<h1 align=center>Predicting Flight Delays</h1>
<h2 align=center>with Apache Spark and TensorFlow</h2>

In this tutorial, we will use the popular [Flights Dataset](http://stat-computing.org/dataexpo/2009/the-data.html) to analyze and predict flight delays in airports based on past flight records. We will show how you can use __Jupyter Notebooks, TensorFlow and Apache Spark__ to read, explore, analyze and visualize your results.  

This tutorial is intended for readers who wants to use TensorFlow in Jupyter Notebooks. We use a __Jupyter Notebook__ to write an interactive Python code, use __Spark__ to distribute the preprocessing phase, and use __TensorFlow__ to efficiently make the model.

__TensorFlow__ is an open-source machine learning library for numerical computation using data flow graphs. If you can express your computation as a data flow graph, you can use TensorFlow. TensorFlow library is already installed in your Data Scientist Workbench so you can simply import it into your notebook.

Although __Spark ML library and TensorFlow__ both have been designed based on the DataFlow paradigm of parallel computation, the distributed version of TensorFlow has not been released yet. This means that it can be run only on one node as of now. That is, with Spark an RDD is distributed on many nodes of cluster, whereas TensorFlow sits on one node. Therefore, after preprocessing the flight dataset using Spark, we will collect and convert the final results into a numpy matrix (on one node) and use that for modeling with TensorFlow.

For this dataset, we will only look at the flights in 2007 - this is still 7 million flights! 

In this notebook, we will build **a classification model to predict airline delay from historical flight data.**  We are going to train a model to look at flights and predict whether they have delay or not.  

First, we need to import some Python packages that we need:

In [None]:
from pyspark.sql import SQLContext
from pyspark.sql.types import *
from pyspark.sql import Row
from pyspark.mllib.regression import LabeledPoint
from pyspark.sql.functions import udf
from pyspark.mllib.linalg import Vectors
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.param import Param, Params
from pyspark.mllib.classification import LogisticRegressionWithLBFGS, LogisticRegressionModel
from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.stat import Statistics
from pyspark.ml.feature import OneHotEncoder, StringIndexer
from pyspark.mllib.linalg import Vectors
from pyspark.ml.feature import VectorAssembler
from IPython.display import display
from ipywidgets import interact
import sys
import numpy as np
import pandas as pd
import time
import datetime
import matplotlib.pyplot as plt
import os.path
%matplotlib inline

### Import data
To import data into your Data Scientist Workbench (DSWB), you can take either one of these actions:

1) Paste the following link into the sidebar of your DSWB:
https://share.datascientistworkbench.com/#/api/v1/workbench/10.115.89.160/shares/QBNwgXam7veFKl7/airline2007.csv

OR

2) Run the following cell to download it directly to you DSWB.

In [None]:
#Will download airline2007.csv if file not yet downloaded

if os.path.isfile("/resources/airline2007.csv") != True:
    #If file does not already exist, download it, unzip, then delete zipped file
    !wget --quiet --output-document  /resources/airline2007.csv.bz2 http://stat-computing.org/dataexpo/2009/2007.csv.bz2
    !bzip2 -d /resources/airline2007.csv.bz2
    !rm /resources/airline2007.csv.bz2
    print "Downloaded to /resources/airline2007.csv"
else:
    #If file already exists
    print "airline2007.csv already exists under /resources/airline2007.csv"
    print "You can continue to the next cell."

In [None]:
textFile = sc.textFile('/resources/airline2007.csv')

### Cleaning data
In this section, we remove the header of file

In [None]:
textFileRDD = textFile.map(lambda x: x.split(','))
header = textFileRDD.first()

textRDD = textFileRDD.filter(lambda r: r != header)

### Creating the Dataframe from RDD
A DataFrame is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in Python, but with richer optimizations under the hood.

In [None]:
def parse(r):
    try:
        x=Row(Year=int(r[0]),\
          Month=int(r[1]),\
          DayofMonth=int(r[2]),\
          DayOfWeek=int(r[3]),\
          DepTime=int(float(r[4])), \
          CRSDepTime=int(r[5]),\
          ArrTime=int(float(r[6])),\
          CRSArrTime=int(r[7]), \
          UniqueCarrier=r[8],\
          DepDelay=int(float(r[15])),\
          Origin=r[16],\
          Dest=r[17], \
          Distance=int(float(r[18])))  
    except:
        x=None  
    return x

rowRDD = textRDD.map(lambda r: parse(r)).filter(lambda r:r != None)
airline_df = sqlContext.createDataFrame(rowRDD)

In this section, we add a new column to our data frame, **DepDelayed**, a binary variable:
- **True**, for flights that have > 15 minutes of delay
- **False**, for flights that have <= 15 minutes of delay

We will later use **Depdelayed** as the target/label column in the classification process.

In [None]:
airline_df = airline_df.withColumn('DepDelayed', airline_df['DepDelay']>15)

We also add a new column, __Hour__, to determine the hour of flight (0 to 24).

In [None]:
# define hour function to obtain hour of day
def hour_ex(x): 
    h = int(str(int(x)).zfill(4)[:2])
    return h

# register as a UDF 
f = udf(hour_ex, IntegerType())

#CRSDepTime: scheduled departure time (local, hhmm)
airline_df = airline_df.withColumn('hour', f(airline_df.CRSDepTime))
airline_df.registerTempTable("airlineDF")

## Exploration
Let's do some exploration of this dataset.  
### Exploration: Which Airports have the Most Delays?

In [None]:
groupedDelay = sqlContext.sql("SELECT Origin, count(*) conFlight,avg(DepDelay) delay \
                                FROM airlineDF \
                                GROUP BY Origin")

df_origin = groupedDelay.toPandas()

__Notice:__ To map each Airport to corresponding _Long_ and _Lat_, run the following cell to download the needed dataset.

In [None]:
# Will download airports.dat if not found in /resources/

if os.path.isfile("/resources/airports1.dat") != True:
    #If file does not already exist, download it
    !wget  --quiet --output-document /resources/airports.dat \
        https://raw.githubusercontent.com/jpatokal/openflights/master/data/airports.dat
    print "Downloaded to /resources/airports.dat"
else:
    #If file already exists
    print "airports.dat already exists under /resources/airports.dat"
    print "You can continue to the next cell."

In [None]:
df = pd.read_csv('/resources/airports.dat', index_col=0,\
names = ['name', 'city', 'country','IATA','ICAO','lat','lng','alt','TZone','DST','Tz'], \
            header=0)

In [None]:
df_airports = pd.merge(df_origin, df, left_on = 'Origin', right_on = 'IATA')


In [None]:
df_airports.head()

In [None]:
def sigmoid(x):
    return 1 / (1 + np.exp(-x))

def zscore(x):
    return (x-np.average(x))/np.std(x)

Plot the map:

In [None]:
from mpl_toolkits.basemap import Basemap
import matplotlib.pyplot as plt
from pylab import rcParams
%matplotlib inline

rcParams['figure.figsize'] = (14,10)


my_map = Basemap(projection='merc',
            resolution = 'l', area_thresh = 1000.0,
            llcrnrlon=-130, llcrnrlat=22, #min longitude (llcrnrlon) and latitude (llcrnrlat)
            urcrnrlon=-60, urcrnrlat=50) #max longitude (urcrnrlon) and latitude (urcrnrlat)

my_map.drawcoastlines()
my_map.drawcountries()
my_map.drawmapboundary()
my_map.fillcontinents(color = 'white', alpha = 0.3)
my_map.shadedrelief()

# To create a color map
colors = plt.get_cmap('hot')(np.linspace(0.0, 1.0, 30))
colors=np.flipud(colors)

#----- Scatter -------
countrange=max(df_airports['conFlight'])-min(df_airports['conFlight'])
al=np.array([sigmoid(x) for x in zscore(df_airports['delay'])])
xs,ys = my_map(np.asarray(df_airports['lng']), np.asarray(df_airports['lat']))
val=df_airports['conFlight']*4000.0/countrange

my_map.scatter(xs, ys,  marker='o', s= val, alpha = 0.8,color=colors[(al*20).astype(int)])

#----- Text -------
df_text=df_airports[(df_airports['conFlight']>60000) & (df_airports['IATA'] != 'HNL')]
xt,yt = my_map(np.asarray(df_text['lng']), np.asarray(df_text['lat']))
txt=np.asarray(df_text['IATA'])
zp=zip(xt,yt,txt)
for row in zp:
    #print zp[2]
    plt.text(row[0],row[1],row[2], fontsize=10, color='blue',)

print("Each marker is an airport.")
print("Size of markers: Airport Traffic (larger means higher number of flights in year)")
print("Color of markers: Average Flight Delay (Redder means longer delays)")

plt.show()

### Exploration: Route delay

#### Which Routes are typically the most delayed?

In [None]:
grp_rout_Delay = sqlContext.sql("SELECT Origin, Dest, count(*) traffic,avg(Distance) avgDist,\
                                    avg(DepDelay) avgDelay\
                                FROM airlineDF \
                                GROUP BY Origin,Dest")
rout_Delay = grp_rout_Delay.toPandas()

In [None]:
df_airport_rout1 = pd.merge(rout_Delay, df, left_on = 'Origin', right_on = 'IATA')
df_airport_rout2 = pd.merge(df_airport_rout1, df, left_on = 'Dest', right_on = 'IATA')
df_airport_rout = df_airport_rout2[["Origin","lat_x","lng_x","Dest","lat_y","lng_y",\
                                    "avgDelay", "traffic"]]

In [None]:
rcParams['figure.figsize'] = (14,10)


my_map = Basemap(projection='merc',
            resolution = 'l', area_thresh = 1000.0,
            llcrnrlon=-130, llcrnrlat=22, #min longitude (llcrnrlon) and latitude (llcrnrlat)
            urcrnrlon=-60, urcrnrlat=50) #max longitude (urcrnrlon) and latitude (urcrnrlat)

my_map.drawcoastlines()
my_map.drawcountries()
my_map.drawmapboundary()
my_map.fillcontinents(color = 'white', alpha = 0.3)
my_map.shadedrelief()

delay=np.array([sigmoid(x) for x in zscore(df_airports["delay"])])
colors = plt.get_cmap('hot')(np.linspace(0.0, 1.0, 40))
colors=np.flipud(colors)
xs,ys = my_map(np.asarray(df_airports['lng']), np.asarray(df_airports['lat']))
xo,yo = my_map(np.asarray(df_airport_rout['lng_x']), np.asarray(df_airport_rout['lat_x']))
xd,yd = my_map(np.asarray(df_airport_rout['lng_y']), np.asarray(df_airport_rout['lat_y']))

my_map.scatter(xs, ys,  marker='o',  alpha = 0.8,color=colors[(delay*20).astype(int)])


al=np.array([sigmoid(x) for x in zscore(df_airport_rout["avgDelay"])])
f=zip(xo,yo,xd,yd,df_airport_rout['avgDelay'],al)
for row in f:
    plt.plot([row[0],row[2]], [row[1],row[3]],'-',alpha=0.07, \
             color=colors[(row[5]*30).astype(int)] )
    

for row in zp:
    plt.text(row[0],row[1],row[2], fontsize=10, color='blue',)

print("Each line represents a route from the Origin to Destination airport.")
print("The redder line, the higher probablity of delay.")
    
plt.show()



### Exploration: Airport Origin delay per month

Set the airport code name below to explore

In [None]:
Origin_Airport="JFK"

In [None]:
df_ORG = sqlContext.sql("SELECT * from airlineDF WHERE origin='"+ Origin_Airport+"'")
df_ORG.registerTempTable("df_ORG")
df_ORG.select('ArrTime','CRSArrTime','CRSDepTime',\
              'DayOfWeek','DayofMonth','DepDelay','DepTime','Dest').show(2)

Let's look at flights originating from this airport:

In [None]:
print "total flights from this ariport: " + str(df_ORG.count())

In this section, we group flights by month to see how delayed flights are distributed by month:

In [None]:
grp_carr = sqlContext.sql("SELECT  UniqueCarrier,month, avg(DepDelay) avgDelay from df_ORG \
                            WHERE DepDelayed=True \
                            GROUP BY UniqueCarrier,month")
s = grp_carr.toPandas()

In [None]:
ps = s.pivot(index='month', columns='UniqueCarrier', values='avgDelay')[['AA','UA','US']]

In [None]:
rcParams['figure.figsize'] = (8,5)
ps.plot(kind='bar', colormap='Greens');
plt.xlabel('Average delay')
plt.ylabel('Month')
plt.title('How much delay does each carrier has in each month?')

We see that average delay in this year is is highest in June and August in this airport.

### Exploration: Airport Origin delay per day/hour

In [None]:
hour_grouped = df_ORG.filter(df_ORG['DepDelayed']).select('DayOfWeek','hour','DepDelay').groupby('DayOfWeek','hour').mean('DepDelay')

In [None]:
rcParams['figure.figsize'] = (10,5)
dh = hour_grouped.toPandas()
c = dh.pivot('DayOfWeek','hour')
X = c.columns.levels[1].values
Y = c.index.values
Z = c.values
plt.xticks(range(0,24), X)
plt.yticks(range(0,7), Y)
plt.xlabel('Hour of Day')
plt.ylabel('Day of Week')
plt.title('Average delay per hours and day?')
plt.imshow(Z)

A clear pattern here: flights tend to be delayed in these situations:  
- Later in the day: possibly because delays tend to pile up as the day progresses and the problem tends to compound later in the day.  
- Mornings in first day of week possibly because of more business meetings

## Modeling: Logistic Regression
In this section, we will build a supervised learning model to predict flight delays for flights leaving our selected airport.


### Feature selection
In the next two cell we select the features that we need to create the model.

In [None]:
df_model=df_ORG
stringIndexer2 = StringIndexer(inputCol="Dest", outputCol="destIndex")
model_stringIndexer = stringIndexer2.fit(df_model)
indexedDest = model_stringIndexer.transform(df_model)
encoder2 = OneHotEncoder(dropLast=False, inputCol="destIndex", outputCol="destVec")
df_model = encoder2.transform(indexedDest)

### Assembler
In order to train our logistic regression model, we have to combine features generated above into a single feature vector. _VectorAssembler_ is a transformer that combines a given list of columns into a single vector column. In each row, the values of the input columns will be concatenated into a vector in the specified order.

In [None]:
assembler = VectorAssembler(
    inputCols = ['Year','Month','DayofMonth','DayOfWeek','hour','Distance','destVec'],
    outputCol = "features")
df_assembled = assembler.transform(df_model)

### Standardization
In the following cell, we use _MinMaxScaler_ to rescale each feature to a specific range  [0, 1] which is appropriate for Logistic Regression in TensorFlow. MinMaxScaler computes summary statistics on a data set and produces a _MinMaxScalerModel_. The model can then transform each feature individually such that it is in the given range.

In [None]:
from pyspark.ml.feature import MinMaxScaler
minmaxscaler= MinMaxScaler(inputCol="features", outputCol="minMaxFeatures")
minMaxModel = minmaxscaler.fit(df_assembled)
minMax_df = minMaxModel.transform(df_assembled)

### Labeling
The corresponding labels in Flight dataset are True and False describing flight delay which is either more or less than 15 seconds. For the purposes of this tutorial, we are going to want our labels as __one-hot vectors__. A one-hot vector is a vector which is 0 in most dimensions, and 1 in a single dimension. In this case, the no-delay (False) will be represented as a vector which is one in the first dimension, i.e. [1.0,0.0], and the True label will be represented as [0.0,1.0].

In [None]:
from pyspark.sql.types import ArrayType
def delay_lbl(x):
    return [[1.0,0.0],[0.0,1.0]][x]

func = udf(delay_lbl, ArrayType(FloatType(),True))
labeled_df = minMax_df.withColumn('label', func(minMax_df.DepDelayed))

### Convert to Numpy matrix
We will convert each row of the dataframe into a vector of 1x76 array. The result is that our cleaned dataset is a tensor (an n-dimensional array) with a shape of [7M, 76]. The first dimension indexes the flights and the second dimension indexes the features including Year, Month, Day, Destinatin, etc. Each entry in the tensor has a value between 0 and 1.

In [None]:
d1=labeled_df.select('label','minMaxFeatures').toPandas()
np_data_mtx=np.matrix(d1['minMaxFeatures'].map(lambda r:r.toArray().tolist()).tolist())
np_lbl_mtx=np.matrix(d1['label'].tolist())

### Spliting dataset into train and test dtasets
The data is split into two parts, 70% of data as training data, and 30% as test data. This split is very important: it's essential in machine learning that we have separate data which we don't learn from so that we can make sure that what we've learned actually generalizes!


In [None]:
rowCount,colCount=np_data_mtx.shape
testPercent=int(rowCount*0.3)
ix=np.random.randint(0,rowCount,testPercent)
mask = np.ones(rowCount,dtype=bool) 
mask[ix] = False
tr,ts=np_data_mtx[mask],np_data_mtx[~mask]
tr_lbl,ts_lbl=np_lbl_mtx[mask],np_lbl_mtx[~mask]

### Build the model
We want out model be able to look at a flight and give probabilities for it haing delay. For example, our model might look at a flight scheduled at 8:00 PM of Monday, March 2nd and be 60% sure it is going to be on-time, but give a 40% chance to it be delayed. This is a classic case where a __logistic regression__ is used as a classification model. In logistic regression, first we add up the evidence of our input being in certain classes, and then we convert that evidence into probabilities.

In the following cell, we set parameters after importing TensorFlow.

In [None]:
import tensorflow as tf
# Set the Parameters
learning_rate = 0.01
training_epochs = 25
batch_size = 100
display_step = 1
classCount=2

We create two placeholders _x_ and _y_ which are values that we will input when we ask TensorFlow to run a computation. We want to be able to input any number of flights, represented by a colCount-dimensional vector. We represent this as a 2-D tensor of floating-point numbers, with a shape [None, colCount]. (Here None means that a dimension can be of any length.) __x__ represents the feature list and _'y_' is a placeholder to input the correct answers.

In [None]:
x = tf.placeholder(tf.float32, [None, colCount])
y_ = tf.placeholder(tf.float32, [None, classCount])

We also need to define weights and biases variables for our model. A __Variable__ in TensorFlow is a modifiable tensor that lives in TensorFlow's graph of interacting operations. It can be used and even modified by the computation.  
To make the Logistic Regression we use _Softmax_ function which is a generalized version of Logistic Regressiona and can be used for multi class classification problems. 

In [None]:
W = tf.Variable(tf.zeros([colCount, classCount]))
b = tf.Variable(tf.zeros([classCount]))
y = tf.nn.softmax(tf.matmul(x, W) + b)

In order to train our model, we need to define the cost function, and then try to minimize it. In this cell, we use "cross-entropy" which is widely used in machine learning.  TensorFlow trains the model using a backpropagation algorithm to efficiently determine how our variables affect the cost that should be minimized. In this case, TensorFlow uses the _gradient descent algorithm_ to minimize _cross entropy_. Gradient descent is a simple procedure, where TensorFlow simply shifts each variable a little bit in the direction that reduces the cost. 

In [None]:
cross_entropy = -tf.reduce_sum(y_*tf.log(y))
train_step = tf.train.GradientDescentOptimizer(learning_rate).minimize(cross_entropy)

We just have set up our model so far to train. Now, we have to add an operation to initialize the variables we created, and then we launch the model in a Session, and run the operation that initializes the variables

In [None]:
init = tf.initialize_all_variables()
sess = tf.Session()
sess.run(init)

In the following cell, we will run the training step for 25 cycles. In each cycle, we selct batches many times and train the model and at the end compute the average loss. 

In [None]:
# Training cycle
for epoch in range(training_epochs):
    avg_cost = 0.
    total_batch = int(rowCount/batch_size)
    # Loop over all batches
    for i in range(total_batch):
        r=np.random.randint(0,tr.shape[0],batch_size)
        batch_xs=tr[r]
        batch_ys = tr_lbl[r]
        sess.run(train_step, feed_dict={x: batch_xs, y_: batch_ys})
        # Compute average loss
        avg_cost += sess.run(cross_entropy, feed_dict={x: batch_xs,y_: batch_ys})/total_batch
    # Display logs per epoch step
    if epoch % display_step == 0:
        print "Epoch:", '%04d' % (epoch+1), "cost=", "{:.9f}".format(avg_cost)

## Model Evaluation
Let's figure out where we predicted the correct label. __tf.argmax__ gives you the index of the highest entry in a tensor along some axis. For example, tf.argmax(y,1) is the label our model thinks is most likely for each input, while tf.argmax(y_,1) is the correct label. We can use tf.equal to check if our prediction matches the truth.

In [None]:
correct_prediction = tf.equal(tf.argmax(y,1), tf.argmax(y_,1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
acc=((sess.run(accuracy, feed_dict={x: ts, y_: ts_lbl}))*100)
print "Model Accuracy for JFK: %1.2f %%" % acc

### Use the model to predict your flight from JFK

You can use the following widget to query the model.  
For example the following flight has dely:  
    Month=2, Day=3, Hour=18, Dest=CLE

In [None]:
Destin = rout_Delay[rout_Delay['Origin']=='JFK'].Dest.unique()
pred=tf.argmax(y,1)
@interact(Destination=tuple(Destin),Month=(1,12),DayofMonth=(1,30),DayOfWeek=(0,7),Hour=(0,23))
def g(Destination,Month,DayofMonth,DayOfWeek,Hour):
    Distance=int(rout_Delay[(rout_Delay['Origin']=='JFK') & (rout_Delay['Dest']==Destination)]\
                 .avgDist.tolist()[0])
    testcase=  Row(Year=2007.0,Month=Month,DayofMonth=DayofMonth,DayOfWeek=DayOfWeek,hour=Hour,\
                 Origin='JFK',Dest=Destination,Distance=Distance) 
    TestCase_df = sqlContext.createDataFrame(sc.parallelize([testcase]))
    t1 = model_stringIndexer.transform(TestCase_df)
    t2 = encoder2.transform(t1)
    t3 = assembler.transform(t2)
    t4 = minMaxModel.transform(t3)
    case=t4.select('minMaxFeatures').take(1)[0]['minMaxFeatures']
    case2=np.asmatrix(case)
    p=sess.run(pred, feed_dict={x: case2})
    print "Flight from JFK to "+Destination + ", Distance:" + str(Distance)
    if p==0:
        print "You flight doesnt have a delay, Accuracy= %1.2f %%" % (acc)
    else:
        print "You flight may be delayed, Accuracy= %1.2f %%" % (acc)

<hr>

## Want to learn more?

<a href="http://bigdatauniversity.com/courses/advanced-classification-and-prediction/?utm_source=tutorial-flightdelay-tensor&utm_medium=dswb&utm_campaign=bdu"><img src = "https://ibm.box.com/shared/static/u7iyiej98gb971gmjqvfsveqz3ik4fxj.png"> </a>


<h3>Authors:</h3>
<article class="teacher">
<div class="teacher-image" style="    float: left;
    width: 115px;
    height: 115px;
    margin-right: 10px;
    margin-bottom: 10px;
    border: 1px solid #CCC;
    padding: 3px;
    border-radius: 3px;
    text-align: center;"><img class="alignnone wp-image-2258 " src="https://ibm.box.com/shared/static/tyd41rlrnmfrrk78jx521eb73fljwvv0.jpg" alt="Saeed Aghabozorgi" width="178" height="178" /></div>
<h4>Saeed Aghabozorgi</h4>
<p><a href="https://ca.linkedin.com/in/saeedaghabozorgi">Saeed Aghabozorgi</a>, PhD is a Data Scientist in IBM with a track record of developing enterprise level applications that substantially increases clients’ ability to turn data into actionable knowledge. He is a researcher in data mining field and expert in developing advanced analytic methods like machine learning and statistical modelling on large datasets.</p>
</article>
<article class="teacher">
<div class="teacher-image" style="    float: left;
    width: 115px;
    height: 115px;
    margin-right: 10px;
    margin-bottom: 10px;
    border: 1px solid #CCC;
    padding: 3px;
    border-radius: 3px;
    text-align: center;"><img class="alignnone size-medium wp-image-2177" src="https://ibm.box.com/shared/static/2ygdi03ahcr97df2ofrr6cf8knq4kodd.jpg" alt="Polong Lin" width="300" height="300" /></div>
<h4>Polong Lin</h4>
<p>
<a href="https://ca.linkedin.com/in/polonglin">Polong Lin</a> is a Data Scientist at IBM in Canada. Under the Emerging Technologies division, Polong is responsible for educating the next generation of data scientists through Big Data University. Polong is a regular speaker in conferences and meetups, and holds a M.Sc. in Cognitive Psychology.</p>
</article>

<hr>
Copyright &copy; 2016 [Big Data University](https://bigdatauniversity.com/?utm_source=bducopyrightlink&utm_medium=dswb&utm_campaign=bdu). This notebook and its source code are released under the terms of the [MIT License](https://bigdatauniversity.com/mit-license/).​