# Flight Delay Prediction Demo Using SystemML

This notebook is based on datascientistworkbench.com's tutorial notebook for predicting flight delay.

## Loading SystemML 

To use one of the released version, use "%AddDeps org.apache.systemml systemml 0.9.0-incubating". To use nightly build, "%AddJar https://sparktc.ibmcloud.com/repo/latest/SystemML.jar"

Or you provide SystemML.jar and dependency through commandline when starting the notebook (for example: --packages com.databricks:spark-csv_2.10:1.4.0 --jars SystemML.jar)

In [1]:
%AddJar https://sparktc.ibmcloud.com/repo/latest/SystemML.jar

Using cached version of SystemML.jar


Use Spark's CSV package for loading the CSV file

In [2]:
%AddDeps com.databricks spark-csv_2.10 1.4.0

:: loading settings :: url = jar:file:/usr/local/spark-kernel/lib/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
:: resolving dependencies :: com.ibm.spark#spark-kernel;working [not transitive]
	confs: [default]
	found com.databricks#spark-csv_2.10;1.4.0 in central
:: resolution report :: resolve 93ms :: artifacts dl 5ms
	:: modules in use:
	com.databricks#spark-csv_2.10;1.4.0 from central in [default]
	---------------------------------------------------------------------
	|                  |            modules            ||   artifacts   |
	|       conf       | number| search|dwnlded|evicted|| number|dwnlded|
	---------------------------------------------------------------------
	|      default     |   1   |   0   |   0   |   0   ||   1   |   0   |
	---------------------------------------------------------------------
:: retrieving :: com.ibm.spark#spark-kernel
	confs: [default]
	0 artifacts copied, 1 already retrieved (0kB/6ms)


## Import Data

Download the airline dataset from stat-computing.org if not already downloaded

In [3]:
import sys.process._
import java.net.URL
import java.io.File
val url = "http://stat-computing.org/dataexpo/2009/2007.csv.bz2"
val localFilePath = "airline2007.csv.bz2"
if(!new java.io.File(localFilePath).exists) {
    new URL(url) #> new File(localFilePath) !!
}

Load the dataset into DataFrame using Spark CSV package

In [4]:
import org.apache.spark.sql.SQLContext
val sqlContext = new SQLContext(sc)
val fmt = sqlContext.read.format("com.databricks.spark.csv")
val opt = fmt.options(Map("header"->"true", "inferSchema"->"true", "nullValue"->"null", "treatEmptyValuesAsNulls"->"true"))
val airline = opt.load(localFilePath)

In [5]:
airline.printSchema

## Data Exploration
Which airports have the most delays?

In [6]:
airline.registerTempTable("airline")
sqlContext.sql("""SELECT Origin, count(*) conFlight, avg(DepDelay) delay
                    FROM airline 
                    GROUP BY Origin
                    ORDER BY delay DESC""").show

## Modeling: Logistic Regression

Predict departure delays of flights from JFK

In [7]:
val smallAirlineData = sqlContext.sql("SELECT * FROM airline WHERE Origin='JFK'").na.fill(0.0, Seq("DepDelay"))
val datasets = smallAirlineData.withColumnRenamed("DepDelay", "label").randomSplit(Array(0.7, 0.3))
val trainDataset = datasets(0).cache
val testDataset = datasets(1).cache

In [8]:
trainDataset.count

In [None]:
testDataset.count

### Feature selection

Encode the destination using one-hot encoding and include the columns Year, Month, DayofMonth, DayOfWeek, Distance

In [14]:
import org.apache.spark.ml.feature.{OneHotEncoder, StringIndexer, VectorAssembler}

val indexer = new StringIndexer().setInputCol("Dest").setOutputCol("DestIndex")
val encoder = new OneHotEncoder().setInputCol("DestIndex").setOutputCol("DestVec")
val assembler = new VectorAssembler().setInputCols(Array("Year","Month","DayofMonth","DayOfWeek","Distance","DestVec")).setOutputCol("features")

### Build the model: Use SystemML's MLPipeline wrapper. 

This wrapper invokes MultiLogReg.dml (for training) and GLM-predict.dml (for prediction). These DML algorithms are available at https://github.com/apache/incubator-systemml/tree/master/scripts/algorithms

In [None]:
import org.apache.spark.ml.Pipeline
import org.apache.sysml.api.ml.LogisticRegression

val lr = new LogisticRegression("log", sc).setRegParam(1e-4).setTol(1e-2).setMaxInnerIter(5).setMaxOuterIter(5)

val pipeline = new Pipeline().setStages(Array(indexer, encoder, assembler, lr))
val model = pipeline.fit(trainDataset)

BEGIN MULTINOMIAL LOGISTIC REGRESSION SCRIPT
Reading X...
Reading Y...


### Evaluate the model 

Output RMS error on test data

In [None]:
val predictions = model.transform(testDataset.withColumnRenamed("label", "OriginalLabel"))
predictions.registerTempTable("predictions")
sqlContext.sql("SELECT sqrt(avg(pow(OriginalLabel - label, 2.0))) FROM predictions").show

### Perform k-fold cross-validation to tune the hyperparameters

Perform cross-validation to tune the regularization parameter for Logistic regression.

In [None]:
import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator
import org.apache.spark.ml.tuning.{ParamGridBuilder, CrossValidator}

val crossval = new CrossValidator().setEstimator(pipeline).setEvaluator(new BinaryClassificationEvaluator)
val paramGrid = new ParamGridBuilder().addGrid(lr.regParam, Array(0.1, 1e-3, 1e-6)).build()
crossval.setEstimatorParamMaps(paramGrid)
crossval.setNumFolds(2) // Setting k = 2
val cvmodel = crossval.fit(trainDataset)

### Evaluate the cross-validated model

In [None]:
val cvpredictions = cvmodel.transform(testDataset.withColumnRenamed("label", "OriginalLabel"))
cvpredictions.registerTempTable("cvpredictions")
sqlContext.sql("SELECT sqrt(avg(pow(OriginalLabel - label, 2.0))) FROM cvpredictions").show

## Homework ;)

Read http://apache.github.io/incubator-systemml/algorithms-classification.html#multinomial-logistic-regression and perform cross validation on other hyperparameters: for example: icpt, tol, maxOuterIter, maxInnerIter