<div class="alert alert-block alert-info" style="margin-top: 20px">
    <a href="https://cocl.us/System_ML_notebook">
         <img src="https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0111EN/Ad/TopAd.png" width="750" align="center">
    </a>
</div>

<a href="https://cognitiveclass.ai/">
    <img src="https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0111EN/Ad/CCLog.png" width="200" align="center">
</a>

# Flight Delay Prediction Demo Using SystemML

This notebook is based on datascientistworkbench.com's tutorial notebook for predicting flight delay.

## Loading SystemML 

To use one of the released version, use <code>%AddDeps org.apache.systemml systemml 1.2.0</code>.

In [None]:
%AddDeps org.apache.systemml systemml 1.2.0

Use Spark's CSV package for loading the CSV file

In [None]:
%AddDeps com.databricks spark-csv_2.10 1.4.0

## Import Data

Download the airline dataset from <a href="stat-computing.org">stat-computing.org<a> if not already downloaded

In [None]:
import sys.process._
import java.net.URL
import java.io.File
val url = "http://stat-computing.org/dataexpo/2009/2007.csv.bz2"
val localFilePath = "airline2007.csv.bz2"
if(!new java.io.File(localFilePath).exists) {
    new URL(url) #> new File(localFilePath) !!
}

Load the dataset into DataFrame using Spark CSV package

In [None]:
import org.apache.spark.sql.SQLContext
import org.apache.spark.storage.StorageLevel
val sqlContext = new SQLContext(sc)
val fmt = sqlContext.read.format("com.databricks.spark.csv")
val opt = fmt.options(Map("header"->"true", "inferSchema"->"true"))
val airline = opt.load(localFilePath).na.replace( "*", Map("NA" -> "0.0") )

In [None]:
airline.printSchema

## Data Exploration
Which airports have the most delays?

In [None]:
airline.registerTempTable("airline")
sqlContext.sql("""SELECT Origin, count(*) conFlight, avg(DepDelay) delay
                    FROM airline
                    GROUP BY Origin
                    ORDER BY delay DESC""").show(40)

## Modeling: Logistic Regression

Predict departure delays of greater than 15 of flights from JFK

In [None]:
sqlContext.udf.register("checkDelay", (depDelay:String) => try { if(depDelay.toDouble > 15) 1.0 else 2.0 } catch { case e:Exception => 1.0 })
val tempSmallAirlineData = sqlContext.sql("SELECT *, checkDelay(DepDelay) label FROM airline WHERE Origin = 'JFK'").persist(StorageLevel.MEMORY_AND_DISK)
val tempDestSet = tempSmallAirlineData.select("Dest").map(y => (y.get(0).toString, 1)).groupByKey(_._1).reduceGroups((a, b) => (a._1, a._2 + b._2)).map(_._2).filter(_._2 > 1000).toDF//.collect.toList
tempDestSet.registerTempTable("tempdest")

tempSmallAirlineData.registerTempTable("tempairline")
val smallAirlineData = sqlContext.sql("SELECT * FROM tempairline WHERE Dest in (SELECT _1 FROM tempdest)")

val datasets = smallAirlineData.randomSplit(Array(0.7, 0.3))
val trainDataset = datasets(0).cache()
val testDataset = datasets(1).cache()
trainDataset.count
testDataset.count

sqlContext.udf.register("checkDelay", (depDelay:String) => try { if(depDelay.toDouble > 15) 1.0 else 2.0 } catch { case e:Exception => 1.0 })
val tempSmallAirlineData = sqlContext.sql("SELECT *, checkDelay(DepDelay) label FROM airline WHERE Origin = 'JFK'").persist(StorageLevel.MEMORY_AND_DISK)
val popularDest = tempSmallAirlineData.select("Dest").map(y => (y.get(0).toString, 1)).groupByKey(_._1).reduceGroups((a, b) => (a._1, a._2 + b._2)).map(_._2).filter(_._2 > 1000).collect.toMap

sqlContext.udf.register("onlyUsePopularDest", (x:String) => popularDest.contains(x))
tempSmallAirlineData.registerTempTable("tempAirline")
println(tempSmallAirlineData)
// val smallAirlineData = sqlContext.sql("SELECT * FROM tempAirline WHERE onlyUsePopularDest(Dest)")

// val datasets = smallAirlineData.randomSplit(Array(0.7, 0.3))
// val trainDataset = datasets(0).cache()
// val testDataset = datasets(1).cache()
// trainDataset.count
// testDataset.count

Encode the destination using one-hot encoding and include the columns Year, Month, DayofMonth, DayOfWeek, Distance

In [None]:
import org.apache.spark.ml.feature.{OneHotEncoder, StringIndexer, VectorAssembler}
val indexer = new StringIndexer().setInputCol("Dest").setOutputCol("DestIndex").setHandleInvalid("skip") // Only works on Spark 1.6 or later
val encoder = new OneHotEncoder().setInputCol("DestIndex").setOutputCol("DestVec")
val assembler = new VectorAssembler().setInputCols(Array("Year","Month","DayofMonth","DayOfWeek","Distance","DestVec")).setOutputCol("features")

### Build the model: Use SystemML's MLPipeline wrapper. 

This wrapper invokes <code>MultiLogReg.dml</code> (for training) and <code>GLM-predict.dml</code> (for prediction). These DML algorithms are available at https://github.com/apache/incubator-systemml/tree/master/scripts/algorithms

In [None]:
import org.apache.spark.ml.Pipeline
import org.apache.sysml.api.ml.LogisticRegression
val lr = new LogisticRegression("log", sc).setRegParam(1e-4).setTol(1e-2).setMaxInnerIter(0).setMaxOuterIter(100)
val pipeline = new Pipeline().setStages(Array(indexer, encoder, assembler, lr))
val model = pipeline.fit(trainDataset)

### Evaluate the model 

Output RMS error on test data

In [None]:
val predictions = model.transform(testDataset.withColumnRenamed("label", "OriginalLabel"))
predictions.select("prediction", "OriginalLabel").show
sqlContext.udf.register("square", (x:Double) => Math.pow(x, 2.0))

In [None]:
predictions.registerTempTable("predictions")
sqlContext.sql("SELECT sqrt(avg(square(OriginalLabel - prediction))) FROM predictions").show

<div class="alert alert-block alert-info" style="margin-top: 20px">
<h2>Get IBM Watson Studio free of charge!</h2>
    <p><a href="https://cocl.us/System_ML_notebook"><img src="https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0111EN/Ad/BottomAd.png" width="750" align="center"></a></p>
</div>