# Flight Delay Predictions with PixieDust  

<img style="max-width: 800px; padding: 25px 0px;" src="https://ibm-watson-data-lab.github.io/simple-data-pipe-connector-flightstats/flight_predictor_architecture.png"/>
  
This notebook features a Spark Machine Learning application that predicts whether a flight will be delayed based on weather data. [Read the step-by-step tutorial](https://medium.com/@vabarbosa/fb613afd6e91#.vo01jflmf) 

The application workflow is as follows:  
1. Configure the application parameters
2. Load the training and test data
3. Build the classification models
4. Evaluate the models and iterate
5. Launch a PixieDust embedded application to run the models  

## Prerequisite  

This notebook is a follow-up to [Predict Flight Delays with Apache Spark MLlib, FlightStats, and Weather Data](https://developer.ibm.com/clouddataservices/2016/08/04/predict-flight-delays-with-apache-spark-mllib-flightstats-and-weather-data/). Follow the steps in that tutorial and at a minimum:

* Set up a FlightStats account  
* Provision the Weather Company Data service  
* Obtain or build the training and test data sets  

## Learn more about the technology used:  

* [Weather Company Data](https://console.ng.bluemix.net/docs/services/Weather/index.html)  
* [FlightStats](https://developer.flightstats.com/)  
* [Apache Spark MLlib](https://spark.apache.org/mllib/)  
* [PixieDust](https://github.com/ibm-watson-data-lab/pixiedust)  
* [pixiedust_flightpredict](https://github.com/ibm-watson-data-lab/simple-data-pipe-connector-flightstats/tree/master/pixiedust_flightpredict)    

# Install latest pixiedust and pixiedust-flightpredict plugin

Make sure you are running the latest `pixiedust` and `pixiedust-flightpredict` versions. After upgrading, restart the kernel before continuing to the next cells.

In [None]:
!pip install --upgrade --user pixiedust

In [None]:
!pip install --upgrade --user pixiedust-flightpredict

<h3>If PixieDust was just installed or upgraded, <span style="color: red">restart the kernel</span> before continuing.</h3>

# Import required python package and set Cloudant credentials  

Have available your credentials for Cloudant, Weather Company Data, and FlightStats, as well as the training and test data info from [Predict Flight Delays with Apache Spark MLlib, FlightStats, and Weather Data](https://developer.ibm.com/clouddataservices/2016/08/04/predict-flight-delays-with-apache-spark-mllib-flightstats-and-weather-data/)  

Run this cell to launch and complete the Configuration Dashboard, where you'll load the training and test data. Ensure all <i class="fa fa-2x fa-times" style="font-size:medium"></i> tasks are completed. After editing configuration, you can re-run this cell to see the updated status for each task.

In [None]:
import pixiedust_flightpredict
pixiedust_flightpredict.configure()

# Train multiple classification models  

The following cells train four models: Logistic Regression, Naive Bayes, Decision Tree, and Random Forest.
Feel free to update these models or build your own models.

In [None]:
from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.linalg import Vectors
from numpy import array
import numpy as np
import math
from datetime import datetime
from dateutil import parser
from pyspark.mllib.classification import LogisticRegressionWithLBFGS
logRegModel = LogisticRegressionWithLBFGS.train(labeledTrainingData.map(lambda lp: LabeledPoint(lp.label,\
      np.fromiter(map(lambda x: 0.0 if np.isnan(x) else x,lp.features.toArray()),dtype=np.double )))\
      , iterations=1000, validateData=False, intercept=False)
print(logRegModel)

In [None]:
from pyspark.mllib.classification import NaiveBayes
#NaiveBayes requires non negative features, set them to 0 for now
modelNaiveBayes = NaiveBayes.train(labeledTrainingData.map(lambda lp: LabeledPoint(lp.label, \
                    np.fromiter(map(lambda x: x if x>0.0 else 0.0,lp.features.toArray()),dtype=np.int)\
               ))\
          )

print(modelNaiveBayes)

In [None]:
from pyspark.mllib.tree import DecisionTree
modelDecisionTree = DecisionTree.trainClassifier(labeledTrainingData.map(lambda lp: LabeledPoint(lp.label,\
      np.fromiter(map(lambda x: 0.0 if np.isnan(x) else x,lp.features.toArray()),dtype=np.double )))\
      , numClasses=training.getNumClasses(), categoricalFeaturesInfo={})
print(modelDecisionTree)

In [None]:
from pyspark.mllib.tree import RandomForest
modelRandomForest = RandomForest.trainClassifier(labeledTrainingData.map(lambda lp: LabeledPoint(lp.label,\
      np.fromiter(map(lambda x: 0.0 if np.isnan(x) else x,lp.features.toArray()),dtype=np.double )))\
      , numClasses=training.getNumClasses(), categoricalFeaturesInfo={},numTrees=100)
print(modelRandomForest)

# Evaluate the models  

`pixiedust_flightpredict` provides a plugin to the PixieDust `display` api and adds a menu (look for the plane icon) that computes the accuracy metrics for the models, including the confusion table.

In [None]:
display(testData)

# Run the predictive model application  

This cell runs the embedded PixieDust application, which lets users enter flight information. The models run and predict the probability that the flight will be on-time.

In [None]:
import pixiedust_flightpredict
from pixiedust_flightpredict import *
pixiedust_flightpredict.flightPredict("LAS")

# Get aggregated results for all the flights that have been predicted.
The following cell shows a map with all the airports and flights searched to-date. Each edge represents an aggregated view of all the flights between 2 airports. Click on it to display a group list of flights showing how many users are on the same flight.

In [None]:
import pixiedust_flightpredict
pixiedust_flightpredict.displayMapResults()