# Let's Send Airplanes Through Pipelines
### or "How to bundle repeated actions into a single predictable process"

We will be attempting to predict whether a single flight will be delayed, given various characteristics about that flight. The data is provided by Albert Bifet & Elena Ikonomovska, [Data Expo competition (2009)](http://kt.ijs.si/elena_ikonomovska/data.html). Here is the description Elena Ikonomovska gave from her website:   
  
>  The dataset consists of a large amount of records, containing flight arrival and departure details for all the commercial flights within the USA, from October 1987 to April 2008. This is a large dataset with nearly 120 million records (11.5 GB memory size). The dataset was cleaned and records were sorted according to the arrival/departure date (year, month, and day) and time of flight. Its final size is around 116 million records and 5.76 GB of memory.
  
We will be using [OpenML](https://www.openml.org/) to access the data, along with `fetch_openml` from sklearn so that we don't even need to worry about unzipping or finding a folder for the data (it's all handled inside python). (The specifics from OpenML about this dataset can be found [here](https://www.openml.org/d/1169).) First, these are the imports for the entire project along with the code to save the data in RAM and read the description. 

In [14]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split      
from sklearn.datasets import fetch_openml    
from sklearn.pipeline import Pipeline, make_pipeline, FeatureUnion
from sklearn.preprocessing import OneHotEncoder

In [9]:
airlines = sklearn.datasets.fetch_openml('airlines', version='1')    # This is a JSON dictionary
print(airlines['DESCR'])

**Author**: Albert Bifet, Elena Ikonomovska  
**Source**: [Data Expo competition](http://kt.ijs.si/elena_ikonomovska/data.html) - 2009  
**Please cite**:   

Airlines Dataset Inspired in the regression dataset from Elena Ikonomovska. The task is to predict whether a given flight will be delayed, given the information of the scheduled departure.

Downloaded from openml.org.


#### Bonus Info!
Originally, this dataset wasn't gathered for machine learning, but rather for a "Data Expo" Here is the original challenge:  

>The aim of the data expo is to provide a graphical summary of important features of the data set. This is intentionally vague in order to allow different entries to focus on different aspects of the data.

Check out the resulting posters [here](http://stat-computing.org/dataexpo/2009/posters/).

Great! Let's do a bit of prep work, then we'll build the pipeline. The purpose here is to input "Raw" data, whatever that means for you, and output an evaluation metric.

First, we save it in a DataFrame, I had to hunt around just a bit to find the column names, but any time you begin investigating a new dataset, there's a good chance you'll have metadata that may or may not be useful to you, with the actual data you want nested inside somewhere. For this dataset, it is very obviuosly kept in the `data` part of the main dictionary.

In [13]:
df = pd.DataFrame(airlines['data'], columns=airlines['feature_names'])  # These will be the X inputs
df.head(1)

Unnamed: 0,Airline,Flight,AirportFrom,AirportTo,DayOfWeek,Time,Length
0,3.0,269.0,2.0,3.0,2.0,15.0,205.0


The outputs we are trying to predict are kept inside a different part of the JSON file, they reside in `target` as binary values: 1 means the flight was delayed, while 0 means "on time". We will perform our train_test_split right from the start.  
#### Important note: it's always a good idea to split your data BEFORE you do you any heavy modifications to it (like one-hot-encoding or fillna). 
This is because the action of "modifying" the data before a split will inherently pass *some* bit of information into the test data that gets pulled out later, which will bias the results of the model. For example, if we have 15% of our data that is `null` but we chose to fill those nulls with the median value for the column, THEN split the data, if the test data has any `null` values at all, it has inadvertanly gained information about the training data, even though that's never supposed to happen. My first data science instructor really tried to drive home this point, and I'll borrow from him in saying that this is a "Career Limiting Mistake".

In [12]:
target = pd.DataFrame(airlines['target'], columns=['was_late'])  # This is the goal: y

X_train, X_test, y_train, y_test = train_test_split(df, target, random_state=14159)  # random_state is only so our numbers match, generally not needed except for educational purposes.

In that last step, it's crucial the rows of `X` line up with the rows of `y`. This data set is pre-cleaned and I am willing to trust it, but often you'll specify directly which columns you want from a unified dataframe.

In [25]:
one_hot = OneHotEncoder()
s1 = one_hot.fit(X_train)
s2 = s1.transform(X_train)
s3 = s2.toarray

In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly.


In [None]:
make_pipeline([OneHotEncoder])