## Using Regression Techniques

Use regression techniques to predict patterns in flight data

In [328]:
import findspark
findspark.init('/home/rich/spark/spark-2.4.3-bin-hadoop2.7')
import pandas as pd
from pyspark import SparkConf, SparkContext
from pyspark.sql import SparkSession
from pyspark.sql.types import *
import matplotlib.pyplot as plt

### Data dictionary of flights data
    mon — month (integer between 1 and 12)
    dom — day of month (integer between 1 and 31)
    dow — day of week (integer; 1 = Monday and 7 = Sunday)
    org — origin airport (IATA code)
    mile — distance (miles)
    carrier — carrier (IATA code)
    depart — departure time (decimal hour)
    duration — expected duration (minutes)
    delay — delay (minutes)


### Loading flights data

In [329]:
file_path = './data/flights.csv'

In [330]:
#pandas is my first love :)
df = pd.read_csv(file_path)

In [331]:
df.head()

Unnamed: 0,mon,dom,dow,carrier,flight,org,mile,depart,duration,delay
0,11,20,6,US,19,JFK,2153,9.48,351,
1,0,22,2,UA,1107,ORD,316,16.33,82,30.0
2,2,20,4,UA,226,SFO,337,6.17,82,-8.0
3,9,13,1,AA,419,ORD,1236,10.33,195,-5.0
4,4,2,5,AA,325,ORD,258,8.92,65,


In [332]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 10 columns):
mon         50000 non-null int64
dom         50000 non-null int64
dow         50000 non-null int64
carrier     50000 non-null object
flight      50000 non-null int64
org         50000 non-null object
mile        50000 non-null int64
depart      50000 non-null float64
duration    50000 non-null int64
delay       47022 non-null float64
dtypes: float64(2), int64(6), object(2)
memory usage: 3.8+ MB


In [333]:
# Create SparkSession object
spark = SparkSession.builder.master('local[*]').appName('FlightsRegression').getOrCreate()

In [334]:
# Read data from CSV file
flights = spark.read.csv(file_path,sep=',',header=True,inferSchema=True,nullValue='NA')

# Get number of records
print("The data contain %d records." % flights.count())

# View the first five records
flights.show(5)

# Check column data types
flights.dtypes

The data contain 50000 records.
+---+---+---+-------+------+---+----+------+--------+-----+
|mon|dom|dow|carrier|flight|org|mile|depart|duration|delay|
+---+---+---+-------+------+---+----+------+--------+-----+
| 11| 20|  6|     US|    19|JFK|2153|  9.48|     351| null|
|  0| 22|  2|     UA|  1107|ORD| 316| 16.33|      82|   30|
|  2| 20|  4|     UA|   226|SFO| 337|  6.17|      82|   -8|
|  9| 13|  1|     AA|   419|ORD|1236| 10.33|     195|   -5|
|  4|  2|  5|     AA|   325|ORD| 258|  8.92|      65| null|
+---+---+---+-------+------+---+----+------+--------+-----+
only showing top 5 rows



[('mon', 'int'),
 ('dom', 'int'),
 ('dow', 'int'),
 ('carrier', 'string'),
 ('flight', 'int'),
 ('org', 'string'),
 ('mile', 'int'),
 ('depart', 'double'),
 ('duration', 'int'),
 ('delay', 'int')]

### Add a categorical column

In [335]:
from pyspark.ml.feature import StringIndexer

# Create an indexer
indexer = StringIndexer(inputCol='org', outputCol='org_idx')

# Indexer identifies categories in the data
indexer_model = indexer.fit(flights)

# Indexer creates a new column with numeric index values
flights = indexer_model.transform(flights)

flights.show(10)

+---+---+---+-------+------+---+----+------+--------+-----+-------+
|mon|dom|dow|carrier|flight|org|mile|depart|duration|delay|org_idx|
+---+---+---+-------+------+---+----+------+--------+-----+-------+
| 11| 20|  6|     US|    19|JFK|2153|  9.48|     351| null|    2.0|
|  0| 22|  2|     UA|  1107|ORD| 316| 16.33|      82|   30|    0.0|
|  2| 20|  4|     UA|   226|SFO| 337|  6.17|      82|   -8|    1.0|
|  9| 13|  1|     AA|   419|ORD|1236| 10.33|     195|   -5|    0.0|
|  4|  2|  5|     AA|   325|ORD| 258|  8.92|      65| null|    0.0|
|  5|  2|  1|     UA|   704|SFO| 550|  7.98|     102|    2|    1.0|
|  7|  2|  6|     AA|   380|ORD| 733| 10.83|     135|   54|    0.0|
|  1| 16|  6|     UA|  1477|ORD|1440|   8.0|     232|   -7|    0.0|
|  1| 22|  5|     UA|   620|SJC|1829|  7.98|     250|  -13|    4.0|
| 11|  8|  1|     OO|  5590|SFO| 158|  7.77|      60|   88|    1.0|
+---+---+---+-------+------+---+----+------+--------+-----+-------+
only showing top 10 rows



## Encoding flight origin

The org column in the flights data is a categorical variable giving the airport from which a flight departs.

    ORD — O'Hare International Airport (Chicago)
    SFO — San Francisco International Airport
    JFK — John F Kennedy International Airport (New York)
    LGA — La Guardia Airport (New York)
    SMF — Sacramento
    SJC — San Jose
    TUS — Tucson International Airport
    OGG — Kahului (Hawaii)

It is a categorical variable, it needs to be one-hot encoded before it can be used in a regression model.

Previously used string indexer to create a column of indexed values corresponding to the strings in org.

In [336]:
# Import the one hot encoder class
from pyspark.ml.feature import OneHotEncoderEstimator

# Create an instance of the one hot encoder
onehot = OneHotEncoderEstimator(inputCols=['org_idx'], outputCols=['org_dummy'])

# Apply the one hot encoder to the flights data
onehot = onehot.fit(flights)
flights = onehot.transform(flights)

# Check the results
flights.select('org', 'org_idx', 'org_dummy').distinct().sort('org_idx').show()

+---+-------+-------------+
|org|org_idx|    org_dummy|
+---+-------+-------------+
|ORD|    0.0|(7,[0],[1.0])|
|SFO|    1.0|(7,[1],[1.0])|
|JFK|    2.0|(7,[2],[1.0])|
|LGA|    3.0|(7,[3],[1.0])|
|SJC|    4.0|(7,[4],[1.0])|
|SMF|    5.0|(7,[5],[1.0])|
|TUS|    6.0|(7,[6],[1.0])|
|OGG|    7.0|    (7,[],[])|
+---+-------+-------------+



### Flight duration model: Just distance

Build a regression model to predict flight duration (the duration column).

Keep the model simple, including only the distance of the flight (the km column) as a predictor.

In [337]:
from pyspark.sql.functions import round

# make a distance km feature
flights = flights.withColumn('km', round(flights.mile * 1.60934, 0)).drop('mile')

flights.show(5)

+---+---+---+-------+------+---+------+--------+-----+-------+-------------+------+
|mon|dom|dow|carrier|flight|org|depart|duration|delay|org_idx|    org_dummy|    km|
+---+---+---+-------+------+---+------+--------+-----+-------+-------------+------+
| 11| 20|  6|     US|    19|JFK|  9.48|     351| null|    2.0|(7,[2],[1.0])|3465.0|
|  0| 22|  2|     UA|  1107|ORD| 16.33|      82|   30|    0.0|(7,[0],[1.0])| 509.0|
|  2| 20|  4|     UA|   226|SFO|  6.17|      82|   -8|    1.0|(7,[1],[1.0])| 542.0|
|  9| 13|  1|     AA|   419|ORD| 10.33|     195|   -5|    0.0|(7,[0],[1.0])|1989.0|
|  4|  2|  5|     AA|   325|ORD|  8.92|      65| null|    0.0|(7,[0],[1.0])| 415.0|
+---+---+---+-------+------+---+------+--------+-----+-------+-------------+------+
only showing top 5 rows



In [338]:
from pyspark.ml.feature import VectorAssembler

# make km the feature
assembler = VectorAssembler(inputCols=['km'], outputCol='features')

flights = assembler.transform(flights)

# Check the resulting column
flights.show(5, truncate=False)

+---+---+---+-------+------+---+------+--------+-----+-------+-------------+------+--------+
|mon|dom|dow|carrier|flight|org|depart|duration|delay|org_idx|org_dummy    |km    |features|
+---+---+---+-------+------+---+------+--------+-----+-------+-------------+------+--------+
|11 |20 |6  |US     |19    |JFK|9.48  |351     |null |2.0    |(7,[2],[1.0])|3465.0|[3465.0]|
|0  |22 |2  |UA     |1107  |ORD|16.33 |82      |30   |0.0    |(7,[0],[1.0])|509.0 |[509.0] |
|2  |20 |4  |UA     |226   |SFO|6.17  |82      |-8   |1.0    |(7,[1],[1.0])|542.0 |[542.0] |
|9  |13 |1  |AA     |419   |ORD|10.33 |195     |-5   |0.0    |(7,[0],[1.0])|1989.0|[1989.0]|
|4  |2  |5  |AA     |325   |ORD|8.92  |65      |null |0.0    |(7,[0],[1.0])|415.0 |[415.0] |
+---+---+---+-------+------+---+------+--------+-----+-------+-------------+------+--------+
only showing top 5 rows



In [339]:
#clean up the column order
flights = flights.select('mon','dom','dow','carrier','flight','org','depart','duration','delay','km','org_idx',
                          'org_dummy','features')

In [340]:
# Split into training and testing sets in a 80:20 ratio
flights_train, flights_test = flights.randomSplit([0.8,0.2],seed=17)

# Check that training set has around 80% of records
training_ratio = flights_train.count() / flights_test.count()
print(training_ratio)

3.956383822363204


In [341]:
from pyspark.ml.regression import LinearRegression
from pyspark.ml.evaluation import RegressionEvaluator

# Create a regression object and train on training data
regression = LinearRegression(labelCol='duration').fit(flights_train)

# Create predictions for the testing data and take a look at the predictions
predictions = regression.transform(flights_test)
predictions.select('duration', 'prediction').show(5, False)

# Calculate the RMSE
RegressionEvaluator(labelCol='duration').evaluate(predictions)

+--------+------------------+
|duration|prediction        |
+--------+------------------+
|135     |149.8246079746783 |
|120     |133.5637354593352 |
|160     |152.39609479105815|
|275     |269.0205851104026 |
|85      |93.40316194469713 |
+--------+------------------+
only showing top 5 rows



17.077817098141733

### Interpreting the coefficients

The linear regression model for flight duration as a function of distance takes the form

duration=α+β×distance

where

    α — intercept (component of duration which does not depend on distance) and
    β — coefficient (rate at which duration increases as a function of distance; also called the slope).

By looking at the coefficients of your model you will be able to infer

    how much of the average flight duration is actually spent on the ground and
    what the average speed is during a flight.


In [342]:
# Intercept (average minutes on ground)
inter = regression.intercept
print("intercept is: ",inter)

# Coefficients
coefs = regression.coefficients
print("coefs are: ", coefs)

# Average minutes per km
minutes_per_km = regression.coefficients[0]
print("minutes_per_km is : ",minutes_per_km)

# Average speed in km per hour
avg_speed = 60 / minutes_per_km
print("avg_speed is : ",avg_speed)

intercept is:  44.318016537917266
coefs are:  [0.07563196518764231]
minutes_per_km is :  0.07563196518764231
avg_speed is :  793.3153640942751


In [343]:
coefs[0]

0.07563196518764231

In [344]:
predictions.show(5)

+---+---+---+-------+------+---+------+--------+-----+------+-------+-------------+--------+------------------+
|mon|dom|dow|carrier|flight|org|depart|duration|delay|    km|org_idx|    org_dummy|features|        prediction|
+---+---+---+-------+------+---+------+--------+-----+------+-------+-------------+--------+------------------+
|  0|  1|  2|     AA|   154|ORD| 17.25|     135|   49|1395.0|    0.0|(7,[0],[1.0])|[1395.0]| 149.8246079746783|
|  0|  1|  2|     AA|   392|ORD|  8.08|     120|    4|1180.0|    0.0|(7,[0],[1.0])|[1180.0]| 133.5637354593352|
|  0|  1|  2|     AA|   895|ORD| 12.67|     160|   68|1429.0|    0.0|(7,[0],[1.0])|[1429.0]|152.39609479105815|
|  0|  1|  2|     AA|  1561|ORD| 18.58|     275|   65|2971.0|    0.0|(7,[0],[1.0])|[2971.0]| 269.0205851104026|
|  0|  1|  2|     AA|  1659|ORD| 21.08|      85|   29| 649.0|    0.0|(7,[0],[1.0])| [649.0]| 93.40316194469713|
+---+---+---+-------+------+---+------+--------+-----+------+-------+-------------+--------+------------

### Flight duration model: Adding origin airport

Duration of a flight might depend not only on the distance being covered but also the airport from which the flight departs.

Make departure airport as a predictor.

In [345]:
flights.show(5)

+---+---+---+-------+------+---+------+--------+-----+------+-------+-------------+--------+
|mon|dom|dow|carrier|flight|org|depart|duration|delay|    km|org_idx|    org_dummy|features|
+---+---+---+-------+------+---+------+--------+-----+------+-------+-------------+--------+
| 11| 20|  6|     US|    19|JFK|  9.48|     351| null|3465.0|    2.0|(7,[2],[1.0])|[3465.0]|
|  0| 22|  2|     UA|  1107|ORD| 16.33|      82|   30| 509.0|    0.0|(7,[0],[1.0])| [509.0]|
|  2| 20|  4|     UA|   226|SFO|  6.17|      82|   -8| 542.0|    1.0|(7,[1],[1.0])| [542.0]|
|  9| 13|  1|     AA|   419|ORD| 10.33|     195|   -5|1989.0|    0.0|(7,[0],[1.0])|[1989.0]|
|  4|  2|  5|     AA|   325|ORD|  8.92|      65| null| 415.0|    0.0|(7,[0],[1.0])| [415.0]|
+---+---+---+-------+------+---+------+--------+-----+------+-------+-------------+--------+
only showing top 5 rows



In [346]:
flights = flights.select('mon','dom','dow','carrier','flight','org','depart','duration','delay','km','org_idx',
                          'org_dummy')
flights.show(5)

#save for later processing
flights_stored = flights

# make org_dummy and km the features
assembler = VectorAssembler(inputCols=['km','org_dummy'], outputCol='features')

flights = assembler.transform(flights)

# Check the resulting column
flights.show(5, truncate=False)

+---+---+---+-------+------+---+------+--------+-----+------+-------+-------------+
|mon|dom|dow|carrier|flight|org|depart|duration|delay|    km|org_idx|    org_dummy|
+---+---+---+-------+------+---+------+--------+-----+------+-------+-------------+
| 11| 20|  6|     US|    19|JFK|  9.48|     351| null|3465.0|    2.0|(7,[2],[1.0])|
|  0| 22|  2|     UA|  1107|ORD| 16.33|      82|   30| 509.0|    0.0|(7,[0],[1.0])|
|  2| 20|  4|     UA|   226|SFO|  6.17|      82|   -8| 542.0|    1.0|(7,[1],[1.0])|
|  9| 13|  1|     AA|   419|ORD| 10.33|     195|   -5|1989.0|    0.0|(7,[0],[1.0])|
|  4|  2|  5|     AA|   325|ORD|  8.92|      65| null| 415.0|    0.0|(7,[0],[1.0])|
+---+---+---+-------+------+---+------+--------+-----+------+-------+-------------+
only showing top 5 rows

+---+---+---+-------+------+---+------+--------+-----+------+-------+-------------+----------------------+
|mon|dom|dow|carrier|flight|org|depart|duration|delay|km    |org_idx|org_dummy    |features              |
+---+

In [347]:
# Split into training and testing sets in a 80:20 ratio
flights_train, flights_test = flights.randomSplit([0.8,0.2],seed=17)

In [348]:
# Create a regression object and train on training data
regression = LinearRegression(labelCol='duration').fit(flights_train)

# Create predictions for the testing data
predictions = regression.transform(flights_test)

# Calculate the RMSE on testing data
RegressionEvaluator(labelCol='duration').evaluate(predictions)

predictions.select('duration', 'prediction').show(5, False)

+--------+------------------+
|duration|prediction        |
+--------+------------------+
|135     |147.98723402231772|
|120     |132.01442581276055|
|160     |150.51316648336396|
|275     |265.0716328049321 |
|85      |92.56530414171473 |
+--------+------------------+
only showing top 5 rows



## Interpreting coefficients

Origin airport, org, has eight possible values (ORD, SFO, JFK, LGA, SMF, SJC, TUS and OGG) which have been one-hot encoded to seven dummy variables in org_dummy.

The values for km and org_dummy have been assembled into features, which has eight columns with sparse representation. Column indices in features are as follows:

    0 — km
    1 — ORD
    2 — SFO
    3 — JFK
    4 — LGA
    5 — SMF
    6 — SJC and
    7 — TUS.

OGG does not appear in this list because it is the reference level for the origin airport category.
Use intercept and coefficients attributes to interpret the model.
The coefficients attribute is a list, where the first element indicates how flight duration changes with flight distance.

In [349]:
# Intercept (average minutes on ground)
inter = regression.intercept
print("intercept is: ",inter)

# Coefficients - these represent 8 values from above
coefs = regression.coefficients
print("coefs are: ", coefs)


intercept is:  15.956859201862951
coefs are:  [0.07429213120724262,28.392851786351308,20.339524711313466,52.53454711956853,46.64532328571594,18.144657044664857,15.567486628753434,17.67929758404075]


In [350]:
res = regression.coefficients[0]
avg_speed_hour = 60/res
print('avg_speed_hour = ',avg_speed_hour)

# Average minutes on ground at OGG
inter = regression.intercept
print(inter)

# Average minutes on ground at JFK
avg_ground_jfk = inter + regression.coefficients[3]
print(avg_ground_jfk)

# Average minutes on ground at LGA
avg_ground_lga = inter + regression.coefficients[4]
print(avg_ground_lga)


avg_speed_hour =  807.6225439357256
15.956859201862951
68.49140632143148
62.60218248757889


## Bucketing departure time

Time of day data are a challenge with regression models. They are also a great candidate for bucketing.

Convert the flight departure times from numeric values between 0 (corresponding to 00:00) and 24 (corresponding to 24:00) to binned values. Take those binned values and one-hot encode them.

In [351]:
flights = spark.read.csv(file_path,sep=',',header=True,inferSchema=True,nullValue='NA')

flights.show(5)

+---+---+---+-------+------+---+----+------+--------+-----+
|mon|dom|dow|carrier|flight|org|mile|depart|duration|delay|
+---+---+---+-------+------+---+----+------+--------+-----+
| 11| 20|  6|     US|    19|JFK|2153|  9.48|     351| null|
|  0| 22|  2|     UA|  1107|ORD| 316| 16.33|      82|   30|
|  2| 20|  4|     UA|   226|SFO| 337|  6.17|      82|   -8|
|  9| 13|  1|     AA|   419|ORD|1236| 10.33|     195|   -5|
|  4|  2|  5|     AA|   325|ORD| 258|  8.92|      65| null|
+---+---+---+-------+------+---+----+------+--------+-----+
only showing top 5 rows



In [352]:
from pyspark.ml.feature import Bucketizer, OneHotEncoderEstimator

# Create buckets at 3 hour intervals through the day
buckets = Bucketizer(splits=[0, 3, 6, 9, 12, 15, 18, 21, 24], inputCol='depart', outputCol='depart_bucket')

# Bucket the departure times
bucketed = buckets.transform(flights)
bucketed.select('depart', 'depart_bucket').show(5)

# Create a one-hot encoder
onehot = OneHotEncoderEstimator(inputCols=['depart_bucket'], outputCols=['depart_dummy'])

# One-hot encode the bucketed departure times
flights_onehot = onehot.fit(bucketed).transform(bucketed)
flights_onehot.select('depart', 'depart_bucket', 'depart_dummy').show(5)

+------+-------------+
|depart|depart_bucket|
+------+-------------+
|  9.48|          3.0|
| 16.33|          5.0|
|  6.17|          2.0|
| 10.33|          3.0|
|  8.92|          2.0|
+------+-------------+
only showing top 5 rows

+------+-------------+-------------+
|depart|depart_bucket| depart_dummy|
+------+-------------+-------------+
|  9.48|          3.0|(7,[3],[1.0])|
| 16.33|          5.0|(7,[5],[1.0])|
|  6.17|          2.0|(7,[2],[1.0])|
| 10.33|          3.0|(7,[3],[1.0])|
|  8.92|          2.0|(7,[2],[1.0])|
+------+-------------+-------------+
only showing top 5 rows



## Flight duration model: Adding departure time

Include the dummy variables in a regression model for flight duration.

Put km, org_dummy and depart_dummy into features, where km is index 0, org_dummy runs from index 1 to 7 and depart_dummy from index 8 to 14.

In [353]:
#flights = flights_onehot
#flights_stored.show(5)
flights = flights_stored

# Create buckets at 3 hour intervals through the day
buckets = Bucketizer(splits=[0, 3, 6, 9, 12, 15, 18, 21, 24], inputCol='depart', outputCol='depart_bucket')
bucketed = buckets.transform(flights)
onehot = OneHotEncoderEstimator(inputCols=['depart_bucket'], outputCols=['depart_dummy'])

# One-hot encode the bucketed departure times
flights = onehot.fit(bucketed).transform(bucketed)
#flights.show(5)


# make org_dummy and km the features
assembler = VectorAssembler(inputCols=['km','org_dummy','depart_dummy'], outputCol='features')

flights = assembler.transform(flights)


In [354]:
flights_train, flights_test = flights.randomSplit([0.8,0.2],seed=17)

In [355]:
# Create a regression object and train on training data
regression = LinearRegression(labelCol='duration').fit(flights_train)

# Create predictions for the testing data
predictions = regression.transform(flights_test)

# Calculate the RMSE on testing data
RegressionEvaluator(labelCol='duration').evaluate(predictions)

# Average minutes on ground at OGG for flights departing between 21:00 and 24:00
avg_eve_ogg = regression.intercept
print(avg_eve_ogg)

# Average minutes on ground at OGG for flights departing between 00:00 and 03:00
avg_night_ogg = regression.intercept + regression.coefficients[8]
print(avg_night_ogg)

# Average minutes on ground at JFK for flights departing between 00:00 and 03:00
avg_night_jfk = regression.intercept + regression.coefficients[3] + regression.coefficients[8]
print(avg_night_jfk)

10.424267279374
-3.8800857990632043
47.84550240132776


Adding departure time resulted in smaller RMSE

In [356]:
predictions.select('duration', 'prediction').show(5, False)

+--------+------------------+
|duration|prediction        |
+--------+------------------+
|135     |150.29002569090045|
|120     |129.48581650315236|
|160     |148.57484012682133|
|275     |267.3117109619755 |
|85      |85.85671736253809 |
+--------+------------------+
only showing top 5 rows



## Flight duration model: Regularization

More features will make the model more complicated and difficult to interpret.

Include in the next model:

    km
    org (origin airport, one-hot encoded, 8 levels)
    depart (departure time, binned in 3 hour intervals, one-hot encoded, 8 levels)
    dow (departure day of week, one-hot encoded, 7 levels) and
    mon (departure month, one-hot encoded, 12 levels).

These have been assembled into the features column, which is a sparse representation of 32 columns (remember one-hot encoding produces a number of columns which is one fewer than the number of levels).

The data are available as flights, randomly split into flights_train and flights_test. The object predictions is also available.

In [357]:
#flights = flights.drop('features')
#flights.show()

In [358]:
flights.show(5)

+---+---+---+-------+------+---+------+--------+-----+------+-------+-------------+-------------+-------------+--------------------+
|mon|dom|dow|carrier|flight|org|depart|duration|delay|    km|org_idx|    org_dummy|depart_bucket| depart_dummy|            features|
+---+---+---+-------+------+---+------+--------+-----+------+-------+-------------+-------------+-------------+--------------------+
| 11| 20|  6|     US|    19|JFK|  9.48|     351| null|3465.0|    2.0|(7,[2],[1.0])|          3.0|(7,[3],[1.0])|(15,[0,3,11],[346...|
|  0| 22|  2|     UA|  1107|ORD| 16.33|      82|   30| 509.0|    0.0|(7,[0],[1.0])|          5.0|(7,[5],[1.0])|(15,[0,1,13],[509...|
|  2| 20|  4|     UA|   226|SFO|  6.17|      82|   -8| 542.0|    1.0|(7,[1],[1.0])|          2.0|(7,[2],[1.0])|(15,[0,2,10],[542...|
|  9| 13|  1|     AA|   419|ORD| 10.33|     195|   -5|1989.0|    0.0|(7,[0],[1.0])|          3.0|(7,[3],[1.0])|(15,[0,1,11],[198...|
|  4|  2|  5|     AA|   325|ORD|  8.92|      65| null| 415.0|    0.0|

In [359]:
# Create an indexer
indexer = StringIndexer(inputCol='dow', outputCol='dow_idx')

# Indexer identifies categories in the data
indexer_model = indexer.fit(flights)

# Indexer creates a new column with numeric index values
flights = indexer_model.transform(flights)

# Create an instance of the one hot encoder
onehot = OneHotEncoderEstimator(inputCols=['dow_idx'], outputCols=['dow_dummy'])

# Apply the one hot encoder to the flights data
onehot = onehot.fit(flights)
flights = onehot.transform(flights)

indexer = StringIndexer(inputCol='mon', outputCol='mon_idx')
indexer_model = indexer.fit(flights)
flights = indexer_model.transform(flights)
onehot = OneHotEncoderEstimator(inputCols=['mon_idx'], outputCols=['mon_dummy'])
onehot = onehot.fit(flights)
flights = onehot.transform(flights)

flights = flights.drop('mon_idx','dow_idx','features')

In [360]:
flights.show(5)

+---+---+---+-------+------+---+------+--------+-----+------+-------+-------------+-------------+-------------+-------------+---------------+
|mon|dom|dow|carrier|flight|org|depart|duration|delay|    km|org_idx|    org_dummy|depart_bucket| depart_dummy|    dow_dummy|      mon_dummy|
+---+---+---+-------+------+---+------+--------+-----+------+-------+-------------+-------------+-------------+-------------+---------------+
| 11| 20|  6|     US|    19|JFK|  9.48|     351| null|3465.0|    2.0|(7,[2],[1.0])|          3.0|(7,[3],[1.0])|    (6,[],[])| (11,[6],[1.0])|
|  0| 22|  2|     UA|  1107|ORD| 16.33|      82|   30| 509.0|    0.0|(7,[0],[1.0])|          5.0|(7,[5],[1.0])|(6,[2],[1.0])| (11,[1],[1.0])|
|  2| 20|  4|     UA|   226|SFO|  6.17|      82|   -8| 542.0|    1.0|(7,[1],[1.0])|          2.0|(7,[2],[1.0])|(6,[3],[1.0])| (11,[4],[1.0])|
|  9| 13|  1|     AA|   419|ORD| 10.33|     195|   -5|1989.0|    0.0|(7,[0],[1.0])|          3.0|(7,[3],[1.0])|(6,[1],[1.0])|(11,[10],[1.0])|
|  4| 

In [361]:
#make new features list
# make org_dummy and km the features
assembler = VectorAssembler(inputCols=['km','org_dummy','depart_dummy','dow_dummy','mon_dummy'], outputCol='features')

flights = assembler.transform(flights)

In [362]:
flights.show(5)

+---+---+---+-------+------+---+------+--------+-----+------+-------+-------------+-------------+-------------+-------------+---------------+--------------------+
|mon|dom|dow|carrier|flight|org|depart|duration|delay|    km|org_idx|    org_dummy|depart_bucket| depart_dummy|    dow_dummy|      mon_dummy|            features|
+---+---+---+-------+------+---+------+--------+-----+------+-------+-------------+-------------+-------------+-------------+---------------+--------------------+
| 11| 20|  6|     US|    19|JFK|  9.48|     351| null|3465.0|    2.0|(7,[2],[1.0])|          3.0|(7,[3],[1.0])|    (6,[],[])| (11,[6],[1.0])|(32,[0,3,11,27],[...|
|  0| 22|  2|     UA|  1107|ORD| 16.33|      82|   30| 509.0|    0.0|(7,[0],[1.0])|          5.0|(7,[5],[1.0])|(6,[2],[1.0])| (11,[1],[1.0])|(32,[0,1,13,17,22...|
|  2| 20|  4|     UA|   226|SFO|  6.17|      82|   -8| 542.0|    1.0|(7,[1],[1.0])|          2.0|(7,[2],[1.0])|(6,[3],[1.0])| (11,[4],[1.0])|(32,[0,2,10,18,25...|
|  9| 13|  1|     AA| 

In [365]:
flights.select('features').show(5,truncate=False)

+--------------------------------------------+
|features                                    |
+--------------------------------------------+
|(32,[0,3,11,27],[3465.0,1.0,1.0,1.0])       |
|(32,[0,1,13,17,22],[509.0,1.0,1.0,1.0,1.0]) |
|(32,[0,2,10,18,25],[542.0,1.0,1.0,1.0,1.0]) |
|(32,[0,1,11,16,31],[1989.0,1.0,1.0,1.0,1.0])|
|(32,[0,1,10,15,29],[415.0,1.0,1.0,1.0,1.0]) |
+--------------------------------------------+
only showing top 5 rows



In [366]:
flights_train, flights_test = flights.randomSplit([0.8,0.2],seed=17)

In [371]:
regression = LinearRegression(labelCol='duration').fit(flights_train)

predictions = regression.transform(flights_test)

rmse = RegressionEvaluator(labelCol='duration').evaluate(predictions)

print("The test RMSE is", rmse)

print("\n")
coeffs = regression.coefficients
print(coeffs)

The test RMSE is 10.606545109500756


[0.07439612871049665,27.159411218293794,20.010007367091102,51.75945116860125,45.72336548770983,17.44795681766783,14.962858899771346,17.1106892046316,-14.680435731520726,2.387687080155746,4.039941072205227,6.950825833824756,4.612768532620685,8.839164315549944,8.693816938676932,0.003971629408195554,-0.05855515105425649,-0.32944667683343165,0.07127937393444782,0.26921576598471947,0.09468122491776833,-3.480694045524389,-1.531330936233792,-3.7185459131066634,-1.6568311712213777,-1.5924189448573478,-3.6787427012342597,0.8425869541367698,-2.795911328844872,-3.5786185552697285,-3.38972772264376,-2.221182955823689]


### Use Lasso regression (regularized with a L1 penalty) to create a more simple model

In [379]:
# Fit Lasso model (α = 1) to training data
regression = LinearRegression(labelCol='duration', regParam=1,elasticNetParam=1).fit(flights_train)

# Calculate the RMSE on testing data
rmse = RegressionEvaluator(labelCol='duration').evaluate(regression.transform(flights_test))
print("The test RMSE is", rmse)

coeffs = regression.coefficients
print(coeffs)

# Number of zero coefficients
zero_coeff = sum([beta==0 for beta in regression.coefficients])
print("Number of coefficients equal to 0:", zero_coeff)

The test RMSE is 11.545792028112292
[0.07350293358100364,5.5828598433558785,0.0,29.09802852173182,22.009764588838063,0.0,-2.2061697985886166,0.0,0.0,0.0,0.0,0.0,0.0,1.117227868678856,1.0815190301505242,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0]
Number of coefficients equal to 0: 25


Regularisation produced a far simpler model with similar test performance.