# Classification

Dataset from Kaggle https://www.kaggle.com/usdot/flight-delays

## Description
The U.S. Department of Transportation's (DOT) Bureau of Transportation Statistics tracks the on-time performance of domestic flights operated by large air carriers. Summary information on the number of on-time, delayed, canceled, and diverted flights is published in DOT's monthly Air Travel Consumer Report and in this dataset of 2015 flight delays and cancellations.

### Schema
- |-- YEAR: integer (nullable = true) - The year of the flight
- |-- MONTH: integer (nullable = true) - The month of the flight
- |-- DAY: integer (nullable = true) - The day of the flight
- |-- DAY_OF_WEEK: integer (nullable = true) - The day of the week of the flight
- |-- AIRLINE: string (nullable = true) - The airline code of the flight
- |-- FLIGHT_NUMBER: integer (nullable = true) - The flight number of the flight
- |-- TAIL_NUMBER: string (nullable = true) - the tail number of the aircraft
- |-- ORIGIN_AIRPORT: string (nullable = true) - The origin airport code of the flight
- |-- DESTINATION_AIRPORT: string (nullable = true) - The destination airport code of the flight
- |-- SCHEDULED_DEPARTURE: integer (nullable = true) - The scheduled departure time of the flight
- |-- DEPARTURE_TIME: integer (nullable = true) - The actual departure time of the flight
- |-- DEPARTURE_DELAY: integer (nullable = true) - The departure delay of the flight
- |-- TAXI_OUT: integer (nullable = true) - The taxi out time of the flight
- |-- WHEELS_OFF: integer (nullable = true) - The wheels off time of the flight
- |-- SCHEDULED_TIME: integer (nullable = true) - The scheduled time of the flight
- |-- ELAPSED_TIME: integer (nullable = true) - The elapsed time of the flight
- |-- AIR_TIME: integer (nullable = true) - The air time of the flight
- |-- DISTANCE: integer (nullable = true) - The distance between the origin and destination airports
- |-- WHEELS_ON: integer (nullable = true) - The wheels on time of the flight
- |-- TAXI_IN: integer (nullable = true) - The taxi in time of the flight
- |-- SCHEDULED_ARRIVAL: integer (nullable = true) - The scheduled arrival time of the flight
- |-- ARRIVAL_TIME: integer (nullable = true) - The actual arrival time of the flight
- |-- ARRIVAL_DELAY: integer (nullable = true) - The arrival delay of the flight
- |-- DIVERTED: integer (nullable = true) - Whether the flight was diverted or not
- |-- CANCELLED: integer (nullable = true) - Whether the flight was cancelled or not
- |-- CANCELLATION_REASON: string (nullable = true) - The reason the flight was cancelled
- |-- AIR_SYSTEM_DELAY: integer (nullable = true) - The air system delay of the flight
- |-- SECURITY_DELAY: integer (nullable = true) - The security delay of the flight
- |-- AIRLINE_DELAY: integer (nullable = true) - The airline delay of the flight
- |-- LATE_AIRCRAFT_DELAY: integer (nullable = true) - The late aircraft delay of the flight 
- |-- WEATHER_DELAY: integer (nullable = true) - The weather delay of the flight


In [9]:
# download the dataset, extract, and remove compressed version
!kaggle datasets download -d usdot/flight-delays -f "flights.csv"
!unzip flights.csv.zip
!rm flights.csv.zip

Downloading flights.csv.zip to /workspace/pyspark-airline-delay-classification
 90%|████████████████████████████████████▉    | 172M/191M [00:00<00:00, 207MB/s]
100%|█████████████████████████████████████████| 191M/191M [00:00<00:00, 206MB/s]
Archive:  flights.csv.zip
  inflating: flights.csv             


In [10]:
# initlize pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("airline-delay-analysis").getOrCreate()

In [11]:
# Needed to make Jupyter work with Gitpod
import plotly.io as pio
pio.renderers.default = 'iframe_connected'

In [12]:
# Read the data into a dataframe and print the schema
df = spark.read.csv("flights.csv", header=True, inferSchema=True)
df.printSchema()



root
 |-- YEAR: integer (nullable = true)
 |-- MONTH: integer (nullable = true)
 |-- DAY: integer (nullable = true)
 |-- DAY_OF_WEEK: integer (nullable = true)
 |-- AIRLINE: string (nullable = true)
 |-- FLIGHT_NUMBER: integer (nullable = true)
 |-- TAIL_NUMBER: string (nullable = true)
 |-- ORIGIN_AIRPORT: string (nullable = true)
 |-- DESTINATION_AIRPORT: string (nullable = true)
 |-- SCHEDULED_DEPARTURE: integer (nullable = true)
 |-- DEPARTURE_TIME: integer (nullable = true)
 |-- DEPARTURE_DELAY: integer (nullable = true)
 |-- TAXI_OUT: integer (nullable = true)
 |-- WHEELS_OFF: integer (nullable = true)
 |-- SCHEDULED_TIME: integer (nullable = true)
 |-- ELAPSED_TIME: integer (nullable = true)
 |-- AIR_TIME: integer (nullable = true)
 |-- DISTANCE: integer (nullable = true)
 |-- WHEELS_ON: integer (nullable = true)
 |-- TAXI_IN: integer (nullable = true)
 |-- SCHEDULED_ARRIVAL: integer (nullable = true)
 |-- ARRIVAL_TIME: integer (nullable = true)
 |-- ARRIVAL_DELAY: integer (null

                                                                                

In [13]:
from pyspark.sql.functions import when, col

# Add a column "delayed" which is 0 if DEPARTURE_DELAY is 0, 1 if DEPARTURE_DELAY is greater than 0 and -1 if DEPARTURE_DELAY is less than 0
df = df.withColumn("DELAYED", when(df["DEPARTURE_DELAY"] == 0, 0).otherwise(when(df["DEPARTURE_DELAY"] > 0, 1).otherwise(2)))

# Preprocess the data
Drop the columns that we don't need or that we cannot predict before the flight

Drop the following columns
- |-- YEAR: integer (nullable = true) - Not needed since all flights are from the same year
- |-- FLIGHT_NUMBER: integer (nullable = true) - Not needed since all planes have a unique flight number
- |-- TAIL_NUMBER: string (nullable = true) - Not needed since all planes have a unique tail number
- |-- DEPARTURE_TIME: integer (nullable = true) - Cannot know this before the flight
- |-- DEPARTURE_DELAY: integer (nullable = true) - Cannot know this before the flight
- |-- TAXI_OUT: integer (nullable = true) - Cannot know this before the flight
- |-- WHEELS_OFF: integer (nullable = true) - Cannot know this before the flight
- |-- ELAPSED_TIME: integer (nullable = true) - Cannot know this before the flight
- |-- AIR_TIME: integer (nullable = true) - Cannot know this before the flight
- |-- WHEELS_ON: integer (nullable = true) - Cannot know this before the flight
- |-- TAXI_IN: integer (nullable = true) - Cannot know this before the flight
- |-- ARRIVAL_TIME: integer (nullable = true) - Cannot know this before the flight
- |-- ARRIVAL_DELAY: integer (nullable = true) - Not needed since we will only focus on the departure delay
- |-- DIVERTED: integer (nullable = true) - Cannot know this before the flight
- |-- CANCELLED: integer (nullable = true) - Cannot know this before the flight
- |-- CANCELLATION_REASON: string (nullable = true) - Cannot know this before the flight
- |-- AIR_SYSTEM_DELAY: integer (nullable = true) - Cannot know this before the flight
- |-- SECURITY_DELAY: integer (nullable = true) - Cannot know this before the flight
- |-- AIRLINE_DELAY: integer (nullable = true) - Cannot know this before the flight
- |-- LATE_AIRCRAFT_DELAY: integer (nullable = true) - Cannot know this before the flight
- |-- WEATHER_DELAY: integer (nullable = true) - Cannot know this before the flight

In [14]:
df = df.drop("YEAR", "FLIGHT_NUMBER", "TAIL_NUMBER", "DEPARTURE_TIME", "DEPARTURE_DELAY", "TAXI_OUT", "WHEELS_OFF", "ELAPSED_TIME", "AIR_TIME", "WHEELS_ON", "TAXI_IN", "ARRIVAL_TIME", "ARRIVAL_DELAY", "DIVERTED", "CANCELLED", "CANCELLATION_REASON", "AIR_SYSTEM_DELAY", "SECURITY_DELAY", "AIRLINE_DELAY", "LATE_AIRCRAFT_DELAY", "WEATHER_DELAY")

# Our new schema
# |-- MONTH: integer (nullable = true)
# |-- DAY: integer (nullable = true)
# |-- DAY_OF_WEEK: integer (nullable = true)
# |-- AIRLINE: string (nullable = true)
# |-- ORIGIN_AIRPORT: string (nullable = true)
# |-- DESTINATION_AIRPORT: string (nullable = true)
# |-- SCHEDULED_DEPARTURE: integer (nullable = true)
# |-- SCHEDULED_TIME: integer (nullable = true)
# |-- DISTANCE: integer (nullable = true)
# |-- SCHEDULED_ARRIVAL: integer (nullable = true)
# |-- DELAYED: integer (nullable = false)
df.printSchema()

root
 |-- MONTH: integer (nullable = true)
 |-- DAY: integer (nullable = true)
 |-- DAY_OF_WEEK: integer (nullable = true)
 |-- AIRLINE: string (nullable = true)
 |-- ORIGIN_AIRPORT: string (nullable = true)
 |-- DESTINATION_AIRPORT: string (nullable = true)
 |-- SCHEDULED_DEPARTURE: integer (nullable = true)
 |-- SCHEDULED_TIME: integer (nullable = true)
 |-- DISTANCE: integer (nullable = true)
 |-- SCHEDULED_ARRIVAL: integer (nullable = true)
 |-- DELAYED: integer (nullable = false)



In [15]:
# Remove the rows that have null values
df = df.dropna()

# Print the number of rows
print(df.count())

[Stage 47:>                                                         (0 + 6) / 6]

5819073


                                                                                

# Preparing data for training

We use StringIndexer to convert the categorical variables to numerical variables.

We use VectorAssembler to convert the numerical variables to a feature vector (required by the model).

In [16]:
# Preparing the data
from pyspark.ml.feature import StringIndexer
from pyspark.ml.feature import VectorAssembler

train, test = df.randomSplit([0.7, 0.3])

# Encode the categorical features using StringIndexer
indexer = StringIndexer(inputCols=["AIRLINE", "ORIGIN_AIRPORT", "DESTINATION_AIRPORT"], outputCols=["AIRLINE_INDEX", "ORIGIN_AIRPORT_INDEX", "DESTINATION_AIRPORT_INDEX"])

# Create the assembler
assembler = VectorAssembler(inputCols=["MONTH", "DAY", "DAY_OF_WEEK", "AIRLINE_INDEX", "ORIGIN_AIRPORT_INDEX", "DESTINATION_AIRPORT_INDEX", "SCHEDULED_DEPARTURE", "SCHEDULED_TIME", "DISTANCE", "SCHEDULED_ARRIVAL"], outputCol="features")

# Get the maximum number of all categorical features in the dataframe
num_of_origins = df.select("ORIGIN_AIRPORT").distinct().count()
num_of_destinations = df.select("DESTINATION_AIRPORT").distinct().count()
num_of_carriers = df.select("AIRLINE").distinct().count()
max_num_of_categorical_features = max(num_of_origins, num_of_destinations, num_of_carriers)
print("max categories:", max_num_of_categorical_features)

[Stage 62:>                                                         (0 + 6) / 6]

max categories: 629


                                                                                

# First Classfier: Deision Tree

In [17]:
# Use a Decision Tree Classifier to predict the delay
# Split the data into training and testing sets
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.pipeline import Pipeline


train, test = df.randomSplit([0.7, 0.3])

# Create the classifier
dt = DecisionTreeClassifier(labelCol="DELAYED", featuresCol="features", maxBins=max_num_of_categorical_features)

# Create the pipeline
pipeline = Pipeline(stages=[indexer, assembler, dt])

# Create the grid
# paramGrid = ParamGridBuilder().addGrid(dt.maxDepth, [2, 3, 4, 5, 6, 7, 8, 9, 10]).addGrid(dt.maxBins, [650, 750, 850, 950, 1050, 1150, 1250]).build()

# Create the cross validator
# cv = CrossValidator(estimator=pipeline, evaluator=MulticlassClassificationEvaluator(), estimatorParamMaps=paramGrid, numFolds=3)


# Create the train validation split with the param grid
# tvs = TrainValidationSplit(estimator=pipeline, estimatorParamMaps=paramGrid, evaluator=MulticlassClassificationEvaluator(), trainRatio=0.8)

# Create the train validation split without the param grid
# tvs = TrainValidationSplit(estimator=pipeline, evaluator=MulticlassClassificationEvaluator(), trainRatio=0.8)

# Train the model using the pipeline
model_basic = pipeline.fit(train)

# Train the model using TrainValidationSplit
# model = tvs.fit(train)

# Train the model using CrossValidator
# model = cv.fit(train)

# Evaluate the models
predictions_basic = model_basic.transform(test)

# predictions_with_validation = tvs.fit(train).transform(test)



# Evaluate the model
evaluator = MulticlassClassificationEvaluator(labelCol="DELAYED", predictionCol="prediction", metricName="accuracy")

# Print the accuracy
print("Accuracy (Decision Tree)", evaluator.evaluate(predictions_basic))

22/05/14 10:19:32 WARN MemoryStore: Not enough space to cache rdd_193_0 in memory! (computed 44.7 MiB so far)
22/05/14 10:19:32 WARN BlockManager: Persisting block rdd_193_0 to disk instead.
22/05/14 10:19:33 WARN MemoryStore: Not enough space to cache rdd_193_1 in memory! (computed 44.7 MiB so far)
22/05/14 10:19:33 WARN BlockManager: Persisting block rdd_193_1 to disk instead.
22/05/14 10:19:33 WARN MemoryStore: Not enough space to cache rdd_193_3 in memory! (computed 44.7 MiB so far)
22/05/14 10:19:33 WARN BlockManager: Persisting block rdd_193_3 to disk instead.
22/05/14 10:19:33 WARN MemoryStore: Not enough space to cache rdd_193_5 in memory! (computed 70.0 MiB so far)
22/05/14 10:19:33 WARN BlockManager: Persisting block rdd_193_5 to disk instead.
22/05/14 10:19:36 WARN MemoryStore: Not enough space to cache rdd_193_0 in memory! (computed 46.3 MiB so far)
22/05/14 10:19:36 WARN MemoryStore: Not enough space to cache rdd_193_3 in memory! (computed 69.4 MiB so far)
22/05/14 10:19:3

Accuracy (Decision Tree) 0.6143093574619173


                                                                                

# Second Classifer: Random Forest

In [18]:
# Random forest classifier
from pyspark.ml.classification import RandomForestClassifier

rf = RandomForestClassifier(labelCol="DELAYED", featuresCol="features", numTrees=10, maxBins=max_num_of_categorical_features)

pipeline = Pipeline(stages=[indexer, assembler, rf])

model = pipeline.fit(train)

predictions = model.transform(test)

# Evaluate the model and print the accuracy
print("Accuracy (Random Forest)", evaluator.evaluate(predictions))

22/05/14 10:20:23 WARN MemoryStore: Not enough space to cache rdd_255_5 in memory! (computed 42.5 MiB so far)
22/05/14 10:20:23 WARN BlockManager: Persisting block rdd_255_5 to disk instead.
22/05/14 10:20:23 WARN MemoryStore: Not enough space to cache rdd_255_4 in memory! (computed 64.5 MiB so far)
22/05/14 10:20:23 WARN BlockManager: Persisting block rdd_255_4 to disk instead.
22/05/14 10:20:23 WARN MemoryStore: Not enough space to cache rdd_255_3 in memory! (computed 42.5 MiB so far)
22/05/14 10:20:23 WARN BlockManager: Persisting block rdd_255_3 to disk instead.
22/05/14 10:20:23 WARN MemoryStore: Not enough space to cache rdd_255_0 in memory! (computed 64.5 MiB so far)
22/05/14 10:20:23 WARN BlockManager: Persisting block rdd_255_0 to disk instead.
22/05/14 10:20:23 WARN MemoryStore: Not enough space to cache rdd_255_2 in memory! (computed 42.5 MiB so far)
22/05/14 10:20:23 WARN BlockManager: Persisting block rdd_255_2 to disk instead.
22/05/14 10:20:23 WARN MemoryStore: Not enoug

Accuracy (Random Forest) 0.6151872637727637


                                                                                