# Data Preprocessor
## i. Overview
The Data Preprocessor is used to take the process the raw data into a simpler dataset for training and prediction. The training data will be 2018 and 2019 while the test data is 2022. The user will be selecting which test data to predict on.

## ii. Special Notes
1. Since sparse matrices cannot be saved to a file, the Data Preprocessor does not convert the data into a sparse matrix.
2. The output files will use the parquet file extension because parquet is loaded faster.

## iii. Methodology
### Selected Features

### Transformed Features

### Other Identifying Features


In [1]:
#import libraries
import pandas as pd
import numpy as np

#import classes/functions
from sklearn import preprocessing

#define constants
combinedFlights2018Parquet = "archive/Combined_Flights_2018.parquet"
combinedFlights2019Parquet = "archive/Combined_Flights_2019.parquet"
combinedFlights2022Parquet = "archive/Combined_Flights_2022.parquet"
columns_to_use = [
    'Airline',
    'Origin',
    'Dest',
    'CRSDepTime', 
    'Distance', 
    'Year', 
    'Quarter', 
    'Month', 
    'DayofMonth', 
    'DayOfWeek', 
    'DepTimeBlk', 
    'ArrTimeBlk', 
    'DistanceGroup',
    'ArrDelayMinutes'
    ]

In [2]:
#Define any relevant functions:


#EncodeLabel
#
#This function will accept a dataFrame, and use a label encoder to convert the
#given feature to a set of integers.
#
#dataFrame should be the dataFrame which is going to be changed.
#featureName should be a string indicating the feature(column) to change in the dataFrame.
#
#This function assumes that all observations of the feature are strings, and that
#all empty observations have already been removed from the dataframe (for example, by
#using dropna).
#
#This function does not return anything, and the dataFrame passed in will be changed.
#However, this function will print a reference table for the original labels for the feature,
#and their corresponding integer representation.
def EncodeLabel(dataFrame, featureName, printTable = False):
    encoder = preprocessing.LabelEncoder();
    encoder.fit(pd.unique(dataFrame[featureName]));
    dataFrame[featureName] = encoder.transform(dataFrame[featureName]);
    if(printTable):
        print("Conversion table for feature \"" + featureName + "\":")
        for i in range(np.size(encoder.classes_)):
            print(encoder.classes_[i] + " = " + str(i));

In [3]:
#Load data from files

#very large line, just loads both the 2018 and 2019 files directly into the concat function to save on having to
#dedicate 2 other dataframes in memory, then having to free those later.
trainData = pd.concat([pd.read_parquet(combinedFlights2018Parquet, columns = columns_to_use, engine="fastparquet"),pd.read_parquet(combinedFlights2019Parquet, columns = columns_to_use, engine="fastparquet")], axis = 0);
#load test data
testData = pd.read_parquet(combinedFlights2022Parquet, columns = columns_to_use, engine="fastparquet");

In [4]:
#Clean the data
trainData.dropna(inplace = True);
testData.dropna(inplace = True);

In [5]:
#Convert string labels to int labels (EncodeLabel)
#Train:
EncodeLabel(trainData, "Airline", printTable = True);
print("");
EncodeLabel(trainData, "Origin", printTable = True);
print("");
EncodeLabel(trainData, "Dest", printTable = True);
print("");
EncodeLabel(trainData, "DepTimeBlk", printTable = True);
print("");
EncodeLabel(trainData, "ArrTimeBlk", printTable = True);
#Test:
EncodeLabel(testData, "Airline", printTable = False);
EncodeLabel(testData, "Origin", printTable = False);
EncodeLabel(testData, "Dest", printTable = False);
EncodeLabel(testData, "DepTimeBlk", printTable = False);
EncodeLabel(testData, "ArrTimeBlk", printTable = False);

Conversion table for feature "Airline":
Air Wisconsin Airlines Corp = 0
Alaska Airlines Inc. = 1
Allegiant Air = 2
American Airlines Inc. = 3
Cape Air = 4
Capital Cargo International = 5
Comair Inc. = 6
Commutair Aka Champlain Enterprises, Inc. = 7
Compass Airlines = 8
Delta Air Lines Inc. = 9
Empire Airlines Inc. = 10
Endeavor Air Inc. = 11
Envoy Air = 12
ExpressJet Airlines Inc. = 13
Frontier Airlines Inc. = 14
GoJet Airlines, LLC d/b/a United Express = 15
Hawaiian Airlines Inc. = 16
Horizon Air = 17
JetBlue Airways = 18
Mesa Airlines Inc. = 19
Peninsula Airways Inc. = 20
Republic Airlines = 21
SkyWest Airlines Inc. = 22
Southwest Airlines Co. = 23
Spirit Air Lines = 24
Trans States Airlines = 25
United Air Lines Inc. = 26
Virgin America = 27

Conversion table for feature "Origin":
ABE = 0
ABI = 1
ABQ = 2
ABR = 3
ABY = 4
ACK = 5
ACT = 6
ACV = 7
ACY = 8
ADK = 9
ADQ = 10
AEX = 11
AGS = 12
AKN = 13
ALB = 14
ALO = 15
ALW = 16
AMA = 17
ANC = 18
APN = 19
ART = 20
ASE = 21
ATL = 22
ATW = 23

Conversion table for feature "ArrTimeBlk":
0001-0559 = 0
0600-0659 = 1
0700-0759 = 2
0800-0859 = 3
0900-0959 = 4
1000-1059 = 5
1100-1159 = 6
1200-1259 = 7
1300-1359 = 8
1400-1459 = 9
1500-1559 = 10
1600-1659 = 11
1700-1759 = 12
1800-1859 = 13
1900-1959 = 14
2000-2059 = 15
2100-2159 = 16
2200-2259 = 17
2300-2359 = 18


In [None]:
# save processed data to new files