# Train/Validation/Test Split Notebook - Validation/Test Data Split  

Objective: Split the full dataset into Validation/Test data.  


##### Train data (other notebook):  
3 month data = January and February of 2015    
full data = 2015 through 2017    
Steps:   
A) Drop rows:  
- Where weather readings are suspect or erroneous   
- Where report_types are SY-MT, FM-16, SOD,SOM  
- Where there are nulls  
- Where there are 99999 (missing) data    

B) Bin Data    
C) Use train data to create Airport PageRank helper Table     
D) Join Airport PageRank to final table    
E) Create helper table of averages  grouped by day_of_week and origin for imputing validation data     
F) Create helper table of averages  for imputing validation data (when ORIGIN not in training data)     


 ##### Validation/Test Data (this notebook): 
3 month data = March 2015 (validation)  
full data = 2018 (validation), 2019 (test)    
Steps:  
A) Join PageRank helper table created previously to test data  
B) Imputations  
- Join imputed value table (averages at each ORIGIN) with full data (joins columns with feature averages to full data)  
- For averages with null values, impute with second helper table of averages of all training data (null averages are caused by ORIGIN in validation/test data that are not in train data)   
- Finally, impute the suspect or erroneous, nulls, and 99999 (missing) data in the full table with these averages  

C) Bin Data

In [0]:
#Import packages
from pyspark.sql import functions as f
from pyspark.sql.types import StructType, StructField, StringType, DoubleType, IntegerType, NullType, ShortType, DateType, BooleanType, BinaryType, FloatType
from pyspark.sql import SQLContext
from pyspark.sql.functions import isnan, when, count, col, udf, date_trunc, max as max_, sum as sum_, avg as avg_, min as min_
from pyspark.ml.feature import Bucketizer
from pyspark.sql.types import TimestampType
from pyspark.sql.functions import udf, date_trunc, col
from datetime import datetime, timedelta
from pyspark.sql.types import TimestampType
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
from pytz import timezone
import pytz

sqlContext = SQLContext(sc)

## Get Joined Data from Shared Folder

In [0]:
## dbutils.fs.mkdirs("dbfs:/mnt/mids-w261/team_25/train_test_data_folder")               #Made Directory in DataBricks, no need to remake
# display(dbutils.fs.ls("dbfs:/mnt/mids-w261/team_25/train_test_data_folder"))

In [0]:
# Read Files and Fix Schema For Processing

#READING PARQUET File from Shared Directory
filename = "flight_weather_data_3m"              # 3 Month Data
joined_data_3m = spark.read.option("header", "true").parquet("dbfs:/mnt/mids-w261/team_25/join_data_folder/"+filename+"/part-00*.parquet")

filename = "flight_weather_data"               # Full data
joined_data = spark.read.option("header", "true").parquet("dbfs:/mnt/mids-w261/team_25/join_data_folder/"+filename+"/part-00*.parquet")

joined_data_3m.display()
print("joined_data_3m Shape:", joined_data_3m.count(), len(joined_data_3m.columns))
joined_data.display()
print("joined_data Shape:", joined_data.count(), len(joined_data.columns))

#Fix Schema
joined_data_3m = joined_data_3m.withColumn('DAY_OF_WEEK',joined_data_3m.DAY_OF_WEEK.cast(IntegerType()) )
joined_data_3m = joined_data_3m.withColumn('FIRST_DEP',joined_data_3m.FIRST_DEP.cast(IntegerType()) )
joined_data_3m = joined_data_3m.withColumn('PREVIOUS_DELAY',joined_data_3m.PREVIOUS_DELAY.cast(IntegerType()) )
joined_data_3m = joined_data_3m.withColumn('CRS_DEP_TIME',joined_data_3m.CRS_DEP_TIME.cast(IntegerType()) )

joined_data = joined_data.withColumn('DAY_OF_WEEK',joined_data.DAY_OF_WEEK.cast(IntegerType()) )
joined_data = joined_data.withColumn('FIRST_DEP',joined_data.FIRST_DEP.cast(IntegerType()) )
joined_data = joined_data.withColumn('PREVIOUS_DELAY',joined_data.PREVIOUS_DELAY.cast(IntegerType()) )
joined_data = joined_data.withColumn('CRS_DEP_TIME',joined_data.CRS_DEP_TIME.cast(IntegerType()) )

YEAR,MONTH,DAY_OF_WEEK,OP_UNIQUE_CARRIER,ORIGIN,ORIGIN_STATE_ABR,DEST,DEST_STATE_ABR,CRS_DEP_TIME,CRS_ELAPSED_TIME,DISTANCE,FIRST_DEP,PREVIOUS_DELAY,DELAY,LATITUDE,LONGITUDE,ELEVATION,REPORT_TYPE,WND_SPEED,WND_SPEED_QUAL,CIG_HEIGHT,CIG_QUAL,VIS_DIST,VIS_DIST_QUAL,VIS_VAR,VIS_VAR_QUAL,TEMP,TEMP_QUAL,DEW_TEMP,DEW_TEMP_QUAL,SLPRESS,SLPRESS_QUAL,DEST_LATITUDE,DEST_LONGITUDE,DEST_ELEVATION,DEST_REPORT_TYPE,DEST_WND_SPEED,DEST_WND_SPEED_QUAL,DEST_CIG_HEIGHT,DEST_CIG_QUAL,DEST_VIS_DIST,DEST_VIS_DIST_QUAL,DEST_VIS_VAR,DEST_VIS_VAR_QUAL,DEST_TEMP,DEST_TEMP_QUAL,DEST_DEW_TEMP,DEST_DEW_TEMP_QUAL,DEST_SLPRESS,DEST_SLPRESS_QUAL,FLIGHTS_PER_DAY
2015,1,4,AA,DFW,TX,CLE,OH,1415,155.0,1021.0,0,0,0,32.8978,-97.0189,170.7,FM-16,31,5,122,5,3219,5,N,5,6,5,0,5,99999,9,41.4057,-81.852,238.0,FM-15,88,5,22000,5,16093,5,N,5,-11,5,-128,5,10169,5,13951
2015,1,4,AA,DFW,TX,DTW,MI,1030,150.0,986.0,0,0,1,32.8978,-97.0189,170.7,FM-16,21,5,396,5,3219,5,N,5,-6,5,-17,5,99999,9,42.2313,-83.3308,192.3,FM-15,88,5,7315,5,16093,5,N,5,-72,5,-133,5,10171,5,13951
2015,1,4,AA,DFW,TX,IND,IN,1625,120.0,761.0,0,0,1,32.8978,-97.0189,170.7,FM-16,26,5,183,5,2414,5,N,5,17,5,6,5,99999,9,39.72517,-86.28168,241.1,FM-15,88,5,7620,5,16093,5,N,5,11,5,-94,5,10190,5,13951
2015,1,4,AA,PHX,AZ,DFW,TX,200,120.0,868.0,0,0,1,33.4277,-112.0038,337.4,SOM,9999,9,99999,9,999999,9,9,9,9999,9,9999,9,99999,9,32.8978,-97.0189,170.7,FM-15,26,5,1372,5,16093,5,N,5,11,5,-50,5,10296,5,13951
2015,1,4,AA,PHX,AZ,DFW,TX,200,120.0,868.0,0,0,1,33.4277,-112.0038,337.4,SOD,9999,9,99999,9,999999,9,9,9,9999,9,9999,9,99999,9,32.8978,-97.0189,170.7,FM-15,26,5,1372,5,16093,5,N,5,11,5,-50,5,10296,5,13951
2015,1,4,AS,ATL,GA,SEA,WA,1810,334.0,2182.0,0,0,1,33.6301,-84.4418,307.8,FM-15,15,5,1981,5,16093,5,N,5,133,5,-28,5,10250,5,47.4444,-122.3138,112.8,FM-15,21,5,22000,5,16093,5,N,5,44,5,-44,5,10263,5,13951
2015,1,4,AS,OGG,HI,BLI,WA,1425,353.0,2681.0,0,0,1,20.89972,-156.42861,15.5,FM-15,41,5,22000,5,16093,5,N,5,239,5,139,5,10122,5,48.79389,-122.53722,45.4,FM-15,15,5,22000,5,16093,5,N,5,39,5,-33,5,10263,5,13951
2015,1,4,AS,SAT,TX,SEA,WA,1830,270.0,1774.0,0,0,0,29.5443,-98.4839,240.5,FM-15,21,5,213,5,16093,5,N,5,50,5,33,5,10191,5,47.4444,-122.3138,112.8,FM-15,21,5,22000,5,16093,5,N,5,50,5,-50,5,10257,5,13951
2015,1,4,AS,SEA,WA,GEG,WA,1220,62.0,224.0,0,0,0,47.4444,-122.3138,112.8,FM-15,0,5,22000,5,16093,5,N,5,0,5,-44,5,10280,5,47.6216,-117.528,717.2,FM-15,0,5,853,5,12875,5,N,5,-106,5,-122,5,10335,5,13951
2015,1,4,B6,FLL,FL,BOS,MA,1342,188.0,1237.0,0,1,1,26.07875,-80.16217,3.4,FM-15,0,5,1829,5,16093,5,N,5,250,5,206,5,10225,5,42.3606,-71.0097,3.7,FM-15,82,5,22000,5,16093,5,N,5,-11,5,-150,5,10175,5,13951


YEAR,MONTH,DAY_OF_WEEK,OP_UNIQUE_CARRIER,ORIGIN,ORIGIN_STATE_ABR,DEST,DEST_STATE_ABR,CRS_DEP_TIME,CRS_ELAPSED_TIME,DISTANCE,FIRST_DEP,PREVIOUS_DELAY,DELAY,LATITUDE,LONGITUDE,ELEVATION,REPORT_TYPE,WND_SPEED,WND_SPEED_QUAL,CIG_HEIGHT,CIG_QUAL,VIS_DIST,VIS_DIST_QUAL,VIS_VAR,VIS_VAR_QUAL,TEMP,TEMP_QUAL,DEW_TEMP,DEW_TEMP_QUAL,SLPRESS,SLPRESS_QUAL,DEST_LATITUDE,DEST_LONGITUDE,DEST_ELEVATION,DEST_REPORT_TYPE,DEST_WND_SPEED,DEST_WND_SPEED_QUAL,DEST_CIG_HEIGHT,DEST_CIG_QUAL,DEST_VIS_DIST,DEST_VIS_DIST_QUAL,DEST_VIS_VAR,DEST_VIS_VAR_QUAL,DEST_TEMP,DEST_TEMP_QUAL,DEST_DEW_TEMP,DEST_DEW_TEMP_QUAL,DEST_SLPRESS,DEST_SLPRESS_QUAL,FLIGHTS_PER_DAY
2015,5,5,AA,DFW,TX,AUS,TX,2020,67.0,190.0,0,0,0,32.8978,-97.0189,170.7,FM-15,26,5,22000,5,16093,5,N,5,244,5,122,5,10152,5,30.1831,-97.6799,146.3,FM-15,31,5,22000,5,16093,5,N,5,267,5,111,5,10153,5,16927
2015,5,5,AA,DFW,TX,ORD,IL,700,146.0,802.0,0,0,0,32.8978,-97.0189,170.7,FM-15,36,5,22000,5,16093,5,N,5,161,5,106,5,10152,5,41.995,-87.9336,201.8,FM-15,26,5,22000,5,16093,5,N,5,33,5,11,5,10203,5,16927
2015,5,5,AA,DFW,TX,SFO,CA,905,233.0,1464.0,0,0,0,32.8978,-97.0189,170.7,FM-15,0,5,22000,5,16093,5,N,5,133,5,100,5,10165,5,37.6197,-122.3647,2.4,FM-15,26,5,22000,5,16093,5,N,5,144,5,89,5,10092,5,16927
2015,5,5,AA,JFK,NY,BOS,MA,1629,80.0,187.0,0,0,0,40.6386,-73.7622,3.4,FM-15,57,5,1097,5,16093,5,N,5,117,5,44,5,10154,5,42.3606,-71.0097,3.7,FM-15,57,5,7620,5,16093,5,N,5,89,5,39,5,10162,5,16927
2015,5,5,AA,LAX,CA,LAS,NV,1640,70.0,236.0,0,0,0,33.938,-118.3888,29.6,FM-15,57,5,5486,5,14484,5,N,5,194,5,128,5,10125,5,36.0719,-115.1634,664.5,FM-15,77,5,7620,5,16093,5,N,5,344,5,-61,5,10071,5,16927
2015,5,5,AA,LGA,NY,MIA,FL,1950,189.0,1096.0,0,0,0,40.7792,-73.88,3.4,FM-15,51,5,7620,5,16093,5,N,5,133,5,44,5,10136,5,25.7881,-80.3169,8.8,FM-15,41,5,22000,5,16093,5,N,5,289,5,94,5,10111,5,16927
2015,5,5,AA,MCO,FL,ORD,IL,1928,175.0,1005.0,0,0,0,28.4339,-81.325,27.4,FM-15,62,5,22000,5,16093,5,N,5,283,5,106,5,10109,5,41.995,-87.9336,201.8,FM-15,41,5,7620,5,16093,5,N,5,178,5,44,5,10193,5,16927
2015,5,5,AA,ORD,IL,DFW,TX,1325,154.0,802.0,0,1,1,41.995,-87.9336,201.8,FM-15,21,5,22000,5,16093,5,N,5,150,5,39,5,10221,5,32.8978,-97.0189,170.7,FM-15,46,5,22000,5,16093,5,N,5,228,5,139,5,10184,5,16927
2015,5,5,AA,ORD,IL,DFW,TX,900,163.0,802.0,0,0,0,41.995,-87.9336,201.8,FM-15,31,5,22000,5,16093,5,N,5,67,5,28,5,10214,5,32.8978,-97.0189,170.7,FM-15,0,5,22000,5,16093,5,N,5,133,5,100,5,10165,5,16927
2015,5,5,AS,JNU,AK,ANC,AK,1945,96.0,571.0,0,0,0,58.3566,-134.564,4.9,FM-15,57,5,1676,5,16093,5,N,5,94,5,50,5,10155,5,61.169,-150.0278,36.6,FM-15,31,5,7620,5,16093,5,N,5,122,5,6,5,10137,5,16927


## Split Data: 

###### Train data:  
3 month data = 1st and 2nd month  
full data = 2015 through 2017  

###### Validation/Test Data: 
3 month data = 3rd month (validation)
full data = 2018 (validation), 2019 (test)

In [0]:
train_3m        = joined_data_3m.where("MONTH < 3")
valid_test_3m   = joined_data_3m.where("MONTH > 2")
            
train           = joined_data.where("YEAR < 2018")
valid_test      = joined_data.where("YEAR > 2017")

#Store Data         
dbutils.fs.rm("dbfs:/mnt/mids-w261/team_25/train_test_data_folder/train_3m", True)      #remove file if there already is an existing one, be careful with this!!!
dbutils.fs.rm("dbfs:/mnt/mids-w261/team_25/train_test_data_folder/valid_test_3m", True)      #remove file if there already is an existing one, be careful with this!!!
dbutils.fs.rm("dbfs:/mnt/mids-w261/team_25/train_test_data_folder/train", True)      #remove file if there already is an existing one, be careful with this!!!
dbutils.fs.rm("dbfs:/mnt/mids-w261/team_25/train_test_data_folder/valid_test", True)      #remove file if there already is an existing one, be careful with this!!!

train_3m.write.parquet("dbfs:/mnt/mids-w261/team_25/train_test_data_folder/train_3m")  
valid_test_3m.write.parquet("dbfs:/mnt/mids-w261/team_25/train_test_data_folder/valid_test_3m")                                    
train.write.parquet("dbfs:/mnt/mids-w261/team_25/train_test_data_folder/train")                                    
valid_test.write.parquet("dbfs:/mnt/mids-w261/team_25/train_test_data_folder/valid_test")

In [0]:
#Read Data
train_3m = spark.read.option("header", "true").parquet("dbfs:/mnt/mids-w261/team_25/train_test_data_folder/train_3m/part-00*.parquet")
valid_test_3m = spark.read.option("header", "true").parquet("dbfs:/mnt/mids-w261/team_25/train_test_data_folder/valid_test_3m/part-00*.parquet")
print("3-Month Train/Validation/Test:", train_3m.count(), valid_test_3m.count())

train = spark.read.option("header", "true").parquet("dbfs:/mnt/mids-w261/team_25/train_test_data_folder/train/part-00*.parquet")
valid_test = spark.read.option("header", "true").parquet("dbfs:/mnt/mids-w261/team_25/train_test_data_folder/valid_test/part-00*.parquet")
print("Full Data Train/Validation/Test:", train.count(), valid_test.count())

# A) Add Page Rank
Join PageRank helper table to test data. Helper table created in train split notebook: https://dbc-c4580dc0-018b.cloud.databricks.com/?o=8229810859276230#notebook/2158640876511176/command/3450569410566968

In [0]:
#Read helper table Data
filename = "airportPR_ordered_df"                      
airportPR_ordered_df = spark.read.option("header", "true").parquet("dbfs:/mnt/mids-w261/team_25/train_test_data_folder/"+filename+"/part-00*.parquet")
# airportPR_ordered_df.display()
airportPR_ordered_df.registerTempTable("airportPR_ordered_df_tt")
unique_pr_airports = airportPR_ordered_df.count()   #This is total of unique airports in pagerank calculated from training data

#Read Data
valid_test_3m.registerTempTable("valid_test_3m_tt") 
valid_test.registerTempTable("valid_test_tt") 

In [0]:
#Join Data

# 3 MONTH DATA
valid_test_3m_pr = spark.sql("""SELECT * 
                                FROM valid_test_3m_tt t1
                                LEFT JOIN airportPR_ordered_df_tt t2
                                ON (t1.ORIGIN = t2.original)
                                """).drop("original")

#SAVING Spark Dataframe to Shared Directory
file_to_store = valid_test_3m_pr                        #CHANGE THIS: name of Spark Dataframe (to save in database)
filename = "valid_test_3m_pr"                      #CHANGE THIS: new file name in database
dbutils.fs.rm("dbfs:/mnt/mids-w261/team_25/train_test_data_folder/"+filename, True)      #remove file if there already is an existing one, be careful with this!!!
file_to_store.write.parquet("dbfs:/mnt/mids-w261/team_25/train_test_data_folder/" + filename)

############################################

# FULL DATA
valid_test_pr = spark.sql("""SELECT * 
                                FROM valid_test_tt t1
                                LEFT JOIN airportPR_ordered_df_tt t2
                                ON (t1.ORIGIN = t2.original)
                                """)

#SAVING Spark Dataframe to Shared Directory
file_to_store = valid_test_pr                        #CHANGE THIS: name of Spark Dataframe (to save in database)
filename = "valid_test_pr"                      #CHANGE THIS: new file name in database
dbutils.fs.rm("dbfs:/mnt/mids-w261/team_25/train_test_data_folder/"+filename, True)      #remove file if there already is an existing one, be careful with this!!!
file_to_store.write.parquet("dbfs:/mnt/mids-w261/team_25/train_test_data_folder/" + filename)

In [0]:
#Read Data
filename = "valid_test_3m_pr"                      
valid_test_3m_pr = spark.read.option("header", "true").parquet("dbfs:/mnt/mids-w261/team_25/train_test_data_folder/"+filename+"/part-00*.parquet")
# valid_test_3m_pr.display()
# print("valid_test_3m_pr Shape:", valid_test_3m_pr.count(), len(valid_test_3m_pr.columns))

############################################

filename = "valid_test_pr"                      
valid_test_pr = spark.read.option("header", "true").parquet("dbfs:/mnt/mids-w261/team_25/train_test_data_folder/"+filename+"/part-00*.parquet")
# valid_test_pr.display()
# print("valid_test_pr:", valid_test_pr.count(), len(valid_test_pr.columns))

# B) Imputations

#### Join imputed value table (averages at each ORIGIN) with full data (joins columns with feature averages to full data)

In [0]:
#Read Imputed Value helper Table from training notebook: 

# 3 MONTH DATA
train_3m_groupby = spark.read.option("header", "true").parquet("dbfs:/mnt/mids-w261/team_25/train_test_data_folder/train_3m_groupby/part-00*.parquet")
# train_3m_groupby.display()

#FULL DATA
train_groupby = spark.read.option("header", "true").parquet("dbfs:/mnt/mids-w261/team_25/train_test_data_folder/train_groupby/part-00*.parquet")
# train_groupby.display()

In [0]:
#Inner Join train_data averages 

# 3 MONTH DATA
valid_test_3m_pr.registerTempTable("valid_test_3m_pr_tt")
train_3m_groupby.registerTempTable("train_3m_groupby_tt")

valid_test_3m_train_3m_groupby = spark.sql("""SELECT t1.*, t2.* 
             FROM(
                  (SELECT * FROM valid_test_3m_pr_tt) t1
                  LEFT JOIN
                  (SELECT * FROM train_3m_groupby_tt) t2
                  ON t1.DAY_OF_WEEK = t2.DOW AND t1.ORIGIN = t2.O
                  )
             """).drop("DOW","O")

#Store Data         
dbutils.fs.rm("dbfs:/mnt/mids-w261/team_25/train_test_data_folder/valid_test_3m_train_3m_groupby", True)      #remove file if there already is an existing one, be careful with this!!!
valid_test_3m_train_3m_groupby.write.parquet("dbfs:/mnt/mids-w261/team_25/train_test_data_folder/valid_test_3m_train_3m_groupby")  

#################################

# FULL DATA
valid_test_pr.registerTempTable("valid_test_pr_tt")
train_groupby.registerTempTable("train_groupby_tt")

valid_test_train_groupby = spark.sql("""SELECT t1.*, t2.* 
             FROM(
                  (SELECT * FROM valid_test_pr_tt) t1
                  LEFT JOIN
                  (SELECT * FROM train_groupby_tt) t2
                  ON t1.DAY_OF_WEEK = t2.MO AND t1.ORIGIN = t2.O
                  )
             """).drop("MO","O")

#Store Data         
dbutils.fs.rm("dbfs:/mnt/mids-w261/team_25/train_test_data_folder/valid_test_train_groupby", True)      #remove file if there already is an existing one, be careful with this!!!
valid_test_train_groupby.write.parquet("dbfs:/mnt/mids-w261/team_25/train_test_data_folder/valid_test_train_groupby")  

#### For averages with null values, impute with second helper table of averages of all training data (null averages are caused by ORIGIN in validation/test data that are not in train data)

In [0]:
#Read Data

# 3 MONTH DATA
valid_test_3m_train_3m_groupby = spark.read.option("header", "true").parquet("dbfs:/mnt/mids-w261/team_25/train_test_data_folder/valid_test_3m_train_3m_groupby/part-00*.parquet")
# valid_test_3m_train_3m_groupby.display()

# FULL DATA
valid_test_train_groupby = spark.read.option("header", "true").parquet("dbfs:/mnt/mids-w261/team_25/train_test_data_folder/valid_test_train_groupby/part-00*.parquet")
# valid_test_3m_train_3m_groupby.display()

In [0]:
# Read in Avg Lookup Table
avg_lookup = spark.read.option("header", "true").parquet("dbfs:/mnt/mids-w261/team_25/train_test_data_folder/avg_lookup/part-00*.parquet")
avg_lookup.display()

#Impute null average readings

#3 MONTH DATA
valid_test_3m_train_3m_groupby1=valid_test_3m_train_3m_groupby.na.fill(value=avg_lookup.select("avg_DAY_OF_WEEKa").first()[0],subset=['avg_DAY_OF_WEEK'])
valid_test_3m_train_3m_groupby1=valid_test_3m_train_3m_groupby1.na.fill(value=avg_lookup.select("avg_CRS_DEP_TIMEa").first()[0],subset=['avg_CRS_DEP_TIME'])
valid_test_3m_train_3m_groupby1=valid_test_3m_train_3m_groupby1.na.fill(value=avg_lookup.select("avg_CRS_ELAPSED_TIMEa").first()[0],subset=['avg_CRS_ELAPSED_TIME'])
valid_test_3m_train_3m_groupby1=valid_test_3m_train_3m_groupby1.na.fill(value=avg_lookup.select("avg_DISTANCEa").first()[0],subset=['avg_DISTANCE'])
valid_test_3m_train_3m_groupby1=valid_test_3m_train_3m_groupby1.na.fill(value=avg_lookup.select("min_FIRST_DEPa").first()[0],subset=['min_FIRST_DEP'])
valid_test_3m_train_3m_groupby1=valid_test_3m_train_3m_groupby1.na.fill(value=avg_lookup.select("min_PREVIOUS_DELAYa").first()[0],subset=['min_PREVIOUS_DELAY'])
valid_test_3m_train_3m_groupby1=valid_test_3m_train_3m_groupby1.na.fill(value=avg_lookup.select("avg_LATITUDEa").first()[0],subset=['avg_LATITUDE'])
valid_test_3m_train_3m_groupby1=valid_test_3m_train_3m_groupby1.na.fill(value=avg_lookup.select("avg_LONGITUDEa").first()[0],subset=['avg_LONGITUDE'])
valid_test_3m_train_3m_groupby1=valid_test_3m_train_3m_groupby1.na.fill(value=avg_lookup.select("avg_ELEVATIONa").first()[0],subset=['avg_ELEVATION'])
valid_test_3m_train_3m_groupby1=valid_test_3m_train_3m_groupby1.na.fill(value=avg_lookup.select("avg_WND_SPEEDa").first()[0],subset=['avg_WND_SPEED'])
valid_test_3m_train_3m_groupby1=valid_test_3m_train_3m_groupby1.na.fill(value=avg_lookup.select("avg_CIG_HEIGHTa").first()[0],subset=['avg_CIG_HEIGHT'])
valid_test_3m_train_3m_groupby1=valid_test_3m_train_3m_groupby1.na.fill(value=avg_lookup.select("avg_VIS_DISTa").first()[0],subset=['avg_VIS_DIST'])
valid_test_3m_train_3m_groupby1=valid_test_3m_train_3m_groupby1.na.fill(value=avg_lookup.select("avg_TEMPa").first()[0],subset=['avg_TEMP'])
valid_test_3m_train_3m_groupby1=valid_test_3m_train_3m_groupby1.na.fill(value=avg_lookup.select("avg_DEW_TEMPa").first()[0],subset=['avg_DEW_TEMP'])
valid_test_3m_train_3m_groupby1=valid_test_3m_train_3m_groupby1.na.fill(value=avg_lookup.select("avg_SLPRESSa").first()[0],subset=['avg_SLPRESS'])

valid_test_3m_train_3m_groupby1=valid_test_3m_train_3m_groupby1.na.fill(value=avg_lookup.select("avg_DEST_LATITUDEa").first()[0],subset=['avg_DEST_LATITUDE'])
valid_test_3m_train_3m_groupby1=valid_test_3m_train_3m_groupby1.na.fill(value=avg_lookup.select("avg_DEST_LONGITUDEa").first()[0],subset=['avg_DEST_LONGITUDE'])
valid_test_3m_train_3m_groupby1=valid_test_3m_train_3m_groupby1.na.fill(value=avg_lookup.select("avg_DEST_ELEVATIONa").first()[0],subset=['avg_DEST_ELEVATION'])
valid_test_3m_train_3m_groupby1=valid_test_3m_train_3m_groupby1.na.fill(value=avg_lookup.select("avg_DEST_WND_SPEEDa").first()[0],subset=['avg_DEST_WND_SPEED'])
valid_test_3m_train_3m_groupby1=valid_test_3m_train_3m_groupby1.na.fill(value=avg_lookup.select("avg_DEST_CIG_HEIGHTa").first()[0],subset=['avg_DEST_CIG_HEIGHT'])
valid_test_3m_train_3m_groupby1=valid_test_3m_train_3m_groupby1.na.fill(value=avg_lookup.select("avg_DEST_VIS_DISTa").first()[0],subset=['avg_DEST_VIS_DIST'])
valid_test_3m_train_3m_groupby1=valid_test_3m_train_3m_groupby1.na.fill(value=avg_lookup.select("avg_DEST_TEMPa").first()[0],subset=['avg_DEST_TEMP'])
valid_test_3m_train_3m_groupby1=valid_test_3m_train_3m_groupby1.na.fill(value=avg_lookup.select("avg_DEST_DEW_TEMPa").first()[0],subset=['avg_DEST_DEW_TEMP'])
valid_test_3m_train_3m_groupby1=valid_test_3m_train_3m_groupby1.na.fill(value=avg_lookup.select("avg_DEST_SLPRESSa").first()[0],subset=['avg_DEST_SLPRESS'])
#Store Data         
dbutils.fs.rm("dbfs:/mnt/mids-w261/team_25/train_test_data_folder/valid_test_3m_train_3m_groupby1", True)      #remove file if there already is an existing one, be careful with this!!!
valid_test_3m_train_3m_groupby1.write.parquet("dbfs:/mnt/mids-w261/team_25/train_test_data_folder/valid_test_3m_train_3m_groupby1") 


#FULL DATA
valid_test_train_groupby1=valid_test_train_groupby.na.fill(value=avg_lookup.select("avg_DAY_OF_WEEKa").first()[0],subset=['avg_DAY_OF_WEEK'])
valid_test_train_groupby1=valid_test_train_groupby1.na.fill(value=avg_lookup.select("avg_CRS_DEP_TIMEa").first()[0],subset=['avg_CRS_DEP_TIME'])
valid_test_train_groupby1=valid_test_train_groupby1.na.fill(value=avg_lookup.select("avg_CRS_ELAPSED_TIMEa").first()[0],subset=['avg_CRS_ELAPSED_TIME'])
valid_test_train_groupby1=valid_test_train_groupby1.na.fill(value=avg_lookup.select("avg_DISTANCEa").first()[0],subset=['avg_DISTANCE'])
valid_test_train_groupby1=valid_test_train_groupby1.na.fill(value=avg_lookup.select("min_FIRST_DEPa").first()[0],subset=['min_FIRST_DEP'])
valid_test_train_groupby1=valid_test_train_groupby1.na.fill(value=avg_lookup.select("min_PREVIOUS_DELAYa").first()[0],subset=['min_PREVIOUS_DELAY'])
valid_test_train_groupby1=valid_test_train_groupby1.na.fill(value=avg_lookup.select("avg_LATITUDEa").first()[0],subset=['avg_LATITUDE'])
valid_test_train_groupby1=valid_test_train_groupby1.na.fill(value=avg_lookup.select("avg_LONGITUDEa").first()[0],subset=['avg_LONGITUDE'])
valid_test_train_groupby1=valid_test_train_groupby1.na.fill(value=avg_lookup.select("avg_ELEVATIONa").first()[0],subset=['avg_ELEVATION'])
valid_test_train_groupby1=valid_test_train_groupby1.na.fill(value=avg_lookup.select("avg_WND_SPEEDa").first()[0],subset=['avg_WND_SPEED'])
valid_test_train_groupby1=valid_test_train_groupby1.na.fill(value=avg_lookup.select("avg_CIG_HEIGHTa").first()[0],subset=['avg_CIG_HEIGHT'])
valid_test_train_groupby1=valid_test_train_groupby1.na.fill(value=avg_lookup.select("avg_VIS_DISTa").first()[0],subset=['avg_VIS_DIST'])
valid_test_train_groupby1=valid_test_train_groupby1.na.fill(value=avg_lookup.select("avg_TEMPa").first()[0],subset=['avg_TEMP'])
valid_test_train_groupby1=valid_test_train_groupby1.na.fill(value=avg_lookup.select("avg_DEW_TEMPa").first()[0],subset=['avg_DEW_TEMP'])
valid_test_train_groupby1=valid_test_train_groupby1.na.fill(value=avg_lookup.select("avg_SLPRESSa").first()[0],subset=['avg_SLPRESS'])

valid_test_train_groupby1=valid_test_train_groupby1.na.fill(value=avg_lookup.select("avg_DEST_LATITUDEa").first()[0],subset=['avg_DEST_LATITUDE'])
valid_test_train_groupby1=valid_test_train_groupby1.na.fill(value=avg_lookup.select("avg_DEST_LONGITUDEa").first()[0],subset=['avg_DEST_LONGITUDE'])
valid_test_train_groupby1=valid_test_train_groupby1.na.fill(value=avg_lookup.select("avg_DEST_ELEVATIONa").first()[0],subset=['avg_DEST_ELEVATION'])
valid_test_train_groupby1=valid_test_train_groupby1.na.fill(value=avg_lookup.select("avg_DEST_WND_SPEEDa").first()[0],subset=['avg_DEST_WND_SPEED'])
valid_test_train_groupby1=valid_test_train_groupby1.na.fill(value=avg_lookup.select("avg_DEST_CIG_HEIGHTa").first()[0],subset=['avg_DEST_CIG_HEIGHT'])
valid_test_train_groupby1=valid_test_train_groupby1.na.fill(value=avg_lookup.select("avg_DEST_VIS_DISTa").first()[0],subset=['avg_DEST_VIS_DIST'])
valid_test_train_groupby1=valid_test_train_groupby1.na.fill(value=avg_lookup.select("avg_DEST_TEMPa").first()[0],subset=['avg_DEST_TEMP'])
valid_test_train_groupby1=valid_test_train_groupby1.na.fill(value=avg_lookup.select("avg_DEST_DEW_TEMPa").first()[0],subset=['avg_DEST_DEW_TEMP'])
valid_test_train_groupby1=valid_test_train_groupby1.na.fill(value=avg_lookup.select("avg_DEST_SLPRESSa").first()[0],subset=['avg_DEST_SLPRESS'])
#Store Data         
dbutils.fs.rm("dbfs:/mnt/mids-w261/team_25/train_test_data_folder/valid_test_train_groupby1", True)      #remove file if there already is an existing one, be careful with this!!!
valid_test_train_groupby1.write.parquet("dbfs:/mnt/mids-w261/team_25/train_test_data_folder/valid_test_train_groupby1") 

avg_DAY_OF_WEEKa,avg_CRS_DEP_TIMEa,avg_CRS_ELAPSED_TIMEa,avg_DISTANCEa,min_FIRST_DEPa,min_PREVIOUS_DELAYa,avg_LATITUDEa,avg_LONGITUDEa,avg_ELEVATIONa,avg_WND_SPEEDa,avg_CIG_HEIGHTa,avg_VIS_DISTa,avg_TEMPa,avg_DEW_TEMPa,avg_SLPRESSa,avg_DEST_LATITUDEa,avg_DEST_LONGITUDEa,avg_DEST_ELEVATIONa,avg_DEST_WND_SPEEDa,avg_DEST_CIG_HEIGHTa,avg_DEST_VIS_DISTa,avg_DEST_TEMPa,avg_DEST_DEW_TEMPa,avg_DEST_SLPRESSa
3.942869869943395,1336.1695987819908,145.05387984885908,846.0186153945617,0,0,36.66500090109792,-96.0239866126912,253.3793025421035,38.581142655405,13082.74633962568,15341.070194213306,173.06952671824973,84.81701658819989,10165.1129698825,36.66381468722042,-96.05805652637844,253.06262291705207,38.49919856590109,13135.569034051388,15351.528574847283,173.61805946460262,85.14554648449783,10165.283138336996


#### Finally, impute the suspect or erroneous, nulls, and 99999 (missing) data in the full table with these averages

In [0]:
#Read Data

# 3 MONTH DATA
valid_test_3m_train_3m_groupby1 = spark.read.option("header", "true").parquet("dbfs:/mnt/mids-w261/team_25/train_test_data_folder/valid_test_3m_train_3m_groupby1/part-00*.parquet")

# FULL DATA
valid_test_train_groupby1 = spark.read.option("header", "true").parquet("dbfs:/mnt/mids-w261/team_25/train_test_data_folder/valid_test_train_groupby1/part-00*.parquet")

In [0]:
#Second, impute suspect or erroneous, nulls, and 99999's readings.

#For airports missing from PageRank helper function, impute nulls as 1/N (N is number of airports in pagerank calculated from training)
unique_pr_airports = airportPR_ordered_df.count()   #This is total of unique airports in pagerank calculated from training data
#1/N = 0.0030211480362537764

# 3 MONTH DATA
# valid_test_3m_pr.describe().display()
#columns: [YEAR, MONTH, DAY_OF_WEEK, OP_UNIQUE_CARRIER, ORIGIN, ORIGIN_STATE_ABR, DEST, DEST_STATE_ABR, CRS_DEP_TIME, CRS_ELAPSED_TIME, DISTANCE, FIRST_DEP, PREVIOUS_DELAY, FLIGHTS_PER_DAY, DELAY, LATITUDE, LONGITUDE, ELEVATION, REPORT_TYPE, WND_SPEED, WND_SPEED_QUAL, CIG_HEIGHT, CIG_QUAL, VIS_DIST, VIS_DIST_QUAL, VIS_VAR, VIS_VAR_QUAL, TEMP, TEMP_QUAL, DEW_TEMP, DEW_TEMP_QUAL, SLPRESS, SLPRESS_QUAL, avg_DAY_OF_WEEK, avg_CRS_DEP_TIME, avg_CRS_ELAPSED_TIME, avg_DISTANCE, min_FIRST_DEP, min_PREVIOUS_DELAY, avg_LATITUDE, avg_LONGITUDE, avg_ELEVATION, avg_WND_SPEED, avg_CIG_HEIGHT, avg_VIS_DIST, avg_TEMP, avg_DEW_TEMP, avg_SLPRESS]

valid_test_3m_train_3m_groupby1.registerTempTable("valid_test_3m_train_3m_groupby1_tt")

valid_test_3m_imputed=spark.sql("""SELECT YEAR, MONTH, DAY_OF_WEEK, OP_UNIQUE_CARRIER, ORIGIN, ORIGIN_STATE_ABR, DEST, DEST_STATE_ABR, FLIGHTS_PER_DAY, DELAY, REPORT_TYPE, WND_SPEED_QUAL, CIG_QUAL, VIS_DIST_QUAL, VIS_VAR, VIS_VAR_QUAL, TEMP_QUAL, DEW_TEMP_QUAL, SLPRESS_QUAL, DEST_REPORT_TYPE, DEST_WND_SPEED_QUAL, DEST_CIG_QUAL, DEST_VIS_DIST_QUAL, DEST_VIS_VAR, DEST_VIS_VAR_QUAL, DEST_TEMP_QUAL, DEST_DEW_TEMP_QUAL, DEST_SLPRESS_QUAL, avg_DAY_OF_WEEK, avg_CRS_DEP_TIME, avg_CRS_ELAPSED_TIME, avg_DISTANCE, min_FIRST_DEP, min_PREVIOUS_DELAY, avg_LATITUDE, avg_LONGITUDE, avg_ELEVATION, avg_WND_SPEED, avg_CIG_HEIGHT, avg_VIS_DIST, avg_TEMP, avg_DEW_TEMP, avg_SLPRESS, avg_DEST_LATITUDE, avg_DEST_LONGITUDE, avg_DEST_ELEVATION, avg_DEST_WND_SPEED, avg_DEST_CIG_HEIGHT, avg_DEST_VIS_DIST, avg_DEST_TEMP, avg_DEST_DEW_TEMP, avg_DEST_SLPRESS, 
              
              case when CRS_DEP_TIME IS NULL then avg_CRS_DEP_TIME else CRS_DEP_TIME end as CRS_DEP_TIME,
              case when CRS_ELAPSED_TIME IS NULL then avg_CRS_ELAPSED_TIME else CRS_ELAPSED_TIME end as CRS_ELAPSED_TIME,
              case when DISTANCE IS NULL then avg_DISTANCE else DISTANCE end as DISTANCE,
              case when FIRST_DEP IS NULL then min_FIRST_DEP else FIRST_DEP end as FIRST_DEP,
              case when PREVIOUS_DELAY IS NULL then min_PREVIOUS_DELAY else PREVIOUS_DELAY end as PREVIOUS_DELAY,
              case when LATITUDE IS NULL then avg_LATITUDE else LATITUDE end as LATITUDE,
              case when LONGITUDE IS NULL then avg_LONGITUDE else LONGITUDE end as LONGITUDE,
              case when ELEVATION IS NULL then avg_ELEVATION else ELEVATION end as ELEVATION,
              
              case when WND_SPEED IS NULL or WND_SPEED=9999 or WND_SPEED_QUAL=2 or WND_SPEED_QUAL=3 or WND_SPEED_QUAL=6 or WND_SPEED_QUAL=7 
              then avg_WND_SPEED else WND_SPEED end as WND_SPEED,
              
              case when CIG_HEIGHT IS NULL or CIG_HEIGHT=99999 or CIG_QUAL=2 or CIG_QUAL=3 or CIG_QUAL=6 or CIG_QUAL=7
              then avg_CIG_HEIGHT else CIG_HEIGHT end as CIG_HEIGHT,
              
              case when VIS_DIST IS NULL or VIS_DIST=999999 or VIS_DIST_QUAL=2 or VIS_DIST_QUAL=3 or VIS_DIST_QUAL=6 or VIS_DIST_QUAL=7
              then avg_VIS_DIST else VIS_DIST end as VIS_DIST,
              
              case when TEMP IS NULL or trim(TEMP)=9999 or TEMP_QUAL=2 or TEMP_QUAL=3 or TEMP_QUAL=6 or TEMP_QUAL=7
              then avg_TEMP else TEMP end as TEMP,
              
              case when DEW_TEMP IS NULL or DEW_TEMP=9999 or DEW_TEMP_QUAL=2 or DEW_TEMP_QUAL=3 or DEW_TEMP_QUAL=6 or DEW_TEMP_QUAL=7
              then avg_DEW_TEMP else DEW_TEMP end as DEW_TEMP,
              
              case when SLPRESS IS NULL or SLPRESS=99999 or SLPRESS_QUAL=2 or SLPRESS_QUAL=3 or SLPRESS_QUAL=6 or SLPRESS_QUAL=7
              then avg_SLPRESS else SLPRESS end as SLPRESS,
              
              case when PAGERANK IS NULL then 0.0030211480362537764 else PAGERANK end as PAGERANK,


              case when DEST_LATITUDE IS NULL then avg_DEST_LATITUDE else DEST_LATITUDE end as DEST_LATITUDE,
              case when DEST_LONGITUDE IS NULL then avg_DEST_LONGITUDE else DEST_LONGITUDE end as DEST_LONGITUDE,
              case when DEST_ELEVATION IS NULL then avg_DEST_ELEVATION else DEST_ELEVATION end as DEST_ELEVATION,
              
              case when DEST_WND_SPEED IS NULL or DEST_WND_SPEED=9999 or DEST_WND_SPEED_QUAL=2 or DEST_WND_SPEED_QUAL=3 or DEST_WND_SPEED_QUAL=6 or DEST_WND_SPEED_QUAL=7 
              then avg_DEST_WND_SPEED else DEST_WND_SPEED end as DEST_WND_SPEED,
              
              case when DEST_CIG_HEIGHT IS NULL or DEST_CIG_HEIGHT=99999 or DEST_CIG_QUAL=2 or DEST_CIG_QUAL=3 or DEST_CIG_QUAL=6 or DEST_CIG_QUAL=7
              then avg_DEST_CIG_HEIGHT else DEST_CIG_HEIGHT end as DEST_CIG_HEIGHT,
              
              case when DEST_VIS_DIST IS NULL or DEST_VIS_DIST=999999 or DEST_VIS_DIST_QUAL=2 or DEST_VIS_DIST_QUAL=3 or DEST_VIS_DIST_QUAL=6 or DEST_VIS_DIST_QUAL=7
              then avg_DEST_VIS_DIST else DEST_VIS_DIST end as DEST_VIS_DIST,
              
              case when DEST_TEMP IS NULL or DEST_TEMP=9999 or DEST_TEMP_QUAL=2 or DEST_TEMP_QUAL=3 or DEST_TEMP_QUAL=6 or DEST_TEMP_QUAL=7
              then avg_DEST_TEMP else DEST_TEMP end as DEST_TEMP,
              
              case when DEST_DEW_TEMP IS NULL or DEST_DEW_TEMP=9999 or DEST_DEW_TEMP_QUAL=2 or DEST_DEW_TEMP_QUAL=3 or DEST_DEW_TEMP_QUAL=6 or DEST_DEW_TEMP_QUAL=7
              then avg_DEST_DEW_TEMP else DEST_DEW_TEMP end as DEST_DEW_TEMP,
              
              case when DEST_SLPRESS IS NULL or DEST_SLPRESS=99999 or DEST_SLPRESS_QUAL=2 or DEST_SLPRESS_QUAL=3 or DEST_SLPRESS_QUAL=6 or DEST_SLPRESS_QUAL=7
              then avg_DEST_SLPRESS else DEST_SLPRESS end as DEST_SLPRESS   
             
             FROM valid_test_3m_train_3m_groupby1_tt
              """)

#Store Data         
dbutils.fs.rm("dbfs:/mnt/mids-w261/team_25/train_test_data_folder/valid_test_3m_imputed", True)      #remove file if there already is an existing one, be careful with this!!!
valid_test_3m_imputed.write.parquet("dbfs:/mnt/mids-w261/team_25/train_test_data_folder/valid_test_3m_imputed")  

#######################################

# FULL DATA

#columns: [YEAR, MONTH, DAY_OF_WEEK, OP_UNIQUE_CARRIER, ORIGIN, ORIGIN_STATE_ABR, DEST, DEST_STATE_ABR, CRS_DEP_TIME, CRS_ELAPSED_TIME, DISTANCE, FIRST_DEP, PREVIOUS_DELAY, DELAY, LATITUDE, LONGITUDE, ELEVATION, REPORT_TYPE, WND_SPEED, WND_SPEED_QUAL, CIG_HEIGHT, CIG_QUAL, VIS_DIST, VIS_DIST_QUAL, VIS_VAR, VIS_VAR_QUAL, TEMP, TEMP_QUAL, DEW_TEMP, DEW_TEMP_QUAL, SLPRESS, SLPRESS_QUAL, avg_DAY_OF_WEEK, avg_CRS_DEP_TIME, avg_CRS_ELAPSED_TIME, avg_DISTANCE, min_FIRST_DEP, min_PREVIOUS_DELAY, avg_LATITUDE, avg_LONGITUDE, avg_ELEVATION, avg_WND_SPEED, avg_CIG_HEIGHT, avg_VIS_DIST, avg_TEMP, avg_DEW_TEMP, avg_SLPRESS]

valid_test_train_groupby1.registerTempTable("valid_test_train_groupby1_tt")

valid_test_imputed=spark.sql("""SELECT YEAR, MONTH, DAY_OF_WEEK, OP_UNIQUE_CARRIER, ORIGIN, ORIGIN_STATE_ABR, DEST, DEST_STATE_ABR, FLIGHTS_PER_DAY, DELAY, REPORT_TYPE, WND_SPEED_QUAL, CIG_QUAL, VIS_DIST_QUAL, VIS_VAR, VIS_VAR_QUAL, TEMP_QUAL, DEW_TEMP_QUAL, SLPRESS_QUAL, DEST_REPORT_TYPE, DEST_WND_SPEED_QUAL, DEST_CIG_QUAL, DEST_VIS_DIST_QUAL, DEST_VIS_VAR, DEST_VIS_VAR_QUAL, DEST_TEMP_QUAL, DEST_DEW_TEMP_QUAL, DEST_SLPRESS_QUAL, avg_DAY_OF_WEEK, avg_CRS_DEP_TIME, avg_CRS_ELAPSED_TIME, avg_DISTANCE, min_FIRST_DEP, min_PREVIOUS_DELAY, avg_LATITUDE, avg_LONGITUDE, avg_ELEVATION, avg_WND_SPEED, avg_CIG_HEIGHT, avg_VIS_DIST, avg_TEMP, avg_DEW_TEMP, avg_SLPRESS, avg_DEST_LATITUDE, avg_DEST_LONGITUDE, avg_DEST_ELEVATION, avg_DEST_WND_SPEED, avg_DEST_CIG_HEIGHT, avg_DEST_VIS_DIST, avg_DEST_TEMP, avg_DEST_DEW_TEMP, avg_DEST_SLPRESS,  
              
              case when CRS_DEP_TIME IS NULL then avg_CRS_DEP_TIME else CRS_DEP_TIME end as CRS_DEP_TIME,
              case when CRS_ELAPSED_TIME IS NULL then avg_CRS_ELAPSED_TIME else CRS_ELAPSED_TIME end as CRS_ELAPSED_TIME,
              case when DISTANCE IS NULL then avg_DISTANCE else DISTANCE end as DISTANCE,
              case when FIRST_DEP IS NULL then min_FIRST_DEP else FIRST_DEP end as FIRST_DEP,
              case when PREVIOUS_DELAY IS NULL then min_PREVIOUS_DELAY else PREVIOUS_DELAY end as PREVIOUS_DELAY,
              case when LATITUDE IS NULL then avg_LATITUDE else LATITUDE end as LATITUDE,
              case when LONGITUDE IS NULL then avg_LONGITUDE else LONGITUDE end as LONGITUDE,
              case when ELEVATION IS NULL then avg_ELEVATION else ELEVATION end as ELEVATION,
              
              case when WND_SPEED IS NULL or WND_SPEED=9999 or WND_SPEED_QUAL=2 or WND_SPEED_QUAL=3 or WND_SPEED_QUAL=6 or WND_SPEED_QUAL=7 
              then avg_WND_SPEED else WND_SPEED end as WND_SPEED,
              
              case when CIG_HEIGHT IS NULL or CIG_HEIGHT=99999 or CIG_QUAL=2 or CIG_QUAL=3 or CIG_QUAL=6 or CIG_QUAL=7
              then avg_CIG_HEIGHT else CIG_HEIGHT end as CIG_HEIGHT,
              
              case when VIS_DIST IS NULL or VIS_DIST=999999 or VIS_DIST_QUAL=2 or VIS_DIST_QUAL=3 or VIS_DIST_QUAL=6 or VIS_DIST_QUAL=7
              then avg_VIS_DIST else VIS_DIST end as VIS_DIST,
              
              case when (trim(TEMP) IS NULL or trim(TEMP)=9999 or TEMP_QUAL=2 or TEMP_QUAL=3 or TEMP_QUAL=6 or TEMP_QUAL=7)
              then avg_TEMP else TEMP end as TEMP,
              
              case when DEW_TEMP IS NULL or DEW_TEMP=9999 or DEW_TEMP_QUAL=2 or DEW_TEMP_QUAL=3 or DEW_TEMP_QUAL=6 or DEW_TEMP_QUAL=7
              then avg_DEW_TEMP else DEW_TEMP end as DEW_TEMP,
              
              case when SLPRESS IS NULL or SLPRESS=99999 or SLPRESS_QUAL=2 or SLPRESS_QUAL=3 or SLPRESS_QUAL=6 or SLPRESS_QUAL=7
              then avg_SLPRESS else SLPRESS end as SLPRESS,
              
              case when PAGERANK IS NULL then 0.0030211480362537764 else PAGERANK end as PAGERANK,


              case when DEST_LATITUDE IS NULL then avg_DEST_LATITUDE else DEST_LATITUDE end as DEST_LATITUDE,
              case when DEST_LONGITUDE IS NULL then avg_DEST_LONGITUDE else DEST_LONGITUDE end as DEST_LONGITUDE,
              case when DEST_ELEVATION IS NULL then avg_DEST_ELEVATION else DEST_ELEVATION end as DEST_ELEVATION,
              
              case when DEST_WND_SPEED IS NULL or DEST_WND_SPEED=9999 or DEST_WND_SPEED_QUAL=2 or DEST_WND_SPEED_QUAL=3 or DEST_WND_SPEED_QUAL=6 or DEST_WND_SPEED_QUAL=7 
              then avg_DEST_WND_SPEED else DEST_WND_SPEED end as DEST_WND_SPEED,
              
              case when DEST_CIG_HEIGHT IS NULL or DEST_CIG_HEIGHT=99999 or DEST_CIG_QUAL=2 or DEST_CIG_QUAL=3 or DEST_CIG_QUAL=6 or DEST_CIG_QUAL=7
              then avg_DEST_CIG_HEIGHT else DEST_CIG_HEIGHT end as DEST_CIG_HEIGHT,
              
              case when DEST_VIS_DIST IS NULL or DEST_VIS_DIST=999999 or DEST_VIS_DIST_QUAL=2 or DEST_VIS_DIST_QUAL=3 or DEST_VIS_DIST_QUAL=6 or DEST_VIS_DIST_QUAL=7
              then avg_DEST_VIS_DIST else DEST_VIS_DIST end as DEST_VIS_DIST,
              
              case when DEST_TEMP IS NULL or DEST_TEMP=9999 or DEST_TEMP_QUAL=2 or DEST_TEMP_QUAL=3 or DEST_TEMP_QUAL=6 or DEST_TEMP_QUAL=7
              then avg_DEST_TEMP else DEST_TEMP end as DEST_TEMP,
              
              case when DEST_DEW_TEMP IS NULL or DEST_DEW_TEMP=9999 or DEST_DEW_TEMP_QUAL=2 or DEST_DEW_TEMP_QUAL=3 or DEST_DEW_TEMP_QUAL=6 or DEST_DEW_TEMP_QUAL=7
              then avg_DEST_DEW_TEMP else DEST_DEW_TEMP end as DEST_DEW_TEMP,
              
              case when DEST_SLPRESS IS NULL or DEST_SLPRESS=9 or DEST_SLPRESS_QUAL=2 or DEST_SLPRESS_QUAL=3 or DEST_SLPRESS_QUAL=6 or DEST_SLPRESS_QUAL=7
              then avg_DEST_SLPRESS else DEST_SLPRESS end as DEST_SLPRESS   


             FROM valid_test_train_groupby1_tt
              """)


#Store Data         
dbutils.fs.rm("dbfs:/mnt/mids-w261/team_25/train_test_data_folder/valid_test_imputed", True)      #remove file if there already is an existing one, be careful with this!!!
valid_test_imputed.write.parquet("dbfs:/mnt/mids-w261/team_25/train_test_data_folder/valid_test_imputed")  

In [0]:
#Read Data

# 3 MONTH DATA
valid_test_3m_imputed = spark.read.option("header", "true").parquet("dbfs:/mnt/mids-w261/team_25/train_test_data_folder/valid_test_3m_imputed/part-00*.parquet")
valid_test_3m_imputed.display()

# FULL DATA
valid_test_imputed = spark.read.option("header", "true").parquet("dbfs:/mnt/mids-w261/team_25/train_test_data_folder/valid_test_imputed/part-00*.parquet")
# valid_test_imputed.display()

YEAR,MONTH,DAY_OF_WEEK,OP_UNIQUE_CARRIER,ORIGIN,ORIGIN_STATE_ABR,DEST,DEST_STATE_ABR,FLIGHTS_PER_DAY,DELAY,REPORT_TYPE,WND_SPEED_QUAL,CIG_QUAL,VIS_DIST_QUAL,VIS_VAR,VIS_VAR_QUAL,TEMP_QUAL,DEW_TEMP_QUAL,SLPRESS_QUAL,DEST_REPORT_TYPE,DEST_WND_SPEED_QUAL,DEST_CIG_QUAL,DEST_VIS_DIST_QUAL,DEST_VIS_VAR,DEST_VIS_VAR_QUAL,DEST_TEMP_QUAL,DEST_DEW_TEMP_QUAL,DEST_SLPRESS_QUAL,avg_DAY_OF_WEEK,avg_CRS_DEP_TIME,avg_CRS_ELAPSED_TIME,avg_DISTANCE,min_FIRST_DEP,min_PREVIOUS_DELAY,avg_LATITUDE,avg_LONGITUDE,avg_ELEVATION,avg_WND_SPEED,avg_CIG_HEIGHT,avg_VIS_DIST,avg_TEMP,avg_DEW_TEMP,avg_SLPRESS,avg_DEST_LATITUDE,avg_DEST_LONGITUDE,avg_DEST_ELEVATION,avg_DEST_WND_SPEED,avg_DEST_CIG_HEIGHT,avg_DEST_VIS_DIST,avg_DEST_TEMP,avg_DEST_DEW_TEMP,avg_DEST_SLPRESS,CRS_DEP_TIME,CRS_ELAPSED_TIME,DISTANCE,FIRST_DEP,PREVIOUS_DELAY,LATITUDE,LONGITUDE,ELEVATION,WND_SPEED,CIG_HEIGHT,VIS_DIST,TEMP,DEW_TEMP,SLPRESS,PAGERANK,DEST_LATITUDE,DEST_LONGITUDE,DEST_ELEVATION,DEST_WND_SPEED,DEST_CIG_HEIGHT,DEST_VIS_DIST,DEST_TEMP,DEST_DEW_TEMP,DEST_SLPRESS
2015,3,7,AA,DFW,TX,ORD,IL,15189,1,FM-16,5,5,5,N,5,5,5,9,FM-15,5,5,5,N,5,5,5,5,7.0,1430.4825870646766,127.28897180762851,768.8994610281924,0,0,32.897800000000046,-97.01889999999992,170.7000000000003,58.82815091210614,13971.243366500828,14818.195895522387,106.98072139303484,24.2363184079602,10194.98673300166,35.096326797263686,-95.155985068408,289.974295190713,41.45563847429519,12014.406094527363,14616.968905472637,86.27446102819238,7.633706467661692,10195.79995854063,950.0,133.0,802.0,0,0,32.8978,-97.0189,170.7,36.0,91.0,2414.0,0.0,-6.0,10194.98673300166,0.0381707360832717,41.995,-87.9336,201.8,0.0,152.0,3219.0,-83.0,-100.0,10290.0
2015,3,7,AA,IAH,TX,MIA,FL,15189,0,FM-15,5,5,A,N,A,5,5,5,FM-15,5,5,5,N,5,5,5,5,7.0,1434.5062454077884,138.92652461425422,818.929463629684,0,0,29.98000000000001,-95.35999999999991,29.0,34.958486407053634,10235.256061719325,13748.89052167524,155.54665686994858,87.81520940484937,10196.4360764144,34.712629426891986,-94.77601407053636,293.8718221895664,41.812637766348274,11866.793901542984,14622.159808963996,98.89088905216752,18.53857457751653,10193.839456282143,951.0,139.0,964.0,0,0,29.98,-95.36,29.0,31.0,91.0,402.0,117.0,117.0,10226.0,0.0223240124155635,25.7881,-80.3169,8.8,41.0,22000.0,16093.0,239.0,206.0,10260.0
2015,3,7,AA,JFK,NY,MCO,FL,15189,1,FM-15,5,5,5,N,5,5,5,5,FM-15,5,5,5,N,5,5,5,5,7.0,1340.8753959873284,246.4841605068638,1455.621964097149,0,0,40.63860000000001,-73.76220000000005,3.4000000000000017,53.3442449841605,6005.921858500528,13989.043294614572,12.94350580781415,-55.46832101372756,10169.729672650476,33.386253004223875,-94.3661244561774,128.53104540654698,34.051214361140445,13106.37434002112,14823.50580781415,124.61932418162618,56.21066525871172,10194.171594508976,1459.0,183.0,944.0,0,0,40.6386,-73.7622,3.4,51.0,366.0,1207.0,-33.0,-67.0,10351.0,0.0145890329261653,28.4339,-81.325,27.4,41.0,1463.0,16093.0,256.0,206.0,10268.0
2015,3,7,AA,LAX,CA,DFW,TX,15189,1,FM-15,5,5,5,N,5,5,5,5,FM-16,5,5,5,N,5,5,5,9,7.0,1376.8017484489565,171.33248730964468,1163.6872532430907,0,0,33.938000000000045,-118.38879999999992,29.599999999999955,23.71714608009024,14597.55527354766,15065.7983643542,158.76931754089114,88.84715172024816,10184.896503102089,36.15198998025946,-106.06744339537512,300.2130851663847,36.3970671178793,11706.18358714044,14387.303440496336,100.2969543147208,21.683587140439933,10191.536661026508,1144.0,181.0,1235.0,0,1,33.938,-118.3888,29.6,21.0,22000.0,16093.0,128.0,83.0,10140.0,0.0308713278344312,32.8978,-97.0189,170.7,46.0,183.0,8047.0,17.0,11.0,10191.536661026508
2015,3,7,AA,LAX,CA,SFO,CA,15189,0,FM-15,5,5,5,N,5,5,5,5,FM-15,5,5,5,N,5,5,5,5,7.0,1376.8017484489565,171.33248730964468,1163.6872532430907,0,0,33.938000000000045,-118.38879999999992,29.599999999999955,23.71714608009024,14597.55527354766,15065.7983643542,158.76931754089114,88.84715172024816,10184.896503102089,36.15198998025946,-106.06744339537512,300.2130851663847,36.3970671178793,11706.18358714044,14387.303440496336,100.2969543147208,21.683587140439933,10191.536661026508,1620.0,80.0,337.0,0,0,33.938,-118.3888,29.6,57.0,1463.0,16093.0,161.0,83.0,10118.0,0.0308713278344312,37.6197,-122.3647,2.4,21.0,22000.0,16093.0,172.0,39.0,10139.0
2015,3,7,AA,LGA,NY,DFW,TX,15189,1,FM-15,5,5,5,N,5,5,5,5,FM-16,5,5,5,N,5,5,5,9,7.0,1387.642010163749,165.3331451157538,802.5505364201016,0,0,40.77919999999996,-73.88000000000005,3.4000000000000017,53.15866741953698,6048.884810841332,12897.010728402032,5.899491812535291,-61.387351778656125,10168.94466403162,34.99564884246189,-84.1914811857708,165.33698475437606,44.5115753811406,10332.941840767928,14225.095990965556,75.2394127611519,3.846979107848673,10200.56747600226,730.0,251.0,1389.0,0,0,40.7792,-73.88,3.4,0.0,22000.0,16093.0,-44.0,-122.0,10390.0,0.0141252218790335,32.8978,-97.0189,170.7,36.0,183.0,4828.0,0.0,-6.0,10200.56747600226
2015,3,7,AA,MIA,FL,SFO,CA,15189,0,FM-15,5,5,5,N,5,5,5,5,FM-15,5,5,5,N,5,5,5,5,7.0,1432.3077407174324,176.47891755821271,1048.1227186910007,0,0,25.78809999999999,-80.31689999999999,8.800000000000004,38.82693517935809,15380.200755191943,16093.0,215.0824417872876,139.63436123348018,10210.094398993077,35.514524449339206,-84.76186118313403,137.73373190685962,42.10320956576463,10302.60918816866,14287.264317180618,76.11516677155444,5.94587791063562,10189.396475770924,1840.0,393.0,2585.0,0,0,25.7881,-80.3169,8.8,67.0,22000.0,16093.0,272.0,189.0,10242.0,0.0102115569589181,37.6197,-122.3647,2.4,0.0,22000.0,16093.0,161.0,22.0,10146.0
2015,3,7,AA,STL,MO,DFW,TX,15189,1,FM-15,5,5,5,N,5,5,5,5,FM-15,5,5,5,N,5,5,5,5,7.0,1361.8914956011731,131.19354838709677,742.1378299120234,0,0,38.7525,-90.37359999999995,161.80000000000007,49.37243401759531,6141.00439882698,14486.136363636364,25.10850439882698,-52.53519061583578,10199.271260997068,36.09212784457478,-91.68502130498534,226.73489736070383,41.759530791788855,11156.052785923754,14215.926686217008,71.60263929618769,-2.846041055718475,10196.16275659824,800.0,115.0,550.0,0,0,38.7525,-90.3736,161.8,0.0,853.0,3219.0,-28.0,-44.0,10298.0,0.0076554971275557,32.8978,-97.0189,170.7,41.0,91.0,3219.0,0.0,-6.0,10265.0
2015,3,7,AS,OAK,CA,OGG,HI,15189,0,FM-15,5,5,5,N,5,5,5,5,FM-16,5,5,5,N,5,5,5,9,7.0,1369.551246537396,112.78947368421052,670.7770083102494,0,0,37.72139,-122.22082999999998,1.8000000000000005,25.58448753462604,12039.03324099723,13557.652354570637,121.59279778393352,68.45013850415512,10196.427977839336,36.28005385041551,-116.4538350138504,341.74445983379513,27.12465373961219,13381.404432132964,14701.058171745151,130.66897506925207,36.322714681440445,10193.124653739613,720.0,339.0,2349.0,0,0,37.72139,-122.22083,1.8,36.0,22000.0,16093.0,83.0,61.0,10141.0,0.0069694060144943,20.89972,-156.42861,15.5,31.0,427.0,2816.0,210.0,200.0,10193.124653739613
2015,3,7,B6,FLL,FL,DCA,VA,15189,1,FM-15,5,5,5,N,5,5,5,5,FM-15,5,5,5,N,5,5,5,5,7.0,1369.5995009357457,177.06674984404242,1075.5514660012475,0,0,26.07875000000001,-80.16217000000005,3.400000000000001,48.64566437928883,16461.85090455396,15967.515907673112,207.07735495945104,140.7174048658765,10206.929507174047,36.87132376793512,-83.6296637679351,134.91216469120403,42.41297567061759,9429.285090455396,13984.67186525265,54.23144104803494,-13.48845913911416,10184.066126013724,2030.0,149.0,899.0,0,0,26.07875,-80.16217,3.4,57.0,22000.0,16093.0,239.0,206.0,10243.0,0.0121748835826635,38.8472,-77.03454,3.0,36.0,244.0,8047.0,6.0,-17.0,10274.0


In [0]:
valid_test_3m_imputed.where("DEST_TEMP = '9999'").display()

YEAR,MONTH,DAY_OF_WEEK,OP_UNIQUE_CARRIER,ORIGIN,ORIGIN_STATE_ABR,DEST,DEST_STATE_ABR,FLIGHTS_PER_DAY,DELAY,REPORT_TYPE,WND_SPEED_QUAL,CIG_QUAL,VIS_DIST_QUAL,VIS_VAR,VIS_VAR_QUAL,TEMP_QUAL,DEW_TEMP_QUAL,SLPRESS_QUAL,DEST_REPORT_TYPE,DEST_WND_SPEED_QUAL,DEST_CIG_QUAL,DEST_VIS_DIST_QUAL,DEST_VIS_VAR,DEST_VIS_VAR_QUAL,DEST_TEMP_QUAL,DEST_DEW_TEMP_QUAL,DEST_SLPRESS_QUAL,avg_DAY_OF_WEEK,avg_CRS_DEP_TIME,avg_CRS_ELAPSED_TIME,avg_DISTANCE,min_FIRST_DEP,min_PREVIOUS_DELAY,avg_LATITUDE,avg_LONGITUDE,avg_ELEVATION,avg_WND_SPEED,avg_CIG_HEIGHT,avg_VIS_DIST,avg_TEMP,avg_DEW_TEMP,avg_SLPRESS,avg_DEST_LATITUDE,avg_DEST_LONGITUDE,avg_DEST_ELEVATION,avg_DEST_WND_SPEED,avg_DEST_CIG_HEIGHT,avg_DEST_VIS_DIST,avg_DEST_TEMP,avg_DEST_DEW_TEMP,avg_DEST_SLPRESS,CRS_DEP_TIME,CRS_ELAPSED_TIME,DISTANCE,FIRST_DEP,PREVIOUS_DELAY,LATITUDE,LONGITUDE,ELEVATION,WND_SPEED,CIG_HEIGHT,VIS_DIST,TEMP,DEW_TEMP,SLPRESS,PAGERANK,DEST_LATITUDE,DEST_LONGITUDE,DEST_ELEVATION,DEST_WND_SPEED,DEST_CIG_HEIGHT,DEST_VIS_DIST,DEST_TEMP,DEST_DEW_TEMP,DEST_SLPRESS


In [0]:
#Finally, keep only columns used for next step(s).

# 3 MONTH DATA
valid_test_3m_imputed=valid_test_3m_imputed.select('YEAR', 'MONTH', 'DAY_OF_WEEK', 'OP_UNIQUE_CARRIER', 'ORIGIN', 'ORIGIN_STATE_ABR', 'DEST', 'DEST_STATE_ABR', 'FLIGHTS_PER_DAY', 'DELAY', 'REPORT_TYPE', 'CRS_DEP_TIME', 'CRS_ELAPSED_TIME', 'DISTANCE', 'FIRST_DEP', 'PREVIOUS_DELAY', 'LATITUDE', 'LONGITUDE', 'ELEVATION', 'WND_SPEED', 'CIG_HEIGHT', 'VIS_DIST', 'VIS_VAR', 'TEMP', 'DEW_TEMP', 'SLPRESS', 'PAGERANK', 'DEST_LATITUDE', 'DEST_LONGITUDE', 'DEST_ELEVATION', 'DEST_WND_SPEED', 'DEST_CIG_HEIGHT', 'DEST_VIS_DIST', 'DEST_VIS_VAR', 'DEST_TEMP', 'DEST_DEW_TEMP', 'DEST_SLPRESS')
# print(valid_test_3m_imputed.columns)

# FULL DATA
valid_test_imputed=valid_test_imputed.select('YEAR', 'MONTH', 'DAY_OF_WEEK', 'OP_UNIQUE_CARRIER', 'ORIGIN', 'ORIGIN_STATE_ABR', 'DEST', 'DEST_STATE_ABR', 'FLIGHTS_PER_DAY', 'DELAY', 'REPORT_TYPE', 'CRS_DEP_TIME', 'CRS_ELAPSED_TIME', 'DISTANCE', 'FIRST_DEP', 'PREVIOUS_DELAY', 'LATITUDE', 'LONGITUDE', 'ELEVATION', 'WND_SPEED', 'CIG_HEIGHT', 'VIS_DIST', 'VIS_VAR', 'TEMP', 'DEW_TEMP', 'SLPRESS', 'PAGERANK', 'DEST_LATITUDE', 'DEST_LONGITUDE', 'DEST_ELEVATION', 'DEST_WND_SPEED', 'DEST_CIG_HEIGHT', 'DEST_VIS_DIST', 'DEST_VIS_VAR', 'DEST_TEMP', 'DEST_DEW_TEMP', 'DEST_SLPRESS')
# print(valid_test_imputed.columns)

# C) Binning

Add code for creating bins for these features:

AIRLINE DATA:

Departure time (* Should be based on local time - this binning needs to occur before UTC)
* 0 - 1:59
* 2 - 3:59
* 4 - 5:59
* 6 - 7:59
* 8 - 9:59
* 10 - 11:59
* 12 - 13:59 
* 14 - 15:59 
* 16 - 17:59
* 18 - 19:59
* 20 - 21:59
* 22 - 23:59


WEATHER DATA:

Sky condition (CIG_HEIGHT)
* 0 - 21999 = Limited vertical visibility
* 22000 = Unlimited veritcal visibility

Wind Speed (based on Beaufort Scale: https://en.wikipedia.org/wiki/Beaufort_scale)
* <15 = Calm to light breeze
* <33 = Gentle breeze
* <107 = Moderate to fresh breeze
* \>=138 = Strong breeze to gale/storm

Visibility Distance
* \>= 16093 (>= 10 miles) --- most values are exactly 16093, likely indicating max visibility
* < 16093 (< 10 miles) --- group all other values below together

In [0]:
#####################################
# 3 Month Data

# change data to whatever the processed and joined dataframe is called
data_3m = valid_test_3m_imputed.withColumn('CRS_DEP_TIME',valid_test_3m_imputed.CRS_DEP_TIME.cast(FloatType()) )

#Convert departure time to bins
dep_time_buck_3m = Bucketizer(splits=[ 0, 200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000, 2200, float('Inf') ],inputCol="CRS_DEP_TIME", outputCol="CRS_DEP_TIME_BUCK")
air_buck_3m = dep_time_buck_3m.setHandleInvalid("keep").transform(data_3m)

#Convert wind, vis, sky to bins
wind_buck_3m = Bucketizer(splits=[ 0, 15, 33, 107, float('Inf') ],inputCol="WND_SPEED", outputCol="WND_SPEED_BUCK")
vis_buck_3m = Bucketizer(splits=[ 0, 16092, float('Inf') ],inputCol="VIS_DIST", outputCol="VIS_DIST_BUCK")
sky_buck_3m = Bucketizer(splits=[ 0, 21999, float('Inf') ],inputCol="CIG_HEIGHT", outputCol="CIG_HEIGHT_BUCK")
w_buck_3m = wind_buck_3m.setHandleInvalid("keep").transform(air_buck_3m)   
w_buck_3m = vis_buck_3m.setHandleInvalid("keep").transform(w_buck_3m)
w_buck_3m = sky_buck_3m.setHandleInvalid("keep").transform(w_buck_3m)

wind_buck_3m_dw = Bucketizer(splits=[ 0, 15, 33, 107, float('Inf') ],inputCol="DEST_WND_SPEED", outputCol="DEST_WND_SPEED_BUCK")
vis_buck_3m_dw = Bucketizer(splits=[ 0, 16092, float('Inf') ],inputCol="DEST_VIS_DIST", outputCol="DEST_VIS_DIST_BUCK")
sky_buck_3m_dw = Bucketizer(splits=[ 0, 21999, float('Inf') ],inputCol="DEST_CIG_HEIGHT", outputCol="DEST_CIG_HEIGHT_BUCK")
w_buck_3m = wind_buck_3m_dw.setHandleInvalid("keep").transform(w_buck_3m)   
w_buck_3m = vis_buck_3m_dw.setHandleInvalid("keep").transform(w_buck_3m)
w_buck_3m = sky_buck_3m_dw.setHandleInvalid("keep").transform(w_buck_3m)

# drop numeric columns (keep only features needed for modeling)
valid_test_data_3m = w_buck_3m.select('YEAR', 'MONTH', 'DAY_OF_WEEK', 'OP_UNIQUE_CARRIER', 'ORIGIN', 'ORIGIN_STATE_ABR', 'DEST', 'DEST_STATE_ABR', 'CRS_DEP_TIME_BUCK', 'CRS_ELAPSED_TIME', 'DISTANCE', 'FIRST_DEP', 'PREVIOUS_DELAY', 'FLIGHTS_PER_DAY', 'LATITUDE', 'LONGITUDE', 'ELEVATION', 'WND_SPEED_BUCK', 'CIG_HEIGHT_BUCK', 'VIS_DIST_BUCK', 'VIS_VAR', 'TEMP', 'DEW_TEMP', 'SLPRESS', 'PAGERANK', 'DEST_LATITUDE', 'DEST_LONGITUDE', 'DEST_ELEVATION', 'DEST_VIS_VAR', 'DEST_TEMP', 'DEST_DEW_TEMP', 'DEST_SLPRESS', 'DEST_WND_SPEED_BUCK', 'DEST_VIS_DIST_BUCK', 'DEST_CIG_HEIGHT_BUCK', 'DELAY')

#SAVING Spark Dataframe to Shared Directory
# dbutils.fs.mkdirs("dbfs:/mnt/mids-w261/team_25/")               #Made Directory in DataBricks, no need to remake
file_to_store = valid_test_data_3m                        #CHANGE THIS: name of Spark Dataframe (to save in database)
filename = "valid_test_data_3m"                      #CHANGE THIS: new file name in database
dbutils.fs.rm("dbfs:/mnt/mids-w261/team_25/train_test_data_folder/"+filename, True)      #remove file if there already is an existing one, be careful with this!!!
file_to_store.write.parquet("dbfs:/mnt/mids-w261/team_25/train_test_data_folder/" + filename)

#####################################
# Full Data

# change data to whatever the processed and joined dataframe is called
data = valid_test_imputed.withColumn('CRS_DEP_TIME',valid_test_imputed.CRS_DEP_TIME.cast(FloatType()) )

#Convert departure time to bins
dep_time_buck = Bucketizer(splits=[ 0, 200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000, 2200, float('Inf') ],inputCol="CRS_DEP_TIME", outputCol="CRS_DEP_TIME_BUCK")
air_buck = dep_time_buck.setHandleInvalid("keep").transform(data)

#Convert wind, vis, sky to bins
wind_buck = Bucketizer(splits=[ 0, 15, 33, 107, float('Inf') ],inputCol="WND_SPEED", outputCol="WND_SPEED_BUCK")
vis_buck = Bucketizer(splits=[ 0, 16092, float('Inf') ],inputCol="VIS_DIST", outputCol="VIS_DIST_BUCK")
sky_buck = Bucketizer(splits=[ 0, 21999, float('Inf') ],inputCol="CIG_HEIGHT", outputCol="CIG_HEIGHT_BUCK")
w_buck = wind_buck.setHandleInvalid("keep").transform(air_buck)   
w_buck = vis_buck.setHandleInvalid("keep").transform(w_buck)
w_buck = sky_buck.setHandleInvalid("keep").transform(w_buck)

wind_buck_dw = Bucketizer(splits=[ 0, 15, 33, 107, float('Inf') ],inputCol="DEST_WND_SPEED", outputCol="DEST_WND_SPEED_BUCK")
vis_buck_dw = Bucketizer(splits=[ 0, 16092, float('Inf') ],inputCol="DEST_VIS_DIST", outputCol="DEST_VIS_DIST_BUCK")
sky_buck_dw = Bucketizer(splits=[ 0, 21999, float('Inf') ],inputCol="DEST_CIG_HEIGHT", outputCol="DEST_CIG_HEIGHT_BUCK")
w_buck = wind_buck_dw.setHandleInvalid("keep").transform(w_buck)   
w_buck = vis_buck_dw.setHandleInvalid("keep").transform(w_buck)
w_buck = sky_buck_dw.setHandleInvalid("keep").transform(w_buck)

# drop numeric columns (keep only features needed for modeling)
valid_test_data = w_buck.select('YEAR', 'MONTH', 'DAY_OF_WEEK', 'OP_UNIQUE_CARRIER', 'ORIGIN', 'ORIGIN_STATE_ABR', 'DEST', 'DEST_STATE_ABR', 'CRS_DEP_TIME_BUCK', 'CRS_ELAPSED_TIME', 'DISTANCE', 'FIRST_DEP', 'PREVIOUS_DELAY', 'FLIGHTS_PER_DAY', 'LATITUDE', 'LONGITUDE', 'ELEVATION', 'WND_SPEED_BUCK', 'CIG_HEIGHT_BUCK', 'VIS_DIST_BUCK', 'VIS_VAR', 'TEMP', 'DEW_TEMP', 'SLPRESS', 'PAGERANK', 'DEST_LATITUDE', 'DEST_LONGITUDE', 'DEST_ELEVATION', 'DEST_VIS_VAR', 'DEST_TEMP', 'DEST_DEW_TEMP', 'DEST_SLPRESS', 'DEST_WND_SPEED_BUCK', 'DEST_VIS_DIST_BUCK', 'DEST_CIG_HEIGHT_BUCK', 'DELAY')

#SAVING Spark Dataframe to Shared Directory
# dbutils.fs.mkdirs("dbfs:/mnt/mids-w261/team_25/")               #Made Directory in DataBricks, no need to remake
file_to_store = valid_test_data                        #CHANGE THIS: name of Spark Dataframe (to save in database)
filename = "valid_test_data"                      #CHANGE THIS: new file name in database
dbutils.fs.rm("dbfs:/mnt/mids-w261/team_25/train_test_data_folder/"+filename, True)      #remove file if there already is an existing one, be careful with this!!!
file_to_store.write.parquet("dbfs:/mnt/mids-w261/team_25/train_test_data_folder/" + filename)

In [0]:
#Read Data
filename = "valid_test_data_3m"                      
valid_test_data_3m = spark.read.option("header", "true").parquet("dbfs:/mnt/mids-w261/team_25/train_test_data_folder/"+filename+"/part-00*.parquet")
# valid_test_data_3m.display()
print("valid_test_data_3m Shape:", valid_test_data_3m.count(), len(valid_test_data_3m.columns))

############################################

filename = "valid_test_data"                      
valid_test_data = spark.read.option("header", "true").parquet("dbfs:/mnt/mids-w261/team_25/train_test_data_folder/"+filename+"/part-00*.parquet")
# valid_test_data.display()
print("valid_test_data:", valid_test_data.count(), len(valid_test_data.columns))

## Validation/Test Data Write to DataBase

In [0]:
##########################
# 3 MONTH DATA

#Split to Validation and Test Data

validation_data_3m_dw_b = valid_test_data_3m.where("MONTH = 3")

#SAVING Spark Dataframe to Shared Directory
file_to_store = validation_data_3m_dw_b                        #CHANGE THIS: name of Spark Dataframe (to save in database)
filename = "validation_data_3m_dw_b"                      #CHANGE THIS: new file name in database
dbutils.fs.rm("dbfs:/mnt/mids-w261/team_25/train_test_data_folder/"+filename, True)      #remove file if there already is an existing one, be careful with this!!!
file_to_store.write.parquet("dbfs:/mnt/mids-w261/team_25/train_test_data_folder/" + filename)

###########################
#FULL DATA

validation_data_dw_b = valid_test_data.where("YEAR = 2018")
test_data_dw_b = valid_test_data.where("YEAR = 2019")

#SAVING Spark Dataframe to Shared Directory
file_to_store = validation_data_dw_b                        #CHANGE THIS: name of Spark Dataframe (to save in database)
filename = "validation_data_dw_b"                      #CHANGE THIS: new file name in database
dbutils.fs.rm("dbfs:/mnt/mids-w261/team_25/train_test_data_folder/"+filename, True)      #remove file if there already is an existing one, be careful with this!!!
file_to_store.write.parquet("dbfs:/mnt/mids-w261/team_25/train_test_data_folder/" + filename)

file_to_store = test_data_dw_b                        #CHANGE THIS: name of Spark Dataframe (to save in database)
filename = "test_data_dw_b"                      #CHANGE THIS: new file name in database
dbutils.fs.rm("dbfs:/mnt/mids-w261/team_25/train_test_data_folder/"+filename, True)      #remove file if there already is an existing one, be careful with this!!!
file_to_store.write.parquet("dbfs:/mnt/mids-w261/team_25/train_test_data_folder/" + filename)

# Rename Data Files for Modeling

In [0]:
#Read Data
filename = "validation_data_3m_dw_b"                      
validation_data_3m_dw_b = spark.read.option("header", "true").parquet("dbfs:/mnt/mids-w261/team_25/train_test_data_folder/"+filename+"/part-00*.parquet")
# validation_data_3m_dw_b.display()
# print("validation_data_3m_dw_b Shape:", validation_data_3m_dw_b.count(), len(validation_data_3m_dw_b.columns))

############################################

filename = "validation_data_dw_b"                      
validation_data_dw_b = spark.read.option("header", "true").parquet("dbfs:/mnt/mids-w261/team_25/train_test_data_folder/"+filename+"/part-00*.parquet")
# validation_data_dw_b.display()
# print("validation_data_dw_b:", validation_data_dw_b.count(), len(validation_data_dw_b.columns))

filename = "test_data_dw_b"                      
test_data_dw_b = spark.read.option("header", "true").parquet("dbfs:/mnt/mids-w261/team_25/train_test_data_folder/"+filename+"/part-00*.parquet")
# test_data_dw_b.display()
# print("test_data_dw_b:", test_data_dw_b.count(), len(test_data_dw_b.columns))

In [0]:
#Split Data to 3month baseline validation, 3 month validation, full validation and full test
validation_data_3m_baseline = validation_data_3m_dw_b.where("ORIGIN = 'ATL' or ORIGIN = 'ORD'")
validation_data_3m = validation_data_3m_dw_b
validation_data = validation_data_dw_b
test_data = test_data_dw_b

#SAVING Spark Dataframe to Shared Directory
file_to_store = validation_data_3m_baseline                        #CHANGE THIS: name of Spark Dataframe (to save in database)
filename = "validation_data_3m_baseline"                      #CHANGE THIS: new file name in database
dbutils.fs.rm("dbfs:/mnt/mids-w261/team_25/train_test_data_folder/"+filename, True)      #remove file if there already is an existing one, be careful with this!!!
file_to_store.write.parquet("dbfs:/mnt/mids-w261/team_25/train_test_data_folder/" + filename)

file_to_store = validation_data_3m                        #CHANGE THIS: name of Spark Dataframe (to save in database)
filename = "validation_data_3m"                      #CHANGE THIS: new file name in database
dbutils.fs.rm("dbfs:/mnt/mids-w261/team_25/train_test_data_folder/"+filename, True)      #remove file if there already is an existing one, be careful with this!!!
file_to_store.write.parquet("dbfs:/mnt/mids-w261/team_25/train_test_data_folder/" + filename)

file_to_store = validation_data                        #CHANGE THIS: name of Spark Dataframe (to save in database)
filename = "validation_data"                      #CHANGE THIS: new file name in database
dbutils.fs.rm("dbfs:/mnt/mids-w261/team_25/train_test_data_folder/"+filename, True)      #remove file if there already is an existing one, be careful with this!!!
file_to_store.write.parquet("dbfs:/mnt/mids-w261/team_25/train_test_data_folder/" + filename)

file_to_store = test_data                        #CHANGE THIS: name of Spark Dataframe (to save in database)
filename = "test_data"                      #CHANGE THIS: new file name in database
dbutils.fs.rm("dbfs:/mnt/mids-w261/team_25/train_test_data_folder/"+filename, True)      #remove file if there already is an existing one, be careful with this!!!
file_to_store.write.parquet("dbfs:/mnt/mids-w261/team_25/train_test_data_folder/" + filename)

In [0]:
#Reado 3month baseline validation, 3 month validation, full validation and full test data

#Read Data
filename = "validation_data_3m_baseline"                      
validation_data_3m_baseline = spark.read.option("header", "true").parquet("dbfs:/mnt/mids-w261/team_25/train_test_data_folder/"+filename+"/part-00*.parquet")
validation_data_3m_baseline.display()
print("validation_data_3m_baseline Shape:", validation_data_3m_baseline.count(), len(validation_data_3m_baseline.columns))

############################################

filename = "validation_data_3m"                      
validation_data_3m = spark.read.option("header", "true").parquet("dbfs:/mnt/mids-w261/team_25/train_test_data_folder/"+filename+"/part-00*.parquet")
validation_data_3m.display()
print("validation_data_3m:", validation_data_3m.count(), len(validation_data_3m.columns))

############################################

filename = "validation_data"                      
validation_data = spark.read.option("header", "true").parquet("dbfs:/mnt/mids-w261/team_25/train_test_data_folder/"+filename+"/part-00*.parquet")
validation_data.display()
print("validation_data:", validation_data.count(), len(validation_data.columns))

############################################

filename = "test_data"                      
test_data = spark.read.option("header", "true").parquet("dbfs:/mnt/mids-w261/team_25/train_test_data_folder/"+filename+"/part-00*.parquet")
test_data.display()
print("test_data:", test_data.count(), len(test_data.columns))

YEAR,MONTH,DAY_OF_WEEK,OP_UNIQUE_CARRIER,ORIGIN,ORIGIN_STATE_ABR,DEST,DEST_STATE_ABR,CRS_DEP_TIME_BUCK,CRS_ELAPSED_TIME,DISTANCE,FIRST_DEP,PREVIOUS_DELAY,FLIGHTS_PER_DAY,LATITUDE,LONGITUDE,ELEVATION,WND_SPEED_BUCK,CIG_HEIGHT_BUCK,VIS_DIST_BUCK,VIS_VAR,TEMP,DEW_TEMP,SLPRESS,PAGERANK,DEST_LATITUDE,DEST_LONGITUDE,DEST_ELEVATION,DEST_VIS_VAR,DEST_TEMP,DEST_DEW_TEMP,DEST_SLPRESS,DEST_WND_SPEED_BUCK,DEST_VIS_DIST_BUCK,DEST_CIG_HEIGHT_BUCK,DELAY
2015,3,7,AA,ORD,IL,MSP,MN,7.0,86.0,334.0,0,0,15189,41.995,-87.9336,201.8,1.0,0.0,0.0,N,-56.0,-89.0,10194.58439116126,0.0473776371384612,44.8831,-93.2289,265.8,N,-56.0,-150.0,10268.0,2.0,1.0,1.0,1
2015,3,7,EV,ATL,GA,CSG,GA,8.0,44.0,83.0,0,0,15189,33.6301,-84.4418,307.8,1.0,0.0,0.0,N,56.0,44.0,10303.0,0.0617810304792605,32.5161,-84.9422,119.5,N,83.0,61.0,10293.0,1.0,1.0,0.0,0
2015,3,7,MQ,ORD,IL,CMI,IL,6.0,47.0,135.0,0,0,15189,41.995,-87.9336,201.8,1.0,0.0,0.0,N,-67.0,-89.0,10289.0,0.0473776371384612,40.03972,-88.27778,229.8,N,-44.0,-56.0,10290.0,0.0,0.0,0.0,1
2015,3,7,OO,ATL,GA,ORD,IL,4.0,132.0,606.0,0,1,15189,33.6301,-84.4418,307.8,2.0,0.0,0.0,N,30.0,10.0,10203.477010504565,0.0617810304792605,41.995,-87.9336,201.8,N,-111.0,-128.0,10301.0,1.0,0.0,0.0,0
2015,3,7,UA,ORD,IL,SFO,CA,9.0,287.0,1846.0,0,0,15189,41.995,-87.9336,201.8,2.0,0.0,1.0,N,-44.0,-94.0,10260.0,0.0473776371384612,37.6197,-122.3647,2.4,N,172.0,39.0,10139.0,1.0,1.0,1.0,1
2015,3,1,DL,ATL,GA,DFW,TX,5.0,153.0,731.0,0,0,16427,33.6301,-84.4418,307.8,1.0,0.0,0.0,N,83.0,72.0,10268.0,0.0617810304792605,32.8978,-97.0189,170.7,N,6.0,0.0,10289.0,2.0,0.0,0.0,1
2015,3,1,DL,ATL,GA,MSP,MN,4.0,161.0,907.0,0,0,16427,33.6301,-84.4418,307.8,1.0,0.0,0.0,N,78.0,72.0,10260.0,0.0617810304792605,44.8831,-93.2289,265.8,N,-111.0,-144.0,10334.0,1.0,1.0,1.0,0
2015,3,1,DL,ATL,GA,RDU,NC,11.0,85.0,356.0,0,0,16427,33.6301,-84.4418,307.8,0.0,0.0,1.0,N,122.0,100.0,10248.0,0.0617810304792605,35.8923,-78.7819,126.8,N,94.0,-50.0,10263.0,0.0,1.0,0.0,0
2015,3,1,EV,ATL,GA,GTR,MS,7.0,70.0,241.0,0,0,16427,33.6301,-84.4418,307.8,2.0,0.0,1.0,N,161.0,106.0,10262.0,0.0617810304792605,33.45,-88.58333,80.5,N,80.0,60.0,10208.477611940298,2.0,0.0,0.0,0
2015,3,1,EV,ORD,IL,GRB,WI,4.0,60.0,173.0,0,0,16427,41.995,-87.9336,201.8,2.0,1.0,1.0,N,-100.0,-133.0,10329.0,0.0473776371384612,44.4794,-88.1366,209.4,N,-111.0,-144.0,10319.0,2.0,0.0,1.0,0


YEAR,MONTH,DAY_OF_WEEK,OP_UNIQUE_CARRIER,ORIGIN,ORIGIN_STATE_ABR,DEST,DEST_STATE_ABR,CRS_DEP_TIME_BUCK,CRS_ELAPSED_TIME,DISTANCE,FIRST_DEP,PREVIOUS_DELAY,FLIGHTS_PER_DAY,LATITUDE,LONGITUDE,ELEVATION,WND_SPEED_BUCK,CIG_HEIGHT_BUCK,VIS_DIST_BUCK,VIS_VAR,TEMP,DEW_TEMP,SLPRESS,PAGERANK,DEST_LATITUDE,DEST_LONGITUDE,DEST_ELEVATION,DEST_VIS_VAR,DEST_TEMP,DEST_DEW_TEMP,DEST_SLPRESS,DEST_WND_SPEED_BUCK,DEST_VIS_DIST_BUCK,DEST_CIG_HEIGHT_BUCK,DELAY
2015,3,7,AA,DFW,TX,ORD,IL,4.0,133.0,802.0,0,0,15189,32.8978,-97.0189,170.7,2.0,0.0,0.0,N,0.0,-6.0,10194.98673300166,0.0381707360832717,41.995,-87.9336,201.8,N,-83.0,-100.0,10290.0,0.0,0.0,0.0,1
2015,3,7,AA,IAH,TX,MIA,FL,4.0,139.0,964.0,0,0,15189,29.98,-95.36,29.0,1.0,0.0,0.0,N,117.0,117.0,10226.0,0.0223240124155635,25.7881,-80.3169,8.8,N,239.0,206.0,10260.0,2.0,1.0,1.0,0
2015,3,7,AA,JFK,NY,MCO,FL,7.0,183.0,944.0,0,0,15189,40.6386,-73.7622,3.4,2.0,0.0,0.0,N,-33.0,-67.0,10351.0,0.0145890329261653,28.4339,-81.325,27.4,N,256.0,206.0,10268.0,2.0,1.0,0.0,1
2015,3,7,AA,LAX,CA,DFW,TX,5.0,181.0,1235.0,0,1,15189,33.938,-118.3888,29.6,1.0,1.0,1.0,N,128.0,83.0,10140.0,0.0308713278344312,32.8978,-97.0189,170.7,N,17.0,11.0,10191.536661026508,2.0,0.0,0.0,1
2015,3,7,AA,LAX,CA,SFO,CA,8.0,80.0,337.0,0,0,15189,33.938,-118.3888,29.6,2.0,0.0,1.0,N,161.0,83.0,10118.0,0.0308713278344312,37.6197,-122.3647,2.4,N,172.0,39.0,10139.0,1.0,1.0,1.0,0
2015,3,7,AA,LGA,NY,DFW,TX,3.0,251.0,1389.0,0,0,15189,40.7792,-73.88,3.4,0.0,1.0,1.0,N,-44.0,-122.0,10390.0,0.0141252218790335,32.8978,-97.0189,170.7,N,0.0,-6.0,10200.56747600226,2.0,0.0,0.0,1
2015,3,7,AA,MIA,FL,SFO,CA,9.0,393.0,2585.0,0,0,15189,25.7881,-80.3169,8.8,2.0,1.0,1.0,N,272.0,189.0,10242.0,0.0102115569589181,37.6197,-122.3647,2.4,N,161.0,22.0,10146.0,0.0,1.0,1.0,0
2015,3,7,AA,STL,MO,DFW,TX,4.0,115.0,550.0,0,0,15189,38.7525,-90.3736,161.8,0.0,0.0,0.0,N,-28.0,-44.0,10298.0,0.0076554971275557,32.8978,-97.0189,170.7,N,0.0,-6.0,10265.0,2.0,0.0,0.0,1
2015,3,7,AS,OAK,CA,OGG,HI,3.0,339.0,2349.0,0,0,15189,37.72139,-122.22083,1.8,2.0,1.0,1.0,N,83.0,61.0,10141.0,0.0069694060144943,20.89972,-156.42861,15.5,N,210.0,200.0,10193.124653739613,1.0,0.0,0.0,0
2015,3,7,B6,FLL,FL,DCA,VA,10.0,149.0,899.0,0,0,15189,26.07875,-80.16217,3.4,2.0,1.0,1.0,N,239.0,206.0,10243.0,0.0121748835826635,38.8472,-77.03454,3.0,N,6.0,-17.0,10274.0,2.0,0.0,0.0,1


YEAR,MONTH,DAY_OF_WEEK,OP_UNIQUE_CARRIER,ORIGIN,ORIGIN_STATE_ABR,DEST,DEST_STATE_ABR,CRS_DEP_TIME_BUCK,CRS_ELAPSED_TIME,DISTANCE,FIRST_DEP,PREVIOUS_DELAY,FLIGHTS_PER_DAY,LATITUDE,LONGITUDE,ELEVATION,WND_SPEED_BUCK,CIG_HEIGHT_BUCK,VIS_DIST_BUCK,VIS_VAR,TEMP,DEW_TEMP,SLPRESS,PAGERANK,DEST_LATITUDE,DEST_LONGITUDE,DEST_ELEVATION,DEST_VIS_VAR,DEST_TEMP,DEST_DEW_TEMP,DEST_SLPRESS,DEST_WND_SPEED_BUCK,DEST_VIS_DIST_BUCK,DEST_CIG_HEIGHT_BUCK,DELAY
2018,12,1,9E,AGS,GA,ATL,GA,6.0,69.0,143.0,0,0,16709,33.3644,-81.9633,40.2,1.0,0.0,0.0,N,189.0,189.0,10214.0,0.0008247055865650812,33.6301,-84.4418,307.8,N,167.0,156.0,99999.0,1.0,0.0,0.0,1
2018,12,1,9E,AGS,GA,ATL,GA,4.0,71.0,143.0,0,0,16709,33.3644,-81.9633,40.2,0.0,0.0,0.0,N,150.0,150.0,10208.0,0.0008247055865650812,33.6301,-84.4418,307.8,N,133.0,128.0,10195.0,2.0,0.0,0.0,0
2018,12,1,9E,BHM,AL,ATL,GA,6.0,65.0,134.0,0,0,16709,33.56556,-86.745,187.5,2.0,0.0,1.0,N,217.0,172.0,10151.0,0.0021165652216008,33.6301,-84.4418,307.8,N,178.0,167.0,99999.0,2.0,0.0,0.0,1
2018,12,1,9E,HRL,TX,MSP,MN,7.0,191.0,1310.0,0,0,16709,26.22806,-97.65417,10.4,1.0,0.0,1.0,N,144.0,72.0,10146.0,0.0008289280080042082,44.8831,-93.2289,265.8,N,-28.0,-50.0,10109.0,2.0,1.0,0.0,1
2018,12,1,AA,BDL,CT,CLT,NC,8.0,132.0,644.0,0,0,16709,41.9375,-72.6819,53.3,2.0,0.0,1.0,N,56.0,-6.0,10248.0,0.0031754809776797,35.2236,-80.9552,221.9,N,156.0,156.0,10173.0,1.0,0.0,0.0,0
2018,12,1,AA,DCA,VA,RSW,FL,4.0,168.0,892.0,0,0,16709,38.8472,-77.03454,3.0,1.0,0.0,1.0,N,50.0,22.0,10256.0,0.0105095571058594,26.53611,-81.755,9.4,N,172.0,167.0,10205.0,1.0,1.0,1.0,0
2018,12,1,AA,DFW,TX,CLT,NC,3.0,151.0,936.0,0,0,16709,32.8978,-97.0189,170.7,2.0,0.0,1.0,N,44.0,33.0,10074.0,0.0381707360832717,35.2236,-80.9552,221.9,N,122.0,122.0,10217.0,1.0,0.0,0.0,0
2018,12,1,AA,DFW,TX,IND,IN,6.0,124.0,761.0,0,0,16709,32.8978,-97.0189,170.7,2.0,0.0,1.0,N,67.0,50.0,10096.0,0.0381707360832717,39.72517,-86.28168,241.1,N,78.0,78.0,10075.0,2.0,0.0,0.0,0
2018,12,1,AA,DFW,TX,SEA,WA,7.0,271.0,1660.0,0,0,16709,32.8978,-97.0189,170.7,2.0,0.0,1.0,N,139.0,28.0,10075.0,0.0381707360832717,47.4444,-122.3138,112.8,N,33.0,22.0,10330.0,1.0,0.0,0.0,0
2018,12,1,AA,DFW,TX,SFO,CA,4.0,241.0,1464.0,0,0,16709,32.8978,-97.0189,170.7,1.0,0.0,1.0,N,50.0,39.0,10079.0,0.0381707360832717,37.6197,-122.3647,2.4,N,83.0,22.0,10159.0,2.0,1.0,1.0,0


YEAR,MONTH,DAY_OF_WEEK,OP_UNIQUE_CARRIER,ORIGIN,ORIGIN_STATE_ABR,DEST,DEST_STATE_ABR,CRS_DEP_TIME_BUCK,CRS_ELAPSED_TIME,DISTANCE,FIRST_DEP,PREVIOUS_DELAY,FLIGHTS_PER_DAY,LATITUDE,LONGITUDE,ELEVATION,WND_SPEED_BUCK,CIG_HEIGHT_BUCK,VIS_DIST_BUCK,VIS_VAR,TEMP,DEW_TEMP,SLPRESS,PAGERANK,DEST_LATITUDE,DEST_LONGITUDE,DEST_ELEVATION,DEST_VIS_VAR,DEST_TEMP,DEST_DEW_TEMP,DEST_SLPRESS,DEST_WND_SPEED_BUCK,DEST_VIS_DIST_BUCK,DEST_CIG_HEIGHT_BUCK,DELAY
2019,8,4,9E,AGS,GA,ATL,GA,2.0,60.0,143.0,0,0,22301,33.3644,-81.9633,40.2,1.0,1.0,1.0,N,250.0,222.0,10112.0,0.0008247055865650812,33.6301,-84.4418,307.8,N,256.0,189.0,10120.0,2.0,1.0,0.0,0
2019,8,4,9E,ATL,GA,OAJ,NC,10.0,86.0,399.0,0,0,22301,33.6301,-84.4418,307.8,2.0,0.0,1.0,N,322.0,200.0,10124.0,0.0617810304792605,34.83333,-77.61667,29.3,N,328.0,222.0,10102.0,2.0,1.0,1.0,1
2019,8,4,9E,BTR,LA,ATL,GA,3.0,98.0,448.0,0,1,22301,30.5372,-91.1469,19.5,1.0,1.0,1.0,N,261.0,250.0,10139.0,0.0014464071133628,33.6301,-84.4418,307.8,N,239.0,194.0,10122.0,1.0,1.0,0.0,0
2019,8,4,9E,DTW,MI,BTV,VT,8.0,109.0,537.0,0,0,22301,42.2313,-83.3308,192.3,2.0,1.0,1.0,N,283.0,150.0,10078.0,0.0206175511649113,44.4683,-73.1499,100.6,N,278.0,183.0,10016.0,2.0,1.0,1.0,0
2019,8,4,9E,JFK,NY,IAD,VA,8.0,115.0,228.0,0,1,22301,40.63915,-73.76401,3.4,2.0,1.0,1.0,N,272.0,233.0,10061.0,0.0145890329261653,38.93486,-77.44728,88.4,N,328.0,139.0,10078.0,2.0,1.0,1.0,1
2019,8,4,9E,JFK,NY,PWM,ME,11.0,99.0,273.0,0,0,22301,40.63915,-73.76401,3.4,3.0,0.0,0.0,N,267.0,161.0,10167.81477165778,0.0145890329261653,43.64222,-70.30444,13.7,N,217.0,183.0,10023.0,1.0,1.0,1.0,1
2019,8,4,9E,JFK,NY,ROC,NY,11.0,103.0,264.0,0,0,22301,40.63915,-73.76401,3.4,3.0,0.0,0.0,N,267.0,161.0,10167.81477165778,0.0145890329261653,43.1167,-77.6767,164.3,N,222.0,194.0,10052.0,2.0,1.0,1.0,1
2019,8,4,9E,MSP,MN,STL,MO,3.0,87.0,448.0,0,0,22301,44.8831,-93.2289,265.8,1.0,1.0,1.0,N,167.0,117.0,10120.0,0.0228724306628863,38.7525,-90.3736,161.8,N,228.0,200.0,10109.0,1.0,1.0,1.0,0
2019,8,4,9E,ORF,VA,LGA,NY,3.0,86.0,296.0,0,1,22301,36.9033,-76.1922,9.1,1.0,1.0,1.0,N,228.0,217.0,10174.867733782645,0.0018830828949039,40.77944,-73.88035,3.4,N,217.0,183.0,10070.0,1.0,1.0,1.0,0
2019,8,4,9E,RDU,NC,LGA,NY,7.0,115.0,431.0,1,0,22301,35.8923,-78.7819,126.8,2.0,1.0,1.0,N,311.0,194.0,10109.0,0.0051515025215362,40.77944,-73.88035,3.4,N,294.0,178.0,10060.0,2.0,1.0,1.0,1
