# Flights Curated Data Pre-processing

**This note book pre-process the flights data from the raw folder and save it to the curated folder**

In [17]:
import pandas as pd

df = pd.read_csv("../data/raw/flights_raw_data.csv")
df.head()

Unnamed: 0,flight_id,departure_airport,arrival_airport,scheduled_departure_time,actual_departure_time,scheduled_arrival_time,actual_arrival_time,elapsed_time_flight_minutes
0,0,IAD,JFK,06:00:00,05:54:00,07:29:00,07:10:00,89
1,1,IAD,JFK,06:00:00,05:52:00,07:29:00,07:07:00,89
2,2,IAD,JFK,06:00:00,06:00:00,07:29:00,07:36:00,89
3,3,IAD,JFK,06:00:00,05:54:00,07:29:00,07:11:00,89
4,4,IAD,JFK,06:00:00,05:57:00,07:24:00,07:16:00,84


**Only include flight depart from JFK and arrive to JFK**

In [18]:
# Ensure departue_airport and arrival_airport are string
df['departure_airport'] = df['departure_airport'].astype('string')
df['arrival_airport'] = df['arrival_airport'].astype('string')

In [19]:
# Only consider departure from JFK or arrival to JFK
filtered_df = df[(df['departure_airport'] == 'JFK') | (df['arrival_airport'] == 'JFK')]

In [20]:
JFK_arr = df[df['arrival_airport'] == 'JFK']

In [21]:
# Extract the departure hour and arrival hour
JFK_arr['arrival_hour'] = filtered_df['actual_arrival_time'].str.split(':').str[0].str[-2:].astype(int)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  JFK_arr['arrival_hour'] = filtered_df['actual_arrival_time'].str.split(':').str[0].str[-2:].astype(int)


In [22]:
# Drop unecceasry column
JFK_arr = JFK_arr.drop(columns=['scheduled_departure_time', 'actual_departure_time', 'scheduled_arrival_time', 'actual_arrival_time'])

**Add hourly bins**


In [23]:
# Create bins for arrival_hour and add to dataframe
# Add the bins for every hour

for i in range(23):
    bin_name = f"[{i}-{i+1}]"
    JFK_arr[bin_name] = JFK_arr['arrival_hour'].apply(lambda x: 1 if x == i else 0)

**Re-order the columns**

In [25]:
# Change column order
hourly_bins = [f"[{i}-{i+1}]" for i in range(23)]

column_order = [
    'flight_id', 'departure_airport', 'arrival_airport', 'arrival_hour'
] + hourly_bins + ['elapsed_time_flight_minutes']

# Reorder the columns in the dataframe
JFK_arr = JFK_arr[column_order]

**Save the data**

In [27]:
# Save the data
JFK_arr.to_csv('../data/curated/JFK_arrival_flight_data.csv', index = False)