# Preliminary Analysis
## for KidKit

## Data Wrangling

> The original dataset reports domestic flights in the United States, including carriers, arrival and departure delays, and reasons for delays, from 1987 to 2008.

> Data is available at the http://stat-computing.org/dataexpo/2009/the-data.html originally from https://www.transtats.bts.gov/OT_Delay/OT_DelayCause1.asp described in detail here https://www.transtats.bts.gov/Fields.asp?Table_ID=236

In [None]:
# import modules
import pandas as pd
import numpy as np
import calendar
import datetime as dt

import matplotlib.pyplot as plt
import seaborn as sb
%matplotlib inline

In [None]:
df_88 = pd.read_csv('1988.csv', nrows=None, encoding='latin-1')
df_98 = pd.read_csv('1998.csv', nrows=None, encoding='latin-1')
df_08 = pd.read_csv('2008.csv', nrows=None, encoding='latin-1')
df_88.shape, df_98.shape, df_08.shape 

### What is the structure of our dataset?

> We have selected three datasets to show how delays have evolved last thirty years in the aviation industry and perhaps get some insights about how to handle critical aviation operations linked to higher volume of flights This is because the first two datasets representing 5 million records each, demonstrate there has not been a significant change in flights volume. In the 2008 dataset a significant increase in the dataset records raise questions how this 30% increase has been handled by the actual aviation system and is so, there has been a direct impact in the passenger experience with respect to flight delays.

### What are the main feature of interest in our dataset?

> As explained before an increase in flight demand, must have altered aviation system's supply operations. Out of the 27 features - columns of our dataset We will focus on the delayed fligts or delays in general. However further investigation has to be performed with respect to cancelled flights, taxi  and local trasportation as well as including operations to final destinations. Other interesting information is provided and can be further eplored to improve passenger experience and further explore the success of innovative business models in the aviation industry. This can provide inderesting insights from the operations management prespective as well a comprehensive understanding of the operational cost for being idle. Carrier delay data can be thus analyzed, something we are not providing here.

### What features in the dataset that will help support our investigation into our features of interest?

> For the sake of this analyses we will be analysing 'Year', 'Month', 'DayofMonth', 'DayOfWeek', 'ArrDelay', 'CarrierDelay', 'WeatherDelay', 'NASDelay', 'SecurityDelay'and 'LateAircraftDelay'. 'ArrDelay' which is the delay of flights arrival in the case this is true is the sum in minutes of the following variables 'CarrierDelay' 'WeatherDelay', 'NASDelay', 'SecurityDelay' and 'LateAircraftDelay' aka, Carrier Delay, Weather Delay, National Airline System Delay, Security Delay and Late Aircraft Delay.

### Updated Dataset

> We are now focussing on the part of the delayed flights so we have to drop from our dataset null values assuming empty records for delayed flights are the flights that arrive on time. Additionally, we only keep columns we are interested in, thus 'Year', 'Month', 'DayofMonth', 'DayOfWeek', 'ArrDelay', 'CarrierDelay', 'WeatherDelay', 'NASDelay', 'SecurityDelay'and 'LateAircraftDelay'

In [None]:
#Let's visualize nulls in 1988 dataset
plt.figure(figsize = (15,5))
na_counts = df_88.isna().sum()
base_color = sb.color_palette()[0]
sb.barplot(na_counts.index.values, na_counts, color=base_color)
locs, labels = plt.xticks() 
plt.xticks(np.arange(len(df_88.columns)), df_88.columns, rotation=90);

In [None]:
#Let's visualize nulls in 1998 dataset
plt.figure(figsize = (15,5))
na_counts = df_98.isna().sum()
base_color = sb.color_palette()[0]
sb.barplot(na_counts.index.values, na_counts, color=base_color)
locs, labels = plt.xticks() 
plt.xticks(np.arange(len(df_98.columns)), df_98.columns, rotation=90);

In [None]:
#Let's visualize nulls in 2008 dataset
plt.figure(figsize = (15,5))
na_counts = df_98.isna().sum()
base_color = sb.color_palette()[0]
sb.barplot(na_counts.index.values, na_counts, color=base_color)
locs, labels = plt.xticks() 
plt.xticks(np.arange(len(df_98.columns)), df_98.columns, rotation=90);

In [None]:
#Let's have a look at our fields
df_98.columns

In [None]:
#colums to keep
fields = ['Year', 'Month', 'DayofMonth', 'DayOfWeek', 'ArrDelay', 'CarrierDelay',
       'WeatherDelay', 'NASDelay', 'SecurityDelay', 'LateAircraftDelay' ]

In [None]:
#columns to drop 
#In this particular phase we are so lucky all our column names are the same in all datasets
fields_to_drop = ['DepTime', 'CRSDepTime', 'ArrTime', 'CRSArrTime', 'UniqueCarrier', 'FlightNum', 'TailNum', 
                  'ActualElapsedTime', 'CRSElapsedTime', 'AirTime', 'DepDelay', 'Origin', 'Dest', 'Distance', 
                  'TaxiIn', 'TaxiOut', 'Cancelled', 'CancellationCode', 'Diverted']
df_88.drop(fields_to_drop,axis=1,inplace=True)
df_98.drop(fields_to_drop,axis=1,inplace=True)
df_08.drop(fields_to_drop,axis=1,inplace=True)

In [None]:
# drop rows with null values in the 'ArrDelay' field in all datasets - We assume those values represent flights 
# arriving on time and check size of updated datasets

df_88.dropna(subset = ['ArrDelay'], inplace = True)
df_98.dropna(subset = ['ArrDelay'], inplace = True)
df_08.dropna(subset = ['ArrDelay'], inplace = True)
df_88.shape, df_98.shape, df_08.shape

In [None]:
#Shape of delayed flights in 1988
df_88[df_88.ArrDelay > 0].shape

In [None]:
#Shape of delayed flights in 1998
df_98[df_98.ArrDelay > 0].shape

In [None]:
#Shape of delayed flights in 2008
df_08[df_08.ArrDelay > 0].shape

## Univariate Exploration

> In this section, investigate distributions of individual variables. If
you see unusual points or outliers, take a deeper look to clean things up
and prepare yourself to look at relationships between variables.

> Make sure that, after every plot or related series of plots, that you
include a Markdown cell with comments about what you observed, and what
you plan on investigating next.

### Discuss the distribution(s) of your variable(s) of interest. Were there any unusual points? Did you need to perform any transformations?

> Your answer here!

### Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

> Your answer here!

## Bivariate Exploration

> In this section, investigate relationships between pairs of variables in your
data. Make sure the variables that you cover here have been introduced in some
fashion in the previous section (univariate exploration).

### Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

> Your answer here!

### Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

> Your answer here!

## Multivariate Exploration

> Create plots of three or more variables to investigate your data even
further. Make sure that your investigations are justified, and follow from
your work in the previous sections.

### Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

> Your answer here!

### Were there any interesting or surprising interactions between features?

> Your answer here!

> At the end of your report, make sure that you export the notebook as an
html file from the `File > Download as... > HTML` menu. Make sure you keep
track of where the exported file goes, so you can put it in the same folder
as this notebook for project submission. Also, make sure you remove all of
the quote-formatted guide notes like this one before you finish your report!