# Project 4: Communicate Data Findings

São Paulo, 15 June of 2019<br>
Felipe Mahlmeister

## Table of Contents

1. [Summary](#summary)<br>
2. [Data Wrangling](#data_wrangling)<br>
2.1. [Extracting the Data](#extract)<br>
2.2. [Assess](#assess)<br>
2.3. [Clean](#clean)<br>
4. [Analysis, Modeling, and Validation](#analysis)<br>
5. [Conclusion](#conclusion)<br>

<a id='summary'></a>
## 1. Summary

#### intro
The data consists of flight arrival and departure details for all commercial flights within the USA, from October 1987 to April 2008. This is a large dataset: there are nearly 120 million records in total, and takes up 1.6 gigabytes of space compressed and 12 gigabytes when uncompressed.

These files were downloaded at: http://stat-computing.org/dataexpo/2009/the-data.html

#### Objective

<a id='data_wrangling'></a>
## 2. Data Wrangling

In [1]:
# import all default packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
from urllib.request import urlretrieve
import glob

# import my packages
from jupyterworkflow.data import get_url
from jupyterworkflow.data import get_download_and_unzip
from jupyterworkflow.data import get_flights_data

%matplotlib inline

<a id='extract'></a>
### 2.1. Extracting the Data

In [2]:
# Choose the range of years you want to download
start_year = 1987
end_year = 2008

In order to achieve reproducibility of this project, these packages download the files from stat-computing.org according to `start_year` and `end_year` automatically, also it unzip these files in source folder.

In [None]:
url, filepath = get_url(start_year, end_year)
get_flights_data(url, filepath)

In [7]:
download_list = glob.glob('source/*.csv')
download_list.sort()

In [3]:
df = pd.read_csv('source/2008.csv')

In [4]:
df.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7009728 entries, 0 to 7009727
Data columns (total 29 columns):
Year                 int64
Month                int64
DayofMonth           int64
DayOfWeek            int64
DepTime              float64
CRSDepTime           int64
ArrTime              float64
CRSArrTime           int64
UniqueCarrier        object
FlightNum            int64
TailNum              object
ActualElapsedTime    float64
CRSElapsedTime       float64
AirTime              float64
ArrDelay             float64
DepDelay             float64
Origin               object
Dest                 object
Distance             int64
TaxiIn               float64
TaxiOut              float64
Cancelled            int64
CancellationCode     object
Diverted             int64
CarrierDelay         float64
WeatherDelay         float64
NASDelay             float64
SecurityDelay        float64
LateAircraftDelay    float64
dtypes: float64(14), int64(10), object(5)
memory usage: 3.0 GB


In [5]:
df.head()

Unnamed: 0,Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,...,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay
0,2008,1,3,4,2003.0,1955,2211.0,2225,WN,335,...,4.0,8.0,0,,0,,,,,
1,2008,1,3,4,754.0,735,1002.0,1000,WN,3231,...,5.0,10.0,0,,0,,,,,
2,2008,1,3,4,628.0,620,804.0,750,WN,448,...,3.0,17.0,0,,0,,,,,
3,2008,1,3,4,926.0,930,1054.0,1100,WN,1746,...,3.0,7.0,0,,0,,,,,
4,2008,1,3,4,1829.0,1755,1959.0,1925,WN,3920,...,3.0,10.0,0,,0,2.0,0.0,0.0,0.0,32.0


In [6]:
df.loc[:,'UniqueCarrier'].unique()

array(['WN', 'XE', 'YV', 'OH', 'OO', 'UA', 'US', 'DL', 'EV', 'F9', 'FL',
       'HA', 'MQ', 'NW', '9E', 'AA', 'AQ', 'AS', 'B6', 'CO'], dtype=object)

In [15]:
df.sample(20).iloc[:,15:30]

Unnamed: 0,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay
6039434,0.0,CLE,MHT,544,6.0,31.0,0,,0,0.0,0.0,16.0,0.0,0.0
3166755,-2.0,SLC,PIT,1659,5.0,25.0,0,,0,,,,,
5701749,0.0,CRW,ATL,363,10.0,10.0,0,,0,,,,,
2872876,-4.0,ORD,STL,258,5.0,21.0,0,,0,,,,,
5820806,-3.0,DTW,ERI,163,7.0,19.0,0,,0,,,,,
4166252,39.0,IAD,OAK,2408,5.0,56.0,0,,0,13.0,0.0,48.0,0.0,26.0
3303390,3.0,MCO,LGA,950,16.0,125.0,0,,0,0.0,0.0,116.0,0.0,0.0
2152757,-3.0,BWI,MKE,641,5.0,12.0,0,,0,,,,,
5946515,4.0,SMF,DEN,910,7.0,15.0,0,,0,,,,,
836122,125.0,SAN,SFO,447,7.0,11.0,0,,0,0.0,0.0,3.0,0.0,122.0


In [14]:
df.loc[:,'Origin'].unique()

array(['IAD', 'IND', 'ISP', 'JAN', 'JAX', 'LAS', 'LAX', 'LBB', 'LIT',
       'MAF', 'MCI', 'MCO', 'MDW', 'MHT', 'MSY', 'OAK', 'OKC', 'OMA',
       'ONT', 'ORF', 'PBI', 'PDX', 'PHL', 'PHX', 'PIT', 'PVD', 'RDU',
       'RNO', 'RSW', 'SAN', 'SAT', 'SDF', 'SEA', 'SFO', 'SJC', 'SLC',
       'SMF', 'SNA', 'STL', 'TPA', 'TUL', 'TUS', 'ABQ', 'ALB', 'AMA',
       'AUS', 'BDL', 'BHM', 'BNA', 'BOI', 'BUF', 'BUR', 'BWI', 'CLE',
       'CMH', 'CRP', 'DAL', 'DEN', 'DTW', 'ELP', 'FLL', 'GEG', 'HOU',
       'HRL', 'ROC', 'DAY', 'ORD', 'EWR', 'SYR', 'IAH', 'LFT', 'MKE',
       'CHS', 'LCH', 'CLT', 'BTR', 'CRW', 'FAT', 'COS', 'MRY', 'LGB',
       'BFL', 'EUG', 'ICT', 'MEM', 'LGA', 'DCA', 'BTV', 'GRK', 'BRO',
       'TYS', 'DSM', 'BPT', 'GPT', 'GRR', 'PWM', 'MSP', 'RIC', 'CVG',
       'SAV', 'SRQ', 'GSO', 'CHA', 'XNA', 'GSP', 'LEX', 'MFE', 'ABE',
       'MLU', 'MOB', 'LRD', 'SHV', 'TLH', 'CAE', 'AEX', 'ATL', 'DFW',
       'BGR', 'AVL', 'BOS', 'MSN', 'HSV', 'MGM', 'MYR', 'VPS', 'CLL',
       'PNS', 'MTJ',

Each csv file has aproximatelly 600 MB of file size, and it all together represents 12,0 GB.

    
In a ideal world, a normal workspace could handle all these files at the same time, but for example my computer have only 8 GB of RAM


s we're dealing with large datasets

In [17]:
for dtype in ['float','int','object']:
    selected_dtype = df.select_dtypes(include=[dtype])
    mean_usage_b = selected_dtype.memory_usage(deep=True).mean()
    mean_usage_mb = mean_usage_b / 1024 ** 2
    print("Average memory usage for {} columns: {:03.2f}MB".format(dtype,mean_usage_mb))

Average memory usage for float columns: 49.91MB
Average memory usage for int columns: 48.62MB
Average memory usage for object columns: 305.61MB


In [25]:
# We're going to be calculating memory usage a lot,
# so we'll create a function to save us some time!
def mem_usage(pandas_obj):
    
    usage_mb = 0
    
    if isinstance(pandas_obj,pd.DataFrame):
        
        usage_b = pandas_obj.memory_usage(deep=True).sum()
        
    else: # we assume if not a df it's a series
        
        usage_b = pandas_obj.memory_usage(deep=True)
        usage_mb = usage_b / 1024 ** 2 # convert bytes to megabytes
        
    return "{:03.2f} MB".format(usage_mb)

df_int = df.select_dtypes(include=['int'])
converted_int = df_int.apply(pd.to_numeric,downcast='unsigned')
print(mem_usage(df_int))
print(mem_usage(converted_int))
compare_ints = pd.concat([df_int.dtypes,converted_int.dtypes],axis=1)
compare_ints.columns = ['before','after']
compare_ints.apply(pd.Series.value_counts)

0.00 MB
0.00 MB


Unnamed: 0,before,after
uint8,,5.0
uint16,,5.0
int64,10.0,


In [26]:
converted_int.head()

Unnamed: 0,Year,Month,DayofMonth,DayOfWeek,CRSDepTime,CRSArrTime,FlightNum,Distance,Cancelled,Diverted
0,2008,1,3,4,1955,2225,335,810,0,0
1,2008,1,3,4,735,1000,3231,810,0,0
2,2008,1,3,4,620,750,448,515,0,0
3,2008,1,3,4,930,1100,1746,515,0,0
4,2008,1,3,4,1755,1925,3920,515,0,0


In [27]:
converted_int.shape

(7009728, 10)

In [None]:
df_float = df.select_dtypes(include=['float'])
converted_float = df_float.apply(pd.to_numeric,downcast='float')
print(mem_usage(df_float))
print(mem_usage(converted_float))
compare_floats = pd.concat([df_float.dtypes,converted_float.dtypes],axis=1)
compare_floats.columns = ['before','after']
compare_floats.apply(pd.Series.value_counts)

<a id='assess'></a>
### 2.2. Assess

<a id='clean'></a>
### 2.3. Clean

In [None]:
download_list = glob.glob('source/*.csv')

In [None]:
download_list.sort()

In [None]:
download_list[-1:]

In [None]:
df = pd.concat([pd.read_csv(f, encoding='latin-1') for f in download_list[:-1]], ignore_index = True)

In [None]:
df.head()

In [None]:
df.shape

In [None]:
# Calculate the time of execution
#start_3 = time.time()

#data = pd.read_csv(filepath[:-4])

# Calculate the time of execution
#end_3 = time.time()

#print('read df - execution time: ',end_3 - start_3, 'seconds')
#print('read df - execution time: ',(end_3 - start_3)/60, 'minutes')
#print('-----------------------------------------------------')
#print('total execution time: ',(end_1+end_2+end_3)-(start_1+start_2+start_3), 'seconds')
#print('total execution time: ',((end_1+end_2+end_3)-(start_1+start_2+start_3))/60, 'minutes')

### What is the structure of your dataset?

> Your answer here!

### What is/are the main feature(s) of interest in your dataset?

> Your answer here!

### What features in the dataset do you think will help support your investigation into your feature(s) of interest?

> Your answer here!

## Univariate Exploration

> In this section, investigate distributions of individual variables. If
you see unusual points or outliers, take a deeper look to clean things up
and prepare yourself to look at relationships between variables.

> Make sure that, after every plot or related series of plots, that you
include a Markdown cell with comments about what you observed, and what
you plan on investigating next.

### Discuss the distribution(s) of your variable(s) of interest. Were there any unusual points? Did you need to perform any transformations?

> Your answer here!

### Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

> Your answer here!

## Bivariate Exploration

> In this section, investigate relationships between pairs of variables in your
data. Make sure the variables that you cover here have been introduced in some
fashion in the previous section (univariate exploration).

### Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

> Your answer here!

### Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

> Your answer here!

## Multivariate Exploration

> Create plots of three or more variables to investigate your data even
further. Make sure that your investigations are justified, and follow from
your work in the previous sections.

### Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

> Your answer here!

### Were there any interesting or surprising interactions between features?

> Your answer here!

> At the end of your report, make sure that you export the notebook as an
html file from the `File > Download as... > HTML` menu. Make sure you keep
track of where the exported file goes, so you can put it in the same folder
as this notebook for project submission. Also, make sure you remove all of
the quote-formatted guide notes like this one before you finish your report!