# Cleaning Airport Departure Data

This notebook is used to clean data retrieved from the ITA National Travel and Tourism Office (https://travel.trade.gov/research/monthly/departures/).  Originally, it came as a mix of csv and xlx files, which can be found in the "raw dowloaded data" folder.  The data was combined together into a single csv for ease of use, manually (with actual direct copying from the website in the case of some incomplete files, namely 1999/2000).  This file is "airline_departures_all.csv" in the "airline data" folder.

### Dependencies

First, we will load a number of useful packages.

In [40]:
#load dependencies
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import datetime as dt

### Load File
Now, we can load in the file for viewing.

In [41]:
#create filepath
filepath = os.path.join('..', 'airline data', 'airport_departures_all.csv')
print(filepath)

../airline data/airport_departures_all.csv


In [42]:
#load data into data frame
raw_data = pd.read_csv(filepath, header = None)
raw_data.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,255,256,257,258,259,260,261,262,263,264
0,Europe,500.1,490.1,710.6,645.1,918.1,1005.0,984.0,894.9,885.8,...,1064429,1212898,1684659,2139814,1867812,1627172,1676165,1201375,928855,1068009
1,Caribbean,312.8,341.1,357.7,343.3,325.9,348.0,409.7,381.8,244.9,...,822751,783076,714117,865847,959510,718092,356693,473086,567498,754622
2,Asia,259.6,222.6,266.4,243.4,293.1,282.5,271.3,268.5,247.5,...,513034,495884,477954,525875,504326,419586,420099,504961,507576,537118
3,South America,117.9,116.7,118.0,98.7,104.5,129.9,145.9,137.9,102.6,...,165836,148540,154436,193673,182796,161435,119148,137624,157642,222500
4,Central America,85.6,86.3,100.9,78.1,75.5,98.1,107.7,90.4,58.7,...,310182,252392,232366,336150,332634,222045,147044,169255,216656,319918


### Add column headers
We will add column headers, which are region and month (Jan. 1996 - Dec. 2017)

In [43]:
#create a list of years (1996 - 2017)
years = np.arange(1996, 2018, 1)

#create a list of months
months = ["January", "February", "March", "April", "May", "June", "July", "August", 
          "September", "October", "November", "December"]

In [44]:
#create a headers list
headers = []
headers.append("Region")

for year in years:
    for month in months:
        headers.append(month + " " + str(year))
        
headers

['Region',
 'January 1996',
 'February 1996',
 'March 1996',
 'April 1996',
 'May 1996',
 'June 1996',
 'July 1996',
 'August 1996',
 'September 1996',
 'October 1996',
 'November 1996',
 'December 1996',
 'January 1997',
 'February 1997',
 'March 1997',
 'April 1997',
 'May 1997',
 'June 1997',
 'July 1997',
 'August 1997',
 'September 1997',
 'October 1997',
 'November 1997',
 'December 1997',
 'January 1998',
 'February 1998',
 'March 1998',
 'April 1998',
 'May 1998',
 'June 1998',
 'July 1998',
 'August 1998',
 'September 1998',
 'October 1998',
 'November 1998',
 'December 1998',
 'January 1999',
 'February 1999',
 'March 1999',
 'April 1999',
 'May 1999',
 'June 1999',
 'July 1999',
 'August 1999',
 'September 1999',
 'October 1999',
 'November 1999',
 'December 1999',
 'January 2000',
 'February 2000',
 'March 2000',
 'April 2000',
 'May 2000',
 'June 2000',
 'July 2000',
 'August 2000',
 'September 2000',
 'October 2000',
 'November 2000',
 'December 2000',
 'January 2001',
 '

In [45]:
#add column headers
raw_data.columns = headers
raw_data.head()

Unnamed: 0,Region,January 1996,February 1996,March 1996,April 1996,May 1996,June 1996,July 1996,August 1996,September 1996,...,March 2017,April 2017,May 2017,June 2017,July 2017,August 2017,September 2017,October 2017,November 2017,December 2017
0,Europe,500.1,490.1,710.6,645.1,918.1,1005.0,984.0,894.9,885.8,...,1064429,1212898,1684659,2139814,1867812,1627172,1676165,1201375,928855,1068009
1,Caribbean,312.8,341.1,357.7,343.3,325.9,348.0,409.7,381.8,244.9,...,822751,783076,714117,865847,959510,718092,356693,473086,567498,754622
2,Asia,259.6,222.6,266.4,243.4,293.1,282.5,271.3,268.5,247.5,...,513034,495884,477954,525875,504326,419586,420099,504961,507576,537118
3,South America,117.9,116.7,118.0,98.7,104.5,129.9,145.9,137.9,102.6,...,165836,148540,154436,193673,182796,161435,119148,137624,157642,222500
4,Central America,85.6,86.3,100.9,78.1,75.5,98.1,107.7,90.4,58.7,...,310182,252392,232366,336150,332634,222045,147044,169255,216656,319918


### Clean Data Types
We want to make sure all our data types are actually numbers.  We also need to multiply by 1000 for years before 2000 (as the data files list amounts in units of thousands).

In [46]:
#checkout data types
print(raw_data.dtypes)

Region             object
January 1996      float64
February 1996     float64
March 1996        float64
April 1996        float64
May 1996          float64
June 1996         float64
July 1996         float64
August 1996       float64
September 1996    float64
October 1996      float64
November 1996     float64
December 1996     float64
January 1997      float64
February 1997     float64
March 1997        float64
April 1997        float64
May 1997          float64
June 1997         float64
July 1997         float64
August 1997       float64
September 1997    float64
October 1997      float64
November 1997     float64
December 1997     float64
January 1998      float64
February 1998     float64
March 1998        float64
April 1998        float64
May 1998          float64
                   ...   
July 2015          object
August 2015        object
September 2015     object
October 2015       object
November 2015      object
December 2015      object
January 2016       object
February 201

Clearly, we need to change some of the datatypes.

In [47]:
#find where this change occurs
raw_data.iloc[:, 47:52].dtypes

November 1999    float64
December 1999    float64
January 2000      object
February 2000     object
March 2000        object
dtype: object

In [48]:
#the change occurs after December 1999 (in other words, index 48)
#convert type for all months after that (we need to replace the comma so the to_numeric method will work)
for i in np.arange(49, len(raw_data.columns)):
    raw_data.iloc[:, i] = pd.to_numeric(raw_data.iloc[:, i].str.replace(",", ""))
    
#verify it works
print(raw_data.iloc[:, 47:52].dtypes)

November 1999    float64
December 1999    float64
January 2000       int64
February 2000      int64
March 2000         int64
dtype: object


In [49]:
#multiply the data by 1000 for all years before 2000
for i in np.arange(1, 49):
    raw_data.iloc[:, i] = 1000 * raw_data.iloc[:, i]
    
#show results
raw_data.head()

Unnamed: 0,Region,January 1996,February 1996,March 1996,April 1996,May 1996,June 1996,July 1996,August 1996,September 1996,...,March 2017,April 2017,May 2017,June 2017,July 2017,August 2017,September 2017,October 2017,November 2017,December 2017
0,Europe,500100.0,490100.0,710600.0,645100.0,918100.0,1005000.0,984000.0,894900.0,885800.0,...,1064429,1212898,1684659,2139814,1867812,1627172,1676165,1201375,928855,1068009
1,Caribbean,312800.0,341100.0,357700.0,343300.0,325900.0,348000.0,409700.0,381800.0,244900.0,...,822751,783076,714117,865847,959510,718092,356693,473086,567498,754622
2,Asia,259600.0,222600.0,266400.0,243400.0,293100.0,282500.0,271300.0,268500.0,247500.0,...,513034,495884,477954,525875,504326,419586,420099,504961,507576,537118
3,South America,117900.0,116700.0,118000.0,98700.0,104500.0,129900.0,145900.0,137900.0,102600.0,...,165836,148540,154436,193673,182796,161435,119148,137624,157642,222500
4,Central America,85600.0,86300.0,100900.0,78100.0,75500.0,98100.0,107700.0,90400.0,58700.0,...,310182,252392,232366,336150,332634,222045,147044,169255,216656,319918


### Verify cleanliness
We will check for null values and make sure all columns are the same lengths.

In [50]:
#check for any null values
raw_data.isnull().values.any()

False

In [51]:
#check for even column lengths
raw_data.count()

Region            10
January 1996      10
February 1996     10
March 1996        10
April 1996        10
May 1996          10
June 1996         10
July 1996         10
August 1996       10
September 1996    10
October 1996      10
November 1996     10
December 1996     10
January 1997      10
February 1997     10
March 1997        10
April 1997        10
May 1997          10
June 1997         10
July 1997         10
August 1997       10
September 1997    10
October 1997      10
November 1997     10
December 1997     10
January 1998      10
February 1998     10
March 1998        10
April 1998        10
May 1998          10
                  ..
July 2015         10
August 2015       10
September 2015    10
October 2015      10
November 2015     10
December 2015     10
January 2016      10
February 2016     10
March 2016        10
April 2016        10
May 2016          10
June 2016         10
July 2016         10
August 2016       10
September 2016    10
October 2016      10
November 2016

In [52]:
#sum up true values for any columns that do not have 10 values
sum(raw_data.count() != 10)

0

### Save clean data
We seem to have reasonably useable data now.  We will save this in a new file.

In [53]:
#create a filepath
output_path = os.path.join('..', 'airline data', 'airport_data_cleaned.csv')

In [54]:
#save data (with no index)
raw_data.to_csv(output_path, encoding = 'utf-8', index = False)