# Cleaning Airport Departure Data

This notebook is used to clean data retrieved from the ITA National Travel and Tourism Office (https://travel.trade.gov/research/monthly/departures/).  Originally, it came as a mix of csv and xlx files, which can be found in the "raw dowloaded data" folder.  The data was combined together into a single csv for ease of use, manually (with actual direct copying from the website in the case of some incomplete files, namely 1999/2000).  This file is "airline_departures_all.csv" in the "airline data" folder.

### Dependencies

First, we will load a number of useful packages.

In [1]:
#load dependencies
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import datetime as dt

### Load File
Now, we can load in the file for viewing.

In [2]:
#create filepath
filepath = os.path.join('..', 'airline data', 'airport_departures_all.csv')
print(filepath)

../airline data/airport_departures_all.csv


In [3]:
#load data into data frame
raw_data = pd.read_csv(filepath, header = None)
raw_data.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,255,256,257,258,259,260,261,262,263,264
0,Europe,500.1,490.1,710.6,645.1,918.1,1005.0,984.0,894.9,885.8,...,1064429,1212898,1684659,2139814,1867812,1627172,1676165,1201375,928855,1068009
1,Caribbean,312.8,341.1,357.7,343.3,325.9,348.0,409.7,381.8,244.9,...,822751,783076,714117,865847,959510,718092,356693,473086,567498,754622
2,Asia,259.6,222.6,266.4,243.4,293.1,282.5,271.3,268.5,247.5,...,513034,495884,477954,525875,504326,419586,420099,504961,507576,537118
3,South America,117.9,116.7,118.0,98.7,104.5,129.9,145.9,137.9,102.6,...,165836,148540,154436,193673,182796,161435,119148,137624,157642,222500
4,Central America,85.6,86.3,100.9,78.1,75.5,98.1,107.7,90.4,58.7,...,310182,252392,232366,336150,332634,222045,147044,169255,216656,319918


### Add column headers
We will add column headers, which are region and month (Jan. 1996 - Dec. 2017)

In [4]:
#create a list of years (1996 - 2017)
years = np.arange(1996, 2018, 1)

#create a list of months
months = ["January", "February", "March", "April", "May", "June", "July", "August", 
          "September", "October", "November", "December"]

In [5]:
#create a headers list
headers = []
headers.append("Region")

for year in years:
    for month in months:
        headers.append(month + " " + str(year))
        
headers

['Region',
 'January 1996',
 'February 1996',
 'March 1996',
 'April 1996',
 'May 1996',
 'June 1996',
 'July 1996',
 'August 1996',
 'September 1996',
 'October 1996',
 'November 1996',
 'December 1996',
 'January 1997',
 'February 1997',
 'March 1997',
 'April 1997',
 'May 1997',
 'June 1997',
 'July 1997',
 'August 1997',
 'September 1997',
 'October 1997',
 'November 1997',
 'December 1997',
 'January 1998',
 'February 1998',
 'March 1998',
 'April 1998',
 'May 1998',
 'June 1998',
 'July 1998',
 'August 1998',
 'September 1998',
 'October 1998',
 'November 1998',
 'December 1998',
 'January 1999',
 'February 1999',
 'March 1999',
 'April 1999',
 'May 1999',
 'June 1999',
 'July 1999',
 'August 1999',
 'September 1999',
 'October 1999',
 'November 1999',
 'December 1999',
 'January 2000',
 'February 2000',
 'March 2000',
 'April 2000',
 'May 2000',
 'June 2000',
 'July 2000',
 'August 2000',
 'September 2000',
 'October 2000',
 'November 2000',
 'December 2000',
 'January 2001',
 '

In [6]:
#add column headers
raw_data.columns = headers
raw_data.head()

Unnamed: 0,Region,January 1996,February 1996,March 1996,April 1996,May 1996,June 1996,July 1996,August 1996,September 1996,...,March 2017,April 2017,May 2017,June 2017,July 2017,August 2017,September 2017,October 2017,November 2017,December 2017
0,Europe,500.1,490.1,710.6,645.1,918.1,1005.0,984.0,894.9,885.8,...,1064429,1212898,1684659,2139814,1867812,1627172,1676165,1201375,928855,1068009
1,Caribbean,312.8,341.1,357.7,343.3,325.9,348.0,409.7,381.8,244.9,...,822751,783076,714117,865847,959510,718092,356693,473086,567498,754622
2,Asia,259.6,222.6,266.4,243.4,293.1,282.5,271.3,268.5,247.5,...,513034,495884,477954,525875,504326,419586,420099,504961,507576,537118
3,South America,117.9,116.7,118.0,98.7,104.5,129.9,145.9,137.9,102.6,...,165836,148540,154436,193673,182796,161435,119148,137624,157642,222500
4,Central America,85.6,86.3,100.9,78.1,75.5,98.1,107.7,90.4,58.7,...,310182,252392,232366,336150,332634,222045,147044,169255,216656,319918


In [7]:
transpo_data = os.path.join('..','Transpo Data','Travel_Spending.csv')

transpo_df = pd.read_csv(transpo_data, header=None)

transpo_df.columns = headers
#transpo_df[transpo_df["Region"]=="NaN"].drop

clean_transpo = transpo_df.loc[0:2]

clean_transpo

Unnamed: 0,Region,January 1996,February 1996,March 1996,April 1996,May 1996,June 1996,July 1996,August 1996,September 1996,...,March 2017,April 2017,May 2017,June 2017,July 2017,August 2017,September 2017,October 2017,November 2017,December 2017
0,Total U.S. Travel and Tourism,"$7,066","$7,078","$7,685","$7,053","$8,005","$7,739","$6,913","$7,156","$7,025",...,"$16,137","$16,227","$16,229","$16,305","$16,488","$16,283","$16,575","$16,435","$16,604","$16,568"
1,Travel,"$5,442","$5,434","$5,938","$5,487","$6,218","$6,017","$5,309","$5,493","$5,392",...,"$12,789","$12,942","$12,924","$12,917","$13,012","$12,975","$13,105","$13,001","$13,153","$13,163"
2,Passenger fares,"$1,624","$1,644","$1,747","$1,566","$1,787","$1,722","$1,604","$1,663","$1,633",...,"$3,348","$3,285","$3,305","$3,388","$3,476","$3,308","$3,470","$3,434","$3,451","$3,405"


In [8]:
#save cleaned transportation data as a csv
clean_transpo_path = os.path.join('..', 'Transpo Data', 'Travel_Spending_cleaned.csv')

clean_transpo.to_csv(clean_transpo_path, index = False, encoding = 'utf-8')

### Getting Totaled Data for Transportation data
We wish to total the spending for various years.  To do so, we have to do a bit of data type manipulation and then some totaling.

In [21]:
#read in csv
clean_transpo_df = pd.read_csv(clean_transpo_path)
clean_transpo_df

Unnamed: 0,Region,January 1996,February 1996,March 1996,April 1996,May 1996,June 1996,July 1996,August 1996,September 1996,...,March 2017,April 2017,May 2017,June 2017,July 2017,August 2017,September 2017,October 2017,November 2017,December 2017
0,Total U.S. Travel and Tourism,"$7,066","$7,078","$7,685","$7,053","$8,005","$7,739","$6,913","$7,156","$7,025",...,"$16,137","$16,227","$16,229","$16,305","$16,488","$16,283","$16,575","$16,435","$16,604","$16,568"
1,Travel,"$5,442","$5,434","$5,938","$5,487","$6,218","$6,017","$5,309","$5,493","$5,392",...,"$12,789","$12,942","$12,924","$12,917","$13,012","$12,975","$13,105","$13,001","$13,153","$13,163"
2,Passenger fares,"$1,624","$1,644","$1,747","$1,566","$1,787","$1,722","$1,604","$1,663","$1,633",...,"$3,348","$3,285","$3,305","$3,388","$3,476","$3,308","$3,470","$3,434","$3,451","$3,405"


In [22]:
#re-establish years
years = years

In [23]:
#change all columns to numeric columns
for i in np.arange(1, len(clean_transpo_df.columns)):
    clean_transpo_df.iloc[:, i] = clean_transpo_df.iloc[:, i].str.replace(',', '')
    clean_transpo_df.iloc[:, i] = pd.to_numeric(clean_transpo_df.iloc[:, i].str.replace('$', ''))
    
clean_transpo_df

Unnamed: 0,Region,January 1996,February 1996,March 1996,April 1996,May 1996,June 1996,July 1996,August 1996,September 1996,...,March 2017,April 2017,May 2017,June 2017,July 2017,August 2017,September 2017,October 2017,November 2017,December 2017
0,Total U.S. Travel and Tourism,7066,7078,7685,7053,8005,7739,6913,7156,7025,...,16137,16227,16229,16305,16488,16283,16575,16435,16604,16568
1,Travel,5442,5434,5938,5487,6218,6017,5309,5493,5392,...,12789,12942,12924,12917,13012,12975,13105,13001,13153,13163
2,Passenger fares,1624,1644,1747,1566,1787,1722,1604,1663,1633,...,3348,3285,3305,3388,3476,3308,3470,3434,3451,3405


In [24]:
#sum for all years and put in a new data frame
yearly_transpo_df = pd.DataFrame({'Totaled Amount': clean_transpo_df['Region']})

#add data for all years
for year in years:
    yearly_data = clean_transpo_df[[column for column in clean_transpo_df if str(year) in column]]
    
    yearly_total = yearly_data.sum(axis = 1)
    
    yearly_transpo_df[str(year)] = yearly_total
    
yearly_transpo_df

Unnamed: 0,Totaled Amount,1996,1997,1998,1999,2000,2001,2002,2003,2004,...,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017
0,Total U.S. Travel and Tourism,90231,94294,91423,94875,102560,84630,77380,73779,87633,...,135577,116782,137839,152599.0,166109,180468,193829,202189,194878,196421
1,Travel,69809,73426,71325,75450,82363,67449,61089,58688,69701,...,104620,90679,106853,118645.0,126745,139454,149757,159942,155606,155808
2,Passenger fares,20422,20868,20098,19425,20197,17181,16291,15091,17932,...,30957,26103,30986,33954.0,39364,41014,44072,42247,39272,40613


While this is totaled, it is not easy to plot.  To do so, we must first transpose the data.

In [25]:
#first turn all columns to floats
for i in np.arange(1, len(yearly_transpo_df.columns)):
    yearly_transpo_df.iloc[:, i] = yearly_transpo_df.iloc[:, i].astype('float64')
yearly_transpo_df

Unnamed: 0,Totaled Amount,1996,1997,1998,1999,2000,2001,2002,2003,2004,...,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017
0,Total U.S. Travel and Tourism,90231.0,94294.0,91423.0,94875.0,102560.0,84630.0,77380.0,73779.0,87633.0,...,135577.0,116782.0,137839.0,152599.0,166109.0,180468.0,193829.0,202189.0,194878.0,196421.0
1,Travel,69809.0,73426.0,71325.0,75450.0,82363.0,67449.0,61089.0,58688.0,69701.0,...,104620.0,90679.0,106853.0,118645.0,126745.0,139454.0,149757.0,159942.0,155606.0,155808.0
2,Passenger fares,20422.0,20868.0,20098.0,19425.0,20197.0,17181.0,16291.0,15091.0,17932.0,...,30957.0,26103.0,30986.0,33954.0,39364.0,41014.0,44072.0,42247.0,39272.0,40613.0


In [26]:
#index by totaled amount
yearly_transpo_df = yearly_transpo_df.set_index('Totaled Amount')

In [27]:
#transpose the data frame
yearly_transposed_df = yearly_transpo_df.transpose()

In [28]:
#show
yearly_transposed_df.head()

Totaled Amount,Total U.S. Travel and Tourism,Travel,Passenger fares
1996,90231.0,69809.0,20422.0
1997,94294.0,73426.0,20868.0
1998,91423.0,71325.0,20098.0
1999,94875.0,75450.0,19425.0
2000,102560.0,82363.0,20197.0


In [29]:
#drop the index
yearly_transposed_df = yearly_transposed_df.reset_index()
yearly_transposed_df.head()

Totaled Amount,index,Total U.S. Travel and Tourism,Travel,Passenger fares
0,1996,90231.0,69809.0,20422.0
1,1997,94294.0,73426.0,20868.0
2,1998,91423.0,71325.0,20098.0
3,1999,94875.0,75450.0,19425.0
4,2000,102560.0,82363.0,20197.0


In [30]:
#check data types
yearly_transposed_df.dtypes

Totaled Amount
index                             object
Total U.S. Travel and Tourism    float64
      Travel                     float64
      Passenger fares            float64
dtype: object

In [31]:
#rename index column as year and turn to numeric
yearly_transposed_df = yearly_transposed_df.rename(columns = {'index': 'Year'})

#change to numeric
yearly_transposed_df.loc[:, 'Year'] = pd.to_numeric(yearly_transposed_df.loc[:, 'Year']).astype('float64')

In [32]:
#check data types
yearly_transposed_df.dtypes

Totaled Amount
Year                             float64
Total U.S. Travel and Tourism    float64
      Travel                     float64
      Passenger fares            float64
dtype: object

In [33]:
#save as a csv
output_path = os.path.join('..', 'Transpo Data', 'Transpo_data_yearly.csv')
yearly_transposed_df.to_csv(output_path, encoding = 'utf-8', index = False)