### Outline ###
    
1. Import datasets

2. Data cleaning + export cleaned datasets to `datasets/cleaned` (Note: index column name should always be "airport_code")

    2.1 Clean `datasets/original/crash` (unfinished data1)
    
    2.2 Clean `datasets/original/delay/2009.csv`. Export `unpleasant_2009.csv` and `unpleasant_airport_code.csv`.
    
    2.3 Clean `datasets/original/airport/airport-extended.csv`. Generate `airport_loc_df` with column names city_name, latitude, longitude, altitude_ft.
    
    2.4 Clean `datasets/original/city/uscities.csv`.
    
    2.5 Merge `unpleasant_airport_code` and `airport_loc_df` to get pre-`airport_prop_df`. Drop rows whose city_name isn't included in `us_cities_df`. Scrape pcp data with datasets in `datasets/original/weather`. Then we get `airport_prop_df`. Export `airport_prop.csv`.
    
    2.6 Clean `datasets/original/airports/Bird Strikes.xlsx`. Merge `bird_strike_avg_df` and `unpleasant_airport_code_df` to get `bird_strike_final_df`. Export it as `bird_strike.csv`.
    
    2.7 Generate population data (unfinished)
   
    
    

#### Note: ####
- 2.1 ~ 2.5 are from `A Pleasant Flight.ipynb` (data visualization part has not been added)
- 2.6 is from `Bird_Strike_Cleaning.ipynb`
- 2.7 is from `Population_Clean.ipynb`


### X column names: ###

airport_code (cities_need.csv) <- index

0: city_name (cities_need.csv)

1: city_id (airport_poneeded.csv)

2: latitude (airport_poneeded.csv)

3: longitude (airport_poneeded.csv)

4: altitude_ft (airport_poneeded.csv)

5: population (airport_poneeded.csv)

6: temp_avg (airport_poneeded.csv)

7: pcp_avg (airport_poneeded.csv)

8: strike_avg (birdstrike_cleaned.csv)

9: damage_avg (birdstrike_cleaned.csv)

10: enplanements_17 (commercial_service_enplanements.xlsx)

11: enplanements_18 (commercial_service_enplanements.xlsx)

12: enplanements_change (commercial_service_enplanements.xlsx)

unfinished: airport_name

==========================================

In [1]:
import pandas as pd
import numpy as np
from numpy import nan as Nan
import matplotlib.pyplot as plt
import requests
from string import digits
import wget # you need to "pip install wget"  
import glob
import time

import plotly.graph_objects as go # use conda to install plotly --YD
import pandas as pd
import plotly.figure_factory as ff
import xml.etree.ElementTree as ET # to read one dataset in XML format

1. import datasets

From `A Pleasant Flight.ipynb` (now in the folder `previous_codes`)

In [2]:
crash_df = pd.read_csv("datasets/original/crash/Airplane_Crashes_and_Fatalities_Since_1908.csv") #data1
etree = ET.parse("datasets/original/crash/AviationData.xml") #data2
delay_2009_df = pd.read_csv("datasets/original/delay/2009.csv") #data3 (done)
    #need to download 2009.csv from https://www.kaggle.com/yuanyuwendymu/airline-delay-and-cancellation-data-2009-2018

In [3]:
airport_loc_df = pd.read_csv("datasets/original/airport/airports-extended.csv") #data4
#https://www.kaggle.com/open-flights/airports-train-stations-and-ferry-terminals#airports-extended.csv

#aiport_code: city_name, latitude, longitude, altitude_ft
us_cities_df = pd.read_csv("datasets/original/city/uscities.csv") #data5

From `Bird Strike Cleaning.ipynb` (now in the folder `previous_codes`)

In [4]:
bird_strike_df = pd.read_excel("datasets/original/airport/Bird Strikes.xlsx") #data6
airport_name_df = pd.read_excel("datasets/original/airport/airportcode.xlsx") #data7 
    #airport_name_df function: to map each airport_name to airport_code

==========================================

2.1 Clean `datasets/original/crash` (unfinished data1)

clean data2

In [5]:
xml_root = etree.getroot()

interest_columns = ["EventId","EventDate","Location","Country","Latitude","Longitude","AirportCode","InjurySeverity","AircraftDamage",
                    "AircraftCategory","NumberOfEngines","EngineType","Schedule","TotalUninjured","TotalMinorInjuries",
                    "TotalSeriousInjuries","TotalFatalInjuries","WeatherCondition","BroadPhaseOfFlight","RegistrationNumber","PurposeOfFlight"]

NTSB_crash_df = pd.DataFrame(columns=interest_columns) # NTSB_crash_df: dataframe which collects airport information. -- YD 02/28/2020

In [6]:
for elem in xml_root: # This loop will run only once
    for row in elem: 
        if not(row.attrib["PurposeOfFlight"]=="Business"):  # Only care about business flights whose engine type is not 'Reciprocating'-- YD 02/28/2020
            continue
        if row.attrib["EngineType"]=="Reciprocating": 
            continue
        information = list()
        for interest in interest_columns:
            if not (row.attrib[interest]==""):
                information.append(row.attrib[interest])
            else:
                information.append(np.nan)
        row_information = pd.Series(information,index=interest_columns)
        NTSB_crash_df = NTSB_crash_df.append(row_information,ignore_index=True)
            
# this may need to run for a while. It takes 18 seconds on my computer
#     NTSB_crash_df

In [7]:
# NTSB_crash_df[(NTSB_crash_df["AircraftCategory"]=="Airplane") & (NTSB_crash_df["EngineType"]!="Reciprocating") & (NTSB_crash_df["Country"]=="United States")]

In [8]:
# (unfinished Crash Data Cleaning.ipynb)

==========================================

2.2 Clean `datasets/original/delay/2009.csv`. Export `unpleasant_2009.csv` and `unpleasant_airport_code.csv`.

clean data3

In [9]:
# DEPARTURE
unpleasant_2009_departure = pd.DataFrame()
unpleasant_2009_departure['total_departure'] = delay_2009_df.loc[:,["ORIGIN"]].groupby('ORIGIN').size()
unpleasant_2009_departure[["average_departure_delay","average_departure_taxi"]] = delay_2009_df.loc[:,["ORIGIN","DEP_DELAY","TAXI_OUT"]].groupby('ORIGIN').mean()
unpleasant_2009_departure['average_departure_cancelled'] = delay_2009_df.loc[:,["ORIGIN","CANCELLED"]].groupby('ORIGIN').mean()
unpleasant_2009_departure['averge_departure_distance'] = delay_2009_df.loc[:,["ORIGIN","DISTANCE"]].groupby('ORIGIN').mean()
    
#ARRIVAL
unpleasant_2009_arrival = pd.DataFrame()
unpleasant_2009_arrival['total_arrival'] = delay_2009_df.loc[:,["DEST"]].groupby('DEST').size()
unpleasant_2009_arrival[["average_arrival_delay","average_arrival_taxi"]] = delay_2009_df.loc[:,["DEST","ARR_DELAY","TAXI_IN"]].groupby('DEST').mean()
unpleasant_2009_arrival['average_arrival_diverted'] = delay_2009_df.loc[:,["DEST","DIVERTED"]].groupby('DEST').mean()
unpleasant_2009_arrival['averge_arrival_distance'] = delay_2009_df.loc[:,["DEST","DISTANCE"]].groupby('DEST').mean()
    
unpleasant_2009_departure['total_departure_lg10'] = unpleasant_2009_departure['total_departure'].apply(np.log10)
unpleasant_2009_arrival['total_arrival_lg10'] = unpleasant_2009_arrival['total_arrival'].apply(np.log10)

unpleasant_2009 = unpleasant_2009_departure.merge(unpleasant_2009_arrival,left_index=True,right_index=True)
unpleasant_2009.index.names = ["airport_code"]

In [10]:
unpleasant_2009

Unnamed: 0_level_0,total_departure,average_departure_delay,average_departure_taxi,average_departure_cancelled,averge_departure_distance,total_departure_lg10,total_arrival,average_arrival_delay,average_arrival_taxi,average_arrival_diverted,averge_arrival_distance,total_arrival_lg10
airport_code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
ABE,4034,5.877463,13.556651,0.020327,562.069658,3.605736,4037,5.126924,4.179300,0.002973,561.860045,3.606059
ABI,2490,8.685573,8.305099,0.023293,158.000000,3.396199,2490,10.406404,3.518443,0.001606,158.000000,3.396199
ABQ,35582,5.050531,9.963724,0.005312,641.429965,4.551230,35577,2.210595,5.242186,0.000675,642.158726,4.551169
ABY,995,7.742564,10.823409,0.021106,146.000000,2.997823,997,6.965271,3.421859,0.002006,146.000000,2.998695
ACK,342,16.478788,20.081818,0.035088,199.000000,2.534026,343,2.516616,5.556886,0.008746,198.647230,2.535294
...,...,...,...,...,...,...,...,...,...,...,...,...
WRG,722,8.484979,9.507163,0.033241,61.569252,2.858537,722,10.140376,3.793054,0.016620,61.569252,2.858537
WYS,340,-6.779412,8.191176,0.000000,273.000000,2.531479,340,-3.958702,3.523529,0.002941,273.000000,2.531479
XNA,13755,7.796676,13.595218,0.029953,526.853435,4.138461,13764,6.735768,4.960809,0.005522,527.071563,4.138745
YAK,719,8.250352,9.568214,0.011127,205.990264,2.856729,719,11.091808,3.847458,0.008345,206.009736,2.856729


export `unpleasant_2009.csv`

In [11]:
unpleasant_2009.to_csv("datasets/cleaned/unpleasant_2009.csv")

In [12]:
unpleasant_airport_code_df = unpleasant_2009[[]]
unpleasant_airport_code_df.to_csv("datasets/cleaned/unpleasant_airport_code.csv")

==========================================

2.3 Clean `datasets/original/airport/airport-extended.csv`. Generate `airport_loc_df` with column names city_name, latitude, longitude, altitude_ft.

In [13]:
#clean city_name
def clean_city_name(input_city):
    original = input_city
    input_city = str(input_city)
    input_city = input_city.strip()
    input_city = input_city.lower()
    
    input_city = input_city.replace(".","")
    input_city = input_city.replace("\\\\","")
    input_city = input_city.replace("-"," ")
    input_city = input_city.replace(" - "," ")
    input_city = input_city.replace("saint ","st")
    input_city = input_city.replace("east ","")
    input_city = input_city.replace("west ","")
    
    input_city = input_city.translate({ord(k): None for k in digits})
    
    if ('/' in input_city):
        input_city = input_city[:input_city.find('/')]
    if ('(' in input_city):
        input_city = input_city[:input_city.find('(')]
    if (',' in input_city):
        input_city = input_city[:input_city.find(',')]
    input_city = input_city.strip()   
    if (' ' in input_city):
        temp=input_city.find(' ')
        if (temp > 2):
            input_city = input_city[:input_city.find(' ')]
        else:
            if (input_city.find(' ',temp+1) != -1):
                input_city = input_city[temp+1:input_city.find(' ',temp+1)]
            else:
                input_city = input_city[temp+1:]
    input_city = input_city.strip()
    try:
        assert len(input_city) > 2
        assert input_city.replace(" ","").replace("'","").isalpha()
    except:
        print("This city name is prehaps incorrect: ",original,input_city,len(original))
    return input_city

In [14]:
airport_loc_df.columns = ["ID","name","city_name","country","airport_code","code4","latitude","longitude","altitude_ft","UTC_offset","DST","timezone","type","information_source"]
airport_loc_df = airport_loc_df.loc[:,["city_name","country","airport_code","latitude","longitude","altitude_ft"]]
airport_loc_df = airport_loc_df[airport_loc_df["country"]=="United States"]
airport_loc_df = airport_loc_df.loc[:,["city_name","airport_code","latitude","longitude","altitude_ft"]]
airport_loc_df = airport_loc_df[airport_loc_df["airport_code"]!="\\N"] # remove NAN in index
airport_loc_df = airport_loc_df.set_index("airport_code")

airport_loc_df["city_name"] = airport_loc_df["city_name"].apply(clean_city_name)

In [15]:
airport_loc_df.head()

Unnamed: 0_level_0,city_name,latitude,longitude,altitude_ft
airport_code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
BTI,barter,70.134003,-143.582001,2
K03,fort,70.613403,-159.860001,35
LUR,cape,68.875099,-166.110001,16
PIZ,point,69.732903,-163.005005,22
ITO,hilo,19.721399,-155.048004,38


In [16]:
#airport_loc_df.to_csv("datasets/cleaned/airport_loc.csv")

==========================================

2.4 Clean `datasets/original/city/uscities.csv`.

clean data5

In [17]:
us_cities_df.head()

Unnamed: 0,city,city_ascii,state_id,state_name,county_fips,county_name,county_fips_all,county_name_all,lat,lng,population,density,source,military,incorporated,timezone,ranking,zips,id
0,South Creek,South Creek,WA,Washington,53053,Pierce,53053,Pierce,46.9994,-122.3921,2500,125.0,polygon,False,True,America/Los_Angeles,3,98580 98387 98338,1840042075
1,Roslyn,Roslyn,WA,Washington,53037,Kittitas,53037,Kittitas,47.2507,-121.0989,947,84.0,polygon,False,True,America/Los_Angeles,3,98941 98068 98925,1840019842
2,Sprague,Sprague,WA,Washington,53043,Lincoln,53043,Lincoln,47.3048,-117.9713,441,163.0,polygon,False,True,America/Los_Angeles,3,99032,1840021107
3,Gig Harbor,Gig Harbor,WA,Washington,53053,Pierce,53053,Pierce,47.3352,-122.5968,9507,622.0,polygon,False,True,America/Los_Angeles,3,98332 98335,1840019855
4,Lake Cassidy,Lake Cassidy,WA,Washington,53061,Snohomish,53061,Snohomish,48.0639,-122.092,3591,131.0,polygon,False,True,America/Los_Angeles,3,98223 98258 98270,1840041959


In [18]:
us_cities_df = us_cities_df[["city","state_id","county_fips","county_name","population","density","lat","lng"]]
us_cities_df = us_cities_df.rename(columns = {"city":"city_name"})
us_cities_df["fips"] = us_cities_df["county_fips"]

def get_county_code(input_county):
    return int(input_county) % 1000

us_cities_df["county_fips"] = us_cities_df["county_fips"].apply(get_county_code)

us_cities_df["city_name"] = us_cities_df["city_name"].apply(clean_city_name)

This city name is prehaps incorrect:  Y-O Ranch o 9
This city name is prehaps incorrect:  Ho-Ho-Kus ho 9
This city name is prehaps incorrect:  St. Jo jo 6
This city name is prehaps incorrect:  Ty Ty ty 5
This city name is prehaps incorrect:  So-Hi hi 5
This city name is prehaps incorrect:  K. I. Sawyer i 12
This city name is prehaps incorrect:  G. L. García l 12


In [19]:
us_cities_df

Unnamed: 0,city_name,state_id,county_fips,county_name,population,density,lat,lng,fips
0,south,WA,53,Pierce,2500,125.0,46.9994,-122.3921,53053
1,roslyn,WA,37,Kittitas,947,84.0,47.2507,-121.0989,53037
2,sprague,WA,43,Lincoln,441,163.0,47.3048,-117.9713,53043
3,gig,WA,53,Pierce,9507,622.0,47.3352,-122.5968,53053
4,lake,WA,61,Snohomish,3591,131.0,48.0639,-122.0920,53061
...,...,...,...,...,...,...,...,...,...
28884,montrose,SD,87,McCook,442,414.0,43.7002,-97.1843,46087
28885,spearfish,SD,81,Lawrence,12675,272.0,44.4912,-103.8167,46081
28886,emery,SD,61,Hanson,455,384.0,43.6020,-97.6195,46061
28887,watertown,SD,29,Codington,21837,475.0,44.9094,-97.1532,46029


==========================================

2.5 Merge `unpleasant_airport_code` and `airport_loc_df` to get pre-`airport_prop_df`. Drop rows whose city_name isn't included in `us_cities_df`. Scrape pcp data with datasets in `datasets/original/weather`. Then we get `airport_prop_df`. Export `airport_prop_df.csv`.

In [20]:
airport_prop_df = unpleasant_airport_code_df.merge(airport_loc_df,how='inner',left_index=True,right_index=True)

In [21]:
airport_prop_df

Unnamed: 0_level_0,city_name,latitude,longitude,altitude_ft
airport_code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
ABE,allentown,40.652100,-75.440804,393
ABI,abilene,32.411301,-99.681900,1791
ABQ,albuquerque,35.040199,-106.609001,5355
ABY,albany,31.535500,-84.194504,197
ACK,nantucket,41.253101,-70.060204,47
...,...,...,...,...
WRG,wrangell,56.484299,-132.369995,49
WYS,yellowstone,44.688400,-111.117996,6649
XNA,bentonville,36.281898,-94.306801,1287
YAK,yakutat,59.503300,-139.660004,33


In [22]:
airport_prop_df

Unnamed: 0_level_0,city_name,latitude,longitude,altitude_ft
airport_code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
ABE,allentown,40.652100,-75.440804,393
ABI,abilene,32.411301,-99.681900,1791
ABQ,albuquerque,35.040199,-106.609001,5355
ABY,albany,31.535500,-84.194504,197
ACK,nantucket,41.253101,-70.060204,47
...,...,...,...,...
WRG,wrangell,56.484299,-132.369995,49
WYS,yellowstone,44.688400,-111.117996,6649
XNA,bentonville,36.281898,-94.306801,1287
YAK,yakutat,59.503300,-139.660004,33


In [23]:
city_climate_df = pd.DataFrame(columns=["airport_code","population","density","avg_temp_sp","avg_temp_su","avg_temp_fa","avg_temp_wi","avg_precipitation_sp","avg_precipitation_su","avg_precipitation_fa","avg_precipitation_wi"])
city_search_df = pd.DataFrame(columns=["airport_code","state_id","county_id","city_id","fips"])

In [24]:
for ind,row in airport_prop_df.iterrows():
    city = row["city_name"]
    target_lat = row["latitude"]
    target_lng = row["longitude"]
    try:      
        target_cities = us_cities_df[us_cities_df["city_name"]==city]
        
        if not (target_cities.shape[0] == 1):
            def calc_dis(input_):
                err = abs(target_lat - input_["lat"]) + abs(target_lng - input_["lng"])
                return err
            target_cities.loc[:,"error"] = (target_cities.apply(calc_dis,axis=1))
            target_city = target_cities.sort_values(by="error").iloc[0]
            
            assert target_city["error"] < 1.5
            
            target_city = target_city.drop(["error"])
        elif (target_cities.shape[0] >= 1):
            target_city = target_cities.iloc[0]
        
        county = str(target_city["county_fips"])
        if (len(county)==1):
            county = "00" + county
        elif (len(county)==2):
            county = "0" + county

        city_search_df = city_search_df.append({"airport_code":ind,"state_id":target_city["state_id"],"county_id":county,"city_id":target_city.name,"fips":target_city["fips"]},ignore_index=True)        
    except:
        try:
            def calc_dis(input_):
                err = abs(target_lat - input_["lat"]) + abs(target_lng - input_["lng"])
                return err
            us_cities_df_copy = us_cities_df
            us_cities_df_copy.loc[:,"error"] = (us_cities_df.apply(calc_dis,axis=1))
            target_city = us_cities_df_copy.sort_values(by="error").iloc[0]
            assert target_city["error"] < 1.5
            county = str(target_city["county_fips"])
            if (len(county)==1):
                county = "00" + county
            elif (len(county)==2):
                county = "0" + county
            city_search_df = city_search_df.append({"airport_code":ind,"state_id":target_city["state_id"],"county_id":county,"city_id":target_city.name,"fips":target_city["fips"]},ignore_index=True)
        except:
            print("No data for ",city)
            city_search_df = city_search_df.append({"airport_code":ind,"state_id":np.nan,"county_id":np.nan,"city_id":np.nan,"fips":np.nan},ignore_index=True)

#special case for DC
for ind,row in city_search_df.iterrows():
    if (row["airport_code"]=="DCA"):
        city_search_df.iloc[ind]["state_id"]="MD"
        city_search_df.iloc[ind]["county_id"]="511"

        
        
city_search_df = city_search_df[(city_search_df["state_id"]!="HI") & (city_search_df["state_id"]!="AK")]

In [25]:
city_search_df.head()

Unnamed: 0,airport_code,state_id,county_id,city_id,fips
0,ABE,PA,77,10988,42077
1,ABI,TX,441,5333,48441
2,ABQ,NM,1,3742,35001
3,ABY,GA,95,17309,13095
4,ACK,MA,19,19193,25019


scrape cities in `city_search_df`

In [26]:
def download_climate_data(state,county,year):
    save_path = "datasets/original/weather/"
    fname = state + county + "_" + str(year) + ".csv"
    if (len(glob.glob(save_path + fname))==0):
        URL = "https://www.ncdc.noaa.gov/cag/county/time-series/{}-{}-{}-all-1-2000-2020.csv?base_prd=true&begbaseyear=1901&endbaseyear=2000".format(state,county,"tavg")
        r = requests.get(URL)
        file = wget.download(URL,out=save_path + "tavg/tavg_" + fname)
        URL = "https://www.ncdc.noaa.gov/cag/county/time-series/{}-{}-{}-all-1-2000-2020.csv?base_prd=true&begbaseyear=1901&endbaseyear=2000".format(state,county,"pcp")
        r = requests.get(URL)
        file = wget.download(URL,out=save_path + "pcp/pcp_" + fname)

        tavg_df = pd.read_csv(save_path + "tavg/tavg_" + fname).iloc[4:]
        tavg_df.columns=["date","tavg","comp"]
        tavg = tavg_df.set_index("date")["tavg"]

        pcp_df = pd.read_csv(save_path + "pcp/pcp_" + fname).iloc[4:]
        pcp_df.columns=["date","pcp","comp"]
        pcp = pcp_df.set_index("date")["pcp"]

        pd.concat([tavg, pcp], axis=1).to_csv(save_path + fname)
        time.sleep(1) # not requesting too frequently

In [27]:
counter = 0
for ind,row in city_search_df.iterrows():
    try:
        download_climate_data(row["state_id"],row["county_id"],2018)
        download_climate_data(row["state_id"],row["county_id"],2019)
    except:
        print(row)
    counter+=1
    print("progress: {:.2f}%   Just done: {}".format(100 * counter / city_search_df.shape[0],row["airport_code"]),end="\r")


temp_pcp_df = pd.DataFrame(columns=["airport_code","temp_avg","pcp_avg"])
for ind,row in city_search_df.iterrows():
    state = row["state_id"]
    county = row["county_id"]
    save_path = "datasets/original/weather/"
    years = [2018,2019]
    
    try:
        tavg = 0
        pcp = 0
        for year in years:
            fname = state + county + "_" + str(year) + ".csv"
            temp_pcp = pd.read_csv(save_path + fname)
            tavg += temp_pcp.mean()["tavg"]
            pcp += temp_pcp.mean()["pcp"]
        tavg /= len(years)
        pcp /= len(years)
        
        temp_pcp_df = temp_pcp_df.append({"airport_code":row["airport_code"],"temp_avg":tavg,"pcp_avg":pcp},ignore_index=True)
        
    except:
        temp_pcp_df = temp_pcp_df.append({"airport_code":row["airport_code"],"temp_avg":np.nan,"pcp_avg":np.nan},ignore_index=True)

progress: 100.00%   Just done: YUM

In [28]:
airport_prop_df

Unnamed: 0_level_0,city_name,latitude,longitude,altitude_ft
airport_code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
ABE,allentown,40.652100,-75.440804,393
ABI,abilene,32.411301,-99.681900,1791
ABQ,albuquerque,35.040199,-106.609001,5355
ABY,albany,31.535500,-84.194504,197
ACK,nantucket,41.253101,-70.060204,47
...,...,...,...,...
WRG,wrangell,56.484299,-132.369995,49
WYS,yellowstone,44.688400,-111.117996,6649
XNA,bentonville,36.281898,-94.306801,1287
YAK,yakutat,59.503300,-139.660004,33


In [29]:
airport_prop_df = airport_prop_df.merge(city_search_df.set_index("airport_code").loc[:,["city_id","fips"]],left_index=True,right_index=True)
airport_prop_df = airport_prop_df.merge(temp_pcp_df.set_index("airport_code"),left_index=True,right_index=True)

In [30]:
airport_prop_df

Unnamed: 0_level_0,city_name,latitude,longitude,altitude_ft,city_id,fips,temp_avg,pcp_avg
airport_code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
ABE,allentown,40.652100,-75.440804,393,10988,42077,51.902490,4.161037
ABI,abilene,32.411301,-99.681900,1791,5333,48441,64.488797,2.131411
ABQ,albuquerque,35.040199,-106.609001,5355,3742,35001,54.316598,0.928465
ABY,albany,31.535500,-84.194504,197,17309,13095,66.578008,4.069336
ACK,nantucket,41.253101,-70.060204,47,19193,25019,50.793361,3.703195
...,...,...,...,...,...,...,...,...
VLD,valdosta,30.782499,-83.276703,203,17227,13185,66.983817,4.138382
VPS,valparaiso,30.483200,-86.525398,87,2510,12091,67.168050,5.020456
WYS,yellowstone,44.688400,-111.117996,6649,26356,30031,40.186307,2.004440
XNA,bentonville,36.281898,-94.306801,1287,15451,5007,58.123651,3.907842


In [31]:
airport_prop_df.to_csv("datasets/cleaned/airport_prop.csv")

In [32]:
#bird_strike_df

==========================================

2.6 Clean `datasets/original/airports/Bird Strikes.xlsx`. Merge `bird_strike_avg_df` and `unpleasant_airport_code_df` to get `bird_strike_final_df`. Export it as `bird_strike.csv`.

In [33]:
bird_strike_df = bird_strike_df[["Airport: Name", "Effect: Indicated Damage"]]
bird_strike_df = bird_strike_df.rename(columns = {"Airport: Name": "airport_name", "Effect: Indicated Damage":"bird_strike_effect"})

In [34]:
bird_strike_df = bird_strike_df.dropna()
bird_strike_df = bird_strike_df.reset_index(drop = True)

In [35]:
bird_strike_df

Unnamed: 0,airport_name,bird_strike_effect
0,LAGUARDIA NY,Caused damage
1,DALLAS/FORT WORTH INTL ARPT,Caused damage
2,LAKEFRONT AIRPORT,No damage
3,SEATTLE-TACOMA INTL,No damage
4,NORFOLK INTL,No damage
...,...,...
25424,SACRAMENTO INTL,No damage
25425,REDDING MUNICIPAL,No damage
25426,ORLANDO INTL,No damage
25427,DETROIT METRO WAYNE COUNTY ARPT,No damage


In [36]:
airport_name_df = airport_name_df.dropna()
airport_name_df = airport_name_df.reset_index(drop = True)

In [37]:
airport_name_df

Unnamed: 0,airport_code,airport_name
0,ABE,Lehigh Valley International
1,ABI,Abilene Regional Airport
2,ABQ,Albuquerque International Sunport
3,ABY,Southwest Georgia Regional
4,ACK,Nantucket Memorial
...,...,...
381,WYS,West Yellowstone
382,XNA,Northwest Arkansas Regional
383,YAK,Yakutat
384,YKM,Yakima Air Terminal


In [38]:
#standardize airport names in bird_strike_df and airport_name_df so we can merge them later

def standardize_airport_name(string):
    
    string = string.lower()
    string = string.strip()
    if 'intl' in string:
        string = string.replace('intl', '')
    if 'arpt' in string:
        string = string.replace('arpt', '')
    if 'regional' in string:
        string = string.replace('regional', '')
    if 'airport' in string:
        string = string.replace('airport', '')
    if 'sunport' in string:
        string = string.replace('sunport', '')
    if 'international' in string:
        string = string.replace('international', '')
    if 'intercontinental' in string:
        string = string.replace('intercontinental', '')
    else:
        output = string
        
    string = string = string.strip()
    
    return string

In [39]:
bird_strike_df['airport_name'] = bird_strike_df['airport_name'].apply(standardize_airport_name)
airport_name_df['airport_name'] = airport_name_df['airport_name'].apply(standardize_airport_name)

In [40]:
bird_strike_df

Unnamed: 0,airport_name,bird_strike_effect
0,laguardia ny,Caused damage
1,dallas/fort worth,Caused damage
2,lakefront,No damage
3,seattle-tacoma,No damage
4,norfolk,No damage
...,...,...
25424,sacramento,No damage
25425,redding municipal,No damage
25426,orlando,No damage
25427,detroit metro wayne county,No damage


In [41]:
airport_name_df

Unnamed: 0,airport_code,airport_name
0,ABE,lehigh valley
1,ABI,abilene
2,ABQ,albuquerque
3,ABY,southwest georgia
4,ACK,nantucket memorial
...,...,...
381,WYS,west yellowstone
382,XNA,northwest arkansas
383,YAK,yakutat
384,YKM,yakima air terminal


In [42]:
# Quantify the strike data and the damage data separately

def check_strike (string):
    return 1

def check_damage (string):
    if 'Caused' in string:
        output = 1
    else:
        output = 0
    return output

bird_strike_df['strike'] = bird_strike_df['bird_strike_effect'].apply(check_strike)
bird_strike_df['damage'] = bird_strike_df['bird_strike_effect'].apply(check_damage)

In [43]:
bird_strike_df

Unnamed: 0,airport_name,bird_strike_effect,strike,damage
0,laguardia ny,Caused damage,1,1
1,dallas/fort worth,Caused damage,1,1
2,lakefront,No damage,1,0
3,seattle-tacoma,No damage,1,0
4,norfolk,No damage,1,0
...,...,...,...,...
25424,sacramento,No damage,1,0
25425,redding municipal,No damage,1,0
25426,orlando,No damage,1,0
25427,detroit metro wayne county,No damage,1,0


In [44]:
# Calculate how many strikes and strikes causing damage for each airport

grouped_strike = bird_strike_df.groupby('airport_name').agg({'strike':['sum']})
grouped_strike = grouped_strike.reset_index()
grouped_damage = bird_strike_df.groupby('airport_name').agg({'damage':['sum']})

In [45]:
bird_strike_sum_df = pd.merge(grouped_strike, grouped_damage, on='airport_name')
bird_strike_sum_df.columns = ['airport_name', 'strike_sum','damage_sum']
#bird_strike_sum_df

In [46]:
bird_strike_avg_df = pd.merge(airport_name_df, bird_strike_sum_df, on='airport_name')

In [47]:
def average_sum(input):
    output = input/(2011 - 2000 + 1)
    return output

bird_strike_avg_df['strike_avg'] = bird_strike_avg_df['strike_sum'].apply(average_sum)
bird_strike_avg_df['damage_avg'] = bird_strike_avg_df['damage_sum'].apply(average_sum)
bird_strike_avg_df = bird_strike_avg_df.drop(columns = ['strike_sum', 'damage_sum'])
#bird_strike_avg_df = bird_strike_avg_df.set_index('airport_code')

In [48]:
bird_strike_avg_df = bird_strike_avg_df.drop(columns = ["airport_name"])

merge `unpleasant_airport_code_df` and `bird_strike_avg_df`

In [49]:
bird_strike_avg_df

Unnamed: 0,airport_code,strike_avg,damage_avg
0,ABE,5.250000,0.166667
1,ABI,0.750000,0.083333
2,ABQ,6.833333,0.250000
3,ABY,0.416667,0.083333
4,ACK,0.583333,0.083333
...,...,...,...
212,TYS,3.416667,0.166667
213,UNV,0.333333,0.166667
214,VLD,0.250000,0.166667
215,YAK,0.500000,0.250000


In [50]:
unpleasant_airport_code_df.reset_index()

Unnamed: 0,airport_code
0,ABE
1,ABI
2,ABQ
3,ABY
4,ACK
...,...
291,WRG
292,WYS
293,XNA
294,YAK


In [51]:
bird_strike_final_df = pd.merge(bird_strike_avg_df, unpleasant_airport_code_df.reset_index(), how='right')
bird_strike_final_df = bird_strike_final_df.fillna(0)

In [52]:
bird_strike_final_df = bird_strike_final_df.set_index("airport_code")

In [53]:
bird_strike_final_df.to_csv("bird_strike.csv")

==========================================

2.7 Generate population data (unfinished)