### Cleaning the US I94 ports in Non-US region


The following cleaning process has been taken up:

1. Each entry's airport/passing, city, state or province, country is identified.

    Reason: To match with the airport codes data.
    
   For example:
   
   'TOK'	=	'Torokina #ARPRT, New Guinea'
   
   is changed to:
   
   'TOK'	=	'Torokina Airport, Torokina, Bougainville, New Guinea'
   
   'Code'	=	Airport/Crossing, City, State, Country
   
   This matches with the muncipality 'Torokina' in the Airports code data.
   

In [1]:
import pandas as pd
import numpy as np
import us

In [2]:
I94_port_codes = []
I94_ac = []
I94_city = []
I94_state = []
I94_country = []

In [3]:
with open('../I94_Ports_Non_US.txt') as f:
    for line in f:
        temp = line.strip('\n').split("\t=\t")
        I94_port_codes.append(temp[0].strip("'"))
        exp = temp[1].strip("'").split(",")
        I94_country.append(exp[-1].strip())
        I94_state.append(exp[-2].strip())
        I94_city.append(exp[-3].strip())
        if len(exp)==4:
            I94_ac.append(exp[-4].strip())
        else:
            I94_ac.append("")
#         print(I94_port_codes)
#         print(I94_country)
#         print(I94_state)
#         print(I94_city)
#         print(I94_ac)
#         break

In [4]:
print(len(I94_port_codes))
print(len(I94_country))
print(len(I94_state))
print(len(I94_city))
print(len(I94_ac))

57
57
57
57
57


In [5]:
df_I94_nonUS = pd.DataFrame({'code':I94_port_codes, 
                           'port': I94_ac, 
                           'locality': I94_city, 
                           'province': I94_state, 
                           'territory': I94_country})

In [6]:
df_I94_nonUS

Unnamed: 0,code,port,locality,state,country
0,CLG,,Calgary,Alberta,Canada
1,EDA,,Edmonton,Alberta,Canada
2,YHC,,Hakai pass,British Columbia,Canada
3,HAL,,Halifax,Nova Scotia,Canada
4,MON,,Montreal,Quebec,Canada
5,OTT,,Ottawa,Ontario,Canada
6,YXE,,Saskatoon,Saskatchewan,Canada
7,TOR,,Toronto,Ontario,Canada
8,VCV,,Vancouver,British Columbia,Canada
9,VIC,,Victoria,British Columbia,Canada


### Cleaning the data dictionary for the I94port Codes in US region

1. Separating out the US ports and Non-US ports.
2. Replacing the shortcuts like 'INTL' with International, and 'ARPRT' with Airport.
3. Replacing the combined airports like:

        OAKLAND COUNTY - PONTIAC AIRPORT, PONTIAC, MI AS OAKLAND COUNTY - PONTIAC, MI
4. Handling the typos:

        HAMLIN, ME as HAMIIN, ME
5. Adding the city name, if airport name is found in I94 port definition:
        
        Given: BILOXI REGIONAL, MS
        Changed to: BILOXI REGIONAL AIRPORT, BILOXI, MS

6. Identified crossings and bridges in the I94 ports.
7. Made sure that every I94 port has an associated city which is marked as entering city/Town. This in referred as locality in the coming sections.
8. Each entry's airport/passing, city, state or province is identified.

    Reason: To match with the airport codes data.
    
   For example:
   
   'MOS'	=	'MOSES POINT INTERMEDIATE, AK'
   
   is changed to:
   
   'MOS'='Moses Point Intermediate Airport, MOSES POINT, AK'
   
   'Code'	=	Airport/Crossing, City, State
   
   This matches with the muncipality 'Moses Point' in the Airports code data.


In [7]:
I94_port_codesX = []
I94_acX = []
I94_cityX = []
I94_stateX = []
I94_countryX = []

In [8]:
with open('../I94_Ports_US.txt') as f:
    for line in f:
        temp = line.strip('\n').split("\t=\t")
        I94_port_codesX.append(temp[0].strip("'"))
        exp = temp[1].strip().strip("'").split(",")
        I94_countryX.append("Unites States")
        I94_stateX.append(us.states.lookup(exp[-1].strip()).name)
        I94_cityX.append(" ".join([elem.capitalize() for elem in exp[-2].strip().split(" ")]))
        if len(exp)==3:
            I94_acX.append(exp[-3].strip())
        else:
            I94_acX.append("")
        if len(exp)>3:
            print(line)
#         print(I94_port_codesX)
#         print(I94_countryX)
#         print(I94_stateX)
#         print(I94_cityX)
#         print(I94_acX)
#         break

In [9]:
df_us = pd.DataFrame({'code':I94_port_codesX, 
                           'port': I94_acX, 
                           'locality': I94_cityX, 
                           'province': I94_stateX, 
                           'territory': I94_countryX})

In [10]:
pd.set_option('display.max_rows', 600)
df_us.head()

Unnamed: 0,code,port,locality,state,country
0,ALC,,Alcan,Alaska,Unites States
1,ANC,,Anchorage,Alaska,Unites States
2,BAR,Baker AAF,Baker Island,Alaska,Unites States
3,DAC,Daltons Cache,Haines,Alaska,Unites States
4,PIZ,DEW Station,Point Lay,Alaska,Unites States


### Finally Adding the not reported or unidentified ports

'XXX'	=	'NOT REPORTED/UNKNOWN, NOT REPORTED/UNKNOWN  ' 

'888'	=	'UNIDENTIFED AIR / SEAPORT, UNIDENTIFED AIR / SEAPORT'

'UNK'	=	'UNKNOWN POE, UNKNOWN POE           '

In [11]:
I94_port_codesX += ['XXX', '888', 'UNK']
I94_acX += ['NOT REPORTED/UNKNOWN', 'UNIDENTIFED AIR / SEAPORT', 'UNKNOWN POE']
I94_cityX += ['NOT REPORTED/UNKNOWN', 'UNIDENTIFED AIR / SEAPORT', 'UNKNOWN POE']
I94_stateX += ['NOT REPORTED/UNKNOWN', 'UNIDENTIFED AIR / SEAPORT', 'UNKNOWN POE']
I94_countryX += ['NOT REPORTED/UNKNOWN', 'UNIDENTIFED AIR / SEAPORT', 'UNKNOWN POE']

In [12]:
df_us = pd.DataFrame({'code':I94_port_codesX, 
                           'port': I94_acX, 
                           'locality': I94_cityX, 
                           'province': I94_stateX, 
                           'terrtitory': I94_countryX})

In [13]:
df_us.tail()

Unnamed: 0,code,port,locality,state,country
536,GTF,,International Falls,Minnesota,Unites States
537,INL,,International Falls,Minnesota,Unites States
538,XXX,NOT REPORTED/UNKNOWN,NOT REPORTED/UNKNOWN,NOT REPORTED/UNKNOWN,NOT REPORTED/UNKNOWN
539,888,UNIDENTIFED AIR / SEAPORT,UNIDENTIFED AIR / SEAPORT,UNIDENTIFED AIR / SEAPORT,UNIDENTIFED AIR / SEAPORT
540,UNK,UNKNOWN POE,UNKNOWN POE,UNKNOWN POE,UNKNOWN POE


### Merging the US and Non-US ports into one dataframe and Save

In [14]:
df_ports = pd.concat([df_I94_nonUS, df_us], ignore_index=True)

In [15]:
df_ports.head()

Unnamed: 0,code,port,locality,state,country
0,CLG,,Calgary,Alberta,Canada
1,EDA,,Edmonton,Alberta,Canada
2,YHC,,Hakai pass,British Columbia,Canada
3,HAL,,Halifax,Nova Scotia,Canada
4,MON,,Montreal,Quebec,Canada


In [16]:
df_ports.tail()

Unnamed: 0,code,port,locality,state,country
593,GTF,,International Falls,Minnesota,Unites States
594,INL,,International Falls,Minnesota,Unites States
595,XXX,NOT REPORTED/UNKNOWN,NOT REPORTED/UNKNOWN,NOT REPORTED/UNKNOWN,NOT REPORTED/UNKNOWN
596,888,UNIDENTIFED AIR / SEAPORT,UNIDENTIFED AIR / SEAPORT,UNIDENTIFED AIR / SEAPORT,UNIDENTIFED AIR / SEAPORT
597,UNK,UNKNOWN POE,UNKNOWN POE,UNKNOWN POE,UNKNOWN POE


In [17]:
df_ports.to_csv('../Cleaned Data/I94_ports.csv', index=False)

In [19]:
df_ports_valid_codes = df_ports[['code']]

In [20]:
# df_ports_valid_codes.to_csv('../Cleaned Data/I94_ports_code.csv', index=False)