# Project Title
### Data Engineering Capstone Project

#### Project Summary
--describe your project at a high level--

The project follows the follow steps:
* Step 1: Scope the Project and Gather Data
* Step 2: Explore and Assess the Data
* Step 3: Define the Data Model
* Step 4: Run ETL to Model the Data
* Step 5: Complete Project Write Up

In [1]:
# Do all imports and installs here
import pandas as pd
import numpy as np

import datetime

### Step 1: Scope the Project and Gather Data

#### Scope 
Explain what you plan to do in the project in more detail. What data do you use? What is your end solution look like? What tools did you use? etc>

#### Describe and Gather Data 
Describe the data sets you're using. Where did it come from? What type of information is included? 

In [2]:
# Read in the data here

## For eventual Spark read, SEE THIS THREAD: https://knowledge.udacity.com/questions/602336


fname = 'immigration_data_sample.csv'
df_immig_data = pd.read_csv(fname)
df_immig_data.shape

(1000, 29)

In [3]:
# Spark SQL: date_add('1960-01-01',arrdate) as arrdate

df_immig_data['arrdate'] = pd.to_timedelta(df_immig_data['arrdate'], unit='d') + pd.datetime(1960, 1, 1)
df_immig_data['depdate'] = pd.to_timedelta(df_immig_data['depdate'], unit='d') + pd.datetime(1960, 1, 1)

In [4]:
df_immig_data['dtaddto'] = pd.to_datetime(df_immig_data['dtaddto'], dayfirst=False, yearfirst=False, format='%m%d%Y', errors='coerce')

In [5]:
with pd.option_context('display.max_rows', 5, 'display.max_columns', None): 
    display(df_immig_data.head())

Unnamed: 0.1,Unnamed: 0,cicid,i94yr,i94mon,i94cit,i94res,i94port,arrdate,i94mode,i94addr,depdate,i94bir,i94visa,count,dtadfile,visapost,occup,entdepa,entdepd,entdepu,matflag,biryear,dtaddto,gender,insnum,airline,admnum,fltno,visatype
0,2027561,4084316.0,2016.0,4.0,209.0,209.0,HHW,2016-04-22,1.0,HI,2016-04-29,61.0,2.0,1.0,20160422,,,G,O,,M,1955.0,2016-07-20,F,,JL,56582670000.0,00782,WT
1,2171295,4422636.0,2016.0,4.0,582.0,582.0,MCA,2016-04-23,1.0,TX,2016-04-24,26.0,2.0,1.0,20160423,MTR,,G,R,,M,1990.0,2016-10-22,M,,*GA,94362000000.0,XBLNG,B2
2,589494,1195600.0,2016.0,4.0,148.0,112.0,OGG,2016-04-07,1.0,FL,2016-04-27,76.0,2.0,1.0,20160407,,,G,O,,M,1940.0,2016-07-05,M,,LH,55780470000.0,00464,WT
3,2631158,5291768.0,2016.0,4.0,297.0,297.0,LOS,2016-04-28,1.0,CA,2016-05-07,25.0,2.0,1.0,20160428,DOH,,G,O,,M,1991.0,2016-10-27,M,,QR,94789700000.0,00739,B2
4,3032257,985523.0,2016.0,4.0,111.0,111.0,CHM,2016-04-06,3.0,NY,2016-04-09,19.0,2.0,1.0,20160406,,,Z,K,,M,1997.0,2016-07-04,F,,,42322570000.0,LAND,WT


In [6]:
df_immig_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 29 columns):
Unnamed: 0    1000 non-null int64
cicid         1000 non-null float64
i94yr         1000 non-null float64
i94mon        1000 non-null float64
i94cit        1000 non-null float64
i94res        1000 non-null float64
i94port       1000 non-null object
arrdate       1000 non-null datetime64[ns]
i94mode       1000 non-null float64
i94addr       941 non-null object
depdate       951 non-null datetime64[ns]
i94bir        1000 non-null float64
i94visa       1000 non-null float64
count         1000 non-null float64
dtadfile      1000 non-null int64
visapost      382 non-null object
occup         4 non-null object
entdepa       1000 non-null object
entdepd       954 non-null object
entdepu       0 non-null float64
matflag       954 non-null object
biryear       1000 non-null float64
dtaddto       987 non-null datetime64[ns]
gender        859 non-null object
insnum        35 non-null float64


In [7]:
df_immig_data.columns

Index(['Unnamed: 0', 'cicid', 'i94yr', 'i94mon', 'i94cit', 'i94res', 'i94port',
       'arrdate', 'i94mode', 'i94addr', 'depdate', 'i94bir', 'i94visa',
       'count', 'dtadfile', 'visapost', 'occup', 'entdepa', 'entdepd',
       'entdepu', 'matflag', 'biryear', 'dtaddto', 'gender', 'insnum',
       'airline', 'admnum', 'fltno', 'visatype'],
      dtype='object')

In [8]:
# Definite keeper cols: 
keeper_cols_nomissing_vals = ['i94yr', 'i94mon', 'i94cit', 'i94res', 'i94port', 'arrdate', 'i94mode', 'i94bir', 'i94visa', 'biryear', 'visatype']

# Keepers, but missing values: 
##  depdate: expected (some people haven't left yet)
##  entdepd: expected (same reason)
##  matflag: expected (same reason)
##  dtaddto: may be able to fill in the missing values (13 rows) using 'arrdate' and typical stay time allowed by 'visatype'
##  gender: should impute probabilistically based on stats of non-missing values
##  airline: expected (only applies with i94mode = 1, arrival by air; no missing values for that mode); impute to 'None/Unknown' for i94mode in [2, 3, 9]
keepers_missing_vals = ['i94addr', 'depdate' , 'entdepd', 'matflag', 'dtaddto', 'gender', 'airline']
                                          
# Deemed unnecessary: 
drop_cols_unnecess = ['Unnamed: 0', 'cicid', 'count', 'dtadfile', 'entdepa', 'entdepu', 'insnum', 'admnum', 'fltno']

# Dropping due to sparsity: 
drop_cols_sparsity = ['visapost', 'occup']
 
keeper_cols = keeper_cols_nomissing_vals + keepers_missing_vals
drop_cols = drop_cols_unnecess + drop_cols_sparsity

In [9]:
np.setdiff1d(df_immig_data.columns,  keeper_cols+drop_cols)

array([], dtype=object)

In [10]:
len(keeper_cols)

18

In [11]:
df_immig_data = df_immig_data.drop(columns=drop_cols, axis=1)
df_immig_data.shape

(1000, 18)

In [12]:
df_immig_data.loc[df_immig_data['i94mode'].isin([2.0, 3.0, 9.0]), ['airline']] = 'None/Unknown'

In [13]:
# ## Exclude "Temporary" visitors:
# # B2: Temporary visitors for pleasure
# # WT: Visa Waiver Program – temporary visitors for pleasure
# # GT: Guam Visa Waiver Program – temporary visitors for pleasure to Guam
# # GMT: Guam - Commonwealth of Northern Mariana Islands (CNMI) Visa Waiver Program - temporary visitors for pleasure to Guam or Northern Mariana Islands
# # B1: Temporary visitors for business
# # WB: Visa Waiver Program – temporary visitors for business
# # GB: Guam Visa Waiver Program – temporary visitors for business to Guam
# # GMB: Guam - Commonwealth of Northern Mariana Islands (CNMI) Visa Waiver Program - temporary visitors for business to Guam or Northern Mariana Islands

# excluded_visa_types = ['B2', 'WT', 'GT', 'GMT', 'B1', 'WB', 'GB', 'GMB']

In [14]:
# df_immig_data.shape

In [15]:
# df_immig_data = df_immig_data[~df_immig_data['visatype'].isin(excluded_visa_types)]
# df_immig_data.shape

In [16]:

df_immig_data['gender'].value_counts(normalize=False, dropna=False)

M      471
F      386
NaN    141
X        2
Name: gender, dtype: int64

In [17]:
## Ingnore the trivial 'X' designation and overwrite it along w/ NaNs (probabilistic imputation)

df_immig_data.loc[df_immig_data['gender'] == 'X', 'gender'] = np.nan
p = df_immig_data['gender'].value_counts(normalize=True)[0]
n = df_immig_data.loc[df_immig_data['gender'].isna()].shape[0]
rands = np.random.random(size=n)
df_immig_data.loc[df_immig_data['gender'].isna(), 'gender'] = ['F' if x > p else 'M' for x in rands]

In [18]:
(p, 1-p)

(0.54959159859976658, 0.45040840140023342)

In [19]:
df_immig_data['gender'].value_counts(normalize=True)

M    0.549
F    0.451
Name: gender, dtype: float64

In [20]:
## These are student visas (F1 for student, F2 for dependents of student), in effect for as long as student enrolled in
#  an approved educational program and making satisfactory progress; so not unexpected to see this field left blank;
#  no need to impute anything here

df_immig_data[df_immig_data['dtaddto'].isna()]['visatype'].unique()

array(['F1', 'F2'], dtype=object)

In [21]:
df_immig_data[df_immig_data['visatype'].isin(['F1','F2'])]['dtaddto']

70    NaT
238   NaT
274   NaT
337   NaT
415   NaT
538   NaT
591   NaT
615   NaT
621   NaT
684   NaT
791   NaT
934   NaT
964   NaT
Name: dtaddto, dtype: datetime64[ns]

In [22]:
# Using code suggested in https://knowledge.udacity.com/questions/125439

fname = './I94_SAS_Labels_Descriptions.SAS'
with open(fname) as f:
    f_content = f.read()
    f_content = f_content.replace('\t', '')
    
    def code_mapper(file, idx):
        f_content2 = f_content[f_content.index(idx):]
        f_content2 = f_content2[:f_content2.index(';')].split('\n')
        f_content2 = [i.replace("'", "") for i in f_content2]
        dic = [i.split('=') for i in f_content2[1:]]
        dic = dict([i[0].strip(), i[1].strip()] for i in dic if len(i) == 2)
        return dic

i94cit_res = code_mapper(f_content, "i94cntyl")
i94port = code_mapper(f_content, "i94prtl")
i94mode = code_mapper(f_content, "i94model")
i94addr = code_mapper(f_content, "i94addrl")
i94visa = {'1':'Business', '2': 'Pleasure', '3' : 'Student'}

In [23]:
i94visa

{'1': 'Business', '2': 'Pleasure', '3': 'Student'}

In [24]:
## Known issues:
# 1) {'RNO': 'cannon intl - reno/tahoe, nv'} needs to change to 'reno, nv' (outdated reference to same thing 'REN' points to)
# 2) {'SRQ': 'bradenton - sarasota, fl'} needs to change to 'sarasota, fl'
# 3) {'OGG': 'kahului - maui, hi'} needs to change to 'kahului, hi'
# 4) {'KOA': 'keahole-kona, hi'} needs to change to 'kailua/kona, hi'
# 5) {'MCA': 'mcallen, tx'} needs to change to 'mc allen, tx'
# 6) {'NEW': 'newark/teterboro, nj'} needs to change to 'newark, nj' (technically 2 nearby airports, but mapped to one port to 1)
# 7) {'OPF': 'opa locka, fl'} needs to change to 'miami, fl'
# 8) {'WAS': 'washington dc'} needs to change to 'washington, dc'

i94port['RNO'] = 'reno, nv'
i94port['SRQ'] = 'sarasota, fl'
i94port['OGG'] = 'kahului, hi'
i94port['KOA'] = 'kailua/kona, hi'
i94port['MCA'] = 'mc allen, tx'
i94port['NEW'] = 'newark, nj'
i94port['OPF'] = 'miami, fl'
i94port['WAS'] = 'washington, dc'

In [25]:
df_ports_munis = pd.DataFrame(columns=['i94port', 'municipality'])
i = 0
for k, v in i94port.items():
    df_ports_munis.loc[i] = [k] + [v.lower()]
    i += 1
    
df_ports_munis.head()

Unnamed: 0,i94port,municipality
0,ALC,"alcan, ak"
1,ANC,"anchorage, ak"
2,BAR,"baker aaf - baker island, ak"
3,DAC,"daltons cache, ak"
4,PIZ,"dew station pt lay dew, ak"


In [26]:
df_immig_data = df_immig_data.merge(df_ports_munis, on=['i94port'], how='inner')
df_immig_data.head()

Unnamed: 0,i94yr,i94mon,i94cit,i94res,i94port,arrdate,i94mode,i94addr,depdate,i94bir,i94visa,entdepd,matflag,biryear,dtaddto,gender,airline,visatype,municipality
0,2016.0,4.0,209.0,209.0,HHW,2016-04-22,1.0,HI,2016-04-29,61.0,2.0,O,M,1955.0,2016-07-20,F,JL,WT,"honolulu, hi"
1,2016.0,4.0,209.0,209.0,HHW,2016-04-15,1.0,HI,2016-04-19,54.0,2.0,O,M,1962.0,2016-07-13,M,JL,WT,"honolulu, hi"
2,2016.0,4.0,209.0,209.0,HHW,2016-04-29,1.0,HI,2016-05-03,39.0,2.0,O,M,1977.0,2016-07-27,M,DL,WT,"honolulu, hi"
3,2016.0,4.0,254.0,276.0,HHW,2016-04-28,1.0,HI,2016-05-02,11.0,2.0,O,M,2005.0,2016-07-26,M,OZ,WT,"honolulu, hi"
4,2016.0,4.0,209.0,209.0,HHW,2016-04-08,1.0,HI,2016-04-12,46.0,2.0,O,M,1970.0,2016-07-06,F,NH,WT,"honolulu, hi"


In [27]:
df_immig_data[['municipality', 'region']] = pd.DataFrame(df_immig_data['municipality'].str.split(',', 1).tolist(), index= df_immig_data.index)

In [28]:
df_immig_data['municipality'] = df_immig_data['municipality'].str.strip()
df_immig_data['region'] = df_immig_data['region'].str.strip()

In [29]:
df_immig_data.head()

Unnamed: 0,i94yr,i94mon,i94cit,i94res,i94port,arrdate,i94mode,i94addr,depdate,i94bir,i94visa,entdepd,matflag,biryear,dtaddto,gender,airline,visatype,municipality,region
0,2016.0,4.0,209.0,209.0,HHW,2016-04-22,1.0,HI,2016-04-29,61.0,2.0,O,M,1955.0,2016-07-20,F,JL,WT,honolulu,hi
1,2016.0,4.0,209.0,209.0,HHW,2016-04-15,1.0,HI,2016-04-19,54.0,2.0,O,M,1962.0,2016-07-13,M,JL,WT,honolulu,hi
2,2016.0,4.0,209.0,209.0,HHW,2016-04-29,1.0,HI,2016-05-03,39.0,2.0,O,M,1977.0,2016-07-27,M,DL,WT,honolulu,hi
3,2016.0,4.0,254.0,276.0,HHW,2016-04-28,1.0,HI,2016-05-02,11.0,2.0,O,M,2005.0,2016-07-26,M,OZ,WT,honolulu,hi
4,2016.0,4.0,209.0,209.0,HHW,2016-04-08,1.0,HI,2016-04-12,46.0,2.0,O,M,1970.0,2016-07-06,F,NH,WT,honolulu,hi


In [31]:
df_immig_data[df_immig_data['i94port'] == 'No PORT Code (X96)']

Unnamed: 0,i94yr,i94mon,i94cit,i94res,i94port,arrdate,i94mode,i94addr,depdate,i94bir,i94visa,entdepd,matflag,biryear,dtaddto,gender,airline,visatype,municipality,region


In [33]:
df_immig_data.head()

Unnamed: 0,i94yr,i94mon,i94cit,i94res,i94port,arrdate,i94mode,i94addr,depdate,i94bir,i94visa,entdepd,matflag,biryear,dtaddto,gender,airline,visatype,municipality,region
0,2016.0,4.0,209.0,209.0,HHW,2016-04-22,1.0,HI,2016-04-29,61.0,2.0,O,M,1955.0,2016-07-20,F,JL,WT,honolulu,hi
1,2016.0,4.0,209.0,209.0,HHW,2016-04-15,1.0,HI,2016-04-19,54.0,2.0,O,M,1962.0,2016-07-13,M,JL,WT,honolulu,hi
2,2016.0,4.0,209.0,209.0,HHW,2016-04-29,1.0,HI,2016-05-03,39.0,2.0,O,M,1977.0,2016-07-27,M,DL,WT,honolulu,hi
3,2016.0,4.0,254.0,276.0,HHW,2016-04-28,1.0,HI,2016-05-02,11.0,2.0,O,M,2005.0,2016-07-26,M,OZ,WT,honolulu,hi
4,2016.0,4.0,209.0,209.0,HHW,2016-04-08,1.0,HI,2016-04-12,46.0,2.0,O,M,1970.0,2016-07-06,F,NH,WT,honolulu,hi


In [142]:
# df_immig_data[df_immig_data['municipality'].isin(['saipan', 'nassau', 'agana', 'vancouver', 'dublin', 'shannon',
#        'montreal', 'sanford', 'toronto', 'abu dhabi', 'north caicos',
#        'hamilton'])]

In [106]:
# Read in the data here
fname = 'airport-codes_csv.csv'
df_airport_codes = pd.read_csv(fname)
df_airport_codes['municipality'] = df_airport_codes['municipality'].str.lower()
df_airport_codes.head()

Unnamed: 0,ident,type,name,elevation_ft,continent,iso_country,iso_region,municipality,gps_code,iata_code,local_code,coordinates
0,00A,heliport,Total Rf Heliport,11.0,,US,US-PA,bensalem,00A,,00A,"-74.93360137939453, 40.07080078125"
1,00AA,small_airport,Aero B Ranch Airport,3435.0,,US,US-KS,leoti,00AA,,00AA,"-101.473911, 38.704022"
2,00AK,small_airport,Lowell Field,450.0,,US,US-AK,anchor point,00AK,,00AK,"-151.695999146, 59.94919968"
3,00AL,small_airport,Epps Airpark,820.0,,US,US-AL,harvest,00AL,,00AL,"-86.77030181884766, 34.86479949951172"
4,00AR,closed,Newport Hospital & Clinic Heliport,237.0,,US,US-AR,newport,,,,"-91.254898, 35.6087"


In [107]:
df_airport_codes.shape

(55075, 12)

In [108]:
df_airport_codes = df_airport_codes[df_airport_codes['iso_country'] == 'US']
df_airport_codes.reset_index(drop=True, inplace=True)
df_airport_codes.shape

(22757, 12)

In [109]:
df_airport_codes['type'].unique()

array(['heliport', 'small_airport', 'closed', 'seaplane_base',
       'balloonport', 'medium_airport', 'large_airport'], dtype=object)

In [110]:
# Assume no (or virtually no) immigrants arrive at 'heliport', 'closed', 'seaplane_base', or 'balloonport' airports

df_airport_codes = df_airport_codes[~df_airport_codes['type'].isin(['heliport', 'closed', 'seaplane_base', 'balloonport'])]
df_airport_codes.reset_index(drop=True, inplace=True)
df_airport_codes.shape

(14582, 12)

In [111]:
## Airport info only useful for US airports where 'municipality' is not NaN

df_airport_codes = df_airport_codes[~df_airport_codes['municipality'].isna()]
df_airport_codes.reset_index(drop=True, inplace=True)
df_airport_codes.shape

(14532, 12)

In [112]:
df_airport_codes['region'] = df_airport_codes['iso_region'].str.split('-').str[-1].str.lower().str.strip()

# Drop 'US-U-A' 
df_airport_codes = df_airport_codes[~df_airport_codes['region'].isin(['a'])]
df_airport_codes.head()

Unnamed: 0,ident,type,name,elevation_ft,continent,iso_country,iso_region,municipality,gps_code,iata_code,local_code,coordinates,region
0,00AA,small_airport,Aero B Ranch Airport,3435.0,,US,US-KS,leoti,00AA,,00AA,"-101.473911, 38.704022",ks
1,00AK,small_airport,Lowell Field,450.0,,US,US-AK,anchor point,00AK,,00AK,"-151.695999146, 59.94919968",ak
2,00AL,small_airport,Epps Airpark,820.0,,US,US-AL,harvest,00AL,,00AL,"-86.77030181884766, 34.86479949951172",al
3,00AS,small_airport,Fulton Airport,1100.0,,US,US-OK,alex,00AS,,00AS,"-97.8180194, 34.9428028",ok
4,00AZ,small_airport,Cordes Airport,3810.0,,US,US-AZ,cordes,00AZ,,00AZ,"-112.16500091552734, 34.305599212646484",az


In [113]:
df_airport_codes['region'].unique()

array(['ks', 'ak', 'al', 'ok', 'az', 'ca', 'fl', 'ga', 'id', 'il', 'ky',
       'la', 'md', 'mn', 'mo', 'nj', 'nc', 'ny', 'oh', 'pa', 'or', 'sc',
       'sd', 'tn', 'tx', 'va', 'wa', 'wi', 'wv', 'ia', 'in', 'mt', 'ne',
       'nh', 'nm', 'nv', 'ut', 'wy', 'ms', 'co', 'me', 'mi', 'ma', 'nd',
       'vt', 'ar', 'ri', 'de', 'ct', 'hi', 'dc'], dtype=object)

In [114]:
# # # Reasonable to assume that immigrants never enter the US via an Air Force Base (???)
# df_airport_codes = df_airport_codes[~df_airport_codes['name'].str.contains('Air Force')]
# df_airport_codes.reset_index(drop=True, inplace=True)
# df_airport_codes.shape

In [115]:
## 'US-0883' seems like unwanted data; drop it

df_airport_codes[(df_airport_codes['municipality'] == 'new york') & (df_airport_codes['region'] == 'ny') & (df_airport_codes['type'] == 'large_airport')]

Unnamed: 0,ident,type,name,elevation_ft,continent,iso_country,iso_region,municipality,gps_code,iata_code,local_code,coordinates,region
9333,KJFK,large_airport,John F Kennedy International Airport,13.0,,US,US-NY,new york,KJFK,JFK,JFK,"-73.77890015, 40.63980103",ny
9467,KLGA,large_airport,La Guardia Airport,21.0,,US,US-NY,new york,KLGA,LGA,LGA,"-73.87259674, 40.77719879",ny
13727,US-0883,large_airport,JFK,,,US,US-NY,new york,,,,"0, 0",ny


In [116]:
df_airport_codes = df_airport_codes[df_airport_codes['ident'] != 'US-0883']
df_airport_codes.reset_index(drop=True, inplace=True)
df_airport_codes[(df_airport_codes['municipality'] == 'new york') & (df_airport_codes['region'] == 'ny') & (df_airport_codes['type'] == 'large_airport')]

Unnamed: 0,ident,type,name,elevation_ft,continent,iso_country,iso_region,municipality,gps_code,iata_code,local_code,coordinates,region
9333,KJFK,large_airport,John F Kennedy International Airport,13.0,,US,US-NY,new york,KJFK,JFK,JFK,"-73.77890015, 40.63980103",ny
9467,KLGA,large_airport,La Guardia Airport,21.0,,US,US-NY,new york,KLGA,LGA,LGA,"-73.87259674, 40.77719879",ny


In [117]:
## Only interested in US airports, and 'continent' column is null for those, so drop it
df_airport_codes.drop(columns=['continent'], inplace=True)
df_airport_codes.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14528 entries, 0 to 14527
Data columns (total 12 columns):
ident           14528 non-null object
type            14528 non-null object
name            14528 non-null object
elevation_ft    14493 non-null float64
iso_country     14528 non-null object
iso_region      14528 non-null object
municipality    14528 non-null object
gps_code        14163 non-null object
iata_code       1865 non-null object
local_code      14371 non-null object
coordinates     14528 non-null object
region          14528 non-null object
dtypes: float64(1), object(11)
memory usage: 1.3+ MB


In [118]:
# Can also drop 'ident', 'elevation_ft', 'iso_country', 'iso_region', 'gps_code', 'iata_code', 'local_code', 'coordinates'
df_airport_codes.drop(columns=['ident', 'elevation_ft', 'iso_country', 'iso_region', 'gps_code', 'iata_code', 'local_code', 'coordinates'], 
                      inplace=True)
df_airport_codes.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14528 entries, 0 to 14527
Data columns (total 4 columns):
type            14528 non-null object
name            14528 non-null object
municipality    14528 non-null object
region          14528 non-null object
dtypes: object(4)
memory usage: 454.1+ KB


In [119]:
# Drop any rows where 'type' or 'name' are NaN
df_airport_codes = df_airport_codes[~((df_airport_codes['type'].isna()) | (df_airport_codes['name'].isna()))]
df_airport_codes.reset_index(drop=True, inplace=True)
df_airport_codes.shape

(14528, 4)

In [120]:
## This is what's left 

us_airport_regions = df_airport_codes['region'].unique()
us_airport_regions

array(['ks', 'ak', 'al', 'ok', 'az', 'ca', 'fl', 'ga', 'id', 'il', 'ky',
       'la', 'md', 'mn', 'mo', 'nj', 'nc', 'ny', 'oh', 'pa', 'or', 'sc',
       'sd', 'tn', 'tx', 'va', 'wa', 'wi', 'wv', 'ia', 'in', 'mt', 'ne',
       'nh', 'nm', 'nv', 'ut', 'wy', 'ms', 'co', 'me', 'mi', 'ma', 'nd',
       'vt', 'ar', 'ri', 'de', 'ct', 'hi', 'dc'], dtype=object)

In [121]:
us_airport_munis = df_airport_codes['municipality'].unique()
us_airport_munis

array(['leoti', 'anchor point', 'harvest', ..., 'copper center', 'cibecue',
       'nyac'], dtype=object)

In [122]:
i94data_munis = df_immig_data[df_immig_data['i94mode'] == 1.0]['municipality'].unique()
i94data_regions = df_immig_data[(df_immig_data['i94mode'] == 1.0) & (~df_immig_data['region'].isnull())]['region'].unique()

In [124]:
## All foreign airports - great! (Only care about US 50+1)

np.setdiff1d(i94data_munis, us_airport_munis)

array(['abu dhabi', 'agana', 'montreal', 'nassau', 'north caicos', 'saipan'], dtype=object)

In [125]:
## All non-US 50+1 regions (this is good!)

np.setdiff1d(i94data_regions, us_airport_regions)

array(['bahamas', 'bermuda', 'canada', 'gu', 'ireland', 'spn',
       'turk & caiman'], dtype=object)

In [126]:
agg_airport_codes = df_airport_codes.groupby(['municipality', 'region']).agg(lambda x: x.tolist())
agg_airport_codes

Unnamed: 0_level_0,Unnamed: 1_level_0,type,name
municipality,region,Unnamed: 2_level_1,Unnamed: 3_level_1
abbeville,al,[small_airport],[Abbeville Municipal Airport]
abbeville,la,"[small_airport, small_airport, small_airport, ...","[Coastal Ridge Airpark, Ms Pats Airport, Abbev..."
abbeville,sc,"[small_airport, small_airport]","[Abbeville Airport, Rambos Field]"
abbott,tx,[small_airport],[Stapleton Field]
aberdeen,id,[small_airport],[Aberdeen Municipal Airport]
aberdeen,sd,"[medium_airport, small_airport, small_airport]","[Aberdeen Regional Airport, Thorson Airfield, ..."
aberdeen,wa,[small_airport],[Wishkah River Ranch Airport]
aberdeen proving grounds(aberdeen),md,[medium_airport],[Phillips Army Air Field]
aberdeen/amory,ms,[small_airport],[Monroe County Airport]
abernathy,tx,[small_airport],[Abernathy Municipal Airport]


In [129]:
# agg_airport_codes.loc['zellwood', 'fl']

In [130]:
# agg_airport_codes.loc['zellwood', 'fl']['name']

In [143]:
### NOTE: Will only merge this with the subset of df_immig_data that has i94port in the US 50+1

df_immig_data_us_airports = df_immig_data[df_immig_data['i94mode'] == 1.0].merge(
    agg_airport_codes.reset_index(), on=['municipality', 'region'], how='inner')

In [144]:
df_immig_data_us_airports.isna().sum()

i94yr            0
i94mon           0
i94cit           0
i94res           0
i94port          0
arrdate          0
i94mode          0
i94addr         35
depdate         36
i94bir           0
i94visa          0
entdepd         36
matflag         36
biryear          0
dtaddto         12
gender           0
airline          0
visatype         0
municipality     0
region           0
type             0
name             0
dtype: int64

In [201]:
# Read in the data here
fname = 'us-cities-demographics.csv'
df_us_cities_demos = pd.read_csv(fname, sep=';')
df_us_cities_demos.head()

Unnamed: 0,City,State,Median Age,Male Population,Female Population,Total Population,Number of Veterans,Foreign-born,Average Household Size,State Code,Race,Count
0,Silver Spring,Maryland,33.8,40601.0,41862.0,82463,1562.0,30908.0,2.6,MD,Hispanic or Latino,25924
1,Quincy,Massachusetts,41.0,44129.0,49500.0,93629,4147.0,32935.0,2.39,MA,White,58723
2,Hoover,Alabama,38.5,38040.0,46799.0,84839,4819.0,8229.0,2.58,AL,Asian,4759
3,Rancho Cucamonga,California,34.5,88127.0,87105.0,175232,5821.0,33878.0,3.18,CA,Black or African-American,24437
4,Newark,New Jersey,34.6,138040.0,143873.0,281913,5829.0,86253.0,2.73,NJ,White,76402


In [202]:
df_us_cities_demos['State Code'] = df_us_cities_demos['State Code'].str.lower()
df_us_cities_demos.head()

Unnamed: 0,City,State,Median Age,Male Population,Female Population,Total Population,Number of Veterans,Foreign-born,Average Household Size,State Code,Race,Count
0,Silver Spring,Maryland,33.8,40601.0,41862.0,82463,1562.0,30908.0,2.6,md,Hispanic or Latino,25924
1,Quincy,Massachusetts,41.0,44129.0,49500.0,93629,4147.0,32935.0,2.39,ma,White,58723
2,Hoover,Alabama,38.5,38040.0,46799.0,84839,4819.0,8229.0,2.58,al,Asian,4759
3,Rancho Cucamonga,California,34.5,88127.0,87105.0,175232,5821.0,33878.0,3.18,ca,Black or African-American,24437
4,Newark,New Jersey,34.6,138040.0,143873.0,281913,5829.0,86253.0,2.73,nj,White,76402


In [203]:
df_us_cities_states = df_us_cities_demos['State Code'].unique()
df_us_cities_states.shape[0]

49

In [204]:
# Puerto Rico: drop because only interested in US 50+1

'pr' in df_us_cities_states

True

In [205]:
# Restrict data to US 50+1

df_us_cities_demos = df_us_cities_demos[~(df_us_cities_demos['State Code'] == 'pr')]
df_us_cities_demos.reset_index(drop=True, inplace=True)
df_us_cities_states = df_us_cities_demos['State Code'].unique()
df_us_cities_states.shape[0]

48

In [206]:
# VT, WV & WY are missing from the demographics - so we only get US 47+1

np.setdiff1d(df_airport_codes['region'].unique(), df_us_cities_states)

array(['vt', 'wv', 'wy'], dtype=object)

In [207]:
df_us_cities_demos.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2878 entries, 0 to 2877
Data columns (total 12 columns):
City                      2878 non-null object
State                     2878 non-null object
Median Age                2878 non-null float64
Male Population           2875 non-null float64
Female Population         2875 non-null float64
Total Population          2878 non-null int64
Number of Veterans        2878 non-null float64
Foreign-born              2878 non-null float64
Average Household Size    2875 non-null float64
State Code                2878 non-null object
Race                      2878 non-null object
Count                     2878 non-null int64
dtypes: float64(6), int64(2), object(4)
memory usage: 269.9+ KB


In [208]:
df_us_cities_demos[df_us_cities_demos.isna().any(axis=1)]

Unnamed: 0,City,State,Median Age,Male Population,Female Population,Total Population,Number of Veterans,Foreign-born,Average Household Size,State Code,Race,Count
330,The Villages,Florida,70.5,,,72590,15231.0,4034.0,,fl,Hispanic or Latino,1066
446,The Villages,Florida,70.5,,,72590,15231.0,4034.0,,fl,Black or African-American,331
1433,The Villages,Florida,70.5,,,72590,15231.0,4034.0,,fl,White,72211


In [209]:
# The villages is a retirement community, so let's just drop it from the dataset

df_us_cities_demos = df_us_cities_demos[~(df_us_cities_demos['City'] == 'The Villages')]
df_us_cities_demos.reset_index(drop=True, inplace=True)
df_us_cities_demos.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2875 entries, 0 to 2874
Data columns (total 12 columns):
City                      2875 non-null object
State                     2875 non-null object
Median Age                2875 non-null float64
Male Population           2875 non-null float64
Female Population         2875 non-null float64
Total Population          2875 non-null int64
Number of Veterans        2875 non-null float64
Foreign-born              2875 non-null float64
Average Household Size    2875 non-null float64
State Code                2875 non-null object
Race                      2875 non-null object
Count                     2875 non-null int64
dtypes: float64(6), int64(2), object(4)
memory usage: 269.6+ KB


In [212]:
df_us_cities_demos.groupby(['City','Race'])['Count'].sum()

City         Race                             
Abilene      American Indian and Alaska Native      1813
             Asian                                  2929
             Black or African-American             14449
             Hispanic or Latino                    33222
             White                                 95487
Akron        American Indian and Alaska Native      1845
             Asian                                  9033
             Black or African-American             66551
             Hispanic or Latino                     3684
             White                                129192
Alafaya      Asian                                 10336
             Black or African-American              6577
             Hispanic or Latino                    34897
             White                                 63666
Alameda      American Indian and Alaska Native      1329
             Asian                                 27984
             Black or African-American   

In [None]:
# Percentage of Population that's Foreign-born by city would be interesting (maybe also by race?)

In [217]:
# Read in the data here
fname = 'GlobalLandTemperaturesByCity.csv'
df_global_land_temps = pd.read_csv(fname)
df_global_land_temps.head()

Unnamed: 0,dt,AverageTemperature,AverageTemperatureUncertainty,City,Country,Latitude,Longitude
0,1743-11-01,6.068,1.737,Århus,Denmark,57.05N,10.33E
1,1743-12-01,,,Århus,Denmark,57.05N,10.33E
2,1744-01-01,,,Århus,Denmark,57.05N,10.33E
3,1744-02-01,,,Århus,Denmark,57.05N,10.33E
4,1744-03-01,,,Århus,Denmark,57.05N,10.33E


In [218]:
df_global_land_temps.shape

(8599212, 7)

In [220]:
# Drop 'AverageTemperatureUncertainty', 'Latitude' & 'Longitude'
df_global_land_temps.drop(columns=['AverageTemperatureUncertainty', 'Latitude', 'Longitude'], inplace=True)
df_global_land_temps.shape

(8599212, 4)

In [222]:
df_global_land_temps['dt'].max()

'2013-09-01'

In [227]:
# Let's just focus on relatively recent temperature data (since '1999-09-01', up through '2013-08-01')

df_global_land_temps = df_global_land_temps[(df_global_land_temps['dt'] >= '1999-09-01') & (df_global_land_temps['dt'] < '2013-09-01')]

df_global_land_temps.reset_index(drop=True, inplace=True)
df_global_land_temps.shape

(589680, 4)

In [228]:
df_global_land_temps.head()

Unnamed: 0,dt,AverageTemperature,City,Country
0,1999-09-01,16.339,Århus,Denmark
1,1999-10-01,9.291,Århus,Denmark
2,1999-11-01,5.736,Århus,Denmark
3,1999-12-01,1.638,Århus,Denmark
4,2000-01-01,3.065,Århus,Denmark


In [229]:
df_global_land_temps[df_global_land_temps.isna().any(axis=1)]

Unnamed: 0,dt,AverageTemperature,City,Country


In [None]:
# Two ways to go here: 1) Average monthly over all the years to get one average per month per city
#  OR 2) Just retain the last year's worth of data...

In [56]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.\
config("spark.jars.packages","saurfang:spark-sas7bdat:2.0.0-s_2.11")\
.enableHiveSupport().getOrCreate()
df_spark =spark.read.format('com.github.saurfang.sas.spark').load('../../data/18-83510-I94-Data-2016/i94_apr16_sub.sas7bdat')


Exception: Java gateway process exited before sending its port number

In [None]:
#write to parquet
df_spark.write.parquet("sas_data")
df_spark=spark.read.parquet("sas_data")

### Step 2: Explore and Assess the Data
#### Explore the Data 
Identify data quality issues, like missing values, duplicate data, etc.

#### Cleaning Steps
Document steps necessary to clean the data

In [None]:
# Performing cleaning tasks here





### Step 3: Define the Data Model
#### 3.1 Conceptual Data Model
Map out the conceptual data model and explain why you chose that model

#### 3.2 Mapping Out Data Pipelines
List the steps necessary to pipeline the data into the chosen data model

### Step 4: Run Pipelines to Model the Data 
#### 4.1 Create the data model
Build the data pipelines to create the data model.

In [None]:
# Write code here

#### 4.2 Data Quality Checks
Explain the data quality checks you'll perform to ensure the pipeline ran as expected. These could include:
 * Integrity constraints on the relational database (e.g., unique key, data type, etc.)
 * Unit tests for the scripts to ensure they are doing the right thing
 * Source/Count checks to ensure completeness
 
Run Quality Checks

In [None]:
# Perform quality checks here

#### 4.3 Data dictionary 
Create a data dictionary for your data model. For each field, provide a brief description of what the data is and where it came from. You can include the data dictionary in the notebook or in a separate file.

#### Step 5: Complete Project Write Up
* Clearly state the rationale for the choice of tools and technologies for the project.
* Propose how often the data should be updated and why.
* Write a description of how you would approach the problem differently under the following scenarios:
 * The data was increased by 100x.
 * The data populates a dashboard that must be updated on a daily basis by 7am every day.
 * The database needed to be accessed by 100+ people.