# Project Title
### Data Engineering Capstone Project

#### Project Summary
An international beverage production company called "A sweet thing" is interested in increasing their sales by entering new markets.  Present in more than 50 countries it sees the greatest potential in North America, the last continent the company couldn't take a step in, until now.

The idea behind the project is to equip stakeholders (from a few different departments) with necessary statistical knowledge in order to help them make more thorough decisions.

Stakeholders are primarily interested in airports (location, passenger capacity) and cities (location, population density, share of female population) around them.

The goal of this project in, on the one hand, to provide the company with valuable insights from pre-defined data sets, and, on the other hand, to further solidify data-driven decision making  

The project follows the follow steps:
* Step 1: Scope the Project and Gather Data
* Step 2: Explore and Assess the Data
* Step 3: Define the Data Model
* Step 4: Run ETL to Model the Data
* Step 5: Complete Project Write Up

In [80]:
# Do all imports and installs here
import pandas as pd
import numpy as np
import re

# define pandas options needed to explore data mor efficient
pd.options.display.max_columns = None
pd.options.display.max_rows = 100
pd.options.display.max_colwidth = 200

### Step 1: Scope the Project and Gather Data

#### Scope 
This project's main goal is to deliver data about immigration to US, so that stakeholders of the company I work for as a data engineer can analyze it and make data-based decisions. Stakeholders not only are interested in airports and nearby cities the US visitors choose to travel to, but also about the demographics of this cities.

In order to accomplish goals set a great amount of data was parsed. Next block holds files, which were chosen for this project.
Each file provides information on a narrow field, which combined deliver a greater picture of the US immigration activities for 2016.

Pandas library is used to analyze and exctract valuable information from all .csv files. Firstly, I will explore each file content, so that I can decide, what columns are important. After that 

#### Data Read-Ins

In [4]:
# a sample of the main data set 
fname_immi = 'immigration_data_sample.csv'
# a foreigner can only visit US if a visa is issued
fname_visatype = 'visatype.csv'
# airport codes and cities
fname_ccodes = 'country_codes.csv'
# airport codes and cities
fname_airports = 'airport-codes_csv.csv'
# data on US international airports (not every airport is a Point-of-Entry)
fname_international = 'international_airports_US.csv'
# data on US population by city and state
fname_demographics = 'us-cities-demographics.csv'
# another dataset holding data about cities
fname_uszips = 'uszips.csv'
# world temperature
fname_temper = 'worldTempByCity.csv'

In [None]:
df_visatype = pd.read_csv(fname_visatype,delimiter='|')

In [None]:
df_ccodes = pd.read_csv(fname_ccodes)

In [None]:
df_airports = pd.read_csv(fname_airports)

In [None]:
df_inter_US = pd.read_csv(fname_international)

In [None]:
df_demographics = pd.read_csv(fname_demographics, delimiter=";")

In [None]:
df_zips = pd.read_csv(fname_uszips)

### Step 2: Explore and Assess the Data
#### Explore the Data 
This step included reads and exploration of provided and additional datasets gathered by me. TODOs, as well as issues, are included for each dataset 

#### I94_SAS_Labels_Descriptions.SAS

This file contains labels and description about values from immigration datasets.
#### TODOs
* extract Port-of-Entry labels
* extract Country-of-Citizenship labels
* extract State codes

#### Issues
* data in this file is messy: lots of invalid values

In [73]:
# Get port locations from SAS text file
with open("input/I94_SAS_Labels_Descriptions.SAS") as f:
    content = f.readlines()

In [75]:
def get_I94port(ports):
    """This method 
    
    Params:
    
    Returns:
    """
    ports_strip = [p.strip() for p in ports]
    ports_known = [port for port in ports_strip if 'No PORT Code' not in port and 'Collapsed' not in port]
    ports_cleaned = [port.replace('\t','') for port in ports_known]
    ports_results = []
    for port in ports_cleaned:
        match = re.search("""\'([A-Z0-9]{3})\'\S*=\S*\'([A-Z\(\)\.\s\/-]*),\s?([A-Z]{2})(\s\(BPS\)|\s#ARPT|\s*)\'""", port)
        if match:
            code, city, state = match.group(1), match.group(2), match.group(3).strip()
            ports_results.append((code, city, state))
        else:
            print(f'Could not match port of entry: {port}')
    df = pd.DataFrame(ports_results, columns=['I94_port_code', 'I94_port_city', 'I94_port_state'])
    return df

In [76]:
def get_I94cit(cit_codes):
    """This method 
    
    Params:
    
    Returns:
    """
    cit_codes_strip = [c.strip() for c in cit_codes]
    cit_codes_valid = [c for c in cit_codes_strip if 'No Country Code' not in c 
                                                 and 'INVALID:' not in c 
                                                 and '(should not show)' not in c]
    cit_codes_cleaned = [c.replace('\t','') for c in cit_codes_valid]
    cit_results = []
    for c in cit_codes_cleaned:
        match = re.search("""(\d{3})\s?=\s*\'(.*)\'""", c)
        if match:
            cit_results.append((match.group(1), match.group(2)))
    #cit_codes_json = json.dumps(dict(cit_results))
    df = pd.DataFrame(cit_results, columns=['I94_country_code', 'I94_country'])
    return df

In [77]:
def get_I94addr(addr_states):
    """This method 
    
    Params:
    
    Returns:
    """
    addr_states_strip = [a.strip() for a in addr_states]
    addr_states_cleaned = [a.replace('\t','') for a in addr_states_strip]
    addr_results = []
    for a in addr_states_cleaned:
        match = re.match("""\'([A-Z]{2})\'=\'([A-Z\.\s]*)\'""", a)
        if match:
            addr_results.append((match.group(1), match.group(2)))
    df = pd.DataFrame(addr_results, columns=['I94_state_code', 'I94_state'])
    return df

In [78]:
# define lines for particular labels
ports = content[302:962]
cit_codes = content[10:298]
addr_states = content[982:1036]

In [81]:
I94_ports = get_I94port(ports)
I94_cit_codes = get_I94cit(cit_codes)
I94_addr_states = get_I94addr(addr_states)

Could not match port of entry: 'MAP'='MARIPOSA AZ           '
Could not match port of entry: 'BLT'='PACIFIC, HWY. STATION, CA '
Could not match port of entry: 'WSB'='WARROAD INTL, SPB, MN'
Could not match port of entry: 'SAI'='SAIPAN, SPN           '
Could not match port of entry: 'DER'='DERBY LINE, VT (I-91) '
Could not match port of entry: 'DLV'='DERBY LINE, VT (RT. 5)'
Could not match port of entry: 'SWB'='SWANTON, VT (BP - SECTOR HQ)'
Could not match port of entry: 'BLI'='BELLINGHAM, WASHINGTON #INTL'
Could not match port of entry: 'XXX'='NOT REPORTED/UNKNOWN  '
Could not match port of entry: '888'='UNIDENTIFED AIR / SEAPORT'
Could not match port of entry: 'UNK'='UNKNOWN POE           '
Could not match port of entry: 'CLG'='CALGARY, CANADA       '
Could not match port of entry: 'EDA'='EDMONTON, CANADA      '
Could not match port of entry: 'YHC'='HAKAI PASS, CANADA'
Could not match port of entry: 'HAL'='Halifax, NS, Canada   '
Could not match port of entry: 'MON'='MONTREAL, CANADA  

In [82]:
I94_cit_codes.head()

Unnamed: 0,I94_country_code,I94_country
0,236,AFGHANISTAN
1,101,ALBANIA
2,316,ALGERIA
3,102,ANDORRA
4,324,ANGOLA


In [83]:
I94_ports[I94_ports.I94_port_state == 'MX']

Unnamed: 0,I94_port_code,I94_port_city,I94_port_state
509,HMO,GEN PESQUEIRA GARCIA,MX


In [84]:
I94_ports.drop(I94_ports.index[509],inplace=True)
I94_ports.reset_index(inplace=True)

In [86]:
print(f"""The dataframe holds {len(I94_ports)} US Point-of-Entry""")

The dataframe holds 522 US Point-of-Entry


In [87]:
I94_ports.head()

Unnamed: 0,index,I94_port_code,I94_port_city,I94_port_state
0,0,ALC,ALCAN,AK
1,1,ATW,APPLETON,WI
2,2,ANC,ANCHORAGE,AK
3,3,BAR,BAKER AAF - BAKER ISLAND,AK
4,4,DAC,DALTONS CACHE,AK


In [88]:
I94_cit_codes.head(2)

Unnamed: 0,I94_country_code,I94_country
0,236,AFGHANISTAN
1,101,ALBANIA


In [90]:
print(f"""Country codes for {len(I94_cit_codes)} countries are provided""")

Country codes for 235 countries are provided


#### I94_ports and international_airports_US.csv

#### TODOs
* get airport capacity, extract it from a string, transform it to integer

#### Issues
* airport capacity ('Passanger_Role')

In [91]:
raw_inter_airports = pd.read_csv('input/international_airports_US.csv')
df_inter_airports = raw_inter_airports.copy()

In [92]:
df_inter_airports.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 117 entries, 0 to 116
Data columns (total 6 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Unnamed: 0       117 non-null    int64 
 1   Location         117 non-null    object
 2   Airport          117 non-null    object
 3   IATA_Code        117 non-null    object
 4   Passenger_Role   117 non-null    object
 5   2018_passengers  117 non-null    object
dtypes: int64(1), object(5)
memory usage: 5.6+ KB


In [93]:
I94_ports.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 522 entries, 0 to 521
Data columns (total 4 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   index           522 non-null    int64 
 1   I94_port_code   522 non-null    object
 2   I94_port_city   522 non-null    object
 3   I94_port_state  522 non-null    object
dtypes: int64(1), object(3)
memory usage: 16.4+ KB


In [94]:
df_inter_airports.head(2)

Unnamed: 0.1,Unnamed: 0,Location,Airport,IATA_Code,Passenger_Role,2018_passengers
0,0,Akron,Akron Executive Airport,AKC,Non-Hub/Reliever,No Commercial Service
1,1,Albany,Albany International Airport,ALB,Small,"2,848,000 [2]"


In [95]:
def get_count(self):
    """This method
    
    Params:
    
    Returns:
    """
    if 'No' in self or 'TBA' in self or 'Unknown' in self or 'Service' in self:
        return 0
    else:
        try:
            split = self.split('[')
        except ValueError:
            pass
        else:
            count = int(split[0].replace(',',''))
            return count

In [96]:
## get passengers count were defined
df_inter_airports["count_passengers"] = df_inter_airports["2018_passengers"].apply(get_count)

In [97]:
df_inter_airports.head(5)

Unnamed: 0.1,Unnamed: 0,Location,Airport,IATA_Code,Passenger_Role,2018_passengers,count_passengers
0,0,Akron,Akron Executive Airport,AKC,Non-Hub/Reliever,No Commercial Service,0
1,1,Albany,Albany International Airport,ALB,Small,"2,848,000 [2]",2848000
2,2,Albuquerque,Albuquerque International Sunport,ABQ,Medium,"5,258,775 [3]",5258775
3,3,Anchorage,Ted Stevens Anchorage International Airport,ANC,Medium,"5,176,371[4]",5176371
4,4,Appleton,Appleton International Airport,ATW,Small,"717,757 [5]",717757


In [98]:
df_inter_airports["Location_lower"] = df_inter_airports.Location.apply(lambda x: x.lower())

In [99]:
I94_ports.head(2)

Unnamed: 0,index,I94_port_code,I94_port_city,I94_port_state
0,0,ALC,ALCAN,AK
1,1,ATW,APPLETON,WI


In [100]:
I94_ports["I94_port_city_lower"] = I94_ports["I94_port_city"].apply(lambda x: x.lower())

In [101]:
df_inter_merged = df_inter_airports.merge(I94_ports, how='left', left_on='Location_lower', right_on='I94_port_city_lower')

In [102]:
df_inter_merged.head(3)

Unnamed: 0.1,Unnamed: 0,Location,Airport,IATA_Code,Passenger_Role,2018_passengers,count_passengers,Location_lower,index,I94_port_code,I94_port_city,I94_port_state,I94_port_city_lower
0,0,Akron,Akron Executive Airport,AKC,Non-Hub/Reliever,No Commercial Service,0,akron,338.0,AKR,AKRON,OH,akron
1,0,Akron,Akron Executive Airport,AKC,Non-Hub/Reliever,No Commercial Service,0,akron,514.0,CAK,AKRON,OH,akron
2,1,Albany,Albany International Airport,ALB,Small,"2,848,000 [2]",2848000,albany,308.0,ALB,ALBANY,NY,albany


In [104]:
## Airports that could not be matched
df_inter_merged[df_inter_merged.I94_port_city.isna()].head(2)

Unnamed: 0.1,Unnamed: 0,Location,Airport,IATA_Code,Passenger_Role,2018_passengers,count_passengers,Location_lower,index,I94_port_code,I94_port_city,I94_port_state,I94_port_city_lower
25,23,Dayton,Dayton International Airport,DAY,Small,905558,905558,dayton,,,,,
37,35,Greensboro,Piedmont Triad International Airport,GSO,Small,"1,859,588[23]",1859588,greensboro,,,,,


In [105]:
df_inter_merged_final = df_inter_merged[df_inter_merged.I94_port_city.notna()]

In [106]:
df_inter_merged_final = df_inter_merged_final[["Location", "Airport", "IATA_Code", "Passenger_Role", "count_passengers", "I94_port_code", "I94_port_city", "I94_port_state"]]

In [107]:
df_inter_merged_final.Passenger_Role.unique()

array(['Non-Hub/Reliever', 'Small', 'Medium', 'Large', 'Non-Hub',
       'Reliever'], dtype=object)

In [108]:
df_inter_merged_final = df_inter_merged_final[(df_inter_merged_final.Passenger_Role != "Non-Hub/Reliever") & 
                                             (df_inter_merged_final.Passenger_Role != "Non-Hub") &
                                             (df_inter_merged_final.Passenger_Role != "Reliever")]

In [109]:
df_inter_merged_final.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 97 entries, 2 to 127
Data columns (total 8 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Location          97 non-null     object
 1   Airport           97 non-null     object
 2   IATA_Code         97 non-null     object
 3   Passenger_Role    97 non-null     object
 4   count_passengers  97 non-null     int64 
 5   I94_port_code     97 non-null     object
 6   I94_port_city     97 non-null     object
 7   I94_port_state    97 non-null     object
dtypes: int64(1), object(7)
memory usage: 6.8+ KB


#### df_inter_merged_final + airport_codes.csv

#### IATA code is essential for a commercial airport, as the code will be put on baggage while check-ins.
#### Based on this fact and stakeholder specifications, further filtering steps will be applied:
* ```iso_country``` == 'US'
* ```iata_code``` != Null
* ```type``` == 'medium_airport' & 'large_airport'

Large (18,500,000+ Annual Passengers)

Medium (3,500,000 - 18,499,999 Annual Passengers)

Small ( < 3,500,000 Annual Passengers)

In [110]:
raw_airports = pd.read_csv('input/airport-codes_csv.csv')
df_airports = raw_airports.copy()

In [113]:
df_airports.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 22757 entries, 0 to 54896
Data columns (total 15 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   ident         22757 non-null  object 
 1   type          22757 non-null  object 
 2   name          22757 non-null  object 
 3   elevation_ft  22518 non-null  float64
 4   continent     1 non-null      object 
 5   iso_country   22757 non-null  object 
 6   iso_region    22757 non-null  object 
 7   municipality  22655 non-null  object 
 8   gps_code      20984 non-null  object 
 9   iata_code     2019 non-null   object 
 10  local_code    21236 non-null  object 
 11  coordinates   22757 non-null  object 
 12  state         22757 non-null  object 
 13  latitude      22757 non-null  object 
 14  longitude     22757 non-null  object 
dtypes: float64(1), object(14)
memory usage: 2.8+ MB


In [111]:
df_airports.head(2)

Unnamed: 0,ident,type,name,elevation_ft,continent,iso_country,iso_region,municipality,gps_code,iata_code,local_code,coordinates
0,00A,heliport,Total Rf Heliport,11.0,,US,US-PA,Bensalem,00A,,00A,"-74.93360137939453, 40.07080078125"
1,00AA,small_airport,Aero B Ranch Airport,3435.0,,US,US-KS,Leoti,00AA,,00AA,"-101.473911, 38.704022"


In [112]:
## clean data
## issue: as under iso_country there are NaN I could not map the dataframes without some crunching
df_airports = df_airports[df_airports.iso_country == 'US']
df_airports['state'] = df_airports.iso_region.apply(lambda x: re.match(r'US-(.*)',str(x)).group(1) if x else x)
df_airports['latitude'] = df_airports.coordinates.str.split(", ",expand=True)[0]
df_airports['longitude'] = df_airports.coordinates.str.split(", ",expand=True)[1]

In [114]:
df_inter_final = df_inter_merged_final.merge(df_airports, how='left', left_on='IATA_Code', right_on='iata_code')

In [115]:
df_inter_final[df_inter_final.Location == 'Portland']

Unnamed: 0,Location,Airport,IATA_Code,Passenger_Role,count_passengers,I94_port_code,I94_port_city,I94_port_state,ident,type,name,elevation_ft,continent,iso_country,iso_region,municipality,gps_code,iata_code,local_code,coordinates,state,latitude,longitude
69,Portland,Portland International Jetport,PWM,Small,2134430,POM,PORTLAND,ME,KPWM,large_airport,Portland International Jetport Airport,76.0,,US,US-ME,Portland,KPWM,PWM,PWM,"-70.30930328, 43.64619827",ME,-70.30930328,43.64619827
70,Portland,Portland International Jetport,PWM,Small,2134430,POO,PORTLAND,OR,KPWM,large_airport,Portland International Jetport Airport,76.0,,US,US-ME,Portland,KPWM,PWM,PWM,"-70.30930328, 43.64619827",ME,-70.30930328,43.64619827
71,Portland,Portland International Airport,PDX,Large,19882788,POM,PORTLAND,ME,KPDX,large_airport,Portland International Airport,31.0,,US,US-OR,Portland,KPDX,PDX,PDX,"-122.5979996, 45.58869934",OR,-122.5979996,45.58869934
72,Portland,Portland International Airport,PDX,Large,19882788,POO,PORTLAND,OR,KPDX,large_airport,Portland International Airport,31.0,,US,US-OR,Portland,KPDX,PDX,PDX,"-122.5979996, 45.58869934",OR,-122.5979996,45.58869934


In [116]:
drop_cols_inter = ["type", "name", "continent", "coordinates"]
df_inter_final.drop(drop_cols_inter, inplace = True, axis = 1)

In [117]:
df_inter_final.head(2)

Unnamed: 0,Location,Airport,IATA_Code,Passenger_Role,count_passengers,I94_port_code,I94_port_city,I94_port_state,ident,elevation_ft,iso_country,iso_region,municipality,gps_code,iata_code,local_code,state,latitude,longitude
0,Albany,Albany International Airport,ALB,Small,2848000,ALB,ALBANY,NY,KALB,285.0,US,US-NY,Albany,KALB,ALB,ALB,NY,-73.80169677734375,42.74829864501953
1,Albuquerque,Albuquerque International Sunport,ABQ,Medium,5258775,ABQ,ALBUQUERQUE,NM,KABQ,5355.0,US,US-NM,Albuquerque,KABQ,ABQ,ABQ,NM,-106.609001,35.040199


In [118]:
df_inter_final = df_inter_final[["Location", "Airport", "IATA_Code", "Passenger_Role", "count_passengers", "I94_port_code", "I94_port_state", "elevation_ft", "iso_country", "municipality", "gps_code", "latitude", "longitude"]].copy()
    # rename columns
df_inter_final.rename(columns={"Location":"city_name",
                                   "Airport":"airport_name",
                                   "IATA_Code":"iata_code",
                                   "Passenger_Role":"airport_size",
                                   "count_passengers":"passenger_count",
                                   "latitude":"lat",
                                   "longitude":"lng"}, inplace=True)

In [119]:
df_inter_final.head(2)

Unnamed: 0,city_name,airport_name,iata_code,airport_size,passenger_count,I94_port_code,I94_port_state,elevation_ft,iso_country,municipality,gps_code,lat,lng
0,Albany,Albany International Airport,ALB,Small,2848000,ALB,NY,285.0,US,Albany,KALB,-73.80169677734375,42.74829864501953
1,Albuquerque,Albuquerque International Sunport,ABQ,Medium,5258775,ABQ,NM,5355.0,US,Albuquerque,KABQ,-106.609001,35.040199


#### World Temperature Data by City

#### #TODOs for df_temper:
* filter by ```Country``` == 'United States'
* convert dt to datetime
* drop columns: AverageTemperatureUncertainty

#### Issues 
* Missing temperatures will be removed for the filtered data set
* AverageTemperature will be rounded to a whole degree
* Latitude, Longitude will be set to float, 'N' and 'E' removed ()
* Duplicated values will be removed

In [5]:
df_temper = pd.read_csv(fname_temper)

In [6]:
df_temper.shape

(8599212, 7)

In [7]:
df_temper.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8599212 entries, 0 to 8599211
Data columns (total 7 columns):
 #   Column                         Dtype  
---  ------                         -----  
 0   dt                             object 
 1   AverageTemperature             float64
 2   AverageTemperatureUncertainty  float64
 3   City                           object 
 4   Country                        object 
 5   Latitude                       object 
 6   Longitude                      object 
dtypes: float64(2), object(5)
memory usage: 459.2+ MB


In [8]:
df_temper.tail(4)

Unnamed: 0,dt,AverageTemperature,AverageTemperatureUncertainty,City,Country,Latitude,Longitude
8599208,2013-06-01,15.043,0.261,Zwolle,Netherlands,52.24N,5.26E
8599209,2013-07-01,18.775,0.193,Zwolle,Netherlands,52.24N,5.26E
8599210,2013-08-01,18.025,0.298,Zwolle,Netherlands,52.24N,5.26E
8599211,2013-09-01,,,Zwolle,Netherlands,52.24N,5.26E


In [9]:
## get unique values from column Country in order to find US
df_temper.Country.unique()

array(['Denmark', 'Turkey', 'Kazakhstan', 'China', 'Spain', 'Germany',
       'Nigeria', 'Iran', 'Russia', 'Canada', "Côte D'Ivoire",
       'United Kingdom', 'Saudi Arabia', 'Japan', 'United States',
       'India', 'Benin', 'United Arab Emirates', 'Mexico', 'Venezuela',
       'Ghana', 'Ethiopia', 'Australia', 'Yemen', 'Indonesia', 'Morocco',
       'Pakistan', 'France', 'Libya', 'Burma', 'Brazil', 'South Africa',
       'Syria', 'Egypt', 'Algeria', 'Netherlands', 'Malaysia', 'Portugal',
       'Ecuador', 'Italy', 'Uzbekistan', 'Philippines', 'Madagascar',
       'Chile', 'Belgium', 'El Salvador', 'Romania', 'Peru', 'Colombia',
       'Tanzania', 'Tunisia', 'Turkmenistan', 'Israel', 'Eritrea',
       'Paraguay', 'Greece', 'New Zealand', 'Vietnam', 'Cameroon', 'Iraq',
       'Afghanistan', 'Argentina', 'Azerbaijan', 'Moldova', 'Mali',
       'Congo (Democratic Republic Of The)', 'Thailand',
       'Central African Republic', 'Bosnia And Herzegovina', 'Bangladesh',
       'Switzerland'

In [15]:
print(f"""'Number of rows when filtered for United States {len(df_temper[df_temper.Country == 'United States'])}""")

'Number of rows when filtered for United States 687289


In [16]:
df_temper_USA = df_temper[df_temper.Country == 'United States']
df_temper['dt'] = pd.to_datetime(df_temper.dt)

In [17]:
df_temper_USA.shape

(687289, 7)

In [18]:
# removind missing temperatures
df_temper_USA_mod = df_temper_USA[~((df_temper_USA.AverageTemperature.isnull()) & 
                                    (df_temper_USA.AverageTemperatureUncertainty.isnull()))]

In [28]:
# number of cities
print(f"""The dataset holds temperature information about {len(df_temper_USA_mod.City.unique())} US cities""")

The dataset holds temperature information about 248 US cities


In [29]:
# finding duplicates
df_temper_USA_mod[df_temper_USA_mod.duplicated(subset=['dt','City'])]

Unnamed: 0,dt,AverageTemperature,AverageTemperatureUncertainty,City,Country,Latitude,Longitude
404275,1820-01-01,-3.022,7.662,Arlington,United States,39.38N,76.99W
404276,1820-02-01,3.338,3.696,Arlington,United States,39.38N,76.99W
404277,1820-03-01,5.489,3.148,Arlington,United States,39.38N,76.99W
404278,1820-04-01,12.226,1.456,Arlington,United States,39.38N,76.99W
404279,1820-05-01,16.335,1.818,Arlington,United States,39.38N,76.99W
...,...,...,...,...,...,...,...
7148658,2013-05-01,14.309,0.331,Springfield,United States,42.59N,72.00W
7148659,2013-06-01,19.313,0.353,Springfield,United States,42.59N,72.00W
7148660,2013-07-01,23.629,0.447,Springfield,United States,42.59N,72.00W
7148661,2013-08-01,19.579,0.336,Springfield,United States,42.59N,72.00W


In [30]:
df_temper_USA_dropdup = df_temper_USA_mod.drop_duplicates(subset=['dt','City','Country'],keep='first')

In [32]:
print(f"""Percentage of values missing by column AverageTemperature and duplicates when filtered for US: {"{0:.0f}%".format(df_temper_USA_dropdup.shape[0] / df_temper_USA.shape[0])}""")

Percentage of values missing by column AverageTemperature and duplicates when filtered for US: 1%


In [31]:
df_temper_USA_dropdup.shape

(639649, 7)

In [33]:
# Rounding temperature to decimals=1
df_temper_USA_dropdup['AverageTemperature'] = df_temper_USA_dropdup['AverageTemperature'].round(decimals=1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_temper_USA_dropdup['AverageTemperature'] = df_temper_USA_dropdup['AverageTemperature'].round(decimals=1)


In [34]:
df_temper_USA_dropdup.head()

Unnamed: 0,dt,AverageTemperature,AverageTemperatureUncertainty,City,Country,Latitude,Longitude
47555,1820-01-01,2.1,3.217,Abilene,United States,32.95N,100.53W
47556,1820-02-01,6.9,2.853,Abilene,United States,32.95N,100.53W
47557,1820-03-01,10.8,2.395,Abilene,United States,32.95N,100.53W
47558,1820-04-01,18.0,2.202,Abilene,United States,32.95N,100.53W
47559,1820-05-01,21.8,2.036,Abilene,United States,32.95N,100.53W


In [35]:
### Min/Max Dates
dates = (df_temper_USA_dropdup.assign(dt=df_temper_USA_dropdup['dt'])
       .groupby(['City'])['dt'].agg([('earliest' , 'min'), ('latest', 'max')])
       .add_prefix('Date_'))
dates = dates.reset_index()

In [36]:
dates.Date_latest.unique()

array(['2013-09-01', '2013-08-01'], dtype=object)

In [37]:
dates.Date_earliest.unique()

array(['1820-01-01', '1743-11-01', '1849-01-01', '1828-01-01',
       '1758-03-01', '1823-01-01', '1835-01-01', '1775-04-01',
       '1825-05-01', '1768-09-01', '1821-11-01'], dtype=object)

In [38]:
dates.head()

Unnamed: 0,City,Date_earliest,Date_latest
0,Abilene,1820-01-01,2013-09-01
1,Akron,1743-11-01,2013-09-01
2,Albuquerque,1820-01-01,2013-09-01
3,Alexandria,1743-11-01,2013-09-01
4,Allentown,1743-11-01,2013-09-01


#### Issues World temperature data

* Date_first column has 11 different dates
* Date_latest column has only two different values dates, separated by a day
* Date_latest goes only till 2013, whereas I94 Immigration Data is from 2016. 

As temperature is not main priority of the project, as it was not issued as critical by stakeholders while planning, not further considerations, or cleaning steps will be taken.

Temperature data will not be taken for final model

#### I94 Immigration Data Sample

#### TODOs
* stakeholders are only interested in Air as mode of transportation. Thereafter, ```i94mode``` only 1.0 will be taken from here 
* further columns hold no value for further analysis and will be removed: cicid, i94res, count, entdepa, entdepd, entdepu, insnum
* the company's main product is aimed at females first. Thereafter the gender will be filtered for male and female only in order to calculate female share on immigrants by country

#### Issues
* no such ```visatype``` as GMT. Records with this value will be removed
* no information given on columns ```entdepa```, ```entdepd```, ```entdepu```. As the records from this columns are of no particular value for further analysis, they will be removed

In [39]:
df_immi = pd.read_csv(fname_immi)

In [40]:
df_immi.head()

Unnamed: 0.1,Unnamed: 0,cicid,i94yr,i94mon,i94cit,i94res,i94port,arrdate,i94mode,i94addr,depdate,i94bir,i94visa,count,dtadfile,visapost,occup,entdepa,entdepd,entdepu,matflag,biryear,dtaddto,gender,insnum,airline,admnum,fltno,visatype
0,2027561,4084316.0,2016.0,4.0,209.0,209.0,HHW,20566.0,1.0,HI,20573.0,61.0,2.0,1.0,20160422,,,G,O,,M,1955.0,7202016,F,,JL,56582670000.0,00782,WT
1,2171295,4422636.0,2016.0,4.0,582.0,582.0,MCA,20567.0,1.0,TX,20568.0,26.0,2.0,1.0,20160423,MTR,,G,R,,M,1990.0,10222016,M,,*GA,94362000000.0,XBLNG,B2
2,589494,1195600.0,2016.0,4.0,148.0,112.0,OGG,20551.0,1.0,FL,20571.0,76.0,2.0,1.0,20160407,,,G,O,,M,1940.0,7052016,M,,LH,55780470000.0,00464,WT
3,2631158,5291768.0,2016.0,4.0,297.0,297.0,LOS,20572.0,1.0,CA,20581.0,25.0,2.0,1.0,20160428,DOH,,G,O,,M,1991.0,10272016,M,,QR,94789700000.0,00739,B2
4,3032257,985523.0,2016.0,4.0,111.0,111.0,CHM,20550.0,3.0,NY,20553.0,19.0,2.0,1.0,20160406,,,Z,K,,M,1997.0,7042016,F,,,42322570000.0,LAND,WT


In [41]:
df_immi.i94cit.unique()

array([209., 582., 148., 297., 111., 577., 245., 113., 131., 116., 438.,
       260., 512., 689., 746., 115., 251., 268., 129., 213., 135., 133.,
       373., 126., 252., 696., 117., 687., 528., 123., 258., 691., 130.,
       107., 103., 694., 254., 574., 368., 575., 586., 734., 514., 273.,
       692., 109., 579., 164., 263., 464., 124., 602., 121., 162., 274.,
       690., 207., 104., 525., 105., 343., 576., 585., 272., 108., 114.,
       140., 180., 526., 603., 332., 513., 516., 218., 204., 296., 201.,
       257., 266., 520., 718., 112., 261., 299., 688., 141., 350., 340.])

In [42]:
df_immi.describe()

Unnamed: 0.1,Unnamed: 0,cicid,i94yr,i94mon,i94cit,i94res,arrdate,i94mode,depdate,i94bir,i94visa,count,dtadfile,entdepu,biryear,insnum,admnum
count,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,951.0,1000.0,1000.0,1000.0,1000.0,0.0,1000.0,35.0,1000.0
mean,1542097.0,3040461.0,2016.0,4.0,302.928,298.262,20559.68,1.078,20575.037855,42.382,1.859,1.0,20160420.0,,1973.618,3826.857143,69372370000.0
std,915287.9,1799818.0,0.0,0.0,206.485285,202.12039,8.995027,0.485955,24.211234,17.903424,0.386353,0.0,49.51657,,17.903424,221.742583,23381340000.0
min,10925.0,13208.0,2016.0,4.0,103.0,103.0,20545.0,1.0,20547.0,1.0,1.0,1.0,20160400.0,,1923.0,3468.0,0.0
25%,721442.2,1412170.0,2016.0,4.0,135.0,131.0,20552.0,1.0,20561.0,30.75,2.0,1.0,20160410.0,,1961.0,3668.0,55993010000.0
50%,1494568.0,2941176.0,2016.0,4.0,213.0,213.0,20560.0,1.0,20570.0,42.0,2.0,1.0,20160420.0,,1974.0,3887.0,59314770000.0
75%,2360901.0,4694151.0,2016.0,4.0,438.0,438.0,20567.25,1.0,20580.0,55.0,2.0,1.0,20160420.0,,1985.25,3943.0,93436230000.0
max,3095749.0,6061994.0,2016.0,4.0,746.0,696.0,20574.0,9.0,20715.0,93.0,3.0,1.0,20160800.0,,2015.0,4686.0,95021510000.0


In [43]:
df_immi.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 29 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Unnamed: 0  1000 non-null   int64  
 1   cicid       1000 non-null   float64
 2   i94yr       1000 non-null   float64
 3   i94mon      1000 non-null   float64
 4   i94cit      1000 non-null   float64
 5   i94res      1000 non-null   float64
 6   i94port     1000 non-null   object 
 7   arrdate     1000 non-null   float64
 8   i94mode     1000 non-null   float64
 9   i94addr     941 non-null    object 
 10  depdate     951 non-null    float64
 11  i94bir      1000 non-null   float64
 12  i94visa     1000 non-null   float64
 13  count       1000 non-null   float64
 14  dtadfile    1000 non-null   int64  
 15  visapost    382 non-null    object 
 16  occup       4 non-null      object 
 17  entdepa     1000 non-null   object 
 18  entdepd     954 non-null    object 
 19  entdepu     0 non-null      

In [44]:
df_immi.visatype.unique()

array(['WT', 'B2', 'CP', 'B1', 'GMT', 'WB', 'F1', 'E2', 'F2', 'M1'],
      dtype=object)

In [45]:
# Country where visa was issued
print(df_immi[df_immi.visapost.isnull()].shape)
df_immi['visapost'].value_counts().head()

(618, 29)


MEX    28
BNS    21
BGT    14
SPL    14
GUZ    13
Name: visapost, dtype: int64

In [46]:
# everything except '1' will be removed
df_immi['i94mode'].value_counts().head()

# stakeholders are only interested in Air as mode of transportation. Thereafter, only 1.0 will be taken from here  

1.0    962
3.0     26
2.0     10
9.0      2
Name: i94mode, dtype: int64

In [47]:
# although the sample only holds four data point in this columns, a greater amount can provide more information
# this column will be taken for further analysis
print(df_immi[df_immi.occup.isnull()].shape)
df_immi['occup'].value_counts().head()

(996, 29)


STU    2
OTH    1
PHA    1
Name: occup, dtype: int64

In [48]:
# dataset provides visitors age
print(df_immi[df_immi.i94bir.isnull()].shape)
df_immi['i94bir'].value_counts().head()

(0, 29)


34.0    29
44.0    27
40.0    27
35.0    26
48.0    25
Name: i94bir, dtype: int64

In [49]:
# matchflag is 
print(df_immi['matflag'].value_counts())
temp = df_immi.copy()
temp[temp['matflag'].isnull()].head()

M    954
Name: matflag, dtype: int64


Unnamed: 0.1,Unnamed: 0,cicid,i94yr,i94mon,i94cit,i94res,i94port,arrdate,i94mode,i94addr,depdate,i94bir,i94visa,count,dtadfile,visapost,occup,entdepa,entdepd,entdepu,matflag,biryear,dtaddto,gender,insnum,airline,admnum,fltno,visatype
36,98732,216657.0,2016.0,4.0,696.0,696.0,FTL,20545.0,1.0,FL,,54.0,2.0,1.0,20160401,CRS,,G,,,,1962.0,9302016,F,,2D,92513110000.0,406,B2
79,2937355,5957654.0,2016.0,4.0,254.0,276.0,SAI,20556.0,1.0,GU,,20.0,2.0,1.0,20160610,,,A,,,,1996.0,5262016,M,3993.0,7C,45114680000.0,3404,GMT
82,3052309,1435383.0,2016.0,4.0,574.0,206.0,BLA,20552.0,3.0,NE,,61.0,2.0,1.0,20160408,,,Z,,,,1955.0,8082016,F,,,87733070000.0,1788,B2
100,899737,1843262.0,2016.0,4.0,245.0,245.0,CHI,20554.0,1.0,IN,,60.0,2.0,1.0,20160410,BEJ,,G,,,,1956.0,10092016,F,,AA,93191570000.0,186,B2
106,2068019,4231176.0,2016.0,4.0,586.0,586.0,NYC,20566.0,1.0,NJ,,68.0,2.0,1.0,20160422,,,O,,,,1948.0,10212016,,,AA,94252920000.0,2179,B2


In [50]:
# great piece of data when combined with other dimensions
# will be taken for further analysis
print(df_immi[df_immi.visatype.isnull()].shape)
df_immi['visatype'].value_counts()

(0, 29)


WT     443
B2     356
WB      91
B1      61
GMT     27
F1      10
CP       5
F2       3
E2       3
M1       1
Name: visatype, dtype: int64

In [51]:
df_immi.gender.value_counts()

M    471
F    386
X      2
Name: gender, dtype: int64

#### US Cities Demographics

#### TODOs
* a female share on total population has to be calculated
* no race analysis must be made, as North America is completely new to the company. Column Race and Count will be removed
* group data by city, state, and state code
* map demographics table to unzips in order to attain more valuable information
* rename columns

#### Issues
* multiple column types must be set correctly 
* no easy way to map both tables. thereafter groupby was made, as mapping column city+state was defined

In [53]:
raw_demog = pd.read_csv('input/us-cities-demographics.csv', delimiter=';')
raw_uszips = pd.read_csv('input/uszips.csv')

df_demog = raw_demog.copy()
df_uszips = raw_uszips.copy()

In [54]:
df_demog.head(3)

Unnamed: 0,City,State,Median Age,Male Population,Female Population,Total Population,Number of Veterans,Foreign-born,Average Household Size,State Code,Race,Count
0,Silver Spring,Maryland,33.8,40601.0,41862.0,82463,1562.0,30908.0,2.6,MD,Hispanic or Latino,25924
1,Quincy,Massachusetts,41.0,44129.0,49500.0,93629,4147.0,32935.0,2.39,MA,White,58723
2,Hoover,Alabama,38.5,38040.0,46799.0,84839,4819.0,8229.0,2.58,AL,Asian,4759


In [55]:
# Convert floats to ints
cols_float_2_int = ['Male Population', 'Female Population', 'Total Population']

for col in cols_float_2_int:
    df_demog[col] = df_demog[col].replace(np.nan, 0)
    df_demog[col] = df_demog[col].astype(int)

In [56]:
# filter data by comparing sum of male and female population with the total population
df_demog = df_demog[(df_demog['Male Population'] + df_demog['Female Population'] == df_demog['Total Population'])].copy()

In [57]:
df_demog.head(3)

Unnamed: 0,City,State,Median Age,Male Population,Female Population,Total Population,Number of Veterans,Foreign-born,Average Household Size,State Code,Race,Count
0,Silver Spring,Maryland,33.8,40601,41862,82463,1562.0,30908.0,2.6,MD,Hispanic or Latino,25924
1,Quincy,Massachusetts,41.0,44129,49500,93629,4147.0,32935.0,2.39,MA,White,58723
2,Hoover,Alabama,38.5,38040,46799,84839,4819.0,8229.0,2.58,AL,Asian,4759


In [58]:
# define columns for further analysis
cols_main = ['City','State','Median Age','Male Population','Female Population','Total Population','Average Household Size','State Code']
df_demog_main = df_demog[cols_main]

In [59]:
# group data by three columns and aggregate their values
df_demog_agg = df_demog_main.groupby(['City', 'State', 'State Code']).agg({'Median Age':'mean', 
                                                            'Male Population':'sum', 
                                                            'Female Population':'sum',
                                                            'Average Household Size':'mean'}).reset_index().copy()

In [60]:
# calculate total population
df_demog_agg['popul_total'] = df_demog_agg['Male Population'] + df_demog_agg['Female Population']

In [61]:
# calculate female share
df_demog_agg['share_female'] = df_demog_agg['Female Population'] / df_demog_agg['popul_total']

In [62]:
# define column to match with another dataframe
df_demog_agg['matcher'] = df_demog_agg['City'] + '/' + df_demog_agg['State']

In [63]:
# unzips holds multiple rows for many cities as county_fips numbers vary. 
temp_density_coor = df_uszips.groupby(['city', 'state_id', 'state_name']).agg({'density':'mean',
                                                                               'lat':'mean',
                                                                               'lng':'mean'}).reset_index().copy()

In [64]:
temp_density_coor['matcher'] = temp_density_coor['city'] + '/' + temp_density_coor['state_name']

In [65]:
df_merged = df_demog_agg.merge(temp_density_coor, how='left', on='matcher')

In [66]:
# Convert floats to ints
cols_float = ['Average Household Size', 'share_female', 'density']

for col in cols_float:
    df_merged[col] = df_merged[col].replace(np.nan, 0)

In [67]:
df_merged['share_female'] = df_merged['share_female'].round(decimals=2)
df_merged['density'] = df_merged['density'].round(decimals=2)

In [68]:
# rename columns
df_merged.rename(columns={'City':'city_name', 
                          'State':'state_name', 
                          'State Code':'state_code',
                          'Median Age':'age_median',
                          'Male Population':'popul_male',
                          'Female Population':'popul_female',
                          'Average Household Size':'household_size_ave',
                          'density':'popul_density'}, inplace=True)

#### I94_cit_codes and country_codes.csv

#### TODOs
* find out, how to map both dataframes

#### Issues
* there are misspeling in countries names

In [120]:
raw_ccodes = pd.read_csv('input/country_codes.csv', converters={"country_code":str,
                                                          "region_code":str})
df_ccodes = raw_ccodes.copy()

In [121]:
df_ccodes.head(2)

Unnamed: 0,name,alpha_2,alpha_3,country_code,iso_3166_2,region,sub_region,intermediate_region,region_code,sub_region_code,intermediate_region_code
0,Afghanistan,AF,AFG,4,ISO 3166-2:AF,Asia,Southern Asia,,142,34.0,
1,Åland Islands,AX,ALA,248,ISO 3166-2:AX,Europe,Northern Europe,,150,154.0,


In [122]:
I94_cit_codes.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 235 entries, 0 to 234
Data columns (total 2 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   I94_country_code  235 non-null    object
 1   I94_country       235 non-null    object
dtypes: object(2)
memory usage: 3.8+ KB


In [123]:
I94_cit_codes["I94_country_low"] = I94_cit_codes["I94_country"].apply(lambda x: x.lower())

In [124]:
df_ccodes["name_low"] = df_ccodes.name.apply(lambda x: x.lower())

In [125]:
df_I94_merged = I94_cit_codes.merge(df_ccodes, how='left', left_on='I94_country_low', right_on='name_low')

In [126]:
df_I94_merged.head()

Unnamed: 0,I94_country_code,I94_country,I94_country_low,name,alpha_2,alpha_3,country_code,iso_3166_2,region,sub_region,intermediate_region,region_code,sub_region_code,intermediate_region_code,name_low
0,236,AFGHANISTAN,afghanistan,Afghanistan,AF,AFG,4,ISO 3166-2:AF,Asia,Southern Asia,,142,34.0,,afghanistan
1,101,ALBANIA,albania,Albania,AL,ALB,8,ISO 3166-2:AL,Europe,Southern Europe,,150,39.0,,albania
2,316,ALGERIA,algeria,Algeria,DZ,DZA,12,ISO 3166-2:DZ,Africa,Northern Africa,,2,15.0,,algeria
3,102,ANDORRA,andorra,Andorra,AD,AND,20,ISO 3166-2:AD,Europe,Southern Europe,,150,39.0,,andorra
4,324,ANGOLA,angola,Angola,AO,AGO,24,ISO 3166-2:AO,Africa,Sub-Saharan Africa,Middle Africa,2,202.0,17.0,angola


In [127]:
df_I94_merged[df_I94_merged.name.notna()].head(2)

Unnamed: 0,I94_country_code,I94_country,I94_country_low,name,alpha_2,alpha_3,country_code,iso_3166_2,region,sub_region,intermediate_region,region_code,sub_region_code,intermediate_region_code,name_low
0,236,AFGHANISTAN,afghanistan,Afghanistan,AF,AFG,4,ISO 3166-2:AF,Asia,Southern Asia,,142,34.0,,afghanistan
1,101,ALBANIA,albania,Albania,AL,ALB,8,ISO 3166-2:AL,Europe,Southern Europe,,150,39.0,,albania


In [128]:
drop_ = ['I94_country', 'I94_country_low', 'intermediate_region', 'intermediate_region_code', 'name_low']
df_I94_merged.drop(drop_, inplace = True, axis = 1)

In [129]:
df_I94_merged.rename(columns={"name":"county_name",
                              "alpha_2":"country_alpha_2",
                              "alpha_3":"country_alpha_3",
                              "iso_3166_2":"country_iso_3166_2",
                              "region":"country_region",
                              "sub_region":"country_sub_region",
                              "region_code":"country_region_code",
                              "sub_region_code":"country_sub_region_code"}, inplace=True)

In [130]:
import numpy as np
df_I94_merged["country_sub_region_code"] = df_I94_merged["country_sub_region_code"].replace(np.nan, 0)
df_I94_merged["country_sub_region_code"] = df_I94_merged["country_sub_region_code"].astype(int)

In [131]:
df_I94_merged.head(3)

Unnamed: 0,I94_country_code,county_name,country_alpha_2,country_alpha_3,country_code,country_iso_3166_2,country_region,country_sub_region,country_region_code,country_sub_region_code
0,236,Afghanistan,AF,AFG,4,ISO 3166-2:AF,Asia,Southern Asia,142,34
1,101,Albania,AL,ALB,8,ISO 3166-2:AL,Europe,Southern Europe,150,39
2,316,Algeria,DZ,DZA,12,ISO 3166-2:DZ,Africa,Northern Africa,2,15


### Step 3: Define the Data Model
#### 3.1 Conceptual Data Model
This section provides information on what data model was chosen, why and what steps are required in order to load transformed data to a Data Warehouse

As mentioned earlier, main consumers of the data are representitives of a few different departments of "A sweet Thing" company. While considering general database complexity level, one has to keep in mind, that the stakeholders are no data professionals, have different needs in regard to queries, and are likely to expect support from Data Engineer, if query complexity grows.

Based on this knowledge a star schema was chosen. Simple as it is, it is likely to satisfy even most demanding users. 

Picture below shows tables produces, as well as how these tables can be joined. In the middle is the fact table, surrounded by four dimension tables. 

<img src="images/model_data_warehouse.png">

#### 3.2 Mapping Out Data Pipelines
Value from data can only be achieved by working with it thoroughly. Each dataset is likely to be unique, has it's own flows, and critical information. Data manipulation is a very consuming part of each data project, as it has to be done carefully. 

Each table mentioned in the picture above consists of multiple datasets, combined together. This notebook provides detailed information on what had to be done in order to, on the one hand, keep as much data as possible, while, on the other hand, removing / replacing / transforming as much as needed.

Generated dataframes in .csv are saved in ./output/, whereas the main data is kept safe in ./output_gzip/df_immigrant.gzip

### Step 4: Run Pipelines to Model the Data 
#### 4.1 Create the data model
In order to load data to the Data Warehouse three further scripts were written:
* ```create_tables.py``` -> holds methods to create / drop database
* ```sql_queries.py``` -> holds drop / create / insert queries
* ```etl.py``` -> holds two methods to upload data from .csv files and .gzip files into Data Warehouse

#### 4.2 Data Quality Checks
Explain the data quality checks you'll perform to ensure the pipeline ran as expected. These could include:
 * Integrity constraints on the relational database (e.g., unique key, data type, etc.)
 * Unit tests for the scripts to ensure they are doing the right thing
 * Source/Count checks to ensure completeness

#### Source/Count checks to ensure completeness
Data quality checks are implemented on two separate places of the pipeline:
1. in ```process_input.py``` -> after each file is being cleaned
2. in ```etl.py``` -> after uploading dataframes to Data Warehouse

Unit tests for the scripts to ensure they are doing the right thing:
```unit_test.ipynb``` -> after data was uploaded to the Data Warehouse this script eveluates, if particular values from processed data files can be found in the right format in a Data Warahouse table

#### 4.3 Data dictionary 
Data dictionary for the data model above can be found in the project documentation (filename = 'data_dict.md')

#### Step 5: Complete Project Write Up
#### Tools used
* Python: Python is number one programming language to perform data analysis these days: many different libraries make it easy and quick to perform data analysis, also on a big scale 
* Pandas: Pandas is one of main libraries to crunch data, and is unavoidable when it comes to small to medium datasets
* Spark: Spark (Pyspark library) is best choice when it comes to large datasets
* Postgresql: Postgresql is a great Data Warehouse option
#### Tools for better accessibility:
* AWS S3: S3 can be used to store processed data.  
* AWS Redshift: Redshift is a Data Warehouse, which can be accessed from anywhere. This makes it a perfect choice for those who wants to share a database with more than one location
#### Data updates
As mentioned in the very beginning, the purpose of this Data Project was to gain insights about who immigrate to US and where to exactly. Depending on what next goals might be, one can define how often an update should be made.

From the perspective of the predefined goals it is necessary to update (append new data) tables every 3-5 Months. This way Stakeholders can see changes in immigrants behaviour and act accordingly.
#### Scenarios:
##### The data was increased by 100x.
Around 35 Million rows of data was processed during this project. It was done in a dedicated Udacity workspace and took almost 2 hours. 100x data increase can potentially, even if only minimally, harm business. In order to to secure efficient data processing, alternative places, such as scalable AWS Ressources (also counterparts), should be considered.
AWS Ressources like:
* Amazon S3 -> data storage
* Amazon EMR -> uses Apache Spark, Apache Hive, Presto and other open-source frameworks. This Ressource is able to help in analyzing vast amounts of data
* Amazon Redshift -> being a petabyte-scale data warehouse service, Redshift can then be used as a data warehouse for processed data 

##### The data populates a dashboard that must be updated on a daily basis by 7am every day.
In order to update data frequently, furthermore at 7am, it must be automated properly. Depending on the input and the complexity of data wragling one has at least two options to choose from (sorted by complexity):
* CRON: as long as scripts are simple enough and all the exceptions are in right places
* Airflow: if pipeline is complex and constists of multiple processes that have to be performed as a DAG
* Amazon Livy (in combination with Airflow and Spark): helps to submit multiple Spark jobs in parallel on an EMR cluster

##### The database needed to be accessed by 100+ people.
A good way to handle this amount of connections is to use Redshift. The downfall by this alternative are the costs, which can become significant with time and increased data amount.

Another solution would be to copy (on a regular basis) data from Data Warehouse to a NoSQL server, for example Apache Cassandra or MongoDB. These servers can easily handle multiple connections. They however have no the same level of flexibility and complexity when building queries.