# Project Title
### ETL Pipeline for US Immigration Data

In [1]:
# Do all imports and installs here
from pyspark.sql import SparkSession
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
from pyspark.sql.types import IntegerType
from pyspark.sql.functions import desc
from pyspark.sql.functions import asc
from pyspark.sql.functions import sum as Fsum

import datetime

import numpy as np
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt

import configparser
from datetime import datetime
import os

import boto3
import psycopg2     ## Psycopg2 is the most popular PostgreSQL database adapter for the Python programming language.

%reload_ext sql 


from datetime import datetime, timedelta   ## for converting data to datetime format
from pyspark.sql import types as T         ## for importing Datetime and other data types

pd.set_option('display.max_columns', 500)

### Step 1: Scope the Project and Gather Data

#### 1.1 Scope 
Explain what you plan to do in the project in more detail. What data do you use? What is your end solution look like? What tools did you use? etc


*Our objective is to create a pipeline to process the raw data from the data repository of the client, and store the processed data in datalake and further query the data using cloud data warehouse (AWS Redshift), so that client can analyse the immigration data at scale, and generate actionable insights at faster pace.*


#### 1.2 Describe and Gather Data 
Describe the data sets you're using. Where did it come from? What type of information is included? 

#### 1.3 Data:
1. I94 Immigration Data: This data comes from the US National Tourism and Trade Office. 
      The descriptions is contained in the 'I94_SAS_Labels_Descriptions.SAS'file.
2. World Temperature Data: This dataset came from Kaggle.
3. U.S. City Demographic Data: This data comes from OpenSoft.
4. Airport Code Table: This is a simple table of airport codes and corresponding cities.


#### 1.4 Architecture
![](Images/Architecture.png)

![](Images/architecture1.png)

#### 1.5 High Level Description of the architecture.

1. **Justification for using both EMR and Redshift:** We can query the data stored in S3 using EMR PySpark. We can do the same with the help of RedShift because we can run sql queries on different type of data in S3 directly from redshift. In case of Redshift we can do this by purely using SQL, whereas, in case of PySpark we have to use PySpark plus some SQL. Both EMR and Redshift has been used in two different phases: EMR for processing the Big Raw Data, whereas, Redshift is being used to query the processed raw data, also the query processing in Redshift is faster than Spark because the Redshift is a columnar storage database. Since we want to make it easier for the end user to query/analyse data, thus end user can simply write SQL query to get value from the redshift data warehouse.

2. **Phase 1:** Since the data is very big and is of varied nature (SAS files, CSV files, and Parquet files), thus we first need to process all the data then store them in parquet format in S3. For this activity we will be using EMR PySpark. This processed data will be stored in S3. 

3. **Phase 2:** Further, the processed data in S3 will be copied in facts and dimensions tables in Redshift, so that it can be queried using any BI App. The result of querying from redshift can also be stored in S3.


In [2]:
## create environment variable for configuration
config = configparser.ConfigParser()
config.read('dl.cfg')

os.environ['AWS_ACCESS_KEY_ID']=config['AWS']['AWS_ACCESS_KEY_ID']
os.environ['AWS_SECRET_ACCESS_KEY']=config['AWS']['AWS_SECRET_ACCESS_KEY']

##### Special Notes: 

1. Once we create the **environment variables**, then we can create spark session with spark hadoop aws package, which is mean for connecting spark session with AWS using environment variables.


2. Therefore always create environment variables first before creating spark session.

In [3]:
## create spark session
spark = SparkSession \
        .builder \
        .config("spark.jars.packages","saurfang:spark-sas7bdat:2.0.0-s_2.11") \
        .config("spark.jars.packages", "org.apache.hadoop:hadoop-aws:2.7.0")  \
        .getOrCreate()

In [4]:
## Define a Function to convert to upper case

def to_upper_case_(df, existing_column_name, new_column_name):
    """ 
    This function creates an additional column, which converts the existing column to upper case.
    This function also drops the existing lower case column after conversion to upper case.
    
    Args:
      df: data frame
      existing_column_name: name of existing lower case column in data frame df.
      new_column_name: name of new upper case column created using this function. 
      
    """
    df[new_column_name] = df[existing_column_name].apply(lambda x: x.upper())
    df.drop([existing_column_name], axis = 1, inplace = True)
    
    return df

### Step 2: Explore and Assess the Data
#### Explore the Data 
Identify data quality issues, like missing values, duplicate data, etc.

#### Cleaning Steps
Document steps necessary to clean the data

#### 2.0 S3 Bucket paths

In [5]:
# S3 Customer data paths
csv_data_path = 's3a://data-store-client-ap/csv_data/'
immig_data_path = 's3a://data-store-client-ap/sas_immi_data/'

# S3 Staging path

staging_path = 's3a://staging-ap/'

#### 2.1 Immigration data

In [6]:
# Read the immigration data from S3
df_immi = spark.read.parquet(immig_data_path+'part*')

In [7]:
# Have a glance at the immigration data
df_immi.limit(10).toPandas()

Unnamed: 0,cicid,i94yr,i94mon,i94cit,i94res,i94port,arrdate,i94mode,i94addr,depdate,i94bir,i94visa,count,dtadfile,visapost,occup,entdepa,entdepd,entdepu,matflag,biryear,dtaddto,gender,insnum,airline,admnum,fltno,visatype
0,459651.0,2016.0,4.0,135.0,135.0,ATL,20547.0,1.0,FL,20559.0,54.0,2.0,1.0,20160403,,,O,R,,M,1962.0,7012016,,,VS,55556250000.0,115,WT
1,459652.0,2016.0,4.0,135.0,135.0,ATL,20547.0,1.0,FL,20555.0,74.0,2.0,1.0,20160403,,,T,O,,M,1942.0,7012016,F,,VS,674406500.0,103,WT
2,459653.0,2016.0,4.0,135.0,135.0,ATL,20547.0,1.0,FL,20557.0,44.0,2.0,1.0,20160403,,,T,Q,,M,1972.0,10022016,M,,VS,674948200.0,109,B2
3,459654.0,2016.0,4.0,135.0,135.0,ATL,20547.0,1.0,G,20555.0,38.0,2.0,1.0,20160403,,,O,O,,M,1978.0,7012016,,,VS,55541760000.0,103,WT
4,459655.0,2016.0,4.0,135.0,135.0,ATL,20547.0,1.0,GA,,64.0,2.0,1.0,20160403,,,G,,,,1952.0,7012016,F,,VS,55541330000.0,103,WT
5,459656.0,2016.0,4.0,135.0,135.0,ATL,20547.0,1.0,GA,,63.0,2.0,1.0,20160403,,,G,,,,1953.0,7012016,M,,BA,55578040000.0,227,WT
6,459657.0,2016.0,4.0,135.0,135.0,ATL,20547.0,1.0,GA,20548.0,44.0,1.0,1.0,20160403,,,G,I,,M,1972.0,7012016,M,,BA,55578920000.0,227,WB
7,459658.0,2016.0,4.0,135.0,135.0,ATL,20547.0,1.0,GA,20548.0,39.0,1.0,1.0,20160403,,,G,I,,M,1977.0,7012016,M,,BA,55578740000.0,227,WB
8,459659.0,2016.0,4.0,135.0,135.0,ATL,20547.0,1.0,GA,20573.0,84.0,2.0,1.0,20160403,,,G,I,,M,1932.0,7012016,M,,BA,55577460000.0,227,WT
9,459660.0,2016.0,4.0,135.0,135.0,ATL,20547.0,1.0,GA,20555.0,55.0,2.0,1.0,20160403,,,G,N,,M,1961.0,7012016,M,,BA,55577870000.0,227,WT


In [8]:
print('Total Number of Rows in Immigration DataFrame: '+ str(df_immi.count()))

Total Number of Rows in Immigration DataFrame: 2861188


In [9]:
df_immi.printSchema()

root
 |-- cicid: double (nullable = true)
 |-- i94yr: double (nullable = true)
 |-- i94mon: double (nullable = true)
 |-- i94cit: double (nullable = true)
 |-- i94res: double (nullable = true)
 |-- i94port: string (nullable = true)
 |-- arrdate: double (nullable = true)
 |-- i94mode: double (nullable = true)
 |-- i94addr: string (nullable = true)
 |-- depdate: double (nullable = true)
 |-- i94bir: double (nullable = true)
 |-- i94visa: double (nullable = true)
 |-- count: double (nullable = true)
 |-- dtadfile: string (nullable = true)
 |-- visapost: string (nullable = true)
 |-- occup: string (nullable = true)
 |-- entdepa: string (nullable = true)
 |-- entdepd: string (nullable = true)
 |-- entdepu: string (nullable = true)
 |-- matflag: string (nullable = true)
 |-- biryear: double (nullable = true)
 |-- dtaddto: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- insnum: string (nullable = true)
 |-- airline: string (nullable = true)
 |-- admnum: double (nullable = 

#### ----------------------------------------------------
#### Data Dictionary- Immigration Data 

cicid: ID that uniquely identify each record

I94YR: 4 digit year

I94MON: Numeric month

I94CIT: 3 digit code of source city for immigration (city of birth)

I94RES: 3 digit code of source country for immigration (country of birth)

I94PORT: Port addmitted through

ARRDATE: Arrival date in the USA

I94MODE: Mode of transportation (1 = Air; 2 = Sea; 3 = Land; 9 = Not reported)

I94ADDR: State of arrival

DEPDATE: Departure date

I94BIR: Age of immigrant in Years

I94VISA: Visa codes collapsed into three categories: (1 = Business; 2 = Pleasure; 3 = Student)

COUNT: count of rows (value of 1 assigned to each row)

DTADFILE: Character Date Field

VISAPOST: Department of State where where Visa was issued

OCCUP: Occupation that will be performed in U.S.

ENTDEPA: Arrival Flag. Whether admitted or paroled into the US

ENTDEPD: Departure Flag. Whether departed, lost visa, or deceased

ENTDEPU: Update Flag. Update of visa, either apprehended, overstayed, or updated to PR

MATFLAG: Match flag

BIRYEAR: 4 digit year of birth

DTADDTO: Character date field to when admitted in the US

GENDER: Gender

INSNUM: INS number

AIRLINE: Airline used to arrive in U.S.

ADMNUM: Admission number, should be unique and not nullable

FLTNO: Flight number of Airline used to arrive in U.S.

VISATYPE: Class of admission legally admitting the non-immigrant to temporarily stay in U.S.

##### -----------------------------------------------------

#### 2.1.1 Airport Codes,     Country Code,     State Code,     Visa Code,     Travel Mode Code

In [10]:
# Read the csv data from S3
df_airpt_code = spark.read.csv(csv_data_path + 'Airport_Code.csv', header = True, sep = ',')
df_country_code = spark.read.csv(csv_data_path + 'Country_Code.csv', header = True,  sep = ',')
df_state_code = spark.read.csv(csv_data_path + 'State_Code.csv', header = True,  sep = ',')
df_visa_code = spark.read.csv(csv_data_path + 'Visa_Code.csv', header = True, sep = ',')
df_travelMode_code = spark.read.csv(csv_data_path + 'Mode_of_Travel_Code.csv', header = True,  sep = ',')

In [11]:
# create a collection (dictionary) of the codes data
dict_codes = {'airport code': df_airpt_code,'country code': df_country_code,'state code': df_state_code,\
              'visa code': df_visa_code,'travel mode code': df_travelMode_code}

In [12]:
# have a glance over the respective codes data
for name, data in dict_codes.items():
    print(name+' : \n')
    data.printSchema()
    print('\n')
    print(data.limit(3).toPandas())
    print('\n --------- \n')

airport code : 

root
 |-- i94port_: string (nullable = true)
 |-- i94_airport_name_: string (nullable = true)
 |-- i94_state_: string (nullable = true)



  i94port_         i94_airport_name_ i94_state_
0      ALC                     ALCAN         AK
1      ANC                 ANCHORAGE         AK
2      BAR  BAKER AAF - BAKER ISLAND         AK

 --------- 

country code : 

root
 |-- i94cit_: string (nullable = true)
 |-- i94_country_: string (nullable = true)
 |-- iso_country_code_: string (nullable = true)



  i94cit_ i94_country_ iso_country_code_
0     582       MEXICO               484
1     236  AFGHANISTAN                 4
2     101      ALBANIA                 8

 --------- 

state code : 

root
 |-- State_Code: string (nullable = true)
 |-- State: string (nullable = true)



  State_Code    State
0         AL  ALABAMA
1         AK   ALASKA
2         AZ  ARIZONA

 --------- 

visa code : 

root
 |-- Code_Visa: string (nullable = true)
 |-- Visa_Name: string (nullable = true

##### 2.1.2 Join the Codes data to immigration data

In [13]:
# Join the 'state code' data with the immigration data because immigration data has the state code

df_immi_state = df_immi.join(df_state_code, df_immi.i94addr == df_state_code.State_Code)

In [14]:
df_immi_state.count()

2736378

In [15]:
# view data Frame
df_immi_state.limit(10).toPandas()

Unnamed: 0,cicid,i94yr,i94mon,i94cit,i94res,i94port,arrdate,i94mode,i94addr,depdate,i94bir,i94visa,count,dtadfile,visapost,occup,entdepa,entdepd,entdepu,matflag,biryear,dtaddto,gender,insnum,airline,admnum,fltno,visatype,State_Code,State
0,459651.0,2016.0,4.0,135.0,135.0,ATL,20547.0,1.0,FL,20559.0,54.0,2.0,1.0,20160403,,,O,R,,M,1962.0,7012016,,,VS,55556250000.0,115,WT,FL,FLORIDA
1,459652.0,2016.0,4.0,135.0,135.0,ATL,20547.0,1.0,FL,20555.0,74.0,2.0,1.0,20160403,,,T,O,,M,1942.0,7012016,F,,VS,674406500.0,103,WT,FL,FLORIDA
2,459653.0,2016.0,4.0,135.0,135.0,ATL,20547.0,1.0,FL,20557.0,44.0,2.0,1.0,20160403,,,T,Q,,M,1972.0,10022016,M,,VS,674948200.0,109,B2,FL,FLORIDA
3,459655.0,2016.0,4.0,135.0,135.0,ATL,20547.0,1.0,GA,,64.0,2.0,1.0,20160403,,,G,,,,1952.0,7012016,F,,VS,55541330000.0,103,WT,GA,GEORGIA
4,459656.0,2016.0,4.0,135.0,135.0,ATL,20547.0,1.0,GA,,63.0,2.0,1.0,20160403,,,G,,,,1953.0,7012016,M,,BA,55578040000.0,227,WT,GA,GEORGIA
5,459657.0,2016.0,4.0,135.0,135.0,ATL,20547.0,1.0,GA,20548.0,44.0,1.0,1.0,20160403,,,G,I,,M,1972.0,7012016,M,,BA,55578920000.0,227,WB,GA,GEORGIA
6,459658.0,2016.0,4.0,135.0,135.0,ATL,20547.0,1.0,GA,20548.0,39.0,1.0,1.0,20160403,,,G,I,,M,1977.0,7012016,M,,BA,55578740000.0,227,WB,GA,GEORGIA
7,459659.0,2016.0,4.0,135.0,135.0,ATL,20547.0,1.0,GA,20573.0,84.0,2.0,1.0,20160403,,,G,I,,M,1932.0,7012016,M,,BA,55577460000.0,227,WT,GA,GEORGIA
8,459660.0,2016.0,4.0,135.0,135.0,ATL,20547.0,1.0,GA,20555.0,55.0,2.0,1.0,20160403,,,G,N,,M,1961.0,7012016,M,,BA,55577870000.0,227,WT,GA,GEORGIA
9,459661.0,2016.0,4.0,135.0,135.0,ATL,20547.0,1.0,GA,20555.0,54.0,2.0,1.0,20160403,,,G,N,,M,1962.0,7012016,M,,BA,55577730000.0,227,WT,GA,GEORGIA


In [16]:
# Drop a redundant column
df_immi_rev1 = df_immi_state.drop('i94addr')

#### 2.1.3 Join (inner join) Airport Code, Country Code, and Visa Code to Immigration data

![](Images\dataModelPrelim.PNG)

In [17]:
## We have already Joined state data , thus we are joining the remaining three tables
df_immi_joined = df_immi_rev1.join(df_airpt_code , df_immi_rev1.i94port == df_airpt_code.i94port_)\
.join(df_country_code , df_immi_rev1.i94cit == df_country_code.i94cit_)\
.join(df_visa_code , df_immi_rev1.i94visa == df_visa_code.Code_Visa)

In [18]:
df_immi_joined.limit(10).toPandas()

Unnamed: 0,cicid,i94yr,i94mon,i94cit,i94res,i94port,arrdate,i94mode,depdate,i94bir,i94visa,count,dtadfile,visapost,occup,entdepa,entdepd,entdepu,matflag,biryear,dtaddto,gender,insnum,airline,admnum,fltno,visatype,State_Code,State,i94port_,i94_airport_name_,i94_state_,i94cit_,i94_country_,iso_country_code_,Code_Visa,Visa_Name
0,459651.0,2016.0,4.0,135.0,135.0,ATL,20547.0,1.0,20559.0,54.0,2.0,1.0,20160403,,,O,R,,M,1962.0,7012016,,,VS,55556250000.0,115,WT,FL,FLORIDA,ATL,ATLANTA,GA,135,UNITED KINGDOM,826,2,Pleasure
1,459652.0,2016.0,4.0,135.0,135.0,ATL,20547.0,1.0,20555.0,74.0,2.0,1.0,20160403,,,T,O,,M,1942.0,7012016,F,,VS,674406500.0,103,WT,FL,FLORIDA,ATL,ATLANTA,GA,135,UNITED KINGDOM,826,2,Pleasure
2,459653.0,2016.0,4.0,135.0,135.0,ATL,20547.0,1.0,20557.0,44.0,2.0,1.0,20160403,,,T,Q,,M,1972.0,10022016,M,,VS,674948200.0,109,B2,FL,FLORIDA,ATL,ATLANTA,GA,135,UNITED KINGDOM,826,2,Pleasure
3,459655.0,2016.0,4.0,135.0,135.0,ATL,20547.0,1.0,,64.0,2.0,1.0,20160403,,,G,,,,1952.0,7012016,F,,VS,55541330000.0,103,WT,GA,GEORGIA,ATL,ATLANTA,GA,135,UNITED KINGDOM,826,2,Pleasure
4,459656.0,2016.0,4.0,135.0,135.0,ATL,20547.0,1.0,,63.0,2.0,1.0,20160403,,,G,,,,1953.0,7012016,M,,BA,55578040000.0,227,WT,GA,GEORGIA,ATL,ATLANTA,GA,135,UNITED KINGDOM,826,2,Pleasure
5,459657.0,2016.0,4.0,135.0,135.0,ATL,20547.0,1.0,20548.0,44.0,1.0,1.0,20160403,,,G,I,,M,1972.0,7012016,M,,BA,55578920000.0,227,WB,GA,GEORGIA,ATL,ATLANTA,GA,135,UNITED KINGDOM,826,1,Business
6,459658.0,2016.0,4.0,135.0,135.0,ATL,20547.0,1.0,20548.0,39.0,1.0,1.0,20160403,,,G,I,,M,1977.0,7012016,M,,BA,55578740000.0,227,WB,GA,GEORGIA,ATL,ATLANTA,GA,135,UNITED KINGDOM,826,1,Business
7,459659.0,2016.0,4.0,135.0,135.0,ATL,20547.0,1.0,20573.0,84.0,2.0,1.0,20160403,,,G,I,,M,1932.0,7012016,M,,BA,55577460000.0,227,WT,GA,GEORGIA,ATL,ATLANTA,GA,135,UNITED KINGDOM,826,2,Pleasure
8,459660.0,2016.0,4.0,135.0,135.0,ATL,20547.0,1.0,20555.0,55.0,2.0,1.0,20160403,,,G,N,,M,1961.0,7012016,M,,BA,55577870000.0,227,WT,GA,GEORGIA,ATL,ATLANTA,GA,135,UNITED KINGDOM,826,2,Pleasure
9,459661.0,2016.0,4.0,135.0,135.0,ATL,20547.0,1.0,20555.0,54.0,2.0,1.0,20160403,,,G,N,,M,1962.0,7012016,M,,BA,55577730000.0,227,WT,GA,GEORGIA,ATL,ATLANTA,GA,135,UNITED KINGDOM,826,2,Pleasure


In [19]:
df_immi_joined.printSchema()

root
 |-- cicid: double (nullable = true)
 |-- i94yr: double (nullable = true)
 |-- i94mon: double (nullable = true)
 |-- i94cit: double (nullable = true)
 |-- i94res: double (nullable = true)
 |-- i94port: string (nullable = true)
 |-- arrdate: double (nullable = true)
 |-- i94mode: double (nullable = true)
 |-- depdate: double (nullable = true)
 |-- i94bir: double (nullable = true)
 |-- i94visa: double (nullable = true)
 |-- count: double (nullable = true)
 |-- dtadfile: string (nullable = true)
 |-- visapost: string (nullable = true)
 |-- occup: string (nullable = true)
 |-- entdepa: string (nullable = true)
 |-- entdepd: string (nullable = true)
 |-- entdepu: string (nullable = true)
 |-- matflag: string (nullable = true)
 |-- biryear: double (nullable = true)
 |-- dtaddto: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- insnum: string (nullable = true)
 |-- airline: string (nullable = true)
 |-- admnum: double (nullable = true)
 |-- fltno: string (nullable = tr

#### 2.1.4 Drop the columns which are not useful.

In [20]:
df_immi_dm = df_immi_joined.drop('i94cit','i94cit_','i94port','i94port_', 'insnum', 'airline', 'admnum', 'fltno', 'airline', 'matflag' )

#### Convert arrival date and departure date columns to date time format

In [21]:
# define the function
def convert_datetime(x):
    """
    This function adds the number of days given in columns
    arrdate and depdate to the standard date(1/1/1960) to get the 
    actual day, month, year of the date in datetime format.
    
    Arg:
       x : data frame column having arrival date(arrdate) or departure date(depdate)
    """
    try:
        start = datetime(1960, 1, 1)
        return start + timedelta(days=int(x))
    except:
        return None


# create udf of the above defined function. 
"""
This function further converts the datetime format from above function(convert_datetime()) 
to DateType format of the PySpark.
"""
udf_datetime_from_sas = udf(lambda x: convert_datetime(x), T.DateType())

In [22]:
df_immi_dm = df_immi_dm.withColumn("arrival_date", udf_datetime_from_sas("arrdate"))\
.withColumn("departure_date", udf_datetime_from_sas("depdate"))

In [23]:
df_immi_dm.limit(10).toPandas()

Unnamed: 0,cicid,i94yr,i94mon,i94res,arrdate,i94mode,depdate,i94bir,i94visa,count,dtadfile,visapost,occup,entdepa,entdepd,entdepu,biryear,dtaddto,gender,visatype,State_Code,State,i94_airport_name_,i94_state_,i94_country_,iso_country_code_,Code_Visa,Visa_Name,arrival_date,departure_date
0,459651.0,2016.0,4.0,135.0,20547.0,1.0,20559.0,54.0,2.0,1.0,20160403,,,O,R,,1962.0,7012016,,WT,FL,FLORIDA,ATLANTA,GA,UNITED KINGDOM,826,2,Pleasure,2016-04-03,2016-04-15
1,459652.0,2016.0,4.0,135.0,20547.0,1.0,20555.0,74.0,2.0,1.0,20160403,,,T,O,,1942.0,7012016,F,WT,FL,FLORIDA,ATLANTA,GA,UNITED KINGDOM,826,2,Pleasure,2016-04-03,2016-04-11
2,459653.0,2016.0,4.0,135.0,20547.0,1.0,20557.0,44.0,2.0,1.0,20160403,,,T,Q,,1972.0,10022016,M,B2,FL,FLORIDA,ATLANTA,GA,UNITED KINGDOM,826,2,Pleasure,2016-04-03,2016-04-13
3,459655.0,2016.0,4.0,135.0,20547.0,1.0,,64.0,2.0,1.0,20160403,,,G,,,1952.0,7012016,F,WT,GA,GEORGIA,ATLANTA,GA,UNITED KINGDOM,826,2,Pleasure,2016-04-03,
4,459656.0,2016.0,4.0,135.0,20547.0,1.0,,63.0,2.0,1.0,20160403,,,G,,,1953.0,7012016,M,WT,GA,GEORGIA,ATLANTA,GA,UNITED KINGDOM,826,2,Pleasure,2016-04-03,
5,459657.0,2016.0,4.0,135.0,20547.0,1.0,20548.0,44.0,1.0,1.0,20160403,,,G,I,,1972.0,7012016,M,WB,GA,GEORGIA,ATLANTA,GA,UNITED KINGDOM,826,1,Business,2016-04-03,2016-04-04
6,459658.0,2016.0,4.0,135.0,20547.0,1.0,20548.0,39.0,1.0,1.0,20160403,,,G,I,,1977.0,7012016,M,WB,GA,GEORGIA,ATLANTA,GA,UNITED KINGDOM,826,1,Business,2016-04-03,2016-04-04
7,459659.0,2016.0,4.0,135.0,20547.0,1.0,20573.0,84.0,2.0,1.0,20160403,,,G,I,,1932.0,7012016,M,WT,GA,GEORGIA,ATLANTA,GA,UNITED KINGDOM,826,2,Pleasure,2016-04-03,2016-04-29
8,459660.0,2016.0,4.0,135.0,20547.0,1.0,20555.0,55.0,2.0,1.0,20160403,,,G,N,,1961.0,7012016,M,WT,GA,GEORGIA,ATLANTA,GA,UNITED KINGDOM,826,2,Pleasure,2016-04-03,2016-04-11
9,459661.0,2016.0,4.0,135.0,20547.0,1.0,20555.0,54.0,2.0,1.0,20160403,,,G,N,,1962.0,7012016,M,WT,GA,GEORGIA,ATLANTA,GA,UNITED KINGDOM,826,2,Pleasure,2016-04-03,2016-04-11


In [24]:
df_immi_dm.printSchema()

root
 |-- cicid: double (nullable = true)
 |-- i94yr: double (nullable = true)
 |-- i94mon: double (nullable = true)
 |-- i94res: double (nullable = true)
 |-- arrdate: double (nullable = true)
 |-- i94mode: double (nullable = true)
 |-- depdate: double (nullable = true)
 |-- i94bir: double (nullable = true)
 |-- i94visa: double (nullable = true)
 |-- count: double (nullable = true)
 |-- dtadfile: string (nullable = true)
 |-- visapost: string (nullable = true)
 |-- occup: string (nullable = true)
 |-- entdepa: string (nullable = true)
 |-- entdepd: string (nullable = true)
 |-- entdepu: string (nullable = true)
 |-- biryear: double (nullable = true)
 |-- dtaddto: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- visatype: string (nullable = true)
 |-- State_Code: string (nullable = true)
 |-- State: string (nullable = true)
 |-- i94_airport_name_: string (nullable = true)
 |-- i94_state_: string (nullable = true)
 |-- i94_country_: string (nullable = true)
 |-- iso_c

#### 2.1.5 Write the immigration data to parquet files and store them in S3 Staging Bucket.

In [25]:
# write to parquet
df_immi_dm.write.parquet(staging_path+'immigration/')

#### 2.2 Demographics Data

In [26]:
# Read the US demographics data from S3
df_demog = spark.read.csv(csv_data_path+'us-cities-demographics.csv', header=True, sep = ';')

In [27]:
df_demog.limit(10).toPandas().head()

Unnamed: 0,City,State,Median Age,Male Population,Female Population,Total Population,Number of Veterans,Foreign-born,Average Household Size,State Code,Race,Count
0,Silver Spring,Maryland,33.8,40601,41862,82463,1562,30908,2.6,MD,Hispanic or Latino,25924
1,Quincy,Massachusetts,41.0,44129,49500,93629,4147,32935,2.39,MA,White,58723
2,Hoover,Alabama,38.5,38040,46799,84839,4819,8229,2.58,AL,Asian,4759
3,Rancho Cucamonga,California,34.5,88127,87105,175232,5821,33878,3.18,CA,Black or African-American,24437
4,Newark,New Jersey,34.6,138040,143873,281913,5829,86253,2.73,NJ,White,76402


In [28]:
# total number of rows
print('Total Number of Rows in Demographic DataFrame: '+str(df_demog.count()))

Total Number of Rows in Demographic DataFrame: 2891


In [29]:
df_demog_state = df_demog.groupBy(['State']).agg({'Median Age': 'mean', 'Male Population': 'mean', \
                                                  'Female Population': 'mean', 'Number of Veterans': 'mean', \
                                                  'Foreign-born': 'mean', 'Average Household Size': 'mean',\
                                                  'State Code': 'First'})

In [30]:
df_demog_state.limit(5).toPandas().head()

Unnamed: 0,State,avg(Female Population),avg(Median Age),first(State Code),avg(Number of Veterans),avg(Foreign-born),avg(Male Population),avg(Average Household Size)
0,Utah,52769.270833,30.8625,UT,4024.270833,13579.395833,53890.666667,3.156875
1,Hawaii,175959.0,41.4,HI,23213.0,101312.0,176807.0,2.69
2,Minnesota,66025.222222,35.57963,MN,5958.111111,19812.740741,64422.277778,2.496852
3,Ohio,127414.204082,35.593878,OH,12927.673469,17834.510204,119454.163265,2.298571
4,Arkansas,51109.137931,32.737931,AR,5323.793103,10612.172414,48300.827586,2.526897


In [31]:
df_demog_state.count()

49

In [32]:
df_demog_state.printSchema()

root
 |-- State: string (nullable = true)
 |-- avg(Female Population): double (nullable = true)
 |-- avg(Median Age): double (nullable = true)
 |-- first(State Code): string (nullable = true)
 |-- avg(Number of Veterans): double (nullable = true)
 |-- avg(Foreign-born): double (nullable = true)
 |-- avg(Male Population): double (nullable = true)
 |-- avg(Average Household Size): double (nullable = true)



In [33]:
## Rename the columns else there will be an error, because column names cannot contain brackets '()'

df_demog_dm = df_demog_state.withColumnRenamed('avg(Female Population)', 'avgFemalePopulation').withColumnRenamed('avg(Median Age)', 'avgMedianAge')\
.withColumnRenamed('first(State Code)', 'stateCode').withColumnRenamed('avg(Number of Veterans)', 'avgNumberOfVeterans')\
.withColumnRenamed('avg(Foreign-born)', 'avgForeignBorn').withColumnRenamed('avg(Male Population)', 'avgMalePopulation')\
.withColumnRenamed('avg(Average Household Size)', 'avgHouseHoldSize')

In [34]:
df_demog_dm.printSchema()

root
 |-- State: string (nullable = true)
 |-- avgFemalePopulation: double (nullable = true)
 |-- avgMedianAge: double (nullable = true)
 |-- stateCode: string (nullable = true)
 |-- avgNumberOfVeterans: double (nullable = true)
 |-- avgForeignBorn: double (nullable = true)
 |-- avgMalePopulation: double (nullable = true)
 |-- avgHouseHoldSize: double (nullable = true)



In [35]:
## Convert the State column to Upper case; because we need to join the immigration data to States Demographic data.

## We can also convert this data frame entirely to Pandas and apply functions because this data frame is very small.

df_demog_upper = to_upper_case_(df_demog_dm.toPandas(), 'State', 'State_upper_case')

In [36]:
df_demog_upper.head()

Unnamed: 0,avgFemalePopulation,avgMedianAge,stateCode,avgNumberOfVeterans,avgForeignBorn,avgMalePopulation,avgHouseHoldSize,State_upper_case
0,52769.270833,30.8625,UT,4024.270833,13579.395833,53890.666667,3.156875,UTAH
1,175959.0,41.4,HI,23213.0,101312.0,176807.0,2.69,HAWAII
2,66025.222222,35.57963,MN,5958.111111,19812.740741,64422.277778,2.496852,MINNESOTA
3,127414.204082,35.593878,OH,12927.673469,17834.510204,119454.163265,2.298571,OHIO
4,51109.137931,32.737931,AR,5323.793103,10612.172414,48300.827586,2.526897,ARKANSAS


#### Write the state level demographic data for cities only (aggregated) to parquet files and store them in S3 Staging Bucket.

In [37]:
# Convert the pandas data frame back to spark dataframe. 
df_demog_dim = spark.createDataFrame(df_demog_upper)

# write to paraquet and store it in S3 Staging Bucket
df_demog_dim.write.parquet(staging_path+'demography/')

#### 2.3 Temperature Data

In [38]:
# read temperature data
df_temp = spark.read.csv('../../data2/GlobalLandTemperaturesByCity.csv',header=True)

In [39]:
# total number of rows
print('Total Number of Rows in Temperature DataFrame: '+str(df_temp.count()))

Total Number of Rows in Temperature DataFrame: 8599212


In [40]:
df_temp.limit(5).toPandas()

Unnamed: 0,dt,AverageTemperature,AverageTemperatureUncertainty,City,Country,Latitude,Longitude
0,1743-11-01,6.068,1.737,Århus,Denmark,57.05N,10.33E
1,1743-12-01,,,Århus,Denmark,57.05N,10.33E
2,1744-01-01,,,Århus,Denmark,57.05N,10.33E
3,1744-02-01,,,Århus,Denmark,57.05N,10.33E
4,1744-03-01,,,Århus,Denmark,57.05N,10.33E


In [41]:
## lets us aggregate the data to the Country Level
df_temp_country = df_temp.groupby(["Country"]).agg({"AverageTemperature": 'mean', "Latitude": 'First', "Longitude": "First"})

In [42]:
df_temp_country.printSchema()

root
 |-- Country: string (nullable = true)
 |-- avg(AverageTemperature): double (nullable = true)
 |-- first(Latitude): string (nullable = true)
 |-- first(Longitude): string (nullable = true)



In [43]:
df_temp_dm = df_temp_country.withColumnRenamed('avg(AverageTemperature)', 'avgTemperature').withColumnRenamed('first(Latitude)', 'Latitude').withColumnRenamed('first(Longitude)', 'Longitude')

In [44]:
df_temp_dm.count()

159

In [45]:
## Convert the Country column to Upper case; because we need to join the immigration data to Temperature data

df_temp_upper = to_upper_case_(df_temp_dm.toPandas(), 'Country', 'Country_upper_case')

In [46]:
df_temp_upper.head()

Unnamed: 0,avgTemperature,Latitude,Longitude,Country_upper_case
0,27.189829,8.84N,15.41E,CHAD
1,22.784014,24.92S,58.52W,PARAGUAY
2,3.347268,53.84N,91.36E,RUSSIA
3,25.768408,13.66N,45.41E,YEMEN
4,25.984177,15.27N,17.50W,SENEGAL


#### Write the Country level temperature data to parquet files and store them in S3 Staging Bucket.

In [47]:
df_temp_dim = spark.createDataFrame(df_temp_upper)

df_temp_dim.write.parquet(staging_path+'temperature/')

In [48]:
# we can also partition our data by countries :    df_temp_dm.write.partitionBy("Country").parquet(staging_path+'temperature/')
# to store the data partitioned by countries. It is a good practice to partion the bigger data because it helps in optimizing the query performance. 
# Here, we are not doing the partition because our data is already very small.

### Step 3: Define the Data Model

#### 3.1 Conceptual Data Model
Map out the conceptual data model and explain why you chose that model

**Fact_Immigration**: Immigration table is our central table, which consist of events that we need to analyze, thus, this table  acts as a fact table in our star schema. The immigration data has already been joined with the state code, country code, Visa code and port code tables; these tables were joined previously to the immigration table because after joining these tables(country code and state code), we get the State and Country names corresponding to the codes.

**Dim_Temperature**: The level of granularity in the raw temperature table data is to the level of city in the country of origin. Since our fact table doesnt have that level of granularity, thus I have aggregated the temperature data to the level of granularity of the country. Thus I have aggregated the temperature data to the level of Country. This transformed table gives us the average temperature in the country of origin of an immigrant.

**Dim_Demography**: Raw/original demographic data was given at the city level; however, I have changed the granularity of the original table to the State level. This would help us in joining the immigration table to the demography table on the State column. The aggregation of the city level data to the state level data can be understood from the example given below :                                       
*Original table*: The column 'Male Population' represented male population in a particular city of a State in USA.                           
*Transformed table*: The column 'avgMalePopulation' represents the average population aggregated across all the given citiese in a particular State in USA.




![](Images\StartSchema.PNG)

#### 3.2 Mapping Out Data Pipelines
List the steps necessary to pipeline the data into the chosen data model

**The whole pipeline is divided into two phases: phase 1 and phase 2.**

![](Images\architecture1.PNG)

**Phase1 :** In this phase we did the extraction of raw data, transformation to process the data, and loading of the data to S3 Staging folder. We used **EMR Spark cluster** to this phase.     
1. Extracted data from the repository (S3 bucket 1).
2. Joined immigration table with all the State, Visa, Port codes to the Immigration table.
3. Saved the processed Immigration Table into the Staging Bucket (S3 bucket 2) in Parquet format.
4. Extracted raw temperature data from the repository.
5. Processed the raw temperature table by aggregating the temperature table at the level of detail of the Country.
6. Saved the processed Temperature table to the Staging Bucket (S3 bucket 2).
7. Extracted raw demographic data.
8. Processed the raw demographic data by aggregating the data to the granularity of the States. 
9. Saved the processed demography table to the Staging Bucket (S3 bucket 2)

**Phase2 :** In this phase we extracted the data from the Staging Bucket (S3 Bucket 2) and loaded them into the **Redshift** data warehouse.
1. Created the data model in Redshift with all the integrity constraints.
2. Copied the data from the Staging Bucket to the Redshift cluster.
3. Performed Data Quality Checks

### Step 4: Run Pipelines to Model the Data 
#### 4.1 Create the data model
Build the data pipelines to create the data model.

##### 4.1.1 AWS Credential variables declaration

In [49]:
## Save the credentials in the form of variables for programmatically accessing Redshift

KEY                    = config.get('AWS','AWS_ACCESS_KEY_ID')
SECRET                 = config.get('AWS','AWS_SECRET_ACCESS_KEY')

DWH_CLUSTER_TYPE       = config.get("DWH","DWH_CLUSTER_TYPE")
DWH_NUM_NODES          = config.get("DWH","DWH_NUM_NODES")
DWH_NODE_TYPE          = config.get("DWH","DWH_NODE_TYPE")

DWH_CLUSTER_IDENTIFIER = config.get("DWH","DWH_CLUSTER_IDENTIFIER")
DWH_DB                 = config.get("DWH","DWH_DB")
DWH_DB_USER            = config.get("DWH","DWH_DB_USER")
DWH_DB_PASSWORD        = config.get("DWH","DWH_DB_PASSWORD")
DWH_PORT               = config.get("DWH","DWH_PORT")

DWH_IAM_ROLE_NAME      = config.get("DWH", "DWH_IAM_ROLE_NAME")

IAM_ROLE               = config.get("IAM_ROLE","ARN")
HOST                   = config.get("CLUSTER","HOST")

##### 4.1.2 Create AWS resources 

In [50]:
## Create an object s3 through which we can access the s3 buckets in aws.

s3 = boto3.resource('s3',
                       region_name="us-west-2",
                       aws_access_key_id=KEY,
                       aws_secret_access_key=SECRET
                   )

## Create an object iam through which we can access the iam roles in aws. 

iam = boto3.client('iam',aws_access_key_id=KEY,
                     aws_secret_access_key=SECRET,
                     region_name='us-west-2'
                  )

## Create an object redshift through which we can access redshift in aws. 

redshift = boto3.client('redshift',
                       region_name="us-west-2",
                       aws_access_key_id=KEY,
                       aws_secret_access_key=SECRET
                       )

##### 4.1.3 Connection with Redshift

In [51]:
## write Connection query with redshift

conn_string = "postgresql://{}:{}@{}:{}/{}".format(DWH_DB_USER, DWH_DB_PASSWORD, HOST, DWH_PORT, DWH_DB)

In [52]:
## Connect to redshift using sql. Execute the query.
%sql $conn_string

'Connected: awsuser@dev'

##### 4.1.4 Drop pre-existing tables

In [53]:
# Drop Tables query

immigration_table_drop         = "DROP TABLE IF EXISTS fact_immigration"
temperature_table_drop         = "DROP TABLE IF EXISTS dim_temperature"
demography_table_drop          = "DROP TABLE IF EXISTS dim_demography"

In [54]:
# Execute the drop table query

%sql $immigration_table_drop
%sql $temperature_table_drop
%sql $demography_table_drop

 * postgresql://awsuser:***@redshift-cluster-1.czg1ucux6cgx.us-west-2.redshift.amazonaws.com:5439/dev
Done.
 * postgresql://awsuser:***@redshift-cluster-1.czg1ucux6cgx.us-west-2.redshift.amazonaws.com:5439/dev
Done.
 * postgresql://awsuser:***@redshift-cluster-1.czg1ucux6cgx.us-west-2.redshift.amazonaws.com:5439/dev
Done.


[]

##### 4.1.5 Create Data Model in Redshift

In [55]:
# Create data model in Redshift

immigration_fact_table_create= ("""
CREATE TABLE IF NOT EXISTS fact_immigration
(
 cicid             FLOAT       PRIMARY KEY,
 i94yr             FLOAT,
 i94mon            FLOAT,
 i94res            FLOAT, 
 arrdate           FLOAT, 
 i94mode           FLOAT, 
 depdate           FLOAT, 
 i94bir            FLOAT, 
 i94visa           FLOAT, 
 count             FLOAT, 
 dtadfile          VARCHAR, 
 visapost          VARCHAR, 
 occup             VARCHAR, 
 entdepa           VARCHAR, 
 entdepd           VARCHAR, 
 entdepu           VARCHAR, 
 biryear           FLOAT, 
 dtaddto           VARCHAR, 
 gender            VARCHAR, 
 visatype          VARCHAR, 
 State_Code        VARCHAR, 
 State             VARCHAR,
 i94_airport_name_ VARCHAR,
 i94_state_        VARCHAR, 
 i94_country_      VARCHAR,
 iso_country_code_ VARCHAR, 
 Code_Visa         VARCHAR, 
 Visa_Name         VARCHAR,
 arrival_date      DATE,
 departure_date    DATE
);
""")



temperature_dim_table_create = ("""
CREATE TABLE IF NOT EXISTS dim_temperature
(
     AverageTemperature     FLOAT,
     Latitude               VARCHAR,
     Longitude              VARCHAR,
     Country_upper_case     VARCHAR  PRIMARY KEY
     )
diststyle all;
""")

demography_dim_table_create = ("""
CREATE TABLE IF NOT EXISTS dim_demography
(
     avgFemalePopulation       FLOAT, 
     avgMedianAge              FLOAT,
     StateCode                 VARCHAR  NOT NULL, 
     avgNumberOfVeterans       FLOAT,
     avgForeignBorn            FLOAT, 
     avgMalePopulation         FLOAT,
     avgAverageHouseholdSize   FLOAT,
     State_upper_case          VARCHAR  PRIMARY KEY
)
diststyle all;
""")

In [56]:
# Execute sql query variables declared above

%sql $immigration_fact_table_create
%sql $temperature_dim_table_create
%sql $demography_dim_table_create

 * postgresql://awsuser:***@redshift-cluster-1.czg1ucux6cgx.us-west-2.redshift.amazonaws.com:5439/dev
Done.
 * postgresql://awsuser:***@redshift-cluster-1.czg1ucux6cgx.us-west-2.redshift.amazonaws.com:5439/dev
Done.
 * postgresql://awsuser:***@redshift-cluster-1.czg1ucux6cgx.us-west-2.redshift.amazonaws.com:5439/dev
Done.


[]

##### 4.1.6 Copy data from Staging Bucket to Redshift

In [57]:
# Copy data to redshift: sql query variables

immi_data_to_redshift= """COPY fact_immigration
                          FROM 's3://staging-ap/immigration/'
                          CREDENTIALS 'aws_iam_role=arn:aws:iam::392638740494:role/dwhRole'
                          FORMAT AS PARQUET; """

demog_data_to_redshift= """COPY dim_demography
                           FROM 's3://staging-ap/demography/'
                           CREDENTIALS 'aws_iam_role=arn:aws:iam::392638740494:role/dwhRole'
                           FORMAT AS PARQUET; """

temp_data_to_redshift= """COPY dim_temperature
                          FROM 's3://staging-ap/temperature/'
                          CREDENTIALS 'aws_iam_role=arn:aws:iam::392638740494:role/dwhRole'
                          FORMAT AS PARQUET; """

In [58]:
# Execute Sql query variable declared above

%sql $demog_data_to_redshift

%sql $temp_data_to_redshift

%sql $immi_data_to_redshift

 * postgresql://awsuser:***@redshift-cluster-1.czg1ucux6cgx.us-west-2.redshift.amazonaws.com:5439/dev
Done.
 * postgresql://awsuser:***@redshift-cluster-1.czg1ucux6cgx.us-west-2.redshift.amazonaws.com:5439/dev
Done.
 * postgresql://awsuser:***@redshift-cluster-1.czg1ucux6cgx.us-west-2.redshift.amazonaws.com:5439/dev
Done.


[]

#### 4.2 Data Quality Checks
Explain the data quality checks you'll perform to ensure the pipeline ran as expected. These could include:
 * Integrity constraints on the relational database (e.g., unique key, data type, etc.)
 * Unit tests for the scripts to ensure they are doing the right thing
 * Source/Count checks to ensure completeness
 
Run Quality Checks

##### 4.2.0 *Before running the query we should know that to assign a variable to the output of the sql query we need to assign a variable to the output. For this we need to run the query in python rather than running in the SQl environment of python jupyter.*

In [77]:
# establish Connection to the redshift/database
con = psycopg2.connect(dbname= DWH_DB , host= HOST,port= DWH_PORT, user= DWH_DB_USER, password= DWH_DB_PASSWORD)

In [78]:
# create a cursor
cur = con.cursor()

##### 4.2.1 Check whether we are getting any data in facts and dimension tables.

In [79]:
immi_data_check = """ select count(*) from fact_immigration;"""

demo_data_check = """ select count(*) from dim_demography  ;"""

temp_data_check = """ select count(*) from dim_temperature ;"""

In [80]:
# Create a dictionary of Variables
checks = {'fact_immigration':immi_data_check,'dim_demography':demo_data_check,'dim_temperature':temp_data_check}

In [81]:
# actual checks
num = 0
prob = []
for table_name, table_query in checks.items():
    cur.execute(table_query)
    if cur.fetchall()[0][0] != 0:
        continue
    else:
        num = num +1
        prob.append(table_name)
    

# results
if num > 0:
    for i in prob:
        print('No data found in the table : {}'.format(i))
else:
    print('Quality Checked : Ok')

Quality Checked : Ok


In [82]:
## Close the connection and cursor

cur.close()
con.close()

#### 4.3 Data dictionary 
Create a data dictionary for your data model. For each field, provide a brief description of what the data is and where it came from. You can include the data dictionary in the notebook or in a separate file.

##### --------------------------------------

 #### fact_immigration table
##### --------------------------------------
 
 cicid        :         ID that uniquely identify each record
 
 i94yr        :        4 digit year
 
 i94mon       :       Numeric month
 
 i94res       :      country of birth
 
 arrdate      :     Arrival date in the USA
 
 i94mode      :    Mode of transportation (1 = Air; 2 = Sea; 3 = Land; 9 = Not reported)
 
 depdate      :     Departure date
 
 i94bir       :    Age of Respondent in Years
 
 i94visa      :    Visa codes collapsed into three categories: (1 = Business; 2 = Pleasure; 3 = Student)
 
 count        :    count of rows (1  for each row)
 
 dtadfile     :    Character Date Field
 
 visapost     :     Post /State where Visa issued 
 
 occup        :     Occupation that will be performed in U.S.
 
 entdepa      :      Arrival Flag. Whether admitted or paroled into the US
 
 entdepd      :     Departure Flag. Whether departed, lost visa, or deceased
 
 entdepu      :     Update Flag. Update of visa, either apprehended, overstayed, or updated to PR
 
 biryear      :     year of birth
 
 dtaddto      :     Character date field to when admitted in the US
 
 gender       :      Gender
 
 visatype     :     Class of admission legally admitting the non-immigrant to temporarily stay in U.S.
 
 State_Code   :     Code for States in USA
 
 State        :     Name of States in USA
 
 i94_airport_name_  :   Airport Name
 
 i94_state_    :    State Code
 
 i94_country_   :   Country of Origin
 
 iso_country_code_     : Country of Origin Code
 
 Code_Visa     :    Code related to visa type
 
 Visa_Name      :   Name of Visa Type
 
 arrival_date   :   Arrival Date in USA
 
 departure_date  :  Departure Date in USA
##### --------------------------------------

#### dim_demography table
##### --------------------------------------

State_upper_case: Name of the State in upper case letters

avgFemalePopulation :  Avergae Female Population in a state

avgMedianAge :  Avergae of Median Ages in the cities in a state

stateCode:      State Code

avgNumberOfVeterans: Average Number of Veterans in the state

avgForeignBorn:   Average Forein Born Population in a state

avgMalePopulation: Average Male Population in a state

avgHouseHoldSize:  Average House Hold Size in the state

##### --------------------------------------

#### dim_temperature table
##### --------------------------------------

Country_upper_case  :   Country Name in Upper Case

AverageTemperature  :   Average Temperature in the Country

Latitude            :   Latitude

Longitude           :   Longitude

##### ===============================================================================

#### Step 5: Complete Project Write Up
* Clearly state the rationale for the choice of tools and technologies for the project.
* Propose how often the data should be updated and why.
* Write a description of how you would approach the problem differently under the following scenarios:
 * The data was increased by 100x.
 * The data populates a dashboard that must be updated on a daily basis by 7am every day.
 * The database needed to be accessed by 100+ people.
 
 ##### --------------------------------------------------------------------

*We have used the AWS cloud technology, because it is low cost, highly, scalable and reliable in comparison to on premise or hybrid technologies available to us. This solution has been built keeping in future growth of data in mind. All the AWS services used in this solution are scalable and low cost.*

AWS Services and their justifications are mentioned below:

**S3**: S3 Buckets have been used for several purposes such as  Data lake, Staging Area, and for saving the query results from redshift. It can easily be integrated with EMR and Redshift services as a massive storage of data. Data accessibility is super easy. Other than that, S3 is cheap, easy-to-use, highly scalabe, highly available, and secured.

**Spark**: The existing data from the client was big and keeping future data growth in mind, we can assume that its going to grow at even more rapid pace. Therefore, we needed the best big data processing framework, and that is Spark, which is not only very fast as compared to other such systems but also scalable and fault tolerant. since python is the most popular programming language among the data scientists and engineers, thus I have used python API(PySpark) of spark to do the data processing.

**EMR**: We have divided the pipeline in two stages: one is for processing the massive data and the other is meant to create a data warehouse for querying the data using BI apps using SQL. We have used EMR to process raw data and store them in Staging Bucket in S3.  We have used EMR because it is the industry leading cloud-native big data platform for processing vast amounts of data quickly and cost-effectively at scale. Using open source tools such as Apache Spark coupled with the dynamic scalability of Amazon EC2 and scalable storage of Amazon S3, EMR gives analytical teams the engines and elasticity to run Petabyte-scale analysis for a fraction of the cost of traditional on-premises clusters.

**Redshift**: Amazon Redshift is the most popular and fastest cloud data warehouse. Redshift is integrated with our data lake (S3), offers up to 3x faster performance than any other data warehouse, and costs up to 75% less than any other cloud data warehouse. The client can query data in Redshift using popular SQL language.

 ##### --------------------------------------------------------------------

**Propose how often the data should be updated and why**

We get data per month  Thus it is reasonable to update the model monthly.

**Write a description of how you would approach the problem differently under the following scenarios:**

**The data was increased by 100x:**

Scalability is not an issue at all with this solution. If we have to scale , then we just need to increase the number of nodes of the cluster in EMR.


**The data populates a dashboard that must be updated on a daily basis by 7am every day.**:

We can create an airflow dag with a scheduled running time interval daily so that our data is updated before 7AM daily.

**The database needed to be accessed by 100+ people**:

Amazon Redshift provides consistently fast performance, even with thousands of concurrent queries, whether they query data in your Amazon Redshift data warehouse, or directly in your Amazon S3 data lake. Amazon Redshift Concurrency Scaling supports virtually unlimited concurrent users and concurrent queries with consistent service levels by adding transient capacity in seconds as concurrency increases. 

#### =========================================================================