# Project Title
### Data Engineering Capstone Project

#### Project Summary
--describe your project at a high level--
* The goal of this project is to (1) evaluate the impact of weather's temperature on tourist movements in USA, and (2)
    predict the number of visitores in upcoming monthes based on historical data
* The star schema is used to develop a database, which will be effectively used for handling analytical queries.
* Data pipelines
* Spark!, and other tools

The project follows the follow steps:
* Step 1: Scope the Project and Gather Data
* Step 2: Explore and Assess the Data
* Step 3: Define the Data Model
* Step 4: Run ETL to Model the Data
* Step 5: Complete Project Write Up

### Step 1: Scope the Project and Gather Data

#### Scope 
Explain what you plan to do in the project in more detail. What data do you use? What is your end solution look like? What tools did you use? etc>
* consistsing of two? fact tables referencing two dimension tables.

#### Describe and Gather Data 
Describe the data sets you're using. Where did it come from? What type of information is included? 

* Data come from the [US National Tourism and Trade Office](https://www.kaggle.com/berkeleyearth/climate-change-earth-surface-temperature-data) and Kaggle, [world temprature](https://www.kaggle.com/berkeleyearth/climate-change-earth-surface-temperature-data).

## Datasets
# I94 Immigration Data: 
    * This data comes from the US National Tourism and Trade Office.
    * A data dictionary is included in the workspace.
    * This is where the data comes from. https://travel.trade.gov/research/reports/i94/historical/2016.html
    * There's a sample file so you can take a look at the data in csv format before reading it all in.
    * You do not have to use the entire dataset, just use what you need to accomplish the goal you set at the beginning of the project.
    
    'i94yr   ':  '4 digit year',
    'i94mon  ':  'Numeric month',
    'i94addr ':  'where the immigrants resides in USA',

# World Temperature Data:
    This dataset came from Kaggle.
    You can read more about it here. https://www.kaggle.com/berkeleyearth/climate-change-earth-surface-temperature-data
    
    'dt',
    'AverageTemperature',
    'City',
    'Country',

# U.S. City Demographic Data:
    This data comes from OpenSoft.
    You can read more about it here. https://public.opendatasoft.com/explore/dataset/us-cities-demographics/export/

# Airport Code Table:
    This is a simple table of airport codes and corresponding cities.
    It comes from here. https://datahub.io/core/airport-codes#data

In [1]:
# Do all imports and installs here
import pandas as pd
import logging
import matplotlib.pyplot as plt
import datetime
import pyspark.sql.functions as F

In [2]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.\
    config("spark.jars.packages","saurfang:spark-sas7bdat:2.0.0-s_2.11")\
    .enableHiveSupport().getOrCreate()

In [None]:
write to parquet --> to be done once only
df_spark =spark.read.format('com.github.saurfang.sas.spark').load('../../data/18-83510-I94-Data-2016/i94_apr16_sub.sas7bdat')
#df_spark.write.parquet("sas_data2")

In [13]:
df_immagrants=spark.read.parquet("sas_data2")
df_address = df_immagrants.select(['i94yr', 'i94mon', 'i94addr'])
#df_address.where(df_address.i94yr == 2016).limit(10).toPandas()

In [24]:
# World temprature data
fname = '../../data2/GlobalLandTemperaturesByCity.csv'
df_tempr = spark.read.csv(fname, header=True)
df_tempr = df_tempr.where(df_tempr.Country == 'United States')
df_tempr.show()

+----------+------------------+-----------------------------+-------+-------------+--------+---------+
|        dt|AverageTemperature|AverageTemperatureUncertainty|   City|      Country|Latitude|Longitude|
+----------+------------------+-----------------------------+-------+-------------+--------+---------+
|1820-01-01|2.1010000000000004|                        3.217|Abilene|United States|  32.95N|  100.53W|
|1820-02-01|             6.926|                        2.853|Abilene|United States|  32.95N|  100.53W|
|1820-03-01|            10.767|                        2.395|Abilene|United States|  32.95N|  100.53W|
|1820-04-01|17.988999999999994|                        2.202|Abilene|United States|  32.95N|  100.53W|
|1820-05-01|            21.809|                        2.036|Abilene|United States|  32.95N|  100.53W|
|1820-06-01|            25.682|                        2.008|Abilene|United States|  32.95N|  100.53W|
|1820-07-01|            26.268|           1.8019999999999998|Abilene|Unit

### Step 2: Explore and Assess the Data
#### Explore the Data 
Identify data quality issues, like missing values, duplicate data, etc.

#### Cleaning Steps
Document steps necessary to clean the data

In [11]:
df_address.printSchema()

Unnamed: 0,i94yr,i94mon,i94addr
0,2016.0,4.0,CA
1,2016.0,4.0,NV
2,2016.0,4.0,WA
3,2016.0,4.0,WA
4,2016.0,4.0,WA
5,2016.0,4.0,HI
6,2016.0,4.0,HI
7,2016.0,4.0,HI
8,2016.0,4.0,FL
9,2016.0,4.0,CA


In [None]:
df_address.limit(10).toPandas()

root
 |-- i94yr: double (nullable = true)
 |-- i94mon: double (nullable = true)
 |-- i94addr: string (nullable = true)



In [15]:
df_address.select(['i94yr', 'i94mon']).describe().show()

+-------+-------+-------+
|summary|  i94yr| i94mon|
+-------+-------+-------+
|  count|3096313|3096313|
|   mean| 2016.0|    4.0|
| stddev|    0.0|    0.0|
|    min| 2016.0|    4.0|
|    max| 2016.0|    4.0|
+-------+-------+-------+



In [16]:
df_address.createOrReplaceTempView('address_table')

In [12]:
def value_counts(table, col):
    return spark.sql(f"""
    select {col}, count({col}) as count
    from {table}
    group by {col}
    order by count
    """).show(20)

In [57]:
value_counts('address_table', 'i94yr')
value_counts('address_table', 'i94mon')
value_counts('address_table', 'i94addr')

+------+-------+
| i94yr|  count|
+------+-------+
|2016.0|3096313|
+------+-------+

+------+-------+
|i94mon|  count|
+------+-------+
|   4.0|3096313|
+------+-------+

+-------+-----+
|i94addr|count|
+-------+-----+
|   null|    0|
|     KF|    1|
|     52|    1|
|     71|    1|
|     S6|    1|
|     85|    1|
|     UL|    1|
|     RU|    1|
|     VL|    1|
|     RA|    1|
|     UR|    1|
|     ZN|    1|
|     TC|    1|
|     PD|    1|
|     YH|    1|
|     EX|    1|
|     RF|    1|
|     RO|    1|
|     73|    1|
|     FC|    1|
+-------+-----+
only showing top 20 rows



* The above descriptive results show that both 'i94yr' and 'i94mon' columns are clean, but there are lots of invalid codes in 'i94addr' for cities. 

In [60]:
#TODO cast cities that are valid
from utils.city_code import city_code
valid_city_code = list(set(city_code.keys()))
str_valid_city_code = str(valid_city_code).replace('[', '(').replace(']', ')')

In [87]:
clean_address_df = spark.sql(f"""
select * --cast(i94yr as int), cast(i94mon as int), cast(i94addr as varchar(2))
from address_table
where i94yr is not null and i94mon is not null and i94addr is not null and i94addr in {str_valid_city_code}
""")

In [88]:
clean_address_df.createOrReplaceTempView('clean_address_table')

In [90]:
value_counts('clean_address_table', 'i94addr')

+-------+-----+
|i94addr|count|
+-------+-----+
|     99|   52|
|     VI|  226|
|     WY|  460|
|     SD|  557|
|     WV|  808|
|     ND| 1225|
|     MT| 1339|
|     VT| 1477|
|     AK| 1604|
|     ID| 1752|
|     MS| 1771|
|     NM| 1994|
|     ME| 2361|
|     NH| 2817|
|     AR| 2873|
|     DE| 3111|
|     KS| 3224|
|     OK| 3239|
|     RI| 3289|
|     IA| 3391|
+-------+-----+
only showing top 20 rows



In [25]:
df_tempr = df_tempr.select(['dt', 'Country', 'City', 'AverageTemperature'])

In [26]:
df_tempr.printSchema()

root
 |-- dt: string (nullable = true)
 |-- Country: string (nullable = true)
 |-- City: string (nullable = true)
 |-- AverageTemperature: string (nullable = true)



In [27]:
df_tempr.limit(10).toPandas()

Unnamed: 0,dt,Country,City,AverageTemperature
0,1820-01-01,United States,Abilene,2.1010000000000004
1,1820-02-01,United States,Abilene,6.926
2,1820-03-01,United States,Abilene,10.767
3,1820-04-01,United States,Abilene,17.988999999999994
4,1820-05-01,United States,Abilene,21.809
5,1820-06-01,United States,Abilene,25.682
6,1820-07-01,United States,Abilene,26.268
7,1820-08-01,United States,Abilene,25.048
8,1820-09-01,United States,Abilene,22.435
9,1820-10-01,United States,Abilene,15.83


In [28]:
df_tempr.select(['AverageTemperature']).describe().show()

+-------+--------------------+
|summary|  AverageTemperature|
+-------+--------------------+
|  count|              661524|
|   mean|  13.949334923600631|
| stddev|   9.173337261791255|
|    min|-0.00099999999999989|
|    max|               9.999|
+-------+--------------------+



In [29]:
df_tempr.createOrReplaceTempView('tempr_table')

In [30]:
for col in df_tempr.columns: value_counts('tempr_table', col)

+----------+-----+
|        dt|count|
+----------+-----+
|1746-07-01|   98|
|1744-03-01|   98|
|1752-02-01|   98|
|1746-05-01|   98|
|1752-09-01|   98|
|1755-05-01|   98|
|1746-03-01|   98|
|1747-03-01|   98|
|1757-07-01|   98|
|1754-11-01|   98|
|1755-06-01|   98|
|1752-10-01|   98|
|1746-11-01|   98|
|1750-02-01|   98|
|1747-10-01|   98|
|1754-03-01|   98|
|1747-07-01|   98|
|1748-08-01|   98|
|1746-09-01|   98|
|1749-10-01|   98|
+----------+-----+
only showing top 20 rows

+-------------+------+
|      Country| count|
+-------------+------+
|United States|687289|
+-------------+------+

+----------------+-----+
|            City|count|
+----------------+-----+
|   Moreno Valley| 1977|
|Rancho Cucamonga| 1977|
|      Santa Rosa| 1977|
|          Pomona| 1977|
|       San Diego| 1977|
|      Sacramento| 1977|
|   San Francisco| 1977|
|          Orange| 1977|
|         Visalia| 1977|
|       Fairfield| 1977|
|     Los Angeles| 1977|
|     Chula Vista| 1977|
|       Oceanside| 1977|
| 

In [56]:
df_tempr_clean = spark.sql("""
select timestamp(date(dt)) as dt, cast(Country as varchar(13)), cast(City as string), round(AverageTemperature, 2) as AverageTemperature
from tempr_table
where dt is not null and Country is not null and City is not null and  AverageTemperature is not null
order by AverageTemperature 
""")

In [57]:
df_tempr_clean.createOrReplaceTempView('clean_tempr_table')

In [58]:
value_counts('clean_tempr_table', 'AverageTemperature')

+------------------+-----+
|AverageTemperature|count|
+------------------+-----+
|            -20.83|    1|
|            -19.93|    1|
|            -20.78|    1|
|            -24.59|    1|
|            -20.59|    1|
|            -24.16|    1|
|            -20.27|    1|
|            -22.65|    1|
|            -20.21|    1|
|            -22.45|    1|
|            -19.77|    1|
|            -22.06|    1|
|            -19.61|    1|
|            -21.94|    1|
|            -20.19|    1|
|            -19.53|    1|
|            -21.75|    1|
|            -19.31|    1|
|            -21.68|    1|
|            -19.08|    1|
+------------------+-----+
only showing top 20 rows



In [None]:
# Performing cleaning tasks here





### Step 3: Define the Data Model
#### 3.1 Conceptual Data Model
Map out the conceptual data model and explain why you chose that model
* The star schema is used as datamodel of this project.
* It is a relational model contains one fact table surrounded by two dimension tables.
* It suits analytical queries and user can analyze the data with few number of joins.
#### 3.2 Mapping Out Data Pipelines
List the steps necessary to pipeline the data into the chosen data model

### Step 4: Run Pipelines to Model the Data 
#### 4.1 Create the data model
Build the data pipelines to create the data model.

In [None]:
# Write code here

#### 4.2 Data Quality Checks
Explain the data quality checks you'll perform to ensure the pipeline ran as expected. These could include:
 * Integrity constraints on the relational database (e.g., unique key, data type, etc.)
 * Unit tests for the scripts to ensure they are doing the right thing
 * Source/Count checks to ensure completeness
 
Run Quality Checks

In [None]:
# Perform quality checks here

#### 4.3 Data dictionary 
Create a data dictionary for your data model. For each field, provide a brief description of what the data is and where it came from. You can include the data dictionary in the notebook or in a separate file.

#### Step 5: Complete Project Write Up
* Clearly state the rationale for the choice of tools and technologies for the project.
* Propose how often the data should be updated and why.
* Write a description of how you would approach the problem differently under the following scenarios:
 * The data was increased by 100x.
 * The data populates a dashboard that must be updated on a daily basis by 7am every day.
 * The database needed to be accessed by 100+ people.