# Come to brazil
### Data Engineering Capstone Project

#### Project Summary
Brazil is a continentallly sized country, both in terms of geographic size and population. The mercator projection, presented by Flemish geographer and cartographer Gerardus Mercator in 1569 was created long before modern cartographic techniques were developed. South america is larger in size than western europe, and individual brazilian states are larger in both size and population than the largest western european countries, like france and germany. Even as a poor country, the sheer potential in numbers amongst the upper economic echelons of the brazilian population make it a very interesting commercial study subject. That, coupled with a virtual 100% import tariff more than doubling commodity prices natively, and a habit of smuggling goods by both brazilian travelers and brazilian immigrants in america make creating a service industry specifically geared towards catering to brazilian passerbys especially profitable. This study will use airport data from the I94 Immigration Data to determine which airports would be more interesting to establish services such as apple stores accepting brazilian real (brazilian currency) with portuguese speaking salesmen.

The project follows the follow steps:
* Step 1: Scope the Project and Gather Data
* Step 2: Explore and Assess the Data
* Step 3: Define the Data Model
* Step 4: Run ETL to Model the Data
* Step 5: Complete Project Write Up

In [2]:
# Do all imports and installs here
import pandas as pd

### Step 1: Scope the Project and Gather Data

#### Scope 

The scope of the project is the data presented by Udacity, presented in graphical form in a plotly graph, but could very well be an auto-updating dash website being fed immigration and travel data from brazil to america. As the joke goes, if you don't come to brazil, brazil comes to you

#### Describe and Gather Data 

- I94 Immigration Data: This data comes from the US National Tourism and Trade Office. A data dictionary is included in the workspace. This is where the data comes from.  
- I94_SAS_Labels_descriptions.SAS - Partial description of the SAS data provided by udacity
- Airport Code Table(airport-codes_csv.csv): This is a simple table of airport codes and corresponding cities. It comes from here.
- U.S. City Demographic Data: This data comes from OpenSoft. You can read more about it here (us-cities-demographics.csv): list of us cities including their latitude and longitude

In [3]:
# Read in the data here
fname = '../../data/18-83510-I94-Data-2016/i94_apr16_sub.sas7bdat'
df = pd.read_sas(fname, 'sas7bdat', encoding="ISO-8859-1")
fname2 = '../../data2/GlobalLandTemperaturesByCity.csv'
df2 = pd.read_csv(fname2)

In [21]:
df.head()

Unnamed: 0,cicid,i94yr,i94mon,i94cit,i94res,i94port,arrdate,i94mode,i94addr,depdate,...,entdepu,matflag,biryear,dtaddto,gender,insnum,airline,admnum,fltno,visatype
0,6.0,2016.0,4.0,692.0,692.0,XXX,20573.0,,,,...,U,,1979.0,10282016,,,,1897628000.0,,B2
1,7.0,2016.0,4.0,254.0,276.0,ATL,20551.0,1.0,AL,,...,Y,,1991.0,D/S,M,,,3736796000.0,296.0,F1
2,15.0,2016.0,4.0,101.0,101.0,WAS,20545.0,1.0,MI,20691.0,...,,M,1961.0,09302016,M,,OS,666643200.0,93.0,B2
3,16.0,2016.0,4.0,101.0,101.0,NYC,20545.0,1.0,MA,20567.0,...,,M,1988.0,09302016,,,AA,92468460000.0,199.0,B2
4,17.0,2016.0,4.0,101.0,101.0,NYC,20545.0,1.0,MA,20567.0,...,,M,2012.0,09302016,,,AA,92468460000.0,199.0,B2


In [68]:
brazil = df[df['i94res']==689]

In [5]:
brazil

Unnamed: 0,cicid,i94yr,i94mon,i94cit,i94res,i94port,arrdate,i94mode,i94addr,depdate,...,entdepu,matflag,biryear,dtaddto,gender,insnum,airline,admnum,fltno,visatype
628,737.0,2016.0,4.0,103.0,689.0,NEW,20545.0,1.0,NY,20553.0,...,,M,1980.0,06292016,F,,UA,5.540653e+10,00148,WT
629,738.0,2016.0,4.0,103.0,689.0,NEW,20545.0,1.0,NY,20553.0,...,,M,1982.0,06292016,F,,UA,5.540652e+10,00148,WT
9327,11229.0,2016.0,4.0,111.0,689.0,DAL,20545.0,1.0,NV,20547.0,...,,M,1984.0,06292016,M,,AA,5.540677e+10,00962,WT
9328,11230.0,2016.0,4.0,111.0,689.0,LVG,20545.0,1.0,NV,20551.0,...,,M,1982.0,06292016,,,CM,5.544914e+10,00252,WB
9329,11231.0,2016.0,4.0,111.0,689.0,MIA,20545.0,1.0,FL,20549.0,...,,M,1970.0,06292016,,,JJ,5.540666e+10,08090,WT
9330,11232.0,2016.0,4.0,111.0,689.0,MIA,20545.0,1.0,FL,20556.0,...,,M,1938.0,06292016,,,JJ,5.544946e+10,08094,WT
9331,11233.0,2016.0,4.0,111.0,689.0,HOU,20545.0,1.0,TX,20574.0,...,,M,1998.0,D/S,F,,UA,9.243793e+10,00128,F1
13689,16687.0,2016.0,4.0,117.0,689.0,ATL,20545.0,1.0,MA,20554.0,...,,M,1971.0,06292016,M,,DL,5.540701e+10,00060,WT
13690,16691.0,2016.0,4.0,117.0,689.0,DAL,20545.0,1.0,TX,20554.0,...,,M,1952.0,06292016,,,AA,5.540664e+10,00962,WT
13691,16692.0,2016.0,4.0,117.0,689.0,DAL,20545.0,1.0,TX,20554.0,...,,M,1975.0,06292016,,,AA,5.540664e+10,00962,WT


In [12]:
airports = brazil['i94port'].unique()

In [18]:
airports.sort()
airports

array(['AGA', 'ANZ', 'ATL', 'AUS', 'BAL', 'BED', 'BLA', 'BOA', 'BOS',
       'BQN', 'BRO', 'BUF', 'CAL', 'CHI', 'CHM', 'CIN', 'CLE', 'CLG',
       'CLM', 'CLS', 'CLT', 'DAL', 'DEN', 'DER', 'DET', 'DLR', 'DNA',
       'DOU', 'DUB', 'EDA', 'EGP', 'FMY', 'FTL', 'GPM', 'HAL', 'HAM',
       'HAR', 'HHW', 'HID', 'HIG', 'HOU', 'HPN', 'HTM', 'INP', 'JKM',
       'KAN', 'KOA', 'LAR', 'LCB', 'LEW', 'LEX', 'LIH', 'LLB', 'LNB',
       'LOS', 'LVG', 'LYN', 'MAA', 'MAS', 'MCA', 'MDT', 'MIA', 'MIL',
       'MLB', 'MON', 'NAS', 'NCA', 'NEW', 'NIA', 'NOG', 'NOL', 'NOR',
       'NSV', 'NYC', 'OGD', 'OGG', 'OPF', 'ORL', 'ORO', 'OTM', 'OTT',
       'PBB', 'PDN', 'PEM', 'PEV', 'PHI', 'PHO', 'PHR', 'PHU', 'PIT',
       'PNH', 'POO', 'PRO', 'PSP', 'RAY', 'RDU', 'ROC', 'ROU', 'SAA',
       'SAI', 'SAJ', 'SAV', 'SDP', 'SEA', 'SFB', 'SFR', 'SKA', 'SLC',
       'SLU', 'SNA', 'SNJ', 'SPE', 'SPM', 'SRQ', 'STL', 'STR', 'STT',
       'SUM', 'SYR', 'SYS', 'TAM', 'TEC', 'THO', 'TOR', 'TUC', 'VCV',
       'VIC', 'WAS',

In [6]:
brazil.to_csv("brazil.csv")

In [22]:
df.columns

Index(['cicid', 'i94yr', 'i94mon', 'i94cit', 'i94res', 'i94port', 'arrdate',
       'i94mode', 'i94addr', 'depdate', 'i94bir', 'i94visa', 'count',
       'dtadfile', 'visapost', 'occup', 'entdepa', 'entdepd', 'entdepu',
       'matflag', 'biryear', 'dtaddto', 'gender', 'insnum', 'airline',
       'admnum', 'fltno', 'visatype'],
      dtype='object')

In [46]:
brazil.head()

Unnamed: 0,cicid,i94yr,i94mon,i94cit,i94res,i94port,arrdate,i94mode,i94addr,depdate,...,entdepu,matflag,biryear,dtaddto,gender,insnum,airline,admnum,fltno,visatype
628,737,2016,4,103,689,NEW,20545,1,NY,20553,...,,M,1980,6292016,F,,UA,55406500000.0,148,WT
629,738,2016,4,103,689,NEW,20545,1,NY,20553,...,,M,1982,6292016,F,,UA,55406500000.0,148,WT
9327,11229,2016,4,111,689,DAL,20545,1,NV,20547,...,,M,1984,6292016,M,,AA,55406800000.0,962,WT
9328,11230,2016,4,111,689,LVG,20545,1,NV,20551,...,,M,1982,6292016,,,CM,55449100000.0,252,WB
9329,11231,2016,4,111,689,MIA,20545,1,FL,20549,...,,M,1970,6292016,,,JJ,55406700000.0,8090,WT


In [4]:
df2.head()

Unnamed: 0,dt,AverageTemperature,AverageTemperatureUncertainty,City,Country,Latitude,Longitude
0,1743-11-01,6.068,1.737,Århus,Denmark,57.05N,10.33E
1,1743-12-01,,,Århus,Denmark,57.05N,10.33E
2,1744-01-01,,,Århus,Denmark,57.05N,10.33E
3,1744-02-01,,,Århus,Denmark,57.05N,10.33E
4,1744-03-01,,,Århus,Denmark,57.05N,10.33E


In [25]:
airport_codes = pd.read_csv("airport-codes_csv.csv")
df4 = pd.read_csv("immigration_data_sample.csv")
us_demographics = pd.read_csv("us-cities-demographics.csv",sep = ';')

In [21]:
airport_codes.head()

Unnamed: 0,ident,type,name,elevation_ft,continent,iso_country,iso_region,municipality,gps_code,iata_code,local_code,coordinates
0,00A,heliport,Total Rf Heliport,11.0,,US,US-PA,Bensalem,00A,,00A,"-74.93360137939453, 40.07080078125"
1,00AA,small_airport,Aero B Ranch Airport,3435.0,,US,US-KS,Leoti,00AA,,00AA,"-101.473911, 38.704022"
2,00AK,small_airport,Lowell Field,450.0,,US,US-AK,Anchor Point,00AK,,00AK,"-151.695999146, 59.94919968"
3,00AL,small_airport,Epps Airpark,820.0,,US,US-AL,Harvest,00AL,,00AL,"-86.77030181884766, 34.86479949951172"
4,00AR,closed,Newport Hospital & Clinic Heliport,237.0,,US,US-AR,Newport,,,,"-91.254898, 35.6087"


In [55]:
teste = brazil


In [69]:
brazil['i94port'] = brazil['i94port'].str.replace('NYC','JFK')



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy



In [70]:
brazil['i94port'] = brazil['i94port'].str.replace('LOS','LAX')



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy



In [47]:
teste.head()

Unnamed: 0,cicid,i94yr,i94mon,i94cit,i94res,i94port,arrdate,i94mode,i94addr,depdate,...,entdepu,matflag,biryear,dtaddto,gender,insnum,airline,admnum,fltno,visatype
628,737,2016,4,103,689,NEW,20545,1,NY,20553,...,,M,1980,6292016,F,,UA,55406500000.0,148,WT
629,738,2016,4,103,689,NEW,20545,1,NY,20553,...,,M,1982,6292016,F,,UA,55406500000.0,148,WT
9327,11229,2016,4,111,689,DAL,20545,1,NV,20547,...,,M,1984,6292016,M,,AA,55406800000.0,962,WT
9328,11230,2016,4,111,689,LVG,20545,1,NV,20551,...,,M,1982,6292016,,,CM,55449100000.0,252,WB
9329,11231,2016,4,111,689,MIA,20545,1,FL,20549,...,,M,1970,6292016,,,JJ,55406700000.0,8090,WT


In [71]:
z = brazil[brazil['i94port']=='LAX']
z

Unnamed: 0,cicid,i94yr,i94mon,i94cit,i94res,i94port,arrdate,i94mode,i94addr,depdate,...,entdepu,matflag,biryear,dtaddto,gender,insnum,airline,admnum,fltno,visatype
16802,20461.0,2016.0,4.0,126.0,689.0,LAX,20545.0,1.0,CA,20558.0,...,,M,1948.0,06292016,F,,AA,5.540877e+10,00216,WT
16803,20462.0,2016.0,4.0,126.0,689.0,LAX,20545.0,1.0,CA,20553.0,...,,M,1972.0,06292016,M,,AA,5.540849e+10,00216,WT
16804,20463.0,2016.0,4.0,126.0,689.0,LAX,20545.0,1.0,CA,20556.0,...,,M,1949.0,06292016,M,,AA,5.540848e+10,00216,WT
16805,20464.0,2016.0,4.0,126.0,689.0,LAX,20545.0,1.0,CA,20560.0,...,,M,1946.0,06292016,M,,AA,5.540881e+10,00216,WT
16831,20497.0,2016.0,4.0,126.0,689.0,LAX,20545.0,1.0,CA,20552.0,...,,M,1971.0,09302016,M,,AM,9.247620e+10,00019,B2
43656,51444.0,2016.0,4.0,148.0,689.0,LAX,20545.0,1.0,CA,20554.0,...,,M,1962.0,06292016,,,AA,5.540853e+10,00216,WT
43657,51445.0,2016.0,4.0,148.0,689.0,LAX,20545.0,1.0,OR,20549.0,...,,M,1972.0,06292016,M,,AA,5.540850e+10,00216,WT
63948,77905.0,2016.0,4.0,254.0,689.0,LAX,20545.0,1.0,CA,20548.0,...,,M,1953.0,06292016,,,KE,5.540430e+10,00062,WT
89486,206017.0,2016.0,4.0,689.0,689.0,LAX,20545.0,1.0,CA,20562.0,...,,M,1942.0,09302016,F,,LA,9.251707e+10,02604,B2
89667,206277.0,2016.0,4.0,689.0,689.0,LAX,20545.0,1.0,CA,20553.0,...,,M,1965.0,09302016,,,AA,9.244821e+10,00216,B2


In [20]:
len(df4)

1000

In [26]:
us_demographics.head()

Unnamed: 0,City,State,Median Age,Male Population,Female Population,Total Population,Number of Veterans,Foreign-born,Average Household Size,State Code,Race,Count
0,Silver Spring,Maryland,33.8,40601.0,41862.0,82463,1562.0,30908.0,2.6,MD,Hispanic or Latino,25924
1,Quincy,Massachusetts,41.0,44129.0,49500.0,93629,4147.0,32935.0,2.39,MA,White,58723
2,Hoover,Alabama,38.5,38040.0,46799.0,84839,4819.0,8229.0,2.58,AL,Asian,4759
3,Rancho Cucamonga,California,34.5,88127.0,87105.0,175232,5821.0,33878.0,3.18,CA,Black or African-American,24437
4,Newark,New Jersey,34.6,138040.0,143873.0,281913,5829.0,86253.0,2.73,NJ,White,76402


In [19]:
dfz = df3[df3['iata_code']=='SKA']
dfz

Unnamed: 0,ident,type,name,elevation_ft,continent,iso_country,iso_region,municipality,gps_code,iata_code,local_code,coordinates
29965,KSKA,large_airport,Fairchild Air Force Base,2461.0,,US,US-WA,Spokane,KSKA,SKA,SKA,"-117.65599823, 47.6151008606"


In [72]:
braziliansInAirports = pd.merge(brazil,airport_codes,left_on=['i94port'], right_on=['iata_code'],how='inner')
braziliansInAirports.head()

Unnamed: 0,cicid,i94yr,i94mon,i94cit,i94res,i94port,arrdate,i94mode,i94addr,depdate,...,name,elevation_ft,continent,iso_country,iso_region,municipality,gps_code,iata_code,local_code,coordinates
0,737.0,2016.0,4.0,103.0,689.0,NEW,20545.0,1.0,NY,20553.0,...,Lakefront Airport,8.0,,US,US-LA,New Orleans,KNEW,NEW,NEW,"-90.028297424316, 30.042400360107"
1,738.0,2016.0,4.0,103.0,689.0,NEW,20545.0,1.0,NY,20553.0,...,Lakefront Airport,8.0,,US,US-LA,New Orleans,KNEW,NEW,NEW,"-90.028297424316, 30.042400360107"
2,16693.0,2016.0,4.0,117.0,689.0,NEW,20545.0,1.0,NY,20569.0,...,Lakefront Airport,8.0,,US,US-LA,New Orleans,KNEW,NEW,NEW,"-90.028297424316, 30.042400360107"
3,51442.0,2016.0,4.0,148.0,689.0,NEW,20545.0,1.0,NJ,20553.0,...,Lakefront Airport,8.0,,US,US-LA,New Orleans,KNEW,NEW,NEW,"-90.028297424316, 30.042400360107"
4,206161.0,2016.0,4.0,689.0,689.0,NEW,20545.0,1.0,NJ,20555.0,...,Lakefront Airport,8.0,,US,US-LA,New Orleans,KNEW,NEW,NEW,"-90.028297424316, 30.042400360107"


In [73]:
braziliansInAirports.groupby(['municipality','coordinates','iata_code']).size().to_frame(name = 'count').reset_index().sort_values(by=['count'],ascending=False)

Unnamed: 0,municipality,coordinates,iata_code,count
58,Miami,"-80.29060363769531, 25.79319953918457",MIA,36774
66,Orlando,"-81.332901, 28.5455",ORL,20030
61,New York,"-73.77890015, 40.63980103",JFK,17916
51,Los Angeles,"-118.4079971, 33.94250107",LAX,7390
41,Houston,"-95.27890015, 29.64539909",HOU,5614
3,Atlanta,"-84.428101, 33.6367",ATL,5021
24,Dallas,"-96.851799, 32.847099",DAL,4231
60,New Orleans,"-90.028297424316, 30.042400360107",NEW,3465
75,Point Hope,"-166.7989959716797, 68.34880065917969",PHO,2514
28,Detroit,"-83.00990295, 42.40919876",DET,1584


In [67]:
braziliansInAirports.to_csv("Brazilians-In-Airports2.csv")

In [None]:
import plotly.graph_objects as go

import pandas as pd

df = braziliansInAirports
new = df["coordinates"].str.split(", ", n = 1, expand = True)
df['lat'] = new[1]
df['lon'] = new[0]
df = df.groupby(['municipality', 'lat','lon']).size().to_frame(name = 'count').reset_index()


df['text'] = df['municipality'] + '<br>Arrivals '+ df['count'].astype(str)
limits = [(0,2),(3,10),(11,20),(21,50),(50,3000)]
colors = ["royalblue","crimson","lightseagreen","orange","lightgrey"]
cities = []
scale = 2

fig = go.Figure()

for i in range(len(limits)):
    lim = limits[i]
    df_sub = df[lim[0]:lim[1]]
    fig.add_trace(go.Scattergeo(
        locationmode = 'USA-states',
        lon = df_sub['lon'],
        lat = df_sub['lat'],
        text = df_sub['text'],
        marker = dict(
            size = df_sub['count']/scale,
            color = colors[i],
            line_color='rgb(40,40,40)',
            line_width=0.5,
            sizemode = 'area'
        ),
        name = '{0} - {1}'.format(lim[0],lim[1])))

fig.update_layout(
        title_text = '2016 US city brazilian arrivals<br>(Click legend to toggle traces)',
        showlegend = True,
        geo = dict(
            scope = 'usa',
            landcolor = 'rgb(217, 217, 217)',
        )
    )

fig.show()

![Map](mapz.png "Map")

In [None]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.\
config("spark.jars.repositories", "https://repos.spark-packages.org/").\
config("spark.jars.packages", "saurfang:spark-sas7bdat:2.0.0-s_2.11").\
enableHiveSupport().getOrCreate()

df_spark = spark.read.format('com.github.saurfang.sas.spark').load('../../data/18-83510-I94-Data-2016/i94_apr16_sub.sas7bdat')


In [11]:
#write to parquet
df_spark.write.parquet("sas_data")
df_spark=spark.read.parquet("sas_data")

### Step 2: Explore and Assess the Data
#### Explore the Data 
The presented data was relatively good, well cleaned, but had quite a few entire city names listed as i94 ports, which required some manual cleaning of the data above

#### Cleaning Steps
Altered all city airport codes for specific airport codes before inner joining with airport codes dataset

### Step 3: Define the Data Model
#### 3.1 Conceptual Data Model
The data model is an inner join between a filtered dataset of arrivals, by the code assigned to brazil (689), and the airport codes through the iana airport code, and grouped by and counted by individual rows of arrivals, reduced to the count of arrivals, municipality, latitute and longitude, displayed in a geographic map of the united states through a bubble graph using plotly graph objects.

#### 3.2 Mapping Out Data Pipelines
The data is mapped utilizing spark and could be fed from redshift to a dash webpage live, meaning the data could be increased 100x, the pipelines could be run on a daily basis by 7 am every day (or ran every second), and the database could be accessed in redshift by 100+ people.


### Step 4: Run Pipelines to Model the Data 
#### 4.1 Create the data model
Build the data pipelines to create the data model.

In [None]:
# Write code here

#### 4.2 Data Quality Checks
Explain the data quality checks you'll perform to ensure the pipeline ran as expected. These could include:
 * Integrity constraints on the relational database (e.g., unique key, data type, etc.)
 * Unit tests for the scripts to ensure they are doing the right thing
 * Source/Count checks to ensure completeness
 
Run Quality Checks

In [None]:
# Perform quality checks here

#### 4.3 Data dictionary 
Create a data dictionary for your data model. For each field, provide a brief description of what the data is and where it came from. You can include the data dictionary in the notebook or in a separate file.

#### Step 5: Complete Project Write Up
* Clearly state the rationale for the choice of tools and technologies for the project.
* Propose how often the data should be updated and why.
* Write a description of how you would approach the problem differently under the following scenarios:
 * The data was increased by 100x.
 * The data populates a dashboard that must be updated on a daily basis by 7am every day.
 * The database needed to be accessed by 100+ people.