# Project Title
### Data Engineering Capstone Project
I am doing this project based on data that I have put together. I will describe the data and its sources in Step 1.

#### Project Summary
InvestSure is an investment company that manages the retirement accounts of employees of its customers. It gets a dump of many data elements in CSV format from transactional systems and has been using Excel to load this data for analysis. However, the data has now grown to a size where this approach is no longer viable. Therefore, InvestSure has hired me as a Data Engineer to analyze this data, cleanse it, build a conceptual model for analytical use of the data and load the data from Excel files into the analytical tables. InvestSure has also requested me to provide them with typical queries that they could run on this analytical model to gain insights into this data.

The project follows the steps listed below:
* Step 1: Scope the Project and Gather Data
* Step 2: Explore and Assess the Data
* Step 3: Define the Data Model
* Step 4: Run ETL to Model the Data
* Step 5: Complete Project Write Up

In [3]:
# Do all imports and installs here
import pandas as pd
import numpy as np
import re
from pyspark.sql import SparkSession
import os
import sys
import glob
import configparser
from datetime import datetime, timedelta, date
from dateutil import parser
from pyspark.sql import types as t
from pyspark.sql.functions import udf, col, monotonically_increasing_id
from pyspark.sql.functions import year, month, dayofmonth, hour, weekofyear

# To suppress numeric values from being returned in exponential format
pd.options.display.float_format = '{:20,.2f}'.format

# suppress warnings from final output
import warnings
warnings.simplefilter("ignore")

# Read project data configuration entries
config = configparser.ConfigParser()
config.read_file(open('capstone_project_data.cfg'))
print(config['LOCAL']['INPUT_DATA_SRC'])

src_data/


### Step 1: Scope the Project and Gather Data

#### Scope 
The scope of this project is to analyze the data provided by transactional systems, cleanse the data if needed and load it into an analytical data model to facilitate querying the data.

#### Describe and Gather Data 
In this section, I will describe the data including its source.

- txn.csv (fact)
    - Structure
        - txn_id
        - txn_date
        - contact_id
        - product_id
        - sales
        - redemptions
    - Source: InvestSure's transactional system captured by its trading application
    - Feed frequency: Daily

- customer.csv (dimension)
    - Structure
        - customer_id
        - customer_name
        - sector
    - Source: InvestSure's CRM system
    - Feed Frequency: Daily

- contact.csv (dimenstion)
    - Structure
        - contact_id
        - first_name
        - last_name
        - city
        - state_code
        - zip
        - country
        - latitude
        - longitude
        - customer_id
        - status
        - opportunity
    - Source: InvestSure's CRM system
    - Feed Frequency: Daily

- product.json (dimension)
    - Structure
        - product_id
        - product_name
        - tna
        - ms_rating
        - exp_ratio
        - market_cap
    - Source: Yahoo Finance
    - Feed Frequency: Daily

- sec_codes.csv (mapping table)
    - Structure
        - code
        - description
    - Source: Yahoo Finance provides description but InvestSure's systems use abbreviated codes
    - Feed Frequency: On demand and when new customers are added

- state.csv
    - Structure
        - state_code
        - state
        - region
    - Source: US Census Board
    - Feed Frequency: one time


In [9]:
# Read CUSTOMER data
df_customer = pd.read_csv(config['LOCAL']['INPUT_DATA_CUSTOMER'], encoding = "ISO-8859-1")

# Display sample rows from CUSTOMER data
df_customer.head()

Unnamed: 0,customer_id,customer_name,sector
0,450056063,B.W.E CUSTOM CONSTRUCTION LLC,HC
1,450056064,SERGEY NIZHEGORODTSEV PUBLISHING LLC,CS
2,450056066,SUNRISE ANDOVER LLC,RE
3,450056067,474 CENTRAL BOULEVARD LLC,FS
4,450056068,ELITE FINISHES LLC,TECH


In [8]:
# Read CONTACT data
df_contact = pd.read_csv(config['LOCAL']['INPUT_DATA_CONTACT'], encoding = "ISO-8859-1")

# Display sample rows from CONTACT data
df_contact.head()

Unnamed: 0,contact_id,first_name,last_name,city,state_code,zip,country,latitude,longitude,customer_id,status,opportunity
0,100000339,Lyndy,Chachas,Omaha,NE,68130,USA,41.23,-96.18,450058148,Active,50000
1,100001423,Watts,Eifenstadt,Weston,FL,33326,USA,26.1,-80.36,450059017,Active,50000
2,100001837,Jingfeng,Lopina,Hunt Valley,MD,21030,USA,39.5,-76.67,450059076,Active,125000
3,100002544,Gaynell,Vivrett,Beloit,WI,53511,USA,42.5,-89.04,450057762,Active,16000
4,100002551,Peregrino,Valles,New York,NY,10036,USA,40.76,-73.98,450055995,Active,150000


In [10]:
# Read PRODUCT data (note that this data is in JSON format)
df_product = pd.read_json(config['LOCAL']['INPUT_DATA_PRODUCT'])

# Display sample rows from PRODUCT data
df_product.head()

Unnamed: 0,exp_ratio,market_cap,ms_rating,product_id,product_name,tna
0,0.05,Giant,5,VFIAX,Vanguard 500 Index Admiral,163456368456
1,0.05,Giant,4,VTSAX,Vanguard Total Stock Mkt Idx Adm,136131758268
2,0.04,Giant,5,VINIX,Vanguard Institutional Index I,110407917518
3,0.16,Giant,4,VTSMX,Vanguard Total Stock Mkt Idx Inv,98869371846
4,0.02,Giant,5,VIIIX,Vanguard Institutional Index Instl Pl,93192353649


In [11]:
# Read SECTOR CODES data
df_sec_codes = pd.read_csv(config['LOCAL']['INPUT_DATA_SECTOR'], encoding = "ISO-8859-1")

# Display sample rows from CONTACT data
df_sec_codes.head()

Unnamed: 0,code,description
0,FS,Financial Services
1,RE,Real Estate
2,HC,Healthcare
3,UT,Utilities
4,CS,Communication Services


In [12]:
# Read STATE data
df_state = pd.read_csv(config['LOCAL']['INPUT_DATA_STATE'], encoding = "ISO-8859-1")

# Display sample rows from STATE data
df_state.head()

Unnamed: 0,state_code,state,region
0,AL,Alabama,Southern
1,AK,Alaska,Pacific
2,AZ,Arizona,Pacific
3,AR,Arkansas,Southern
4,CA,California,Pacific


In [13]:
# Read TRANSACTION data
df_txn = pd.read_csv(config['LOCAL']['INPUT_DATA_TXN'], encoding = "ISO-8859-1")

# Display sample rows from TRANSACTION data
df_txn.head()

Unnamed: 0,txn_id,txn_date,contact_id,product_id,sales,redemptions
0,422909780,1/2/15,992808564,VIVAX,46892.19,0.0
1,422909781,1/2/15,261785827,SOPAX,0.0,-33424.78
2,422909782,1/2/15,389127962,BAICX,14230.85,0.0
3,422909783,1/2/15,101692476,SGROX,94046.93,0.0
4,422909784,1/2/15,327754553,FXSIX,22038.86,0.0


In [8]:
	
from pyspark.sql import SparkSession
spark = SparkSession.builder.\
config("spark.jars.packages","saurfang:spark-sas7bdat:2.0.0-s_2.11")\
.enableHiveSupport().getOrCreate()
df_spark =spark.read.format('com.github.saurfang.sas.spark').load('../../data/18-83510-I94-Data-2016/i94_apr16_sub.sas7bdat')


In [11]:
#write to parquet
df_spark.write.parquet("sas_data")
df_spark=spark.read.parquet("sas_data")

### Step 2: Explore and Assess the Data
#### Explore the Data 
Identify data quality issues, like missing values, duplicate data, etc.

#### Cleaning Steps
Document steps necessary to clean the data

In [None]:
# Performing cleaning tasks here





### Step 3: Define the Data Model
#### 3.1 Conceptual Data Model
Map out the conceptual data model and explain why you chose that model

#### 3.2 Mapping Out Data Pipelines
List the steps necessary to pipeline the data into the chosen data model

### Step 4: Run Pipelines to Model the Data 
#### 4.1 Create the data model
Build the data pipelines to create the data model.

In [None]:
# Write code here

#### 4.2 Data Quality Checks
Explain the data quality checks you'll perform to ensure the pipeline ran as expected. These could include:
 * Integrity constraints on the relational database (e.g., unique key, data type, etc.)
 * Unit tests for the scripts to ensure they are doing the right thing
 * Source/Count checks to ensure completeness
 
Run Quality Checks

In [None]:
# Perform quality checks here

#### 4.3 Data dictionary 
Create a data dictionary for your data model. For each field, provide a brief description of what the data is and where it came from. You can include the data dictionary in the notebook or in a separate file.

#### Step 5: Complete Project Write Up
* Clearly state the rationale for the choice of tools and technologies for the project.
* Propose how often the data should be updated and why.
* Write a description of how you would approach the problem differently under the following scenarios:
 * The data was increased by 100x.
 * The data populates a dashboard that must be updated on a daily basis by 7am every day.
 * The database needed to be accessed by 100+ people.