# ETL

I want to analyze visa petition data, with the end goal of finding lucrative petition demographics. To do this I will analyze my CSV data for a brief overview and then route it through SQL to perform a more robust analysis. I will connect my SQL database to an API in order to query as my data changes over time (with new entries).

1. Load Data into SQL
2. EDA: Exploratory Data Analysis
3. Vizualization
4. API

### Import libraries and database

In [4]:
import pandas as pd
import sqlalchemy as alch
from getpass import getpass
import re

In [18]:
df = pd.read_csv("input/h1b_visas_2011-2016.csv")
df.sample(5)

Unnamed: 0.1,Unnamed: 0,CASE_STATUS,EMPLOYER_NAME,SOC_NAME,JOB_TITLE,FULL_TIME_POSITION,PREVAILING_WAGE,YEAR,WORKSITE,lon,lat
72499,72500,CERTIFIED,HUNT MORTGAGE INVESTMENTS LLC,FINANCIAL ANALYSTS,CAPITAL MARKETS ANALYST,N,65187.0,2016.0,"NEW YORK, NEW YORK",-74.005941,40.712784


### Understand and format my database before importing to SQL.

In [45]:
print(f"CASE_STATUS: \n {df['CASE_STATUS'].unique()} \n\nSOC_NAME: \n {df['SOC_NAME'].unique()} \n\nFULL_TIME_POSITION: \n {df['FULL_TIME_POSITION'].unique()} \n\nPREVAILING_WAGE: \n {df['PREVAILING_WAGE'].unique()} \n\nYEAR: \n {df['YEAR'].unique()} \n\nWORKSITE: \n {df['WORKSITE'].unique()}")

CASE_STATUS: 
 ['CERTIFIED-WITHDRAWN' 'WITHDRAWN' 'CERTIFIED' 'DENIED' 'REJECTED'
 'INVALIDATED' 'PENDING QUALITY AND COMPLIANCE REVIEW - UNASSIGNED' nan] 

SOC_NAME: 
 ['BIOCHEMISTS AND BIOPHYSICISTS' 'CHIEF EXECUTIVES' 'FINANCIAL MANAGERS'
 ... 'Tree Trimmers and Pruners'
 'Excavating and Loading Machine and Dragline Operat'
 'Earth Drillers, Except Oil and Gas'] 

FULL_TIME_POSITION: 
 ['N' 'Y' nan] 

PREVAILING_WAGE: 
 [3.6067000e+04 2.4267400e+05 1.9306600e+05 ... 3.3621300e+05 1.3000080e+05
 1.3701792e+08] 

YEAR: 
 [2016. 2015. 2014. 2013. 2012. 2011.   nan] 

WORKSITE: 
 ['ANN ARBOR, MICHIGAN' 'PLANO, TEXAS' 'JERSEY CITY, NEW JERSEY' ...
 'CLINTON, NEW JERSEY' 'OWINGS MILL, MARYLAND' 'ALTANTA, GEORGIA']


### Potential discoveries:
- most lucrative field, state.
- growth in visa requests y/y.
- check if full-time visas are more lucrative.

# 1. Load Data into SQL

### Create SQL schema
This step is done in MySQL Workbench, first by creating an EER diagram and then reverse engineering into tables.

""" CODE """

### Connect to SQL (using SQLalchemy)

In [None]:
password = getpass("Insert your password here: ")
dbName = "US_visas"
connectionData = f"mysql+pymysql://root:{password}@localhost/{dbName}"
engine = alch.create_engine(connectionData)

In [None]:
list(engine.execute("SELECT * FROM visa"))

### Populate SQL

In [None]:
def check (table, string):
''' Ensure duplicate data is not inserted into SQL using defensive programming '''

    if table == "visa":
        query = list(engine.execute(f"SELECT name FROM visa WHERE name = '{string}';"))
        if len(query) > 0:
            return True
        else:
            return False
        
    if table == "demographic":
        query = list(engine.execute(f"SELECT name FROM demographic WHERE name = '{string}';"))
        if len(query) > 0:
            return True
        else:
            return False

In [None]:
# Test the check function.
#check("visa", "John")

In [None]:
def insertVisa (status, field, wage, year):
    ''' Insert data into SQL table VISA '''
    
    # CANNOT USE CHECK SINCE DATA DOES NOT HAVE UNIQUE IDS (LIKE NAME)
    #if check("visa", string):
    #    return "It already exists"
    #else:
        engine.execute(f"INSERT INTO visa (status, field, wage, year) VALUES ('{status}', '{field}', '{wage}', '{year}');")

In [None]:
# Test the insert function.
#insertVisa ("test")
#check("visa", "John")

In [None]:
# Insert all data into SQL: 3 million rows; CASE_STATUS, SOC_NAME, PREVAILING_WAGE, YEAR.
for index, row in df.iterrows():
    insertVisa (row["CASE_STATUS"], row["SOC_NAME"], row["PREVAILING_WAGE"], row["YEAR"])
    
# for index, row in df.iterrows():
#    insertDemo (row["demo"])

### Export to JSON

In [47]:
df.to_json("h1b_visas_2011-2016")

# 2. EDA: Exploratory Data Analysis

### Most Lucrative Field
1. what is the aggregate PREVAILING_WAGE for each SOC_NAME?
2. Filter for only approved.

In [None]:
# code here

### Most Lucrative State
1. what is the aggregate PREVAILING_WAGE for each WORKSITE? Group worksites with similar state names.
2. Filter for only approved.

In [None]:
# code here

# 3. Vizualization

In [48]:
# Plot 2-3 charts to visualize demographic variance and popularity.

# Plot a geopandas map of each country's applicants' total income.

# 4. API

Connect analysis to api using api.py (see file in repo root).