# ETL

I want to analyze visa petition data, with the end goal of finding lucrative petition demographics. To do this I will analyze my CSV data for a brief overview and then route it through SQL to perform a more robust analysis. I will connect my SQL database to an API in order to query as my data changes over time (with new entries).

1. Load Data into SQL
2. EDA: Exploratory Data Analysis
3. Vizualization
4. API

### Import libraries and database

In [1]:
import pandas as pd
from getpass import getpass
import pymysql
import sqlalchemy as alch
import re

In [2]:
# only read in 100,000 visa requests. This dataset includes over 3,000,000 requests!
df = pd.read_csv("input/h1b_visas_2011-2016.csv", nrows=100000)
df.drop("Unnamed: 0",axis=1, inplace=True)
df.sample(5)

Unnamed: 0,CASE_STATUS,EMPLOYER_NAME,SOC_NAME,JOB_TITLE,FULL_TIME_POSITION,PREVAILING_WAGE,YEAR,WORKSITE,lon,lat
7815,CERTIFIED,AMAZON.COM.KYDC LLC,COMPUTER AND INFORMATION SYSTEMS MANAGERS,TECHNICAL SERVICES MANAGER,Y,119080.0,2016,"DALLAS, TEXAS",-96.796988,32.776664
1076,WITHDRAWN,"COLLISION CENTER DESIGN, LLC",GENERAL AND OPERATIONS MANAGERS,OPERATIONS MANAGER,Y,70969.6,2016,"JOHNSTON, RHODE ISLAND",-71.512617,41.82052
7573,CERTIFIED,CAREFIRST OF MARYLAND INC.,COMPUTER AND INFORMATION SYSTEMS MANAGERS,COE TECHNOLOGY CONSULTANT,Y,161533.0,2016,"OWINGS MILLS, MARYLAND",-76.780253,39.41955
639,CERTIFIED-WITHDRAWN,TRAIL BLAZERS INC.,GENERAL AND OPERATIONS MANAGERS,"SENIOR VICE PRESIDENT, BUSINESS OPERATIONS",Y,136469.0,2016,"PORTLAND, OREGON",-122.676482,45.523062
5322,CERTIFIED,CRISIL IREVNA US LLC,MARKETING MANAGERS,DIRECTOR OF BUSINESS DEVELOPMENT,Y,109242.0,2016,"NEW YORK, NEW YORK",-74.005941,40.712784


### Understand and format my database before importing to SQL.

In [3]:
print(f"CASE_STATUS: \n {df['CASE_STATUS'].unique()} \n\nSOC_NAME: \n {df['SOC_NAME'].unique()} \n\nFULL_TIME_POSITION: \n {df['FULL_TIME_POSITION'].unique()} \n\nPREVAILING_WAGE: \n {df['PREVAILING_WAGE'].unique()} \n\nYEAR: \n {df['YEAR'].unique()} \n\nWORKSITE: \n {df['WORKSITE'].unique()}")

CASE_STATUS: 
 ['CERTIFIED-WITHDRAWN' 'WITHDRAWN' 'CERTIFIED' 'DENIED'] 

SOC_NAME: 
 ['BIOCHEMISTS AND BIOPHYSICISTS' 'CHIEF EXECUTIVES' 'FINANCIAL MANAGERS'
 'GENERAL AND OPERATIONS MANAGERS' 'GENERAL AND OPERATIONS MANAGER'
 'GENERAL AND OPERATIONS MANAGERSE' 'ADVERTISING AND PROMOTIONS MANAGERS'
 'MARKETING MANAGERS' 'PUBLIC RELATIONS SPECIALISTS' 'MARKETING MANAGER'
 'SALES MANAGERS' 'SALES MANAGER' 'SALES MANGERS'
 'PUBLIC RELATIONS AND FUNDRAISING MANAGERS'
 'PUBLIC RELATIONS AND FUND RAISING MANAGERS' 'PUBLIC RELATIONS MANAGERS'
 'ADMINISTRATIVE SERVICES MANAGERS'
 'COMPUTER & INFORMATION SYSTEMS MANAGERS'
 'COMPUTER AND INFORMATION SYSTEMS MANAGERS' 'COMPUTERS MANAGERS'
 'COMPUTER AND INFORMATON SYSTEMS MANAGERS'
 'COMPUTER AND INFORMATION SYSTEMS MANAGER'] 

FULL_TIME_POSITION: 
 ['N' 'Y'] 

PREVAILING_WAGE: 
 [ 36067.  242674.  193066.  ...  77064.  110281.6 125585. ] 

YEAR: 
 [2016] 

WORKSITE: 
 ['ANN ARBOR, MICHIGAN' 'PLANO, TEXAS' 'JERSEY CITY, NEW JERSEY' ...
 'GRAPEVI

In [4]:
print(f"UNIQUE FIELDS: \n {df['SOC_NAME'].nunique()}\n\nUNIQUE FIELDS w/ POP > 100: \n [SEE MySQL QUERY]\n\nFIELD POPULATIONS: \n {df['SOC_NAME'].value_counts()}")

UNIQUE FIELDS: 
 22

UNIQUE FIELDS w/ POP > 100: 
 [SEE MySQL QUERY]

FIELD POPULATIONS: 
 MARKETING MANAGERS                            2884
COMPUTER AND INFORMATION SYSTEMS MANAGERS     2656
GENERAL AND OPERATIONS MANAGERS               1998
SALES MANAGERS                                1076
CHIEF EXECUTIVES                               593
ADVERTISING AND PROMOTIONS MANAGERS            378
PUBLIC RELATIONS AND FUNDRAISING MANAGERS      229
ADMINISTRATIVE SERVICES MANAGERS               155
MARKETING MANAGER                               10
SALES MANAGER                                    3
COMPUTER & INFORMATION SYSTEMS MANAGERS          3
COMPUTERS MANAGERS                               2
PUBLIC RELATIONS AND FUND RAISING MANAGERS       2
COMPUTER AND INFORMATION SYSTEMS MANAGER         2
GENERAL AND OPERATIONS MANAGER                   2
SALES MANGERS                                    1
PUBLIC RELATIONS MANAGERS                        1
PUBLIC RELATIONS SPECIALISTS              

### Split WORKSITE into CITY and STATE for further analysis

In [5]:
df.loc[0]

CASE_STATUS                     CERTIFIED-WITHDRAWN
EMPLOYER_NAME                UNIVERSITY OF MICHIGAN
SOC_NAME              BIOCHEMISTS AND BIOPHYSICISTS
JOB_TITLE              POSTDOCTORAL RESEARCH FELLOW
FULL_TIME_POSITION                                N
PREVAILING_WAGE                             36067.0
YEAR                                           2016
WORKSITE                        ANN ARBOR, MICHIGAN
lon                                      -83.743038
lat                                       42.280826
Name: 0, dtype: object

In [6]:
# add column CITY
df[['CITY','STATE']] = df['WORKSITE'].str.split(", ", expand=True)

In [7]:
# check if function worked
print(f"CITY: {df['CITY'][0]}\n\nSTATE: {df['STATE'][0]}")

CITY: ANN ARBOR

STATE: MICHIGAN


### Format FULL_TIME_POSITION into boolean values

In [8]:
df['FULL_TIME_POSITION'] = df['FULL_TIME_POSITION'].map({'Y': 1, 'N': 0}) # changing from Y to 1 (not 'True') for SQL syntax
df['FULL_TIME_POSITION'] = df['FULL_TIME_POSITION'].astype('int')
type(df['FULL_TIME_POSITION'][0])

numpy.int64

In [9]:
for val in df['FULL_TIME_POSITION']:
    if val == 1:
        pass
    elif val == 0:
        pass
    else:
        print(val)

### Length
Getting a general idea of column length for SQL database creation. I will make the max column size a little bit larger than the greatest current column for potential future entries.

In [10]:
print(f"FIELD: {len('Computer User Support Specialists')}\n\nEMPLOYER: {len('THE CHICAGO ATHENAEUM: CENTER FOR ARCHITECTURE, DESIGN & URBAN STUDIES')}\n\nCITY: {len('RESEARCH TRIANGLE PARK')}\n\nSTATE: {len('DISTRICT OF COLUMBIA')}")

FIELD: 33

EMPLOYER: 70

CITY: 22

STATE: 20


### Potential discoveries:
- most lucrative field, state.
- growth in visa requests y/y.
- check if full-time visas are more lucrative.

# 1. Load Data into SQL

### Create SQL schema
This step is done in MySQL Workbench:
1. Build the database in MySQL Workbench.
2. Build the tables with efficient datatypes.
3. Populate the tables through Python (see below).

Alternatively, I could have drawn an EER diagram and reverse engineered that diagram to get an automatically generated code for the table.

### MySQL reverse engineering code

In [11]:
"""
-- setup the database
CREATE DATABASE us_immigration;
USE us_immigration;

-- create first table
DROP TABLE IF EXISTS visa;
CREATE TABLE visa (
    -- add an idex when new tables are added to the database (unnecessary now)
    status ENUM('CERTIFIED-WITHDRAWN', 'WITHDRAWN', 'CERTIFIED', 'DENIED',
        'REJECTED', 'INVALIDATED', 'PENDING REVIEW - UNASSIGNED', 'nan'),
    field VARCHAR(64),
    job VARCHAR(64),
    employer VARCHAR(64),
    full_time bool,
    wage INT,
    year YEAR,
    city VARCHAR(32),
    state VARCHAR(16)
);
""";

### Connect to SQL (using SQLalchemy)

In [12]:
# establish connection
password = getpass("Insert your password here: ")
dbName = "us_immigration"
connectionData = f"mysql+pymysql://root:{password}@localhost/{dbName}"
engine = alch.create_engine(connectionData)

Insert your password here: ········


In [13]:
# this function should return an empty string, given that data has not been entered into the table yet.
list(engine.execute("SELECT * FROM visa"))

[('CERTIFIED-WITHDRAWN', 'BIOCHEMISTS AND BIOPHYSICISTS', 'POSTDOCTORAL RESEARCH FELLOW', 'UNIVERSITY OF MICHIGAN', 0, 36067, 2016, 'ANN ARBOR', 'MICHIGAN'),
 ('CERTIFIED-WITHDRAWN', 'CHIEF EXECUTIVES', 'CHIEF OPERATING OFFICER', 'GOODMAN NETWORKS, INC.', 1, 242674, 2016, 'PLANO', 'TEXAS'),
 ('CERTIFIED-WITHDRAWN', 'CHIEF EXECUTIVES', 'CHIEF PROCESS OFFICER', 'PORTS AMERICA GROUP, INC.', 1, 193066, 2016, 'JERSEY CITY', 'NEW JERSEY'),
 ('CERTIFIED-WITHDRAWN', 'CHIEF EXECUTIVES', 'REGIONAL PRESIDEN, AMERICAS', 'GATES CORPORATION, A WHOLLY-OWNED SUBSIDIARY OF TOMKINS PLC', 1, 220314, 2016, 'DENVER', 'COLORADO'),
 ('WITHDRAWN', 'CHIEF EXECUTIVES', 'PRESIDENT MONGOLIA AND INDIA', 'PEABODY INVESTMENTS CORP.', 1, 157518, 2016, 'ST. LOUIS', 'MISSOURI'),
 ('CERTIFIED-WITHDRAWN', 'CHIEF EXECUTIVES', 'EXECUTIVE V P, GLOBAL DEVELOPMENT AND PRESIDENT, LATIN AMERI', 'BURGER KING CORPORATION', 1, 225000, 2016, 'MIAMI', 'FLORIDA'),
 ('CERTIFIED-WITHDRAWN', 'CHIEF EXECUTIVES', 'CHIEF OPERATING OFFICER'

### Populate SQL

In [14]:
def check (table, string):
    ''' Ensure duplicate data is not inserted into SQL using defensive programming '''
    
    if table == "visa":
        query = list(engine.execute(f"SELECT name FROM visa WHERE name = '{string}';"))
        if len(query) > 0:
            return True
        else:
            return False
        
    if table == "demographic":
        query = list(engine.execute(f"SELECT name FROM demographic WHERE name = '{string}';"))
        if len(query) > 0:
            return True
        else:
            return False

In [15]:
# Test the check function.
#check("visa", "John")

In [16]:
def insertVisa (status, field, job, employer, full_time, wage, year, city, state):
    ''' Insert data into SQL table VISA '''
    
    """ CANNOT USE CHECK SINCE DATA DOES NOT HAVE UNIQUE IDS (LIKE 'NAME')
    if check("visa", string):
        return "It already exists"
    else:
    """
    
    engine.execute(f'INSERT INTO visa (status, field, job, employer, full_time, wage, year, city, state) VALUES ("{status}", "{field}", "{job}", "{employer}", "{full_time}", "{wage}", "{year}", "{city}", "{state}");')

In [17]:
# Test the insert function.
#insertVisa ("test")
#check("visa", "John")

In [18]:
# Insert all data into SQL: 3 million rows; CASE_STATUS, SOC_NAME, PREVAILING_WAGE, YEAR.
for index, row in df.iterrows():
    #print(row["CASE_STATUS"])
    insertVisa(row["CASE_STATUS"], row["SOC_NAME"], row["JOB_TITLE"], row["EMPLOYER_NAME"], row["FULL_TIME_POSITION"], row["PREVAILING_WAGE"], row["YEAR"], row["CITY"], row["STATE"])

# for index, row in df.iterrows():
#    insertDemo (row["demo"])

### Export to JSON

In [19]:
df.to_json("h1b_visas_2011-2016")

# 2. EDA: Exploratory Data Analysis
See MySQL Workbench (or screenshots attached to readme file) for full query list.

### Most Lucrative Field
1. what is the aggregate PREVAILING_WAGE for each SOC_NAME?
2. Filter for only approved.

In [24]:
'''
-- Most lucrative field, by population
SELECT SUM(wage), COUNT(field) AS num_in_field, field
FROM visa
GROUP BY field
HAVING COUNT(field) > 2 -- filter out misc fields (usually mistypes)
ORDER BY num_in_field DESC;

-- Most lucrative field, by total wage
SELECT SUM(wage) AS total_wage, COUNT(field) AS num_in_field, field
FROM visa
GROUP BY field
HAVING COUNT(field) > 2 -- filter out misc fields (usually mistypes)
ORDER BY total_wage DESC;
''';

### Most Lucrative State
1. what is the aggregate PREVAILING_WAGE for each WORKSITE? Group worksites with similar state names.
2. Filter for only approved.

In [21]:
'''
-- Most lucrative state, by population
SELECT SUM(wage) AS total_wage, COUNT(state) AS num_in_state, state
FROM visa
GROUP BY state
ORDER BY num_in_state DESC;

-- Most lucrative state, by total wage
SELECT SUM(wage) AS total_wage, COUNT(state) AS num_in_state, state
FROM visa
GROUP BY state
ORDER BY total_wage DESC;
''';

### More queries included in readme and in .sql file (see github repo for full set).

# 3. Vizualization

In [22]:
# Plot 2-3 charts to visualize demographic variance and popularity.

# Plot a geopandas map of each country's applicants' total income.


# 4. API

Connect analysis to api using api.py (see file in repo root).

### Project Sources:
- [Dataset](https://www.kaggle.com/datasets/nsharan/h-1b-visa?resource=download)