# Team Project Data Mining  - MSDS 7331
### Name: Cynthia Alvarado, Christopher Havenstein, Alma Lopez, Hieu Nguyen
#### Notepad for H-1 Visa project
Public Disclosure File: H-1B iCERT LCA
Federal Fiscal Year: 2016
Reporting Period: October 1, 2015 through September 30, 2016
Important Note: This public disclosure file contains administrative data from employers’ Labor Condition Applications (ETA Forms 9035 & 9035E) and the certification determinations processed by the Department’s Office of Foreign Labor Certification, Employment and Training Administration where the date of the determination was issued on or after October 1, 2015, and on or before September 30, 2016. All data were extracted from the Office of Foreign Labor Certification’s iCERT Visa Portal System; an electronic filing and application processing system of employer requests for H-1B nonimmigrant workers.

## Business Understanding

## Data Meaning Type

#### CASE_NUMBER - Unique identifier assigned to each application submitted for processing to the Chicago National Processing Center.
#### CASE_STATUS - Status associated with the last significant event or decision. Valid values include “Certified,” “Certified-Withdrawn,” Denied,” and “Withdrawn”.
#### CASE_SUBMITTED - Date and time the application was submitted.
#### DECISION_DATE - Date on which the last significant event or decision was recorded by the Chicago National Processing Center.
#### VISA_CLASS - Indicates the type of temporary application submitted for processing. R = H-1B; A = E-3 Australian; C = H-1B1 Chile; S = H-1B1 Singapore. Also referred to as “Program” in prior years.
#### EMPLOYMENT_START_DATE - Beginning date of employment
EMPLOYMENT_END_DATE - Ending date of employment
EMPLOYER_NAME - Name of employer submitting labor condition application.
EMPLOYER_ADDRESS
EMPLOYER_CITY
EMPLOYER_STATE
EMPLOYER_POSTAL_CODE
EMPLOYER_COUNTRY
EMPLOYER_PROVINCE
EMPLOYER_PHONE
EMPLOYER_PHONE_EXT
AGENT_ATTORNEY_NAME - Name of Agent or Attorney filing an H-1B application on behalf of the employer.
AGENT_ATTORNEY_CITY - City information for the Agent or Attorney filing an H-1B application on behalf of the employer.
AGENT_ATTORNEY_STATE - State information for the Agent or Attorney filing an H-1B application on behalf of the employer.
JOB_TITLE - Title of the job
SOC_CODE - Occupational code associated with the job being requested for temporary labor condition, as classified by the Standard Occupational Classification (SOC) System.
SOC_NAME - Occupational name associated with the SOC_CODE
NAIC_CODE - Industry code associated with the employer requesting permanent labor condition, as classified by the North American Industrial Classification System (NAICS)
TOTAL_WORKERS - Total number of foreign workers requested by the Employer(s)
FULL_TIME_POSITION - Y = Full Time Position; N = Part Time Position
PREVAILING_WAGE - Prevailing Wage for the job being requested for temporary labor condition.
PW_UNIT_OF_PAY - Unit of Pay. Valid values include “Daily (DAI),” “Hourly (HR),” “Bi-weekly (BI),” “Weekly (WK),” “Monthly (MTH),” and “Yearly (YR)”
PW_SOURCE - Variables include "OES", "CBA", "DBA", "SCA" or "Other"
PW_SOURCE_YEAR - Year the Prevailing Wage Source was Issued
PW_SOURCE_OTHER - If "Other Wage Source", provide the source of wage
WAGE_RATE_OF_PAY_FROM - Employer’s proposed wage rate
WAGE_RATE_OF_PAY_TO - Maximum proposed wage rate
WAGE_UNIT_OF_PAY - Unit of pay. Valid values include “Hour", "Week", "Bi-Weekly", "Month", or "Year"
H-1B_DEPENDENT - Y = Employer is H-1B Dependent; N = Employer is not H-1B Dependent.
WILLFUL_VIOLATOR - Y = Employer has been previously found to be a Willful Violator; N = Employer has not been considered a Willful Violator.
WORKSITE_CITY - City information of the foreign worker's intended area of employment.
WORKSITE_COUNTY - County information of the foreign worker's intended area of employment
WORKSITE_STATE - State information of the foreign worker's intended area of employment
WORKSITE_POSTAL_CODE - Zip Code information of the foreign worker's intended area of employment
ORIGINAL_CERT_DATE - Original Certification Date for a Certified_Withdrawn application


## Data Quality

In [1]:
#Import libraries
import pandas as pd
import math

In [2]:
#Load Data
filename = "C:\Alma@SMU\MSDS_7331_DM\H1Visa_Project\H-1B_Disclosure_Data_FY16.csv"
H1VisaData_df = pd.read_csv(filename, encoding='cp1252')

  interactivity=interactivity, compiler=compiler, result=result)


In [3]:
#Exploration Data Analysis (EDA)
#Checking missing values
H1VisaData_df.isnull().sum()

CASE_NUMBER                   0
CASE_STATUS                   0
CASE_SUBMITTED                0
DECISION_DATE                 0
VISA_CLASS                    0
EMPLOYMENT_START_DATE        17
EMPLOYMENT_END_DATE          25
EMPLOYER_NAME                15
EMPLOYER_ADDRESS              5
EMPLOYER_CITY                 6
EMPLOYER_STATE               34
EMPLOYER_POSTAL_CODE         21
EMPLOYER_COUNTRY              3
EMPLOYER_PROVINCE        640180
EMPLOYER_PHONE                3
EMPLOYER_PHONE_EXT       613590
AGENT_ATTORNEY_NAME           0
AGENT_ATTORNEY_CITY      241534
AGENT_ATTORNEY_STATE     252062
JOB_TITLE                     7
SOC_CODE                      8
SOC_NAME                      8
NAIC_CODE                     5
TOTAL_WORKERS                 0
FULL_TIME_POSITION       647852
PREVAILING_WAGE               1
PW_UNIT_OF_PAY               49
PW_WAGE_SOURCE               53
PW_SOURCE_YEAR               61
PW_SOURCE_OTHER            8187
WAGE_RATE_OF_PAY_FROM         0
WAGE_RAT

Removing columns that are not useful in our analysis, also they have null values:  employer_province, employer_phone_ext, agent_attorney_city, agent_attorney_state, full_time_position (no values), original_cert_date

In [27]:
H1VisaData2 = H1VisaData_df.loc[H1VisaData_df['EMPLOYMENT_START_DATE'] > '1-1-2016']
H1VisaData2.drop(['EMPLOYER_PROVINCE', 'EMPLOYER_PHONE_EXT', 'AGENT_ATTORNEY_CITY', 'AGENT_ATTORNEY_STATE', 'FULL_TIME_POSITION', 'ORIGINAL_CERT_DATE'], axis=1, inplace=True)
H1VisaData2.isnull().sum()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


CASE_NUMBER                  0
CASE_STATUS                  0
CASE_SUBMITTED               0
DECISION_DATE                0
VISA_CLASS                   0
EMPLOYMENT_START_DATE        0
EMPLOYMENT_END_DATE         11
EMPLOYER_NAME               14
EMPLOYER_ADDRESS             4
EMPLOYER_CITY                5
EMPLOYER_STATE              33
EMPLOYER_POSTAL_CODE        20
EMPLOYER_COUNTRY             2
EMPLOYER_PHONE               2
AGENT_ATTORNEY_NAME          0
JOB_TITLE                    1
SOC_CODE                     3
SOC_NAME                     3
NAIC_CODE                    4
TOTAL_WORKERS                0
PREVAILING_WAGE              1
PW_UNIT_OF_PAY              47
PW_WAGE_SOURCE              50
PW_SOURCE_YEAR              58
PW_SOURCE_OTHER           8184
WAGE_RATE_OF_PAY_FROM        0
WAGE_RATE_OF_PAY_TO          1
WAGE_UNIT_OF_PAY             6
H-1B_DEPENDENT           13265
WILLFUL_VIOLATOR         13266
WORKSITE_CITY               16
WORKSITE_COUNTY           1342
WORKSITE

## Simple Statistics

## Visualize Attributes

## Explore Join Attributes

## Explore Attributes and Class

## New Features

## Exceptional Work