This project is developed with Python. The purpose is to analyze H1B raw data from past years, calculate two metrics: Top 10 Occupations and Top 10 States for certified visa applications.
Design a program to analyze raw immigration data to calculate two metrics: Top 10 Occupations and Top 10 States for certified visa applications. The program requires:
- The code should be modular and reusable for future without needing to change the code;
- The code can deal with special case like missing data or typos;
- Only allowed to use the default data structures that come with that programming language.
The input data is a semicolon separated (";") format. The first row is field names. The description of the field names can be find here. The fields name are different in different years.
The program must create 2 output files:
top_10_occupations.txt
: Top 10 occupations for certified visa applicationstop_10_states.txt
: Top 10 states for certified visa applications
Requirements:
- Each line holds one record and each field on each line is separated by a semicolon (;).
- The records in the file must be sorted by
NUMBER_CERTIFIED_APPLICATIONS
, and in case of a tie, alphabetically byTOP_OCCUPATIONS
(TOP_STATES
). - There can be fewer than 10 lines in each file.
- Percentages also should be rounded off to 1 decimal place. For instance, 1.05% should be rounded to 1.1% and 1.04% should be rounded to 1.0%. Also, 1% should be represented by 1.0%
- Even the data is separated by semicolon, not all the semicolons are delimiter. (For example,
aaa;"bbb;ccc";;ddd
should be read as'aaa', 'bbb;ccc', '', 'ddd'
; - Different field names in different years;
- Missing data or typo: it is the hardest part in the problem. The work states information are all good. But there are missing or typos in occupation name and SOC code data.
preprocess.py
: process raw data, find correct fields, clean data. Write records with 'CERTIFIED'
status into processed data:
STATE
: 'xx' format. Inferred from zip code if not valid;SOC_CODE
: 'xx-xxxx.xx' format.SOC_NAME
: clean as a string
data_analyze.py
: analyze processed data:
- Infer occupation name from
SOC_CODE
andSOC_NAME
; - Use
state_counter
andoccupation_counter
hash tables to record frequency of states and occupations; - Sort based on counts and print out top 10 states and occupations.
I summarized the field names which are need for this problem for different years.
Year | Status | State | Occupation name | SOC code |
---|---|---|---|---|
2014 | STATUS | LCA_CASE_WORKLOC1_STATE | LCA_CASE_SOC_NAME | LCA_CASE_SOC_CODE |
2015 | CASE_STATUS | WORKSITE_STATE | SOC_NAME | SOC_CODE |
2016 | CASE_STATUS | WORKSITE_STATE | SOC_NAME | SOC_CODE |
So I am using following criteria to select the correct fields
Status: 'STATUS' in name
State: 'WORK' in name and 'STATE' in name
Occupation name: 'SOC' in name and 'NAME' in name
SOC code: 'SOC' in name and 'CODE' in name
Since in 2014 data, there is another field name LCA_CASE_WORKLOC2_STATE
, I only take the first time.
The semicolon delimiter challenge can be solve using a function. Or using regular expression.
Valid values include “Certified”, “Certified-Withdrawn”, “Denied”, and “Withdrawn". In the given data, most information are valid. In the 2014 data, one case with status as 'REJECTED' and another one as 'INVALIDATED'. These are the only two special cases for the three years data. In this problem, only cases with "Certified" status are counted. So the special cases will not affect the result.
Two letters short for one of the states. In some cases, the name of the states are incorrect and the name cannot be found in a state dictionary. For these cases, a function uses Zip code to correct the states. And check all the special cases manually. ** If the state name is accidentally input as another state, my program does not work. **
The occupation name in raw data is not very clean.
- Missing occupation name;
- Different format: the occupation name ending with '*' (means modified in SOC2000) or '"'. Or 'R&D' and 'R & D' and 'R and D' which present same name;
- Typos: like 'ANALYST' -> 'ANALYSTS'; 'COMPUTER' -> 'COMPTUER';
- Missing part of information: like 'COMPUTER OCCUPATIONS, ALL' -> 'COMPUTER OCCUPATIONS, ALL OTHER'
We can use SOC code to help correct occupation name, but SOC code also have problem:
- Missing SOC code;
- Different format: xx/xxxx.xx or xx-xxxx-xx or xxxxxxxx;
- Typos: missing number of incorrect SOC code;
- Same SOC code but different occupation name.
To solve format problem, I used two function clean_soc_code
to transfer all SOC code as format 'XX-XXXX.XX'. If the SOC code is 6 digit or ends with '.99', the function replace the end with '.00'.
The clean_soc_name
function fix the occupation as all capital letter and remove extra space and quotas.
The correct occupation name is the soc_name with highest frequency of one soc_code. For example, in the table below, the correct name corresponding to '13-1161.00' should be 'MARKET RESEARCH ANALYSTS AND MARKETING SPECIALISTS'. Others are missing information or typos.
SOC code | Occupation name | Counting |
---|---|---|
13-1161.00 | MARKET RESEARCH ANALYSTS AND MARKETING SPECIALISTS | 6978 |
MARKET RESEARCH ANALYSTS AND MARKETING | 167 | |
MARKET RESEARCH ANALYST AND MARKETING SPECIALISTS | 2 | |
MARKET RESEACH ANALYSTS AND MARKETING SPECIALISTS | 1 |
The preprocessor reads raw data and writes two output files: processed_xxx.csv
and soc_map_xxx.csv
. The processed data is skimmed from raw data with fields as 'state', 'soc_code' and 'soc_name'. Only certified records are written into processed data.
The analysis code (H1BCounter) reads the processed data and SOC_map. In analysis code, the occupation name is inferred based on 'soc_code', 'soc_name' and the SOC_map. And two hash tables record the counting of each state/occupation. Then sort the two tables with counting decreasing and name alphabet. The first 10 or all (if the number of key is less than 10) records are written into ./output/top_10_occupations.txt
and ./output/top_10_states.txt
The program pre-processes data and save the processed data to disk. The processed data cost about 1/10 space of the raw data. The soc_map file is much small than the processed data. The advantages are:
- Save time and memory: to infer the occupation name, using the processed data can help speed up x10.
- For further distributed analysis: both the soc_code_map and preprocessed data can be extended in distributed system safely.
The input file is read and processed line-by-line. So the largest memory consuming is the hash tables. There are about ~1k occupations, the memory cost is ~100k. And this memory cost will not increase dramatically with the statistic of input.
In preprocessor, the most time consuming part is to split a line with semicolon. So the total time is O(n), n denotes the number of letters in the input file. Other operations are based on hash table with O(1) time.
In analysis code, the time to loop all events is O(m), here m denotes the number of letters, which is about 1/10 of n.
And sorting time is O(klogk), k is the number of states/occupation. Since k ~ 1000, no need to use heap (with O(klog10) time complexity).
Make sure there are 1)./output
directory and ./input/h1b_input.csv
in current directory.
sh run.sh
Thanks to the Insight fellowship team, this small project gave me a meaningful and happy weekend.