<a href="https://colab.research.google.com/github/cfcastillo/DS-6-Notebooks/blob/main/Education_Capstone_MS2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# TASK LIST

* Find dataset(s) that offer future job popularity/need - **Amy**
  * [top 10 skills by year](https://data.world/datasets/jobs)

* **IN PROCESS** SOC to OCCUP 3 pass match - take max(census) code for multiples - **Cecilia**

**MS-2 requirements**

Your Jupyter Notebook needs to contain the following: 
* **ONGOING** All code needed to clean and format your data
* **DONE** A written description of the data cleaning process
* **DONE** completed rows of the dataset, predictors and response.



# Project Definition

The purpose of this project is to identify what factors influence people to choose certain professions or trades. In understanding these factors, we can help colleges like Central New Mexico College (CNM) offer courses that support those professions and better target their marketing to people who are likely to choose those professions.

This project will be a supervised categorization problem using tree-based models to identify the factors that will contribute to career choice.



# Data Identification Process

Steps:

1. We stated several questions we wanted answered (target). 
1. After defining our problem, we listed sets of variables that we believed could answer our questions. We then put the variables and targets into a [spreadsheet](https://docs.google.com/spreadsheets/d/1bOhOBHKOae9TDN9n9-xF7ag4QW_Z0c7HXTYLXeMMLHs/edit#gid=0) to define the dataset we would need to run our analysis. 
1. We then researched data sources such as Bureau of Labor Statistics and the US Census to locate data that supported our research. 
1. We then mapped the columns in the data sources to the columns in our desired dataset and linked multiple datasets by target code value.

*Note: The data identification process is still a work in progress. As we proceed with EDA, we will discover some columns are not needed and others are needed. As we analyzed the data during the data cleaning process, we discovered that earnings are complex, often made up of multiple jobs. Additional analysis will be needed to solidify our predictor when applying the model.*

# Data Collection

The following data sources were used for this project. Data was imported into Google Drive from the below links and modified as needed to support this project.

The primary datasets for this project were initially taken from the Census' [Annual Social and Economic Supplement (ASEC)](https://www.census.gov/programs-surveys/saipe/guidance/model-input-data/cpsasec.html) of the Current Population Survey (CPS) for 2020. However, because 2020 was anomalous due to Covid, we had to go back and take data from 2018 and 2019 - pre-covid to get occupation and salary information that was more stable. Per the above link, the "*ASEC data is the source of timely official national estimates of poverty levels and rates and of widely used measures of income. It provides annual estimates based on a survey of more than 75,000 households. The survey contains detailed questions covering social and economic characteristics of each person who is a household member as of the interview date. Income questions refer to income received during the previous calendar year.*"

[Annual Social and Economic Survey (ASEC) All Years Data](https://www.census.gov/data/datasets/time-series/demo/cps/cps-asec.html)

* Contains links to all years from 2010 to 2021. CSV format available from 2019 to 2021. Prior to 2019, fixed format file is provided so columns would need to be parsed using the available data dictionary.
* [2021 Survey - csv](https://www.census.gov/data/datasets/time-series/demo/cps/cps-asec.2021.html)
* [2020 Survey - csv](https://www.census.gov/data/datasets/time-series/demo/cps/cps-asec.2020.html)
* [2019 Survey - csv](https://www.census.gov/data/datasets/time-series/demo/cps/cps-asec.2019.html)
* [2018 Survey - dat](https://www.census.gov/data/datasets/time-series/demo/cps/cps-asec.2018.html) - Need to convert to csv

[Quarterly Census of Employment and Wages](https://www.bls.gov/cew/about-data/)

* Source data for OES Statistics. Can be used if detailed data is needed.

[Occupational Employment Wage Statistics (OES) Data](https://data.bls.gov/oes/#/geoOcc/Multiple%20occupations%20for%20one%20geographical%20area)

* Format - Excel converted to CSV
* Contains Occupational codes and aggregated statistics on wages for those occupations.

[FIPS State Codes](https://www.census.gov/library/reference/code-lists/ansi/ansi-codes-for-states.html)

* Format - Copied from PDF and converted to CSV
* Contains FIPS State codes mapped to US Postal Service (USPS) State codes.

[Census Occupation Codes](https://www2.census.gov/programs-surveys/cps/techdocs/cpsmar20.pdf)

* Format - Copied from PDF and converted to CSV
* Contains Census Occupation codes mapped to Federal Standard Occupational Classification (SOC) Codes.


# Imports

In [None]:
# grab the imports needed for the project
import pandas as pd
import glob
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import statsmodels.api as sm

# all
from sklearn import datasets
from sklearn import metrics
from sklearn import preprocessing
from sklearn.metrics import classification_report
import sklearn.model_selection as model_selection

# Gaussian Naive Bayes
from sklearn.naive_bayes import GaussianNB
from sklearn.naive_bayes import MultinomialNB

# Regression
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_squared_error

# Data Cleaning - MS-2 - Oct 1

Once we identified the data elements needed for our project and the data sources that provided those data elements, the following steps were taken to get the data into a format needed for our analysis.

1. Downloaded data from data sources and placed copies in Google Drive.
1. Made changes to raw data where needed to support the project. 
  * Added State code to OES data and remove headers and footers from the data.
  * Created lookup data for State codes and SOC codes so secondary data sources could be merged with primary Census data. This involved cleaning the census code list so it could be properly parsed.
1. Converted codes in secondary datasets into Census codes.
1. Merged all datasets together into a single dataset.
1. Removed data that did not meet criteria for our analysis
  * Removed anyone under age 16.
  * Imputed null values.
1. Studied earnings/salary columns to determine which columns provided values that could be used for modeling. Added in columns that were missing from the initial analysis.
1. We were not able to reliably match the OES data to the census data using the full SOC Code because of disparities in SOC Codes. Therefore, we executed 3 matching passes reducing the SOC code by one character each time and pulling the largest Census code for the SOC code prefix. This allowed us to match XX % of the data.

TODO: REFINE AS PROJECT PROGRESSES.

## Import Data

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
# Import Census data
asec_year = '19'
asec_path = '/content/drive/MyDrive/Student Folder - Cecilia/Projects/Capstone/Data/ASEC/asecpub' + asec_year + 'csv/'
asec_data_person = pd.read_csv(asec_path + 'pppub' + asec_year + '.csv')
asec_data_household = pd.read_csv(asec_path + 'hhpub' + asec_year + '.csv')

# TODO: once all data is available, join 1x and then save combined file so don't have to join every time code is run.
# Join and import all 50 states' occupation data
oes_path = '/content/drive/MyDrive/Student Folder - Cecilia/Projects/Capstone/Data/Occupations/'
oes_file_names = glob.glob(oes_path + "*.csv")
li = []
for filename in oes_file_names:
    df = pd.read_csv(filename, index_col=None, header=0)
    li.append(df)
oes_data = pd.concat(li, axis=0, ignore_index=True)

# File path for all code conversion files.
codes_path = '/content/drive/MyDrive/Student Folder - Cecilia/Projects/Capstone/Data/Codes/'

# Import FIPS state codes so we can convert USPS state to FIPS state to match back to Census data.
fips_state_codes = pd.read_csv(codes_path + 'FIPS_STATE_CODES.csv')

# Import Census occupational codes so we can convert SOC codes into Census Occ codes.
# Is in fixed width format. Will parse out data below.
census_occ_codes = pd.read_fwf(codes_path + 'CENSUS_SOC_OCC_CODES.txt')

In [None]:
# How many columns and rows do we have in each dataset?
print(f'Person data: {asec_data_person.shape}')
print(f'Household data: {asec_data_household.shape}')
print(f'Occupation data: {oes_data.shape}')
print(f'FIPS State Codes: {fips_state_codes.shape}')
print(f'Census Occ Codes: {census_occ_codes.shape}')

Person data: (180101, 799)
Household data: (94633, 135)
Occupation data: (35822, 19)
FIPS State Codes: (56, 3)
Census Occ Codes: (531, 1)


## ASEC Data

### Define ASEC Columns

The following data dictionary provides details for the selected columns.

[Annual Social and Economic Supplement (ASEC) 2020 Public Use Dictionary](https://www2.census.gov/programs-surveys/cps/datasets/2020/march/ASEC2020ddl_pub_full.pdf)

In [None]:
# Get lists of columns for various datasets that will be used for the project
# Note: Columns can be added as needed here and will propagate through the project.
id_col = ['H_IDNUM']
person_cols = ['OCCUP','A_DTOCC','A_MJOCC','AGE1','A_AGE','A_SEX','PRDTRACE','PRCITSHP','A_HGA','A_HRLYWK', 'A_HRSPAY','A_GRSWK',
               'CLWK','EARNER','HRCHECK','HRSWK','PEARNVAL','A_CLSWKR','A_DTIND']
household_cols = ['GTMETSTA','GEDIV','GESTFIPS','HEFAMINC','HHINC']

### Get Household Id

In [None]:
# Extract the Household id number from the person record so we can join the household and person dataframes by this id.
asec_data_person['H_IDNUM'] = asec_data_person['PERIDNUM'].str[:20]

In [None]:
# View Person Data
asec_data_person[person_cols].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 180101 entries, 0 to 180100
Data columns (total 19 columns):
 #   Column    Non-Null Count   Dtype
---  ------    --------------   -----
 0   OCCUP     180101 non-null  int64
 1   A_DTOCC   180101 non-null  int64
 2   A_MJOCC   180101 non-null  int64
 3   AGE1      180101 non-null  int64
 4   A_AGE     180101 non-null  int64
 5   A_SEX     180101 non-null  int64
 6   PRDTRACE  180101 non-null  int64
 7   PRCITSHP  180101 non-null  int64
 8   A_HGA     180101 non-null  int64
 9   A_HRLYWK  180101 non-null  int64
 10  A_HRSPAY  180101 non-null  int64
 11  A_GRSWK   180101 non-null  int64
 12  CLWK      180101 non-null  int64
 13  EARNER    180101 non-null  int64
 14  HRCHECK   180101 non-null  int64
 15  HRSWK     180101 non-null  int64
 16  PEARNVAL  180101 non-null  int64
 17  A_CLSWKR  180101 non-null  int64
 18  A_DTIND   180101 non-null  int64
dtypes: int64(19)
memory usage: 26.1 MB


In [None]:
# Look at first 5 records of selected columns of person data.
asec_data_person[person_cols].head()

Unnamed: 0,OCCUP,A_DTOCC,A_MJOCC,AGE1,A_AGE,A_SEX,PRDTRACE,PRCITSHP,A_HGA,A_HRLYWK,A_HRSPAY,A_GRSWK,CLWK,EARNER,HRCHECK,HRSWK,PEARNVAL,A_CLSWKR,A_DTIND
0,4050,13,3,4,21,1,1,1,37,0,-1,0,1,1,1,30,18000,1,45
1,0,0,0,17,85,2,1,1,39,0,-1,0,5,2,0,0,0,0,0
2,4020,13,3,13,61,2,1,1,39,0,-1,0,1,1,2,44,12000,1,45
3,0,0,0,16,73,2,1,1,39,0,-1,0,5,2,0,0,0,0,0
4,4610,15,3,8,37,1,1,1,39,0,-1,0,1,1,1,20,12000,1,43


In [None]:
# View Household Data
asec_data_household[household_cols].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 94633 entries, 0 to 94632
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype
---  ------    --------------  -----
 0   GTMETSTA  94633 non-null  int64
 1   GEDIV     94633 non-null  int64
 2   GESTFIPS  94633 non-null  int64
 3   HEFAMINC  94633 non-null  int64
 4   HHINC     94633 non-null  int64
dtypes: int64(5)
memory usage: 3.6 MB


In [None]:
# Look at first 5 records of household data
asec_data_household[household_cols].head()

Unnamed: 0,GTMETSTA,GEDIV,GESTFIPS,HEFAMINC,HHINC
0,2,1,23,-1,0
1,2,1,23,-1,0
2,2,1,23,-1,0
3,2,1,23,6,8
4,2,1,23,-1,0


### Merge Person and Household Records

In [None]:
# Join Household and Personal records into single dataframe
# Inner join - should not have person without household.
asec_combined = pd.merge(asec_data_household[id_col + household_cols], asec_data_person[id_col + person_cols], on=id_col)

In [None]:
# View combined result
asec_combined.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 180101 entries, 0 to 180100
Data columns (total 25 columns):
 #   Column    Non-Null Count   Dtype 
---  ------    --------------   ----- 
 0   H_IDNUM   180101 non-null  object
 1   GTMETSTA  180101 non-null  int64 
 2   GEDIV     180101 non-null  int64 
 3   GESTFIPS  180101 non-null  int64 
 4   HEFAMINC  180101 non-null  int64 
 5   HHINC     180101 non-null  int64 
 6   OCCUP     180101 non-null  int64 
 7   A_DTOCC   180101 non-null  int64 
 8   A_MJOCC   180101 non-null  int64 
 9   AGE1      180101 non-null  int64 
 10  A_AGE     180101 non-null  int64 
 11  A_SEX     180101 non-null  int64 
 12  PRDTRACE  180101 non-null  int64 
 13  PRCITSHP  180101 non-null  int64 
 14  A_HGA     180101 non-null  int64 
 15  A_HRLYWK  180101 non-null  int64 
 16  A_HRSPAY  180101 non-null  int64 
 17  A_GRSWK   180101 non-null  int64 
 18  CLWK      180101 non-null  int64 
 19  EARNER    180101 non-null  int64 
 20  HRCHECK   180101 non-null 

In [None]:
asec_combined.head()

Unnamed: 0,H_IDNUM,GTMETSTA,GEDIV,GESTFIPS,HEFAMINC,HHINC,OCCUP,A_DTOCC,A_MJOCC,AGE1,A_AGE,A_SEX,PRDTRACE,PRCITSHP,A_HGA,A_HRLYWK,A_HRSPAY,A_GRSWK,CLWK,EARNER,HRCHECK,HRSWK,PEARNVAL,A_CLSWKR,A_DTIND
0,01000691245394308011,2,1,23,6,8,4050,13,3,4,21,1,1,1,37,0,-1,0,1,1,1,30,18000,1,45
1,39994039016100209011,2,1,23,10,9,0,0,0,17,85,2,1,1,39,0,-1,0,5,2,0,0,0,0,0
2,91193400932060909011,2,1,23,5,5,4020,13,3,13,61,2,1,1,39,0,-1,0,1,1,2,44,12000,1,45
3,14103203009699909011,2,1,23,2,10,0,0,0,16,73,2,1,1,39,0,-1,0,5,2,0,0,0,0,0
4,14103203009699909011,2,1,23,2,10,4610,15,3,8,37,1,1,1,39,0,-1,0,1,1,1,20,12000,1,43


## OES Data

In [None]:
# Shorten column names
oes_data.rename(columns={'State':'USPS_STATE',
                         'Occupation (SOC code)':'SOC_DESC',
                         'Employment(1)':'EMP',
                         'Employment percent relative standard error(3)':'EMP_RSDE',
                         'Hourly mean wage':'HOURLY_MEAN',
                         'Annual mean wage(2)':'ANN_MEAN',
                         'Wage percent relative standard error(3)':'WAGE_RSDE',
                         'Hourly 10th percentile wage':'HOURLY_10TH',
                         'Hourly 25th percentile wage':'HOURLY_25TH',
                         'Hourly median wage':'HOURLY_MEDIAN',
                         'Hourly 75th percentile wage':'HOURLY_75TH',
                         'Hourly 90th percentile wage':'HOURLY_90TH',
                         'Annual 10th percentile wage(2)':'ANN_10TH',
                         'Annual 25th percentile wage(2)':'ANN_25TH',
                         'Annual median wage(2)':'ANN_MEDIAN',
                         'Annual 75th percentile wage(2)':'ANN_75TH',
                         'Annual 90th percentile wage(2)':'ANN_90TH',
                         'Employment per 1,000 jobs':'EMP_PER_1000',
                         'Location Quotient':'LOC_QUOTIENT'}, inplace=True)

### OES Column Footnotes

* (1) Estimates for detailed occupations do not sum to the totals because the totals include occupations not shown separately. Estimates do not include self-employed workers.
* (2) Annual wages have been calculated by multiplying the corresponding hourly wage by 2,080 hours.
* (3) The relative standard error (RSE) is a measure of the reliability of a survey statistic. The smaller the relative standard error, the more precise the estimate.
* (4) Wages for some occupations that do not generally work year-round, full time, are reported either as hourly wages or annual salaries depending on how they are typically paid.
* (5) This wage is equal to or greater than \$100.00 per hour or \$208,000 per year.
* (8) Estimate not released.

In [None]:
# OES columns we want to keep
# occupation_cols = ['HOURLY_MEAN','HOURLY_MEDIAN','EMP_PER_1000','LOC_QUOTIENT']
occupation_cols = ['HOURLY_MEAN','HOURLY_MEDIAN','EMP_PER_1000','LOC_QUOTIENT', 'ANN_MEAN','ANN_MEDIAN']
# 2   EMP            35874 non-null  object
#  3   EMP_RSDE       35874 non-null  object
#  4   HOURLY_MEAN    35874 non-null  object
#  5   ANN_MEAN       35874 non-null  object
#  6   WAGE_RSDE      35874 non-null  object
#  7   HOURLY_10TH    35874 non-null  object
#  8   HOURLY_25TH    35874 non-null  object
#  9   HOURLY_MEDIAN  35874 non-null  object
#  10  HOURLY_75TH    35874 non-null  object
#  11  HOURLY_90TH    35874 non-null  object
#  12  ANN_10TH       35874 non-null  object
#  13  ANN_25TH       35874 non-null  object
#  14  ANN_MEDIAN     35874 non-null  object
#  15  ANN_75TH       35874 non-null  object
#  16  ANN_90TH       35874 non-null  object
#  17  EMP_PER_1000   35874 non-null  object
#  18  LOC_QUOTIENT   35874 non-null  object

### Get Census State Codes

In [None]:
# Import FIPS state codes matching on USPS state codes
# Left Join - keep OES data even if no match on state code.
oes_data = pd.merge(oes_data, fips_state_codes[['USPS_STATE','FIPS_STATE']], on='USPS_STATE', how='left')

In [None]:
# Verify merge was successful - that we have expected columns and record count is unchanged.
oes_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 35822 entries, 0 to 35821
Data columns (total 20 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   USPS_STATE     35822 non-null  object
 1   SOC_DESC       35822 non-null  object
 2   EMP            35822 non-null  object
 3   EMP_RSDE       35822 non-null  object
 4   HOURLY_MEAN    35822 non-null  object
 5   ANN_MEAN       35822 non-null  object
 6   WAGE_RSDE      35822 non-null  object
 7   HOURLY_10TH    35822 non-null  object
 8   HOURLY_25TH    35822 non-null  object
 9   HOURLY_MEDIAN  35822 non-null  object
 10  HOURLY_75TH    35822 non-null  object
 11  HOURLY_90TH    35822 non-null  object
 12  ANN_10TH       35822 non-null  object
 13  ANN_25TH       35822 non-null  object
 14  ANN_MEDIAN     35822 non-null  object
 15  ANN_75TH       35822 non-null  object
 16  ANN_90TH       35822 non-null  object
 17  EMP_PER_1000   35822 non-null  object
 18  LOC_QUOTIENT   35822 non-n

In [None]:
# Verify we have all states
oes_data[['USPS_STATE', 'FIPS_STATE']].value_counts()

USPS_STATE  FIPS_STATE
CA          6             797
TX          48            794
FL          12            790
NY          36            785
OH          39            775
PA          42            775
IL          17            775
WA          53            756
MI          26            753
TN          47            751
GA          13            749
VA          51            748
IN          18            747
OR          41            744
MN          27            742
CO          8             742
NC          37            741
MD          24            739
NJ          34            738
MA          25            735
MO          29            733
WI          55            731
AL          1             731
LA          22            730
AZ          4             729
OK          40            712
IA          19            711
KY          21            709
UT          49            705
SC          45            696
MS          28            688
KS          20            685
CT          9    

In [None]:
# Verify no nulls after merge.
oes_data.isnull().sum()

USPS_STATE       0
SOC_DESC         0
EMP              0
EMP_RSDE         0
HOURLY_MEAN      0
ANN_MEAN         0
WAGE_RSDE        0
HOURLY_10TH      0
HOURLY_25TH      0
HOURLY_MEDIAN    0
HOURLY_75TH      0
HOURLY_90TH      0
ANN_10TH         0
ANN_25TH         0
ANN_MEDIAN       0
ANN_75TH         0
ANN_90TH         0
EMP_PER_1000     0
LOC_QUOTIENT     0
FIPS_STATE       0
dtype: int64

### Parse SOC Codes

In [None]:
# Parse out SOC code from the description. The code is inside parentheses.
def getSocCode(value):
  # If not able to parse the code, then return the value from the file.
  try:
    return value[value.index('(')+1:value.index(')')]
  except:
    return value

oes_data['SOC_CODE'] = oes_data['SOC_DESC'].apply(lambda val: getSocCode(val))

In [None]:
# Verify codes were properly parsed
oes_data[['SOC_DESC','SOC_CODE']]

Unnamed: 0,SOC_DESC,SOC_CODE
0,All Occupations(000000),000000
1,Management Occupations(110000),110000
2,Chief Executives(111011),111011
3,General and Operations Managers(111021),111021
4,Legislators(111031),111031
...,...,...
35817,Gas Compressor and Gas Pumping Station Operato...,537071
35818,"Pump Operators, Except Wellhead Pumpers(537072)",537072
35819,Wellhead Pumpers(537073),537073
35820,Refuse and Recyclable Material Collectors(537081),537081


### Get Census Occupation Codes

In [None]:
# Prepare Census/SOC map file. Codes are embedded in a single column so need to be parsed out.
# Parse out Census and SOC occupational codes from the description. The code has a dash in it. So locate by dash
# assuming there are no dashes in the description.
def getOccCodeSoc(value):
  # If not able to parse the code, then return the value from the file.
  try:
    return value[value.index('-')-2:value.index('-')+5].replace('-','')
  except:
    return value

# Retrieve first 4 characters in the file. This is the Census code.
def getOccCodeCensus(value):
  return value[:4]

census_occ_codes['OCCUP'] = census_occ_codes['CENSUS_MAP'].apply(lambda val: getOccCodeCensus(val))
census_occ_codes['SOC_CODE'] = census_occ_codes['CENSUS_MAP'].apply(lambda val: getOccCodeSoc(val))
census_occ_codes

Unnamed: 0,CENSUS_MAP,OCCUP,SOC_CODE
0,5740 Secretaries and administrative assistants...,5740,5740 Secretaries and administrative assistants...
1,0010 Chief executives 11-1011,0010,111011
2,0020 General and operations managers 11-1021,0020,111021
3,0040 Advertising and promotions managers 11-2011,0040,112011
4,0051 Marketing Managers 11-2021,0051,112021
...,...,...,...
526,9645 Stockers and order fillers 53-7065,9645,537065
527,9650 Pumping station operators 53-7070,9650,537070
528,9720 Refuse and recyclable material collectors...,9720,537081
529,9760 Other material moving workers 53-71XX,9760,5371XX


In [None]:
# TODO: determine if needed. This is a work in progress.
# There are some cases where we do not have an exact match on the full SOC Code.
# In such cases, try to match on the first 5 characters, taking the largest Occ code in case
# there are multiple Occ codes associated with the first 5 characters of SOC Code.
census_occ_codes['SOC_CODE5'] = census_occ_codes['SOC_CODE'].str[:5]
census_occ_codes_soc5 = census_occ_codes.groupby("SOC_CODE5").agg({
    'OCCUP':['max','count']
}).reset_index()

# Repeat the process using the first 4 characters of SOC Code.
census_occ_codes['SOC_CODE4'] = census_occ_codes['SOC_CODE'].str[:4]
census_occ_codes_soc4 = census_occ_codes.groupby("SOC_CODE4").agg({
    'OCCUP':['max','count']
}).reset_index()

In [None]:
# Convert SOC Code into Census occupation code
# Left Join - keep OES data even if no match on state code.
oes_data = pd.merge(oes_data, census_occ_codes[['OCCUP','SOC_CODE']], on='SOC_CODE', how='left')

In [None]:
# TODO: 2 more passes needed to match. If still largely unmatched with Census code, then will use census
#   major and detailed categories instead and will forego OES statistics.
# Get rows that are missing OCCUP code.
df = oes_data[oes_data.isnull().any(axis=1)]

# df['SOC_CODE'].value_counts().to_csv('test.csv')
# oes_data

In [None]:
oes_data.info()
oes_data.head()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 35874 entries, 0 to 35873
Data columns (total 22 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   USPS_STATE     35874 non-null  object
 1   SOC_DESC       35874 non-null  object
 2   EMP            35874 non-null  object
 3   EMP_RSDE       35874 non-null  object
 4   HOURLY_MEAN    35874 non-null  object
 5   ANN_MEAN       35874 non-null  object
 6   WAGE_RSDE      35874 non-null  object
 7   HOURLY_10TH    35874 non-null  object
 8   HOURLY_25TH    35874 non-null  object
 9   HOURLY_MEDIAN  35874 non-null  object
 10  HOURLY_75TH    35874 non-null  object
 11  HOURLY_90TH    35874 non-null  object
 12  ANN_10TH       35874 non-null  object
 13  ANN_25TH       35874 non-null  object
 14  ANN_MEDIAN     35874 non-null  object
 15  ANN_75TH       35874 non-null  object
 16  ANN_90TH       35874 non-null  object
 17  EMP_PER_1000   35874 non-null  object
 18  LOC_QUOTIENT   35874 non-n

Unnamed: 0,USPS_STATE,SOC_DESC,EMP,EMP_RSDE,HOURLY_MEAN,ANN_MEAN,WAGE_RSDE,HOURLY_10TH,HOURLY_25TH,HOURLY_MEDIAN,HOURLY_75TH,HOURLY_90TH,ANN_10TH,ANN_25TH,ANN_MEDIAN,ANN_75TH,ANN_90TH,EMP_PER_1000,LOC_QUOTIENT,FIPS_STATE,SOC_CODE,OCCUP
0,AL,All Occupations(000000),1903210,0.5,22.52,46840,0.6,8.98,11.57,17.43,27.39,41.07,18690,24060,36250,56980,85430,1000.0,1.0,1,0,
1,AL,Management Occupations(110000),87110,1.2,52.90,110040,0.7,22.95,32.37,45.73,64.58,91.89,47740,67330,95120,134320,191130,45.772,0.8,1,110000,
2,AL,Chief Executives(111011),1160,8.9,84.09,174910,3.8,23.79,47.08,77.55,-,-,49480,97930,161290,-,-,0.608,0.42,1,111011,10.0
3,AL,General and Operations Managers(111021),31170,2.0,58.56,121800,1.0,23.09,32.57,48.64,73.58,-,48030,67740,101170,153050,-,16.377,0.97,1,111021,20.0
4,AL,Legislators(111031),1150,6.7,-,28840,5.1,-,-,-,-,-,16220,17190,18820,27920,55970,0.604,1.64,1,111031,


## Combine All Data

In [None]:
# Bring in Occupational data joining on FIPS state and full SOC code.
# Convert to numeric datatypes so data can be merged.
oes_data['FIPS_STATE'] = pd.to_numeric(oes_data['FIPS_STATE'], errors='coerce')
oes_data['OCCUP'] = pd.to_numeric(oes_data['OCCUP'], errors='coerce')

# Left Join - keep OES data even if no match on state code.
asec_oes = pd.merge(asec_combined, oes_data, left_on=['GESTFIPS','OCCUP'], right_on=['FIPS_STATE','OCCUP'], how='left')

# Only get desired columns
asec_oes = asec_oes[household_cols + person_cols + occupation_cols]

In [None]:
# Review result of merged data
asec_oes.info()
asec_oes.head()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 180101 entries, 0 to 180100
Data columns (total 30 columns):
 #   Column         Non-Null Count   Dtype 
---  ------         --------------   ----- 
 0   GTMETSTA       180101 non-null  int64 
 1   GEDIV          180101 non-null  int64 
 2   GESTFIPS       180101 non-null  int64 
 3   HEFAMINC       180101 non-null  int64 
 4   HHINC          180101 non-null  int64 
 5   OCCUP          180101 non-null  int64 
 6   A_DTOCC        180101 non-null  int64 
 7   A_MJOCC        180101 non-null  int64 
 8   AGE1           180101 non-null  int64 
 9   A_AGE          180101 non-null  int64 
 10  A_SEX          180101 non-null  int64 
 11  PRDTRACE       180101 non-null  int64 
 12  PRCITSHP       180101 non-null  int64 
 13  A_HGA          180101 non-null  int64 
 14  A_HRLYWK       180101 non-null  int64 
 15  A_HRSPAY       180101 non-null  int64 
 16  A_GRSWK        180101 non-null  int64 
 17  CLWK           180101 non-null  int64 
 18  EARN

Unnamed: 0,GTMETSTA,GEDIV,GESTFIPS,HEFAMINC,HHINC,OCCUP,A_DTOCC,A_MJOCC,AGE1,A_AGE,A_SEX,PRDTRACE,PRCITSHP,A_HGA,A_HRLYWK,A_HRSPAY,A_GRSWK,CLWK,EARNER,HRCHECK,HRSWK,PEARNVAL,A_CLSWKR,A_DTIND,HOURLY_MEAN,HOURLY_MEDIAN,EMP_PER_1000,LOC_QUOTIENT,ANN_MEAN,ANN_MEDIAN
0,2,1,23,6,8,4050,13,3,4,21,1,1,1,37,0,-1,0,1,1,1,30,18000,1,45,,,,,,
1,2,1,23,10,9,0,0,0,17,85,2,1,1,39,0,-1,0,5,2,0,0,0,0,0,,,,,,
2,2,1,23,5,5,4020,13,3,13,61,2,1,1,39,0,-1,0,1,1,2,44,12000,1,45,,,,,,
3,2,1,23,2,10,0,0,0,16,73,2,1,1,39,0,-1,0,5,2,0,0,0,0,0,,,,,,
4,2,1,23,2,10,4610,15,3,8,37,1,1,1,39,0,-1,0,1,1,1,20,12000,1,43,,,,,,


## Clean Data

In [None]:
# Remove people under 15 years old because they are not relevant for this project.
# 0 = Not in universe
# 1 = 15 years
# 2 = 16 and 17 years
# 3 = 18 and 19 years
# 4 = 20 and 21 years
# 5 = 22 to 24 years
# 6 = 25 to 29 years
# 7 = 30 to 34 years
# 8 = 35 to 39 years
# 9 = 40 to 44 years
# 10 = 45 to 49 years
# 11 = 50 to 54 years
# 12 = 55 to 59 years
# 13 = 60 to 61 years
# 14 = 62 to 64 years
# 15 = 65 to 69 years
# 16 = 70 to 74 years
# 17 = 75 years and over
asec_oes = asec_oes[asec_oes['AGE1'] > 0]
asec_oes.info()
asec_oes.head()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 141251 entries, 0 to 180100
Data columns (total 30 columns):
 #   Column         Non-Null Count   Dtype 
---  ------         --------------   ----- 
 0   GTMETSTA       141251 non-null  int64 
 1   GEDIV          141251 non-null  int64 
 2   GESTFIPS       141251 non-null  int64 
 3   HEFAMINC       141251 non-null  int64 
 4   HHINC          141251 non-null  int64 
 5   OCCUP          141251 non-null  int64 
 6   A_DTOCC        141251 non-null  int64 
 7   A_MJOCC        141251 non-null  int64 
 8   AGE1           141251 non-null  int64 
 9   A_AGE          141251 non-null  int64 
 10  A_SEX          141251 non-null  int64 
 11  PRDTRACE       141251 non-null  int64 
 12  PRCITSHP       141251 non-null  int64 
 13  A_HGA          141251 non-null  int64 
 14  A_HRLYWK       141251 non-null  int64 
 15  A_HRSPAY       141251 non-null  int64 
 16  A_GRSWK        141251 non-null  int64 
 17  CLWK           141251 non-null  int64 
 18  EARN

Unnamed: 0,GTMETSTA,GEDIV,GESTFIPS,HEFAMINC,HHINC,OCCUP,A_DTOCC,A_MJOCC,AGE1,A_AGE,A_SEX,PRDTRACE,PRCITSHP,A_HGA,A_HRLYWK,A_HRSPAY,A_GRSWK,CLWK,EARNER,HRCHECK,HRSWK,PEARNVAL,A_CLSWKR,A_DTIND,HOURLY_MEAN,HOURLY_MEDIAN,EMP_PER_1000,LOC_QUOTIENT,ANN_MEAN,ANN_MEDIAN
0,2,1,23,6,8,4050,13,3,4,21,1,1,1,37,0,-1,0,1,1,1,30,18000,1,45,,,,,,
1,2,1,23,10,9,0,0,0,17,85,2,1,1,39,0,-1,0,5,2,0,0,0,0,0,,,,,,
2,2,1,23,5,5,4020,13,3,13,61,2,1,1,39,0,-1,0,1,1,2,44,12000,1,45,,,,,,
3,2,1,23,2,10,0,0,0,16,73,2,1,1,39,0,-1,0,5,2,0,0,0,0,0,,,,,,
4,2,1,23,2,10,4610,15,3,8,37,1,1,1,39,0,-1,0,1,1,1,20,12000,1,43,,,,,,


In [None]:
asec_oes.tail()

Unnamed: 0,GTMETSTA,GEDIV,GESTFIPS,HEFAMINC,HHINC,OCCUP,A_DTOCC,A_MJOCC,AGE1,A_AGE,A_SEX,PRDTRACE,PRCITSHP,A_HGA,A_HRLYWK,A_HRSPAY,A_GRSWK,CLWK,EARNER,HRCHECK,HRSWK,PEARNVAL,A_CLSWKR,A_DTIND,HOURLY_MEAN,HOURLY_MEDIAN,EMP_PER_1000,LOC_QUOTIENT,ANN_MEAN,ANN_MEDIAN
180095,1,9,15,13,28,4230,14,3,12,57,1,4,4,41,0,0,0,1,1,1,32,28000,1,45,19.04,19.32,17.306,3.03,39600.0,40190.0
180096,1,9,15,13,28,4760,16,4,10,45,2,4,4,40,0,0,0,1,1,2,40,30000,1,22,16.96,14.8,35.706,1.36,35280.0,30780.0
180097,1,9,15,13,28,0,0,0,1,15,1,4,4,34,0,0,0,5,2,0,0,0,0,0,,,,,,
180099,1,9,15,6,21,4720,16,4,12,57,2,4,4,43,0,0,0,1,1,2,40,15000,1,22,,,,,,
180100,1,9,15,6,21,2910,9,2,6,27,1,4,1,43,0,0,0,3,1,2,35,35000,6,44,21.22,18.62,0.684,2.29,44130.0,38730.0


### Column Descriptions

TODO: Create summary document that has chosen columns.

**Demographic**
* AHGA - Educational Attainment

**Geo**

**Earnings**
* A_HRLYWK - Is paid by the hour
* A_HRSPAY - If is paid by the hour, this is hourly wage
* A_GRSWK - Gross weekly salary
* CLWK - Longest job classification
* EARNER - earner/non-earner
* HRCHECK - part time/full time
* HRSWK - how many hours does respondent work per week
* PEARNVAL - total person earnings - can be positive or negative
* A_CLSWKR - private/public/self employed
* A_DTIND - Industry code - Appendix A.

In [None]:
# TODO: need to switch to annual pay since A_HRSPAY is only for people who hold hourly positions.
# Convert hours pay into a float with 2 decimal places
asec_oes['A_HRSPAY'] = asec_oes['A_HRSPAY'].astype('float') / 100
# first pass - 93897/141251 nulls
asec_oes['HOURLY_MEAN'].isnull().sum()

92627

In [None]:
asec_oes.shape

(141251, 30)

In [None]:
# TODO: Handle null or blank data - We have some "-" data in the oes file that indicates value is unavailable.


# Exploratory Data Analysis (EDA) - MS-3 - Oct 15

# Data Processing / Models - MS-4 - Oct 29

# Data Visualization and Results - MS-5 - Nov 19

# Presentation and Conclusions - Final - Dec 3

