<a href="https://colab.research.google.com/github/cfcastillo/DS-6-Notebooks/blob/main/Education_Capstone_MS2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# TASK LIST

* Get last 25 states into data folder for oes data - **Amy**
* Find dataset(s) that offer future job popularity/need - **Amy**
  * [top 10 skills by year](https://data.world/datasets/jobs)
* **DONE** - Split SOC in to major and minor groups and analyze - **Cecilia**
* Data cleaning - **Cecilia**
* **DONE** - Populate FIPS State numeric code. **Cecilia**
* **DONE** Merge OES data with Census data.

**MS-2 requirements**

Your Jupyter Notebook needs to contain the following: 
* All code needed to clean and format your data
* A written description of the data cleaning process
* completed rows of the dataset, predictors and response.



# Project Definition

The purpose of this project is to identify what factors influence people to choose certain professions or trades. In understanding these factors, we can help colleges like Central New Mexico College (CNM) offer courses that support those professions and better target their marketing to people who are likely to choose those professions.

This project will be a supervised categorization problem using tree-based models to identify the factors that will contribute to career choice.



# Data Identification Process

Steps:

1. We stated several questions we wanted answered (target). 
1. After defining our problem, we listed sets of variables that we believed could answer our questions. We then put the variables and targets into a [spreadsheet](https://docs.google.com/spreadsheets/d/1bOhOBHKOae9TDN9n9-xF7ag4QW_Z0c7HXTYLXeMMLHs/edit#gid=0) to define the dataset we would need to run our analysis. 
1. We then researched data sources such as Bureau of Labor Statistics and the US Census to locate data that supported our research. 
1. We then mapped the columns in the data sources to the columns in our desired dataset and linked multiple datasets by target code value.

TODO: finish this up.

# Data Collection

The following data sources were used for this project. Data was imported into Google Drive from the below links and modified as needed to support this project.

The primary datasets for this project were initially taken from the Census' [Annual Social and Economic Supplement (ASEC)](https://www.census.gov/programs-surveys/saipe/guidance/model-input-data/cpsasec.html) of the Current Population Survey (CPS) for 2020. However, because 2020 was anomalous due to Covid, we had to go back and take data from 2018 and 2019 - pre-covid to get occupation and salary information that was more stable. Per the above link, the "*ASEC data is the source of timely official national estimates of poverty levels and rates and of widely used measures of income. It provides annual estimates based on a survey of more than 75,000 households. The survey contains detailed questions covering social and economic characteristics of each person who is a household member as of the interview date. Income questions refer to income received during the previous calendar year.*"

[Annual Social and Economic Supplement (ASEC) 2020 Public Use Dictionary](https://www2.census.gov/programs-surveys/cps/datasets/2020/march/ASEC2020ddl_pub_full.pdf)

* Format - PDF
* Contains data dictionary for public use annual survey.

[Current Population Survey (CPS) ASEC Supplement](https://www2.census.gov/programs-surveys/cps/techdocs/cpsmar20.pdf)

* Format - PDF
* Contains appendices with code descriptions for the annual survey.

[Annual Social and Economic Supplement (ASEC) 2020 Data](https://www.census.gov/data/datasets/2020/demo/cps/cps-asec-2020.html)

* Format - CSV/ASCII or SAS
* Contains data for public use annual survey - no replicate weights.

[Occupational Employment Wage Statistics (OES) Data](https://data.bls.gov/oes/#/geoOcc/Multiple%20occupations%20for%20one%20geographical%20area)

* Format - Excel converted to CSV
* Contains Occupational codes and aggregated statistics on wages for those occupations.

[FIPS State Codes](https://www.census.gov/library/reference/code-lists/ansi/ansi-codes-for-states.html)

* Format - Copied from PDF and converted to CSV
* Contains FIPS State codes mapped to US Postal Service (USPS) State codes.

[Census Occupation Codes](https://www2.census.gov/programs-surveys/cps/techdocs/cpsmar20.pdf)

* Format - Copied from PDF and converted to CSV
* Contains Census Occupation codes mapped to Federal Standard Occupational Classification (SOC) Codes.


# Imports

In [118]:
# grab the imports needed for the project
import pandas as pd
import glob
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import statsmodels.api as sm

# all
from sklearn import datasets
from sklearn import metrics
from sklearn import preprocessing
from sklearn.metrics import classification_report
import sklearn.model_selection as model_selection

# Gaussian Naive Bayes
from sklearn.naive_bayes import GaussianNB
from sklearn.naive_bayes import MultinomialNB

# Regression
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_squared_error

# Data Cleaning - MS-2 - Oct 1

Once we identified that data elements needed for our project and the data sources that provided those data elements, the following steps were taken to get the data into a format needed for our analysis.

1. Downloaded data from data sources and placed copies in Google Drive.
1. Made changes to raw data where needed to support the project. 
  * Added State code to OES data and remove headers and footers from the data.
  * Created lookup data for State codes and SOC codes so secondary data sources could be merged with primary Census data.
1. Converted codes in secondary datasets into Census codes.
1. Merged all datasets together into a single dataset.
1. Removed data that did not meet criteria for our analysis
  * Removed anyone under age 16.
  * Removed data that was missing occupational code since this is our target column so must be populated.
  * Imputed null values.


TODO: Finish this up.

## Import Data

In [119]:
# Mount Drive
from google.colab import drive
drive.mount('/drive')

Mounted at /drive


In [120]:
# Import Census data
asec_path = '/drive/MyDrive/Student Folder - Cecilia/Projects/Capstone/Data/ASEC/asecpub20csv/'
asec_data_person = pd.read_csv(asec_path + 'pppub20.csv')
asec_data_household = pd.read_csv(asec_path + 'hhpub20.csv')
# asec_data_family = pd.read_csv(asec_path + 'ffpub20.csv')

# Join and import all 50 states' occupation data
oes_path = '/drive/MyDrive/Student Folder - Cecilia/Projects/Capstone/Data/Occupations/'
oes_file_names = glob.glob(oes_path + "*.csv")
li = []
for filename in oes_file_names:
    df = pd.read_csv(filename, index_col=None, header=0)
    li.append(df)
oes_data = pd.concat(li, axis=0, ignore_index=True)

# File path for all code conversion files.
codes_path = '/drive/MyDrive/Student Folder - Cecilia/Projects/Capstone/Data/Codes/'

# Import FIPS state codes so we can convert USPS state to FIPS state to match back to Census data.
fips_state_codes = pd.read_csv(codes_path + 'FIPS_STATE_CODES.csv')

# Import Census occupational codes so we can convert SOC codes into Census Occ codes.
# Is in fixed width format. Will parse out data below.
census_occ_codes = pd.read_fwf(codes_path + 'CENSUS_SOC_OCC_CODES.txt')

In [121]:
# How many columns and rows do we have in each dataset?
print(f'Person data: {asec_data_person.shape}')
print(f'Household data: {asec_data_household.shape}')
print(f'Occupation data: {oes_data.shape}')
print(f'FIPS State Codes: {fips_state_codes.shape}')
print(f'Census Occ Codes: {census_occ_codes.shape}')

Person data: (157959, 840)
Household data: (91500, 134)
Occupation data: (18630, 19)
FIPS State Codes: (51, 3)
Census Occ Codes: (530, 1)


## ASEC Data

### Define ASEC Columns

The following data dictionary provides details for the selected columns.

[Annual Social and Economic Supplement (ASEC) 2020 Public Use Dictionary](https://www2.census.gov/programs-surveys/cps/datasets/2020/march/ASEC2020ddl_pub_full.pdf)

In [122]:
# Get lists of columns for various datasets that will be used for the project
# Note: Columns can be added as needed here and will propagate through the project.
id_col = ['H_IDNUM']
person_cols = ['OCCUP','A_DTOCC','A_MJOCC','AGE1','A_SEX','PRDTRACE','PRCITSHP','A_HGA','A_HRSPAY','A_CLSWKR','A_DTIND']
household_cols = ['GTCO','GTCSA','GTINDVPC','GTMETSTA','GEDIV','GEREG','GESTFIPS','HEFAMINC','HHINC']
occupation_cols = ['HOURLY_MEAN','HOURLY_MEDIAN','EMP_PER_1000','LOC_QUOTIENT']

### Get Household Id

In [123]:
# Extract the Household id number from the person record so we can join the household and person dataframes by this id.
asec_data_person['H_IDNUM'] = asec_data_person['PERIDNUM'].str[:20]

In [124]:
# View Person Data
asec_data_person[person_cols].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 157959 entries, 0 to 157958
Data columns (total 11 columns):
 #   Column    Non-Null Count   Dtype
---  ------    --------------   -----
 0   OCCUP     157959 non-null  int64
 1   A_DTOCC   157959 non-null  int64
 2   A_MJOCC   157959 non-null  int64
 3   AGE1      157959 non-null  int64
 4   A_SEX     157959 non-null  int64
 5   PRDTRACE  157959 non-null  int64
 6   PRCITSHP  157959 non-null  int64
 7   A_HGA     157959 non-null  int64
 8   A_HRSPAY  157959 non-null  int64
 9   A_CLSWKR  157959 non-null  int64
 10  A_DTIND   157959 non-null  int64
dtypes: int64(11)
memory usage: 13.3 MB


In [125]:
# Look at first 5 records of selected columns of person data.
asec_data_person[person_cols].head()

Unnamed: 0,OCCUP,A_DTOCC,A_MJOCC,AGE1,A_SEX,PRDTRACE,PRCITSHP,A_HGA,A_HRSPAY,A_CLSWKR,A_DTIND
0,440,1,1,14,2,1,1,39,-1,3,51
1,8620,0,0,15,1,1,1,39,-1,0,0
2,9121,22,10,14,1,1,1,39,-1,4,40
3,0,0,0,16,2,1,1,36,-1,0,0
4,5500,17,5,11,2,1,1,39,-1,1,23


In [126]:
# View Household Data
asec_data_household[household_cols].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 91500 entries, 0 to 91499
Data columns (total 9 columns):
 #   Column    Non-Null Count  Dtype
---  ------    --------------  -----
 0   GTCO      91500 non-null  int64
 1   GTCSA     91500 non-null  int64
 2   GTINDVPC  91500 non-null  int64
 3   GTMETSTA  91500 non-null  int64
 4   GEDIV     91500 non-null  int64
 5   GEREG     91500 non-null  int64
 6   GESTFIPS  91500 non-null  int64
 7   HEFAMINC  91500 non-null  int64
 8   HHINC     91500 non-null  int64
dtypes: int64(9)
memory usage: 6.3 MB


In [127]:
# Look at first 5 records of household data
asec_data_household[household_cols].head()

Unnamed: 0,GTCO,GTCSA,GTINDVPC,GTMETSTA,GEDIV,GEREG,GESTFIPS,HEFAMINC,HHINC
0,0,0,0,2,1,1,23,15,41
1,0,0,0,2,1,1,23,7,26
2,0,0,0,2,1,1,23,14,17
3,0,0,0,2,1,1,23,7,4
4,0,0,0,2,1,1,23,15,24


### Merge Person and Household Records

In [128]:
# Join Household and Personal records into single dataframe
# Inner join - should not have person without household.
asec_combined = pd.merge(asec_data_household[id_col + household_cols], asec_data_person[id_col + person_cols], on=id_col)

In [129]:
# View combined result
asec_combined.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 157959 entries, 0 to 157958
Data columns (total 21 columns):
 #   Column    Non-Null Count   Dtype 
---  ------    --------------   ----- 
 0   H_IDNUM   157959 non-null  object
 1   GTCO      157959 non-null  int64 
 2   GTCSA     157959 non-null  int64 
 3   GTINDVPC  157959 non-null  int64 
 4   GTMETSTA  157959 non-null  int64 
 5   GEDIV     157959 non-null  int64 
 6   GEREG     157959 non-null  int64 
 7   GESTFIPS  157959 non-null  int64 
 8   HEFAMINC  157959 non-null  int64 
 9   HHINC     157959 non-null  int64 
 10  OCCUP     157959 non-null  int64 
 11  A_DTOCC   157959 non-null  int64 
 12  A_MJOCC   157959 non-null  int64 
 13  AGE1      157959 non-null  int64 
 14  A_SEX     157959 non-null  int64 
 15  PRDTRACE  157959 non-null  int64 
 16  PRCITSHP  157959 non-null  int64 
 17  A_HGA     157959 non-null  int64 
 18  A_HRSPAY  157959 non-null  int64 
 19  A_CLSWKR  157959 non-null  int64 
 20  A_DTIND   157959 non-null 

In [130]:
asec_combined.head()

Unnamed: 0,H_IDNUM,GTCO,GTCSA,GTINDVPC,GTMETSTA,GEDIV,GEREG,GESTFIPS,HEFAMINC,HHINC,OCCUP,A_DTOCC,A_MJOCC,AGE1,A_SEX,PRDTRACE,PRCITSHP,A_HGA,A_HRSPAY,A_CLSWKR,A_DTIND
0,83296115090150809011,0,0,0,2,1,1,23,15,41,440,1,1,14,2,1,1,39,-1,3,51
1,83296115090150809011,0,0,0,2,1,1,23,15,41,8620,0,0,15,1,1,1,39,-1,0,0
2,42389960119020509011,0,0,0,2,1,1,23,7,26,9121,22,10,14,1,1,1,39,-1,4,40
3,42389960119020509011,0,0,0,2,1,1,23,7,26,0,0,0,16,2,1,1,36,-1,0,0
4,20595061200937509011,0,0,0,2,1,1,23,14,17,5500,17,5,11,2,1,1,39,-1,1,23


## OES Data

In [131]:
# Shorten column names
oes_data.rename(columns={'State':'USPS_STATE',
                         'Occupation (SOC code)':'SOC_DESC',
                         'Employment(1)':'EMP',
                         'Employment percent relative standard error(3)':'EMP_RSDE',
                         'Hourly mean wage':'HOURLY_MEAN',
                         'Annual mean wage(2)':'ANN_MEAN',
                         'Wage percent relative standard error(3)':'WAGE_RSDE',
                         'Hourly 10th percentile wage':'HOURLY_10TH',
                         'Hourly 25th percentile wage':'HOURLY_25TH',
                         'Hourly median wage':'HOURLY_MEDIAN',
                         'Hourly 75th percentile wage':'HOURLY_75TH',
                         'Hourly 90th percentile wage':'HOURLY_90TH',
                         'Annual 10th percentile wage(2)':'ANN_10TH',
                         'Annual 25th percentile wage(2)':'ANN_25TH',
                         'Annual median wage(2)':'ANN_MEDIAN',
                         'Annual 75th percentile wage(2)':'ANN_75TH',
                         'Annual 90th percentile wage(2)':'ANN_90TH',
                         'Employment per 1,000 jobs':'EMP_PER_1000',
                         'Location Quotient':'LOC_QUOTIENT'}, inplace=True)

### OES Column Footnotes

* (1) Estimates for detailed occupations do not sum to the totals because the totals include occupations not shown separately. Estimates do not include self-employed workers.
* (2) Annual wages have been calculated by multiplying the corresponding hourly wage by 2,080 hours.
* (3) The relative standard error (RSE) is a measure of the reliability of a survey statistic. The smaller the relative standard error, the more precise the estimate.
* (4) Wages for some occupations that do not generally work year-round, full time, are reported either as hourly wages or annual salaries depending on how they are typically paid.
* (5) This wage is equal to or greater than \$100.00 per hour or \$208,000 per year.
* (8) Estimate not released.

### Get Census State Codes

In [132]:
# Import FIPS state codes matching on USPS state codes
# Left Join - keep OES data even if no match on state code.
oes_data = pd.merge(oes_data, fips_state_codes[['USPS_STATE','FIPS_STATE']], on='USPS_STATE', how='left')

In [133]:
# Verify merge was successful - that we have expected columns and record count is unchanged.
oes_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 18630 entries, 0 to 18629
Data columns (total 20 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   USPS_STATE     18630 non-null  object
 1   SOC_DESC       18630 non-null  object
 2   EMP            18630 non-null  object
 3   EMP_RSDE       18630 non-null  object
 4   HOURLY_MEAN    18630 non-null  object
 5   ANN_MEAN       18630 non-null  object
 6   WAGE_RSDE      18630 non-null  object
 7   HOURLY_10TH    18630 non-null  object
 8   HOURLY_25TH    18630 non-null  object
 9   HOURLY_MEDIAN  18630 non-null  object
 10  HOURLY_75TH    18630 non-null  object
 11  HOURLY_90TH    18630 non-null  object
 12  ANN_10TH       18630 non-null  object
 13  ANN_25TH       18630 non-null  object
 14  ANN_MEDIAN     18630 non-null  object
 15  ANN_75TH       18630 non-null  object
 16  ANN_90TH       18630 non-null  object
 17  EMP_PER_1000   18630 non-null  object
 18  LOC_QUOTIENT   18630 non-n

In [134]:
# Verify we have all states
oes_data[['USPS_STATE', 'FIPS_STATE']].value_counts()

USPS_STATE  FIPS_STATE
CA          6             797
FL          12            790
IL          17            775
MI          26            753
GA          13            749
IN          18            747
MN          27            742
CO          8             742
MD          24            739
MA          25            735
MO          29            733
AL          1             731
LA          22            730
AZ          4             729
IA          19            711
KY          21            709
MS          28            688
KS          20            685
CT          9             682
AR          5             666
ID          16            628
MT          30            627
ME          23            626
HI          15            567
DE          10            523
AK          2             519
DC          11            507
dtype: int64

In [135]:
# Verify no nulls after merge.
oes_data.isnull().sum()

USPS_STATE       0
SOC_DESC         0
EMP              0
EMP_RSDE         0
HOURLY_MEAN      0
ANN_MEAN         0
WAGE_RSDE        0
HOURLY_10TH      0
HOURLY_25TH      0
HOURLY_MEDIAN    0
HOURLY_75TH      0
HOURLY_90TH      0
ANN_10TH         0
ANN_25TH         0
ANN_MEDIAN       0
ANN_75TH         0
ANN_90TH         0
EMP_PER_1000     0
LOC_QUOTIENT     0
FIPS_STATE       0
dtype: int64

### Parse SOC Codes

In [136]:
# Parse out SOC code from the description. The code is inside parentheses.
def getSocCode(value):
  # If not able to parse the code, then return the value from the file.
  try:
    return value[value.index('(')+1:value.index(')')]
  except:
    return value

oes_data['SOC_CODE'] = oes_data['SOC_DESC'].apply(lambda val: getSocCode(val))

In [137]:
# Verify codes were properly parsed
oes_data[['SOC_DESC','SOC_CODE']]

Unnamed: 0,SOC_DESC,SOC_CODE
0,All Occupations(000000),000000
1,Management Occupations(110000),110000
2,Chief Executives(111011),111011
3,General and Operations Managers(111021),111021
4,Legislators(111031),111031
...,...,...
18625,Stockers and Order Fillers(537065),537065
18626,"Pump Operators, Except Wellhead Pumpers(537072)",537072
18627,Wellhead Pumpers(537073),537073
18628,Refuse and Recyclable Material Collectors(537081),537081


### Get Census Occupation Codes

In [138]:
# Prepare Census/SOC map file. Codes are embedded in a single column so need to be parsed out.
# Parse out Census and SOC occupational codes from the description. The code has a dash in it. So locate by dash
# assuming there are no dashes in the description.
def getOccCodeSoc(value):
  # If not able to parse the code, then return the value from the file.
  try:
    return value[value.index('-')-2:value.index('-')+5].replace('-','')
  except:
    return value

# Retrieve first 4 characters in the file. This is the Census code.
def getOccCodeCensus(value):
  return value[:4]

census_occ_codes['OCCUP'] = census_occ_codes['CENSUS_MAP'].apply(lambda val: getOccCodeCensus(val))
census_occ_codes['SOC_CODE'] = census_occ_codes['CENSUS_MAP'].apply(lambda val: getOccCodeSoc(val))
census_occ_codes

Unnamed: 0,CENSUS_MAP,OCCUP,SOC_CODE
0,0010 Chief executives 11-1011,0010,111011
1,0020 General and operations managers 11-1021,0020,111021
2,0040 Advertising and promotions managers 11-2011,0040,112011
3,0051 Marketing Managers 11-2021,0051,112021
4,0052 Sales managers 11-2022,0052,112022
...,...,...,...
525,9645 Stockers and order fillers 53-7065,9645,537065
526,9650 Pumping station operators 53-7070,9650,537070
527,9720 Refuse and recyclable material collectors...,9720,537081
528,9760 Other material moving workers 53-71XX,9760,5371XX


In [139]:
# Convert SOC Code into Census occupation code
# Left Join - keep OES data even if no match on state code.
oes_data = pd.merge(oes_data, census_occ_codes[['OCCUP','SOC_CODE']], on='SOC_CODE', how='left')

In [140]:
oes_data.info()
oes_data.head()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 18656 entries, 0 to 18655
Data columns (total 22 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   USPS_STATE     18656 non-null  object
 1   SOC_DESC       18656 non-null  object
 2   EMP            18656 non-null  object
 3   EMP_RSDE       18656 non-null  object
 4   HOURLY_MEAN    18656 non-null  object
 5   ANN_MEAN       18656 non-null  object
 6   WAGE_RSDE      18656 non-null  object
 7   HOURLY_10TH    18656 non-null  object
 8   HOURLY_25TH    18656 non-null  object
 9   HOURLY_MEDIAN  18656 non-null  object
 10  HOURLY_75TH    18656 non-null  object
 11  HOURLY_90TH    18656 non-null  object
 12  ANN_10TH       18656 non-null  object
 13  ANN_25TH       18656 non-null  object
 14  ANN_MEDIAN     18656 non-null  object
 15  ANN_75TH       18656 non-null  object
 16  ANN_90TH       18656 non-null  object
 17  EMP_PER_1000   18656 non-null  object
 18  LOC_QUOTIENT   18656 non-n

Unnamed: 0,USPS_STATE,SOC_DESC,EMP,EMP_RSDE,HOURLY_MEAN,ANN_MEAN,WAGE_RSDE,HOURLY_10TH,HOURLY_25TH,HOURLY_MEDIAN,HOURLY_75TH,HOURLY_90TH,ANN_10TH,ANN_25TH,ANN_MEDIAN,ANN_75TH,ANN_90TH,EMP_PER_1000,LOC_QUOTIENT,FIPS_STATE,SOC_CODE,OCCUP
0,AL,All Occupations(000000),1903210,0.5,22.52,46840,0.6,8.98,11.57,17.43,27.39,41.07,18690,24060,36250,56980,85430,1000.0,1.0,1,0,
1,AL,Management Occupations(110000),87110,1.2,52.90,110040,0.7,22.95,32.37,45.73,64.58,91.89,47740,67330,95120,134320,191130,45.772,0.8,1,110000,
2,AL,Chief Executives(111011),1160,8.9,84.09,174910,3.8,23.79,47.08,77.55,-,-,49480,97930,161290,-,-,0.608,0.42,1,111011,10.0
3,AL,General and Operations Managers(111021),31170,2.0,58.56,121800,1.0,23.09,32.57,48.64,73.58,-,48030,67740,101170,153050,-,16.377,0.97,1,111021,20.0
4,AL,Legislators(111031),1150,6.7,-,28840,5.1,-,-,-,-,-,16220,17190,18820,27920,55970,0.604,1.64,1,111031,


## Combine All Data

In [141]:
# Bring in Occupational data joining on FIPS state and full SOC code.
# Convert to numeric datatypes so data can be merged.
oes_data['FIPS_STATE'] = pd.to_numeric(oes_data['FIPS_STATE'], errors='coerce')
oes_data['OCCUP'] = pd.to_numeric(oes_data['OCCUP'], errors='coerce')

# Left Join - keep OES data even if no match on state code.
asec_oes = pd.merge(asec_combined, oes_data, left_on=['GESTFIPS','OCCUP'], right_on=['FIPS_STATE','OCCUP'], how='left')

# Only get desired columns
asec_oes = asec_oes[household_cols + person_cols + occupation_cols]

In [142]:
# Review result of merged data
asec_oes.info()
asec_oes.head()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 157959 entries, 0 to 157958
Data columns (total 24 columns):
 #   Column         Non-Null Count   Dtype 
---  ------         --------------   ----- 
 0   GTCO           157959 non-null  int64 
 1   GTCSA          157959 non-null  int64 
 2   GTINDVPC       157959 non-null  int64 
 3   GTMETSTA       157959 non-null  int64 
 4   GEDIV          157959 non-null  int64 
 5   GEREG          157959 non-null  int64 
 6   GESTFIPS       157959 non-null  int64 
 7   HEFAMINC       157959 non-null  int64 
 8   HHINC          157959 non-null  int64 
 9   OCCUP          157959 non-null  int64 
 10  A_DTOCC        157959 non-null  int64 
 11  A_MJOCC        157959 non-null  int64 
 12  AGE1           157959 non-null  int64 
 13  A_SEX          157959 non-null  int64 
 14  PRDTRACE       157959 non-null  int64 
 15  PRCITSHP       157959 non-null  int64 
 16  A_HGA          157959 non-null  int64 
 17  A_HRSPAY       157959 non-null  int64 
 18  A_CL

Unnamed: 0,GTCO,GTCSA,GTINDVPC,GTMETSTA,GEDIV,GEREG,GESTFIPS,HEFAMINC,HHINC,OCCUP,A_DTOCC,A_MJOCC,AGE1,A_SEX,PRDTRACE,PRCITSHP,A_HGA,A_HRSPAY,A_CLSWKR,A_DTIND,HOURLY_MEAN,HOURLY_MEDIAN,EMP_PER_1000,LOC_QUOTIENT
0,0,0,0,2,1,1,23,15,41,440,1,1,14,2,1,1,39,-1,3,51,,,,
1,0,0,0,2,1,1,23,15,41,8620,0,0,15,1,1,1,39,-1,0,0,24.76,24.11,0.549,2.59
2,0,0,0,2,1,1,23,7,26,9121,22,10,14,1,1,1,39,-1,4,40,,,,
3,0,0,0,2,1,1,23,7,26,0,0,0,16,2,1,1,36,-1,0,0,,,,
4,0,0,0,2,1,1,23,14,17,5500,17,5,11,2,1,1,39,-1,1,23,26.16,24.62,0.258,0.37


## Clean Data

In [143]:
# Remove people under 16 years old because they are not relevant for this project.
# ??? Record count dropped from 157959 to 16855 ???
asec_oes = asec_oes[asec_oes['AGE1'] >= 16]
asec_oes.info()
asec_oes.head()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 16855 entries, 3 to 157956
Data columns (total 24 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   GTCO           16855 non-null  int64 
 1   GTCSA          16855 non-null  int64 
 2   GTINDVPC       16855 non-null  int64 
 3   GTMETSTA       16855 non-null  int64 
 4   GEDIV          16855 non-null  int64 
 5   GEREG          16855 non-null  int64 
 6   GESTFIPS       16855 non-null  int64 
 7   HEFAMINC       16855 non-null  int64 
 8   HHINC          16855 non-null  int64 
 9   OCCUP          16855 non-null  int64 
 10  A_DTOCC        16855 non-null  int64 
 11  A_MJOCC        16855 non-null  int64 
 12  AGE1           16855 non-null  int64 
 13  A_SEX          16855 non-null  int64 
 14  PRDTRACE       16855 non-null  int64 
 15  PRCITSHP       16855 non-null  int64 
 16  A_HGA          16855 non-null  int64 
 17  A_HRSPAY       16855 non-null  int64 
 18  A_CLSWKR       16855 non-

Unnamed: 0,GTCO,GTCSA,GTINDVPC,GTMETSTA,GEDIV,GEREG,GESTFIPS,HEFAMINC,HHINC,OCCUP,A_DTOCC,A_MJOCC,AGE1,A_SEX,PRDTRACE,PRCITSHP,A_HGA,A_HRSPAY,A_CLSWKR,A_DTIND,HOURLY_MEAN,HOURLY_MEDIAN,EMP_PER_1000,LOC_QUOTIENT
3,0,0,0,2,1,1,23,7,26,0,0,0,16,2,1,1,36,-1,0,0,,,,
5,0,0,0,2,1,1,23,7,4,0,0,0,16,2,1,1,39,-1,0,0,,,,
11,0,0,0,2,1,1,23,6,8,0,0,0,16,1,1,1,39,-1,0,0,,,,
13,0,0,0,2,1,1,23,8,8,0,0,0,17,2,1,1,40,-1,0,0,,,,
20,0,0,0,2,1,1,23,6,9,0,0,0,17,2,1,1,39,-1,0,0,,,,


In [144]:
# Convert hours pay into a float with 2 decimal places
# ??? WHY DO WE ONLY HAVE SALARY FOR 194 PEOPLE ???
# TODO: get report from 2018 and 2019 because 2020 may have been impacted by Covid.
asec_oes['A_HRSPAY'] = asec_oes['A_HRSPAY'].astype('float') / 100
asec_oes[asec_oes['A_HRSPAY'] > 0]

Unnamed: 0,GTCO,GTCSA,GTINDVPC,GTMETSTA,GEDIV,GEREG,GESTFIPS,HEFAMINC,HHINC,OCCUP,A_DTOCC,A_MJOCC,AGE1,A_SEX,PRDTRACE,PRCITSHP,A_HGA,A_HRSPAY,A_CLSWKR,A_DTIND,HOURLY_MEAN,HOURLY_MEDIAN,EMP_PER_1000,LOC_QUOTIENT
98,0,0,0,2,1,1,23,10,20,8030,21,9,17,1,1,1,39,21.00,1,6,,,,
168,0,0,0,2,1,1,23,8,11,9130,22,10,16,2,1,1,41,12.00,1,22,,,,
458,19,0,0,1,1,1,23,11,27,9130,22,10,17,1,1,1,43,12.00,1,22,,,,
817,5,438,0,1,1,1,23,15,41,4220,14,3,17,1,1,1,39,14.13,1,43,,,,
2564,11,148,0,1,1,1,33,9,33,640,2,1,16,2,1,4,43,38.70,1,40,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
154362,0,0,0,1,9,4,2,11,22,3647,11,3,16,2,1,1,39,10.50,1,41,,,,
155638,0,0,0,2,9,4,15,11,34,4760,16,4,16,2,3,1,39,13.00,1,22,16.96,14.80,35.706,1.36
156152,3,0,0,1,9,4,15,15,41,3870,12,3,17,1,4,1,43,12.50,2,51,,,,
156318,3,0,0,1,9,4,15,8,34,4760,16,4,16,2,4,1,43,12.00,1,22,16.96,14.80,35.706,1.36


In [145]:
# TODO: Handle null or blank data - We have some "-" data in the oes file that indicates value is unavailable.


# Exploratory Data Analysis (EDA) - MS-3 - Oct 15

# Data Processing / Models - MS-4 - Oct 29

# Data Visualization and Results - MS-5 - Nov 19

# Presentation and Conclusions - Final - Dec 3

