<a href="https://colab.research.google.com/github/cfcastillo/DS-6-Notebooks/blob/main/1_Education_Capstone_Data_Collection_and_Cleaning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Project Definition

The purpose of this project is to identify what factors influence people to choose certain professions or trades. In understanding these factors, we can help colleges like Central New Mexico College (CNM) offer courses that support those professions and better target their marketing to people who are likely to choose those professions.

This project will be a supervised categorization problem using tree-based models to identify the factors that will contribute to career choice.



# Data Identification Process

Steps:

1. We stated several questions we wanted answered (target). 
1. After defining our problem, we listed sets of variables that we believed could answer our questions. We then put the variables and targets into a [spreadsheet](https://docs.google.com/spreadsheets/d/1bOhOBHKOae9TDN9n9-xF7ag4QW_Z0c7HXTYLXeMMLHs/edit#gid=0) to define the dataset we would need to run our analysis. 
1. We then researched data sources such as Bureau of Labor Statistics and the US Census to locate data that supported our research. 
1. We then mapped the columns in the data sources to the columns in our desired dataset and linked multiple datasets by target code value.

*Note: The data identification process is still a work in progress. As we proceed with EDA, we will discover some columns are not needed and others are needed. As we analyzed the data during the data cleaning process, we discovered that earnings are complex, often made up of multiple jobs. Additional analysis will be needed to solidify our predictor when applying the model.*

# Data Collection

The following data sources were used for this project. Data was imported into Google Drive from the below links and modified as needed to support this project.

The primary datasets for this project were initially taken from the Census' [Annual Social and Economic Supplement (ASEC)](https://www.census.gov/programs-surveys/saipe/guidance/model-input-data/cpsasec.html) of the Current Population Survey (CPS) for 2020. However, because 2020 was anomalous due to Covid, we chose to go back and take data from 2019 - pre-covid to get occupation and salary information that was more stable. Per the above link, the "*ASEC data is the source of timely official national estimates of poverty levels and rates and of widely used measures of income. It provides annual estimates based on a survey of more than 75,000 households. The survey contains detailed questions covering social and economic characteristics of each person who is a household member as of the interview date. Income questions refer to income received during the previous calendar year.*"

[Annual Social and Economic Survey (ASEC) All Years Data](https://www.census.gov/data/datasets/time-series/demo/cps/cps-asec.html)

* Contains links to all years from 2010 to 2021. CSV format available from 2019 to 2021. Prior to 2019, a fixed format file is provided so columns would need to be parsed using the available data dictionary.
* [2021 Survey - csv](https://www.census.gov/data/datasets/time-series/demo/cps/cps-asec.2021.html)
* [2020 Survey - csv](https://www.census.gov/data/datasets/time-series/demo/cps/cps-asec.2020.html)
* [2019 Survey - csv](https://www.census.gov/data/datasets/time-series/demo/cps/cps-asec.2019.html)
* [2018 Survey - dat](https://www.census.gov/data/datasets/time-series/demo/cps/cps-asec.2018.html) - Need to convert to csv

[Quarterly Census of Employment and Wages](https://www.bls.gov/cew/about-data/)

* Source data for OES Statistics. Can be used if detailed data is needed.

[Occupational Employment Wage Statistics (OES) Data](https://data.bls.gov/oes/#/geoOcc/Multiple%20occupations%20for%20one%20geographical%20area)

* Format - Excel converted to CSV
* Contains Occupational codes and aggregated statistics on wages for those occupations.

[FIPS State Codes](https://www.census.gov/library/reference/code-lists/ansi/ansi-codes-for-states.html)

* Format - Copied from PDF and converted to CSV
* Contains FIPS State codes mapped to US Postal Service (USPS) State codes.

[Census Occupation Codes](https://www2.census.gov/programs-surveys/cps/techdocs/cpsmar20.pdf)

* Format - Copied from PDF and converted to CSV
* Contains Census Occupation codes mapped to Federal Standard Occupational Classification (SOC) Codes.

[Bureau of Labor Statistics SOC Codes](https://www.bls.gov/oes/current/oes_stru.htm#15-0000)

* Format - list on website

[Potential CNM Enrollment Data](https://www.cnm.edu/depts/finance-operations/ods/institutional-research-request)


## CNM Data Links

[KPIs PDF](https://www.cnm.edu/depts/finance-operations/ods/dashboards-kpis/kpiupdate1120.pdf)

[CNM Data Dashboard](https://livecnm.sharepoint.com/sites/insights/FO/ODS/CNMdashboards/SitePages/Home.aspx?e=1%3Aa3207ba35cac40c480e6029dc0bfcc2c)

[CNM Graduate Outcomes](https://www.cnm.edu/depts/finance-operations/ods/documents/graduate-surveys/2018-2019-graduate-outcomes.pdf)

## Summarized Data Dictionary

[Here is a link to a summarized data dictionary.](https://docs.google.com/document/d/1io7TtqebJLtw6FKE7zkbUh26QkG3rEJrZX3Fver9zmU/edit)


# Imports

In [None]:
# grab the imports needed for the project
import pandas as pd

# Globals
The team had different data links. The global here is to allow team members to specify who is working on this notebook so that they can run the code in their environment.

In [None]:
# Expected values are: ellie, amy, cecilia - lowercase
team_member = 'cecilia'

# Root drive path
if team_member in ['amy','ellie']:
  root_drive = '/content/drive/MyDrive/'
else: # Cecilia
  root_drive = '/content/drive/MyDrive/Student Folder - Cecilia/Projects/'

# Data Cleaning

Once we identified the data elements needed for our project and the data sources that provided those data elements, the following steps were taken to get the data into a format needed for our analysis.

1. Downloaded data from data sources and placed copies in Google Drive.
1. Made changes to raw data where needed to support the project. 
  * Added State code to OES data and remove headers and footers from the data.
  * Created lookup data for State codes and SOC codes so secondary data sources could be merged with primary Census data. This involved cleaning the census code list so it could be properly parsed.
1. Converted codes in secondary datasets into Census codes.
1. Merged all datasets together into a single dataset.
1. Removed data that did not meet criteria for our analysis
  * Removed anyone under age 16.
  * Imputed null values.
1. Studied earnings/salary columns to determine which columns provided values that could be used for modeling. Added in columns that were missing from the initial analysis.
1. We were not able to reliably match the OES data to the census data using the full SOC Code because of disparities in SOC Codes. Therefore, we executed 3 matching passes reducing the SOC code by one character each time and pulling the largest Census code for the SOC code prefix. This allowed us to match a larger percentage of the data back to the ASEC data. However, we still had around 50% NULL data so decided to remove the OES data from our final analysis.
1. After initially applying models to our data, we decided to add additional predictor variables mostly from the family dataset indicating income source.
1. With continuing poor model performance, we decided to do some trend analysis for the past 10 years. To do this, we parsed 2018 to 2011 and merged it with the 2021-2019 data.

[GitHub link for data parsing on local system using Jupyter Notebooks through Anaconda Navigator](https://github.com/cfcastillo/DS-6-Notebooks/blob/main/Education%20Capstone%20Historic%20Data%20Parsing.ipynb)

[GitHub link for merging all years into a single dataset](https://github.com/cfcastillo/DS-6-Notebooks/blob/main/Education_Capstone_Historic_Data_Parsing_Server.ipynb)

This concluded our data cleaning and preparation steps.

## Import Data

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
# Import Census data
asec_year = '21'
data_year = int('20' + asec_year)
asec_path = root_drive + 'Capstone/Data/ASEC/asecpub' + asec_year + 'csv/'
asec_data_person = pd.read_csv(asec_path + 'pppub' + asec_year + '.csv')
asec_data_household = pd.read_csv(asec_path + 'hhpub' + asec_year + '.csv')
asec_data_family = pd.read_csv(asec_path + 'ffpub' + asec_year + '.csv')

In [None]:
# How many columns and rows do we have in each dataset?
print(f'Person data: {asec_data_person.shape}')
print(f'Household data: {asec_data_household.shape}')
print(f'Family data: {asec_data_family.shape}')

Person data: (163543, 830)
Household data: (90759, 134)
Family data: (73151, 85)


## ASEC Data

### Define ASEC Columns

The following data dictionary provides details for the selected columns.

[ASEC Appendices](https://www2.census.gov/programs-surveys/cps/techdocs/cpsmar19.pdf)

[ASEC Data Dictionary](https://www2.census.gov/programs-surveys/cps/datasets/2019/march/06_ASEC_2019-Data_Dictionary_Full.pdf)


In [None]:
# Get lists of columns for various datasets that will be used for the project
# Note: Columns can be added as needed here and will propagate through the project.
hid_col = ['H_IDNUM']
hseq_col = ['H_SEQ']
fseq_col = ['FH_SEQ']  # Joins to household data through H_SEQ
year_col = ['DATA_YEAR']
person_cols = ['OCCUP','A_MJOCC','A_DTOCC','AGE1','A_SEX','PRDTRACE','PXRACE1','PRCITSHP',
               'A_HGA','PRERELG', 'A_GRSWK', 'HRCHECK','HRSWK','PEARNVAL','A_CLSWKR','WEIND',
               'A_MARITL','A_HSCOL','A_WKSTAT','HEA','PEINUSYR']

# In 2022 data? - A_MAJACT, PURACEOT, RAC_HISP, UED_TYP

household_cols = ['GTMETSTA','GEDIV','GESTFIPS','HHINC','H_TENURE','H_LIVQRT']

# In 2022 data? - FEARNS, GEUR
# FKINDEX, 'FINC_ANN', 'FINC_DST', 'FINC_PEN' not in 2018 and earlier

family_cols = ['FINC_FR','FINC_SE','FINC_WS','FINC_CSP','FINC_DIS','FINC_DIV','FINC_RNT',
               'FINC_ED','FINC_SS','FINC_SSI','FINC_FIN','FINC_SUR','FINC_INT','FINC_UC',
               'FINC_OI','FINC_VET','FINC_PAW','FINC_WC']

### Get Household Id

In [None]:
# Extract the Household id number from the person record so we can join the household and person dataframes by this id.
asec_data_person[hid_col] = asec_data_person['PERIDNUM'].str[:20]

In [None]:
# View Person Data
asec_data_person[hid_col + person_cols].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 163543 entries, 0 to 163542
Data columns (total 22 columns):
 #   Column    Non-Null Count   Dtype 
---  ------    --------------   ----- 
 0   H_IDNUM   163543 non-null  object
 1   OCCUP     163543 non-null  int64 
 2   A_MJOCC   163543 non-null  int64 
 3   A_DTOCC   163543 non-null  int64 
 4   AGE1      163543 non-null  int64 
 5   A_SEX     163543 non-null  int64 
 6   PRDTRACE  163543 non-null  int64 
 7   PXRACE1   163543 non-null  int64 
 8   PRCITSHP  163543 non-null  int64 
 9   A_HGA     163543 non-null  int64 
 10  PRERELG   163543 non-null  int64 
 11  A_GRSWK   163543 non-null  int64 
 12  HRCHECK   163543 non-null  int64 
 13  HRSWK     163543 non-null  int64 
 14  PEARNVAL  163543 non-null  int64 
 15  A_CLSWKR  163543 non-null  int64 
 16  WEIND     163543 non-null  int64 
 17  A_MARITL  163543 non-null  int64 
 18  A_HSCOL   163543 non-null  int64 
 19  A_WKSTAT  163543 non-null  int64 
 20  HEA       163543 non-null 

In [None]:
# Look at first 5 records of selected columns of person data.
asec_data_person[hid_col + person_cols].head()

Unnamed: 0,H_IDNUM,OCCUP,A_MJOCC,A_DTOCC,AGE1,A_SEX,PRDTRACE,PXRACE1,PRCITSHP,A_HGA,PRERELG,A_GRSWK,HRCHECK,HRSWK,PEARNVAL,A_CLSWKR,WEIND,A_MARITL,A_HSCOL,A_WKSTAT,HEA,PEINUSYR
0,82389460119020511011,0,0,0,12,2,1,0,1,39,0,0,0,0,0,0,23,1,0,1,4,0
1,82389460119020511011,6305,7,19,12,1,1,0,1,39,0,0,2,40,10000,1,3,1,0,6,3,0
2,82389460119020511011,0,0,0,17,2,1,0,1,39,0,0,0,0,0,0,23,4,0,1,4,0
3,60920525931050712011,2002,2,6,15,2,1,0,1,43,1,827,2,40,43000,1,15,1,0,2,3,0
4,60920525931050712011,9130,10,22,15,1,1,0,1,39,1,635,2,40,33000,1,8,1,0,2,3,0


In [None]:
asec_data_person[hid_col + person_cols].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 163543 entries, 0 to 163542
Data columns (total 22 columns):
 #   Column    Non-Null Count   Dtype 
---  ------    --------------   ----- 
 0   H_IDNUM   163543 non-null  object
 1   OCCUP     163543 non-null  int64 
 2   A_MJOCC   163543 non-null  int64 
 3   A_DTOCC   163543 non-null  int64 
 4   AGE1      163543 non-null  int64 
 5   A_SEX     163543 non-null  int64 
 6   PRDTRACE  163543 non-null  int64 
 7   PXRACE1   163543 non-null  int64 
 8   PRCITSHP  163543 non-null  int64 
 9   A_HGA     163543 non-null  int64 
 10  PRERELG   163543 non-null  int64 
 11  A_GRSWK   163543 non-null  int64 
 12  HRCHECK   163543 non-null  int64 
 13  HRSWK     163543 non-null  int64 
 14  PEARNVAL  163543 non-null  int64 
 15  A_CLSWKR  163543 non-null  int64 
 16  WEIND     163543 non-null  int64 
 17  A_MARITL  163543 non-null  int64 
 18  A_HSCOL   163543 non-null  int64 
 19  A_WKSTAT  163543 non-null  int64 
 20  HEA       163543 non-null 

In [None]:
# Convert hidnum to object with left fill zero. necessary for years 2018 and prior
if int(asec_year) <= 18:
  asec_data_household[hid_col[0]] = asec_data_household[hid_col[0]].astype(str).str.zfill(20)

In [None]:
# View Household Data
asec_data_household[hid_col + hseq_col + household_cols].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 90759 entries, 0 to 90758
Data columns (total 8 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   H_IDNUM   90759 non-null  object
 1   H_SEQ     90759 non-null  int64 
 2   GTMETSTA  90759 non-null  int64 
 3   GEDIV     90759 non-null  int64 
 4   GESTFIPS  90759 non-null  int64 
 5   HHINC     90759 non-null  int64 
 6   H_TENURE  90759 non-null  int64 
 7   H_LIVQRT  90759 non-null  int64 
dtypes: int64(7), object(1)
memory usage: 5.5+ MB


In [None]:
# Look at first 5 records of household data
asec_data_household[hid_col + hseq_col + household_cols].head()

Unnamed: 0,H_IDNUM,H_SEQ,GTMETSTA,GEDIV,GESTFIPS,HHINC,H_TENURE,H_LIVQRT
0,82389460119020511011,1,2,1,23,11,1,1
1,60920525931050712011,2,2,1,23,41,1,1
2,20449021300235411011,3,2,1,23,7,1,1
3,59365480158020111011,4,2,1,23,0,0,1
4,05508356120489111011,5,2,1,23,0,0,1


In [None]:
# View family data. Get record count before grouping data.
asec_data_family[fseq_col + family_cols].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 73151 entries, 0 to 73150
Data columns (total 19 columns):
 #   Column    Non-Null Count  Dtype
---  ------    --------------  -----
 0   FH_SEQ    73151 non-null  int64
 1   FINC_FR   73151 non-null  int64
 2   FINC_SE   73151 non-null  int64
 3   FINC_WS   73151 non-null  int64
 4   FINC_CSP  73151 non-null  int64
 5   FINC_DIS  73151 non-null  int64
 6   FINC_DIV  73151 non-null  int64
 7   FINC_RNT  73151 non-null  int64
 8   FINC_ED   73151 non-null  int64
 9   FINC_SS   73151 non-null  int64
 10  FINC_SSI  73151 non-null  int64
 11  FINC_FIN  73151 non-null  int64
 12  FINC_SUR  73151 non-null  int64
 13  FINC_INT  73151 non-null  int64
 14  FINC_UC   73151 non-null  int64
 15  FINC_OI   73151 non-null  int64
 16  FINC_VET  73151 non-null  int64
 17  FINC_PAW  73151 non-null  int64
 18  FINC_WC   73151 non-null  int64
dtypes: int64(19)
memory usage: 10.6 MB


In [None]:
# There may be multiple families per household. We need unique records in order to merge
# with the household data.
asec_data_family_unique = asec_data_family.drop_duplicates(fseq_col + family_cols)[fseq_col + family_cols]

In [None]:
# View family data after grouping. Get record count with all columns. 
# Compare with record count from sequence number column to ensure we truly have unique rows 
# and to see if further grouping is needed.
asec_data_family_unique

Unnamed: 0,FH_SEQ,FINC_FR,FINC_SE,FINC_WS,FINC_CSP,FINC_DIS,FINC_DIV,FINC_RNT,FINC_ED,FINC_SS,FINC_SSI,FINC_FIN,FINC_SUR,FINC_INT,FINC_UC,FINC_OI,FINC_VET,FINC_PAW,FINC_WC
0,1,2,2,1,2,2,2,2,2,1,2,2,2,2,1,2,2,2,2
1,2,2,2,1,2,2,2,2,2,2,2,2,2,1,2,2,2,2,2
2,3,2,2,2,2,2,2,2,2,1,2,2,2,2,2,2,2,2,2
3,8,2,2,2,2,2,2,2,2,2,1,2,2,2,2,2,2,2,2
4,8,2,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
73146,90755,2,2,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2
73147,90756,2,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2
73148,90757,2,1,2,2,2,2,2,2,1,2,2,2,1,2,2,2,2,2
73149,90758,2,1,1,2,2,2,1,2,2,2,2,2,2,2,2,2,2,2


In [None]:
# Looking at the unique results indicates that we still do not have a single family record per household
# because families within the household may have different income sources. Therefore, we will take the 
# first family record assuming it is the most significant record. If extending this project in future, 
# a better mechanism for reducing the data would be required.
temp_family = asec_data_family[asec_data_family['FFPOS'] == 1]
asec_data_family_single = temp_family[fseq_col + family_cols]

In [None]:
temp_family.tail()

Unnamed: 0,FPOVCUT,FPERSONS,FHEADIDX,FSPOUIDX,FOWNU6,FRELU6,FKIND,FKINDEX,FTYPE,FRELU18,FOWNU18,FLASTIDX,FMLASIDX,FH_SEQ,FAMLIS,FANNVAL,FCSPVAL,FDISVAL,FDIVVAL,FDSTVAL,FEARNVAL,FEDVAL,FFINVAL,FFPOS,FFRVAL,FHIP_VAL,FHIP_VAL2,FINC_ANN,FINC_CSP,FINC_DIS,FINC_DIV,FINC_DST,FINC_ED,FINC_FIN,FINC_FR,FINC_INT,FINC_OI,FINC_PAW,FINC_PEN,FINC_RNT,...,FINC_VET,FINC_WC,FINC_WS,FINTVAL,FMED_VAL,FMOOP,FMOOP2,FOIVAL,FOTC_VAL,FOTHVAL,FPAWVAL,FPCTCUT,FPENVAL,FRECORD,FRNTVAL,FRSPOV,FRSPPCT,FSEVAL,FSPANISH,FSSIVAL,FSSVAL,FSUP_WGT,FSURVAL,FTOTVAL,FTOT_R,FUCVAL,FVETVAL,FWCVAL,FWSVAL,F_MV_FS,F_MV_SL,I_FHIPVAL,I_FHIPVAL2,I_FMEDVAL,I_FMOOP,I_FMOOP2,I_FOTCVAL,POVLL,FILEDATE,YYYYMM
73145,37174,6,1,2,0,1,1,1,1,1,0,6,3,90755,4,0,0,24000,0,0,84800,0,0,1,0,4250,4250,2,2,1,2,2,2,2,2,2,2,2,1,2,...,2,1,1,0,1580,7450,7450,0,1620,80215,0,17,8400,2,0,0,0,0,2,0,18535,47036,0,165015,41,5280,0,24000,84800,0,0,3,3,3,3,3,3,12,81921,202103
73147,13465,1,1,0,0,0,2,3,2,0,0,1,1,90756,4,0,0,0,0,0,60000,0,0,1,0,0,0,2,2,2,2,2,2,2,2,2,2,2,2,2,...,2,2,2,0,100,200,200,0,100,0,0,0,0,2,0,0,0,60000,2,0,0,31235,0,60000,25,0,0,0,0,0,0,3,3,3,3,3,3,12,81921,202103
73148,12413,1,1,0,0,0,2,3,2,0,0,1,1,90757,4,0,0,0,0,0,2700,0,0,1,0,1400,1400,2,2,2,2,2,2,2,2,1,2,2,2,2,...,2,2,2,1,2000,4600,4600,0,1200,19393,0,0,0,2,0,0,0,2700,2,0,19392,51411,0,22093,9,0,0,0,0,0,0,0,0,0,0,0,0,7,81921,202103
73149,15644,2,1,2,0,0,1,1,1,0,0,2,2,90758,4,0,0,0,0,0,85000,0,0,1,0,1800,1800,2,2,2,2,2,2,2,2,2,2,2,2,1,...,2,2,1,0,1000,3556,3556,0,756,25800,0,13,0,2,25800,0,0,50000,2,0,0,51625,0,110800,41,0,0,0,35000,0,0,0,0,0,0,0,0,14,81921,202103
73150,17331,2,1,2,0,0,1,2,1,0,0,2,2,90759,4,0,0,0,0,0,43000,0,0,1,0,0,0,2,2,2,2,2,2,2,2,1,2,2,2,2,...,2,2,1,11,0,500,500,0,500,16803,0,8,0,2,0,0,0,0,1,0,8000,38637,0,59803,24,8792,0,0,43000,0,0,3,3,3,3,3,3,10,81921,202103


In [None]:
# View results
asec_data_family_single

Unnamed: 0,FH_SEQ,FINC_FR,FINC_SE,FINC_WS,FINC_CSP,FINC_DIS,FINC_DIV,FINC_RNT,FINC_ED,FINC_SS,FINC_SSI,FINC_FIN,FINC_SUR,FINC_INT,FINC_UC,FINC_OI,FINC_VET,FINC_PAW,FINC_WC
0,1,2,2,1,2,2,2,2,2,1,2,2,2,2,1,2,2,2,2
1,2,2,2,1,2,2,2,2,2,2,2,2,2,1,2,2,2,2,2
2,3,2,2,2,2,2,2,2,2,1,2,2,2,2,2,2,2,2,2
3,8,2,2,2,2,2,2,2,2,2,1,2,2,2,2,2,2,2,2
5,9,2,2,2,2,2,2,2,2,1,2,2,2,1,2,2,2,2,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
73145,90755,2,2,1,2,1,2,2,2,1,2,2,2,2,1,2,2,2,1
73147,90756,2,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2
73148,90757,2,1,2,2,2,2,2,2,1,2,2,2,1,2,2,2,2,2
73149,90758,2,1,1,2,2,2,1,2,2,2,2,2,2,2,2,2,2,2


### Merge ASEC Tables

In [None]:
# Join Household and Personal records into single dataframe
# Inner join - should not have person without household.
asec_combined = pd.merge(asec_data_household[hid_col + hseq_col + household_cols], asec_data_person[hid_col + person_cols], on=hid_col)

# Join Family to get FINC columns
asec_combined = pd.merge(asec_combined, asec_data_family_single[fseq_col + family_cols], left_on=hseq_col, right_on=fseq_col, how='left')

# Add data year so that we can do trend analysis
asec_combined[year_col] = data_year

In [None]:
# View combined result
asec_combined.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 163543 entries, 0 to 163542
Data columns (total 49 columns):
 #   Column     Non-Null Count   Dtype 
---  ------     --------------   ----- 
 0   H_IDNUM    163543 non-null  object
 1   H_SEQ      163543 non-null  int64 
 2   GTMETSTA   163543 non-null  int64 
 3   GEDIV      163543 non-null  int64 
 4   GESTFIPS   163543 non-null  int64 
 5   HHINC      163543 non-null  int64 
 6   H_TENURE   163543 non-null  int64 
 7   H_LIVQRT   163543 non-null  int64 
 8   OCCUP      163543 non-null  int64 
 9   A_MJOCC    163543 non-null  int64 
 10  A_DTOCC    163543 non-null  int64 
 11  AGE1       163543 non-null  int64 
 12  A_SEX      163543 non-null  int64 
 13  PRDTRACE   163543 non-null  int64 
 14  PXRACE1    163543 non-null  int64 
 15  PRCITSHP   163543 non-null  int64 
 16  A_HGA      163543 non-null  int64 
 17  PRERELG    163543 non-null  int64 
 18  A_GRSWK    163543 non-null  int64 
 19  HRCHECK    163543 non-null  int64 
 20  HRSW

In [None]:
asec_combined.head()

Unnamed: 0,H_IDNUM,H_SEQ,GTMETSTA,GEDIV,GESTFIPS,HHINC,H_TENURE,H_LIVQRT,OCCUP,A_MJOCC,A_DTOCC,AGE1,A_SEX,PRDTRACE,PXRACE1,PRCITSHP,A_HGA,PRERELG,A_GRSWK,HRCHECK,HRSWK,PEARNVAL,A_CLSWKR,WEIND,A_MARITL,A_HSCOL,A_WKSTAT,HEA,PEINUSYR,FH_SEQ,FINC_FR,FINC_SE,FINC_WS,FINC_CSP,FINC_DIS,FINC_DIV,FINC_RNT,FINC_ED,FINC_SS,FINC_SSI,FINC_FIN,FINC_SUR,FINC_INT,FINC_UC,FINC_OI,FINC_VET,FINC_PAW,FINC_WC,DATA_YEAR
0,82389460119020511011,1,2,1,23,11,1,1,0,0,0,12,2,1,0,1,39,0,0,0,0,0,0,23,1,0,1,4,0,1,2,2,1,2,2,2,2,2,1,2,2,2,2,1,2,2,2,2,2021
1,82389460119020511011,1,2,1,23,11,1,1,6305,7,19,12,1,1,0,1,39,0,0,2,40,10000,1,3,1,0,6,3,0,1,2,2,1,2,2,2,2,2,1,2,2,2,2,1,2,2,2,2,2021
2,82389460119020511011,1,2,1,23,11,1,1,0,0,0,17,2,1,0,1,39,0,0,0,0,0,0,23,4,0,1,4,0,1,2,2,1,2,2,2,2,2,1,2,2,2,2,1,2,2,2,2,2021
3,60920525931050712011,2,2,1,23,41,1,1,2002,2,6,15,2,1,0,1,43,1,827,2,40,43000,1,15,1,0,2,3,0,2,2,2,1,2,2,2,2,2,2,2,2,2,1,2,2,2,2,2,2021
4,60920525931050712011,2,2,1,23,41,1,1,9130,10,22,15,1,1,0,1,39,1,635,2,40,33000,1,8,1,0,2,3,0,2,2,2,1,2,2,2,2,2,2,2,2,2,1,2,2,2,2,2,2021


## Combine All Data

In [None]:
asec_final = asec_combined[year_col + household_cols + person_cols + family_cols]

In [None]:
asec_final.shape

(163543, 46)

In [None]:
# Review result of merged data
asec_final.info()
# asec_final.head(50)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 163543 entries, 0 to 163542
Data columns (total 46 columns):
 #   Column     Non-Null Count   Dtype
---  ------     --------------   -----
 0   DATA_YEAR  163543 non-null  int64
 1   GTMETSTA   163543 non-null  int64
 2   GEDIV      163543 non-null  int64
 3   GESTFIPS   163543 non-null  int64
 4   HHINC      163543 non-null  int64
 5   H_TENURE   163543 non-null  int64
 6   H_LIVQRT   163543 non-null  int64
 7   OCCUP      163543 non-null  int64
 8   A_MJOCC    163543 non-null  int64
 9   A_DTOCC    163543 non-null  int64
 10  AGE1       163543 non-null  int64
 11  A_SEX      163543 non-null  int64
 12  PRDTRACE   163543 non-null  int64
 13  PXRACE1    163543 non-null  int64
 14  PRCITSHP   163543 non-null  int64
 15  A_HGA      163543 non-null  int64
 16  PRERELG    163543 non-null  int64
 17  A_GRSWK    163543 non-null  int64
 18  HRCHECK    163543 non-null  int64
 19  HRSWK      163543 non-null  int64
 20  PEARNVAL   163543 non-null

## Clean Data

In [None]:
# Remove people under 15 years old because they are not relevant for this project.
# 0 = Not in universe
# 1 = 15 years
# 2 = 16 and 17 years
# 3 = 18 and 19 years
# 4 = 20 and 21 years
# 5 = 22 to 24 years
# 6 = 25 to 29 years
# 7 = 30 to 34 years
# 8 = 35 to 39 years
# 9 = 40 to 44 years
# 10 = 45 to 49 years
# 11 = 50 to 54 years
# 12 = 55 to 59 years
# 13 = 60 to 61 years
# 14 = 62 to 64 years
# 15 = 65 to 69 years
# 16 = 70 to 74 years
# 17 = 75 years and over
asec_final = asec_final[asec_final['AGE1'] > 0]
asec_final.info()
# asec_oes.head()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 129110 entries, 0 to 163542
Data columns (total 46 columns):
 #   Column     Non-Null Count   Dtype
---  ------     --------------   -----
 0   DATA_YEAR  129110 non-null  int64
 1   GTMETSTA   129110 non-null  int64
 2   GEDIV      129110 non-null  int64
 3   GESTFIPS   129110 non-null  int64
 4   HHINC      129110 non-null  int64
 5   H_TENURE   129110 non-null  int64
 6   H_LIVQRT   129110 non-null  int64
 7   OCCUP      129110 non-null  int64
 8   A_MJOCC    129110 non-null  int64
 9   A_DTOCC    129110 non-null  int64
 10  AGE1       129110 non-null  int64
 11  A_SEX      129110 non-null  int64
 12  PRDTRACE   129110 non-null  int64
 13  PXRACE1    129110 non-null  int64
 14  PRCITSHP   129110 non-null  int64
 15  A_HGA      129110 non-null  int64
 16  PRERELG    129110 non-null  int64
 17  A_GRSWK    129110 non-null  int64
 18  HRCHECK    129110 non-null  int64
 19  HRSWK      129110 non-null  int64
 20  PEARNVAL   129110 non-null

In [None]:
# Export to CSV for teammates to use in EDA
# export_path = root_drive + 'Capstone/Data/FinalData/Trends/asec_' + str(data_year) + '_trend_v2.csv'
# asec_final.to_csv(export_path)

In [None]:
asec_final.shape

(129110, 46)

# Exploratory Data Analysis (EDA)

The EDA process can be found in the notebook titled [2. Education Capstone EDA.ipynb](https://colab.research.google.com/drive/1Fa18G_kZY8fCEKupjsfICRyeav7dEw7K)