<a href="https://colab.research.google.com/github/cfcastillo/DS-6-Notebooks/blob/main/3_Education_Capstone_Data_Visualization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Project Definition

The purpose of this project is to identify what factors influence people to choose certain professions or trades. In understanding these factors, we can help colleges like Central New Mexico College (CNM) offer courses that support those professions and better target their marketing to people who are likely to choose those professions.

This project will be a supervised categorization problem using tree-based models to identify the factors that will contribute to career choice.



# Data Collection and Cleaning

The data collection and cleaning process is outlined in the notebook titled [1. Education Capstone - Data Collection and Cleaning.ipynb](https://colab.research.google.com/drive/1Y_1b7BmiRF6CSYnoiZqGpfjpbzU4qoFe#scrollTo=Kmxlgo4Wnjgd)


## Column Descriptions

[Here is a summary document showing selected columns.](https://docs.google.com/document/d/1io7TtqebJLtw6FKE7zkbUh26QkG3rEJrZX3Fver9zmU/edit)

# Exploratory Data Analysis (EDA)

EDA can be found in the notebook titled [2. Education Capstone - EDA and Processing.ipynb](https://colab.research.google.com/drive/1Fa18G_kZY8fCEKupjsfICRyeav7dEw7K)

# Data Processing / Models

Data Processing and Model application can be found in the notebook titled [2. Education Capstone - EDA and Processing.ipynb](https://colab.research.google.com/drive/1Fa18G_kZY8fCEKupjsfICRyeav7dEw7K)

# Imports

In [None]:
# grab the imports needed for the project
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
# import statsmodels.api as sm

from sklearn import metrics
from sklearn.metrics import classification_report
# import sklearn.model_selection as model_selection
# from collections import Counter
from imblearn.under_sampling import RandomUnderSampler
from imblearn.over_sampling import RandomOverSampler
# from mlxtend.plotting import plot_decision_regions

from sklearn.ensemble import RandomForestClassifier
import xgboost as xgb
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Visualization
import graphviz
# from IPython.display import display
# from sklearn import tree
# import plotly.express as px
# from ipywidgets import interact, Dropdown

# Other
# from sklearn.pipeline import make_pipeline  # does not work properly with randomoversampler.
from imblearn.pipeline import make_pipeline

# Link to Data Drive


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


The team had different data links. The global here is to allow team members to specify who is working on this notebook so that they can run the code in their environment.

In [None]:
# Expected values are: ellie, amy, cecilia - lowercase
team_member = 'cecilia'

# Root drive path
if team_member in ['amy','ellie']:
  root_drive = '/content/drive/MyDrive/'
else: # Cecilia
  root_drive = '/content/drive/MyDrive/Student Folder - Cecilia/Projects/'

# Data Visualization and Results - Ellie

In [None]:
#Import df with all years
df_viz = pd.read_csv(root_drive + 'Capstone/Data/FinalData/Trends/asec_trend_v2.csv')

# Drop rows where A_DTOCC = 0 = Not in universe or Armed Forces
df_viz.drop(np.where(df_viz['A_DTOCC'] == 0) [0], inplace=True)

#create dictionary to convert A_DTOCC codes to string descriptions
occ_dict = {1: 'Management',
            2: 'Business & Financial Operations',
            3: 'Computer & Mathematical Science',
            4: 'Architecture & Engineering',
            5: 'Life, Physical, & Social Science',
            6: 'Community & Social Service',
            7: 'Legal',
            8: 'Education, Training, & Library',
            9: 'Arts, Design, Entertainment, Sports, & Media',
            10: 'Healthcare Practitioner & Technical',
            11: 'Healtcare Support',
            12: 'Protective Service',
            13: 'Food Preparation & Serving Related',
            14: 'Building & Grounds Cleaning & Maintenance',
            15: 'Personal Care & Service',
            16: 'Sales & Related',
            17: 'Office & Administrative Support',
            18: 'Farming, Fishing, & Forestry',
            19: 'Construction & Extraction',
            20: 'Installation, Maintenance, & Repair',
            21: 'Production',
            22: 'Transportation & Material Moving',
            23: 'Armed Forces'}

#add column to df_viz with A_DTOCC codes converted to string descriptions
df_viz['occ_string'] = df_viz['A_DTOCC'].apply(lambda x: occ_dict.get(x, occ_dict.values))

#Import state codes
df_states = pd.read_csv(root_drive + 'Capstone/Data/Codes/FIPS_STATE_CODES.csv')

#Combine dataframes to get state names
df_viz = pd.merge(df_viz, df_states, how='left', left_on='GESTFIPS', right_on='FIPS_STATE')

Nationally, what are the most popular occupation categories from years 2012 to 2021?

In [None]:
state_o = df_viz['USPS_STATE'].unique()
state_o_s = Dropdown(options = sorted(state_o))

year_o = df_viz['DATA_YEAR'].unique()
year_o_s = Dropdown(options = sorted(year_o)) 

@interact(Year = year_o_s, State = state_o_s)
def pie(Year=2012, State = 'NM'):
  'Makes pie plot with given year and state'
  df_viz_year = df_viz[(df_viz['DATA_YEAR'] == Year) & (df_viz['USPS_STATE'] == State)]
  keys = Counter(df_viz_year['occ_string']).keys()
  list_keys= list(keys)
  list_keys.sort()
  fig = px.pie(df_viz_year, values=df_viz_year['occ_string'].value_counts().sort_index(), names=list_keys, color_discrete_map={
    'Management': 'Dark24[0]',
    'Business & Financial Operations': '#E15F99',
    'Computer & Mathematical Science': '#1CA71C',
    'Architecture & Engineering': '#FB0D0D',
    'Life, Physical, & Social Science': '#DA16FF',
    'Community & Social Service': '#222A2A',
    'Legal': '#B68100',
    'Education, Training, & Library': '#750D86',
    'Arts, Design, Entertainment, Sports, & Media': '#EB663B',
    'Healthcare Practitioner & Technical': '#511CFB',
    'Healtcare Support': '#00A08B',
    'Protective Service': '#FB00D1',
    'Food Preparation & Serving Related': '#FC0080',
    'Building & Grounds Cleaning & Maintenance': '#B2828D',
    'Personal Care & Service': '#6C7C32',
    'Sales & Related': '#778AAE',
    'Office & Administrative Support': '#862A16',
    'Farming, Fishing, & Forestry': '#A777F1',
    'Construction & Extraction': '#620042',
    'Installation, Maintenance, & Repair': '#1616A7',
    'Production': '#DA60CA',
    'Transportation & Material Moving': '#6C4516',
    'Armed Forces': '#0D2A63'},
    title=f'Occupations in the US by Year and State: {Year}, {State}')
  return fig.show()

#Note : how do I make the pie chart stay big when re-running the function? also cannot seem to get color_discrete_map to work so that categories stick with the same color

In [None]:
df_viz.head(1)

In [None]:
years = df_viz.DATA_YEAR.unique()
years.sort()
years

years = df_viz.DATA_YEAR.unique().sort()
print(years)

In [None]:
#time series w/ NM state data w/ drop down of the 23 occupation codes

occ_o = df_viz['occ_string'].unique()
occ_o_s = Dropdown(options = sorted(occ_o))

@interact(Occupation = occ_o_s)
def timeseries(Occupation = 'Management'):
  'Makes timeseries with given occupation category'
  df_time = df_viz[(df_viz['occ_string'] == Occupation) & (df_viz['USPS_STATE'] == 'NM')]
  fig = px.line(df_time, x=df_time.DATA_YEAR.unique().sort(), y=df_time['occ_string'].value_counts(), title='Occupations over Time')
  return fig.show()

visualization ideas
- trendlines
- maps

# Presentation and Conclusions - Final - Dec 3

