# Machine Learning: Data Preparation Checkpoint Answers

**Tian Lou** \
Ohio Education Research Center \
The Ohio State University

**Xiangyu Ren** \
New York University

[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.10257277.svg)](https://doi.org/10.5281/zenodo.10257277)

**This notebook is developed for the [Data Literacy and Evidence Building Executive Class](https://www.socialdatascience.umd.edu/data-literacy).**

**The "Syntucky" data, which is synthetic in nature, is exclusively designed for training exercises. It is not intended to derive meaningful insights or make determinations about real-world populations.**

In [None]:
#Data Analysis Libraries
import pandas as pd
import numpy as np

#Machine Learning Packages
from sklearn.preprocessing import MinMaxScaler

Before running the code below, please change <font color='red'> **YOUR DATA DIRECTORY**</font> to your own file path.

In [None]:
#Load 2013, 2014, and 2015 data

#Define data folder directory
data_directory = 'YOUR DATA DIRECTORY'

#Read in different cohort data
df_2013 = pd.read_csv(data_directory+'syntucky_cohort_2013.csv')
df_2014 = pd.read_csv(data_directory+'syntucky_cohort_2014.csv')
df_2015 = pd.read_csv(data_directory+'syntucky_cohort_2015.csv')

#Combine them into one dataset for easier cleaning and generating variables
df_comb = pd.concat([df_2013,df_2014,df_2015])

#Check the first five rows of the combined data
df_comb.head()

#### **Checkpoint 1: Use an Alternative Job Quality Measurement to Create the Label**

In the data measurement notebook, we developed several job quality measures, including number of jobs (`year7_ct_employers`), employment duration (`year7_ct_qtrs_employed`), and average earnings per employed quater (`year7_earnings` / `year7_ct_qtrs_employed`). Please use one of these measurements or your own measurement to create the job quality label. Also check the distribution of your label.

In [None]:
#Consistent job Label
# =1 if have 1 employer
# =0 if have more than 1 employers
#Exclude:
#Students who are not employed in year 7
conditions = [((df_comb['year7_ct_employers'].notna()) & (df_comb['year7_ct_employers'] > 1) & (df_comb['first_enroll_acadyr_pell_disbursed'].isnull() == False)),
              ((df_comb['year7_ct_employers'] == 1) & (df_comb['first_enroll_acadyr_pell_disbursed'].isnull() == False))]

choices = [0,
           1]

df_comb['label_consistent_jobs'] = np.select(conditions, choices, default = np.NaN)

#Check label distribution
df_consistent_jobs = df_comb.groupby(['label_consistent_jobs'])['id'].agg(['count']).reset_index()

df_consistent_jobs['percent'] = round(df_consistent_jobs['count'] / df_consistent_jobs['count'].sum(),2)

df_consistent_jobs

#### **Checkpoint 2: Create additional Features**

Try to create additional features. For example, you can add `first_enroll_fulltime` to the feature list or use `urm_status` as a feature instead of `race_group`. Think about if your new features are categorical or numeric and how you should process them. Also, check if your new features have missing values and think about how you should deal with these missing values.

In [None]:
#Data Cleaning

#Remove doctoral degree recipients due to potential data error
df_comb = df_comb[df_comb['high_completion_label'] != 'Doctoral']

#Check the number of unknown gender
print(df_comb.groupby(['gender'])['id'].agg(['count']).reset_index())

#Remove unknown gender due to small number
df_comb = df_comb[df_comb['gender'] != "Unknown"]

#Generate new features

#Age at year 7
df_comb['year7_age'] = df_comb['cohort_acadyr'] + 6 - df_comb['birth_year']

In [None]:
#Only keep the columns we need

#Identifiers
id_cols = ['id', 'cohort_acadyr']

#Labels
label_cols = ['label_consistent_jobs']

#Categorical features
cat_cols = ['first_enroll_acadyr_pell_disbursed', 'gender', 'urm_status', 'instate_origin', 'first_enroll', 
            'cohort_degree_pursuit_type']

#Numeric features
num_cols = ['year7_age']

#Only keep the features we need
df_comb = df_comb[id_cols + label_cols + cat_cols + num_cols]

#Check the current data
df_comb.head()

In [None]:
#Check number of null values in labels and features
#Note that this method only check if a column has null value. However, missing values could be in many formats.
#You may want to inspect further depending on the data you use.
for col in label_cols + cat_cols + num_cols:
    print(col, df_comb[col].isnull().sum())

In [None]:
#Fill in categorical features' null values with "Missing"
df_comb.loc[df_comb['first_enroll'].isnull() == True, 'first_enroll'] = "Missing"

#Fill in missing age with the average age at year 7. Also bottom and top code age
df_comb.loc[df_comb['year7_age'].isnull() == True, 'year7_age'] = round(df_comb['year7_age'].mean(), 0)

df_comb.loc[df_comb['year7_age'] < 16, 'year7_age'] = 16

df_comb.loc[df_comb['year7_age'] > 64, 'year7_age'] = 64

In [None]:
#Categorical features that we need to convert to dummy variables
cat_cols = ['gender', 'urm_status', 'first_enroll', 'cohort_degree_pursuit_type']

#Get dummy variables
df_comb = pd.get_dummies(df_comb, columns = cat_cols, dtype = float)

#Check our current columns
df_comb.columns

In [None]:
#Define scaler type
scaler = MinMaxScaler()

#Compute the minimum and maximum to be used for scaling
scaler.fit(df_comb['year7_age'].values.reshape(-1,1))

#Scaling features to range [0, 1]
df_comb['year7_age_scl'] = scaler.transform(df_comb['year7_age'].values.reshape(-1,1))

#Drop the original numeric feature
df_comb.drop(columns = ['year7_age'], inplace = True)

In [None]:
#Check summary descriptive statistics for all labels and features
#We use `.T` to tranpose the table so that it is easier to read
#The code in `.apply()` is to format the numbers in the table so that they only have five digits after the decimal
df_comb.describe().T.apply(lambda x: x.apply('{0:.5f}'.format))