# Preprocessing

This is the third installment of the Capstone 3 project: Preprocessing
Here are links to the other installments:

Data Wrangle https://github.com/gisthuband/Capstone_3_Mental_Health_Score_Predictor/blob/main/data_wrangle/data_wrangle.ipynb

Exploratory Data Analysis https://github.com/gisthuband/Capstone_3_Mental_Health_Score_Predictor/blob/main/exploratory_data_analysis/exploratory_data_analysis.ipynb



This project will be used to predict the mental health scores of individuals based on race/ethnicity, the unemployment rate change, and the unemployment change.

Inputs: Race/Ethnicity, unemployment rate change, unemployment change

Outputs: Mental Health Score

In this notebook I aim to come up with a preprocessing object that can be used in the modeling step of this project.

What will the object be able to generate:

1.) A dataframe with only numeric information

2.) A dataframe with a one hot encoding of categorical variables

3.) A numeric only dataframe that has random values

4.) A one hot encoded dataframe that has random values



Imports

In [2]:
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np

In [3]:
df = pd.read_csv('explored_data_v1.csv')
df =df.drop(columns='Unnamed: 0')

In [6]:
df.head()

Unnamed: 0,Indicator,Subgroup,Date,Value,Year,weighted_unemployment_rate_change,total_unemployed_change
0,Symptoms of Depressive Disorder,Hispanic or Latino,2020-04-23,29.4,2020,6.168662,1771000.0
1,Symptoms of Anxiety Disorder,Hispanic or Latino,2020-04-23,36.3,2020,6.168662,1771000.0
2,Symptoms of Anxiety Disorder or Depressive Dis...,Hispanic or Latino,2020-04-23,42.7,2020,6.168662,1771000.0
3,Symptoms of Depressive Disorder,Hispanic or Latino,2020-05-07,27.9,2020,6.168662,1771000.0
4,Symptoms of Anxiety Disorder,Hispanic or Latino,2020-05-07,36.2,2020,6.168662,1771000.0


In [15]:
class preprocess:
    def __init__(self, df_path):
        self.df_path = df_path
    
    
    def numeric_only(df_path):
        
        df = pd.read_csv(df_path)

        df = df.drop(columns='Unnamed: 0')

        df = df.drop(columns=['Date','Year','Indicator','Subgroup'])
        
        features = list(df.columns[df.columns != 'Value'])

        X = df[features]
        
        y = df['Value']
        
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=5)
        
        return X_train, X_test, y_train, y_test
    
    
    def dummied(df_path):
        
        df = pd.read_csv(df_path)

        df = df.drop(columns='Unnamed: 0')
        
        df = df.drop(columns=['Date','Year'])
        
        dum_df = pd.get_dummies(df[['Indicator','Subgroup']])        
        
        dummed_df = pd.concat([df, dum_df],axis=1)

        dummed_df = dummed_df.drop(columns=['Indicator','Subgroup'])
        
        features = list(dummed_df.columns[dummed_df.columns != 'Value'])
        
        X = dummed_df[features]

        y = dummed_df['Value']

        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2)
        
        return X_train, X_test, y_train, y_test
    
    
    def random_label_numeric_only(df_path):
        
        df = pd.read_csv(df_path)

        df = df.drop(columns='Unnamed: 0')

        df = df.drop(columns=['Date','Year','Indicator','Subgroup'])
        
        rng = np.random.default_rng()
        
        rand_scores = list(rng.integers(low=0, high=48, size=len(df))) 
        
        df['random_values'] = rand_scores
        
        df = df.drop(columns='Value')
        
        features = list(df.columns[df.columns != 'random_values'])

        X = df[features]
        
        y = df['random_values']
        
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=6)
        
        return X_train, X_test, y_train, y_test
    
    def random_label_dum(df_path):
        
        df = pd.read_csv(df_path)

        df = df.drop(columns='Unnamed: 0')
        
        df = df.drop(columns=['Date','Year'])
        
        dum_df = pd.get_dummies(df[['Indicator','Subgroup']])        
        
        dummed_df = pd.concat([df, dum_df],axis=1)

        dummed_df = dummed_df.drop(columns=['Indicator','Subgroup'])
        
        rng = np.random.default_rng()
        
        rand_scores = list(rng.integers(low=0, high=48, size=len(df))) 
        
        dummed_df['random_values'] = rand_scores
        
        dummed_df = dummed_df.drop(columns='Value')
        
        features = list(dummed_df.columns[dummed_df.columns != 'random_values'])

        X = dummed_df[features]
        
        y = dummed_df['random_values']
        
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=9)
        
        return X_train, X_test, y_train, y_test

In [16]:
trial_real = preprocess.dummied('explored_data_v1.csv')

In [18]:
for x in range(4):
    print (trial_real[x].shape)

(556, 9)
(140, 9)
(556,)
(140,)


In [27]:
trial_real[2]

78     29.5
83     39.6
690    22.3
222    23.3
154    25.9
       ... 
298    26.6
29     43.4
151    29.1
163    31.6
97     30.2
Name: Value, Length: 556, dtype: float64

In [20]:
trial_fake =preprocess.random_label_dum('explored_data_v1.csv')

In [22]:
for x in range(4):
    print (trial_fake[x].shape)

(556, 9)
(140, 9)
(556,)
(140,)


In [25]:
trial_fake[2]

88     27
21     18
608     6
276    45
504    13
       ..
56     31
501     0
638    14
348     0
382     9
Name: random_values, Length: 556, dtype: int64

## Summary:

This notebook contains the class that will be used to train the models in the next step.

There is now a dataframe containing only originally numeric information.

Also there is now a dataframe containing one hot encoding of the categorical information.