# Classification Modeling

- The question being asked in this project is if we can predict whether or not someone experiences food insecurity or not. Since this is a question with either a this or that answer, classification modeling will be used to get the predictions.
- The models that will be explored include Random Forest Classifier, Logistic Regression, Decision Trees, and Bagging Decision Trees

### Libraries and Data

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.linear_model import LogisticRegression

from sklearn.model_selection import cross_val_score, train_test_split, GridSearchCV

from sklearn.pipeline import Pipeline

from sklearn.metrics import accuracy_score, confusion_matrix, \
classification_report, recall_score, precision_score

from sklearn.tree import DecisionTreeClassifier, plot_tree, export_text

from sklearn.ensemble import BaggingClassifier, RandomForestClassifier

In [2]:
df = pd.read_csv('./data/cleaned_data.csv')

In [3]:
df.shape

(48870, 33)

In [4]:
df.head()

Unnamed: 0,HEFAMINC,HRNUMHOU,PRTAGE,PERET1,PEHRUSLT,PRNMCHLD,QSTNUM,HESP6,HESP7,HESP7A,...,race,marital_status,food_pantry,has_dis,service_status,job_loss,type_job,in_union,in_school,has_stamps
0,100000,4,30,0,60,2,1,0,0,0,...,White,yes,Unknown,No,No,No,ForProf,No,No,no
1,75000,3,20,0,40,1,2,0,0,0,...,White,yes,Unknown,No,No,No,ForProf,No,College,no
2,25000,3,20,0,0,1,5,0,0,2,...,White,yes,Yes,No,No,No,0,No,No,no
3,50000,1,20,0,40,0,6,0,0,0,...,Black,no,Yes,No,No,No,ForProf,No,No,no
4,150000,6,40,0,32,0,7,0,0,0,...,White,yes,Yes,No,No,No,Self-emp,No,No,no


### Feature Engineering

- Since I had to covnert a lot of my columns back to their categorical values for data analysis, I will be taking some steps to convert back to numerical values for modeling below.
- Any column that has either yes or no values will be changed to values of 1 for yes and 0 for no.
- The other columns will either be dummified or target encoded based on the variables specific to the column.

In [5]:
df.replace(['yes', 'no'], [1, 0], inplace=True)

In [6]:
df.dtypes

HEFAMINC           int64
HRNUMHOU           int64
PRTAGE             int64
PERET1             int64
PEHRUSLT           int64
PRNMCHLD           int64
QSTNUM             int64
HESP6              int64
HESP7              int64
HESP7A             int64
HESP8              int64
HESS1              int64
HESH4              int64
HESC1              int64
HESC2              int64
HESC3              int64
food_secure        int64
state             object
is_metro          object
region            object
division          object
sex               object
education         object
race              object
marital_status     int64
food_pantry       object
has_dis           object
service_status    object
job_loss          object
type_job          object
in_union          object
in_school         object
has_stamps         int64
dtype: object

In [7]:
df['is_metro'].replace(['Metro', 'No'], [1, 0], inplace=True)

In [11]:
df['has_dis'].replace(['Yes', 'No'], [1, 0], inplace=True)

In [9]:
df['in_school'].replace(['College', 'HS', 'No'], [1, 1, 0], inplace=True)

In [12]:
df['has_dis'].value_counts()

0    46967
1     1903
Name: has_dis, dtype: int64