# Classification Models Practice

Here are some resources to look for a dataset if you want to practice classification algorithms.

- [Kaggle](https://kaggle.com) 
- [Data World](https://data.world/datasets/data)
- [Google Dataset Search](https://datasetsearch.research.google.com/)

Today, we will look into how to approach classification problems and come up with a business problem. We will also look into 
- building a baseline model, 
- improving its performance using hyperparameter tuning
- providing some insights into the model if possible

# Dataset: Pima Indians Diabetes

We will be using a toy dataset for today's guided practice. A toy dataset is a small standard dataset that is generally used for benchmarking algorithms or just getting everything set up quickly. Here is a [link](https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database) to the dataset page. Since this dataset is a very simple dataset, we will not have many problems we usually need to deal with here and this dataset requires very minimal preprocessing

**Attribute Information:**

- Number of times pregnant
- Plasma glucose concentration a 2 hours in an oral glucose tolerance test
- Diastolic blood pressure (mm Hg)
- Triceps skin fold thickness (mm)
- 2-Hour serum insulin (mu U/ml)
- Body mass index (weight in kg/(height in m)^2)
- Diabetes pedigree function
- Age (years)
- Class variable (0 or 1)

# Business case

The business problem here can be a governmental health agency trying to understand the leading causes of diabetes and trying to decide potential advertisements for educating the public and raise awareness. It could also be about predicting which groups of population might be at risk of diabetes and ensuring good preventive measures are instituted as soon as possible

## Define true positives, false positives, true negatives and false negatives

## Question - Comment on the metric you think will be most valuable to look at 

# Code 

The code is included below this point. Here are the steps we will follow
- Import libraries and data
- Basic EDA
- Baseline Model
- Hyperparameter Tuning
- Building a pipeline

## Imports 

In [4]:
#libraries

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, plot_roc_curve, plot_confusion_matrix

In [2]:
data = pd.read_csv("diabetes.csv")
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB


Quick observations:

1. All datatypes look good for modeling ie no object
2. No missing values
3. Target variable is outcome

In [16]:
#display first 5 rows of df
data.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [6]:
#check class imbalance
data.Outcome.value_counts()

0    500
1    268
Name: Outcome, dtype: int64

In [None]:
#get a quick idea about the data -> summary statistics


In [1]:
# comment about the class imbalance, will it affect the results and do we need
# to do anything about it

## EDA

In [15]:
#do 3 EDA plots

## Logistic Regression Model

In [None]:
#do a test train split


In [None]:
#build a baseline model 


In [None]:
#get predicitions from baseline


In [None]:
#build a confusion matrix and classification report


# Decision Tree Model

In [None]:
#define a decision tree


In [None]:
#fit the model


In [None]:
#get predictions from the model


In [None]:
#print confusion matrix and classification report

In [None]:
#comment about the fit of the model

## Pruning the decision tree

In [None]:
#prune the decision tree using the max_depth


In [None]:
#get predicitions from the pruned model


In [None]:
#print confusion matrix and classification report


In [None]:
#comment about the fit of the model
