# Classification Models and Project Intro

For the phase 3 project you get to choose your own dataset. We do provide a couple of recommended datasets that you can take a look at but I would recommend choosing a dataset based on your career goals. Here are a few places to start looking for a dataset. Be sure to talk more with your instructor about the dataset.

- [Kaggle](https://kaggle.com) 
- [Data World](https://data.world/datasets/data)
- [Google Dataset Search](https://datasetsearch.research.google.com/)

Today, we will look into how to approach classification problems and come up with a business problem. We will also look into 
- building a baseline model, 
- improving its performance using gridsearch and 
- implementing pipelines for better quality of predicitions and reducing manual errors

# Dataset: Pima Indians Diabetes

We will be using a toy dataset for today's guided practice. A toy dataset is a small standard dataset that is generally used for benchmarking algorithms or just getting everything set up quickly. Here is a [link](https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database) to the dataset page. Since this dataset is a very simple dataset, we will not have many problems we usually need to deal with here and this dataset requires very minimal preprocessing

**Attribute Information:**

- Number of times pregnant
- Plasma glucose concentration a 2 hours in an oral glucose tolerance test
- Diastolic blood pressure (mm Hg)
- Triceps skin fold thickness (mm)
- 2-Hour serum insulin (mu U/ml)
- Body mass index (weight in kg/(height in m)^2)
- Diabetes pedigree function
- Age (years)
- Class variable (0 or 1)

# Business case

The business problem here can be a governmental health agency trying to understand the leading causes of diabetes and trying to decide potential advertisements for educating the public and raise awareness. It could also be about predicting which groups of population might be at risk of diabetes and ensuring good preventive measures are instituted as soon as possible

# Code 

The code is included below this point. Here are the steps we will follow
- Import libraries and data
- Basic EDA
- Baseline Model
- GridSearchCV on baseline model
- Building a pipeline

## Imports 

In [1]:
#libraries

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

In [2]:
data = pd.read_csv("diabetes.csv")
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB


Quick observations:

1. All datatypes look good for modeling ie no object
2. No missing values
3. Target variable is outcome

In [3]:
#display first 5 rows of df


In [4]:
#check class imbalance


In [5]:
#get a quick idea about the data 


## EDA

In [6]:
#do 3 EDA plots

## Baseline Model

In [7]:
#do a test train split


In [8]:
#build a baseline model - choose any one you want


In [9]:
#get predicitions from baseline


In [10]:
#build a confusion matrix and classification report


## Tuning the model with GridSearchCV

In [11]:
#define a param grid 


In [12]:
#fit the gridsearch


In [13]:
#print the best parameters


In [14]:
#get predictions from the gridsearch


In [15]:
#print confusion matrix and classification report

In [16]:
#write observations

## Converting to a pipeline

In [17]:
#fitting a basic pipeline


In [18]:
#get predicitions from pipeline


In [19]:
#print confusion matrix and classification report


In [20]:
#adding scaling to our pipeline


In [21]:
#get predicitions from pipeline


In [22]:
#print confusion matrix and classification report
