# Project Title



## Project Canvas

### Background



Describe the goals and pains

### Value Proposition



Propose the product with the value it creates and the pains it alleviates




### Objective

Breakdown the product into key objectives that need to be delivered

### Methodology and Approach

Define the 

1. Perform some ***basic data analysis and exploratory analysis***.
2. Decide the ***models that will be used***.
  - Supervised - Classification
  - Supervised - Regression
  - Unsupervised
3. Define which ***metrics will be used*** to judge model performance.
4. Perform ***model data pre-processing and prepration***.
  - Data Cleansing (missing, duplicates, etc..)
  - Feature Engineering
  - Split data into training, test/validation datasets
5. ***Train models***.
  - For each model type:
    - First use *default parameters*, then score against the test/validation dataset.
    - Then, perform *hyper-parameter tuning* to find the best parameters, retrain and rescore.
    - Select the best parameter set for each model.
6. Select the ***best model*** based on the metrics you have defined
7. Give the observations / feedback / recommendations

### Data Description



The detailed data dictionary is given below

**Data Dictionary**

Text goes here

### Selecting Models

### Selecting Metrics

### Set the method to optimize the model when tuning using GrodSearchCV

In [None]:
# set the scoring method used by all algorigthms in GridSearchCV
optimize_on = recall_score

### Importing necessary libraries and data

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import matplotlib.gridspec as gridspec
import math

import warnings
warnings.filterwarnings("ignore")

# Removes the limit for the number of displayed columns
pd.set_option("display.max_columns", None)

# Sets the limit for the number of displayed rows
pd.set_option("display.max_rows", 200)

# To build models for prediction
from sklearn.model_selection import train_test_split, cross_val_score, KFold
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier

from joblib import parallel_backend
    
# To encode categorical variables
from sklearn.preprocessing import LabelEncoder

# For tuning the model
from sklearn.model_selection import GridSearchCV

# To get diferent metric scores
from sklearn import metrics
from sklearn.metrics import confusion_matrix, classification_report, make_scorer, recall_score, roc_curve, auc, accuracy_score

# Import system 
import sys

### Import helper functions to be used later

In [None]:
# Append the directory to your python path using sys
sys.path.append('../utilities')

# Import the utils.py file
import utils

### Read in the dataset

In [None]:
# Read the dataset file
data = pd.read_csv('../datasets/filename_here.csv')

In [None]:
# Copying data to another variable to avoid any changes to original data
same_data = data.copy()

### Data Overview



- Observations
- Sanity checks

In [None]:
# View the first 5 rows of the dataset
data.head()

In [None]:
# Understand the shape of the data
data.shape

In [None]:
# Checking the info of the data
data.info()

In [None]:
#check for duplicate values
data.duplicated().sum()

In [None]:
# Checking the descriptive statistics of the columns
data.describe().T

**Observations**
- Text goes here to summarize the section


## Exploratory Data Analysis (EDA)

### **Univariate Analysis**

- For category columns, we will look at value counts for leads as a whole.
- For numerical columns, we will look at ditributions for lead as a whole.

#### Check our target variable for balance

There are various degrees of imbalance, and what is considered imbalanced can depend on the specific problem, the total size of the dataset, and the number of classes involved. Here are some general guidelines to understand when a dataset is considered imbalanced:

**Minor Imbalance:** This is when the class distribution is slightly off from being equal but not to a degree that severely impacts model performance. An example could be a binary classification problem with a class distribution of 60% for one class and 40% for the other.

**Moderate Imbalance**: Here, the imbalance starts to become more pronounced. For instance, in a binary classification problem, a distribution of 70% for one class and 30% for the other might be considered moderately imbalanced.

**Severe Imbalance:** This is when the class distribution is highly skewed, making one class significantly underrepresented. An example would be having 90% of the data belonging to one class and only 10% to the other in a binary classification scenario.

The specific threshold at which an imbalance becomes problematic varies, but generally, datasets where one class represents less than 20% of the total can start to introduce significant challenges for many standard machine learning models. These challenges include the model's tendency to overpredict the majority class, as doing so can still achieve high accuracy despite poor minority class performance.

In [None]:
utils.bar_perc(data, 'status', figsize=[6,5])

**Text Example:** This dataset is ***moderately imbalanced***. We will need to see if there are model differences setting class weights, or use the default.

#### Category Columns

In [None]:
# Saving column names with 'object' datatype to a list - can be used later in EDA
cat_col = data.select_dtypes(include=['object']).columns.tolist()

**Counts**

In [None]:
utils.countplot_grid(data=data, cols=cat_col)

**Observations**
- Text goes here

#### Numerical Columns

In [None]:
# Saving column names with 'int', float datatype to a list - used later in EDA
num_col = data.select_dtypes(include=['int64','float64']).columns.tolist()

**Distributions**

In [None]:
utils.histogram_boxplot_grid(num_col, data, 3)

**Observations**
- Text goes here

### **Bi-Variate Analysis**

#### **Category columns**




#### **Distribution Statistical Testing**

1. We will look at **distributions and perform hypothesis tests** (Anova, Kruswal-Wallace, etc..), to determine if there are differences in the categories for each numerical value. We will also plot the **boxplots for where the distributions are statistically different**.

  - For the hypothesis testing we will use the following setup:
    - H<sub>o</sub> = the distributions are the same
    - H<sub>a</sub> = the distrubtions are not the same
    - $\alpha$ = .05 


  If the p-value of the test is less than $\alpha$ then we can reject the H<sub>o</sub> and declare the distrubutions are different.

2. We will also look at **count plots for the label, target variable** ***'status'*** as it is a boolen value and distrubtions do not always show something.

In [None]:
newdf = utils.distribution_check(data, num_col, cat_col)

# create a dataframe for the distributions that are different
boxplotdf = newdf[newdf['is_different'] == True].reset_index(drop=True)

#print the result of the data frame
boxplotdf

##### **Boxplots**

In [None]:
# columns to keep in teh dataframe to be passed to the boxplot grid
columns_to_keep = ['num_column', 'cat_column']

# Create a new DataFrame with only the specified columns
plotdf = boxplotdf[columns_to_keep].reset_index(drop=True)

#Create the boxplot grid (dataframe to indicate whoch columns to plot, the data in which to be plotted)
utils.boxplot_grid(plotdf, data)

**Obervations**
- Text goes here

#### **Correlation Study**

In [None]:
# use a pair plot to look at the distrubution and the correlation between the numeric variables
utils.pplot(data, num_col[:-1]) # <- Dont include the status column as it is boolean

In [None]:
#create a correlation matrix and plot it
utils.corr_matrix(data[num_col[:-1]])

**Observations**
- Text goes here

#### **Count Plots for Target Variable**

In [None]:
utils.countplot_grid(data, cat_col, hue=True, var='status')

**Observations**
- Text goes here

## Feature Engineering / Data Preparation



1. Missing value treatment
2. Feature Engineering
3. Outlier Treatment

In [None]:
#code goes here

- Before we proceed to build a model, we'll have to encode categorical features.
- Separate the independent variables and dependent Variables.
- We'll split the data into train and test to be able to evaluate the model that we train on the training data.

In [None]:
# Creating dummy variables for the categorical columns
# drop_first=True is used to avoid redundant variables
data_encoded = pd.get_dummies(
    data,
    columns = data.select_dtypes(include = ["object", "category"]).columns.tolist(),
    drop_first = True,
)

In [None]:
# Separating independent variables and the target variable
x = data_encoded.drop('status', axis=1)
y = data_encoded['status']

In [None]:
# Splitting the dataset into train and test datasets
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.30, random_state = 1, stratify = y )

In [None]:
# Checking the shape of the train and test data
print("Shape of Training set : ", x_train.shape)
print("Shape of test set : ", x_test.shape)

## Train and Tune models

In [None]:
# since this is the first model scoring we need to clear any existing metrics already built
utils.delete_dataframe_if_exists('dfmr')

### Model Parameter Hyper-Tuning

In [None]:
with parallel_backend(backend='multiprocessing', n_jobs=-1):

    # Choose the type of classifier
    #dtree_estimator = DecisionTreeClassifier(class_weight = {0: 0.30, 1: 0.70}, random_state = 1)
    dtree_estimator = DecisionTreeClassifier(random_state = 1)
    # Grid of parameters to choose from
    parameters = {
        "max_depth": np.arange(2, 20),
        "criterion": ['gini', 'entropy'],
        "min_samples_leaf": [2, 5, 10, 15, 20],
        "max_leaf_nodes": [50, 75, 150, 250],
        "min_samples_split": [10, 30, 50, 70],
    }

    # Type of scoring used to compare parameter combinations
    scorer = metrics.make_scorer(optimize_on, greater_is_better=True)
    
    # Run the grid search
    gridCV = GridSearchCV(dtree_estimator, parameters, scoring = scorer, cv = 5)

    # Fitting the grid search on the train data
    gridCV = gridCV.fit(x_train, y_train)

    # Set the classifier to the best combination of parameters
    dtree_estimator = gridCV.best_estimator_

## Model Validation

Text goes here

## Model Selection

Text goes here

## Actionable Insights and Recommendations

Text goes here