### Machine Learning Assignment - 2
### GROUP 256
### PS - 6
### Group Members -
* P.V.Vihari (2022AC05593)
* Manas Tuteja (2022AC05507)
* Godavarthi Krishna Vamsi (2022AC05704)

In [None]:
import pandas as pd
import numpy as np

# Plots
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import plotly.offline as py
import plotly.graph_objs as go
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import plotly.tools as tls
import plotly.figure_factory as ff
py.init_notebook_mode(connected=True)
import squarify

# Data processing, metrics and modeling
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import GridSearchCV, cross_val_score, train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.metrics import precision_score, recall_score, confusion_matrix,  roc_curve, precision_recall_curve, accuracy_score, roc_auc_score
import lightgbm as lgbm
from sklearn.ensemble import VotingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import roc_curve,auc
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_predict
from yellowbrick.classifier import DiscriminationThreshold

# Stats
import scipy.stats as ss
from scipy import interp
from scipy.stats import randint as sp_randint
from scipy.stats import uniform as sp_uniform

# Time
from contextlib import contextmanager
@contextmanager
def timer(title):
    t0 = time.time()
    yield
    print("{} - done in {:.0f}s".format(title, time.time() - t0))

#ignore warning messages 
import warnings
warnings.filterwarnings('ignore') 

Reading Dataset from CSV

In [None]:
df = pd.read_csv('/kaggle/input/pima-indian-diabetes-data/Assignment 2.6 - Data/diabetes.csv')


In [None]:
df.head()

In [None]:
df.columns

In [None]:
df.isna().sum()

In [None]:
df.duplicated()

## EDA

In [None]:
import pandas_profiling as df_report

df_report.ProfileReport(df)

#Below is an HTML based dashboard/report on the dataset, please use inner scroll bar to navigate

##  As observed from the profiling report:
* Pregnancies is highly overall correlated with Age
* SkinThickness is highly overall correlated with Insulin
* Insulin is highly overall correlated with SkinThickness
* Age is highly overall correlated with Pregnancies

Correlated features in general don't improve models (although it depends on the specifics of the problem like the number of variables and the degree of correlation), but they affect specific models in different ways and to varying extents:

For linear models (e.g., linear regression or logistic regression), multicolinearity can yield solutions that are wildly varying and possibly numerically unstable.

Random forests can be good at detecting interactions between different features, but highly correlated features can mask these interactions.

### **Why a correlation analysis was needed?**

Correlational analysis is used to measure the strength and direction of the relationship between two variables. When building an LR model, it’s important to check for multicollinearity, which occurs when two or more independent variables are highly correlated with each other. Multicollinearity can cause problems in the model, such as unstable coefficient estimates and difficulty in interpreting the effects of individual predictors.

By performing a correlational analysis, we can identify pairs of highly correlated variables and take appropriate action to address multicollinearity. This may involve removing one of the correlated variables from the model or combining them into a single predictor.

## What problems does Multicollinearity cause?
Multicollinearity causes the following two basic types of problems:

The coefficient estimates can swing wildly based on which other
independent variables are in the model. The coefficients become very sensitive to small changes in the model.
Multicollinearity reduces the precision of the estimate coefficients, which weakens the statistical power of your regression model. You might not be able to trust the p-values to identify independent variables that are statistically significant.

## The need to reduce multicollinearity 

It depends on its severity and your primary goal of a predictive model. 

* The severity of the problems increases with the degree of the multicollinearity. Therefore, if there is only moderate multicollinearity, it is not always needed to be resolved.

* Multicollinearity affects only the specific independent variables that are correlated. Therefore, if multicollinearity is not present for the independent variables that we are particularly interested in, we may not need to resolve it. Suppose your model contains the experimental variables of interest and some control variables. If high multicollinearity exists for the control variables but not the experimental variables, then you can interpret the experimental variables without problems.

* Multicollinearity affects the coefficients and p-values, but it does not influence the predictions, precision of the predictions, and the goodness-of-fit statistics. If the primary goal is to make predictions, and don’t need to understand the role of each independent variable, we don’t need to reduce severe multicollinearity.

### The team will split try 1st iteration of Model Building with 2 cases

#### ITR 1. A 
* Create Random Forest and KNN models on dataset with correlated feature

#### ITR 1. B
* Create Random Forest and KNN models on dataset after removing all correlated feature

This is done to validate the hypothesis that Random Forest Classifiers are less effected with Multi collienarity and should yield similar results in both cases. If the model perfromance falls in the case B of ITR1, it would indicate that the model hasn't learnt propeerly due to a lack of information

## Feature Engineering

### Outlier Detection

In [None]:
df.describe()

* Columns - **Pregnancies, Skin Thickness, Insulin, and DiabetesPedigreeFunction** have high number of outliers as can be seen in the table above. We can observe a very high Max value when compared to 25%ile, Mean and 75%ile

### Scatter Plot for Better Visualizations

In [None]:
sns.scatterplot(df, x= 'Pregnancies', y = 'Pregnancies')

In [None]:
sns.countplot(df, x= 'Pregnancies')

#### We can observe that the Pregnancies column has some outright outliers where some females have as high as 17 pregnancies. If the outlier is not treated it will increase the variance of the training data unrealistically and give a skewed understanding of Central Tendencies like Mean. Hence the team recommends Normalization method to scale down all the values between 0 and 1 and reduce the impact of high variance on the model's predictions

### Using Min-Max Normalization to scale down the Pregnancies Feature

In [None]:
df_min_max_scaled = df.copy()
  
# apply normalization techniques by Column 1
column = 'Pregnancies'
df_min_max_scaled[column] = (df_min_max_scaled[column] - df_min_max_scaled[column].min()) / (df_min_max_scaled[column].max() - df_min_max_scaled[column].min())    
  
# view normalized data
display(df_min_max_scaled)

#### Insulin Feature Outliers

In [None]:
sns.scatterplot(df, x= 'Insulin', y = 'Insulin')

#### It is easily observable by the scatterplot that there are **some very extreme values in the Insulin column**. *Normalization may not be able to reduce* the variance here to make legitimate predictions. Hence, the team will **use Inter-Quartile Range to remove outliers from the column**. 

As per the scatter plot, 600 is a good MAX for the column and all the values above it are generally considered outliers. Hence we draw the boundary at 3*IQR 

In [None]:
Q1 = df_min_max_scaled['Insulin'].quantile(0.25)
Q3 = df_min_max_scaled['Insulin'].quantile(0.75)
IQR = Q3 - Q1
lower = Q1 - 3*IQR
upper = Q3 + 3*IQR
 
# Create arrays of Boolean values indicating the outlier rows
upper_array = np.where(df_min_max_scaled['Insulin']>=upper)[0]
lower_array = np.where(df_min_max_scaled['Insulin']<=lower)[0]
 
# Removing the outliers
df_min_max_scaled.drop(index=upper_array, inplace=True)
df_min_max_scaled.drop(index=lower_array, inplace=True)

In [None]:
sns.scatterplot(df_min_max_scaled, x= 'Insulin', y = 'Insulin')

Outliers have been treated very well as observed from the scatterplot

In [None]:
df_min_max_scaled.describe()

### Skin Thickness Feature

In [None]:
sns.scatterplot(df_min_max_scaled, x= 'SkinThickness', y = 'SkinThickness')

#### It is easily observable by the scatterplot that there are **some very extreme values in the SkinThickness column**. *Normalization may not be able to reduce* the variance here to make legitimate predictions. Hence, the team will **use Inter-Quartile Range to remove outliers from the column**. 

As per the scatterplot 0.5 would be a good estimate to eliminate outliers

In [None]:
Q1 = df_min_max_scaled['SkinThickness'].quantile(0.01)
Q3 = df_min_max_scaled['SkinThickness'].quantile(0.99)
IQR = Q3 - Q1
lower = Q1 - 0.01*IQR
upper = Q3 + 0.01*IQR
 
# Create arrays of Boolean values indicating the outlier rows
upper_array = np.where(df_min_max_scaled['SkinThickness']>=upper)[0]
lower_array = np.where(df_min_max_scaled['SkinThickness']<=lower)[0]
 
# Removing the outliers
df_min_max_scaled.drop(index=upper_array, inplace=True)
df_min_max_scaled.drop(index=lower_array, inplace=True)

In [None]:
sns.scatterplot(df_min_max_scaled, x= 'SkinThickness', y = 'SkinThickness')

### DiabetesPedigreeFunction Outlier Removal

In [None]:
sns.scatterplot(df_min_max_scaled, x= 'DiabetesPedigreeFunction', y = 'DiabetesPedigreeFunction')

#### We can observe that the DiabetesPedigreeFunction column has some outright outliers. If the outlier is not treated it will increase the variance of the training data unrealistically and give a skewed understanding of Central Tendencies like Mean. Hence the team recommends Normalization method to scale down all the values between 0 and 1 and reduce the impact of high variance on the model's predictions

In [None]:
df_final = df_min_max_scaled.copy()
  
# apply normalization techniques by Column 1
column = 'DiabetesPedigreeFunction'
df_final[column] = (df_final[column] - df_final[column].min()) / (df_final[column].max() - df_final[column].min())    
  
# view normalized data
display(df_final)

In [None]:
df_final.describe()

### **The outliers have been treated!**

## Model Building

In [None]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
df_modeling = df_final.copy()

### Checking for Sample Balancing

Data balancing is a crucial step in preparing your dataset for machine learning models, especially when dealing with imbalanced datasets. Here are some commonly used techniques:

* **Resampling**: This involves either oversampling the minority class or undersampling the majority class to balance the dataset.

* **SMOTE** (Synthetic Minority Over-sampling Technique): This method generates synthetic examples of the minority class to balance the dataset.

* **Class Weights**: Some machine learning models provide a parameter called class_weights, which can be used to give more importance to the minority class during training.

* **Ensemble Methods**: Techniques like bagging and boosting can be adapted for imbalanced data.

* **Cost-Sensitive Learning**: This involves incorporating misclassification costs or using cost-sensitive algorithms.

In [None]:
df_modeling['Outcome'].value_counts()

### The team will tackle the problem of Data Imbalance by using the class-weight method as the underlying sample population in the Dataset is quite low and we do not wish to lose our original sample.

#### As we can see, 64% of the Sample is mapped to Outcome Variable Value = 0 and only 36% is mapped to Value =1. 

#### This can be tackled by moving the positive class prediction threshold to 0.36 to help balance out the predictions v/s the training sample and help us evaluate our Classification Models properly.

### 80-20 Train-Test Split

In [None]:
x1=df_modeling.iloc[:,:-1]
y1=df_modeling.iloc[:,-1]

### Input Features

In [None]:
x1

### Target Variable

In [None]:
y1

In [None]:
X_train, X_test, y_train, y_test = train_test_split(x1, y1, test_size=0.20, random_state=0)

## Using dataset with the correlated features ITR 1 A
### Initializing Random Forest Model ITR 1 A

In [None]:
rf_1 = RandomForestClassifier().fit(X_train, y_train)
rf_y_pred = rf_1.predict(X_test)

In [None]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

In [None]:
y_predict_class = [1 if prob > 0.36 else 0 for prob in rf_y_pred]
# Accuracy Score
lm_score  =  accuracy_score(y_test, y_predict_class)
print('Accuracy Score is: ', lm_score)

# Confusion Matrix
print("Confusion Matrix\n",confusion_matrix(y_test, y_predict_class))

# Classificaion Report
print("Classification Report of Random Forest :\n",classification_report(y_test, y_predict_class))

## Using dataset without the correlated features - ITR 1B

* Removing 1 feature between Pregnancies, Age randomly
* Removing 1 feature between SkinThickness, Insulin randomly

In [None]:
df_modeling.columns

In [None]:
x2 = df_modeling[['Glucose', 'SkinThickness', 'BloodPressure','BMI', 'DiabetesPedigreeFunction','Pregnancies', 'Outcome']]
y2 = y1

### Initializing Random Forest Model ITR 1 B

In [None]:
X_train_2, X_test_2, y_train_2, y_test_2 = train_test_split(x2, y2, test_size=0.20, random_state=0)

rf_2 = RandomForestClassifier().fit(X_train_2, y_train_2)
rf_y_pred_2 = rf_2.predict(X_test_2)

In [None]:
y_predict_class_2 = [1 if prob > 0.64 else 0 for prob in rf_y_pred_2]
# Accuracy Score
lm_score  =  accuracy_score(y_test_2, y_predict_class_2)
print('Accuracy Score is: ', lm_score)

# Confusion Matrix
print("Confusion Matrix\n",confusion_matrix(y_test_2, y_predict_class_2))

# Classificaion Report
print("Classification Report of Random Forest :\n",classification_report(y_test_2, y_predict_class_2))

### Comparing the Accuracies, Precision and Recall between ITR 1A and ITR 1B of Random Forest Classifier

#### It is observed that not using Correlated Features, the Random Forest Classifier starts overfitting on the small size of sample and won't do very well with new values

This is because
* Tree Based algorithms are generally good at dealing with Multicollinearity as they don't use all the features at the same time to make the split 

* Hence multicollinearity does not have a massive effect on the predictions of a model is validated

* By removing important features, a good amount of information was hidden from the model and hence it started overfitting the sample

## Trying out Random Forest Iterations with Different Train Test Split with all the features 

* **90-10 Train Test Split**

In [None]:
X_train_3, X_test_3, y_train_3, y_test_3 = train_test_split(x1, y1, test_size=0.10, random_state=0)
rf_3 = RandomForestClassifier().fit(X_train_3, y_train_3)
rf_y_pred_3 = rf_3.predict(X_test_3)

In [None]:
y_predict_class_3 = [1 if prob > 0.36 else 0 for prob in rf_y_pred_3]
# Accuracy Score
lm_score  =  accuracy_score(y_test_3, y_predict_class_3)
print('Accuracy Score is: ', lm_score)

# Confusion Matrix
print("Confusion Matrix\n",confusion_matrix(y_test_3, y_predict_class_3))

# Classificaion Report
print("Classification Report of Random Forest :\n",classification_report(y_test_3, y_predict_class_3))

* **The Random Forest Classifier performs poorly with a larger training sample as it starts overfitting**

## Trying out Random Forest Iterations with Hyperparameter Tuning using Cross Validations

### Random Forest Classifier's implementation in Scikit learn, as used here in the notebook has the following hyperparameters that can be tuned:

* **n_estimators**: This parameter controls the number of trees in the forest. More trees can help to get a more generalized result, but it can also increase the time complexity of the model12.

* **max_depth**: This parameter governs the maximum height up to which the trees inside the forest can grow. It is crucial for increasing the accuracy of the model, but setting it too high can lead to overfitting12.

* **min_samples_split**: This specifies the minimum number of samples an internal node must hold in order to split into further nodes. A very low value can lead to overfitting, while a very high value can cause underfitting12.

* **min_samples_leaf**: This is the minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least min_samples_leaf training samples in each of the left and right branches1.

* **max_features**: This is the number of features to consider when looking for the best split1. **This hyperparameter has been deprecated in the latest versions of SciKit Learn hence won't be tuned.**

* **Creating a Random Grid of Hyperparameters' Values**

In [None]:
from sklearn.model_selection import RandomizedSearchCV
# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]
# Number of features to consider at every split

# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)
# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4]
# Method of selecting samples for training each tree
bootstrap = [True, False]
# Create the random grid
random_grid = {'n_estimators': n_estimators,
               
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}
print(random_grid)

* **Radom Search Training**

The most important arguments in RandomizedSearchCV are **n_iter, which controls the number of different combinations to try, and cv which is the number of folds to use for cross validation (we use 100 and 3 respectively).** More iterations will cover a wider search space and more cv folds reduces the chances of overfitting, but raising each will increase the run time. Machine learning is a field of trade-offs, and performance vs time is one of the most fundamental.

In [None]:
# Use the random grid to search for best hyperparameters
# First create the base model to tune
rf_h = RandomForestClassifier()
# Random search of parameters, using 3 fold cross validation, 
# search across 100 different combinations, and use all available cores
rf_random = RandomizedSearchCV(estimator = rf_h, param_distributions = random_grid, n_iter = 100, cv = 3, verbose=2, random_state=42, n_jobs = -1)
# Fit the random search model
rf_random.fit(X_train, y_train)

* **View Best Combination of Hyperparameters**

In [None]:
rf_random.best_params_

### Comparing Base Model v/s Best Random Search Model

In [None]:
def evaluate(model, test_features, test_labels):
    predictions = model.predict(test_features)
    errors = abs(predictions - test_labels)
    mape = 100 * np.mean(errors / test_labels)
    accuracy = 100 - mape
    print('Model Performance')
    print('Average Error: {:0.4f} degrees.'.format(np.mean(errors)))
    print('Accuracy = {:0.2f}%.'.format(accuracy))
    
    return accuracy

# Using Base Model rf_1 here
base_accuracy = evaluate(rf_1, X_test, y_test)

#Using best random search hyperparameters arrived at using Cross Validation
best_random = rf_random.best_estimator_
random_accuracy = evaluate(best_random, X_test, y_test)

print('Improvement of {:0.2f}%.'.format( 100 * (random_accuracy - base_accuracy) / base_accuracy))

* **The model with Hyperparameters are given by Cross Validations are performing better than that of the base model**

* For hyperparameter tuning, we perform many iterations of the entire K-Fold CV process, each time using different model settings. We then compare all of the models, select the best one, train it on the full training set, and then evaluate on the testing set.
* To assess a different set of hyperparameters, we have to split our training data into K fold and train and evaluate K times. Since we have 100 sets of hyperparameters and are using 3-Fold CV, that represents 300 training loops.

### Initializing KNN Model ITR 1 A
#### With multicollinearity

* With Hyperparameter Tuning (K) 

In [None]:
from sklearn.neighbors import KNeighborsClassifier

#Setup arrays to store training and test accuracies
neighbors = np.arange(1,9)
knn_train_accuracy =np.empty(len(neighbors))
knn_test_accuracy = np.empty(len(neighbors))

for i,k in enumerate(neighbors):
    #Setup a knn classifier with k neighbors
    knn = KNeighborsClassifier(n_neighbors=k)
    
    #Fit the model
    knn.fit(X_train, y_train)
    
    #Compute accuracy on the training set
    knn_train_accuracy[i] = knn.score(X_train, y_train)
    
    #Compute accuracy on the test set
    knn_test_accuracy[i] = knn.score(X_test, y_test)

In [None]:
display(knn_train_accuracy, knn_test_accuracy)

#### Observation
* **Test Accuracy of the KNN Algorithm Increases with a higher value of K**

### Initializing KNN Model ITR 1 B

#### Without Multicollinearity

* With Hyperparameter Tuning (K) 

In [None]:
neighbors = np.arange(1,9)
knn_train_accuracy_2 =np.empty(len(neighbors))
knn_test_accuracy_2 = np.empty(len(neighbors))

for i,k in enumerate(neighbors):
    #Setup a knn classifier with k neighbors
    knn = KNeighborsClassifier(n_neighbors=k)
    
    #Fit the model
    knn.fit(X_train_2, y_train_2)
    
    #Compute accuracy on the training set
    knn_train_accuracy_2[i] = knn.score(X_train_2, y_train_2)
    
    #Compute accuracy on the test set
    knn_test_accuracy_2[i] = knn.score(X_test_2, y_test_2)

In [None]:
display(knn_train_accuracy_2, knn_test_accuracy_2)

### Observation

* **We observe that the Test Accuracy of KNN Classifier decreases with an increasing K whent he features are treated for Multicollinearity**

* This is because by removing important features, a good amount of information was hidden from the model and hence it started overfitting the sample

## Final Answer/Conclusion

### Observation

It is observed that the Random Forest Classifier performs much better than the KNN Classifier in terms of

* Accuracy
 * RF - 68%
 * KNN - 78%
 
 
* Recall
 * RF - 67%
 * KNN - 64%


### Justification

* The **Random Forest model has a 0.67 Recall**. This means 67% of patients, who DID infact have diabetes, were correctly identified by the Model. This means if a total of 1000 people would have a diabetes, our Model had given the Label - 1 (hig risk) to 670 of those people this is far better when compared to the **KNN Model that has a recall of 0.64** which means if a total of 1000 people would have a diabetes, our Model had given the Label - 1 (hig risk) to 640.  

* The **Random Forest Accuracy of 0.68** tells us that in general the Random Forest Model will make the right prediction 68% of the times when compared to **KNN Accuracy of 78%** which means KNN would make correct predictions 78% of the time. 

* However, Accuracy is not a measure of how much better is a model is identifying people with diabetes which is a more important metric since someone who doesn't have diabetes in the first place can infact be fine without being diagnoes.

* In the world of medical diagnosis and healthcare, Recall is a much more important statistic to measure a model's performance as it is identifying most of the at-risk population correctly and helps in saving lives.


## We would recommend going forward with the Random Forest Classifier trained by tuning hyperparmeters using cross validation as it best fits the use case.