<a href="https://colab.research.google.com/github/adithya1010/100-Days-of-Code/blob/main/Task-3/NaanMudhalvan_Task3_DiabetesClassification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**Task-3-Diabetes Classification**

Project done with inputs from Gemini 1.5 Flash

**Link to Chat:**

1. https://www.getmerlin.in/share/chat/3GQOQs7vFV5

**References:**

1. https://docs.python.org/3/library/getpass.html
2. https://www.kaggle.com/discussions/general/74235

**Dataset:**
1. https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database


###Importing libraries

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from imblearn.over_sampling import SMOTE
from sklearn.metrics import classification_report, confusion_matrix


### Using getpass to hide the key from Kaggle


The getpass module in Python provides a secure way to get the password from the user without echoing it to the console. It's commonly used in command-line tools and scripts that require a password or username from the user.

In [None]:
from getpass import getpass
secret = getpass('Enter the secret value: ')

Enter the secret value: ··········


### Downloading the dataset from Kaggle and unzipping it

In [None]:
import os
os.environ['KAGGLE_USERNAME'] = "adithyast" # username from the json file
os.environ['KAGGLE_KEY'] = secret # key from the json file
!kaggle datasets download -d uciml/pima-indians-diabetes-database # api copied from kaggle
!unzip pima-indians-diabetes-database.zip

Dataset URL: https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database
License(s): CC0-1.0
Downloading pima-indians-diabetes-database.zip to /content
  0% 0.00/8.91k [00:00<?, ?B/s]
100% 8.91k/8.91k [00:00<00:00, 17.5MB/s]
Archive:  pima-indians-diabetes-database.zip
  inflating: diabetes.csv            


### Reading the dataset

In [None]:
df = pd.read_csv("diabetes.csv")
df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


### Handling missing values usig Mean Imputation

In [None]:
# 3. Handle Missing Values (using mean imputation for simplicity)
imputer = SimpleImputer(strategy='mean')
for col in ['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']:
    df[col] = imputer.fit_transform(df[[col]])

### Sepreating features and target

In [None]:
# 4. Separate Features and Target
X = df.drop('Outcome', axis=1)
y = df['Outcome']


### Handling Class Imbalance using SMOTE

**Synthetic Minority Over-Sampling Technique (SMOTE)**

SMOTE is a widely used technique in data mining and machine learning to address class imbalance problems, particularly in binary classification problems. It artificially increases the minority class by creating synthetic samples along the line segments joining any two minority class samples. This helps to improve the performance of the classifier by providing more representative data. SMOTE is commonly used in applications such as spam detection, fraud detection, and medical diagnosis, where the minority class represents the rare or critical event. It can be used in conjunction with other oversampling techniques.

In [None]:
# 5. Handle Class Imbalance (using SMOTE)
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X, y)
X_resampled
y_resampled

Unnamed: 0,Outcome
0,1
1,0
2,1
3,0
4,1
...,...
995,1
996,1
997,1
998,1


### Splitting data into training and testing:

In [None]:
# 6. Split Data
X_train, X_test, y_train, y_test = train_test_split(X_resampled, y_resampled, test_size=0.2, random_state=42)


### Feature Scaling with Standard Scaler

The StandardScaler is a widely used module in scikit-learn, a Python machine learning library. Its primary function is to scale numeric data, typically feature values, to have a mean of 0 and a standard deviation of 1. This process is also known as standardization or normalization. By doing so, it helps to:

Stabilize learning algorithms
Improve the performance of gradient descent methods
Enable the comparison of features with different units or scales
StandardScaler is a crucial preprocessing step in many machine learning pipelines.

In [None]:
# 7. Feature Scaling (optional, but often improves performance)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

### Using Random Forest Classifer to train the model

**Random Forest Classifier: A Powerful Ensemble Learning Method**

A Random Forest classifier is a supervised learning algorithm that combines multiple decision trees to improve the accuracy and robustness of predictions. It works by training multiple decision trees on random subsets of the data, and then aggregating their predictions to produce a final outcome. This approach helps to reduce overfitting and improves the model's ability to generalize to new, unseen data. Random Forest classifiers are often used for classification and regression tasks, and are particularly effective when dealing with high-dimensional data and non-linear relationships.

In [None]:
# 8. Train a Random Forest Model
model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)

### Making Predictions

In [None]:
# 9. Make Predictions
y_pred = model.predict(X_test)


### Evaluating the Model

In [None]:
# 10. Evaluate the Model
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.87      0.77      0.82        99
           1       0.80      0.89      0.84       101

    accuracy                           0.83       200
   macro avg       0.84      0.83      0.83       200
weighted avg       0.83      0.83      0.83       200

[[76 23]
 [11 90]]
