## **🧩 Problem Statement**
The challenge is to create a machine learning model. This model will learn from health-related data (features) of women and then predict whether they have a specific type of diabetes, such as Type 2 or gestational diabetes.

## **🛠️ Setting Up the Environment**
Before we start, we need to make sure we have all the tools (libraries) we need.

Installing Libraries
We first install or upgrade two important libraries:

* scikit-learn: A fundamental library for machine learning in Python.

* pycaret: A low-code machine learning library that helps speed up the experiment cycle.

In [None]:
!pip install --upgrade scikit-learn


In [None]:
!pip install pycaret

#📦 **Importing Libraries**

Next, we import all the necessary libraries into our notebook. These libraries help us with tasks like:

* Handling data (pandas, numpy)
* Creating visualizations (matplotlib.pyplot, seaborn)
* Preparing data for the model (KNNImputer, OneHotEncoder, MinMaxScaler, ColumnTransformer, train_test_split, Pipeline, LabelEncoder)
* Building and evaluating models (SVC, LogisticRegression, RandomForestClassifier, DecisionTreeClassifier, accuracy_score, classification_report)
* Using PyCaret for automated ML (pycaret.classification)
*Ignoring unnecessary warnings (warnings)

In [None]:
# 📦 Importing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.impute import KNNImputer
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline

from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier

from sklearn.metrics import accuracy_score, classification_report

import warnings
warnings.filterwarnings("ignore")


#📂 **Loading and Understanding the Data**

Start by loading the datasets. The are three files:

* Train.csv: Contains the training data, including features and the target variable (diabetes type).
* Test.csv: Contains the test data, which has the same features but not the target variable. We'll use this to make our final predictions.
* SampleSubmission.csv: Shows the format we need for our submission file.

In [None]:
# 📂 Load the dataset (update the path as needed)
df = pd.read_csv("/content/SheCures/Train.csv")
ts = pd.read_csv("/content/SheCures/Test.csv")
ss = pd.read_csv("/content/SheCures/SampleSubmission.csv")
df.head()


**Getting to Know the Data**

* df.columns: See the names of all the columns

* df.info(): Get a summary, including the number of entries, column names, data types, and non-null counts. This helps identify missing values.

* df.describe():  Look at basic statistics (count, mean, standard deviation, min, max, etc.) for numerical columns.

* Class Distribution: Check how many samples belong to each diabetes type (our target). This helps us see if the dataset is balanced.

* Correlation: Visualize how numerical features relate to each other using a heatmap. This can help identify potentially important features or relationships.

In [None]:
df.columns

In [None]:
# ℹ️ Dataset information
df.info()

In [None]:
# 📊 Basic statistics
df.describe()

In [None]:
# 🧮 Class distribution
diabetes_type = "Target"
df[diabetes_type].value_counts()


In [None]:
# 🔍 Correlation heatmap
sns.heatmap(df.corr(numeric_only = True),annot = True)

# ⚙️ **Data Preprocessing**

Since machine learning models usually require data to be in a specific format, this section covers cleaning and preparing the data.

**Identifying Features**

Here I separate the features (the inputs for the model) from the target variable (what I want to predict). I also identify which features are numerical (like age or glucose level) and which are categorical (like 'Yes'/'No' or a specific group).

In [None]:
target = "Target"
features = df.drop(columns=[target])

num_features = features.select_dtypes(include=["int64", "float64"]).columns.tolist()
cat_features = features.select_dtypes(include=["object", "category"]).columns.tolist()

**Creating Preprocessing Pipelines**

Here I build "pipelines" to handle numerical and categorical features separately and consistently.

**Numerical Pipeline:**

* *KNNImputer*: Fills in missing numerical values using the K-Nearest Neighbors method.

* *MinMaxScaler*: Scales numerical features to a range between 0 and 1. This helps many models perform better.

**Categorical Pipeline**:

* *OneHotEncoder*: Converts categorical text data into numerical format (0s and 1s) so the model can understand it.

Finally, I then combine the numerical and categorical pipelines using *ColumnTransformer*.

In [None]:
# Define preprocessing pipelines
num_transformer = Pipeline([
    ("imputer", KNNImputer(n_neighbors=5)),
    ("scaler", MinMaxScaler())
])

cat_transformer = Pipeline([
    ("encoder", OneHotEncoder(handle_unknown="ignore"))
])

preprocessor = ColumnTransformer([
    ("num", num_transformer, num_features),
    ("cat", cat_transformer, cat_features)
])

**Splitting the Data**

At this stage, it's crucial to split the training data before applying some preprocessing steps to avoid "data leakage" (where information from the test set accidentally influences the training process). The split is training set (80%) and a testing set (20%).

In [None]:
# Split data before preprocessing to avoid data leakage
X_train_raw, X_test_raw, y_train, y_test = train_test_split(df.drop(columns=[target]), df[target], test_size=0.2, random_state=42)

In [None]:
# Encode categorical variables and split features/target as needed
df_cleaned = preprocessor.fit_transform(df.drop(columns=[target])) # Drop target column before preprocessing
new_columns = (
    num_features +
    list(preprocessor.named_transformers_['cat'].named_steps['encoder'].get_feature_names_out(cat_features))
)


**Applying Preprocessing**

Now I applied the preprocessor to both the training and testing sets. This imputed missing values, scaled numbers and one-hot encoded categories.

In [None]:
# Fit and transform training and test data
X_train = preprocessor.fit_transform(X_train_raw)
X_test = preprocessor.transform(X_test_raw)

In [None]:
# Extract the names of the new columns created by one-hot encoding
encoded_cat_columns = preprocessor.named_transformers_['cat'].named_steps['encoder'].get_feature_names_out(cat_features)
new_columns = num_features + list(encoded_cat_columns)


##🤖 **Building Models with PyCaret**

PyCaret simplifies the process of training and comparing many different machine learning models.

**Setting Up PyCaret**

Here I initialized the PyCaret environment. I provided the training data (before our manual preprocessing, as PyCaret can handle it), specified the target column, and set some parameters. I dropped the 'ID' column as it's not a feature.

In [None]:
df_pycaret = df.drop(columns=["ID"])

In [None]:
# Set up PyCaret
from pycaret.classification import *

classf = setup(data = df_pycaret, target = 'Target', train_size = 0.8,
               normalize = True, session_id = 123)

**Comparing Models**

PyCaret can automatically train and evaluate several common classification models and show us which one performs best based on standard metrics.

In [None]:
best_model = compare_models()

**Creating a Specific Model**

Based on the compare_models() results, I created a specific model. Here, I chose Random Forest ('rf') based on the F1 Score.

In [None]:
model = create_model('rf')

## 📊 **Evaluating the Model**

I needed to see how well my chosen model performs, especially on data it hasn't seen before.

**Classification Report**

Here I used the manually split test set (X_test_raw and y_test) to evaluate the PyCaret model. The classification_report gives us precision, recall, and F1-score for each class. **Note**:  Prediction is done on X_test_raw because PyCaret models expect data in the original format and handle preprocessing internally.

In [None]:
# Print the f1 score of the model

print(classification_report(y_test, predict_model(model, data=X_test_raw)["prediction_label"]))


**Visualizations**

PyCaret provides easy ways to visualize model performance:

* *Confusion Matrix*: Shows how many predictions were correct and where the model made mistakes.
* *Feature Importance*: Shows which features the model found most important for making predictions.

In [None]:
plot_model(best_model, plot='confusion_matrix')
plot_model(best_model, plot='feature')


## 🔮 **Making Predictions**

Now I used the trained model to predict diabetes types for the actual test dataset (ts).

**Preparing Test Data and Predicting**

First, I needed to prepare the test set (ts) by dropping the 'ID' column. Then, I used predict_model to get the predictions.

In [None]:
# Save ID column from test set
test_ids = ts["ID"]
ts_clean = ts.drop(columns=["ID"], errors='ignore')

# Predict on the test data
predictions = predict_model(model, data=ts_clean)

predictions.head(10)

## **Saving the model for deployment**

In [None]:
final_model = finalize_model(model)

save_model(final_model, 'classification_model')

## **Create a submission File for sumitting to the contest**

In [None]:
# Create submission
submission = pd.DataFrame({
    "id": test_ids,
    "diabetes_type": final_labels
})

submission_path = "/content/SheCures/submission_file.csv"
submission.to_csv(submission_path, index=False)