## Step 1: Import Necessary Libraries

### Explanation:
We begin by importing libraries for:

- **Data manipulation and analysis:** pandas, numpy
- **Data preprocessing:** train_test_split (splitting data), StandardScaler (scaling features)
- **Modeling:** Algorithms such as Logistic Regression, Random Forest, and SVM
- **Evaluation:** Metrics like accuracy, classification report, and confusion matrix
- **Saving models:** pickle for saving trained models and scalers.


In [1]:
# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import pickle


## Step 2: Load the Dataset

### Explanation:
Load the Pima Indians Diabetes Dataset using pandas. The dataset typically includes features like glucose level, blood pressure, age, etc., and a target variable indicating diabetes presence (1 for diabetic, 0 for non-diabetic).

- Verify the dataset structure using `.head()` and `.info()`.
- Check for missing values.


In [4]:
# Step 1: Load the dataset
dataset = pd.read_csv('diabetics.csv')

# Step 2: Understand the data
print("First few rows of the dataset:")
print(dataset.head())

print("\nDataset Information:")
print(dataset.info())

print("\nCheck for missing values:")
print(dataset.isnull().sum())


First few rows of the dataset:
   Unnamed: 0  Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin  \
0           0            6      148             72             35        0   
1           1            1       85             66             29        0   
2           2            8      183             64              0        0   
3           3            1       89             66             23       94   
4           4            0      137             40             35      168   

    BMI  DiabetesPedigreeFunction  Age  Outcome  
0  33.6                     0.627   50        1  
1  26.6                     0.351   31        0  
2  23.3                     0.672   32        1  
3  28.1                     0.167   21        0  
4  43.1                     2.288   33        1  

Dataset Information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 10 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------       

In [7]:
dataset

Unnamed: 0.1,Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,0,6,148,72,35,0,33.6,0.627,50,1
1,1,1,85,66,29,0,26.6,0.351,31,0
2,2,8,183,64,0,0,23.3,0.672,32,1
3,3,1,89,66,23,94,28.1,0.167,21,0
4,4,0,137,40,35,168,43.1,2.288,33,1
...,...,...,...,...,...,...,...,...,...,...
763,763,10,101,76,48,180,32.9,0.171,63,0
764,764,2,122,70,27,0,36.8,0.340,27,0
765,765,5,121,72,23,112,26.2,0.245,30,0
766,766,1,126,60,0,0,30.1,0.349,47,1


## Step 3: Select Features and Target Variables

### Explanation:
Separate the dataset into:
- **Features (X):** All columns except the target.
- **Target (y):** The last column indicating diabetes status.

This is essential for training machine learning models.


In [5]:
# Step 3: Feature and target selection

X = dataset.iloc[:, :-1].values  # Features (independent variables)
y = dataset.iloc[:, -1].values   # Target (dependent variable)


## Step 4: Split the Dataset

### Explanation:
- Split the data into training (80%) and testing (20%) subsets using `train_test_split`.
- This ensures that we evaluate the model on unseen data to avoid overfitting.


In [6]:
# Step 4: Split the dataset

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


## Step 5: Scale the Features

### Explanation:
- Features are scaled using `StandardScaler` to normalize their range.
- This step is crucial for machine learning models (like SVM) that are sensitive to feature magnitudes.


In [8]:
# Step 5: Feature scaling

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)


## Step 6: Train Machine Learning Models

### Explanation:
Train three machine learning models:
- **Logistic Regression:** A linear model for binary classification.
- **Random Forest:** An ensemble model using decision trees.
- **Support Vector Machine (SVM):** Effective for small, high-dimensional datasets.

Each model is trained on the training data and predictions are made on the testing data.


In [9]:
# Step 6: Train models

models = {
    "Logistic Regression": LogisticRegression(),
    "Random Forest": RandomForestClassifier(n_estimators=100, random_state=42),
    "Support Vector Machine": SVC(kernel='rbf', random_state=42)
}


## Step 7: Evaluate Models

### Explanation:
Evaluate each model using:
- **Accuracy:** Percentage of correct predictions.
- **Classification Report:** Includes precision, recall, F1-score.
- **Confusion Matrix:** Details of true positives, false positives, etc.
- Store the accuracy for each model.


In [10]:
results = {}
for model_name, model in models.items():
    print(f"\nTraining {model_name}...")
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    acc = accuracy_score(y_test, y_pred)
    print(f"Accuracy for {model_name}: {acc}")
    print("Classification Report:")
    print(classification_report(y_test, y_pred))
    results[model_name] = acc



Training Logistic Regression...
Accuracy for Logistic Regression: 0.7532467532467533
Classification Report:
              precision    recall  f1-score   support

           0       0.81      0.80      0.81        99
           1       0.65      0.67      0.66        55

    accuracy                           0.75       154
   macro avg       0.73      0.74      0.73       154
weighted avg       0.76      0.75      0.75       154


Training Random Forest...
Accuracy for Random Forest: 0.7532467532467533
Classification Report:
              precision    recall  f1-score   support

           0       0.82      0.79      0.80        99
           1       0.64      0.69      0.67        55

    accuracy                           0.75       154
   macro avg       0.73      0.74      0.74       154
weighted avg       0.76      0.75      0.76       154


Training Support Vector Machine...
Accuracy for Support Vector Machine: 0.7597402597402597
Classification Report:
              precision  

## Step 8: Select the Best Model

### Explanation:
Identify the model with the highest accuracy.


In [11]:
# Step 8: Choose the best model

best_model_name = max(results, key=results.get)
print(f"\nBest model based on accuracy: {best_model_name} with accuracy {results[best_model_name]}")



Best model based on accuracy: Support Vector Machine with accuracy 0.7597402597402597


## Step 9: Save the Model and Scaler

### Explanation:
Save the best model and the scaler using `pickle`. These files will be used for deployment to make predictions on new data.


In [13]:
# Step 9: Save the best model

best_model = models[best_model_name]
with open('diabetes_model.pkl', 'wb') as f:
    pickle.dump(best_model, f)
print("\nBest model saved as 'diabetes_model.pkl'.")

# Save the scaler for deployment
with open('scaler.pkl', 'wb') as f:
    pickle.dump(scaler, f)
print("Scaler saved as 'scaler.pkl'.")



Best model saved as 'diabetes_model.pkl'.
Scaler saved as 'scaler.pkl'.


## Step 10: Test Deployment

### Explanation:
Simulate deployment by loading the saved model and scaler, then make predictions on new data.


In [15]:
# Step 10: Deployment test
# Load the saved model and scaler for predictions
with open('diabetes_model.pkl', 'rb') as f:
    loaded_model = pickle.load(f)

with open('scaler.pkl', 'rb') as f:
    loaded_scaler = pickle.load(f)

# Example test input

test_input = np.array([[6, 148, 72, 35, 0, 33.6, 0.627, 50,1]])  
scaled_input = loaded_scaler.transform(test_input)
prediction = loaded_model.predict(scaled_input)

print("\nTest Input Prediction:")
print("Diabetic" if prediction[0] == 1 else "Non-Diabetic")



Test Input Prediction:
Non-Diabetic
