Goal:- Build the ML model to predict whether the patient is diabetic or not based on the various parameters!

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

In [2]:
# Load Dataset
data = pd.read_csv("/content/diabetes_data.csv")
data

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1
5,5,116,74,0,0,25.6,0.201,30,0
6,3,78,50,32,88,31.0,0.248,26,1
7,10,115,0,0,0,35.3,0.134,29,0
8,2,197,70,45,543,30.5,0.158,53,1
9,8,125,96,0,0,0.0,0.232,54,1


In [3]:
# Separate Features & Target
X = data.drop("Outcome", axis=1)   # all columns except Outcome
y = data["Outcome"]                # target column

# Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Feature Scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train Logistic Regression Model
model = LogisticRegression(max_iter=1000)
model.fit(X_train_scaled, y_train)

# Predictions
y_pred = model.predict(X_test_scaled)

# Model Evaluation
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
class_report = classification_report(y_test, y_pred)

print("\nModel Evaluation:")
print(f"Accuracy: {accuracy*100:.2f}%")
print("\nConfusion Matrix:")
print(conf_matrix)
print("\nClassification Report:")
print(class_report)


Model Evaluation:
Accuracy: 80.00%

Confusion Matrix:
[[0 0]
 [1 4]]

Classification Report:
              precision    recall  f1-score   support

           0       0.00      0.00      0.00         0
           1       1.00      0.80      0.89         5

    accuracy                           0.80         5
   macro avg       0.50      0.40      0.44         5
weighted avg       1.00      0.80      0.89         5



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


Answer these questions based on the above assignment-

**1. How can machine learning help in predicting diabetes?**

**Ans.** Machine Learning (ML) helps by:
Learning patterns from historical patient data (age, glucose level, BMI, etc.)

Predicting whether a new patient is likely to have diabetes.

Helping doctors with early detection and prevention.

This reduces manual diagnosis time and improves accuracy.

**2. What are some effective features to consider when building a diabetes prediction model?**

**Ans.** From our dataset, these are good features:

Pregnancies → More pregnancies may affect diabetes risk

Glucose → High glucose strongly indicates diabetes

BloodPressure → Uncontrolled BP is linked to diabetes

SkinThickness & Insulin → Indicators of insulin resistance

BMI (Body Mass Index) → Higher BMI increases risk

DiabetesPedigreeFunction → Family history risk

Age → Older age = higher risk

Outcome (1 = diabetic, 0 = not diabetic) is our target variable.

**3. Which machine learning algorithms work well for diabetes prediction?**

**Ans.** Logistic Regression → Simple, interpretable

Decision Trees / Random Forest → Captures non-linear patterns

Support Vector Machines (SVM) → Works well with small datasets

K-Nearest Neighbors (KNN) → Distance-based

Neural Networks → For large, complex datasets

We’ll use Logistic Regression (most common for medical predictions).

**4. How do you evaluate the performance of a diabetes prediction model?**

**Ans.** We can use:

Accuracy → % correctly predicted

Precision & Recall → Especially important in medical prediction

Confusion Matrix → Shows true/false positives

ROC-AUC Score → Measures model discrimination

**5. How can you increase the performance of the models?**

**Ans.** Data cleaning (remove missing/zero values)

Feature scaling (normalize glucose, BMI)

Feature engineering (combine features)

Try better algorithms (Random Forest, XGBoost)

Hyperparameter tuning

More data improves model performance

**6. Are there any challenges or limitations when working with diabetes data in
machine learning?**

**Ans.** Missing or zero values (e.g. SkinThickness=0 is unrealistic)

Imbalanced dataset (more non-diabetic than diabetic)

Small dataset may lead to overfitting

Ethical concerns (wrong predictions can affect treatment)