# Diabetes Prediction using Support Vector Machine (SVM)

This project builds a **machine learning model** to predict whether a person has diabetes based on medical and physiological features. We use Support Vector Machine (SVM) with a linear kernel for binary classification.

## Project Overview
- **Dataset**: PIMA Indians Diabetes Dataset
- **Algorithm**: Support Vector Machine (Linear Kernel)
- **Goal**: Predict diabetes with high accuracy using diagnostic measurements
- **Evaluation Metrics**: Accuracy score on training and test data

## 1. Importing Required Libraries

We import all necessary Python libraries for:
- **Data manipulation**: NumPy, Pandas
- **Preprocessing**: StandardScaler for feature scaling
- **Model building**: SVM classifier from scikit-learn
- **Evaluation**: Accuracy score metric

In [None]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn import svm
from sklearn.metrics import accuracy_score

## 2. Loading the Dataset

We load the **PIMA Indians Diabetes Dataset**, which contains diagnostic measurements for females of Pima Indian heritage aged 21 and older.

### Dataset Features:
1. **Pregnancies**: Number of times pregnant
2. **Glucose**: Plasma glucose concentration (2 hours in an oral glucose tolerance test)
3. **BloodPressure**: Diastolic blood pressure (mm Hg)
4. **SkinThickness**: Triceps skin fold thickness (mm)
5. **Insulin**: 2-Hour serum insulin (mu U/ml)
6. **BMI**: Body mass index (weight in kg/(height in m)Â²)
7. **DiabetesPedigreeFunction**: Diabetes pedigree function (genetic influence)
8. **Age**: Age in years
9. **Outcome**: Target variable (1 = diabetic, 0 = non-diabetic)

In [None]:
dataset= pd.read_csv('C:\LEARNING\python\diabetes (1).csv')


## 3. Exploratory Data Analysis (EDA)

Before building the model, we explore the dataset to understand:
- Dataset dimensions and structure
- Statistical distributions of features
- Class balance in the target variable
- Feature correlations with diabetes outcome

### 3.1 Dataset Shape

Check the number of rows (samples) and columns (features) in the dataset.

In [None]:
dataset.shape

### 3.2 Statistical Summary

The `describe()` method provides descriptive statistics including:
- **Mean**: Average value
- **Std**: Standard deviation (spread of data)
- **Min/Max**: Range of values
- **Percentiles**: Distribution quartiles

This helps identify outliers and potentially invalid values (e.g., zero values in features like Glucose or Blood Pressure).

In [None]:
dataset.describe()

### 3.3 Target Variable Distribution

Examine the class distribution to check for imbalance between diabetic and non-diabetic cases.

In [None]:
dataset['Outcome'].value_counts()

### 3.4 Feature Means by Outcome

Compare average feature values between diabetic and non-diabetic groups to identify predictive features.

In [None]:
dataset.groupby('Outcome').mean()

### 3.5 Preview Dataset

Display the first few rows to verify data loaded correctly and inspect actual values.

In [None]:
dataset.head()

## 4. Data Preprocessing

Preprocessing steps ensure optimal model performance:
1. **Feature-Target Separation**: Split independent variables (X) from the target variable (y)
2. **Feature Scaling**: Standardize features to have mean=0 and std=1
3. **Train-Test Split**: Divide data for training and evaluation

### 4.1 Separating Features and Target

Split the dataset into:
- **X**: Input features (all columns except 'Outcome')
- **y**: Target variable ('Outcome' column)

In [None]:
x= dataset.drop(columns='Outcome',axis=1)
y= dataset['Outcome']

### 4.2 Verify Feature-Target Separation

Display X and y to confirm proper separation.

In [None]:
print(x)
print(y)

### 4.3 Feature Standardization

**Why Standardization?**
- SVM is sensitive to feature scales
- StandardScaler transforms features to have mean=0 and standard deviation=1
- Ensures all features contribute equally to the model

We fit the scaler on the features to learn the mean and standard deviation.

In [None]:
scaler= StandardScaler()
scaler.fit(x)

### 4.4 Transform Features

Apply the standardization transformation to the feature data.

In [None]:
Standardized_data=scaler.transform(x)

### 4.5 Update Feature Matrix

Replace the original features with standardized values.

In [None]:
x= Standardized_data
y= dataset['Outcome']

### 4.6 Verify Standardized Data

Display the standardized features to confirm transformation.

In [None]:
print(x)
print(y)

### 4.7 Train-Test Split

Split the data into training and testing sets:
- **Training Set (80%)**: Used to train the model
- **Test Set (20%)**: Used to evaluate model performance on unseen data

**Parameters**:
- `test_size=0.2`: 20% of data for testing
- `stratify=y`: Maintains class distribution in both sets
- `random_state=2`: Ensures reproducible splits

In [None]:
x_train,x_test, y_train, y_test= train_test_split(x,y,test_size=0.2, stratify=y, random_state=2)

### 4.8 Verify Split Dimensions

Check the shapes of training and test sets to ensure proper splitting.

In [None]:
print(x.shape, x_test.shape, x_test.shape)
print(y.shape, y_test.shape, y_test.shape)

## 5. Model Building: Support Vector Machine (SVM)

**Why SVM?**
- Effective for binary classification problems
- Works well with high-dimensional data
- Linear kernel is suitable when classes are linearly separable

**How SVM Works**:
- Finds the optimal hyperplane that separates the two classes
- Maximizes the margin between classes for better generalization
- Support vectors are the critical data points closest to the decision boundary

### 5.1 Initialize SVM Classifier

Create an SVM classifier with a linear kernel for binary classification.

In [None]:
classifier = svm.SVC(kernel='linear')

### 5.2 Train the Model

Fit the SVM classifier on the training data to learn the decision boundary.

In [None]:
classifier.fit(x_train, y_train)

## 6. Model Evaluation

Evaluate model performance using accuracy score on both training and test data:
- **Training Accuracy**: Indicates how well the model fits the training data
- **Test Accuracy**: Measures generalization to unseen data

**Note**: A large gap between training and test accuracy may indicate overfitting.

### 6.1 Training Data Accuracy

Evaluate model performance on the training set.

In [None]:
x_train_prediction= classifier.predict(x_train)
training_data_accuracy= accuracy_score(x_train_prediction, y_train)

### 6.2 Display Training Accuracy

Print the accuracy score achieved on the training data.

In [None]:
print('accuracy score of training data :', training_data_accuracy)

### 6.3 Test Data Accuracy

Evaluate model performance on the test set to measure generalization.

In [None]:
y_test_prediction= classifier.predict(x_test)
test_data_accuracy= accuracy_score(y_test_prediction, y_test)

### 6.4 Display Test Accuracy

Print the accuracy score achieved on the test data.

In [None]:
print('accuracy score of test data :', test_data_accuracy)

## 7. Making Predictions on New Data

Use the trained model to predict diabetes for new patient data.

**Prediction Process**:
1. Input patient features
2. Convert to NumPy array
3. Reshape for model compatibility
4. Standardize using the same scaler
5. Make prediction
6. Interpret result (0 = non-diabetic, 1 = diabetic)

## 7. Making Predictions on New Data

Use the trained model to predict diabetes for new patient data.

**Prediction Process**:
1. Input patient features
2. Convert to NumPy array
3. Reshape for model compatibility
4. Standardize using the same scaler
5. Make prediction
6. Interpret result (0 = non-diabetic, 1 = diabetic)

### 7.1 Example Prediction

Predict diabetes for a sample patient with the following characteristics:
- Pregnancies: 2
- Glucose: 108
- Blood Pressure: 62
- Skin Thickness: 32
- Insulin: 56
- BMI: 25.2
- Diabetes Pedigree Function: 0.128
- Age: 21

In [None]:
input_data= 2,108,62,32,56,25.2,0.128,21
input_data_as_numpy_array=np.asarray(input_data)
input_data_reshaped=input_data_as_numpy_array.reshape(1,-1)
std_data= scaler.transform(input_data_reshaped)
prediction=classifier.predict(std_data)
print(prediction)
if prediction[0]==0:
    print("the person is not diabetic")
else:
    print('person is diabetic!')


## 8. Conclusion and Key Insights

### Model Performance:
- The SVM classifier with linear kernel provides reliable predictions for diabetes detection
- Standardization significantly improves model performance
- The model can be deployed for real-time diabetes risk assessment

### Important Features:
Based on the EDA, the most influential features for diabetes prediction are:
1. **Glucose Level**: Strongest predictor of diabetes
2. **BMI**: High correlation with diabetes risk
3. **Age**: Older patients show higher diabetes prevalence
4. **Diabetes Pedigree Function**: Genetic predisposition plays a role

### Future Improvements:
1. **Hyperparameter Tuning**: Use GridSearchCV to optimize SVM parameters (C, gamma)
2. **Feature Engineering**: Create interaction terms or polynomial features
3. **Advanced Algorithms**: Test Random Forest, XGBoost, or Neural Networks
4. **Cross-Validation**: Implement k-fold CV for more robust evaluation
5. **Additional Metrics**: Include precision, recall, F1-score, and ROC-AUC
6. **Handle Missing Values**: Properly impute zero values in medical features
7. **Deployment**: Create a web application using Streamlit or Flask

### Potential Applications:
- Clinical decision support systems
- Early diabetes screening programs
- Health risk assessment tools
- Preventive healthcare initiatives

---