<a href="https://colab.research.google.com/github/anshkilhor/diabetes-prediction-using-svm/blob/main/diabetes_prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Scikit-learn, often abbreviated as `sklearn`, is a powerful and versatile machine learning library in Python. It is designed to be simple and efficient for data mining and data analysis, and it builds on top of other scientific Python libraries such as NumPy, SciPy, and matplotlib.

### Key Features of Scikit-learn

1. **Classification**: Identifying which category an object belongs to. Examples include:
   - Logistic Regression
   - Support Vector Machines (SVM)
   - k-Nearest Neighbors (k-NN)
   - Decision Trees
   - Random Forests
   - Gradient Boosting

2. **Regression**: Predicting a continuous-valued attribute associated with an object. Examples include:
   - Linear Regression
   - Ridge Regression
   - Lasso Regression
   - Elastic Net
   - Support Vector Regression (SVR)
   - Decision Trees for Regression

3. **Clustering**: Automatic grouping of similar objects into sets. Examples include:
   - K-means
   - Agglomerative Clustering
   - DBSCAN
   - Mean Shift

4. **Dimensionality Reduction**: Reducing the number of random variables to consider. Examples include:
   - Principal Component Analysis (PCA)
   - Singular Value Decomposition (SVD)
   - Linear Discriminant Analysis (LDA)
   - t-distributed Stochastic Neighbor Embedding (t-SNE)

5. **Model Selection**: Comparing, validating, and choosing parameters and models. Tools include:
   - Grid Search
   - Randomized Search
   - Cross-validation
   - Validation Curve
   - Learning Curve

6. **Preprocessing**: Feature extraction and normalization. Techniques include:
   - Standardization
   - Normalization
   - Binarization
   - One-Hot Encoding
   - Imputation of missing values

### Core Components

#### 1. **Datasets**
Scikit-learn provides several built-in datasets for experimentation, such as:
   - `load_iris()`: Iris flower dataset.
   - `load_digits()`: Handwritten digits dataset.
   - `load_wine()`: Wine recognition dataset.
   - `load_breast_cancer()`: Breast cancer diagnostic dataset.
   - Methods to fetch larger datasets, e.g., `fetch_20newsgroups()`, `fetch_openml()`.

#### 2. **Model Training**
Scikit-learn's API is consistent and straightforward. Here's a typical workflow:
```python
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Load dataset
X, y = load_iris(return_X_y=True)

# Split dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardize features by removing the mean and scaling to unit variance
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Train a logistic regression model
model = LogisticRegression()
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")
```

#### 3. **Pipeline**
Scikit-learn provides a `Pipeline` class to streamline the workflow, encapsulating preprocessing steps and the model training process:
```python
from sklearn.pipeline import Pipeline

pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('clf', LogisticRegression())
])

pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Pipeline Accuracy: {accuracy}")
```

#### 4. **Cross-Validation**
Cross-validation is crucial for assessing model performance:
```python
from sklearn.model_selection import cross_val_score

scores = cross_val_score(model, X, y, cv=5)
print(f"Cross-validation scores: {scores}")
```

### Integration with Other Libraries
- **NumPy**: Fundamental package for numerical computations.
- **SciPy**: Used for more advanced computations.
- **matplotlib**: Plotting library.
- **pandas**: Data manipulation and analysis library.

### Advantages
- **Ease of use**: Consistent API, extensive documentation, and numerous examples.
- **Versatility**: Supports a wide range of machine learning tasks.
- **Interoperability**: Seamlessly integrates with other scientific Python libraries.

### Conclusion
Scikit-learn is a comprehensive machine learning library that provides tools for building, evaluating, and deploying machine learning models. Its ease of use, rich feature set, and integration with other scientific libraries make it a go-to choice for both beginners and experienced practitioners in the field of data science and machine learning.

# diabetes prediction using SVM

In [1]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn import svm

In [2]:
diabetes_data=pd.read_csv('/content/diabetes (1).csv')

In [3]:
pd.read_csv?

In [4]:
diabetes_data.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [5]:
diabetes_data.shape

(768, 9)

In [6]:
diabetes_data.describe()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,120.894531,69.105469,20.536458,79.799479,31.992578,0.471876,33.240885,0.348958
std,3.369578,31.972618,19.355807,15.952218,115.244002,7.88416,0.331329,11.760232,0.476951
min,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0,0.0
25%,1.0,99.0,62.0,0.0,0.0,27.3,0.24375,24.0,0.0
50%,3.0,117.0,72.0,23.0,30.5,32.0,0.3725,29.0,0.0
75%,6.0,140.25,80.0,32.0,127.25,36.6,0.62625,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


In [7]:
diabetes_data['Outcome'].value_counts()

Outcome
0    500
1    268
Name: count, dtype: int64

In [8]:
diabetes_data.groupby('Outcome').mean()

Unnamed: 0_level_0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
Outcome,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,3.298,109.98,68.184,19.664,68.792,30.3042,0.429734,31.19
1,4.865672,141.257463,70.824627,22.164179,100.335821,35.142537,0.5505,37.067164


In [9]:
X = diabetes_data.drop(columns = 'Outcome', axis=1)
Y = diabetes_data['Outcome']

In [10]:
print(X,Y)

     Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin   BMI  \
0              6      148             72             35        0  33.6   
1              1       85             66             29        0  26.6   
2              8      183             64              0        0  23.3   
3              1       89             66             23       94  28.1   
4              0      137             40             35      168  43.1   
..           ...      ...            ...            ...      ...   ...   
763           10      101             76             48      180  32.9   
764            2      122             70             27        0  36.8   
765            5      121             72             23      112  26.2   
766            1      126             60              0        0  30.1   
767            1       93             70             31        0  30.4   

     DiabetesPedigreeFunction  Age  
0                       0.627   50  
1                       0.351   31  


standardizing the imbalanced data

Standardizing data in machine learning involves transforming the data so that it has a mean of zero and a standard deviation of one. This process is essential for ensuring that different features contribute equally to the model and that the model's performance is not skewed by features with different scales.

### What is Standardization?

Standardization is a preprocessing step that adjusts the values of features to have:
- A mean (average) of 0.
- A standard deviation of 1.

The standardization formula for a feature \( X \) is:

\[ X_{\text{standardized}} = \frac{X - \mu}{\sigma} \]

where:
- \( \mu \) is the mean of the feature.
- \( \sigma \) is the standard deviation of the feature.

### How to Standardize Data

In Python, particularly with scikit-learn, standardization can be done using the `StandardScaler` class:

```python
from sklearn.preprocessing import StandardScaler

# Sample data
X = [[1, 2], [3, 4], [5, 6], [7, 8]]

# Create a StandardScaler object
scaler = StandardScaler()

# Fit the scaler to the data and transform it
X_standardized = scaler.fit_transform(X)

print(X_standardized)
```

### Why Standardize Data?

1. **Ensuring Equal Contribution of Features**:
   - Many machine learning algorithms (like linear regression, logistic regression, and neural networks) assume that the data is centered around zero and has a similar scale.
   - Features with larger ranges can dominate the learning process, leading to a model that gives undue importance to certain features.

2. **Improving Convergence in Gradient-Based Algorithms**:
   - Algorithms like gradient descent can converge faster when the data is standardized because the cost function's landscape becomes more uniform, avoiding elongated valleys that slow down convergence.

3. **Handling Regularization**:
   - Regularization techniques (like Lasso and Ridge regression) penalize large coefficients. If the data is not standardized, the penalty imposed on coefficients will be inconsistent due to the varying scales of the features.

4. **Enhancing Performance of Distance-Based Algorithms**:
   - Algorithms that rely on distances (like k-nearest neighbors and clustering algorithms) are sensitive to the scale of the data. Standardizing ensures that all features contribute equally to the distance calculations.

### Examples of Algorithms that Benefit from Standardization

1. **Linear Models**:
   - Linear Regression
   - Logistic Regression
   - Support Vector Machines

2. **Distance-Based Algorithms**:
   - k-Nearest Neighbors (k-NN)
   - K-means Clustering

3. **Gradient-Based Algorithms**:
   - Neural Networks
   - Gradient Boosting

4. **Principal Component Analysis (PCA)**:
   - PCA identifies the directions (principal components) that maximize the variance in the data. If features are not standardized, PCA may give undue importance to features with larger scales.

### When Not to Standardize

While standardization is beneficial in many cases, there are situations where it might not be necessary or appropriate:
- **Tree-Based Algorithms**: Decision trees and ensemble methods like random forests and gradient boosting are not sensitive to the scale of the data. These algorithms split data based on the feature values and do not rely on distance metrics or gradient descent.
- **Interpretability**: In some cases, you might want to retain the original scale of the data for interpretability purposes. For example, when predicting house prices, you might want to understand the impact of features in their original units.

### Conclusion

Standardizing data is a critical preprocessing step in machine learning that ensures features contribute equally, improves the performance and convergence of algorithms, and leads to better model accuracy. By transforming the data to have a mean of zero and a standard deviation of one, we can mitigate issues arising from features with different scales and enhance the overall performance of our models.

In [11]:
scaler= StandardScaler()
scaler.fit(X)

In [13]:

standardized_data = scaler.transform(X)
print(standardized_data)

[[ 0.63994726  0.84832379  0.14964075 ...  0.20401277  0.46849198
   1.4259954 ]
 [-0.84488505 -1.12339636 -0.16054575 ... -0.68442195 -0.36506078
  -0.19067191]
 [ 1.23388019  1.94372388 -0.26394125 ... -1.10325546  0.60439732
  -0.10558415]
 ...
 [ 0.3429808   0.00330087  0.14964075 ... -0.73518964 -0.68519336
  -0.27575966]
 [-0.84488505  0.1597866  -0.47073225 ... -0.24020459 -0.37110101
   1.17073215]
 [-0.84488505 -0.8730192   0.04624525 ... -0.20212881 -0.47378505
  -0.87137393]]


In [15]:

X = standardized_data
Y = diabetes_data['Outcome']
print(X)
print(Y)


[[ 0.63994726  0.84832379  0.14964075 ...  0.20401277  0.46849198
   1.4259954 ]
 [-0.84488505 -1.12339636 -0.16054575 ... -0.68442195 -0.36506078
  -0.19067191]
 [ 1.23388019  1.94372388 -0.26394125 ... -1.10325546  0.60439732
  -0.10558415]
 ...
 [ 0.3429808   0.00330087  0.14964075 ... -0.73518964 -0.68519336
  -0.27575966]
 [-0.84488505  0.1597866  -0.47073225 ... -0.24020459 -0.37110101
   1.17073215]
 [-0.84488505 -0.8730192   0.04624525 ... -0.20212881 -0.47378505
  -0.87137393]]
0      1
1      0
2      1
3      0
4      1
      ..
763    0
764    0
765    0
766    1
767    0
Name: Outcome, Length: 768, dtype: int64


In [16]:
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size = 0.2, stratify=Y, random_state=2)
print(X.shape, X_train.shape, X_test.shape)
classifier = svm.SVC(kernel='linear')
#training the support vector Machine Classifier
classifier.fit(X_train, Y_train)
X_train_prediction = classifier.predict(X_train)
training_data_accuracy = accuracy_score(X_train_prediction, Y_train)
print('Accuracy score of the training data : ', training_data_accuracy)
X_test_prediction = classifier.predict(X_test)
test_data_accuracy = accuracy_score(X_test_prediction, Y_test)
print('Accuracy score of the test data : ', test_data_accuracy)

(768, 8) (614, 8) (154, 8)
Accuracy score of the training data :  0.7866449511400652
Accuracy score of the test data :  0.7727272727272727


In [17]:
input_data = (5,166,72,19,175,25.8,0.587,51)
# changing the input_data to numpy array
input_data_as_numpy_array = np.asarray(input_data)
# reshape the array as we are predicting for one instance
input_data_reshaped = input_data_as_numpy_array.reshape(1,-1)
# standardize the input data
std_data = scaler.transform(input_data_reshaped)
print(std_data)
prediction = classifier.predict(std_data)
print(prediction)
if (prediction[0] == 0):
  print('The person is not diabetic')
else:
  print('The person is diabetic')


[[ 0.3429808   1.41167241  0.14964075 -0.09637905  0.82661621 -0.78595734
   0.34768723  1.51108316]]
[1]
The person is diabetic


