# Diabetes Data Analysis
This notebook is dedicated to the analysis of a diabetes dataset. The steps we will follow are:
1. Load the data into a pandas DataFrame.
2. Perform exploratory data analysis (EDA) to understand the structure and characteristics of the data.
3. Visualize the data to gain further insights.
4. Preprocess the data to prepare it for machine learning models (this may include data cleaning, encoding, normalization, etc.).
5. Build a Machine Learning model to predict the outcome based on the given features.
6. Evaluate the performance of the model.

Each step will be accompanied by an explanation and analysis of the process and results.

In [None]:
import pandas as pd

# Load the data
df = pd.read_csv('diabetes.csv')
df.head()

## Data Loading
The data has been successfully loaded into a pandas DataFrame. The columns in the data are:
- Pregnancies: Number of times pregnant
- Glucose: Plasma glucose concentration
- BloodPressure: Diastolic blood pressure (mm Hg)
- SkinThickness: Triceps skin fold thickness (mm)
- Insulin: 2-Hour serum insulin (mu U/ml)
- BMI: Body mass index (weight in kg/(height in m)^2)
- DiabetesPedigreeFunction: Diabetes pedigree function
- Age: Age (years)
- Outcome: Class variable (0 or 1)

The 'Outcome' column is the target variable we want to predict. A value of 1 represents a positive diagnosis for diabetes, while a 0 represents a negative diagnosis.

In [None]:
# Check the shape of the data
print('Shape of the data:', df.shape)

# Check the data types of the columns
print('\nData types of the columns:')
print(df.dtypes)

# Check for missing values
print('\nMissing values in each column:')
print(df.isnull().sum())

## Exploratory Data Analysis (EDA)
From the initial EDA, we have found that:
- The dataset contains 768 rows and 9 columns.
- All columns are numerical (int64 or float64).
- There are no missing values in the dataset.

These findings are crucial as they inform us that we don't need to handle missing data, and all our data is numerical, which is suitable for most machine learning models.

In [None]:
import matplotlib.pyplot as plt

# Plot histograms for each column
df.hist(bins=50, figsize=(20,15))
plt.tight_layout()
plt.show()

## Data Visualization
From the histograms for each numerical column in the dataset, we can observe that:
- Most of the columns like 'Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', and 'Age' are not normally distributed.
- The 'Outcome' column, which is our target variable, shows that we have more instances of class 0 (no diabetes) than class 1 (diabetes). This is an important observation as it tells us that our dataset is imbalanced.

In [None]:
from sklearn.model_selection import train_test_split

# Split the data into features (X) and the target variable (y)
X = df.drop('Outcome', axis=1)
y = df['Outcome']

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Check the shape of the training and test sets
print('X_train shape:', X_train.shape)
print('y_train shape:', y_train.shape)
print('X_test shape:', X_test.shape)
print('y_test shape:', y_test.shape)

In [None]:
import seaborn as sns

# Plot the correlation matrix
plt.figure(figsize=(10, 8))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm', center=0)
plt.title('Correlation Matrix of Variables')
plt.show()

# Plot the distribution of the target variable
plt.figure(figsize=(6, 4))
sns.countplot(df['Outcome'])
plt.title('Distribution of Outcome')
plt.show()

# Plot the distribution of the Age variable
plt.figure(figsize=(6, 4))
sns.histplot(df['Age'], kde=True)
plt.title('Distribution of Age')
plt.show()

# Plot the relationship between Age and Outcome
plt.figure(figsize=(6, 4))
sns.boxplot(x='Outcome', y='Age', data=df)
plt.title('Relationship between Age and Outcome')
plt.show()

# Plot the relationship between BMI and Outcome
plt.figure(figsize=(6, 4))
sns.boxplot(x='Outcome', y='BMI', data=df)
plt.title('Relationship between BMI and Outcome')
plt.show()

## Additional Exploratory Data Analysis
We created additional plots to gain more insights into the data:

1. **Correlation Matrix of Variables**: This heatmap shows the correlation between each pair of variables in the dataset. The correlation is a value between -1 and 1 that represents how closely two variables are linearly related. A value close to 1 means a strong positive correlation (as one variable increases, so does the other), a value close to -1 means a strong negative correlation (as one variable increases, the other decreases), and a value close to 0 means no linear correlation.

2. **Distribution of Outcome**: This bar plot shows the number of instances for each class in the 'Outcome' variable. We can see that there are more instances of class 0 (no diabetes) than class 1 (diabetes), indicating that our dataset is imbalanced.

3. **Distribution of Age**: This histogram shows the distribution of values in the 'Age' variable. We can see that most of the patients are in their 20s to 30s.

4. **Relationship between Age and Outcome**: This box plot shows the distribution of 'Age' for each class in the 'Outcome' variable. We can see that patients with diabetes (Outcome = 1) tend to be older than those without diabetes (Outcome = 0).

5. **Relationship between BMI and Outcome**: This box plot shows the distribution of 'BMI' for each class in the 'Outcome' variable. We can see that patients with diabetes (Outcome = 1) tend to have a higher BMI than those without diabetes (Outcome = 0).

## Data Preprocessing
In the data preprocessing stage, we first split our data into features (X) and the target variable (y), and then further split these into training and test sets. This will allow us to evaluate the performance of our machine learning model later on. We used 80% of the data for training and 20% for testing.

The shapes of the sets are as follows:
- X_train shape: (614, 8)
- y_train shape: (614,)
- X_test shape: (154, 8)
- y_test shape: (154,)

This means we have 614 instances in our training set and 154 instances in our test set. Each instance has 8 features.

In [None]:
from sklearn.preprocessing import StandardScaler

# Initialize the StandardScaler
scaler = StandardScaler()

# Fit on the training data
scaler.fit(X_train)

# Transform both the training and test data
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

# Check the first few rows of the transformed training data
print(X_train[:5])

After splitting the data into training and test sets, we normalized the features so that they're on the same scale. This is especially important for algorithms that use distance measures, like k-Nearest Neighbors (k-NN) and Support Vector Machines (SVM).

We used the StandardScaler from sklearn, which standardizes features by removing the mean and scaling to unit variance. Each value now represents how many standard deviations the original value is away from the mean. A negative value indicates that the original value was below the mean, while a positive value indicates it was above the mean.

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Initialize the Logistic Regression model
model = LogisticRegression(random_state=42)

# Train the model on the training data
model.fit(X_train, y_train)

# Make predictions on the training and test data
train_preds = model.predict(X_train)
test_preds = model.predict(X_test)

# Calculate the accuracy of the model on the training and test data
train_acc = accuracy_score(y_train, train_preds)
test_acc = accuracy_score(y_test, test_preds)

print('Training accuracy:', train_acc)
print('Test accuracy:', test_acc)

There are several ways we could try to improve the model's performance, such as:
- Using a more complex model: Logistic Regression is a simple model which might not be able to capture all the complexities in the data. We could try using a more complex model like a Random Forest or a Gradient Boosting model.
- Tuning the model's parameters: We used the default parameters for the Logistic Regression model. We could try tuning these parameters to see if we can get better performance.
- Engineering new features from the existing data: We could try creating new features from the existing data which might help improve the model's performance.

In [None]:
from sklearn.ensemble import RandomForestClassifier

# Initialize the Random Forest Classifier
rf_model = RandomForestClassifier(random_state=42)

# Train the model on the training data
rf_model.fit(X_train, y_train)

# Make predictions on the training and test data
rf_train_preds = rf_model.predict(X_train)
rf_test_preds = rf_model.predict(X_test)

# Calculate the accuracy of the model on the training and test data
rf_train_acc = accuracy_score(y_train, rf_train_preds)
rf_test_acc = accuracy_score(y_test, rf_test_preds)

print('Training accuracy:', rf_train_acc)
print('Test accuracy:', rf_test_acc)

## Trying a More Complex Model
We built a Random Forest Classifier and trained it on our training data. We then used the model to make predictions on both the training and test data, and calculated the accuracy of the model on both sets.

The model achieved an accuracy of 1.0 on the training data and approximately 0.721 on the test data. While the model performs perfectly on the training data, its performance drops on the test data. This is a clear sign of overfitting, which means the model has learned the training data too well and is not generalizing well to unseen data.

To address this, we can try tuning the model's parameters or using a different model. For instance, we could adjust the number of trees in the forest (n_estimators parameter) or the maximum depth of the trees (max_depth parameter). We could also try a Gradient Boosting model, which often performs well on a wide range of problems.

In [None]:
from sklearn.model_selection import GridSearchCV

# Define the parameter grid
param_grid = {
    'n_estimators': [100, 200, 300, 400, 500],
    'max_depth': [None, 10, 20, 30, 40, 50]
}

# Initialize the GridSearchCV
grid_search = GridSearchCV(RandomForestClassifier(random_state=42), param_grid, cv=5, n_jobs=-1)

# Fit the GridSearchCV
grid_search.fit(X_train, y_train)

# Get the best parameters and the best score
best_params = grid_search.best_params_
best_score = grid_search.best_score_

# Make predictions on the training and test data using the best model
gs_train_preds = grid_search.predict(X_train)
gs_test_preds = grid_search.predict(X_test)

# Calculate the accuracy of the best model on the training and test data
gs_train_acc = accuracy_score(y_train, gs_train_preds)
gs_test_acc = accuracy_score(y_test, gs_test_preds)

print('Best parameters:', best_params)
print('Best cross-validation score:', best_score)
print('Training accuracy:', gs_train_acc)
print('Test accuracy:', gs_test_acc)

## Parameter Tuning
We performed a grid search to find the best parameters for the Random Forest model. We tried different values for the number of trees in the forest (n_estimators) and the maximum depth of the trees (max_depth), and used cross-validation to find the best parameters.

The best parameters found were 'max_depth': None and 'n_estimators': 200. The best cross-validation score was approximately 0.779.

The model with the best parameters achieved an accuracy of 1.0 on the training data and approximately 0.734 on the test data. As with the previous Random Forest model, this model is overfitting the training data. This means the model has learned the training data too well and is not generalizing well to unseen data.

To address this, we could try using a different model or engineering new features from the existing data. We could also try a more extensive grid search or a different method for parameter tuning, such as random search or Bayesian optimization.

In [None]:
# Create the interaction feature
X_train['Glucose_BMI'] = X_train['Glucose'] * X_train['BMI']
X_test['Glucose_BMI'] = X_test['Glucose'] * X_test['BMI']

# Initialize the Random Forest Classifier with the best parameters
rf_model = RandomForestClassifier(n_estimators=best_params['n_estimators'], max_depth=best_params['max_depth'], random_state=42)

# Train the model on the training data
rf_model.fit(X_train, y_train)

# Make predictions on the training and test data
rf_train_preds = rf_model.predict(X_train)
rf_test_preds = rf_model.predict(X_test)

# Calculate the accuracy of the model on the training and test data
rf_train_acc = accuracy_score(y_train, rf_train_preds)
rf_test_acc = accuracy_score(y_test, rf_test_preds)

print('Training accuracy:', rf_train_acc)
print('Test accuracy:', rf_test_acc)

In [None]:
# Convert X_train and X_test back to dataframes
X_train_df = pd.DataFrame(X_train, columns=df.columns[:-1])
X_test_df = pd.DataFrame(X_test, columns=df.columns[:-1])

# Create the interaction feature
X_train_df['Glucose_BMI'] = X_train_df['Glucose'] * X_train_df['BMI']
X_test_df['Glucose_BMI'] = X_test_df['Glucose'] * X_test_df['BMI']

# Convert X_train_df and X_test_df back to numpy arrays
X_train = X_train_df.values
X_test = X_test_df.values

# Initialize the Random Forest Classifier with the best parameters
rf_model = RandomForestClassifier(n_estimators=best_params['n_estimators'], max_depth=best_params['max_depth'], random_state=42)

# Train the model on the training data
rf_model.fit(X_train, y_train)

# Make predictions on the training and test data
rf_train_preds = rf_model.predict(X_train)
rf_test_preds = rf_model.predict(X_test)

# Calculate the accuracy of the model on the training and test data
rf_train_acc = accuracy_score(y_train, rf_train_preds)
rf_test_acc = accuracy_score(y_test, rf_test_preds)

print('Training accuracy:', rf_train_acc)
print('Test accuracy:', rf_test_acc)

## Feature Engineering
We created an interaction feature between 'Glucose' and 'BMI', which is the product of the two original features. The idea behind interaction features is that the effect of one feature on the target variable may depend on the value of another feature. In this case, the effect of 'Glucose' on 'Outcome' might depend on the value of 'BMI'.

We then trained a new Random Forest model with the best parameters found in the grid search on the data with the interaction feature. The model achieved an accuracy of 1.0 on the training data and approximately 0.740 on the test data. This is a slight improvement over the previous model without the interaction feature, but the model is still overfitting the training data.

To further improve the model's performance, we could try creating more interaction features or other types of new features. We could also try using a different model or a different method for parameter tuning.