# Diabetes Data Feature Engineering
## Notebook steps overview:  

1. Load the cleaned dataset.  
2. Create new features based on domain knowledge.  
3. Evaluate the correlation of features with the target variable.  
4. Apply feature selection methods (ANOVA, Mutual Information, Random Forest, RFE).  
5. Visualize feature selection results.  
6. Transform features using scaling techniques.  
7. Perform dimensionality reduction with PCA.  
8. Save the final processed dataset.  
9. Summarize the feature engineering process.  

In [2]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler
from sklearn.feature_selection import SelectKBest, f_classif, mutual_info_classif, RFE
from sklearn.ensemble import RandomForestClassifier
from sklearn.decomposition import PCA
import os

# Set styling for plots
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette('viridis')
sns.set_context('notebook', font_scale=1.2)

In [4]:
# 1. Load the Cleaned Dataset
# -----------------------------------------------------

print("Loading cleaned diabetes dataset...")
df = pd.read_csv('C:/Users/hp/Desktop/diabetes-analysis-project/outputs/cleaned_data.csv')

# Make a copy to preserve the original data
df_original = df.copy()

Loading cleaned diabetes dataset...


In [5]:
# 2. Feature Creation

print("\n🧮 Creating New Features:")

# Feature 1: BMI * Glucose (Diabetes Risk Index)
df['Diabetes_Risk_Index'] = df['BMI'] * df['Glucose'] / 100
print("1. Created 'Diabetes_Risk_Index' = BMI * Glucose / 100")

# Feature 2: Insulin / Glucose ratio (Insulin Sensitivity)
df['Insulin_Sensitivity'] = df['Insulin'] / df['Glucose']
print("2. Created 'Insulin_Sensitivity' = Insulin / Glucose")

# Feature 3: Age * BMI (Age-adjusted BMI)
df['Age_BMI_Factor'] = df['Age'] * df['BMI'] / 100
print("3. Created 'Age_BMI_Factor' = Age * BMI / 100")

# Feature 4: Pregnancies to Age ratio
df['Pregnancies_Age_Ratio'] = df['Pregnancies'] / df['Age']
print("4. Created 'Pregnancies_Age_Ratio' = Pregnancies / Age")

# Feature 5: Diabetes Pedigree * BMI (Genetic-Physical Risk)
df['Genetic_Physical_Risk'] = df['DiabetesPedigreeFunction'] * df['BMI']
print("5. Created 'Genetic_Physical_Risk' = DiabetesPedigreeFunction * BMI")

# Feature 6: Glucose to BloodPressure ratio
df['Glucose_BP_Ratio'] = df['Glucose'] / df['BloodPressure']
print("6. Created 'Glucose_BP_Ratio' = Glucose / BloodPressure")

# Handle any infinities or NaNs from division operations
df.replace([np.inf, -np.inf], np.nan, inplace=True)
df.fillna(df.median(), inplace=True)


🧮 Creating New Features:
1. Created 'Diabetes_Risk_Index' = BMI * Glucose / 100
2. Created 'Insulin_Sensitivity' = Insulin / Glucose
3. Created 'Age_BMI_Factor' = Age * BMI / 100
4. Created 'Pregnancies_Age_Ratio' = Pregnancies / Age
5. Created 'Genetic_Physical_Risk' = DiabetesPedigreeFunction * BMI
6. Created 'Glucose_BP_Ratio' = Glucose / BloodPressure


In [6]:
# 3. Feature Evaluation

print("\n Evaluating Features:")

# Prepare features and target
X = df.drop('Outcome', axis=1)
y = df['Outcome']

# Correlation of new features with outcome
new_features = ['Diabetes_Risk_Index', 'Insulin_Sensitivity', 'Age_BMI_Factor', 
                'Pregnancies_Age_Ratio', 'Genetic_Physical_Risk', 'Glucose_BP_Ratio']

correlation_with_outcome = pd.DataFrame({
    'Feature': X.columns,
    'Correlation': [df[col].corr(df['Outcome']) for col in X.columns]
}).sort_values('Correlation', ascending=False)

print("\nFeature Correlation with Outcome:")
print(correlation_with_outcome)

# Plot correlation of new features
plt.figure(figsize=(12, 8))
sns.barplot(x='Correlation', y='Feature', data=correlation_with_outcome, palette='viridis')
plt.title('Feature Correlation with Diabetes Outcome', fontsize=16)
plt.xlabel('Correlation Coefficient', fontsize=14)
plt.ylabel('Feature', fontsize=14)
plt.axvline(x=0, color='red', linestyle='--')
plt.tight_layout()
plt.savefig(r'C:\Users\hp\Desktop\diabetes-analysis-project\visuals\static\feature_correlation.png')
plt.close()


 Evaluating Features:

Feature Correlation with Outcome:
                     Feature  Correlation
8        Diabetes_Risk_Index     0.522110
1                    Glucose     0.501914
4                    Insulin     0.420559
10            Age_BMI_Factor     0.372047
13          Glucose_BP_Ratio     0.355128
5                        BMI     0.324264
9        Insulin_Sensitivity     0.307041
3              SkinThickness     0.282478
12     Genetic_Physical_Risk     0.257118
7                        Age     0.243699
6   DiabetesPedigreeFunction     0.202637
0                Pregnancies     0.184710
2              BloodPressure     0.159291
11     Pregnancies_Age_Ratio     0.121302



Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.

  sns.barplot(x='Correlation', y='Feature', data=correlation_with_outcome, palette='viridis')


In [7]:
# 4. Feature Selection Methods
# -----------------------------------------------------

print("\nApplying Feature Selection Methods:")

# Method 1: Univariate Selection (ANOVA F-value)
selector_f = SelectKBest(score_func=f_classif, k=10)
selector_f.fit(X, y)
f_scores = pd.DataFrame({
    'Feature': X.columns,
    'F_Score': selector_f.scores_,
    'P_Value': selector_f.pvalues_
}).sort_values('F_Score', ascending=False)

print("\nANOVA F-value based selection:")
print(f_scores)

# Method 2: Mutual Information
selector_mi = SelectKBest(score_func=mutual_info_classif, k=10)
selector_mi.fit(X, y)
mi_scores = pd.DataFrame({
    'Feature': X.columns,
    'MI_Score': selector_mi.scores_
}).sort_values('MI_Score', ascending=False)

print("\nMutual Information based selection:")
print(mi_scores)

# Method 3: Random Forest Feature Importance
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X, y)
rf_importance = pd.DataFrame({
    'Feature': X.columns,
    'Importance': rf.feature_importances_
}).sort_values('Importance', ascending=False)

print("\nRandom Forest Feature Importance:")
print(rf_importance)

# Method 4: Recursive Feature Elimination (RFE)
rfe = RFE(estimator=RandomForestClassifier(n_estimators=100, random_state=42), n_features_to_select=10)
rfe.fit(X, y)
rfe_ranking = pd.DataFrame({
    'Feature': X.columns,
    'Ranking': rfe.ranking_
}).sort_values('Ranking')

print("\nRecursive Feature Elimination Ranking:")
print(rfe_ranking)


Applying Feature Selection Methods:

ANOVA F-value based selection:
                     Feature     F_Score       P_Value
8        Diabetes_Risk_Index  369.884925  2.959930e-70
1                    Glucose  332.374135  3.138132e-64
4                    Insulin  212.081652  1.154721e-43
10            Age_BMI_Factor  158.568023  7.944963e-34
13          Glucose_BP_Ratio  142.440741  9.092791e-31
5                        BMI  115.974816  1.200896e-25
9        Insulin_Sensitivity  102.734022  4.911300e-23
3              SkinThickness   85.585706  1.336850e-19
12     Genetic_Physical_Risk   69.869112  2.135960e-16
7                        Age   62.317959  7.744258e-15
6   DiabetesPedigreeFunction   42.263568  1.264380e-10
0                Pregnancies   34.863756  4.859636e-09
2              BloodPressure   25.695836  4.771064e-07
11     Pregnancies_Age_Ratio   14.739760  1.312756e-04

Mutual Information based selection:
                     Feature  MI_Score
4                    Insulin  

In [10]:
# 5. Visualize Feature Selection Results

# Plot Random Forest Feature Importance
plt.figure(figsize=(12, 8))
sns.barplot(x='Importance', y='Feature', data=rf_importance, palette='viridis')
plt.title('Random Forest Feature Importance', fontsize=16)
plt.xlabel('Importance', fontsize=14)
plt.ylabel('Feature', fontsize=14)
plt.tight_layout()
plt.savefig('C:/Users/hp/Desktop/diabetes-analysis-project/visuals/static/rf_feature_importance.png')
plt.close()

# Plot combined feature selection ranks
def get_rank(df, value_col, feature_col='Feature'):
    df_ranked = df.copy()
    df_ranked['Rank'] = df_ranked[value_col].rank(ascending=False)
    return df_ranked.set_index(feature_col)['Rank']

# Get ranks from each method
corr_ranks = get_rank(correlation_with_outcome, 'Correlation')
f_ranks = get_rank(f_scores, 'F_Score')
mi_ranks = get_rank(mi_scores, 'MI_Score')
rf_ranks = get_rank(rf_importance, 'Importance')

# Combine ranks
combined_ranks = pd.DataFrame({
    'Correlation_Rank': corr_ranks,
    'F_Score_Rank': f_ranks,
    'MI_Score_Rank': mi_ranks,
    'RF_Importance_Rank': rf_ranks
})

combined_ranks['Average_Rank'] = combined_ranks.mean(axis=1)
combined_ranks = combined_ranks.sort_values('Average_Rank')

print("\nCombined Feature Rankings (lower is better):")
print(combined_ranks)

# Plot combined ranks
plt.figure(figsize=(14, 10))
sns.heatmap(combined_ranks, annot=True, cmap='viridis_r', fmt='.1f')
plt.title('Feature Selection Rankings Across Methods', fontsize=16)
plt.tight_layout()
plt.savefig('C:/Users/hp/Desktop/diabetes-analysis-project/visuals/static/feature_selection_ranks.png')
plt.close()


Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.

  sns.barplot(x='Importance', y='Feature', data=rf_importance, palette='viridis')



Combined Feature Rankings (lower is better):
                          Correlation_Rank  F_Score_Rank  MI_Score_Rank  \
Feature                                                                   
Insulin                                3.0           3.0            1.0   
Diabetes_Risk_Index                    1.0           1.0            4.0   
Glucose                                2.0           2.0            5.0   
Insulin_Sensitivity                    7.0           7.0            2.0   
Age_BMI_Factor                         4.0           4.0            7.0   
SkinThickness                          8.0           8.0            3.0   
Glucose_BP_Ratio                       5.0           5.0            8.0   
BMI                                    6.0           6.0            6.0   
Age                                   10.0          10.0            9.0   
Genetic_Physical_Risk                  9.0           9.0           11.0   
DiabetesPedigreeFunction              11.0          11

In [11]:
# 6. Feature Transformation

print("\n Feature Transformation:")

# Select features based on combined ranking
top_features = combined_ranks.sort_values('Average_Rank').index[:10].tolist()
print(f"Selected top 10 features: {top_features}")

# Prepare dataset with selected features only
X_selected = df[top_features]

# Method 1: Standardization
scaler_standard = StandardScaler()
X_standard = scaler_standard.fit_transform(X_selected)
X_standard_df = pd.DataFrame(X_standard, columns=top_features)

# Method 2: Min-Max Scaling
scaler_minmax = MinMaxScaler()
X_minmax = scaler_minmax.fit_transform(X_selected)
X_minmax_df = pd.DataFrame(X_minmax, columns=top_features)

# Method 3: Robust Scaling (less sensitive to outliers)
scaler_robust = RobustScaler()
X_robust = scaler_robust.fit_transform(X_selected)
X_robust_df = pd.DataFrame(X_robust, columns=top_features)

# Compare distributions of original vs scaled features
feature_to_plot = top_features[0]  # Pick first top feature for visualization

plt.figure(figsize=(15, 10))

plt.subplot(2, 2, 1)
sns.histplot(X_selected[feature_to_plot], kde=True)
plt.title(f'Original {feature_to_plot}', fontsize=14)

plt.subplot(2, 2, 2)
sns.histplot(X_standard_df[feature_to_plot], kde=True)
plt.title(f'StandardScaler {feature_to_plot}', fontsize=14)

plt.subplot(2, 2, 3)
sns.histplot(X_minmax_df[feature_to_plot], kde=True)
plt.title(f'MinMaxScaler {feature_to_plot}', fontsize=14)

plt.subplot(2, 2, 4)
sns.histplot(X_robust_df[feature_to_plot], kde=True)
plt.title(f'RobustScaler {feature_to_plot}', fontsize=14)

plt.tight_layout()
plt.savefig('C:/Users/hp/Desktop/diabetes-analysis-project/visuals/static/feature_scaling_comparison.png')
plt.close()


 Feature Transformation:
Selected top 10 features: ['Insulin', 'Diabetes_Risk_Index', 'Glucose', 'Insulin_Sensitivity', 'Age_BMI_Factor', 'SkinThickness', 'Glucose_BP_Ratio', 'BMI', 'Age', 'Genetic_Physical_Risk']


In [12]:
# 7. Dimensionality Reduction with PCA

print("\nDimensionality Reduction with PCA:")

# Apply PCA
pca = PCA(n_components=3)  # Reduce to 3 components for visualization
X_pca = pca.fit_transform(X_standard)

# Create PCA dataframe for plotting
pca_df = pd.DataFrame(data=X_pca, columns=['PC1', 'PC2', 'PC3'])
pca_df['Outcome'] = y.values

# Get feature importance in PCA
pca_components = pd.DataFrame(
    pca.components_.T,
    columns=[f'PC{i+1}' for i in range(pca.n_components_)],
    index=top_features
)

print("\nPCA Components:")
print(pca_components)

print(f"\nExplained variance ratio: {pca.explained_variance_ratio_}")
print(f"Cumulative explained variance: {np.sum(pca.explained_variance_ratio_)}")

# Plot PCA results
plt.figure(figsize=(12, 10))

# 2D PCA plot
plt.subplot(2, 2, 1)
sns.scatterplot(x='PC1', y='PC2', hue='Outcome', data=pca_df, palette=['#3498db', '#e74c3c'])
plt.title('PCA: PC1 vs PC2', fontsize=14)

# 3D PCA plot using 2D projection
plt.subplot(2, 2, 2)
sns.scatterplot(x='PC1', y='PC3', hue='Outcome', data=pca_df, palette=['#3498db', '#e74c3c'])
plt.title('PCA: PC1 vs PC3', fontsize=14)

# Feature contributions to PC1
plt.subplot(2, 2, 3)
sns.barplot(x=pca_components.index, y=pca_components['PC1'], palette='viridis')
plt.title('Feature Contributions to PC1', fontsize=14)
plt.xticks(rotation=90)

# Feature contributions to PC2
plt.subplot(2, 2, 4)
sns.barplot(x=pca_components.index, y=pca_components['PC2'], palette='viridis')
plt.title('Feature Contributions to PC2', fontsize=14)
plt.xticks(rotation=90)

plt.tight_layout()
plt.savefig('C:/Users/hp/Desktop/diabetes-analysis-project/visuals/static/pca_analysis.png')
plt.close()


Dimensionality Reduction with PCA:

PCA Components:
                            PC1       PC2       PC3
Insulin                0.341829 -0.381743  0.377220
Diabetes_Risk_Index    0.458066 -0.022709 -0.239172
Glucose                0.389243 -0.217417 -0.103798
Insulin_Sensitivity    0.220136 -0.308586  0.494872
Age_BMI_Factor         0.329336  0.497137  0.274115
SkinThickness          0.274025  0.213960 -0.204168
Glucose_BP_Ratio       0.284537 -0.404663 -0.157024
BMI                    0.340501  0.242423 -0.274826
Age                    0.189706  0.433232  0.478934
Genetic_Physical_Risk  0.238569  0.090144 -0.313755

Explained variance ratio: [0.3969733  0.16677334 0.15060819]
Cumulative explained variance: 0.7143548354375522



Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.barplot(x=pca_components.index, y=pca_components['PC1'], palette='viridis')

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.barplot(x=pca_components.index, y=pca_components['PC2'], palette='viridis')


In [13]:
# 8. Save Final Processed Dataset

print("\n Saving Final Processed Dataset:")

# Create final dataset with original features + selected engineered features
# Sort by average rank and select all original features plus top 3 engineered features
original_features = df_original.columns.tolist()
engineered_features = [f for f in top_features if f not in original_features][:3]

final_features = original_features + engineered_features
final_df = df[final_features].copy()

print(f"Final dataset includes original features plus top engineered features: {engineered_features}")

# Save dataset with selected features (unscaled)
final_df.to_csv('C:/Users/hp/Desktop/diabetes-analysis-project/outputs/features_engineered.csv', index=False)

# Also save a standardized version
X_final = final_df.drop('Outcome', axis=1)
y_final = final_df['Outcome']

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_final)
X_scaled_df = pd.DataFrame(X_scaled, columns=X_final.columns)
X_scaled_df['Outcome'] = y_final.values

X_scaled_df.to_csv('C:/Users/hp/Desktop/diabetes-analysis-project/outputs/features_engineered_scaled.csv', index=False)



 Saving Final Processed Dataset:
Final dataset includes original features plus top engineered features: ['Diabetes_Risk_Index', 'Insulin_Sensitivity', 'Age_BMI_Factor']



# **Feature Engineering Summary:**

## 1. Created 6 new features:
   - Diabetes_Risk_Index: BMI * Glucose / 100
   - Insulin_Sensitivity: Insulin / Glucose
   - Age_BMI_Factor: Age * BMI / 100
   - Pregnancies_Age_Ratio: Pregnancies / Age
   - Genetic_Physical_Risk: DiabetesPedigreeFunction * BMI
   - Glucose_BP_Ratio: Glucose / BloodPressure

## 2. Top Features According to Average Ranking:
   ['Insulin', 'Diabetes_Risk_Index', 'Glucose', 'Insulin_Sensitivity', 'Age_BMI_Factor', 'SkinThickness', 'Glucose_BP_Ratio', 'BMI', 'Age', 'Genetic_Physical_Risk']

## 3. Best Engineered Features:
   ['Diabetes_Risk_Index', 'Insulin_Sensitivity', 'Age_BMI_Factor']

## . Feature Transformations Applied:
   - StandardScaler
   - MinMaxScaler
   - RobustScaler

## 5. PCA Analysis:
   - Top 3 PCs explain 71.44% of variance
   - PC1 is mainly influenced by Diabetes_Risk_Index
   - 989 instances