***Data Preprocessing***

**Load the data**


In [26]:
import pandas as pd
import numpy as np

# Load the data
data = pd.read_csv('/content/hospital_data.csv')

#print head
print ("\nHead of data set : ",data.head())


Head of data set :       Patient_ID   Age  Gender  HbA1c_Level Readmitted
0  DMC 992/2019  58.0  Female          7.7         No
1  DMC 392/2016  71.0  Female         11.1         No
2  DMC 905/2019  48.0    Male          7.6         No
3  DMC 587/2019  34.0  Female          8.0         No
4  DMC 611/2011  62.0    Male         11.9         No


**Cleaning:** *Remove any inconsistencies or errors in the data.*

Remove incorrect/incomplete data

In [27]:
# Drop any rows with missing data
data_cleaned = data.dropna()
print(data_cleaned)


       Patient_ID   Age  Gender  HbA1c_Level Readmitted
0    DMC 992/2019  58.0  Female          7.7         No
1    DMC 392/2016  71.0  Female         11.1         No
2    DMC 905/2019  48.0    Male          7.6         No
3    DMC 587/2019  34.0  Female          8.0         No
4    DMC 611/2011  62.0    Male         11.9         No
..            ...   ...     ...          ...        ...
995       DMC 822  23.0    Male          5.2         No
996       DMC 823  20.0    Male          9.6        Yes
997       DMC 824  68.0  Female          8.6         No
998       DMC 825  59.0    Male         11.9         No
999       DMC 826  51.0  Female          6.2         No

[997 rows x 5 columns]


Binning Smoothing by mean/median/boundary

In [29]:
#Binning Smoothing by mean/median/boundary (on HbA1c_Level)

bins = [0, 5.9, 7.0, 9.0, 12.0]
labels = ['Low', 'Normal', 'Pre-diabetes', 'Diabetes']
data_cleaned.loc[:, 'HbA1c_Level_Binned'] = pd.cut(data_cleaned['HbA1c_Level'], bins=bins, labels=labels)

print("\n Binned HbA1c_Level")
print(data_cleaned[['Patient_ID', 'HbA1c_Level', 'HbA1c_Level_Binned']])


 Binned HbA1c_Level
       Patient_ID  HbA1c_Level HbA1c_Level_Binned
0    DMC 992/2019          7.7       Pre-diabetes
1    DMC 392/2016         11.1           Diabetes
2    DMC 905/2019          7.6       Pre-diabetes
3    DMC 587/2019          8.0       Pre-diabetes
4    DMC 611/2011         11.9           Diabetes
..            ...          ...                ...
995       DMC 822          5.2                Low
996       DMC 823          9.6           Diabetes
997       DMC 824          8.6       Pre-diabetes
998       DMC 825         11.9           Diabetes
999       DMC 826          6.2             Normal

[997 rows x 3 columns]


Regression

In [15]:
from sklearn.linear_model import LinearRegression

#predict HbA1c_Level based on Age and Gender
data_reg = data_cleaned[['Age', 'Gender', 'HbA1c_Level']].copy()
data_reg['Gender_numeric'] = data_reg['Gender'].apply(lambda x: 1 if x == 'Female' else 0)

X = data_reg[['Age', 'Gender_numeric']]
y = data_reg['HbA1c_Level']

#  regression model
model = LinearRegression().fit(X, y)

Clustering

In [17]:
from sklearn.cluster import KMeans

#clusters on Age and HbA1c_Level
kmeans = KMeans(n_clusters=2)
data_cleaned.loc[:, 'Cluster'] = kmeans.fit_predict(data_cleaned[['Age', 'HbA1c_Level']])

**Handle missing values** (median, mean, and mode).

In [18]:
data.loc[:, 'Age'] = data['Age'].fillna(data['Age'].median())
data.loc[:, 'HbA1c_Level'] = data['HbA1c_Level'].fillna(data['HbA1c_Level'].median())

print(data)

       Patient_ID   Age  Gender  HbA1c_Level Readmitted
0    DMC 992/2019  58.0  Female          7.7         No
1    DMC 392/2016  71.0  Female         11.1         No
2    DMC 905/2019  48.0    Male          7.6         No
3    DMC 587/2019  34.0  Female          8.0         No
4    DMC 611/2011  62.0    Male         11.9         No
..            ...   ...     ...          ...        ...
995       DMC 822  23.0    Male          5.2         No
996       DMC 823  20.0    Male          9.6        Yes
997       DMC 824  68.0  Female          8.6         No
998       DMC 825  59.0    Male         11.9         No
999       DMC 826  51.0  Female          6.2         No

[1000 rows x 5 columns]


**Reduction**

Dimensionality reduction

In [19]:
from sklearn.decomposition import PCA

#reducing the dimensions of 'Age', 'HbA1c_Level'
pca = PCA(n_components=1)
data_cleaned.loc[:, 'PCA_Age_HbA1c'] = pca.fit_transform(data_cleaned[['Age', 'HbA1c_Level']])

Attribute subset selection

In [20]:
#Attribute Subset Selection some columns

data_reduced = data_cleaned[['Age', 'Gender', 'HbA1c_Level_Binned', 'Readmitted', 'Cluster']]

Numerosity Reduction - Sampling or Modeling

In [21]:
data_Redu= data_reduced.sample(frac=0.5, random_state=42)

print("Reduced Data ")
print(data_Redu)

Reduced Data 
      Age  Gender HbA1c_Level_Binned Readmitted  Cluster
456  71.0  Female           Diabetes         No        1
795  26.0    Male                Low         No        0
211  47.0    Male       Pre-diabetes        Yes        0
312  38.0    Male           Diabetes         No        0
742  68.0    Male             Normal        Yes        1
..    ...     ...                ...        ...      ...
180  79.0  Female           Diabetes         No        1
447  43.0    Male       Pre-diabetes         No        0
419  46.0  Female                Low         No        0
942  64.0    Male           Diabetes         No        1
882  50.0    Male           Diabetes         No        1

[498 rows x 5 columns]


**Transformation:** Convert data into a suitable format for analysis.

Normalization

In [22]:
# Normalize 'Age' and 'HbA1c_Level'
data_Redu['Age_Normalized'] = (data_Redu['Age'] - data_Redu['Age'].min()) / (data_Redu['Age'].max() - data_Redu['Age'].min())



Feature selection and Feature Engineering

In [23]:
# Convert Gender to numeric (Female: 1, Male: 0)
data_Redu['Gender_numeric'] = data_Redu['Gender'].apply(lambda x: 1 if x == 'Female' else 0)



Discretization

In [24]:
age_bins = [0, 30, 50, 100]
age_labels = ['Young', 'Middle-aged', 'Senior']
data_Redu['Age_Group'] = pd.cut(data_Redu['Age'], bins=age_bins, labels=age_labels)

Create New csv file include preprocessed data

In [25]:
# Save the preprocessed data to a new CSV file
data_Redu.to_csv('preprocessed_hospital_data.csv', index=False)

print("Final Transformed Data saved to 'preprocessed_hospital_data.csv'.")
print(data_Redu)

Final Transformed Data saved to 'preprocessed_hospital_data.csv'.
      Age  Gender HbA1c_Level_Binned Readmitted  Cluster  Age_Normalized  \
456  71.0  Female           Diabetes         No        1        0.864407   
795  26.0    Male                Low         No        0        0.101695   
211  47.0    Male       Pre-diabetes        Yes        0        0.457627   
312  38.0    Male           Diabetes         No        0        0.305085   
742  68.0    Male             Normal        Yes        1        0.813559   
..    ...     ...                ...        ...      ...             ...   
180  79.0  Female           Diabetes         No        1        1.000000   
447  43.0    Male       Pre-diabetes         No        0        0.389831   
419  46.0  Female                Low         No        0        0.440678   
942  64.0    Male           Diabetes         No        1        0.745763   
882  50.0    Male           Diabetes         No        1        0.508475   

     Gender_numeric  