
Name : **Yash Santosh Rahate**

Roll No : **44**

Div : **D15B**

**DMBI Practical 02**



---



**Aim:** To perform data preprocessing on the dataset using python.

Dataset Link : https://www.kaggle.com/datasets/marius2303/medical-condition-prediction-dataset

# Loading Dataset and Introduction:

In [1]:
import pandas as pd
import numpy as np

In [2]:
df = pd.read_csv('/content/medical_conditions_dataset.csv')
df.head()

Unnamed: 0,id,full_name,age,gender,smoking_status,bmi,blood_pressure,glucose_levels,condition
0,1,User0001,,male,Non-Smoker,,,,Pneumonia
1,2,User0002,30.0,male,Non-Smoker,,105.315064,,Diabetic
2,3,User0003,18.0,male,Non-Smoker,35.612486,,,Pneumonia
3,4,User0004,,male,Non-Smoker,,99.119829,,Pneumonia
4,5,User0005,76.0,male,Non-Smoker,,,,Diabetic


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 9 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   id              10000 non-null  int64  
 1   full_name       10000 non-null  object 
 2   age             5445 non-null   float64
 3   gender          10000 non-null  object 
 4   smoking_status  10000 non-null  object 
 5   bmi             4652 non-null   float64
 6   blood_pressure  3766 non-null   float64
 7   glucose_levels  4756 non-null   float64
 8   condition       10000 non-null  object 
dtypes: float64(4), int64(1), object(4)
memory usage: 703.3+ KB


In [4]:
df.describe()

Unnamed: 0,id,age,bmi,blood_pressure,glucose_levels
count,10000.0,5445.0,4652.0,3766.0,4756.0
mean,5000.5,53.541598,27.42342,135.209429,135.219608
std,2886.89568,20.925113,7.231257,26.041531,37.607638
min,1.0,18.0,15.012119,90.00962,70.015961
25%,2500.75,35.0,21.077894,113.107754,102.273703
50%,5000.5,54.0,27.326204,134.82104,135.436764
75%,7500.25,72.0,33.68933,157.949509,168.349011
max,10000.0,89.0,39.998687,179.999359,199.890429


In [5]:
df.shape

(10000, 9)

In [6]:
df['condition'].value_counts()

Unnamed: 0_level_0,count
condition,Unnamed: 1_level_1
Diabetic,6013
Pneumonia,2527
Cancer,1460


# Data Cleaning - removing missing values

In [7]:
df.isnull().sum()

Unnamed: 0,0
id,0
full_name,0
age,4555
gender,0
smoking_status,0
bmi,5348
blood_pressure,6234
glucose_levels,5244
condition,0


In [8]:
correlation_matrix = df[['bmi', 'blood_pressure', 'glucose_levels']].corr()
print(correlation_matrix)

                     bmi  blood_pressure  glucose_levels
bmi             1.000000        0.031372        0.011236
blood_pressure  0.031372        1.000000        0.039158
glucose_levels  0.011236        0.039158        1.000000


In [9]:
df.shape

(10000, 9)

In [10]:
df_cleaned = df.dropna(subset=['age', 'bmi', 'blood_pressure', 'glucose_levels'], how='all')
print(f"Shape after dropping completely missing rows: {df_cleaned.shape}")

Shape after dropping completely missing rows: (9209, 9)


In [11]:
# Drop rows where all four key attributes are missing
df_cleaned = df.dropna(subset=['age', 'bmi', 'blood_pressure', 'glucose_levels'], how='all').copy()

# Fill missing values using the grouped median approach
df_cleaned.loc[:, 'age'] = df_cleaned.groupby('condition')['age'].transform(lambda x: x.fillna(x.median()))
df_cleaned.loc[:, 'bmi'] = df_cleaned.groupby('condition')['bmi'].transform(lambda x: x.fillna(x.median()))
df_cleaned.loc[:, 'blood_pressure'] = df_cleaned.groupby('condition')['blood_pressure'].transform(lambda x: x.fillna(x.median()))
df_cleaned.loc[:, 'glucose_levels'] = df_cleaned.groupby('condition')['glucose_levels'].transform(lambda x: x.fillna(x.median()))

# Verify missing values are handled
df_cleaned.isnull().sum()


Unnamed: 0,0
id,0
full_name,0
age,0
gender,0
smoking_status,0
bmi,0
blood_pressure,0
glucose_levels,0
condition,0


In [12]:
df_cleaned.shape

(9209, 9)

In [13]:
df_cleaned.describe()

Unnamed: 0,id,age,bmi,blood_pressure,glucose_levels
count,9209.0,9209.0,9209.0,9209.0,9209.0
mean,4987.308394,53.489195,27.39898,135.070193,135.320178
std,2889.255671,16.115101,5.142369,16.653643,27.047757
min,2.0,18.0,15.012119,90.00962,70.015961
25%,2481.0,48.0,27.19469,134.536675,131.985031
50%,4981.0,54.0,27.23914,135.164088,136.505756
75%,7500.0,59.0,27.959845,135.164088,137.594369
max,10000.0,89.0,39.998687,179.999359,199.890429


# Data Cleaning - removing noisy values

In [14]:
df_cleaned.describe()

Unnamed: 0,id,age,bmi,blood_pressure,glucose_levels
count,9209.0,9209.0,9209.0,9209.0,9209.0
mean,4987.308394,53.489195,27.39898,135.070193,135.320178
std,2889.255671,16.115101,5.142369,16.653643,27.047757
min,2.0,18.0,15.012119,90.00962,70.015961
25%,2481.0,48.0,27.19469,134.536675,131.985031
50%,4981.0,54.0,27.23914,135.164088,136.505756
75%,7500.0,59.0,27.959845,135.164088,137.594369
max,10000.0,89.0,39.998687,179.999359,199.890429


In [15]:
import numpy as np

# Function to detect and replace outliers using IQR method
def handle_outliers(df, column):
    Q1 = df[column].quantile(0.25)  # First quartile (25%)
    Q3 = df[column].quantile(0.75)  # Third quartile (75%)
    IQR = Q3 - Q1  # Interquartile range

    lower_bound = Q1 - 1.5 * IQR  # Lower bound
    upper_bound = Q3 + 1.5 * IQR  # Upper bound

    # Replace outliers with median value
    median_value = df[column].median()
    df[column] = np.where((df[column] < lower_bound) | (df[column] > upper_bound), median_value, df[column])

# Apply outlier handling for the three numerical features
for col in ['bmi', 'blood_pressure', 'glucose_levels']:
    handle_outliers(df_cleaned, col)

# Verify changes
df_cleaned.describe()




Unnamed: 0,id,age,bmi,blood_pressure,glucose_levels
count,9209.0,9209.0,9209.0,9209.0,9209.0
mean,4987.308394,53.489195,27.32516,135.047747,135.853244
std,2889.255671,16.115101,0.294468,0.237141,2.316542
min,2.0,18.0,26.051998,133.607466,123.579231
25%,2481.0,48.0,27.23914,135.164088,136.505756
50%,4981.0,54.0,27.23914,135.164088,136.505756
75%,7500.0,59.0,27.23914,135.164088,136.505756
max,10000.0,89.0,29.106727,136.096022,145.98932


# Data Transformation

In [16]:
df_cleaned.head(1)

Unnamed: 0,id,full_name,age,gender,smoking_status,bmi,blood_pressure,glucose_levels,condition
1,2,User0002,30.0,male,Non-Smoker,27.23914,135.164088,136.505756,Diabetic


In [17]:
df_cleaned['gender'].unique()

array(['male', 'female'], dtype=object)

In [18]:
df_cleaned['smoking_status'].unique()

array(['Non-Smoker', 'Smoker'], dtype=object)

In [19]:
df_cleaned['condition'].unique()

array(['Diabetic', 'Pneumonia', 'Cancer'], dtype=object)

In [20]:
# Applying One-Hot Encoding on df_cleaned
df_cleaned = pd.get_dummies(df_cleaned, columns=['gender', 'smoking_status', 'condition'], drop_first=True)

# Display transformed dataset
df_cleaned.head()


Unnamed: 0,id,full_name,age,bmi,blood_pressure,glucose_levels,gender_male,smoking_status_Smoker,condition_Diabetic,condition_Pneumonia
1,2,User0002,30.0,27.23914,135.164088,136.505756,True,False,True,False
2,3,User0003,18.0,27.23914,134.536675,134.899292,True,False,False,True
3,4,User0004,54.0,27.346625,135.164088,134.899292,True,False,False,True
4,5,User0005,76.0,27.23914,135.164088,136.505756,True,False,True,False
5,6,User0006,40.0,27.23914,135.164088,136.505756,True,False,True,False


In [21]:
df_cleaned.head()

Unnamed: 0,id,full_name,age,bmi,blood_pressure,glucose_levels,gender_male,smoking_status_Smoker,condition_Diabetic,condition_Pneumonia
1,2,User0002,30.0,27.23914,135.164088,136.505756,True,False,True,False
2,3,User0003,18.0,27.23914,134.536675,134.899292,True,False,False,True
3,4,User0004,54.0,27.346625,135.164088,134.899292,True,False,False,True
4,5,User0005,76.0,27.23914,135.164088,136.505756,True,False,True,False
5,6,User0006,40.0,27.23914,135.164088,136.505756,True,False,True,False


In [22]:
df_cleaned.shape

(9209, 10)

# Data Normalization

In [23]:
df_cleaned.describe()

Unnamed: 0,id,age,bmi,blood_pressure,glucose_levels
count,9209.0,9209.0,9209.0,9209.0,9209.0
mean,4987.308394,53.489195,27.32516,135.047747,135.853244
std,2889.255671,16.115101,0.294468,0.237141,2.316542
min,2.0,18.0,26.051998,133.607466,123.579231
25%,2481.0,48.0,27.23914,135.164088,136.505756
50%,4981.0,54.0,27.23914,135.164088,136.505756
75%,7500.0,59.0,27.23914,135.164088,136.505756
max,10000.0,89.0,29.106727,136.096022,145.98932


Now, let's normalize the numerical features (age, bmi, blood_pressure, and glucose_levels) to bring them to a common scale.

In [24]:
from sklearn.preprocessing import MinMaxScaler

# Selecting numerical columns for normalization
num_features = ['age', 'bmi', 'blood_pressure', 'glucose_levels']

scaler = MinMaxScaler()

# Applying Min-Max Scaling
df_cleaned[num_features] = scaler.fit_transform(df_cleaned[num_features])

df_cleaned.describe()


Unnamed: 0,id,age,bmi,blood_pressure,glucose_levels
count,9209.0,9209.0,9209.0,9209.0,9209.0
mean,4987.308394,0.499848,0.416784,0.578762,0.5477
std,2889.255671,0.226973,0.096397,0.095293,0.10337
min,2.0,0.0,0.0,0.0,0.0
25%,2481.0,0.422535,0.388625,0.625512,0.576817
50%,4981.0,0.507042,0.388625,0.625512,0.576817
75%,7500.0,0.577465,0.388625,0.625512,0.576817
max,10000.0,1.0,1.0,1.0,1.0


# Data Reduction

In [25]:
df_cleaned.head()

Unnamed: 0,id,full_name,age,bmi,blood_pressure,glucose_levels,gender_male,smoking_status_Smoker,condition_Diabetic,condition_Pneumonia
1,2,User0002,0.169014,0.388625,0.625512,0.576817,True,False,True,False
2,3,User0003,0.0,0.388625,0.373393,0.505132,True,False,False,True
3,4,User0004,0.507042,0.423811,0.625512,0.505132,True,False,False,True
4,5,User0005,0.816901,0.388625,0.625512,0.576817,True,False,True,False
5,6,User0006,0.309859,0.388625,0.625512,0.576817,True,False,True,False


In [26]:
# Data Reduction - Selecting Relevant Features
features_to_keep = ['age', 'bmi', 'blood_pressure', 'glucose_levels', 'gender_male', 'smoking_status_Smoker', 'condition_Diabetic', 'condition_Pneumonia']
df_reduced = df_cleaned[features_to_keep]

# Check the reduced dataset
df_reduced.head()


Unnamed: 0,age,bmi,blood_pressure,glucose_levels,gender_male,smoking_status_Smoker,condition_Diabetic,condition_Pneumonia
1,0.169014,0.388625,0.625512,0.576817,True,False,True,False
2,0.0,0.388625,0.373393,0.505132,True,False,False,True
3,0.507042,0.423811,0.625512,0.505132,True,False,False,True
4,0.816901,0.388625,0.625512,0.576817,True,False,True,False
5,0.309859,0.388625,0.625512,0.576817,True,False,True,False


In [27]:
df_reduced.shape

(9209, 8)

# Conclusion

In this experiment, we performed essential data preprocessing on a medical dataset, including the following steps:

1. Handling Missing Values: We filled missing values using the grouped median method based on condition, ensuring data consistency.

2. Handling Noisy Values (Outliers): We replaced outliers in bmi, blood_pressure, and glucose_levels with their respective medians to retain data quality.

3. Data Transformation (One-Hot Encoding): Categorical features like gender and smoking_status were encoded into numerical format for machine learning use.
4. Data Normalization: Continuous variables were normalized using Min-Max scaling to standardize their ranges.
5. Data Reduction: We reduced the feature set to the most relevant columns, improving model efficiency.

By following these steps, the dataset was cleaned, balanced, and transformed, making it ready for machine learning modeling.

In [28]:
df_reduced.to_csv('dmbi_4.csv', index=False)