### Assignment 2: Cleaning and Preparing Healthcare Data for Analysis
Objective:
To clean a real-world healthcare dataset by handling inconsistencies, duplicates, and missing values.
Instructions:
Load the Dataset:
Read the healthcare dataset into a Pandas DataFrame.
Handle Missing Data:
Identify missing values in patient demographics (age, gender, blood pressure, etc.).
Apply appropriate imputation methods.
Detect and Handle Duplicates:
Identify duplicate records using duplicated().
Remove or merge duplicates as necessary.
Detect and Handle Outliers:
Use boxplots to identify extreme values.
Apply transformations or capping techniques to handle outliers.
Standardize and Normalize Data:
Convert categorical variables into numerical representations.
Scale numerical variables using Min-Max Scaling or Standard Scaling.
Data Validation:
Ensure no missing values or duplicates remain.
Check data types and correct inconsistencies.
Final Data Export:
Save the cleaned dataset as a CSV file for further analysis.


In [4]:
import pandas as pd
import numpy as np
from sklearn.impute import KNNImputer
from sklearn.preprocessing import LabelEncoder, MinMaxScaler

# Load dataset
df = pd.read_csv('Healthcare_Data.csv')

# Handle missing values
df.fillna(df.median(numeric_only=True), inplace=True)
for col in df.select_dtypes(include=['object']):
    df[col].fillna(df[col].mode()[0], inplace=True)

# Remove duplicates
df.drop_duplicates(inplace=True)

# Encode categorical variables
for col in df.select_dtypes(include=['object']):
    df[col] = LabelEncoder().fit_transform(df[col])

# Scale numerical data
df[df.select_dtypes(include=['float64', 'int64']).columns] = MinMaxScaler().fit_transform(df.select_dtypes(include=['float64', 'int64']))

# Save cleaned dataset
df.to_csv('Healthcare_Data.csv', index=False)
print("Cleaned dataset saved successfully.")


Cleaned dataset saved successfully.


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df[col].fillna(df[col].mode()[0], inplace=True)
