In [1]:
import pandas as pd
import numpy as np
import os

In [2]:
os.chdir('..')
pwd = os.getcwd()

In [3]:
df = pd.read_csv(pwd + "\\2. Dataset\\healthcare_dataset.csv")

In [4]:
df.head()

Unnamed: 0,Name,Age,Gender,Blood Type,Medical Condition,Date of Admission,Doctor,Hospital,Insurance Provider,Billing Amount,Room Number,Admission Type,Discharge Date,Medication,Test Results
0,Bobby JacksOn,30,Male,B-,Cancer,2024-01-31,Matthew Smith,Sons and Miller,Blue Cross,18856.281306,328,Urgent,2024-02-02,Paracetamol,Normal
1,LesLie TErRy,62,Male,A+,Obesity,2019-08-20,Samantha Davies,Kim Inc,Medicare,33643.327287,265,Emergency,2019-08-26,Ibuprofen,Inconclusive
2,DaNnY sMitH,76,Female,A-,Obesity,2022-09-22,Tiffany Mitchell,Cook PLC,Aetna,27955.096079,205,Emergency,2022-10-07,Aspirin,Normal
3,andrEw waTtS,28,Female,O+,Diabetes,2020-11-18,Kevin Wells,"Hernandez Rogers and Vang,",Medicare,37909.78241,450,Elective,2020-12-18,Ibuprofen,Abnormal
4,adrIENNE bEll,43,Female,AB+,Cancer,2022-09-19,Kathleen Hanna,White-White,Aetna,14238.317814,458,Urgent,2022-10-09,Penicillin,Abnormal


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 55500 entries, 0 to 55499
Data columns (total 15 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Name                55500 non-null  object 
 1   Age                 55500 non-null  int64  
 2   Gender              55500 non-null  object 
 3   Blood Type          55500 non-null  object 
 4   Medical Condition   55500 non-null  object 
 5   Date of Admission   55500 non-null  object 
 6   Doctor              55500 non-null  object 
 7   Hospital            55500 non-null  object 
 8   Insurance Provider  55500 non-null  object 
 9   Billing Amount      55500 non-null  float64
 10  Room Number         55500 non-null  int64  
 11  Admission Type      55500 non-null  object 
 12  Discharge Date      55500 non-null  object 
 13  Medication          55500 non-null  object 
 14  Test Results        55500 non-null  object 
dtypes: float64(1), int64(2), object(12)
memory usage: 6.4

# Data Cleaning Plan for Hospital Admissions Dataset

## Objective
Clean and prepare the hospital admissions dataset to ensure accuracy, consistency, and readiness for analysis.

---

## Step 1: Text Standardization
- Convert the **Name** and **Doctor** columns to proper case.
- Remove leading and trailing whitespace from all text fields.
- Standardize capitalization across categorical variables:
  - Gender  
  - Blood Type  
  - Medical Condition  
  - Admission Type  
  - Medication  
  - Test Results  
  - Insurance Provider  
  - Hospital  

---

## Step 2: Categorical Data Validation
- Ensure **Gender** contains only valid values (Male, Female).
- Verify **Blood Type** follows standard formats:
  - A+, A-, B+, B-, AB+, AB-, O+, O-
- Confirm **Admission Type** values are limited to:
  - Emergency, Urgent, Elective
- Validate **Test Results** entries:
  - Normal, Abnormal, Inconclusive

---

## Step 3: Date Formatting and Validation
- Convert **Date of Admission** and **Discharge Date** to a standard date format (YYYY-MM-DD).
- Verify that **Discharge Date** is the same as or later than **Date of Admission**.
- Flag any records with illogical or inconsistent dates.

---

## Step 4: Numerical Data Validation
- Confirm **Age** values fall within a realistic range (0–120).
- Ensure **Billing Amount** values are non-negative.
- Round **Billing Amount** to two decimal places if necessary.
- Verify **Room Number** values are positive integers.

---

## Step 5: Duplicate and Consistency Checks
- Identify duplicate records using key fields such as:
  - Name  
  - Date of Admission  
  - Hospital  
- Remove exact duplicate entries.
- Check for inconsistent patient information (e.g., the same name associated with different genders).

---

## Step 6: Outlier and Anomaly Detection
- Identify unusually high or low values in **Billing Amount**.
- Review extreme **Age** values for potential data entry errors.
- Flag anomalies for review rather than removing them automatically.

---

## Step 7: Final Review and Documentation
- Reconfirm data types after cleaning.
- Ensure all fields are consistent and analysis-ready.
- Document:
  - All cleaning steps performed  
  - Records removed or corrected  
  - Assumptions made during the cleaning process  

---

## Final Output
- Save the cleaned dataset as a new file to preserve the original data.
- Ensure the final dataset is suitable for reporting, visualization, and modeling.


---
---

## Step 1: Text Standardization
- Convert the **Name** and **Doctor** columns to proper case.
- Remove leading and trailing whitespace from all text fields.
- Standardize capitalization across categorical variables:
  - Gender  
  - Blood Type  
  - Medical Condition  
  - Admission Type  
  - Medication  
  - Test Results  
  - Insurance Provider  
  - Hospital  

In [6]:
df.head()

Unnamed: 0,Name,Age,Gender,Blood Type,Medical Condition,Date of Admission,Doctor,Hospital,Insurance Provider,Billing Amount,Room Number,Admission Type,Discharge Date,Medication,Test Results
0,Bobby JacksOn,30,Male,B-,Cancer,2024-01-31,Matthew Smith,Sons and Miller,Blue Cross,18856.281306,328,Urgent,2024-02-02,Paracetamol,Normal
1,LesLie TErRy,62,Male,A+,Obesity,2019-08-20,Samantha Davies,Kim Inc,Medicare,33643.327287,265,Emergency,2019-08-26,Ibuprofen,Inconclusive
2,DaNnY sMitH,76,Female,A-,Obesity,2022-09-22,Tiffany Mitchell,Cook PLC,Aetna,27955.096079,205,Emergency,2022-10-07,Aspirin,Normal
3,andrEw waTtS,28,Female,O+,Diabetes,2020-11-18,Kevin Wells,"Hernandez Rogers and Vang,",Medicare,37909.78241,450,Elective,2020-12-18,Ibuprofen,Abnormal
4,adrIENNE bEll,43,Female,AB+,Cancer,2022-09-19,Kathleen Hanna,White-White,Aetna,14238.317814,458,Urgent,2022-10-09,Penicillin,Abnormal


In [7]:
df['Name'] = df['Name'].str.title()
df['Doctor'] = df['Doctor'].str.title()

In [8]:
text_cols = df.select_dtypes(include='object').columns

In [9]:
df[text_cols] = df[text_cols].apply(lambda x: x.str.strip())

In [10]:
df.head()

Unnamed: 0,Name,Age,Gender,Blood Type,Medical Condition,Date of Admission,Doctor,Hospital,Insurance Provider,Billing Amount,Room Number,Admission Type,Discharge Date,Medication,Test Results
0,Bobby Jackson,30,Male,B-,Cancer,2024-01-31,Matthew Smith,Sons and Miller,Blue Cross,18856.281306,328,Urgent,2024-02-02,Paracetamol,Normal
1,Leslie Terry,62,Male,A+,Obesity,2019-08-20,Samantha Davies,Kim Inc,Medicare,33643.327287,265,Emergency,2019-08-26,Ibuprofen,Inconclusive
2,Danny Smith,76,Female,A-,Obesity,2022-09-22,Tiffany Mitchell,Cook PLC,Aetna,27955.096079,205,Emergency,2022-10-07,Aspirin,Normal
3,Andrew Watts,28,Female,O+,Diabetes,2020-11-18,Kevin Wells,"Hernandez Rogers and Vang,",Medicare,37909.78241,450,Elective,2020-12-18,Ibuprofen,Abnormal
4,Adrienne Bell,43,Female,AB+,Cancer,2022-09-19,Kathleen Hanna,White-White,Aetna,14238.317814,458,Urgent,2022-10-09,Penicillin,Abnormal


## Step 2: Categorical Data Validation
- Ensure **Gender** contains only valid values (Male, Female).
- Verify **Blood Type** follows standard formats:
  - A+, A-, B+, B-, AB+, AB-, O+, O-
- Confirm **Admission Type** values are limited to:
  - Emergency, Urgent, Elective
- Validate **Test Results** entries:
  - Normal, Abnormal, Inconclusive

In [11]:
for col in ['Gender', 'Blood Type', 'Admission Type', 'Test Results']:
    print(f"{col}: {df[col].unique()} ")

Gender: ['Male' 'Female'] 
Blood Type: ['B-' 'A+' 'A-' 'O+' 'AB+' 'AB-' 'B+' 'O-'] 
Admission Type: ['Urgent' 'Emergency' 'Elective'] 
Test Results: ['Normal' 'Inconclusive' 'Abnormal'] 


## Step 3: Date Formatting and Validation
- Convert **Date of Admission** and **Discharge Date** to a standard date format (YYYY-MM-DD).
- Verify that **Discharge Date** is the same as or later than **Date of Admission**.
- Flag any records with illogical or inconsistent dates.

In [12]:
df['Date of Admission'] = pd.to_datetime(df['Date of Admission'])
df['Discharge Date'] = pd.to_datetime(df['Discharge Date'])

In [13]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 55500 entries, 0 to 55499
Data columns (total 15 columns):
 #   Column              Non-Null Count  Dtype         
---  ------              --------------  -----         
 0   Name                55500 non-null  object        
 1   Age                 55500 non-null  int64         
 2   Gender              55500 non-null  object        
 3   Blood Type          55500 non-null  object        
 4   Medical Condition   55500 non-null  object        
 5   Date of Admission   55500 non-null  datetime64[ns]
 6   Doctor              55500 non-null  object        
 7   Hospital            55500 non-null  object        
 8   Insurance Provider  55500 non-null  object        
 9   Billing Amount      55500 non-null  float64       
 10  Room Number         55500 non-null  int64         
 11  Admission Type      55500 non-null  object        
 12  Discharge Date      55500 non-null  datetime64[ns]
 13  Medication          55500 non-null  object    

In [14]:
df[df["Discharge Date"] < df['Date of Admission']]

Unnamed: 0,Name,Age,Gender,Blood Type,Medical Condition,Date of Admission,Doctor,Hospital,Insurance Provider,Billing Amount,Room Number,Admission Type,Discharge Date,Medication,Test Results


In [15]:
df.describe()

Unnamed: 0,Age,Date of Admission,Billing Amount,Room Number,Discharge Date
count,55500.0,55500,55500.0,55500.0,55500
mean,51.539459,2021-11-01 01:02:22.443243008,25539.316097,301.134829,2021-11-16 13:15:20.821621504
min,13.0,2019-05-08 00:00:00,-2008.49214,101.0,2019-05-09 00:00:00
25%,35.0,2020-07-28 00:00:00,13241.224652,202.0,2020-08-12 00:00:00
50%,52.0,2021-11-01 00:00:00,25538.069376,302.0,2021-11-17 00:00:00
75%,68.0,2023-02-03 00:00:00,37820.508436,401.0,2023-02-18 00:00:00
max,89.0,2024-05-07 00:00:00,52764.276736,500.0,2024-06-06 00:00:00
std,19.602454,,14211.454431,115.243069,


In [16]:
# First Admission: 2019/05/08
# First Discharge: 2019/05/09
# Last Admission: 2024/05/07
# Last Discharge: 2024/06/06

# Dataset range: 2019/05/08- 2024/06/06

## Step 4: Numerical Data Validation
- Confirm **Age** values fall within a realistic range (0–120).
- Ensure **Billing Amount** values are non-negative.
- Round **Billing Amount** to two decimal places if necessary.
- Verify **Room Number** values are positive integers.

In [17]:
df['Age'].describe()

count    55500.000000
mean        51.539459
std         19.602454
min         13.000000
25%         35.000000
50%         52.000000
75%         68.000000
max         89.000000
Name: Age, dtype: float64

In [18]:
df['Billing Amount'].describe()

count    55500.000000
mean     25539.316097
std      14211.454431
min      -2008.492140
25%      13241.224652
50%      25538.069376
75%      37820.508436
max      52764.276736
Name: Billing Amount, dtype: float64

In [19]:
df[df['Billing Amount'] < 0]

Unnamed: 0,Name,Age,Gender,Blood Type,Medical Condition,Date of Admission,Doctor,Hospital,Insurance Provider,Billing Amount,Room Number,Admission Type,Discharge Date,Medication,Test Results
132,Ashley Erickson,32,Female,AB-,Cancer,2019-11-05,Gerald Hooper,"and Johnson Moore, Branch",Aetna,-502.507813,376,Urgent,2019-11-23,Penicillin,Normal
799,Christopher Weiss,49,Female,AB-,Asthma,2023-02-16,Kelly Thompson,Hunter-Hughes,Aetna,-1018.245371,204,Elective,2023-03-09,Penicillin,Inconclusive
1018,Ashley Warner,60,Male,A+,Hypertension,2021-12-21,Andrea Bentley,"and Wagner, Lee Klein",Aetna,-306.364925,426,Elective,2022-01-11,Ibuprofen,Normal
1421,Jay Galloway,74,Female,O+,Asthma,2021-01-20,Debra Everett,Group Peters,Blue Cross,-109.097122,381,Emergency,2021-02-09,Ibuprofen,Abnormal
2103,Joshua Williamson,72,Female,B-,Diabetes,2021-03-21,Wendy Ramos,"and Huff Reeves, Dennis",Blue Cross,-576.727907,369,Urgent,2021-04-17,Aspirin,Abnormal
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
52894,Joseph Cox,23,Male,AB-,Diabetes,2019-10-13,Peter Smith,Inc Ward,Blue Cross,-353.865186,271,Elective,2019-10-25,Lipitor,Inconclusive
53204,Ashley Warner,55,Male,A+,Hypertension,2021-12-21,Andrea Bentley,"and Wagner, Lee Klein",Aetna,-306.364925,426,Elective,2022-01-11,Ibuprofen,Normal
53232,Daniel Drake,68,Female,B+,Hypertension,2020-04-24,Brett Ray,Carr Ltd,Aetna,-591.917419,426,Elective,2020-04-26,Lipitor,Abnormal
54136,Dr. Michael Mckay,64,Male,O+,Cancer,2019-05-31,Dawn Navarro,"Mcconnell and Rios, Clark",UnitedHealthcare,-199.663795,122,Urgent,2019-06-12,Ibuprofen,Abnormal


Since the data is synthetic, therefore many records have -ive billiing amounts

In [21]:
df['Billing Amount'] = round(df['Billing Amount'], 2)

In [23]:
df.head()

Unnamed: 0,Name,Age,Gender,Blood Type,Medical Condition,Date of Admission,Doctor,Hospital,Insurance Provider,Billing Amount,Room Number,Admission Type,Discharge Date,Medication,Test Results
0,Bobby Jackson,30,Male,B-,Cancer,2024-01-31,Matthew Smith,Sons and Miller,Blue Cross,18856.28,328,Urgent,2024-02-02,Paracetamol,Normal
1,Leslie Terry,62,Male,A+,Obesity,2019-08-20,Samantha Davies,Kim Inc,Medicare,33643.33,265,Emergency,2019-08-26,Ibuprofen,Inconclusive
2,Danny Smith,76,Female,A-,Obesity,2022-09-22,Tiffany Mitchell,Cook PLC,Aetna,27955.1,205,Emergency,2022-10-07,Aspirin,Normal
3,Andrew Watts,28,Female,O+,Diabetes,2020-11-18,Kevin Wells,"Hernandez Rogers and Vang,",Medicare,37909.78,450,Elective,2020-12-18,Ibuprofen,Abnormal
4,Adrienne Bell,43,Female,AB+,Cancer,2022-09-19,Kathleen Hanna,White-White,Aetna,14238.32,458,Urgent,2022-10-09,Penicillin,Abnormal


In [25]:
df[df['Room Number'] <= 0]

Unnamed: 0,Name,Age,Gender,Blood Type,Medical Condition,Date of Admission,Doctor,Hospital,Insurance Provider,Billing Amount,Room Number,Admission Type,Discharge Date,Medication,Test Results


## Step 5: Duplicate and Consistency Checks
- Identify duplicate records using key fields such as:
  - Name  
  - Date of Admission  
  - Hospital  
- Remove exact duplicate entries.
- Check for inconsistent patient information (e.g., the same name associated with different genders).

In [28]:
df[['Name','Date of Admission', 'Hospital']].duplicated().sum()

np.int64(5500)

In [31]:
df[df.duplicated(subset=['Name', "Date of Admission","Hospital"], keep='first')]

Unnamed: 0,Name,Age,Gender,Blood Type,Medical Condition,Date of Admission,Doctor,Hospital,Insurance Provider,Billing Amount,Room Number,Admission Type,Discharge Date,Medication,Test Results
50000,Krista Gibson,24,Male,A+,Diabetes,2022-05-05,Alexandra Gould,"and Cooper Soto Mccullough,",Aetna,6364.48,230,Urgent,2022-05-16,Lipitor,Inconclusive
50001,Brooke Mccullough,63,Male,AB+,Diabetes,2022-01-18,John Brock,Stark-Smith,Medicare,16183.46,388,Urgent,2022-02-01,Lipitor,Normal
50002,Daniel Thompson,88,Female,AB+,Hypertension,2023-12-14,Ryan Wolfe,Davis-Hicks,Blue Cross,29177.76,157,Elective,2023-12-16,Lipitor,Normal
50003,Nicholas Hunt,22,Male,O-,Diabetes,2021-10-09,Christopher Brown,Stewart-Garza,Blue Cross,37920.32,383,Emergency,2021-10-29,Lipitor,Normal
50004,Melissa Martinez,78,Female,O-,Cancer,2022-01-05,Amy Brown,Lee-Jefferson,Blue Cross,26378.78,132,Urgent,2022-01-31,Lipitor,Abnormal
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
55495,Elizabeth Jackson,42,Female,O+,Asthma,2020-08-16,Joshua Jarvis,Jones-Thompson,Blue Cross,2650.71,417,Elective,2020-09-15,Penicillin,Abnormal
55496,Kyle Perez,61,Female,AB-,Obesity,2020-01-23,Taylor Sullivan,Tucker-Moyer,Cigna,31457.80,316,Elective,2020-02-01,Aspirin,Normal
55497,Heather Wang,38,Female,B+,Hypertension,2020-07-13,Joe Jacobs Dvm,"and Mahoney Johnson Vasquez,",UnitedHealthcare,27620.76,347,Urgent,2020-08-10,Ibuprofen,Abnormal
55498,Jennifer Jones,43,Male,O-,Arthritis,2019-05-25,Kimberly Curry,"Jackson Todd and Castro,",Medicare,32451.09,321,Elective,2019-05-31,Ibuprofen,Abnormal


In [34]:
df.drop_duplicates(subset=['Name', "Date of Admission","Hospital"], keep='first',inplace=True)

In [35]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 50000 entries, 0 to 49999
Data columns (total 15 columns):
 #   Column              Non-Null Count  Dtype         
---  ------              --------------  -----         
 0   Name                50000 non-null  object        
 1   Age                 50000 non-null  int64         
 2   Gender              50000 non-null  object        
 3   Blood Type          50000 non-null  object        
 4   Medical Condition   50000 non-null  object        
 5   Date of Admission   50000 non-null  datetime64[ns]
 6   Doctor              50000 non-null  object        
 7   Hospital            50000 non-null  object        
 8   Insurance Provider  50000 non-null  object        
 9   Billing Amount      50000 non-null  float64       
 10  Room Number         50000 non-null  int64         
 11  Admission Type      50000 non-null  object        
 12  Discharge Date      50000 non-null  datetime64[ns]
 13  Medication          50000 non-null  object        


In [36]:
gender_inconsistencies = (df.groupby("Name")['Gender']
.nunique()
.reset_index()
)

In [39]:
gender_inconsistencies[gender_inconsistencies['Gender'] > 1]

Unnamed: 0,Name,Gender
5,Aaron Baker,2
11,Aaron Bradshaw,2
31,Aaron Davis,2
69,Aaron Lopez,2
72,Aaron Martinez,2
...,...,...
40169,Zachary Miller,2
40172,Zachary Moore,2
40177,Zachary Obrien,2
40193,Zachary Reyes,2


Multiple Genders exist under the same name due to the fact that the dataset in synthetic

## Step 6: Outlier and Anomaly Detection
- Identify unusually high or low values in **Billing Amount**.
- Review extreme **Age** values for potential data entry errors.
- Flag anomalies for review rather than removing them automatically.

In [41]:
df[['Billing Amount','Age']].describe()

Unnamed: 0,Billing Amount,Age
count,50000.0,50000.0
mean,25555.691533,51.58036
std,14215.932245,19.582194
min,-2008.49,18.0
25%,13239.4075,35.0
50%,25541.305,52.0
75%,37853.9975,68.0
max,52764.28,85.0


## Step 7: Final Review and Documentation
- Reconfirm data types after cleaning.
- Ensure all fields are consistent and analysis-ready.
- Document:
  - All cleaning steps performed  
  - Records removed or corrected  
  - Assumptions made during the cleaning process  

In [42]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 50000 entries, 0 to 49999
Data columns (total 15 columns):
 #   Column              Non-Null Count  Dtype         
---  ------              --------------  -----         
 0   Name                50000 non-null  object        
 1   Age                 50000 non-null  int64         
 2   Gender              50000 non-null  object        
 3   Blood Type          50000 non-null  object        
 4   Medical Condition   50000 non-null  object        
 5   Date of Admission   50000 non-null  datetime64[ns]
 6   Doctor              50000 non-null  object        
 7   Hospital            50000 non-null  object        
 8   Insurance Provider  50000 non-null  object        
 9   Billing Amount      50000 non-null  float64       
 10  Room Number         50000 non-null  int64         
 11  Admission Type      50000 non-null  object        
 12  Discharge Date      50000 non-null  datetime64[ns]
 13  Medication          50000 non-null  object        


---
---
### Cleaning Report

- 'Name' and 'Doctor' text standardized
- Validated 'Date of Admission' and 'Discharge Date' (Date range: 2019/05/08- 2024/06/06)
- Negative Billing Amounts were due to the synthetic nature of the dataset
- 5500 Duplicate records were handle on the basis of ['Name','Date of Admission','Hospital'] 
- Name-Gender Inconsistencies were found to be the result of the synthetic nature of the dataset

---
---

In [44]:
df.to_csv(pwd + "\\2. Dataset\\cleaned_dataset.csv", index=False)