### üîç DATA DETECTIVE CHALLENGE: Patient Records Investigation

 Your mission: Clean and analyze this messy patient records dataset!

 SCORING SYSTEM:
 - Each bug found and fixed: 5 points
 - Bonus: Creative solution: +3 points
 - Speed bonus: First to finish correctly: +10 points

 Total possible bugs: 15+ categories
 Maximum score: 100+ points


In [1]:
import pandas as pd
import numpy as np
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')

In [2]:
# ================================================================
# STEP 1: LOAD THE DATA
# ================================================================

# TODO: Load the messy_patient_records.csv file
df = pd.read_csv('messy_patient_records.csv')


print("Dataset loaded successfully!")
print(f"Shape: {df.shape}")
print("\n" + "="*70)

Dataset loaded successfully!
Shape: (100, 15)



In [3]:
# ================================================================
# STEP 2: INITIAL EXPLORATION
# ================================================================
# Get a feel for the data before you start cleaning

print("\nüîé INITIAL DATA INSPECTION")
print("="*70)

# Display data types
print("\nüìä Data Types:")
print(df.dtypes)

# Display basic info
print("\nüìã Dataset Info:")
print(df.info())

# Check for missing values
print("\n‚ùì Missing Values Count:")
print(df.isnull().sum())

print("\n" + "="*70)


üîé INITIAL DATA INSPECTION

üìä Data Types:
Patient_ID            float64
First_Name             object
Last_Name              object
Date_of_Birth          object
Age                   float64
Gender                 object
Blood_Type             object
Admission_Date         object
Discharge_Date         object
Diagnosis              object
Temperature            object
Blood_Pressure         object
Heart_Rate            float64
Insurance_Provider     object
Bill_Amount            object
dtype: object

üìã Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 15 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Patient_ID          98 non-null     float64
 1   First_Name          98 non-null     object 
 2   Last_Name           98 non-null     object 
 3   Date_of_Birth       82 non-null     object 
 4   Age                 95 non-null     float64
 5   Gender          

In [4]:
# Display first few rows
print("\nFirst 10 rows:")
df.head(10)


First 10 rows:


Unnamed: 0,Patient_ID,First_Name,Last_Name,Date_of_Birth,Age,Gender,Blood_Type,Admission_Date,Discharge_Date,Diagnosis,Temperature,Blood_Pressure,Heart_Rate,Insurance_Provider,Bill_Amount
0,1001.0,John,Smith,,45.0,Unknown,A +,2024/10/15,,Flu,97.3F,151 / 88,78.0,,$566
1,1002.0,Mary,johnson,1954-01-24,26.0,MALE,A+,2024-10-25,2024-11-03,Pneumonia,38.4,133/71,68.0,Cigna,5876.33
2,1003.0,james,WILLIAMS,1975-04-08,34.0,M,O-,2025-07-09,2025-07-18,COVID-19,36.5,96/76,78.0,BlueCross,7015.1
3,1004.0,SARAH,Brown,1957-12-04,34.0,Male,B+,2024-11-25,2024-12-05,COVID-19,37.8,151/67,88.0,BLUECROSS,7574.35
4,1005.0,Michael,Jones,2009-02-19,78.0,MALE,AB+,2024-12-19,2024-12-25,Migraine,38.0,98/85,91.0,UnitedHealth,965.92
5,1006.0,Robert,Garcia,1994-01-01,39.0,M,B+,2025-05-05,2025-05-18,Migraine,37.8,152/64,67.0,BLUECROSS,6260.86
6,1007.0,william,Miller,1951-04-08,51.0,MALE,A+,2025-03-13,2025-03-21,Anxiety,37.6,163/100,61.0,blue cross,5218.91
7,1008.0,Olivia,Davis,,85.0,m,O+,2024-12-30,2025-01-09,Bronchitis,39.2,177/63,100.0,Cigna,9089.48
8,1009.0,David,,2004-10-01,72.0,FEMALE,O-,2025-10-06,2025-10-01,Fracture,37.9,109/69,98.0,aetna,3218.84
9,1010.0,Sophia,Martinez,2011-04-23,45.0,m,A+,,,Bronchitis,37.0,162 / 120,75.0,United Health,8089.17


In [5]:
# ================================================================
# üéØ CLUES: Types of Bugs Hidden in the Dataset
# ================================================================
print("\nüïµÔ∏è DETECTIVE CLUES - Types of bugs you'll find:")
print("="*70)
print("""
1. üî¢ DUPLICATE DATA: Some patients appear more than once
2. ‚ùå MISSING VALUES: Multiple types (None, NaN, 'NA', 'N/A', empty strings)
3. üî§ DATA TYPE ISSUES: Numbers stored as strings, dates as objects
4. üìÖ DATE FORMAT ISSUES: Inconsistent date formats across columns
5. üé® FORMATTING ISSUES: Extra whitespace, inconsistent capitalization
6. üö´ INVALID VALUES: Impossible ages, negative numbers where they shouldn't be
7. üå°Ô∏è MIXED UNITS: Temperature in different units (Celsius vs Fahrenheit)
8. ü©∏ TYPOS: Misspellings in categorical data
9. ‚öñÔ∏è INCONSISTENT NAMING: Same values written differently
10. üîÄ LOGICAL INCONSISTENCIES: Dates that don't make sense (discharge before admission)
11. üìä OUTLIERS: Values way outside normal ranges
12. üî¢ WRONG SEPARATORS: Data formatted incorrectly
13. üíµ CURRENCY SYMBOLS: Embedded in numeric data
14. üé≠ CATEGORICAL CHAOS: Same categories with different cases/formats
15. üîÑ INCONSISTENT NULLS: Different representations of missing data

Your challenge: Find and fix ALL of them!
""")
print("="*70)


üïµÔ∏è DETECTIVE CLUES - Types of bugs you'll find:

1. üî¢ DUPLICATE DATA: Some patients appear more than once
2. ‚ùå MISSING VALUES: Multiple types (None, NaN, 'NA', 'N/A', empty strings)
3. üî§ DATA TYPE ISSUES: Numbers stored as strings, dates as objects
4. üìÖ DATE FORMAT ISSUES: Inconsistent date formats across columns
5. üé® FORMATTING ISSUES: Extra whitespace, inconsistent capitalization
6. üö´ INVALID VALUES: Impossible ages, negative numbers where they shouldn't be
7. üå°Ô∏è MIXED UNITS: Temperature in different units (Celsius vs Fahrenheit)
8. ü©∏ TYPOS: Misspellings in categorical data
9. ‚öñÔ∏è INCONSISTENT NAMING: Same values written differently
10. üîÄ LOGICAL INCONSISTENCIES: Dates that don't make sense (discharge before admission)
11. üìä OUTLIERS: Values way outside normal ranges
12. üî¢ WRONG SEPARATORS: Data formatted incorrectly
13. üíµ CURRENCY SYMBOLS: Embedded in numeric data
14. üé≠ CATEGORICAL CHAOS: Same categories with different cases/formats
1

In [6]:
# ================================================================
# STEP 3: START YOUR INVESTIGATION
# ================================================================

# Create a copy to work with (always keep original!)
df_clean = df.copy()

print("\nüßπ CLEANING PROCESS BEGINS!")
print("="*70)


üßπ CLEANING PROCESS BEGINS!


# BUG CATEGORY 1: DUPLICATE RECORDS

In [7]:

print("\n1Ô∏è‚É£ Checking for duplicates...")

# --- find all duplicated Patient_IDs ---

# TODO: Check for duplicate Patient_IDs
# HINT: Use df_clean['Patient_ID'].duplicated()

print("Duplicate IDs:", df_clean["Patient_ID"].duplicated().sum())
duplicate_ids = df_clean[df_clean['Patient_ID'].duplicated(keep=False)]
print("Duplicate Patient IDs found:")
print(duplicate_ids[["Patient_ID", "First_Name", "Last_Name", "Date_of_Birth"]])

# TODO: Remove or handle duplicates
# HINT: Use drop_duplicates() or investigate why duplicates exist

"""
It seems the same ID was assigned to different persons, causing duplicate IDs. 
So removing duplicate IDs might actually remove unique entries.
We shall need to reasign IDs instead.
"""

# Let's first identify truly repeated persons (same name, same DOB)
true_dupes = df_clean[df_clean.duplicated(subset=["Patient_ID", "First_Name", "Last_Name", "Date_of_Birth"], keep=False)]
print("\nPossible true duplicates:")
print(f"Number of possible true duplicates =  {true_dupes.shape[0]}")

# Now we can see that much as the IDs show presence of duplicates, duplicates actually don't exist.
# Let's handle this;

# Convert Patient_ID to numeric just in case there are string values
df_clean["Patient_ID"] = pd.to_numeric(df_clean["Patient_ID"], errors="coerce")

# Show min, max, and count of valid (non-missing) IDs
min_id = df_clean["Patient_ID"].min(skipna=True)
max_id = df_clean["Patient_ID"].max(skipna=True)
num_ids = df_clean["Patient_ID"].notna().sum()

print(f"Existing Patient_ID range: {int(min_id)} -> {int(max_id)}")
print(f"Total valid Patient_IDs: {num_ids}")

# Optional: check how many are missing
missing_ids = df_clean["Patient_ID"].isna().sum()
print(f"Missing Patient_IDs: {missing_ids}")

"""
Now we can clearly see that the ID ranges from from 1001 - 1100
98 patients were assigned IDs
2 patients do not have IDs
"""

# Let's check how many values between 1001 - 1100 were not assigned, because we see that some patients were given the same IDs

# Get all existing (non-null) IDs
existing_ids = df_clean["Patient_ID"].dropna().astype(int)

# Define the full expected range
full_range = set(range(1001, 1101))  # 1101 is exclusive

# Find which IDs are not used
missing_from_range = sorted(list(full_range - set(existing_ids)))

print("Unused Patient_IDs available for assignment:")
print(missing_from_range)
print(f"Total missing IDs: {len(missing_from_range)}")

"""
From the results, we can see that 4 IDs (1011, 1026, 1051, 1076) were not assigned, clearly explaining why we have duplicate IDs
We can now go ahead and reassigned those IDs to patients with duplicate and missing IDs
"""
# Fix duplicate Patient_IDs
df_clean.loc[14, "Patient_ID"] = 1011  # Michael Jones
df_clean.loc[25, "Patient_ID"] = 1026  # Emma Garcia (with DOB 2008)

# Fix missing Patient_IDs
df_clean.loc[50, "Patient_ID"] = 1051  # John Smith (1981)
df_clean.loc[75, "Patient_ID"] = 1076  # Emma Garcia (1985)

"""
Now let's check for missing IDs again, if any
"""
print("Missing IDs:", df_clean["Patient_ID"].isna().sum())
print("Duplicate IDs:", df_clean["Patient_ID"].duplicated().sum())

"""
Wooow! Now our data has no missing and duplicated IDs
We can now proceed to the next check.
"""
print("‚úì Duplicate check complete")



1Ô∏è‚É£ Checking for duplicates...
Duplicate IDs: 3
Duplicate Patient IDs found:
    Patient_ID First_Name Last_Name Date_of_Birth
10      1015.0       John     Smith    2009-07-08
14      1015.0  Michael       Jones           NaN
25      1015.0       Emma    Garcia    2008-02-13
50         NaN       John     Smith    1981-01-08
75         NaN       Emma    Garcia    1985/05/20

Possible true duplicates:
Number of possible true duplicates =  0
Existing Patient_ID range: 1001 -> 1100
Total valid Patient_IDs: 98
Missing Patient_IDs: 2
Unused Patient_IDs available for assignment:
[1011, 1026, 1051, 1076]
Total missing IDs: 4
Missing IDs: 0
Duplicate IDs: 0
‚úì Duplicate check complete


In [8]:
# ----------------------------------------------------------------
# BUG CATEGORY 2: MISSING VALUES - STANDARDIZE
# ----------------------------------------------------------------
print("\n2Ô∏è‚É£ Standardizing missing values...")

# TODO: Replace all variations of missing values with proper NaN
# HINT: Look for 'NA', 'N/A', empty strings '', and convert to np.nan
# HINT: Use replace() method

"""
Let's define all possible variants of missing values
"""
missing_variants = ['NA', 'N/A', '', 'na', 'null']

# Let's now replace all those possible variants with the uniform np.nan across the whole DataFrame
df_clean.replace(missing_variants, np.nan, inplace = True)
df_clean


print("‚úì Missing values standardized")


2Ô∏è‚É£ Standardizing missing values...
‚úì Missing values standardized


In [9]:
# ----------------------------------------------------------------
# BUG CATEGORY 3: WHITESPACE ISSUES
# ----------------------------------------------------------------
print("\n3Ô∏è‚É£ Cleaning whitespace...")

# TODO: Remove leading/trailing whitespace from text columns
# HINT: Use .str.strip() on string columns

"""
Since white space only exist within string data and not numeric, let's apply these changes to only columns that have string data
"""
# Identify object (text/string) columns
text_cols = df_clean.select_dtypes(include="object").columns

# Strip whitespace from each
for col in text_cols:
    df_clean[col] = df_clean[col].str.strip()

print("‚úì Whitespace cleaned")


3Ô∏è‚É£ Cleaning whitespace...


‚úì Whitespace cleaned


In [10]:
# ----------------------------------------------------------------
# BUG CATEGORY 4: CASE CONSISTENCY
# ----------------------------------------------------------------
print("\n4Ô∏è‚É£ Fixing case inconsistencies...")

# TODO: Standardize capitalization in First_Name, Last_Name, Gender, etc.
# HINT: Use .str.title(), .str.upper(), or .str.lower() as appropriate

# Standardize names (jones to Jones)
df_clean["First_Name"] = df_clean["First_Name"].str.title()
df_clean["Last_Name"] = df_clean["Last_Name"].str.title()

# Standardize gender (e.g., 'male' to 'MALE')
df_clean["Gender"] = df_clean["Gender"].str.upper()

"""
We also realize that Gender is labelled as M and MALE or F and FEMALE, 
let's make this consistent by replacing M with MALE and F with FEMALE
"""

df_clean["Gender"] = df_clean["Gender"].replace({
    "M": "MALE",
    "F": "FEMALE"
})

print("‚úì Case consistency applied")


4Ô∏è‚É£ Fixing case inconsistencies...
‚úì Case consistency applied


In [11]:
# ----------------------------------------------------------------
# BUG CATEGORY 5: DATE FORMAT ISSUES
# ----------------------------------------------------------------
print("\n5Ô∏è‚É£ Converting date columns...")

# TODO: Convert Date_of_Birth, Admission_Date, Discharge_Date to proper datetime
# HINT: Use pd.to_datetime() with errors='coerce'

"""
Now let's convert all the three columns (Date_of_Birth, Admission_Date, Discharge_Date) to proper datetime
"""

# Let's first define a variable holding those three columns
date_cols = ["Date_of_Birth", "Admission_Date", "Discharge_Date"]

# Let's first fix inconsistent date separators before converting
for col in date_cols:
    df_clean[col] = df_clean[col].astype(str).str.replace("/", "-", regex=False)
    df_clean[col] = pd.to_datetime(df_clean[col], errors='coerce')

# Iterate through each column and convert to proper datetime
for col in date_cols:
    df_clean[col] = pd.to_datetime(df_clean[col], errors='coerce')

print("‚úì Dates converted")


5Ô∏è‚É£ Converting date columns...
‚úì Dates converted


In [12]:
# ----------------------------------------------------------------
# BUG CATEGORY 6: DATA TYPE CONVERSIONS
# ----------------------------------------------------------------
print("\n6Ô∏è‚É£ Fixing data types...")

# TODO: Convert Age, Heart_Rate to integers (handle strings first)
# HINT: First convert strings to numeric, then to int

"""
Converting Age and Heart_Rate to numeric first, then integers
"""

df_clean["Age"] = pd.to_numeric(df_clean["Age"], errors="coerce").astype("Int64")
df_clean["Heart_Rate"] = pd.to_numeric(df_clean["Heart_Rate"], errors="coerce").astype("Int64")

# TODO: Convert Bill_Amount to float (remove $ symbols first)
# HINT: Use .str.replace() to remove $ then convert to float

"""
Removing $ symbol and converting Bill_Amount to float
"""
df_clean["Bill_Amount"] = df_clean["Bill_Amount"].astype(str).str.replace("$", "", regex=False)
df_clean["Bill_Amount"] = pd.to_numeric(df_clean["Bill_Amount"], errors="coerce")


print("‚úì Data types corrected")


6Ô∏è‚É£ Fixing data types...
‚úì Data types corrected


In [13]:
# ----------------------------------------------------------------
# BUG CATEGORY 7: INVALID VALUES
# ----------------------------------------------------------------
print("\n7Ô∏è‚É£ Handling invalid values...")

# TODO: Fix negative ages, impossible ages (>120), negative bill amounts
# HINT: Use conditional replacement or set to NaN

# Fixing invalid Age
df_clean.loc[(df_clean["Age"] < 0) | (df_clean["Age"] > 120), "Age"] = np.nan

# Fixing invalid Bill_Amount
df_clean.loc[df_clean["Bill_Amount"] < 0, "Bill_Amount"] = np.nan


# TODO: Fix heart rates outside normal range (30-200)

# Fixing abnormal Heart_Rate
df_clean.loc[(df_clean["Heart_Rate"] < 30) | (df_clean["Heart_Rate"] > 200), "Heart_Rate"] = np.nan


# TODO: Fix impossible temperatures (>42¬∞C or <30¬∞C)

"""
Before fixing impossible temperatures, let's first standardize temperature units to degree celicious
Convert those recorded in F to degree celicious by substracting 32 from it and multiplying by (5/9)

"""
# Let's defeine a function that does that
def convert_temp(val):
    if isinstance(val, str):
        val = val.strip()
        if val.upper().endswith("F"):
            try:
                f_val = float(val[:-1])  # Strip the 'F'
                return round((f_val - 32) * 5 / 9, 1)
            except:
                return np.nan
        else:
            try:
                return round(float(val), 1)
            except:
                return np.nan
    else:
        try:
            return round(float(val), 1)
        except:
            return np.nan
    

# Let's apply our function to convert the temperature
df_clean["Temperature"] = df_clean["Temperature"].apply(convert_temp)

# Temperature values are now converted to a uniform unit. Let's get rid of unrealistic temperature values
df_clean.loc[(df_clean["Temperature"] < 30) | (df_clean["Temperature"] > 42), "Temperature"] = np.nan


print("‚úì Invalid values handled")


7Ô∏è‚É£ Handling invalid values...
‚úì Invalid values handled


In [14]:
# # Ensure column is string for parsing
# temp = df_clean["Temperature"].astype(str).str.strip()

# # Mask for Fahrenheit values
# f_mask = temp.str.upper().str.endswith("F")

# # Extract numeric part and convert
# df_clean.loc[f_mask, "Temperature"] = (
#     temp[f_mask].str[:-1].astype(float).sub(32).mul(5 / 9).round(1)
# )

# # Convert remaining (assumed degree celicious) to float
# df_clean.loc[~f_mask, "Temperature"] = pd.to_numeric(temp[~f_mask], errors="coerce").round(1)


In [15]:
# ----------------------------------------------------------------
# BUG CATEGORY 8: BLOOD TYPE ISSUES
# ----------------------------------------------------------------
print("\n8Ô∏è‚É£ Cleaning Blood Types...")

# TODO: Fix '0+' to 'O+', remove extra spaces, handle invalid blood types
# HINT: Use .replace() or .str.replace()

# Fixing common typos and spacing and replacing '0+' to 'O+'
df_clean["Blood_Type"] = (
    df_clean["Blood_Type"]
    .astype(str)
    .str.strip()
    .str.upper()
    .str.replace(" ", "", regex=False)     # Remove all internal spaces
    .str.replace("0+", "O+", regex=False)  # Replace '0+' with 'O+'
)

# To handle invalid blood types, let's first define the valid ones
valid_blood_types = {"A+", "A-", "B+", "B-", "AB+", "AB-", "O+", "O-"}

# Let's now set the invalid tyoes to NaN
df_clean.loc[~df_clean["Blood_Type"].isin(valid_blood_types), "Blood_Type"] = np.nan

print("‚úì Blood types cleaned")


8Ô∏è‚É£ Cleaning Blood Types...
‚úì Blood types cleaned


In [16]:
# ----------------------------------------------------------------
# BUG CATEGORY 9: TEMPERATURE UNITS
# ----------------------------------------------------------------
print("\n9Ô∏è‚É£ Standardizing temperature units...")

# TODO: Convert all temperatures to Celsius (remove 'F', convert if needed)
# HINT: Check for 'F' in string, use formula: (F - 32) * 5/9

"""
I have already done this step just before setting unrealistic temperatures to NaN
Refer to BUG CATEGORY 7
Function, "convert_temp" was created to handle this
"""

print("‚úì Temperatures standardized")


9Ô∏è‚É£ Standardizing temperature units...
‚úì Temperatures standardized


In [17]:
# ----------------------------------------------------------------
# BUG CATEGORY 10: BLOOD PRESSURE FORMAT
# ----------------------------------------------------------------
print("\nüîü Standardizing blood pressure format...")

# TODO: Make all blood pressure values use '/' separator
# HINT: Replace ' / ' and '-' with '/'

# Standardizing separators and removing extra whitespace
df_clean["Blood_Pressure"] = (
    df_clean["Blood_Pressure"]
    .astype(str)
    .str.strip()
    .str.replace(r"\s*[-/]\s*", "/", regex=True)  # Normalize all to '/'
)

print("‚úì Blood pressure standardized")


üîü Standardizing blood pressure format...
‚úì Blood pressure standardized


In [18]:
# ----------------------------------------------------------------
# BUG CATEGORY 11: INSURANCE PROVIDER CONSISTENCY
# ----------------------------------------------------------------
print("\n1Ô∏è‚É£1Ô∏è‚É£ Standardizing insurance providers...")

# TODO: Make all provider names consistent (e.g., all 'BlueCross' or all 'Aetna')
# HINT: Use .str.lower() then .str.title() or create a mapping
"""
To perform this operation by creating mapping, we need to first know the different naming variations that exist in that column
"""
# Check different naming variations
print(sorted(df_clean['Insurance_Provider'].dropna().unique()))

# Now that we know, let's map using a dictionary

provider_map = {
    'Aetna': 'Aetna',
    'aetna': 'Aetna',
    'BlueCross': 'BlueCross',
    'BLUECROSS': 'BlueCross',
    'blue cross': 'BlueCross',
    'Cigna': 'Cigna',
    'CIGNA': 'Cigna',
    'UnitedHealth': 'UnitedHealth',
    'United Health': 'UnitedHealth',
    'Medicare': 'Medicare'
}

# Now let's remove any leading or trailing white space before applying the mappings
df_clean['Insurance_Provider'] = df_clean['Insurance_Provider'].astype(str).str.strip()

# Apply the mapping
df_clean['Insurance_Provider'] = df_clean['Insurance_Provider'].replace(provider_map)

# Let's now confirm that our operation was effective by printing different variations again
print(sorted(df_clean['Insurance_Provider'].dropna().unique()))

print("‚úì Insurance providers standardized")


1Ô∏è‚É£1Ô∏è‚É£ Standardizing insurance providers...
['Aetna', 'BLUECROSS', 'BlueCross', 'CIGNA', 'Cigna', 'Medicare', 'United Health', 'UnitedHealth', 'aetna', 'blue cross']
['Aetna', 'BlueCross', 'Cigna', 'Medicare', 'UnitedHealth', 'nan']
‚úì Insurance providers standardized


In [19]:
# ----------------------------------------------------------------
# BUG CATEGORY 12: DIAGNOSIS CLEANUP
# ----------------------------------------------------------------
print("\n1Ô∏è‚É£2Ô∏è‚É£ Cleaning diagnosis field...")

# TODO: Strip whitespace, fix typos (e.g., 'Diabetis' -> 'Diabetes')
# HINT: Use .str.strip() and .replace()

# Just as in BUG CATEGORY 11, let's display different variations
print(sorted(df_clean['Diagnosis'].dropna().unique()))

# Create mapping now
diagnosis_map = {
    "ASTHMA": "Asthma",
    "Anxiety": "Anxiety",
    "Bronchitis": "Bronchitis",
    "COVID-19": "COVID-19",
    "Diabetis": "Diabetes",
    "diabetes": "Diabetes",
    "Flu": "Flu",
    "Fracture": "Fracture",
    "Hypertension": "Hypertension",
    "Migraine": "Migraine",
    "Pneumonia": "Pneumonia"
}

# Remove any trailing or leading white space and apply the mapping

df_clean["Diagnosis"] = (
    df_clean["Diagnosis"]
    .astype(str)
    .str.strip()
    .replace(diagnosis_map)
)

# Check variation again to confirm
print(sorted(df_clean['Diagnosis'].dropna().unique()))


print("‚úì Diagnosis cleaned")


1Ô∏è‚É£2Ô∏è‚É£ Cleaning diagnosis field...
['ASTHMA', 'Anxiety', 'Bronchitis', 'COVID-19', 'Diabetis', 'Flu', 'Fracture', 'Hypertension', 'Migraine', 'Pneumonia', 'diabetes']
['Anxiety', 'Asthma', 'Bronchitis', 'COVID-19', 'Diabetes', 'Flu', 'Fracture', 'Hypertension', 'Migraine', 'Pneumonia']
‚úì Diagnosis cleaned


In [20]:
# ----------------------------------------------------------------
# BUG CATEGORY 13: LOGICAL INCONSISTENCIES
# ----------------------------------------------------------------
print("\n1Ô∏è‚É£3Ô∏è‚É£ Checking logical consistency...")

# TODO: Ensure Discharge_Date is after Admission_Date
# HINT: Compare the two date columns

# Let's set Discharge_Date to NaT if it's before Admission_Date
mask = df_clean["Discharge_Date"] < df_clean["Admission_Date"]
df_clean.loc[mask, "Discharge_Date"] = pd.NaT


# TODO: Calculate proper age from Date_of_Birth

# from datetime import datetime

# Now let's define the current date
today = pd.Timestamp("today")

# Recalculate age; take 1 year equals to 365 and 1/4 days; also = 365.25 days
# Define Age calculator
age_calc = (today - df_clean["Date_of_Birth"]) / pd.Timedelta(days=365.25)

# Convert valid values to integers
age_cleaned = pd.to_numeric(age_calc, errors="coerce").dropna().astype(int)

# Assign only for those valid rows
df_clean.loc[age_cleaned.index, "Age"] = age_cleaned


print("‚úì Logical consistency checked")


1Ô∏è‚É£3Ô∏è‚É£ Checking logical consistency...
‚úì Logical consistency checked


In [21]:
df_clean.head(15)

Unnamed: 0,Patient_ID,First_Name,Last_Name,Date_of_Birth,Age,Gender,Blood_Type,Admission_Date,Discharge_Date,Diagnosis,Temperature,Blood_Pressure,Heart_Rate,Insurance_Provider,Bill_Amount
0,1001.0,John,Smith,NaT,45,UNKNOWN,A+,2024-10-15,NaT,Flu,36.3,151/88,78,,566.0
1,1002.0,Mary,Johnson,1954-01-24,71,MALE,A+,2024-10-25,2024-11-03,Pneumonia,38.4,133/71,68,Cigna,5876.33
2,1003.0,James,Williams,1975-04-08,50,MALE,O-,2025-07-09,2025-07-18,COVID-19,36.5,96/76,78,BlueCross,7015.1
3,1004.0,Sarah,Brown,1957-12-04,67,MALE,B+,2024-11-25,2024-12-05,COVID-19,37.8,151/67,88,BlueCross,7574.35
4,1005.0,Michael,Jones,2009-02-19,16,MALE,AB+,2024-12-19,2024-12-25,Migraine,38.0,98/85,91,UnitedHealth,965.92
5,1006.0,Robert,Garcia,1994-01-01,31,MALE,B+,2025-05-05,2025-05-18,Migraine,37.8,152/64,67,BlueCross,6260.86
6,1007.0,William,Miller,1951-04-08,74,MALE,A+,2025-03-13,2025-03-21,Anxiety,37.6,163/100,61,BlueCross,5218.91
7,1008.0,Olivia,Davis,NaT,85,MALE,O+,2024-12-30,2025-01-09,Bronchitis,39.2,177/63,100,Cigna,9089.48
8,1009.0,David,,2004-10-01,21,FEMALE,O-,2025-10-06,NaT,Fracture,37.9,109/69,98,Aetna,3218.84
9,1010.0,Sophia,Martinez,2011-04-23,14,MALE,A+,NaT,NaT,Bronchitis,37.0,162/120,75,UnitedHealth,8089.17


In [22]:
# ================================================================
# STEP 4: FINAL VALIDATION
# ================================================================

print("\n" + "="*70)
print("üèÅ CLEANING COMPLETE - FINAL VALIDATION")
print("="*70)

# Check missing values after cleaning
print("\n‚ùì Missing Values After Cleaning:")
print(df_clean.isnull().sum())

# Check data types
print("\nüìä Data Types After Cleaning:")
print(df_clean.dtypes)

# Display sample of cleaned data
print("\n‚ú® Sample of Cleaned Data:")
print(df_clean.head(10))


üèÅ CLEANING COMPLETE - FINAL VALIDATION

‚ùì Missing Values After Cleaning:
Patient_ID             0
First_Name             2
Last_Name              2
Date_of_Birth         27
Age                    0
Gender                 3
Blood_Type             5
Admission_Date        12
Discharge_Date        28
Diagnosis              0
Temperature           13
Blood_Pressure         0
Heart_Rate             9
Insurance_Provider     0
Bill_Amount            7
dtype: int64

üìä Data Types After Cleaning:
Patient_ID                   float64
First_Name                    object
Last_Name                     object
Date_of_Birth         datetime64[ns]
Age                            Int64
Gender                        object
Blood_Type                    object
Admission_Date        datetime64[ns]
Discharge_Date        datetime64[ns]
Diagnosis                     object
Temperature                  float64
Blood_Pressure                object
Heart_Rate                     Int64
Insurance_Provider 

In [23]:
# ================================================================
# STEP 5: BASIC ANALYSIS (Prove your cleaning worked!)
# ================================================================

print("\n" + "="*70)
print("üìä BASIC ANALYSIS ON CLEANED DATA")
print("="*70)

# TODO: Calculate and display:
# 1. Average age of patients
# Average age of patients is calculated using the mean() function
print("\n Average age of patients:")
print(df_clean["Age"].mean().round(1))

# 2. Most common diagnosis
# For Most common diagnosis, we look for the diagnosis that appears most, and for that we use mode() function
print("\nMost common diagnosis:")
print(df_clean["Diagnosis"].mode().iloc[0])

# 3. Average bill amount
print("\nAverage bill amount:")
print(df_clean["Bill_Amount"].mean().round(2))

# 4. Gender distribution
print("\nGender distribution:")
print(df_clean["Gender"].value_counts())

# 5. Most common blood type
print("\nMost common blood type:")
print(df_clean["Blood_Type"].mode().iloc[0])


üìä BASIC ANALYSIS ON CLEANED DATA

 Average age of patients:
46.8

Most common diagnosis:
Diabetes

Average bill amount:
5336.49

Gender distribution:
Gender
MALE       52
FEMALE     41
UNKNOWN     4
Name: count, dtype: int64

Most common blood type:
A+


In [24]:
# ================================================================
# STEP 6: SAVE YOUR CLEANED DATA
# ================================================================

# TODO: Save the cleaned dataset
df_clean.to_csv('cleaned_patient_records_WALTER_ODUR_RegNO_2025_HD07_26017U.csv', index=False)
print("\n‚úÖ Cleaned data saved to 'cleaned_patient_records_WALTER_ODUR_RegNO_2025_HD07_26017U.csv'")

print("\n" + "="*70)
print("üéâ CHALLENGE COMPLETE!")
print("="*70)
print("\nSubmit your notebook to instructor for scoring.")
print("Remember to document what bugs you found and how you fixed them!")


‚úÖ Cleaned data saved to 'cleaned_patient_records_WALTER_ODUR_RegNO_2025_HD07_26017U.csv'

üéâ CHALLENGE COMPLETE!

Submit your notebook to instructor for scoring.
Remember to document what bugs you found and how you fixed them!
