Objectives

The objective of this notebook is to extract the raw healthcare insurance dataset, perform essential data cleaning and transformation steps, and produce a clean, analysis-ready dataset. This ensures data quality and consistency before any exploratory or analytical work is undertaken.

Inputs

Raw dataset: dataset/raw/insurance.csv
Python libraries: pandas

Outputs

Cleaned dataset saved as dataset/cleaned/insurance_cleaned.csv

Summary checks confirming:
No missing values,
No duplicate records,
Correct data types for categorical variables


In [3]:
import pandas as pd

df = pd.read_csv("../dataset/raw/insurance.csv")
df.head()

df.isnull().sum()
df.duplicated().sum()

df['sex'] = df['sex'].astype('category')
df['smoker'] = df['smoker'].astype('category')
df['region'] = df['region'].astype('category')

In [4]:
df_clean = df.copy()

# remove duplicates
df_clean = df_clean.drop_duplicates()

# standardise text columns
for col in ["sex", "smoker", "region"]:
    df_clean[col] = df_clean[col].astype(str).str.strip().str.lower()

print("Raw shape:", df.shape)
print("Clean shape:", df_clean.shape)
print(df_clean.isnull().sum())

# save cleaned dataset
df_clean.to_csv("../dataset/cleaned/insurance_cleaned.csv", index=False)
print("Saved cleaned file")

Raw shape: (1338, 7)
Clean shape: (1337, 7)
age         0
sex         0
bmi         0
children    0
smoker      0
region      0
charges     0
dtype: int64
Saved cleaned file
