## Project Milestone 2

### Cleaning/Formatting Flat File Source

### Medical Cost Personal Dataset: Data Cleansing & Transformation

##### This notebook demonstrates a thorough data wrangling process on the "Medical Cost Personal Dataset" (Kaggle). Data wrangling steps to apply and explain at least five distinct data cleansing or transformation steps are done, ending with a human-readable, cleaned dataset and a discussion of ethical considerations.

---

### 1. Load and Inspect the Data


In [11]:
import pandas as pd

# Load the dataset
df = pd.read_csv('insurance_medical.csv') 

# Preview the first few rows
df.head()


Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


### Data Overview

Checking data types, missing values, and basic stats to understand what transformations are needed.


In [16]:
# General info and statistics
df.info()
df.describe(include='all')
df.isnull().sum()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1338 non-null   int64  
 1   sex       1338 non-null   object 
 2   bmi       1338 non-null   float64
 3   children  1338 non-null   int64  
 4   smoker    1338 non-null   object 
 5   region    1338 non-null   object 
 6   charges   1338 non-null   float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.3+ KB


age         0
sex         0
bmi         0
children    0
smoker      0
region      0
charges     0
dtype: int64

### 2. Data Transformation & Cleansing Steps

Below are at least **five clearly labeled and described transformation steps.**


### Step 1: Standardize Column Headers

Ensure all column names are lowercase, trimmed, and use underscores for consistency.


In [24]:
df.columns = [col.strip().lower().replace(' ', '_') for col in df.columns]
print(df.columns)


Index(['age', 'sex', 'bmi', 'children', 'smoker', 'region', 'charges'], dtype='object')


### Step 2: Consistent Casing & Values in Categorical Columns

- Standardize `sex`, `smoker`, and `region` columns to lowercase and remove extra whitespace.
- Fix any inconsistent or misspelled values (if found).


In [29]:
for col in ['sex', 'smoker', 'region']:
    df[col] = df[col].str.strip().str.lower()
df['sex'].value_counts(), df['smoker'].value_counts(), df['region'].value_counts()



(sex
 male      676
 female    662
 Name: count, dtype: int64,
 smoker
 no     1064
 yes     274
 Name: count, dtype: int64,
 region
 southeast    364
 southwest    325
 northwest    325
 northeast    324
 Name: count, dtype: int64)

### Step 3: Remove Duplicate Rows

Check for and remove any duplicate rows to ensure data quality.


In [32]:
print("Duplicates before removal:", df.duplicated().sum())
df = df.drop_duplicates()
print("Duplicates after removal:", df.duplicated().sum())


Duplicates before removal: 1
Duplicates after removal: 0


### Step 4: Identify & Handle Outliers

Identify outliers in `bmi` and `charges` using the IQR (Interquartile Range) method.  
Flag extreme outliers and, for this dataset, remove them for clarity.


In [37]:
def remove_outliers_iqr(data, column):
    Q1 = data[column].quantile(0.25)
    Q3 = data[column].quantile(0.75)
    IQR = Q3 - Q1
    lower = Q1 - 1.5 * IQR
    upper = Q3 + 1.5 * IQR
    # Print how many are outliers
    print(f"Outliers in '{column}':", ((data[column] < lower) | (data[column] > upper)).sum())
    # Remove outliers
    return data[(data[column] >= lower) & (data[column] <= upper)]

# Remove outliers for 'bmi' and 'charges'
df = remove_outliers_iqr(df, 'bmi')
df = remove_outliers_iqr(df, 'charges')


Outliers in 'bmi': 9
Outliers in 'charges': 138


### Step 5: Fix Data Types

- Ensure `children` is integer, not float.
- Confirm all columns have appropriate data types.


In [40]:
df['children'] = df['children'].astype(int)
df.dtypes


age           int64
sex          object
bmi         float64
children      int32
smoker       object
region       object
charges     float64
dtype: object

### Step 6: Feature Engineering – Add BMI Category

Create a new column that categorizes BMI as 'underweight', 'normal', 'overweight', or 'obese' based on CDC guidelines.


In [45]:
def bmi_category(bmi):
    if bmi < 18.5:
        return 'underweight'
    elif 18.5 <= bmi < 25:
        return 'normal'
    elif 25 <= bmi < 30:
        return 'overweight'
    else:
        return 'obese'
df['bmi_category'] = df['bmi'].apply(bmi_category)
df['bmi_category'].value_counts()


bmi_category
obese          564
overweight     382
normal         224
underweight     20
Name: count, dtype: int64

## 3. Final Cleaned Dataset

Below is the final, human-readable, cleaned dataset after all transformations.


In [48]:
df.reset_index(drop=True, inplace=True)
df.head(20)  # Show first 20 rows for readability


Unnamed: 0,age,sex,bmi,children,smoker,region,charges,bmi_category
0,19,female,27.9,0,yes,southwest,16884.924,overweight
1,18,male,33.77,1,no,southeast,1725.5523,obese
2,28,male,33.0,3,no,southeast,4449.462,obese
3,33,male,22.705,0,no,northwest,21984.47061,normal
4,32,male,28.88,0,no,northwest,3866.8552,overweight
5,31,female,25.74,0,no,southeast,3756.6216,overweight
6,46,female,33.44,1,no,southeast,8240.5896,obese
7,37,female,27.74,3,no,northwest,7281.5056,overweight
8,37,male,29.83,2,no,northeast,6406.4107,overweight
9,60,female,25.84,0,no,northwest,28923.13692,overweight


## 4. Ethical Implications of Data Wrangling

Below is a short discussion of the ethical considerations, legal/regulatory risks, and steps taken for this project.

---

#### **Ethical Implications**

In this project, data wrangling steps were taken to standardize categorical values, removing duplicate rows, handling outliers, fixing data types, and adding a derived BMI category for clarity. No personal identifiers were present; however, data transformations can risk losing valuable information, especially when removing outliers or duplicates, which may bias analysis against underrepresented populations. For insurance data, regulatory guidelines like HIPAA (if any sensitive data was present) must be observed; although our dataset is anonymized and public, care was taken to retain the meaning of original data fields. Assumptions were made that extreme outliers represented errors or rare cases that could unduly influence predictive models. The dataset was sourced from Kaggle, a reputable public data sharing platform, but its origin should be further validated before using it for production analytics. Ethical data handling also means documenting all transformations transparently, so others can reproduce or challenge the choices made. To mitigate risks, transformations should be justified, minimal, and reversible, and all steps should be documented clearly as shown here.

---
