# Session 30: Data Cleaning Part 2 (Handling Missing Data)

**Unit 3: Data Collection and Cleaning**
**Hour: 30**
**Mode: Practical Lab**

---

### 1. Objective

Building on the last session, our goal is to handle the 11 missing values we identified in the `TotalCharges` column. We will learn how to fill these `NaN` values using a common strategy: **imputation**.

### 2. Setup

Let's get our DataFrame to the state it was in at the end of the last lab.

In [None]:
import pandas as pd

url = 'https://raw.githubusercontent.com/IBM/telco-customer-churn-on-icp4d/master/data/Telco-Customer-Churn.csv'
df = pd.read_csv(url)

# Coerce TotalCharges to numeric, creating NaNs
df['TotalCharges'] = pd.to_numeric(df['TotalCharges'], errors='coerce')

### 3. Diagnosing the Missing Data

Let's confirm the number of missing values.

In [None]:
print(f"Number of missing values in TotalCharges: {df['TotalCharges'].isnull().sum()}")

### 4. Imputation Strategy

Since we only have 11 missing values out of over 7000 rows, deleting these rows would be a reasonable option. However, for the sake of learning, we will use imputation.

**Strategy:** For numerical data, a safe and common strategy is to fill the missing values with the **median** of the column. We use the median instead of the mean because the median is robust to outliers (it isn't skewed by a few extremely high or low values).

Let's calculate the median of the `TotalCharges` column.

In [None]:
median_total_charges = df['TotalCharges'].median()
print(f"The median total charges is: {median_total_charges:.2f}")

### 5. Applying the Imputation

Pandas provides the `.fillna()` method to fill missing values. We will pass the median we just calculated to this method.

The `inplace=True` argument modifies the DataFrame directly, so we don't need to reassign it `(df = ...)`.

In [None]:
df['TotalCharges'].fillna(median_total_charges, inplace=True)

### 6. Verifying the Fix

Now, let's check for missing values again. The count should be 0.

In [None]:
print(f"Number of missing values in TotalCharges after imputation: {df['TotalCharges'].isnull().sum()}")

Let's also check the `.info()` summary. The `TotalCharges` column should now have 7043 non-null entries, just like all the other columns.

In [None]:
df.info()

**Success!** The data type is correct, and there are no more missing values in that column.

### 7. Conclusion

In this session, you learned the standard workflow for handling missing numerical data:
1.  Use `.isnull().sum()` to count the number of missing values.
2.  Choose an imputation strategy (e.g., filling with the median for numerical data).
3.  Calculate the value to be imputed (e.g., using `.median()`).
4.  Use the `.fillna()` method to replace the `NaN` values.
5.  Verify that the fix was successful by checking for missing values again.

Our dataset is now one step closer to being ready for modeling.

**Next Session:** We will continue cleaning by finding and handling duplicate data.