# Session 29: Data Cleaning Part 1 (Fixing Data Types)

**Unit 3: Data Collection and Cleaning**
**Hour: 29**
**Mode: Practical Lab**

---

### 1. Objective

This is the first hands-on lab dedicated to cleaning our Telco Churn dataset. Our primary goal is to address the most critical issue we identified during exploration: the `TotalCharges` column has the wrong data type.

We will learn how to diagnose the problem, handle errors during conversion, and successfully change the column to a numeric type.

### 2. Setup

Import Pandas and load the Telco dataset.

In [None]:
import pandas as pd

url = 'https://raw.githubusercontent.com/IBM/telco-customer-churn-on-icp4d/master/data/Telco-Customer-Churn.csv'
df = pd.read_csv(url)

### 3. Diagnosing the Problem

First, let's confirm the problem using `.info()`.

In [None]:
df.info()

As we see, `TotalCharges` is an `object` (text) type, not a `float64` (numeric) type. This means we can't perform mathematical operations on it.

Let's try to convert it directly using `pd.to_numeric` and see what happens.

In [None]:
# This cell will intentionally cause an error!
try:
    pd.to_numeric(df['TotalCharges'])
except ValueError as e:
    print(f"Error: {e}")

The error message `Unable to parse string ' '` tells us exactly what the problem is: some of the values in the `TotalCharges` column are not numbers, but empty strings `' '`.

How many of these are there?

### 4. Finding the Problematic Rows

We can filter the DataFrame to find the rows where `TotalCharges` is just a space.

In [None]:
problematic_rows = df[df['TotalCharges'] == ' ']
problematic_rows

**Insight:** We've found 11 rows where this occurs. Looking at the `tenure` column, we see that all of these customers have a tenure of 0. This makes sense: they are brand new customers and haven't been charged anything yet, so the system left the value blank.

This is essentially a form of missing data.

### 5. Fixing the Problem

The `pd.to_numeric` function has a useful parameter called `errors`. By setting `errors='coerce'`, we tell Pandas: "Try to convert the column to a number. If you encounter any value that you can't convert (like our ' ' strings), just replace it with `NaN` (Not a Number), which is Pandas' standard marker for missing values."

In [None]:
# The 'coerce' argument is extremely useful
df['TotalCharges'] = pd.to_numeric(df['TotalCharges'], errors='coerce')

Now, let's check the `.info()` again to see if our fix worked.

In [None]:
df.info()

**Success!**
1.  The `TotalCharges` column is now a `float64` type.
2.  The number of non-null entries for `TotalCharges` is now `7032` (7043 - 11), correctly showing that we now have 11 missing values.

We can also confirm by checking for null values directly.

In [None]:
# .isnull() returns True for missing values, .sum() counts the Trues
df['TotalCharges'].isnull().sum()

### 6. Conclusion

In this session, you learned a critical data cleaning workflow:
1.  Use `.info()` to identify columns with incorrect data types.
2.  Attempt a direct conversion to diagnose the specific error.
3.  Isolate and inspect the problematic rows to understand the root cause.
4.  Use `pd.to_numeric` with `errors='coerce'` to force the conversion while correctly flagging problematic entries as missing data (`NaN`).

**Next Session:** Now that we have correctly identified the missing `TotalCharges`, we will learn how to handle them using imputation.