# Session 36: Tying It All Together (Intro to EDA)

**Unit 3: Data Collection and Cleaning**
**Hour: 36**
**Mode: Practical Lab**

---

### 1. Objective

This lab serves as a capstone for Unit 3. We will perform a streamlined, end-to-end cleaning and preparation workflow on the Telco dataset, combining all the techniques we've learned so far. The output will be a clean dataset, ready for the deep-dive analysis and visualization we will perform in Unit 4.

This is the bridge between the **Scrub** and **Explore** phases of the OSEMN workflow.

### 2. Setup

Import Pandas and load the dataset.

In [None]:
import pandas as pd

url = 'https://raw.githubusercontent.com/IBM/telco-customer-churn-on-icp4d/master/data/Telco-Customer-Churn.csv'
df = pd.read_csv(url)

### 3. Combined Cleaning and Preparation Workflow

Let's execute our cleaning plan step-by-step.

#### Step 1: Fix `TotalCharges` Data Type and Handle Missing Values

We'll combine the steps from Sessions 29 and 30 into a single, efficient process.

In [None]:
# Coerce to numeric, creating NaNs for blank strings
df['TotalCharges'] = pd.to_numeric(df['TotalCharges'], errors='coerce')

# Calculate the median
median_charges = df['TotalCharges'].median()

# Fill the missing values with the median
df['TotalCharges'].fillna(median_charges, inplace=True)

# Verify the fix
print("Data type of TotalCharges after fix:")
print(df['TotalCharges'].dtype)
print("\nNumber of missing values in TotalCharges after fix:")
print(df['TotalCharges'].isnull().sum())

#### Step 2: Check for Duplicates

As we discovered in Session 31, there are no full-row duplicates, but this is a crucial check.

In [None]:
print(f"Number of duplicate rows found: {df.duplicated().sum()}")

#### Step 3: Standardize Categorical Values (Example)

Let's look at the values in the `PaymentMethod` column. They seem consistent, but in a real-world scenario, you might have variations.

**Hypothetical Problem:** Imagine we also had `"Credit Card (automatic)"` with a different capitalization. We would need to standardize it.

While not strictly necessary here, this is how you would do it:

In [None]:
# Example of standardization using .replace()
# This won't change anything in our current dataset, but it demonstrates the technique
df['PaymentMethod'] = df['PaymentMethod'].replace({
    'Credit card (automatic)': 'Credit Card (automatic)',
    'Bank transfer (automatic)': 'Bank Transfer (automatic)'
})

df['PaymentMethod'].value_counts()

#### Step 4: Create a New Feature

Let's re-create our `TenureGroup` feature from Session 35, as it will be useful for our upcoming analysis.

In [None]:
bins = [-1, 12, 48, 73] # (0-12 months), (13-48 months), (49-72 months)
labels = ['New Customer', 'Medium-Term Customer', 'Long-Term Customer']

df['TenureGroup'] = pd.cut(df['tenure'], bins=bins, labels=labels)

### 4. Final Review of the Cleaned Data

Let's take one last look at our prepared dataset with `.info()` and `.head()`.

In [None]:
df.info()

In [None]:
df.head()

The data is now clean and ready for analysis!
*   No missing values.
*   Correct data types.
*   A new, engineered feature (`TenureGroup`) is available.

### 5. Conclusion

This session consolidated the key data cleaning and preparation tasks into a single, logical workflow. We have successfully transitioned our raw data into a clean, analyzable state.

This prepared dataset will be the foundation for all the work we do in Unit 4, where we will dive deep into Exploratory Data Analysis (EDA) to uncover insights.

**Next Session:** We will begin Unit 4 with a theoretical discussion of the key statistical concepts used to describe and summarize data.