# Session 31: Data Cleaning Part 3 (Handling Duplicates)

**Unit 3: Data Collection and Cleaning**
**Hour: 31**
**Mode: Practical Lab**

---

### 1. Objective

This lab focuses on another common data cleaning task: identifying and removing duplicate records from a dataset.

### 2. Setup

Let's start with a fresh, simple DataFrame to make the concept of duplicates clear.

In [None]:
import pandas as pd

data = {
    'id': [1, 2, 3, 2], # Note the duplicate id 2
    'name': ['Alice', 'Bob', 'Charlie', 'Bob'] # Note the duplicate name Bob
}

df_simple = pd.DataFrame(data)
df_simple

As you can see, the last row (index 3) is an exact duplicate of the row at index 1.

### 3. Identifying Duplicates

Pandas provides the `.duplicated()` method, which returns a boolean Series indicating whether a row is a duplicate of a *previous* row.

In [None]:
df_simple.duplicated()

Notice that the first occurrence of the Bob/2 row (at index 1) is marked `False`, while the second occurrence (at index 3) is marked `True`.

We can use `.sum()` to quickly count the number of duplicate rows.

In [None]:
print(f"Number of duplicate rows: {df_simple.duplicated().sum()}")

### 4. Removing Duplicates

The `.drop_duplicates()` method finds and removes duplicate rows, keeping the first occurrence by default.

In [None]:
df_cleaned = df_simple.drop_duplicates()
df_cleaned

The duplicate row has been removed.

### 5. Application to the Telco Dataset

Now, let's apply this to our main Telco dataset. Does it have any duplicate rows?

In [None]:
# Load the data again
url = 'https://raw.githubusercontent.com/IBM/telco-customer-churn-on-icp4d/master/data/Telco-Customer-Churn.csv'
df = pd.read_csv(url)

In [None]:
duplicate_count = df.duplicated().sum()
print(f"Number of duplicate rows in the Telco dataset: {duplicate_count}")

**Finding:** The Telco dataset is already clean and has no duplicate rows. This is great, but not always the case in real-world projects.

Sometimes, you might consider a duplicate based on a subset of columns. For example, is any `customerID` repeated? A customer ID should be unique.

In [None]:
customer_id_duplicates = df.duplicated(subset=['customerID']).sum()
print(f"Number of duplicate customer IDs: {customer_id_duplicates}")

The result is 0, confirming that every customer has a unique ID.

### 6. Conclusion

In this session, you learned the straightforward process for handling duplicate data:
1.  Use `.duplicated().sum()` to quickly check if and how many duplicate rows exist.
2.  Use `.drop_duplicates()` to remove them.
3.  Use the `subset` parameter in these methods to check for duplicates based on specific key columns (like an ID).

While our main dataset was clean, this is a crucial check in any data cleaning workflow.

**Next Session:** We will explore how to identify and handle outliers in our numerical data.