---
##  Author Information

**Name:** Abdul Rehman  
**Role:** Data Science Enthusiast | Python Learner  
**Notebook Created:** 22-July-2025  

**Connect with Me:**  


[![LinkedIn](https://img.shields.io/badge/LinkedIn-blue?style=flat&logo=linkedin&logoColor=white)](https://www.linkedin.com/in/abdul-rehman-74b418350/)
[![GitHub](https://img.shields.io/badge/GitHub-black?style=flat&logo=github&logoColor=white)](https://github.com/datawithrehman/Data-Science-Beginning)
[![Twitter](https://img.shields.io/badge/Twitter-blue?style=flat&logo=twitter&logoColor=white)](https://x.com/datawithrehman)



#  Removing Duplicates & Fixing Messy Data with Pandas
Cleaning data is often more important than modeling. If your dataset is messy, even the best machine learning models will give poor results.

In this notebook, we'll explore how to:
- Find and remove duplicates
- Clean messy text data
- Reset the index after removing rows
- Avoid common beginner mistakes


In [None]:
# Step 1: Import pandas
import pandas as pd

# Sample data with duplicates and messy strings
data = {
    'Name': ['Alice', 'Bob', 'Alice', 'Carol', 'bob '],
    'City': ['Lahore', 'Karachi', 'Lahore', 'Islamabad', 'karachi'],
    'Age': [25, 30, 25, 27, 30]
}
df = pd.DataFrame(data)
df

### Step 2: Finding Duplicates
We can find duplicate rows using `.duplicated()`.

In [None]:
df.duplicated()

### Step 3: Removing Duplicate Rows
We use `.drop_duplicates()` to remove duplicate rows.

⚠️ **Important:** Don’t forget to either use `inplace=True` or reassign the result to a new variable.

In [None]:
df_cleaned = df.drop_duplicates()
df_cleaned

### Step 4: Fixing Messy Text Data
We’ll clean whitespace and fix capitalization issues using `.str.strip()` and `.str.lower()`.

In [None]:
# Strip spaces and convert to lowercase
df_cleaned['Name'] = df_cleaned['Name'].str.strip().str.lower()
df_cleaned['City'] = df_cleaned['City'].str.strip().str.lower()
df_cleaned

### Step 5: Resetting the Index
After dropping or modifying rows, reset the index using `reset_index()`.

In [None]:
df_cleaned = df_cleaned.reset_index(drop=True)
df_cleaned

### 🚨 Common Mistake Alert
- Forgetting to assign the cleaned DataFrame to a new variable
- Not using `inplace=True` when needed

```python
# Wrong:
df.drop_duplicates()
# This does nothing unless reassigned or inplace=True
```


### 📆 Mini Challenge
**Try this:** Create your own DataFrame with duplicate rows and use `drop_duplicates()` to clean it.