---
title: "Data Cleaning with Pandas"
author: "Mohammed Adil Siraju"
date: "2025-09-19"
categories: [pandas, data-cleaning, preprocessing]
description: "A comprehensive guide to handling duplicates and outliers in Pandas DataFrames, with practical examples and best practices."
---

Welcome to this tutorial on data cleaning using Pandas! Data cleaning is a crucial step in any data analysis workflow. In this notebook, we'll cover two essential techniques:
- **Handling duplicates**: Removing or managing repeated rows.
- **Detecting and removing outliers**: Using statistical methods like IQR (Interquartile Range).

By the end, you'll have practical skills to preprocess messy datasets effectively.

## 1. Setting Up and Creating Sample Data

First, let's import Pandas and create a sample DataFrame to work with.

In [30]:
import pandas as pd

data1 = {
    'A': [1,2,2,3,3],
    'B': [4,5,5,6,7]
}

df1 = pd.DataFrame(data1)
df1

Unnamed: 0,A,B
0,1,4
1,2,5
2,2,5
3,3,6
4,3,7


## 2. Dealing with Duplicates

Duplicates can skew your analysis. Pandas provides easy methods to detect and remove them.

### Checking for Duplicates

In [31]:
df1.duplicated().sum()

np.int64(1)

In [32]:
df1.drop_duplicates()

Unnamed: 0,A,B
0,1,4
1,2,5
3,3,6
4,3,7


In [33]:
df1.drop_duplicates(subset=['A'])

Unnamed: 0,A,B
0,1,4
1,2,5
3,3,6


## 3. Handling Outliers

Outliers are extreme values that can distort statistical analysis. We'll use the IQR method to detect and filter them.

### Creating Sample Data with Outliers

In [34]:

data2 = {
    'A': [1,2,2,3,11, 11],
    'B': [1,5,5,6,12, 25]
}

df2 = pd.DataFrame(data2)
df2

Unnamed: 0,A,B
0,1,1
1,2,5
2,2,5
3,3,6
4,11,12
5,11,25


In [35]:
df2.describe()

Unnamed: 0,A,B
count,6.0,6.0
mean,5.0,9.0
std,4.690416,8.602325
min,1.0,1.0
25%,2.0,5.0
50%,2.5,5.5
75%,9.0,10.5
max,11.0,25.0


### Calculating IQR and Bounds

In [36]:
q_low = df2['B'].quantile(0.25)
q_high = df2['B'].quantile(0.75)

iqr = q_high - q_low

In [37]:
l_bound = q_low - 1.5 * iqr
u_bound = q_high + 1.5 * iqr  # Fixed: upper bound should be plus, not minus

print(f"Lower bound: {l_bound}")
print(f"Upper bound: {u_bound}")

In [38]:
df2

Unnamed: 0,A,B
0,1,1
1,2,5
2,2,5
3,3,6
4,11,12
5,11,25


### Filtering Out Outliers

In [39]:
df_filtered = df2[(df2['B']>l_bound) & (df2['B']<u_bound)]
df_filtered

Unnamed: 0,A,B
0,1,1
1,2,5
2,2,5
3,3,6
4,11,12
