# Dealing with Duplicates & Outliers in Pandas

## Introduction to Duplicates:

Duplicates are rows in a dataset that have identical values across all columns.
They can distort analysis results and should be handled appropriately.

## Detecting Duplicates:

`duplicated()`: This function identifies duplicate rows in a DataFrame.
`drop_duplicates()`: This function removes duplicate rows from a DataFrame.

## Example: Detecting and Handling Duplicates:

In [1]:
import pandas as pd
from faker import Faker

fake = Faker()
data = {'Name': [fake.name() for _ in range(100)],
        'Age': [fake.random_int(min=18, max=80) for _ in range(100)],
        'City': [fake.city() for _ in range(100)]}

df = pd.DataFrame(data)

# Detect duplicates
duplicates = df[df.duplicated()]

# Remove duplicates
df_clean = df.drop_duplicates()


## Introduction to Outliers:

Outliers are data points that significantly differ from other observations in the dataset.
They can skew statistical measures and affect model performance.

## Detecting Outliers:

Outliers can be identified using statistical methods such as Z-score or IQR (Interquartile Range).

### Dealing with Outliers:

Outliers can be handled by removing them, transforming them, or using robust statistical techniques.

### Detecting and Handling Outliers using Z-score:

In [2]:
# Detect outliers using Z-score
from scipy import stats

z_scores = stats.zscore(df['Age'])
threshold = 3
outliers = df[(z_scores > threshold) | (z_scores < -threshold)]

# Remove outliers
df_clean = df[(z_scores < threshold) & (z_scores > -threshold)]
df_clean

Unnamed: 0,Name,Age,City
0,Gregory Cook,76,East Barbara
1,Anita Mcneil,37,Bennettmouth
2,James Zavala,78,West Ray
3,Rhonda Salas,61,Brownport
4,Ryan Robertson,72,Michaelhaven
...,...,...,...
95,Jared Chan,80,Howardton
96,Mrs. Jacqueline Evans,57,Hutchinsonside
97,Katherine Rivers,41,East Christy
98,Matthew Booth,62,Martinbury
