# Session 32: Data Cleaning Part 4 (Handling Outliers)

**Unit 3: Data Collection and Cleaning**
**Hour: 32**
**Mode: Practical Lab**

---

### 1. Objective

This lab focuses on identifying outliers in numerical data. Outliers can skew statistical analyses and negatively affect some machine learning models. We will use a common statistical method, the **Interquartile Range (IQR)**, to find them.

**What is an outlier?** A data point that is significantly different from other observations.

### 2. Setup

We will continue with our Telco dataset.

In [None]:
import pandas as pd

url = 'https://raw.githubusercontent.com/IBM/telco-customer-churn-on-icp4d/master/data/Telco-Customer-Churn.csv'
df = pd.read_csv(url)

### 3. The IQR Method for Outlier Detection

The Interquartile Range (IQR) is the range between the first quartile (25th percentile) and the third quartile (75th percentile). It represents the middle 50% of the data.

A common rule of thumb is that any data point that falls **below Q1 - 1.5 * IQR** or **above Q3 + 1.5 * IQR** is considered an outlier.

Let's apply this to the `MonthlyCharges` column.

#### Step 1: Calculate Q1, Q3, and IQR

We can get Q1 and Q3 from the `.describe()` method or calculate them directly with `.quantile()`.

In [None]:
Q1 = df['MonthlyCharges'].quantile(0.25)
Q3 = df['MonthlyCharges'].quantile(0.75)
IQR = Q3 - Q1

print(f"Q1 (25th percentile): {Q1:.2f}")
print(f"Q3 (75th percentile): {Q3:.2f}")
print(f"IQR (Interquartile Range): {IQR:.2f}")

#### Step 2: Define the Outlier Boundaries

Now we calculate the lower and upper bounds based on the rule.

In [None]:
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

print(f"Lower outlier boundary: {lower_bound:.2f}")
print(f"Upper outlier boundary: {upper_bound:.2f}")

#### Step 3: Identify the Outliers

We can now use these boundaries to filter our DataFrame and see if any rows fall outside this range.

In [None]:
# Find outliers using the OR condition
outliers = df[(df['MonthlyCharges'] < lower_bound) | (df['MonthlyCharges'] > upper_bound)]

print(f"Number of outliers found: {len(outliers)}")

**Finding:** The IQR method did not find any outliers in the `MonthlyCharges` column. This suggests the data is quite well-behaved and doesn't have extreme, erroneous values.

This is a form of analysis in itself. We have determined that the `MonthlyCharges` data does not contain significant outliers according to a standard statistical test.

### 4. Visualizing Outliers with a Box Plot

A box plot is the visual representation of the IQR method. The "whiskers" of the plot typically extend to the 1.5 * IQR boundaries. Any points that fall outside the whiskers are plotted as individual dots, making them easy to see.

Let's use Seaborn to create a box plot.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(8, 6))
sns.boxplot(x=df['MonthlyCharges'])
plt.title('Box Plot of Monthly Charges')
plt.show()

**Interpretation:** The box plot visually confirms our finding. The whiskers extend to the full range of the data, and there are no individual points plotted beyond them, indicating no outliers were detected by this method.

### 5. Conclusion

In this session, you learned a robust method for identifying outliers:
1.  Understand the Interquartile Range (IQR) method for defining outliers.
2.  Calculate Q1, Q3, and the upper/lower boundaries.
3.  Apply a filter to the DataFrame to find data points that fall outside these boundaries.
4.  Use a box plot as a quick and effective way to visualize potential outliers.

While we found no outliers in this specific case, this process is a vital part of data cleaning to ensure the quality of your analysis.

**Next Session:** We will move on to the next cleaning step: Data Transformation, starting with normalization.