# Self-study try-it activity 4.2: Identifying outliers in a data set

Outlier detection, also known as anomaly detection, is the process of identifying data points that deviate significantly from the norm within a data set. These data points, called outliers, may represent abnormalities, errors or rare events that require further investigation. Outlier detection is critical in various fields such as finance, healthcare, cyber security and marketing due to its ability to enhance data quality and uncover valuable insights.

In this notebook, you will explore how to detect and handle outliers.

In [None]:
#Import the necessary libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from scipy.stats import zscore
import scipy.stats as stats
import seaborn as sns
from scipy.stats.mstats import winsorize



## 1. Detecting outliers using the interquartile range (IQR)

- Sort the data: arrange the data set in an ascending order.

- Find the median (Q2): the median is the middle value of the data set. If the number of data points is even, take the average of the two middle values.

- Determine Q1: the median of the lower half of the data set (below Q2).

- Determine Q3: the median of the upper half of the data set (above Q2).

- Calculate IQR: subtract Q1 from Q3.

Note: The data points outside the range of: $Q1 − 1.5 × IQR ≤ data ≤ Q3 + 1.5 × IQR$ are outliers.


In [None]:
data = [24,56,78,1,2,45,6,78,999]

sorted_data = sorted(data)
print(sorted_data)

#Calculate Q1 and Q3
q1 = np.percentile(sorted_data, 25) #Q1 (25th percentile)
q3 = np.percentile(sorted_data, 75) #Q3 (75th percentile)

#Calculate IQR
iqr = q3 - q1

print(f"Q1: {q1}, Q3: {q3}, IQR: {iqr}")

#The outlier points
outliers = [x for x in sorted_data if x < q1 - 1.5 * iqr or x > q3 + 1.5 * iqr]

print("Outliers:", outliers)


In [None]:
#Plot outliers

plt.figure(figsize=(8, 5))
plt.plot(sorted_data, 'bo-', label="Data Points", markersize=8)

#Highlight outliers
outlier_indices = [i for i, val in enumerate(sorted_data) if val in outliers]
plt.scatter(outlier_indices, outliers, color='red', label="Outliers", zorder=5)

#Add labels and title
plt.title("Dataset with Outliers Highlighted")
plt.xlabel("Index")
plt.ylabel("Value")
plt.legend()

In [None]:
#Generate a set of 1,000 points with ten outlier points and plot the outlier points

#Number of data points
num_points = 1000

#Generate normally distributed data (mean = 0, std = 1)
data = np.random.normal(0, 1, num_points)

#Number of outliers to inject
num_outliers = 10

#Randomly select indices for outliers
outlier_indices = np.random.choice(range(num_points), num_outliers, replace=False)

#Replace selected indices with extreme values (outliers)
for idx in outlier_indices:
    data[idx] = np.random.choice([50, -50])  #Extreme outlier values

#Display the generated data with outliers

plt.figure(figsize=(8, 5))
plt.plot(data, 'bo-', label="Data Points", markersize=8)

#Recalculate outliers for the current data
#Calculate Q1 and Q3
q1 = np.percentile(data, 25) #Q1 (25th percentile)
q3 = np.percentile(data, 75) #Q3 (75th percentile)

#Calculate IQR
iqr = q3 - q1

#The outlier points
outliers = [x for x in data if x < q1 - 1.5 * iqr or x > q3 + 1.5 * iqr]

# Highlight outliers using updated outlier information
outlier_indices = [i for i, val in enumerate(data) if val in outliers]
plt.scatter(outlier_indices, outliers, color='red', label="Outliers", zorder=5)

# Add labels and title
plt.title("Dataset with Outliers Highlighted")
plt.xlabel("Index")
plt.ylabel("Value")
plt.legend()

## 2. Detecting outliers using z-score

Z-scores are used to detect outliers in normally distributed data. Data points with a z-score greater than two or three standard deviations from the mean are typically considered outliers. The z-score method from the `scipy.stats` package can be used to identify these points.

In [None]:
#From scipy.stats, import z-score
#Step 1: Generate normally distributed data
np.random.seed(42)  # Set seed for reproducibility
data = np.random.normal(loc=0, scale=1, size=100)  # mean = 0, std = 1, 1000 data points

#Step 2: Calculate z-scores using the z-score method
z_scores = zscore(data)

#Step 3: Identify outliers (z-scores > 3 or < -3)
outliers = data[np.abs(z_scores) > 3]

#Step 4: Plot the data and highlight the outliers
plt.figure(figsize=(10, 6))
plt.hist(data, bins=30, color='lightblue', alpha=0.7, label='Data')
plt.scatter(outliers, np.zeros_like(outliers), color='red', label='Outliers (z > 3 or z < -3)', zorder=5)
plt.title('Normally Distributed Data with Outliers (z > 3 or z < -3)')
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.legend()
plt.grid(True)
plt.show()

#Print the outliers
print("Outliers detected:", outliers)


In [None]:
#Increase the dataset to 10000 datapoints and display outliers using z_score with a standard deviation greater than 3.

In [None]:

#Step 1: Generate normally distributed data
np.random.seed(42)  # Set seed for reproducibility
data = np.random.normal(loc=0, scale=1, size=10000)  # mean = 0, std = 1, 1,000 data points

#Step 2: Calculate z-scores using the z-score method
z_scores = zscore(data)

#Step 3: Identify outliers (z-scores > 2 or < -2)
outliers = data[np.abs(z_scores) > 2]

#Step 4: Plot the data and highlight the outliers
plt.figure(figsize=(10, 6))
plt.hist(data, bins=30, color='lightblue', alpha=0.7, label='Data')
plt.scatter(outliers, np.zeros_like(outliers), color='red', label='Outliers (z > 3 or z < -3)', zorder=5)
plt.title('Normally Distributed Data with Outliers (z > 2 or z < -2)')
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.legend()
plt.grid(True)
plt.axvline(x=-3, color='purple', linestyle='--', linewidth=2, label='Z = -3')
plt.axvline(x=3, color='orange', linestyle='--', linewidth=2, label='Z = +3')
plt.show()

#Print the outliers
#Print('Outliers detected:', outliers)


In [None]:
#In which plot are the outliers visible clearly, and what is the reason behind them?

## 3. Detecting outliers using visualisation tools

Outliers can be visualised using various plots. Common visual tools for detecting outliers include:

- Histograms

- Box plots

- Scatter plots

- Quantile-quantile (QQ) plots

- Violin plots

### Histograms

A histogram is a graphical representation of the distribution of a data set, dividing the range of values into bins and showing the frequency of data points within each bin. Histograms play a crucial role in detecting outliers, especially in univariate data (data with one variable), as they provide a visual way to identify extreme values.

In [None]:
#Histogram to detect outliers


#Example data set
data = [25, 26, 29, 30, 30, 35, 36, 40, 41, 100]

#Create the histogram
plt.hist(data, bins=10, edgecolor='black')

#Add titles and labels
plt.title('Histogram to Detect Outliers')
plt.xlabel('Age')
plt.ylabel('Frequency')

#Show the plot
plt.show()


In [None]:
#Simulate data of 100 ages and use a histogram to plot the outliers

#Simulate 100 ages data (in the range of zero to 100 years)
np.random.seed(42)
ages = np.random.randint(0, 101, 100)

#Add two outlier points manually
ages = np.append(ages, [150, 200])  # Adding two outlier points

#Plot the histogram
plt.figure(figsize=(10,6))
plt.hist(ages, bins=10, color='skyblue', edgecolor='black')
plt.title('Age Distribution Histogram')
plt.xlabel('Age')
plt.ylabel('Frequency')

#Identify and plot the outliers (you'll use IQR method to detect outliers)
Q1 = np.percentile(ages, 25)
Q3 = np.percentile(ages, 75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

outliers = ages[(ages < lower_bound) | (ages > upper_bound)]

#Plot the outliers
plt.scatter(outliers, np.zeros_like(outliers), color='red', label='Outliers', zorder=5)
plt.axvline(x=lower_bound, color='green', linestyle='--', linewidth=2, label='IQR Lower Bound')
plt.axvline(x=upper_bound, color='magenta', linestyle='--', linewidth=2, label='IQR Upper Bound')

plt.legend()
plt.show()

print("Outliers:", outliers)


#### Box plots
A box plot visually identifies outliers by showing the distribution of data. It has:

**Box:** represents the IQR between Q1 and Q3.

**Whiskers:** extend from Q1 and Q3, marking the range within 1.5 * IQR.

**Outliers:** data points outside the whiskers are considered outliers.

Outliers are shown as dots beyond the whiskers, helping quickly identify extreme values.

In [None]:

#Example data set
data = [25, 26, 29, 30, 30, 35, 36, 40, 41, 100]

#Create a box plot
plt.boxplot(data)

#Add titles and labels
plt.title('Box Plot to Identify Outliers')
plt.ylabel('Age')

#Show the plot
plt.show()


In [None]:
# Explore the various options in the box plot and play around with different options.


### Scatter plots

A scatter plot can be used to detect outliers by visually identifying points that deviate significantly from the general pattern or cluster of data points.

In [None]:

#Example data set (height in cm, weight in kg)
height = [150, 160, 165, 170, 175, 180, 185, 190, 200, 300]
weight = [50, 55, 60, 65, 70, 75, 80, 85, 90, 150]

#Create scatter plot
plt.scatter(height, weight)

#Add titles and labels
plt.title('Scatter Plot to Identify Outliers')
plt.xlabel('Height (cm)')
plt.ylabel('Weight (kg)')

#Show the plot
plt.show()


### QQ plots

A QQ plot is a graphical tool used to compare the distribution of a data set to a theoretical distribution (often the normal distribution) and can be useful in detecting outliers.

Any points that deviate significantly from the straight line in the QQ plot are potential outliers. These points represent data values that do not conform to the expected distribution.

In [None]:
#Example data set (normal distribution with an outlier)
data = np.random.normal(loc=0, scale=1, size=100)  # 100 data points from a normal distribution
data_with_outlier = np.append(data, 10)  #Add an outlier value (10)

#Generate a QQ plot
stats.probplot(data_with_outlier, dist="norm", plot=plt)

# Add titles and labels
plt.title('QQ Plot to Identify Outliers')
plt.show()


### Violin plots

A violin plot detects outliers by showing the distribution and density of data. Outliers appear as points or areas far from the main bulk of the data.

The plot includes:

- Density curves: outliers are points outside the main distribution.

- Box plot: points outside the whiskers ($1.5 * IQR$) are outliers.

Violin plots are useful for spotting data points that deviate significantly from the rest of the distribution.

In [None]:

#Example data set (normal distribution with an outlier)
data = np.random.normal(loc=0, scale=1, size=100)  # 100 data points from a normal distribution
data_with_outlier = np.append(data, [10, 12, 15])  # Adding some outliers

#Create a violin plot
sns.violinplot(data=data_with_outlier)

#Add titles and labels
plt.title('Violin Plot to Identify Outliers')
plt.xlabel('Dataset')
plt.ylabel('Value')

# Show the plot
plt.show()


In [None]:
#Explore the other options of the violin plot.

## 4. Handling outliers

Outliers can be handled by removing them, imputing them or transforming the data. Let’s explore when each of these approaches is most effective.

### Removing outliers

You may consider removing outliers in the following situations:

- Data errors: if the outliers are caused by data entry errors or incorrect measurements, they should be removed, as they do not represent real observations.

- Distorting analysis: if outliers significantly distort statistical analysis or model performance (e.g. mean, standard deviation or regression models), removing them may improve the results.

- Non-representative values: if outliers do not reflect the population or the process you're studying (e.g. extreme values that don't belong to the same context as the other data points), they might be excluded.


In [None]:

#Example data set
data = pd.Series([10, 12, 13, 14, 15, 16, 20, 100])

#Calculate mean and standard deviation
mean = data.mean()
std_dev = data.std()

#Define upper and lower bounds (two standard deviations)
upper_bound = mean + 2 * std_dev
lower_bound = mean - 2 * std_dev

print(upper_bound)
print(lower_bound)

#Remove outliers
cleaned_data = data[(data >= lower_bound) & (data <= upper_bound)]
print("Cleaned Data:", cleaned_data.tolist())  # Output: [10, 12, 13, 14, 15, 16, 20]


### Imputing outliers

You may consider imputing outliers in the following situations:

- Data represents real phenomena: if the outliers are legitimate data points but are too extreme to be included in the analysis, imputing them with a more reasonable value (e.g. mean, median or a value based on context) can make the data more manageable while preserving its meaning.

- Minimising impact on analysis: if outliers distort statistical models or analysis, imputing them can reduce their influence while still keeping the data set intact. This is useful when you don't want to lose valuable data by removing them.

- Preventing data loss: imputing can help if removing outliers results in a significant loss of data, especially in small data sets where every point matters.

- Preserving overall distribution: imputation can be used when you need to preserve the overall distribution of data but want to correct extreme values that might skew the results.

In [None]:
#Given this example data, calculate median and IQR-based bounds and replace outliers with median
data = pd.Series([10, 12, 13, 14, 15, 16, 20, 100])
data = data.astype('float64')
#Calculate the median and IQR-based bounds
median = data.median()
Q1 = data.quantile(0.25)
Q3 = data.quantile(0.75)
IQR = Q3 - Q1

lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

#Replace the outliers with the median
imputed_data = data.copy()
imputed_data[data < lower_bound] = median
imputed_data[data > upper_bound] = median

print("Imputed Data:", imputed_data.tolist())  # Output: [10, 12, 13, 14, 15, 16, 20, 15]
# Create comparison DataFrame
comparison_df = pd.DataFrame({
    'Original': data,
    'Imputed': imputed_data
})

print(comparison_df)


### Transforming data

You may consider transforming outliers in the following situations:

- Skewed distribution: if the data is skewed (e.g. right-skewed), transforming it (e.g. using a log or square root transformation) can reduce the impact of outliers and make the distribution more normal. This can improve the performance of statistical models that assume normality.

- Meeting model assumptions: some models (e.g. linear regression) assume that the data follows a normal distribution. Applying transformations can help meet these assumptions and reduce the effect of outliers.

- Improving model fit: transformations such as logarithms or Box-Cox can help compress extreme values, making the data more consistent and improving model performance without removing valuable data.

- Genuine outliers: if the outliers are real and represent significant values (e.g. rare but important events), transformation can help bring them within a more manageable range without losing information.


In [None]:
#Log transformation
data = pd.Series([10, 12, 13, 14, 15, 16, 20, 100])

#Apply log transformation (add a small constant to avoid log(0))
transformed_data = np.log(data + 1)

print("Log Transformed Data:", transformed_data.tolist())
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
axes[0].hist(data, bins=6, color='skyblue', edgecolor='black')
axes[0].set_title('Original Data')
axes[0].set_xlabel('Value')
axes[0].set_ylabel('Frequency')

axes[1].hist(transformed_data, bins=6, color='salmon', edgecolor='black')
axes[1].set_title('Log Transformed Data')
axes[1].set_xlabel('Log(Value + 1)')
axes[1].set_ylabel('Frequency')

plt.tight_layout()
plt.show()


In [None]:

#Winsorise the data (cap at the top/bottom percentiles)
data = pd.Series([10, 12, 13, 14, 15, 16, 20, 100])

winsorized_data = winsorize(data.values, limits=[0.05, 0.05])
print("Winsorized Data:", list(winsorized_data))

#Output: [10.0, 12.0, 13.0, 14.0, 15.0, 16.0, 20.0, capped_value]


Open the `CarPrice_Assignment.csv` data set and perform the following:

1. Read the CSV file using `pd.read_csv()` and save it in a `dataframe df`.

2. Compute the number of rows in the data set.

3. Pick out the price column and check for outliers using z-score.

4. Display the outliers.

In [None]:
import pandas as pd
df = pd.read_csv("data/CarPrice.csv")

In [None]:
price = df['price']
print(price)

In [None]:
#Step 2: Calculate z-scores for the 'price' column
z_scores = zscore(df['price'])

#Step 3: Identify the outliers (z-scores > 3 or z < -3)
outliers_zscore = df['price'][abs(z_scores) > 3]

#Step 4: Display the outliers
print("Outliers detected using Z-scores:", outliers_zscore)

In [None]:
df_cleaned_zscore = df[abs(z_scores) <= 3]

#Step 4: Display the cleaned dataframe
print(f"Rows before removing outliers: {len(df)}")
print(f"Rows after removing outliers: {len(df_cleaned_zscore)}")

Use the `synthetic_data` function provided and experiment further with detecting and handling outliers.

In [None]:
syn_data = pd.read_csv("data/synthetic_data.csv")

In [None]:
syn_data.shape


In [None]:
syn_data.describe()

In [None]:
syn_data.head()


In [None]:
col1_syn_data = syn_data['col1']

In [None]:
z_scores_col1 = zscore(syn_data['col1'])

#Step 3: Identify the outliers (z-scores > 3 or z < -3)
outliers_zscore_col1 = syn_data['col1'][abs(z_scores_col1) > 3]

#Step 4: Display the outliers
print("Outliers detected using Z-scores:", outliers_zscore_col1)