## handling outliers

In [4]:
# imports
import pandas as pd
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt

Matplotlib is building the font cache; this may take a moment.


Outliers in Data Science: Unraveling the Unusual

In the realm of data science, anomalies often lurk within datasets, waiting to disrupt analyses and skew results. These anomalies, known as outliers, hold the potential to mislead, confuse, and sometimes even provide unique insights. In this article, we embark on a journey to understand the world of outliers – what they are, why they matter, and how to effectively handle them.

What Are Outliers? Outliers are data points that significantly deviate from the norm. They can be unusually high or low values that don't align with the overall pattern of the dataset. Outliers can stem from various sources, including measurement errors, data entry mistakes, or genuine rare events.

Why Do Outliers Matter? Outliers hold the power to distort statistical analyses and machine learning models, leading to inaccurate predictions and biased results. Failing to address outliers can undermine the integrity of your insights and decision-making. However, outliers are not always undesirable; in some cases, they might represent critical information, such as fraudulent transactions or rare disease occurrences.

Detecting Outliers:

Visualizations: Box plots, scatter plots, and histograms can help visualize the distribution of data and identify potential outliers.
Z-Score: The z-score measures how many standard deviations a data point is away from the mean. A z-score greater than a threshold (often 2 or 3) might indicate an outlier.
IQR (Interquartile Range): The IQR is the range between the first quartile (25th percentile) and the third quartile (75th percentile). Data points outside a certain range of the IQR are considered outliers.
Distance-based Methods: Techniques like k-nearest neighbors or DBSCAN can help identify data points that are far from their neighbors.
Dealing with Outliers:

Removal: In some cases, outliers can be removed from the dataset. However, this approach must be undertaken cautiously, as removing too many data points might lead to loss of valuable information.
Transformation: Applying mathematical transformations like log or square root can normalize data and reduce the impact of outliers.
Capping or Flooring: Replacing extreme values with a predefined maximum or minimum value can help mitigate the effect of outliers.
Imputation: Replacing outliers with more reasonable values derived from interpolation or other imputation methods can improve the dataset's quality.
Model Robustness: Utilizing algorithms that are less sensitive to outliers, such as Random Forest or Support Vector Machines, can help mitigate the impact of outliers on the model's performance.

In [8]:
data = pd.read_csv('rawdata\csv\scholarship.csv')

df1 = data.copy()

df1.head()

Unnamed: 0,semester_percentage,scholarship_exam_marks,got_scholarship
0,71.9,26,1
1,74.6,38,1
2,75.4,40,1
3,64.2,8,1
4,72.3,17,0


![image.png](attachment:image.png)

Identifying Data Anomalies with Z-Score and Python

Outliers, those mysterious data points that deviate significantly from the norm, can wreak havoc on analyses and models. Fortunately, the Z-score method offers a powerful tool to detect these outliers. In this article, we'll explore the Z-score technique, understand its significance, and provide a step-by-step code example in Python.

Understanding Z-Score: The Z-score measures how many standard deviations a data point is away from the mean. In other words, it quantifies the relative distance of a data point from the average. A high Z-score indicates that the data point is far from the mean, suggesting the possibility of an outlier.

The Formula: The formula to calculate the Z-score for a data point x is:

Z = \frac{x - \mu}{\sigma}
![image.png](attachment:image.png)


Where:

x is observed value
μ is the mean of the dataset.
σ is the standard deviation of the dataset.
Detecting Outliers with Z-Score:

Calculate the mean (μ) and standard deviation (σ) of the dataset.

For each data point (x) in the dataset, compute its Z-score.

Set a Z-score threshold (commonly 2 or 3) beyond which data points are considered outliers.

Identify data points with Z-scores greater than the threshold as potential outliers.

In [9]:
import numpy as np

# Generate example data, here we can see thta 100 is an outlier 
data = np.array([23, 25, 22, 27, 21, 24, 26, 100, 23, 28, 22, 29])

# Calculate mean and standard deviation
mean = np.mean(data)
std_dev = np.std(data)
print('mean :', mean, 'statndard Dev :', std_dev)
# Set Z-score threshold
z_threshold = 2

# Calculate Z-scores
z_scores = [(x - mean) / std_dev for x in data]
print(z_scores)
# Identify outliers
outliers = [data[i] for i, z in enumerate(z_scores) if abs(z) > z_threshold]

print("Original Data:", data)
print("Outliers detected using Z-score:", outliers)

mean : 30.833333333333332 statndard Dev : 20.99536985993711
[np.float64(-0.3730981347597368), np.float64(-0.27783903652320824), np.float64(-0.4207276838780011), np.float64(-0.18257993828667968), np.float64(-0.46835723299626536), np.float64(-0.3254685856414725), np.float64(-0.23020948740494396), np.float64(3.294377147346613), np.float64(-0.3730981347597368), np.float64(-0.1349503891684154), np.float64(-0.4207276838780011), np.float64(-0.08732084005015112)]
Original Data: [ 23  25  22  27  21  24  26 100  23  28  22  29]
Outliers detected using Z-score: [np.int64(100)]
