# Outliers

## What are Outliers?

Outliers are data points that are significantly different from other data points in a dataset. They can be unusually high or low values compared to the rest of the data. Here's a simple explanation:

Imagine you have a group of friends, and you're all sharing how much money you spent on lunch. Most of your friends spent around $10-$15, but one friend spent $100. That friend's spending would be considered an outlier because it's much higher than what everyone else spent.

Similarly, in a dataset of test scores, if most students scored between 70-90 out of 100, but one student scored 20, that score would be an outlier because it's significantly lower than the others.

Outliers can occur due to various reasons such as measurement errors, data entry mistakes, or genuinely extreme values. It's essential to identify outliers because they can skew statistical analyses and lead to inaccurate conclusions about the data. Handling outliers often involves either removing them from the dataset or transforming them to minimize their impact on the analysis.

![Outlier](https://media.geeksforgeeks.org/wp-content/uploads/20210627205853/global-300x245.jpg)

## When should you remove outliers and when should not?

Deciding whether to remove outliers from a dataset depends on the context of the analysis and the nature of the data. Here are some considerations:

### 1. When to Remove Outlier

1. **Data Entry Errors**: If outliers are clearly due to mistakes in data entry or measurement errors, removing them can improve the accuracy of the analysis.

2. **Analysis Sensitivity**: In some cases, outliers can significantly skew statistical analyses or machine learning models, leading to misleading results. Removing outliers may help mitigate this issue and produce more reliable results.

3. **Violation of Assumptions**: Some statistical methods assume that the data follow certain distributions or have specific properties. If outliers violate these assumptions, removing them may be necessary to ensure the validity of the analysis.

### 2. When Not to Remove Outliers:

1. **Genuine Data**: Outliers may represent valid, genuine observations that provide valuable insights into the underlying processes being studied. Removing them could lead to the loss of important information or trends in the data.

2. **Sample Size**: Removing outliers reduces the size of the dataset, potentially reducing statistical power and making it harder to detect true effects or relationships.

3. **Impact on Generalization**: If the goal is to build a model that can generalize to new data, removing outliers that occur naturally in the dataset may lead to overfitting, where the model performs well on the training data but poorly on unseen data.

4. **Exploratory Analysis**: During exploratory data analysis, it's often beneficial to examine outliers to understand why they occur and whether they reveal interesting patterns or anomalies in the data.

### Considerations:

1. **Domain Knowledge**: Understanding the subject matter and context of the data is crucial for making informed decisions about outliers. Domain experts can provide valuable insights into whether outliers are meaningful or should be treated as noise.

2. **Robust Methods**: Instead of removing outliers outright, consider using robust statistical methods that are less sensitive to extreme values. These methods can help mitigate the influence of outliers while still preserving valuable information in the data.

In summary, the decision to remove outliers depends on the specific goals of the analysis, the nature of the data, and the potential impact on the validity and interpretability of the results. It's essential to weigh the benefits and drawbacks carefully and consider alternative approaches before removing outliers from a dataset.

## How to treat Outliers?

Treating outliers involves handling them in a way that minimizes their impact on statistical analyses or machine learning models while preserving the integrity of the data. Here are several common approaches to treat outliers:

1. **Removal**: One straightforward approach is to remove outliers from the dataset entirely. However, this should be done cautiously, and the decision to remove outliers should be based on careful consideration of the data and the specific analysis being performed. Outliers can be removed using statistical techniques such as z-score, interquartile range (IQR), or domain-specific knowledge.

2. **Transformation**: Data transformation techniques can be used to reduce the impact of outliers while still retaining them in the dataset. For example, taking the logarithm or square root of skewed data can help make the distribution more symmetric and reduce the influence of extreme values.

3. **Winsorization**: Winsorization involves capping the extreme values of a dataset at a certain percentile (e.g., replacing values above the 95th percentile with the value at the 95th percentile). This approach preserves the shape of the distribution while mitigating the effect of outliers.

4. **Binning**: Binning involves grouping values into intervals or bins and then replacing outliers with values within a specified range. This approach can help reduce the impact of outliers while maintaining the overall structure of the data.

5. **Imputation**: Imputation involves replacing missing or outlier values with estimated or predicted values. This can be done using various techniques such as mean, median, regression models, or sophisticated imputation methods like KNN imputation or multiple imputation by chained equations (MICE).

6. **Robust Statistical Methods**: Using statistical methods that are less sensitive to outliers, such as robust regression or robust statistical estimators, can help mitigate the influence of extreme values on the analysis.

7. **Model-Based Approaches**: In some cases, it may be appropriate to use outlier detection and correction techniques within the context of a specific statistical or machine learning model. For example, robust regression models or outlier-robust algorithms can automatically adjust for outliers during model fitting.

It's essential to carefully consider the advantages and disadvantages of each approach and choose the most appropriate method based on the characteristics of the data, the goals of the analysis, and domain knowledge. Additionally, documenting the outlier treatment process is crucial for transparency and reproducibility in data analysis.

## What are the Effects of Outliers on ML Algorithms? Which algorithms get affected by outliers?

Outliers can significantly impact the performance of machine learning algorithms in various ways. Here's a breakdown of the effects of outliers on machine learning algorithms and how different algorithms are affected:

### Effects of Outliers on ML Algorithms:

1. **Bias in Parameter Estimation**: Outliers can bias parameter estimates in algorithms that rely on minimizing error or maximizing likelihood, leading to inaccurate model fitting.

2. **Reduced Model Performance**: Outliers can distort the underlying patterns and relationships in the data, leading to poor generalization performance of the model on unseen data.

3. **Increased Variance**: Outliers can increase the variance of the model, making it more sensitive to small changes in the training data and reducing its ability to generalize to new data.

4. **Model Instability**: Outliers can cause instability in model training, leading to fluctuations in model performance and making it difficult to interpret the results.

5. **Skewed Decision Boundaries**: Outliers can skew decision boundaries in classification algorithms, leading to misclassification of data points near the outliers.

### Algorithms Affected by Outliers:

Weight based algorithms

1. **Linear Regression**: Linear regression is sensitive to outliers because it aims to minimize the sum of squared errors, making it susceptible to the influence of extreme values.

2. **Logistic Regression**: Logistic regression can also be affected by outliers, particularly when the outliers are influential in determining the decision boundary between classes.

3. **k-Nearest Neighbors (kNN)**: kNN algorithms are sensitive to outliers because they rely on distance-based metrics to determine nearest neighbors, and outliers can distort the distance calculations.

4. **K-Means Clustering**: K-means clustering can be affected by outliers because it minimizes the sum of squared distances to cluster centroids, and outliers can significantly inflate the distances.

5. **Decision Trees**: Decision trees are relatively robust to outliers because they partition the feature space based on splits that minimize impurity, rather than relying on distance-based metrics.

6. **Support Vector Machines (SVM)**: SVM algorithms are generally robust to outliers, especially when using a kernel function that transforms the data into a higher-dimensional space where outliers may be better separated.

7. **Random Forests**: Random forests are less sensitive to outliers compared to individual decision trees because they aggregate predictions from multiple trees, reducing the impact of outliers on the overall model.

In summary, the effect of outliers on machine learning algorithms varies depending on the algorithm's underlying principles and how it processes the data. While some algorithms are more robust to outliers, others are more sensitive and may require preprocessing steps to handle outliers effectively.

## How to detect outliers?

Detecting outliers involves identifying data points that are significantly different from the majority of the data. Here are several common methods for detecting outliers:

1. **Visual Inspection**: Visualizing the data using plots such as scatter plots, box plots, histograms, or Q-Q plots can often reveal outliers visually. Data points that fall far away from the main cluster or exhibit unusual patterns compared to the rest of the data may indicate outliers.

2. **Summary Statistics**: Calculating summary statistics such as mean, median, standard deviation, quartiles, and interquartile range (IQR) can help identify outliers. Data points that fall far outside the range defined by these summary statistics may be considered outliers.

3. **Z-Score**: Z-score (standard score) measures how many standard deviations a data point is from the mean. Data points with a z-score greater than a certain threshold (e.g., 2 or 3) are considered outliers.

4. **IQR Method**: The Interquartile Range (IQR) method defines outliers as data points that fall below Q1 - 1.5 * IQR or above Q3 + 1.5 * IQR, where Q1 and Q3 are the first and third quartiles, respectively, and IQR is the interquartile range.

5. **Box Plot**: Box plots visually display the distribution of data and identify potential outliers as data points beyond the whiskers (lines extending from the box) that represent the range of typical values.

6. **Density-Based Methods**: Density-based outlier detection methods such as DBSCAN (Density-Based Spatial Clustering of Applications with Noise) or LOF (Local Outlier Factor) identify outliers based on the density of data points in the feature space.

7. **Distance-Based Methods**: Distance-based outlier detection methods such as k-nearest neighbors (kNN) or Mahalanobis distance measure the distance of each data point from its nearest neighbors and identify points that are significantly distant from their neighbors.

8. **Cluster-Based Methods**: Cluster-based outlier detection methods identify outliers as data points that do not belong to any cluster or belong to small clusters compared to the majority of the data.

9. **Machine Learning Models**: Some machine learning algorithms, such as isolation forests or one-class SVM, can be trained to distinguish between normal and outlier data points and identify outliers in the dataset.

10. **Domain Knowledge**: Finally, domain knowledge and subject matter expertise can provide valuable insights into what constitutes an outlier in the context of the data and the specific problem being addressed.

In practice, a combination of these methods may be used to detect outliers effectively, and the choice of method depends on the characteristics of the data and the specific requirements of the analysis.

## Techniques for Outliers Detection and Removal
1. Z-Score
2. IQR Method
3. Percentile
4. Winsorization