# Significance of outliers:

- Outliers badly affect mean and standard deviation of the dataset. These may statistically give erroneous results.
- Most machine learning algorithms do not work well in the presence of outlier. So it is desirable to detect and remove outliers.
- Outliers are highly useful in anomaly detection like fraud detection where the fraud transactions are very different from normal transactions.

# Outlier Detection using IQR
An outlier is a data point in a data set that is distant from all other observations. A data point that lies outside the overall distribution of the dataset.

### What are the criteria to identify an outlier?

1. Data point that falls outside of 1.5 times of an interquartile range above the 3rd quartile and below the 1st quartile
2. Data point that falls outside of 3 standard deviations. we can use a z score and if the z score falls outside of 2 standard deviation

### What is the reason for an outlier to exists in a dataset?

1. Variability in the data
2. An experimental measurement error

### What are the impacts of having outliers in a dataset?

1. It causes various problems during our statistical analysis
2. It may cause a significant impact on the mean and the standard deviation

### Various ways of finding the outlier.
1. Using scatter plots
2. Box plot
3. using z score
4. using the IQR interquantile range



In [1]:
import pandas as pd
import numpy as np

In [2]:
class OutlierDetector:
    def __init__(self, input_file, output_file):
        self.input_file = input_file
        self.output_file = output_file
        
    def read_data(self):
        return pd.read_csv(self.input_file)
    
    def detect_outliers_using_iqr(self, df):
        for i in df._get_numeric_data().columns:
            q1 = df.describe().at['25%', i]
            q3 = df.describe().at['75%', i]
            # calculate interquartile range
            iqr = q3 - q1
            # calculate lower and upper bound using IQR
            lower_bound = q1 - 1.5 * iqr
            upper_bound = q3 + 1.5 * iqr
            arr = np.array(df[i])
            outlier_indices = []
            for index, val in enumerate(arr):
                if val < lower_bound or val > upper_bound:
                    outlier_indices.append(index)
            yield i, q1, q3, len(outlier_indices), outlier_indices
            
    def show_outliers_details_using_iqr(self, df):
        column = []
        oulier_range = []
        outlier_count = []
        outlier_indices = []
        outlier_gen = self.detect_outliers_using_iqr(df)
        for i in outlier_gen:
            column.append(i[0]) # numerical column name
            oulier_range.append((i[1], i[2])) # (q1, q3)
            outlier_count.append(i[3]) # count of outliers
            outlier_indices.append(i[4]) # outlier indices
        combined_df = {
            'Column': column,
            'Outlier Range': oulier_range,
            'Number of Outliers': outlier_count,
            'Outlier Indices': outlier_indices}
        combined_df = pd.DataFrame(combined_df)
        combined_df.to_csv(self.output_file, index=False)

In [3]:
outlier_detector = OutlierDetector(input_file = 'data/country_data.csv', output_file = 'output/outliers_details.csv')

In [4]:
data = outlier_detector.read_data()

In [5]:
outlier_detector.show_outliers_details_using_iqr(data)