In [1]:
import numpy as np
import pandas as pd

df = pd.read_csv('../data/Train.csv')

## Data Cleaning: Outliers

Outliers can sometimes occur in natural numerical data. An outlier is a data point that is significantly different from the rest. Outliers are undesirable because they greatly affect the average of the data, which can lead to erroneous conclusions.

To remove outliers, first we have to detect them.

There are two main ways to detect outliers we will look at:

1. The standard deviation method
2. The interquartile range method

### The standard deviation method

NOTE: this method only works for data that follows the normal (Gaussian) distribution. Most natural data tends to do this, so usually this method is a safe bet, but beware that sometimes data may not be normally distributed.

In order to detect outliers, we use the average and standard deviation of the data.

Any data point that is more than 3 standard deviations away from the mean, in either direction, is considered an outlier.

In [2]:
# Example: here is a numpy array containing some data.
array = np.array([1, 5, 6, 8, 13, 11, 12, 10, 11, 15, 45, 17, 13, 14])

# the average is:
avg = np.mean(array)
# the standard deviation is:
sd = np.std(array) # of course numpy has a built-in standard deviation function

print(avg, sd)

12.928571428571429 9.801343098300979


In [3]:
### Now we can filter those values which are within 3 sds away from the mean
array[np.abs(array - avg) < 3 * sd]

array([ 1,  5,  6,  8, 13, 11, 12, 10, 11, 15, 17, 13, 14])

In [4]:
# The reason we use np.abs is to catch outliers on both sides of the mean, not just the ones greater than the mean

### The interquartile range method

If the data is not normally distributed enough to use the standard deviation method, we can use the interquartile range method.

A 'quartile' of a dataset is a point that marks what percent of the data lies below it. The **First Quartile**, also called the 25th Percentile or Q1, indicates the number which is greater than exactly 25% of the data, the second quartile is greater than 50% of the data (a.k.a. this is the median) and the third quartile is greater than 75% of the data.

The difference between Q3 and Q1 is called the **interquartile range**. We will use this value to define outliers: any value which is more than **1.5 * IQR** below Q1 or above Q3 is considered an outlier.

In [5]:
# using our old data array, we will remove outliers using the IQR method

# use numpy's percentile function to compute Q1 and Q3
quartiles = np.percentile(array, [25, 75])
q1, q3 = quartiles

# calculate the IQR
iqr = q3 - q1

# now, filter the array, only including values that are within the range:
# [q1 - iqr * 1.5, q3 + 1.5 * iqr]
# compute the limits
lower = q1 - 1.5 * iqr
upper = q3 + 1.5 * iqr
array[(array >= lower) & (array <= upper)]

array([ 1,  5,  6,  8, 13, 11, 12, 10, 11, 15, 17, 13, 14])

What do you do with outliers?

There are multiple options that you can choose, depending on what suits your task the best.

1. You can simply eliminate outliers from your dataset.
2. You can 'constrain' them, replacing the outlier value with a value within range.
3. You can simply leave them in.

## Feature Engineering

Feature Engineering refers to creating new features from the ones you have.

NOTE: feature engineering can never create new information out of nowhere, since it uses the data you already have. Thus, the point of feature engineering isn't to give you more data to work with, but to convert the existing data into a more convenient form.

In [6]:
# Feature engineering example:
# Say you want to know whether a store was etablished before or after 1990.
# To do so, you would just create a new column, and use the data in Outlet_Establishment_Year
# To decide what to put in the new column.
df['Outlet_Established_Before_1990'] = df['Outlet_Establishment_Year'].apply(lambda year: True if year < 1990 else False)
df[['Outlet_Identifier', 'Outlet_Establishment_Year', 'Outlet_Established_Before_1990']]

Unnamed: 0,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Established_Before_1990
0,OUT049,1999,False
1,OUT018,2009,False
2,OUT049,1999,False
3,OUT010,1998,False
4,OUT013,1987,True
5,OUT018,2009,False
6,OUT013,1987,True
7,OUT027,1985,True
8,OUT045,2002,False
9,OUT017,2007,False


In [7]:
df.columns

Index(['Item_Identifier', 'Item_Weight', 'Item_Fat_Content', 'Item_Visibility',
       'Item_Type', 'Item_MRP', 'Outlet_Identifier',
       'Outlet_Establishment_Year', 'Outlet_Size', 'Outlet_Location_Type',
       'Outlet_Type', 'Item_Outlet_Sales', 'Outlet_Established_Before_1990'],
      dtype='object')