# Handling Ouliers

Outliers can be given one of the following treatments viz.,
- Drop    : Not a great option. We lose lots of information. Find out if genuine extreme value or broken sensor.
- Mark    : Safest option. We can see if the outlier had an effect
- Rescale : Log values so that outliers donot have much effect

In [2]:
import pandas as pd

houses = pd.DataFrame()
houses['Price'] = [534433, 392333, 293222, 4322032]
houses['Bathrooms'] = [2, 3.5, 2, 116]
houses['Square_feet'] = [1500, 2500, 1500, 48000]
houses

Unnamed: 0,Price,Bathrooms,Square_feet
0,534433,2.0,1500
1,392333,3.5,2500
2,293222,2.0,1500
3,4322032,116.0,48000


In [3]:
# Option 1: Drop outliers
houses[houses['Bathrooms'] < 20]
houses

Unnamed: 0,Price,Bathrooms,Square_feet
0,534433,2.0,1500
1,392333,3.5,2500
2,293222,2.0,1500
3,4322032,116.0,48000


In [4]:
# Option 2 : Mark outliers
import numpy as np

# Create feature based on boolean condition
houses['Outlier'] = np.where(houses['Bathrooms'] < 20, 0, 1)

# Show data
houses

Unnamed: 0,Price,Bathrooms,Square_feet,Outlier
0,534433,2.0,1500,0
1,392333,3.5,2500,0
2,293222,2.0,1500,0
3,4322032,116.0,48000,1


In [5]:
# Option 3 : Rescale so that outliers disappear
# Log feature
houses['Log_of_Square_Feet'] = [np.log(x) for x in houses['Square_feet']]

# Show data
houses

Unnamed: 0,Price,Bathrooms,Square_feet,Outlier,Log_of_Square_Feet
0,534433,2.0,1500,0,7.31322
1,392333,3.5,2500,0,7.824046
2,293222,2.0,1500,0,7.31322
3,4322032,116.0,48000,1,10.778956
