# **Outliers Detection and Removal Using Percentile**

## First Example (simple)

In [1]:
import pandas as pd

df = pd.read_csv('heights.csv')
df

Unnamed: 0,name,height
0,mohan,5.9
1,maria,5.2
2,sakib,5.1
3,tao,5.5
4,virat,4.9
5,khusbu,5.4
6,dmitry,6.2
7,selena,6.5
8,john,7.1
9,imran,14.5


**Data Samples above 95% Quantile (anything above this we will consider as outlier):**

In [2]:
max_threshold = df['height'].quantile(0.95)
max_threshold

np.float64(9.689999999999998)

In [3]:
df[df['height'] > max_threshold]

Unnamed: 0,name,height
9,imran,14.5


****Data Samples under 5% Quantile (anything under this we will consider as outlier):****

In [4]:
min_threshold = df['height'].quantile(0.05)
min_threshold

np.float64(3.6050000000000004)

In [5]:
df[df['height'] < min_threshold]

Unnamed: 0,name,height
12,yoseph,1.2




> *We can also use our domain knowledge to detect outliers. For example, we know maximum heaigh maybe is 7.5 feet, and we can directly say "if the height is greater than 7.5, then there is an outlier".*



> *But unfortuantely when we udeal with the results in real life, we don't have that much domain knowledge and features are very very complex. So it becomes very hard to come up with a fixed thresholdand. At that time, using quantile can be very useful because what are we doing is removing the samples on the far ends on the left far as well as right far.*



In [6]:
df[(df.height < max_threshold) & (df.height > min_threshold)]

Unnamed: 0,name,height
0,mohan,5.9
1,maria,5.2
2,sakib,5.1
3,tao,5.5
4,virat,4.9
5,khusbu,5.4
6,dmitry,6.2
7,selena,6.5
8,john,7.1
10,jose,6.1


## Second Example (complex)

In [7]:
df = pd.read_csv('bhp.csv')
df.head()

Unnamed: 0,location,size,total_sqft,bath,price,bhk,price_per_sqft
0,Electronic City Phase II,2 BHK,1056.0,2.0,39.07,2,3699
1,Chikka Tirupathi,4 Bedroom,2600.0,5.0,120.0,4,4615
2,Uttarahalli,3 BHK,1440.0,2.0,62.0,3,4305
3,Lingadheeranahalli,3 BHK,1521.0,3.0,95.0,3,6245
4,Kothanur,2 BHK,1200.0,2.0,51.0,2,4250


In [8]:
df.shape

(13200, 7)

In [9]:
df.describe()

Unnamed: 0,total_sqft,bath,price,bhk,price_per_sqft
count,13200.0,13200.0,13200.0,13200.0,13200.0
mean,1555.302783,2.691136,112.276178,2.800833,7920.337
std,1237.323445,1.338915,149.175995,1.292843,106727.2
min,1.0,1.0,8.0,1.0,267.0
25%,1100.0,2.0,50.0,2.0,4267.0
50%,1275.0,2.0,71.85,3.0,5438.0
75%,1672.0,3.0,120.0,3.0,7317.0
max,52272.0,40.0,3600.0,43.0,12000000.0


**The 0.1st percentile (0.001 -> 0.1%) & The 99.9th percentile (0.999 -> 99.9%)**

In [10]:
min_threshold, max_threshold = df['price_per_sqft'].quantile([0.001, 0.999])
min_threshold, max_threshold

(1366.184, 50959.36200000098)

In [11]:
df[df['price_per_sqft'] < min_threshold]

Unnamed: 0,location,size,total_sqft,bath,price,bhk,price_per_sqft
665,Yelahanka,3 BHK,35000.0,3.0,130.0,3,371
798,other,4 Bedroom,10961.0,4.0,80.0,4,729
1867,other,3 Bedroom,52272.0,2.0,140.0,3,267
2392,other,4 Bedroom,2000.0,3.0,25.0,4,1250
3934,other,1 BHK,1500.0,1.0,19.5,1,1300
5343,other,9 BHK,42000.0,8.0,175.0,9,416
5417,Ulsoor,4 BHK,36000.0,4.0,450.0,4,1250
5597,JP Nagar,2 BHK,1100.0,1.0,15.0,2,1363
7166,Yelahanka,1 Bedroom,26136.0,1.0,150.0,1,573
7862,JP Nagar,3 BHK,20000.0,3.0,175.0,3,875


In [12]:
df[df['price_per_sqft'] > max_threshold]

Unnamed: 0,location,size,total_sqft,bath,price,bhk,price_per_sqft
345,other,3 Bedroom,11.0,3.0,74.0,3,672727
1005,other,1 BHK,15.0,1.0,30.0,1,200000
1106,other,5 Bedroom,24.0,2.0,150.0,5,625000
4044,Sarjapur Road,4 Bedroom,1.0,4.0,120.0,4,12000000
4924,other,7 BHK,5.0,7.0,115.0,7,2300000
5911,Mysore Road,1 Bedroom,45.0,1.0,23.0,1,51111
6356,Bommenahalli,4 Bedroom,2940.0,3.0,2250.0,4,76530
7012,other,1 BHK,650.0,1.0,500.0,1,76923
7575,other,1 BHK,425.0,1.0,750.0,1,176470
7799,other,4 BHK,2000.0,3.0,1063.0,4,53150


In [13]:
main_df = df[(df['price_per_sqft'] > min_threshold) & (df['price_per_sqft'] < max_threshold) ]
main_df.shape

(13172, 7)

In [14]:
main_df.sample(10)

Unnamed: 0,location,size,total_sqft,bath,price,bhk,price_per_sqft
13004,9th Phase JP Nagar,2 BHK,1331.95,2.0,69.0,2,5180
1462,Bannerghatta,4 Bedroom,810.0,4.0,50.0,4,6172
7543,Banashankari Stage III,4 Bedroom,1200.0,6.0,170.0,4,14166
12213,Hosa Road,2 BHK,1161.0,2.0,48.75,2,4198
4332,other,8 Bedroom,2400.0,9.0,350.0,8,14583
1087,5th Phase JP Nagar,9 Bedroom,812.0,6.0,165.0,9,20320
870,Kanakpura Road,3 BHK,1100.0,3.0,58.0,3,5272
5436,Ramamurthy Nagar,1 Bedroom,600.0,1.0,43.0,1,7166
4433,Rachenahalli,2 BHK,1113.0,2.0,55.0,2,4941
881,other,2 BHK,1180.0,2.0,69.0,2,5847
