# Outlier Detection using percentile method

# Author/Created by: Ajay Taneja

# Date: September - October 2022.

# Removal of outliers using percentile methods using the Airbnb dataset (see refernces section for the link to the dataset)

In this notebook, we will use the Airbnb New York City dataset and remove outliers using the percentile method based on the price per night for a given apartment/home. You can use suitable limits on percentile based on intuition. Your goal is to come up with a new pandas dataframe that doesn't have outliers present in it.

In [20]:
#Load the dataset

import pandas as pd
df = pd.read_csv("AB_NYC_2019.csv")
df.head()

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
0,2539,Clean & quiet apt home by the park,2787,John,Brooklyn,Kensington,40.64749,-73.97237,Private room,149,1,9,2018-10-19,0.21,6,365
1,2595,Skylit Midtown Castle,2845,Jennifer,Manhattan,Midtown,40.75362,-73.98377,Entire home/apt,225,1,45,2019-05-21,0.38,2,355
2,3647,THE VILLAGE OF HARLEM....NEW YORK !,4632,Elisabeth,Manhattan,Harlem,40.80902,-73.9419,Private room,150,3,0,,,1,365
3,3831,Cozy Entire Floor of Brownstone,4869,LisaRoxanne,Brooklyn,Clinton Hill,40.68514,-73.95976,Entire home/apt,89,1,270,2019-07-05,4.64,1,194
4,5022,Entire Apt: Spacious Studio/Loft by central park,7192,Laura,Manhattan,East Harlem,40.79851,-73.94399,Entire home/apt,80,10,9,2018-11-19,0.1,1,0


In [21]:
#We have to remian the outliers using the percentile method. We have to use the price per unit as an indiction
#to detect the outliers

#Firstly, let us use the describe() function to get some statistical information

df.describe()

Unnamed: 0,id,host_id,latitude,longitude,price,minimum_nights,number_of_reviews,reviews_per_month,calculated_host_listings_count,availability_365
count,48895.0,48895.0,48895.0,48895.0,48895.0,48895.0,48895.0,38843.0,48895.0,48895.0
mean,19017140.0,67620010.0,40.728949,-73.95217,152.720687,7.029962,23.274466,1.373221,7.143982,112.781327
std,10983110.0,78610970.0,0.05453,0.046157,240.15417,20.51055,44.550582,1.680442,32.952519,131.622289
min,2539.0,2438.0,40.49979,-74.24442,0.0,1.0,0.0,0.01,1.0,0.0
25%,9471945.0,7822033.0,40.6901,-73.98307,69.0,1.0,1.0,0.19,1.0,0.0
50%,19677280.0,30793820.0,40.72307,-73.95568,106.0,3.0,5.0,0.72,1.0,45.0
75%,29152180.0,107434400.0,40.763115,-73.936275,175.0,5.0,24.0,2.02,2.0,227.0
max,36487240.0,274321300.0,40.91306,-73.71299,10000.0,1250.0,629.0,58.5,327.0,365.0


In [22]:
#Looking at the price column, the maximum is $10000 per night and minimum is $0 per night. There seem some 
#outliers in the dataset

In [23]:
min_threshold, max_threshold = df.price.quantile([0.02, 0.98])
min_threshold, max_threshold

(35.0, 550.0)

In [24]:
#Let us nw see the data points where the price per night is less than the minimum threshold
df[(df.price < min_threshold)]

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
469,165824,Lady only Curtain-divided room,733894,Lucy,Queens,Sunnyside,40.74000,-73.91901,Private room,33,44,31,2019-05-01,0.32,3,161
747,270315,Bed-stuy Royal Room,1398639,Juliet,Brooklyn,Bedford-Stuyvesant,40.68812,-73.93254,Private room,34,10,16,2019-03-27,0.19,3,216
776,278145,Large Room in a Huge NY apartment.,1452026,Heidi,Queens,Astoria,40.77117,-73.91905,Private room,30,5,3,2017-06-20,0.03,1,0
845,296844,"HISTORIC WILLIAMSBURG, BKLYN #1",839679,Brady,Brooklyn,Williamsburg,40.71653,-73.95554,Private room,30,3,24,2013-12-04,0.28,3,0
957,375249,Enjoy Staten Island Hospitality,1887999,Rimma & Jim,Staten Island,Graniteville,40.62109,-74.16534,Private room,20,3,80,2019-05-26,0.92,1,226
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48845,36454025,Private 5 star room,261338177,Diana,Brooklyn,Gravesend,40.59131,-73.97114,Private room,33,2,0,,,6,318
48847,36455321,#6 New Hotel-Like Private Room QUEEN Bed near JFK,263504959,David,Queens,Woodhaven,40.69183,-73.86523,Private room,34,1,0,,,8,320
48852,36455809,"Cozy Private Room in Bushwick, Brooklyn",74162901,Christine,Brooklyn,Bushwick,40.69805,-73.92801,Private room,30,1,1,2019-07-08,1.00,1,1
48867,36473044,The place you were dreaming for.(only for guys),261338177,Diana,Brooklyn,Gravesend,40.59080,-73.97116,Shared room,25,1,0,,,6,338


In [25]:
# Similarly, let us the data points where the price is greater than the maximum threshold
df[df.price > max_threshold]

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
85,19601,perfect for a family or small group,74303,Maggie,Brooklyn,Brooklyn Heights,40.69723,-73.99268,Entire home/apt,800,1,25,2016-08-04,0.24,1,7
299,68974,Unique spacious loft on the Bowery,281229,Alicia,Manhattan,Little Italy,40.71943,-73.99627,Entire home/apt,575,2,191,2019-06-20,1.88,1,298
345,89427,The Brooklyn Waverly,116599,Sahr,Brooklyn,Clinton Hill,40.68613,-73.96536,Entire home/apt,650,5,0,,,3,365
365,103311,2 BR w/ Terrace @ Box House Hotel,417504,The Box House Hotel,Brooklyn,Greenpoint,40.73861,-73.95485,Private room,599,3,9,2018-05-19,0.09,28,60
496,174966,Luxury 2Bed/2.5Bath Central Park View,836168,Henry,Manhattan,Upper West Side,40.77350,-73.98697,Entire home/apt,2000,30,30,2018-05-05,0.33,11,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48523,36308562,"Tasteful & Trendy Brooklyn Brownstone, near Train",217732163,Sandy,Brooklyn,Bedford-Stuyvesant,40.68767,-73.95805,Entire home/apt,1369,1,0,,,1,349
48535,36311055,"Stunning & Stylish Brooklyn Luxury, near Train",245712163,Urvashi,Brooklyn,Bedford-Stuyvesant,40.68245,-73.93417,Entire home/apt,1749,1,0,,,1,303
48697,36388720,HUGE LUXURY CONDO – INCREDIBLE WATER VIEWS,273619215,Layla,Manhattan,Upper West Side,40.77665,-73.98867,Entire home/apt,750,4,0,,,1,174
48757,36419574,Luxury & Spacious 1500 ft² MANHATTAN Townhouse,11454384,Ellen,Manhattan,Tribeca,40.71815,-74.01145,Entire home/apt,700,3,0,,,1,37


In [26]:
#Lets keep the minimum and max threshold limited to 2nd percentile
#Let us remove the outliers outside thsi range

df2 = df[(df.price > min_threshold) & (df.price < max_threshold)]
df2.sample(10)

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
48243,36149281,My Cozy Two Bedroom Apt in Soho - Very Conve,1387341,Dan,Manhattan,SoHo,40.72403,-73.997,Entire home/apt,250,2,0,,,1,247
15954,12898943,Beautiful Pre-War on Prospect Park,10259868,Meredith,Brooklyn,Crown Heights,40.66676,-73.95991,Private room,85,7,0,,,1,0
8092,6244461,Bright studio in the heart of Brooklyn,7290581,Juliana,Brooklyn,Prospect Heights,40.67847,-73.96694,Entire home/apt,150,2,2,2017-01-14,0.06,1,0
35730,28355651,A Cozy Room in New York Apartment,214095681,Pavel And Sarah,Queens,Sunnyside,40.74403,-73.92147,Private room,65,2,64,2019-07-01,6.34,2,6
23334,18893669,"Master Bedroom w/ TV, AC, King Sized Bed",31314659,Muneer,Manhattan,East Harlem,40.79439,-73.94129,Private room,75,7,0,,,1,0
18862,14968000,Spacious Comtemporary 3BR Apartment,24805685,Saif,Brooklyn,Park Slope,40.67124,-73.98625,Entire home/apt,218,2,117,2019-07-08,3.41,1,226
31002,24038323,Dreamy Williamsburg Bedroom with Private Terrace,32720219,Gizem,Brooklyn,Williamsburg,40.70894,-73.94333,Private room,120,4,10,2019-06-17,0.66,2,8
29390,22545036,Cozy Room Close to Columbia,12541753,Yiou,Manhattan,Upper West Side,40.80034,-73.96127,Private room,39,3,1,2018-01-13,0.06,1,0
6013,4402117,GORGEOUS 2 Bedroom in Queens NYC,22780462,Mark,Queens,Kew Gardens Hills,40.72891,-73.82588,Entire home/apt,149,3,82,2019-06-23,1.55,2,327
28728,22207770,AWESOME PRIVATE FAMILY HOUSE CLOSE TO MANHATTAN,155989291,Ahmed,Queens,Corona,40.74725,-73.85486,Entire home/apt,359,3,10,2019-06-09,0.66,1,281


In [27]:
df2.price.describe()

count    46465.000000
mean       135.165350
std         88.602677
min         36.000000
25%         70.000000
50%        109.000000
75%        175.000000
max        549.000000
Name: price, dtype: float64

References:
1) Airbnb dataset from Kaggle: https://www.kaggle.com/datasets/dgomonov/new-york-city-airbnb-open-data
2) Coodebasics: https://youtu.be/7sJaRHF03K8?list=PLeo1K3hjS3ut5olrDIeVXk9N3Q7mKhDxO