## Verify Data Quality

### Data manipulation

In this section, I will import the data and make any transforms required for this section. This should be merged into the general data prep for the rest of the notebook

In [15]:
import pandas as pd
import pandas_profiling as pp
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [6]:
airbnb = pd.read_csv('C:/Users/William/OneDrive/MLI/AB_NYC_2019.csv')
airbnb.head()

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
0,2539,Clean & quiet apt home by the park,2787,John,Brooklyn,Kensington,40.64749,-73.97237,Private room,149,1,9,2018-10-19,0.21,6,365
1,2595,Skylit Midtown Castle,2845,Jennifer,Manhattan,Midtown,40.75362,-73.98377,Entire home/apt,225,1,45,2019-05-21,0.38,2,355
2,3647,THE VILLAGE OF HARLEM....NEW YORK !,4632,Elisabeth,Manhattan,Harlem,40.80902,-73.9419,Private room,150,3,0,,,1,365
3,3831,Cozy Entire Floor of Brownstone,4869,LisaRoxanne,Brooklyn,Clinton Hill,40.68514,-73.95976,Entire home/apt,89,1,270,2019-07-05,4.64,1,194
4,5022,Entire Apt: Spacious Studio/Loft by central park,7192,Laura,Manhattan,East Harlem,40.79851,-73.94399,Entire home/apt,80,10,9,2018-11-19,0.1,1,0


id                                    0
name                                 16
host_id                               0
host_name                            21
neighbourhood_group                   0
neighbourhood                         0
latitude                              0
longitude                             0
room_type                             0
price                                 0
minimum_nights                        0
number_of_reviews                     0
last_review                       10052
reviews_per_month                 10052
calculated_host_listings_count        0
availability_365                      0
dtype: int64

###  Explain any missing values

Stuff

In [7]:
airbnb.isnull().sum()

id                                    0
name                                 16
host_id                               0
host_name                            21
neighbourhood_group                   0
neighbourhood                         0
latitude                              0
longitude                             0
room_type                             0
price                                 0
minimum_nights                        0
number_of_reviews                     0
last_review                       10052
reviews_per_month                 10052
calculated_host_listings_count        0
availability_365                      0
dtype: int64

### Duplicate Data

In [9]:
airbnbDups = airbnb[airbnb.duplicated()]
airbnbDups.head()

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365


There are no duplicate rows in the data, based on the check above.

### Outliers

In [17]:
airbnb.drop(columns=['host_id','id']).describe(percentiles=[.01, .05, .25, .5, .75, .95, .99])


Unnamed: 0,latitude,longitude,price,minimum_nights,number_of_reviews,reviews_per_month,calculated_host_listings_count,availability_365
count,48895.0,48895.0,48895.0,48895.0,48895.0,38843.0,48895.0,48895.0
mean,40.728949,-73.95217,152.720687,7.029962,23.274466,1.373221,7.143982,112.781327
std,0.05453,0.046157,240.15417,20.51055,44.550582,1.680442,32.952519,131.622289
min,40.49979,-74.24442,0.0,1.0,0.0,0.01,1.0,0.0
1%,40.596687,-74.026774,30.0,1.0,0.0,0.02,1.0,0.0
5%,40.646114,-74.00388,40.0,1.0,0.0,0.04,1.0,0.0
25%,40.6901,-73.98307,69.0,1.0,1.0,0.19,1.0,0.0
50%,40.72307,-73.95568,106.0,3.0,5.0,0.72,1.0,45.0
75%,40.763115,-73.936275,175.0,5.0,24.0,2.02,2.0,227.0
95%,40.825643,-73.865771,355.0,30.0,114.0,4.64,15.0,359.0


Outliers would not be relevant for the id and categorical columns. So id, host_id, name, host_name, and room_type would not need attention. Categorical variables would be checked for unusual categories. 

Other columns, based on the describe function above, seem to have reasonable values. Latitude and Longitude are within a tight range, suggesting we are not seeing listings far away from NYC. availability_365 seems ok for the most part. A large number of listings seem to be available zero nights, these could be a mistake or perhaps this field was not collected for all listings. There are no numbers above 365 either, and it is reasonable for a host to offer their property all year long.

We will take a look at the others on an individual basis. Lets start with price.

In [29]:
airbnb[['name','price','minimum_nights','room_type','neighbourhood']].sort_values(by='price', ascending=False)

Unnamed: 0,name,price,minimum_nights,room_type,neighbourhood
9151,Furnished room in Astoria apartment,10000,100,Private room,Astoria
17692,Luxury 1 bedroom apt. -stunning Manhattan views,10000,5,Entire home/apt,Greenpoint
29238,1-BR Lincoln Center,10000,30,Entire home/apt,Upper West Side
40433,2br - The Heart of NYC: Manhattans Lower East ...,9999,30,Entire home/apt,Lower East Side
12342,"Quiet, Clean, Lit @ LES & Chinatown",9999,99,Private room,Lower East Side
6530,Spanish Harlem Apt,9999,5,Entire home/apt,East Harlem
30268,Beautiful/Spacious 1 bed luxury flat-TriBeCa/Soho,8500,30,Entire home/apt,Tribeca
4377,Film Location,8000,1,Entire home/apt,Clinton Hill
29662,East 72nd Townhouse by (Hidden by Airbnb),7703,1,Entire home/apt,Upper East Side
42523,70' Luxury MotorYacht on the Hudson,7500,1,Entire home/apt,Battery Park City


Previously we showed the 99th percentile for proce to be around $800 a night. The listings above far exceed that, some by more than 10x. From the descriptions, these seem to be luxury spaces, event spaces, longer term leases, or mistakes. A google search showed rooms at the 5 star Four Seasons hotel to be going for around $1100, which helps make the case that these rooms are above and beyond the usual accommodations. These would likely not be of interest to your average tourist, we may consider excluding listings above $800 a night based on the 99th percentile. We might consider the same treatment on the low end, where some rooms are listed for $0. Again these appear to actual listings but likely have special circumstances that would not be interesting to your average tourist. We can also search for a way to exclude lease style listings.

In [30]:
airbnb[['name','price','minimum_nights','room_type','neighbourhood']].sort_values(by='minimum_nights', ascending=False).head(20)

Unnamed: 0,name,price,minimum_nights,room_type,neighbourhood
5767,Prime W. Village location 1 bdrm,180,1250,Entire home/apt,Greenwich Village
2854,,400,1000,Entire home/apt,Battery Park City
38664,Shared Studio (females only),110,999,Shared room,Greenwich Village
13404,Historic Designer 2 Bed. Apartment,99,999,Entire home/apt,Harlem
26341,Beautiful place in Brooklyn! #2,79,999,Private room,Williamsburg
47620,Williamsburg Apartment,140,500,Entire home/apt,Williamsburg
14285,Peaceful apartment close to F/G,45,500,Private room,Kensington
8014,Wonderful Large 1 bedroom,75,500,Entire home/apt,Harlem
11193,Zen Room in Crown Heights Brooklyn,50,500,Private room,Crown Heights
7355,Beautiful Fully Furnished 1 bed/bth,134,500,Entire home/apt,Long Island City


Looking at minimum_nights, the top few observations appear to be mistakes. Apartments often lease for 1 year, so values around 365 seem reasonable. 500 seams odd, and 999 and above just seems wrong. We could probably make the case for removal of those observations above 500, and potentially above 365 as well. The data discarded would be small compared to the overall data set. 

In [34]:
airbnb[['name','price','minimum_nights','number_of_reviews','room_type','neighbourhood','availability_365']].sort_values(by='number_of_reviews', ascending=False).head(20)

Unnamed: 0,name,price,minimum_nights,number_of_reviews,room_type,neighbourhood,availability_365
11759,Room near JFK Queen Bed,47,1,629,Private room,Jamaica,333
2031,Great Bedroom in Manhattan,49,1,607,Private room,Harlem,293
2030,Beautiful Bedroom in Manhattan,49,1,597,Private room,Harlem,342
2015,Private Bedroom in Manhattan,49,1,594,Private room,Harlem,339
13495,Room Near JFK Twin Beds,47,1,576,Private room,Jamaica,173
10623,Steps away from Laguardia airport,46,1,543,Private room,East Elmhurst,163
1879,Manhattan Lux Loft.Like.Love.Lots.Look !,99,2,540,Private room,Lower East Side,179
20403,Cozy Room Family Home LGA Airport NO CLEANING FEE,48,1,510,Private room,East Elmhurst,341
4870,Private brownstone studio Brooklyn,160,1,488,Entire home/apt,Park Slope,269
471,LG Private Room/Family Friendly,60,3,480,Private room,Bushwick,0


Looking at number of reviews, these values are very high but at first glance I have no reason to suspect they are not real. It would take time to accumulate this many reviews, but I have no date information to give me clues that they are invalid. Depending on the algorithm used, we could keep them or possibly bin them to help with them potentially over influencing the model

In [36]:
airbnb[['name','price','reviews_per_month','number_of_reviews','minimum_nights','room_type','neighbourhood','availability_365']].sort_values(by='reviews_per_month', ascending=False).head(20)

Unnamed: 0,name,price,reviews_per_month,number_of_reviews,minimum_nights,room_type,neighbourhood,availability_365
42075,Enjoy great views of the City in our Deluxe Room!,100,58.5,156,1,Private room,Theater District,299
42076,Great Room in the heart of Times Square!,199,27.95,82,1,Private room,Theater District,299
38870,Lou's Palace-So much for so little,45,20.94,37,1,Private room,Rosedale,134
27287,JFK Comfort.5 Mins from JFK Private Bedroom & ...,80,19.75,403,1,Private room,Springfield Gardens,26
28651,JFK 2 Comfort 5 Mins from JFK Private Bedroom,50,17.82,341,1,Private room,Springfield Gardens,25
29628,JFK 3 Comfort 5 Mins from JFK Private Bedroom,50,16.81,302,1,Private room,Springfield Gardens,26
20403,Cozy Room Family Home LGA Airport NO CLEANING FEE,48,16.22,510,1,Private room,East Elmhurst,341
22469,Cute Tiny Room Family Home by LGA NO CLEANING FEE,48,16.03,436,1,Private room,East Elmhurst,337
36238,“For Heaven Cakes”,75,15.78,132,1,Entire home/apt,Springfield Gardens,28
40297,Studio Apartment 6 minutes from JFK Airport,67,15.32,95,1,Private room,Jamaica,145


For reviews per month, the top observation, 58.5, must be a mistake. The minimum stay is one night, so the maximum value should be 31. 