# Methodology

1. Review columns and their meaning

2. Check for missing values and deal with them

3. Create plots when neccessary

4. Transform features

5. Create more plots to visualize and come with up some assumptions

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
airbnb_df = pd.read_csv('data/AB_NYC_2019.csv')

In [3]:
airbnb_df.head()

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
0,2539,Clean & quiet apt home by the park,2787,John,Brooklyn,Kensington,40.64749,-73.97237,Private room,149,1,9,2018-10-19,0.21,6,365
1,2595,Skylit Midtown Castle,2845,Jennifer,Manhattan,Midtown,40.75362,-73.98377,Entire home/apt,225,1,45,2019-05-21,0.38,2,355
2,3647,THE VILLAGE OF HARLEM....NEW YORK !,4632,Elisabeth,Manhattan,Harlem,40.80902,-73.9419,Private room,150,3,0,,,1,365
3,3831,Cozy Entire Floor of Brownstone,4869,LisaRoxanne,Brooklyn,Clinton Hill,40.68514,-73.95976,Entire home/apt,89,1,270,2019-07-05,4.64,1,194
4,5022,Entire Apt: Spacious Studio/Loft by central park,7192,Laura,Manhattan,East Harlem,40.79851,-73.94399,Entire home/apt,80,10,9,2018-11-19,0.1,1,0


In [4]:
airbnb_df.columns

Index(['id', 'name', 'host_id', 'host_name', 'neighbourhood_group',
       'neighbourhood', 'latitude', 'longitude', 'room_type', 'price',
       'minimum_nights', 'number_of_reviews', 'last_review',
       'reviews_per_month', 'calculated_host_listings_count',
       'availability_365'],
      dtype='object')

In [5]:
airbnb_df.shape

(48895, 16)

All these features seem to make sense besides 'availabilty_365'. That feature is rather ambiguous. Does it mean availabilty for the last 365 days, next 365 days, or just the current year?

Let's look at the status of each features' missing values.

In [6]:
airbnb_df.isna().sum()

id                                    0
name                                 16
host_id                               0
host_name                            21
neighbourhood_group                   0
neighbourhood                         0
latitude                              0
longitude                             0
room_type                             0
price                                 0
minimum_nights                        0
number_of_reviews                     0
last_review                       10052
reviews_per_month                 10052
calculated_host_listings_count        0
availability_365                      0
dtype: int64

We're missing a very small amount for both the name and host_name columns. It is safe to ignore these since they are 16 and 21 out of 48,895 results. However looking at last_review and reviews_per_month is interesting. These features are missing roughly 20%. 

Based on the feature names we can assume there must be some relation. However let's try to verify this.

In [7]:
airbnb_df['last_review'].value_counts(dropna = False)

NaN           10052
2019-06-23     1413
2019-07-01     1359
2019-06-30     1341
2019-06-24      875
              ...  
2015-04-27        1
2014-09-21        1
2015-06-13        1
2014-04-02        1
2015-07-08        1
Name: last_review, Length: 1765, dtype: int64

In [8]:
airbnb_df['reviews_per_month'].value_counts(dropna = False)

NaN      10052
0.02       919
0.05       893
1.00       893
0.03       804
         ...  
10.23        1
8.94         1
6.04         1
7.61         1
10.67        1
Name: reviews_per_month, Length: 938, dtype: int64

In [9]:
# airbnb_df[airbnb_df['last_review'].isna() == airbnb_df['reviews_per_month'].isna()]
last_review_missing_idx = airbnb_df.loc[airbnb_df['last_review'].isna()].index.tolist()
reviews_per_month_missing_idx = airbnb_df.loc[airbnb_df['reviews_per_month'].isna()].index.tolist()

In [10]:
def index_comparision(list_1, list_2):
    
    num_same_idx, num_diff_idx = 0, 0
    
    for i in range(len(list_1)):
        if list_1[i] == list_2[i]:
            num_same_idx += 1
        else:
            num_diff_idx += 0
    
    print('Number of same indices: {}\nNumber of different indices: {}'.format(num_same_idx, num_diff_idx))
index_comparision(last_review_missing_idx, reviews_per_month_missing_idx)

Number of same indices: 10052
Number of different indices: 0


Looks like these values are missing values at the same time. I safe assumption would be that there are just no reviews for these listings. 

Let's cross verify this with 'availability_365'. We will assume that if this place is not available at all then there would be no reviews. This is because Airbnb has the option to set a house to be not available without the listing being completely taken down.

This required addition information. Of course the opposite can be true, that the listing is so popular it is completely unavailable.

In [11]:
airbnb_df['availability_365'].value_counts()

0      17533
365     1295
364      491
1        408
89       361
       ...  
195       26
196       24
183       24
181       23
202       20
Name: availability_365, Length: 366, dtype: int64

The majority of these places are not available at all in the past 365 days. This must be either very popular houses or just unavailable houses based on the host.

In [12]:
airbnb_df.loc[(airbnb_df['availability_365'] == 0) & (airbnb_df['last_review'].isna())]

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
26,8700,Magnifique Suite au N de Manhattan - vue Cloitres,26394,Claude & Sophie,Manhattan,Inwood,40.86754,-73.92639,Private room,80,4,0,,,1,0
193,51438,1 Bedroom in 2 Bdrm Apt- Upper East,236421,Jessica,Manhattan,Upper East Side,40.77333,-73.95199,Private room,130,14,0,,,2,0
267,64015,Prime East Village 1 Bedroom,146944,David,Manhattan,East Village,40.72807,-73.98594,Entire home/apt,200,3,0,,,1,0
276,65556,"Room in S3rd/Bedford, Williamsburg",320422,Marlon,Brooklyn,Williamsburg,40.71368,-73.96260,Private room,60,3,0,,,1,0
390,118680,Spacious East Village apt near it all,599354,Bobby,Manhattan,East Village,40.73067,-73.98702,Private room,87,2,0,,,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48550,36313048,Sunny room with private entrance in shared home,16883913,Tiffany,Queens,Ridgewood,40.69919,-73.89902,Private room,45,1,0,,,1,0
48731,36410519,Sunlight charming apt. in the heart of Brooklyn,121384174,Luciana Paula,Brooklyn,Park Slope,40.66716,-73.98101,Entire home/apt,111,8,0,,,1,0
48756,36419441,Murray Hill Masterpiece,273824202,David,Manhattan,Murray Hill,40.74404,-73.97239,Entire home/apt,129,2,0,,,1,0
48760,36420725,"Sunnyside, Queens 15 Mins to Midtown Clean & C...",19990280,Brandon,Queens,Sunnyside,40.74719,-73.91919,Private room,46,1,0,,,1,0


In [13]:
airbnb_df.loc[(airbnb_df['number_of_reviews'] == 0) & (airbnb_df['last_review'].isna())]

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
2,3647,THE VILLAGE OF HARLEM....NEW YORK !,4632,Elisabeth,Manhattan,Harlem,40.80902,-73.94190,Private room,150,3,0,,,1,365
19,7750,Huge 2 BR Upper East Cental Park,17985,Sing,Manhattan,East Harlem,40.79685,-73.94872,Entire home/apt,190,7,0,,,2,249
26,8700,Magnifique Suite au N de Manhattan - vue Cloitres,26394,Claude & Sophie,Manhattan,Inwood,40.86754,-73.92639,Private room,80,4,0,,,1,0
36,11452,Clean and Quiet in Brooklyn,7355,Vt,Brooklyn,Bedford-Stuyvesant,40.68876,-73.94312,Private room,35,60,0,,,1,365
38,11943,Country space in the city,45445,Harriet,Brooklyn,Flatbush,40.63702,-73.96327,Private room,150,1,0,,,1,365
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48890,36484665,Charming one bedroom - newly renovated rowhouse,8232441,Sabrina,Brooklyn,Bedford-Stuyvesant,40.67853,-73.94995,Private room,70,2,0,,,2,9
48891,36485057,Affordable room in Bushwick/East Williamsburg,6570630,Marisol,Brooklyn,Bushwick,40.70184,-73.93317,Private room,40,4,0,,,2,36
48892,36485431,Sunny Studio at Historical Neighborhood,23492952,Ilgar & Aysel,Manhattan,Harlem,40.81475,-73.94867,Entire home/apt,115,10,0,,,1,27
48893,36485609,43rd St. Time Square-cozy single bed,30985759,Taz,Manhattan,Hell's Kitchen,40.75751,-73.99112,Shared room,55,1,0,,,6,2


We looked at 'availability_365' and saw that 4845 entrys are when there was 0 availality and a missing value for 'last_review'. After this, we looked at when 'number_of_reviews' is 0 and 'last_review' was missing, this was literally all the cases.

This is a very strong indicator that missing values are just 0 values. Assuming that we can trust our data, of course. 

In [14]:
airbnb_df['last_review'] = airbnb_df['last_review'].fillna(value = 0)
airbnb_df['reviews_per_month'] = airbnb_df['reviews_per_month'].fillna(value = 0)

Now we want to do a few things: 

1. Transform and/or remove last_review column.
2. Transform room_type to a dummy variable.
3. Transform host_name to find gender of the host.
4. Use name to figure our time frame and if it falls around certain holidays.
5. Consider changing neighbourhood to dummy variable.
6. Drop id, host_id, neighbourhood_group, latitude, longitude, last_review.

Of course we will create visualizations to further support anything we will do.