## CS5304/INFO5304 - Data Science: Airbnb Project
William Tung - wt275
### EDA & Data Cleaning

In [3]:
# Packages
import numpy as np
import pandas as pd

#### Preliminary EDA & Dataset Overview

In [23]:
# Airbnb NYC 2019 Data
data = pd.read_csv('data/AB_NYC_2019.csv')
print('Dataset Shape:',data.shape)
data.describe()

Dataset Shape: (48895, 16)


Unnamed: 0,id,host_id,latitude,longitude,price,minimum_nights,number_of_reviews,reviews_per_month,calculated_host_listings_count,availability_365
count,48895.0,48895.0,48895.0,48895.0,48895.0,48895.0,48895.0,38843.0,48895.0,48895.0
mean,19017140.0,67620010.0,40.728949,-73.95217,152.720687,7.029962,23.274466,1.373221,7.143982,112.781327
std,10983110.0,78610970.0,0.05453,0.046157,240.15417,20.51055,44.550582,1.680442,32.952519,131.622289
min,2539.0,2438.0,40.49979,-74.24442,0.0,1.0,0.0,0.01,1.0,0.0
25%,9471945.0,7822033.0,40.6901,-73.98307,69.0,1.0,1.0,0.19,1.0,0.0
50%,19677280.0,30793820.0,40.72307,-73.95568,106.0,3.0,5.0,0.72,1.0,45.0
75%,29152180.0,107434400.0,40.763115,-73.936275,175.0,5.0,24.0,2.02,2.0,227.0
max,36487240.0,274321300.0,40.91306,-73.71299,10000.0,1250.0,629.0,58.5,327.0,365.0


##### Columns Types

In [33]:
print('Columns and dtype')
data.dtypes

Columns and dtype


id                                  int64
name                               object
host_id                             int64
host_name                          object
neighbourhood_group                object
neighbourhood                      object
latitude                          float64
longitude                         float64
room_type                          object
price                               int64
minimum_nights                      int64
number_of_reviews                   int64
last_review                        object
reviews_per_month                 float64
calculated_host_listings_count      int64
availability_365                    int64
dtype: object

##### Missing Values Count
We can see from the missing value counts per column that the dataset appears to be relatively clean except for four columns. 'last_review' and 'reviews_per_month' both have an equally high 10,052 missing values, while 'name' and 'host_name' are missing a respective 16 and 21 entries. All other columns do not have any missing values as marked with null or N/A. Upon checking the other columns and their descriptive purpose, we can conclude that the missing value notation for this dataset is 'NaN' and all other zero value entries appear to be valid for the respective columns.

As for missing value handling, we can leave 'name' and 'host_name' as is since those fields aren't crucial to identify the listing or the host which both have a corresponding ID field. Potential analysis related to these name fields could revolve on the 'attractiveness' of the listing name or possible bias towards a host name. Segmentation and consideration will be given to these fields if that type of analysis is explored. 

Since the columns 'last_review' and 'reviews_per_month' are quantitative fields, we will need to fill values for analysis. It is likely that these missing value entries are related to listing that have not had a review or a guest yet. Both of these described scenarios is completely valid. 

In [32]:
print('Columns and Missing Value Counts')
data.isna().sum()

Columns and Missing Value Counts


id                                    0
name                                 16
host_id                               0
host_name                            21
neighbourhood_group                   0
neighbourhood                         0
latitude                              0
longitude                             0
room_type                             0
price                                 0
minimum_nights                        0
number_of_reviews                     0
last_review                       10052
reviews_per_month                 10052
calculated_host_listings_count        0
availability_365                      0
dtype: int64

In [54]:
# Testing no reviews for missing NaN review columns
print('# of common occurrences:',len(data[(data['number_of_reviews'] == 0)\
    & (data['last_review'].isna()) & (data['reviews_per_month']).isna()]))

# of common occurrences: 10052


Infact there are no reviews (0) for when 'number_of_reviews' and 'reviews_per_month' is NaN, as both the NaN count and common occurrence count is 10,052. Our method to fill these NaN entries will be to replace NaN with zeros, so we can use these columns for quantitative analysis. Choosing zero as the replacement value is valid for the meaning of these columns and also follows the 'number_of_reviews' notation of having zero values as well.