## 1. Loading the Data:

In [1]:
# import libraries that will be used
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
# read csv file
df = pd.read_csv(r"C:\Users\adiya\Documents\Uni\Data Science\1SA-Final-Project\AB_US_2020.csv", low_memory=False)

## 2. Understanding the Data:

In [3]:
# display the first 5 rows for a quick look
df.head()

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365,city
0,38585,Charming Victorian home - twin beds + breakfast,165529,Evelyne,,28804,35.65146,-82.62792,Private room,60,1,138,16/02/20,1.14,1,0,Asheville
1,80905,French Chic Loft,427027,Celeste,,28801,35.59779,-82.5554,Entire home/apt,470,1,114,07/09/20,1.03,11,288,Asheville
2,108061,Walk to stores/parks/downtown. Fenced yard/Pet...,320564,Lisa,,28801,35.6067,-82.55563,Entire home/apt,75,30,89,30/11/19,0.81,2,298,Asheville
3,155305,Cottage! BonPaul + Sharky's Hostel,746673,BonPaul,,28806,35.57864,-82.59578,Entire home/apt,90,1,267,22/09/20,2.39,5,0,Asheville
4,160594,Historic Grove Park,769252,Elizabeth,,28801,35.61442,-82.54127,Private room,125,30,58,19/10/15,0.52,1,0,Asheville


In [4]:
# check the shape of the DataFrame (rows, columns)
# understand the amount of data
df.shape

(226030, 17)

In [5]:
# description of data
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 226030 entries, 0 to 226029
Data columns (total 17 columns):
 #   Column                          Non-Null Count   Dtype  
---  ------                          --------------   -----  
 0   id                              226030 non-null  int64  
 1   name                            226002 non-null  object 
 2   host_id                         226030 non-null  int64  
 3   host_name                       225997 non-null  object 
 4   neighbourhood_group             110185 non-null  object 
 5   neighbourhood                   226030 non-null  object 
 6   latitude                        226030 non-null  float64
 7   longitude                       226030 non-null  float64
 8   room_type                       226030 non-null  object 
 9   price                           226030 non-null  int64  
 10  minimum_nights                  226030 non-null  int64  
 11  number_of_reviews               226030 non-null  int64  
 12  last_review     

In a first observation it is clear some features (for example "neighbourhood_group",
"last_review", and "reviews_per_month") consist of many null values.

### Features in the DataFrame:
0. id: unique id number for each listing
1. name: name of listing
2. host_id: unique host id number
3. host_name: name of host
4. neighbourhood_group: group in which the neighbourhood lies
5. neighbourhood: name of the neighbourhood
6. latitude: latitude of listing
7. longitude: longitude of listing
8. room_type: room type (Entire home/apt, Private room, etc.)
9. price: price of listing per night
10. minimum_nights: minimum number of nights required to book
11. number_of_reviews: total number of reviews on listing
12. last_review: date of last review
13. reviews_per_month: average reviews per month
14. calculated_host_listings_count: total number of listing by host
15. availability_365: number of days a year the listing is available for rent
16. city: region of the listing

## 3. Cleaning the Data:

In [6]:
# further examination of null values
# the methods below calculate the number of missing values in each feature
df.isnull().sum()

id                                     0
name                                  28
host_id                                0
host_name                             33
neighbourhood_group               115845
neighbourhood                          0
latitude                               0
longitude                              0
room_type                              0
price                                  0
minimum_nights                         0
number_of_reviews                      0
last_review                        48602
reviews_per_month                  48602
calculated_host_listings_count         0
availability_365                       0
city                                   0
dtype: int64

##### Understanding the features helps gain insight on how to treat null values.

In [7]:
print("'city' Column Labels: \n", df['city'].unique())
print()
print("'neighbourhood_group' Column Labels: \n", df['neighbourhood_group'].unique())

'city' Column Labels: 
 ['Asheville' 'Austin' 'Boston' 'Broward County' 'Cambridge' 'Chicago'
 'Clark County' 'Columbus' 'Denver' 'Hawaii' 'Jersey City' 'Los Angeles'
 'Nashville' 'New Orleans' 'New York City' 'Oakland' 'Pacific Grove'
 'Portland' 'Rhode Island' 'Salem' 'San Clara Country' 'San Diego'
 'San Francisco' 'San Mateo County' 'Santa Cruz County' 'Seattle'
 'Twin Cities MSA' 'Washington D.C.']

'neighbourhood_group' Column Labels: 
 [nan 'Hawaii' 'Kauai' 'Maui' 'Honolulu' 'Other Cities'
 'City of Los Angeles' 'Unincorporated Areas' 'Manhattan' 'Brooklyn'
 'Queens' 'Staten Island' 'Bronx' 'Providence' 'Washington' 'Newport'
 'Bristol' 'Kent' 'Central Area' 'Other neighborhoods' 'West Seattle'
 'Downtown' 'Ballard' 'Capitol Hill' 'Beacon Hill' 'Seward Park'
 'Queen Anne' 'Rainier Valley' 'Lake City' 'Cascade' 'Delridge'
 'University District' 'Northgate' 'Magnolia' 'Interbay']


The 'city' column (presented above) categorizes the residential area of every listing.
Therefore, given the large amount of missing values and insignificance of the 'neighbourhood_group'
column, I decided to remove it.

In [8]:
# dropping 'neighbourhood_group' column
df.drop(['neighbourhood_group'], axis=1, inplace=True)
# examining the changes
df.head(3)

Unnamed: 0,id,name,host_id,host_name,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365,city
0,38585,Charming Victorian home - twin beds + breakfast,165529,Evelyne,28804,35.65146,-82.62792,Private room,60,1,138,16/02/20,1.14,1,0,Asheville
1,80905,French Chic Loft,427027,Celeste,28801,35.59779,-82.5554,Entire home/apt,470,1,114,07/09/20,1.03,11,288,Asheville
2,108061,Walk to stores/parks/downtown. Fenced yard/Pet...,320564,Lisa,28801,35.6067,-82.55563,Entire home/apt,75,30,89,30/11/19,0.81,2,298,Asheville


In [9]:
# can also see that the number of columns changed to 16
df.shape

(226030, 16)

The 'reviews_per_month' column provides an average of the reviews. The missing values can be changed
to 0.0, suggesting the listing has a monthly average of 0 reviews.


In [10]:
# replacing null values in 'reviews_per_month' column with 0
df.fillna({'reviews_per_month': 0}, inplace=True)
# examining changes
df['reviews_per_month'].isnull().sum()

0