### Capstone Project 
#### Project Title: Using Machine Learning to develop a recommendation system for London neighbourhoods
#### Student Name: Zinia Bhattacharya
#### Notebook 1 - Data Preprocessing & EDA


### Context:<br>

Airbnb currently has a number of features to help personalize the property based on guest requirements e.g.<br>
- Price range
- Type of stay and property (Entire apartment/private room etc or House/Flat/Guest House etc)
- Number of rooms & beds
- Amenities
- Host details

For listings and experiences in the countryside, Airbnb also provides additional recommendations to personalize one's stay e.g. Vineyards, Farms, Countryside, Surfing, Lakes, Beaches, Designer homes, Tiny homes etc.
**However, when it comes to city stays, currently consumers are shown all available properties on a map without any location recommendations or filters</mark>.** <br>

Potential airbnb guests have to scroll through comments to get a 'feel' of the neighbourhood or, they need to research separately about the suitability of the location, before narrowing down on a property. Looking at Google Search trends, searches around 'Where to stay in London' have an average relative popularity score of 70. For comparison, Goggle's own relative score is 91 and Amazon is approx 72 in the United in the same time period of last 1 year. So evidently, people are seeking this information to guide their decision on where to stay in London during their visit.

Adding such recommendation to the Airbnb site, can enhance the user-experience without the guest having to leave the Airbnb ecosystem for location guidance. It can also pave the way for sustainable tourism and drive tourism revenue for ‘under-the radar’ neighbourhoods and ease tourist overcrowding in central London <br>

This recommendation system also has the potential to be extended to other short-term rental booking sites and hotels and expanded to other top cities globally

--------------------------------------------------------------------------

#### Data Source 1: insideairbnb

Data sourced from insideairbnb.com <br>
Date range : The data spans from 2011-2022 <br>
Data File : Listings <br>

**Data Dictionary**
The key columns in the dataset are as below:
Some of the columns do not have any data recorded and will be dropped prior to our analysis

| Column                             | Description                                                            |
| :---                               |    :----:                                                              |
| id                                 | Unique id for the specific property/listing                            |
| name                               | Name or short description of the property/listing                      |
| neighbourhood_overview             | Description of the neighbourhood in the listing                        |
| neighbourhood_cleansed             | The London borough, the listing is located in                          |
| latitude                           | Latitude of the property location                                      |
| longitude                          | Longitude of the property location                                     |
| minimum_nights                     | The minimum required nights for booking                                |
| property_type                      | The type of the rental property, Entire home or Private room           |
| price                              | Average per night price of the listing                                 |
| review_scores_location             | Total number of current reviews of the property                        |
| suburb                             | Imputed value after reverse geo-coding against Latitude and Longitude  |

#### Data Source 2: ChatGPT as additional source to analyse the characteristics of different neighbourhoods in London
Using the said prompt on Chat GPT, we collated suburb_tags across top 100 suburbs in London:
*'Provide neighbourhood characteristics of the following suburbs in London'*

----------------------------------------------------------------------------------------------------------

#### Summary of the data pre-processing approach


The data pre-processing step was a critical part of the project due to the nature of the data required - `suburb` level data being key and the dataset did not come pre-populated with this.<br>

The key steps in the data-sourcing and data pre-processing stage were:

1) Identifying the right dataset from the inside airbnb source -Having evaluated a few different datasets, we finalized on the dataset used here as it provided a detailed level of data on the properties listed, key being `neighborhood_overview` which provides a short description of the neighborhood where the property is located <br>
2) The next step was to narrow down the dataset to include only relevant columns. As the purpose of the project was to identify neighborhood profiles and recommend properties accordingly, the key columns retained in the final dataset were related to the property location <br>
3) Imputing missing values for `name` and `review_scores_location` based on the property details that we could glean from the other columns - e.g. `name` details were completed by adding details from the property location and neighborhood overview and `review_scores_location` were filled in with average review scores for the corresponding London borough <br>
4) Reverse geocoding to map `suburb` names as per latitude and longitude - We used 'nominatim open street' to request for suburb names based on the property location. It returned values for all 40,605 listings barring 2619 of them where nominatim had no suburb details available
5) Inputing missing values for `suburbs` by requesting for `postcode` data and then manually mapping suburbs based on outer postcodes - The process for imputing missing `suburb` values was slightly time-consuming as we first went down the route of requesting for 'postcode' information which were then mapped to 'ward' level data for the 'City of London' borough as a test. However, the 'ward' level data was too narrow for the purposes of our project where the end objective is to profile and recommend broad 'suburbs' to the airbnb guests. We therefore changed the approach and looked at 'outer' level postcode e.g. BR2 in the postcode BR2 6AN to establish the suburbs against each postcode and map them against the relevant listing. For certain boroughs which are closely knit (e.g. City of London) or outer boroughs where the suburbs within are broadly similar (e.g. Richmond) we used the borough name as the `suburb` for the purposes of this project

In [2]:
#Importing the required packages

import numpy as np
import pandas as pd

# plotting
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objs as go
import matplotlib.pyplot as plt
import seaborn as sns
import joblib
import warnings
warnings.filterwarnings('ignore')



In [4]:
#Loading the dataset
df_listing_new= pd.read_csv("C:\\Users\\Zinia\\Documents\\capstone-project-ZiniaB\\Data\\inside airbnb\\listings_new.csv")

In [5]:
#Looking at the columns in the dataset
df_listing_new.columns

Index(['id', 'listing_url', 'scrape_id', 'last_scraped', 'source', 'name',
       'description', 'neighborhood_overview', 'picture_url', 'host_id',
       'host_url', 'host_name', 'host_since', 'host_location', 'host_about',
       'host_response_time', 'host_response_rate', 'host_acceptance_rate',
       'host_is_superhost', 'host_thumbnail_url', 'host_picture_url',
       'host_neighbourhood', 'host_listings_count',
       'host_total_listings_count', 'host_verifications',
       'host_has_profile_pic', 'host_identity_verified', 'neighbourhood',
       'neighbourhood_cleansed', 'neighbourhood_group_cleansed', 'latitude',
       'longitude', 'property_type', 'room_type', 'accommodates', 'bathrooms',
       'bathrooms_text', 'bedrooms', 'beds', 'amenities', 'price',
       'minimum_nights', 'maximum_nights', 'minimum_minimum_nights',
       'maximum_minimum_nights', 'minimum_maximum_nights',
       'maximum_maximum_nights', 'minimum_nights_avg_ntm',
       'maximum_nights_avg_ntm', 'ca

In [6]:
# creating a new dataframe with only the relevant columns for the project

df = df_listing_new[['id','name', 'neighborhood_overview',
       'neighbourhood_cleansed', 'latitude',
       'longitude', 'minimum_nights','property_type', 'room_type','price','review_scores_location',
       ]]


#Host_neighbourhood same as neighbourhood_cleansed, so host_neighbourhood has been dropped

Google link to dataset: https://drive.google.com/file/d/1IM1CwhKhf1StfN7DCZ23Z_PDEWDo-D_D/view?usp=sharing

In [7]:
#viewing the dataframe
df.head()

Unnamed: 0,id,name,neighborhood_overview,neighbourhood_cleansed,latitude,longitude,minimum_nights,property_type,room_type,price,review_scores_location
0,13913,Holiday London DB Room Let-on going,Finsbury Park is a friendly melting pot commun...,Islington,51.56861,-0.1127,1,Private room in rental unit,Private room,$79.00,4.71
1,15400,Bright Chelsea Apartment. Chelsea!,It is Chelsea.,Kensington and Chelsea,51.4878,-0.16813,10,Entire rental unit,Entire home/apt,$75.00,4.93
2,172811,Nice double bedroom in NW London,,Camden,51.5471,-0.17981,21,Entire rental unit,Entire home/apt,$229.00,
3,173082,The Residential Suite Above Gallery,"The neighbourhood ""Victoria Park Village"" is a...",Hackney,51.538254,-0.044086,2,Entire rental unit,Entire home/apt,$132.00,4.68
4,42010,You Will Save Money Here,We have a unique cinema called the Phoenix whi...,Barnet,51.5859,-0.16434,4,Private room in home,Private room,$65.00,4.72


In [8]:
#details of the dataframe
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 71938 entries, 0 to 71937
Data columns (total 11 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   id                      71938 non-null  int64  
 1   name                    71914 non-null  object 
 2   neighborhood_overview   40605 non-null  object 
 3   neighbourhood_cleansed  71938 non-null  object 
 4   latitude                71938 non-null  float64
 5   longitude               71938 non-null  float64
 6   minimum_nights          71938 non-null  int64  
 7   property_type           71938 non-null  object 
 8   room_type               71938 non-null  object 
 9   price                   71938 non-null  object 
 10  review_scores_location  53807 non-null  float64
dtypes: float64(3), int64(2), object(6)
memory usage: 6.0+ MB


In [9]:
#To check if there are any duplicated rows
df.duplicated().sum()
print(f" There are {df.duplicated().sum()} duplicate rows in df_listing")

 There are 0 duplicate rows in df_listing


In [10]:
#Checking the null values in the dataset
df.isna().sum()

id                            0
name                         24
neighborhood_overview     31333
neighbourhood_cleansed        0
latitude                      0
longitude                     0
minimum_nights                0
property_type                 0
room_type                     0
price                         0
review_scores_location    18131
dtype: int64

In [11]:
# Looking at the `neighborhood_overvew` column closely to see what kind of descriptions are provided for the property neighbourhood 
df['neighborhood_overview'].head(10)

0    Finsbury Park is a friendly melting pot commun...
1                                       It is Chelsea.
2                                                  NaN
3    The neighbourhood "Victoria Park Village" is a...
4    We have a unique cinema called the Phoenix whi...
5    Location, location, location! You won't find b...
6    The neighborhood of Holland Park borders Notti...
7    The area is called Munster village.  It has a ...
8    It's a really safe and friendly neighbourhood....
9                                                  NaN
Name: neighborhood_overview, dtype: object

Our main column of interest is `Neighbourhood Overview` as it outines the description of the neighbourhood as provided by the host. Therefore, we cannot analyse neighbourhoods and property locations that have no descriptions against them. We will hence go ahead and drop the 31333 rows that have null values in `Neighbourhood Overview`. We will still have 40,000 + rows in our dataset to analyze neighbourhood descriptions in London

In [12]:
#dropping the rows with null values in `neighborhood_overview`
df.dropna(subset=['neighborhood_overview'], inplace=True)

In [13]:
#checking if the specified columns have been dropped
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 40605 entries, 0 to 71937
Data columns (total 11 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   id                      40605 non-null  int64  
 1   name                    40601 non-null  object 
 2   neighborhood_overview   40605 non-null  object 
 3   neighbourhood_cleansed  40605 non-null  object 
 4   latitude                40605 non-null  float64
 5   longitude               40605 non-null  float64
 6   minimum_nights          40605 non-null  int64  
 7   property_type           40605 non-null  object 
 8   room_type               40605 non-null  object 
 9   price                   40605 non-null  object 
 10  review_scores_location  33645 non-null  float64
dtypes: float64(3), int64(2), object(6)
memory usage: 3.7+ MB


We can see that there are still null values in `name` and `review_scores_location` column. We will fill them with relevant values in the the next few steps

In [14]:
#looking at the `name` column null values more closely
df[df['name'].isna()]

Unnamed: 0,id,name,neighborhood_overview,neighbourhood_cleansed,latitude,longitude,minimum_nights,property_type,room_type,price,review_scores_location
1406,1346531,,"Situated in hip, lively Dalston there is a ple...",Hackney,51.5479,-0.06752,2,Entire rental unit,Entire home/apt,$80.00,4.65
8506,9859465,,Clerkenwell is a vibrant area with many restau...,Islington,51.52465,-0.09624,4,Entire rental unit,Entire home/apt,$120.00,
8619,9833194,,Clapham Common provides an oasis of peace amid...,Lambeth,51.45238,-0.12921,2,Private room in rental unit,Private room,$36.00,
12866,14051483,,- Area: The flat is located 2min by walk away ...,Lambeth,51.47023,-0.12353,1,Private room in rental unit,Private room,$20.00,


In [15]:
#Imputing NaNs in the 'Name' coulumn with a short description of the property type and neighbourhood from the other columns

df.loc[1406,['name']]=['Entire home in Dalston']
df.loc[8506,['name']]=['Entire home in Clerkenwell']
df.loc[8619,['name']]=['Private room in Clapham Common']
df.loc[12866,['name']]=['Private room in Lambeth']

We can also see from the dataframe that `neighbourhood_cleansed` is the same as the borough name, so we will rename it for easier reference

In [16]:
#renaming `neighborhood_cleansed` to `London_borough`
df.rename(columns= {'neighbourhood_cleansed':'London_borough'}, inplace=True)

In [17]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 40605 entries, 0 to 71937
Data columns (total 11 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   id                      40605 non-null  int64  
 1   name                    40605 non-null  object 
 2   neighborhood_overview   40605 non-null  object 
 3   London_borough          40605 non-null  object 
 4   latitude                40605 non-null  float64
 5   longitude               40605 non-null  float64
 6   minimum_nights          40605 non-null  int64  
 7   property_type           40605 non-null  object 
 8   room_type               40605 non-null  object 
 9   price                   40605 non-null  object 
 10  review_scores_location  33645 non-null  float64
dtypes: float64(3), int64(2), object(6)
memory usage: 4.7+ MB


For `review_scores_location`, we will use the average scores to impute the missing values with the hypothesis that the scores for location are likely to be similar for properties within the same neighbourhood

In [18]:
#looking at the mean of the review_scores by London borough
df.groupby(['London_borough'])['review_scores_location'].mean()

London_borough
Barking and Dagenham      4.520340
Barnet                    4.675242
Bexley                    4.663934
Brent                     4.632068
Bromley                   4.743007
Camden                    4.830038
City of London            4.823320
Croydon                   4.599150
Ealing                    4.685341
Enfield                   4.642847
Greenwich                 4.685111
Hackney                   4.772441
Hammersmith and Fulham    4.757260
Haringey                  4.691287
Harrow                    4.715862
Havering                  4.700000
Hillingdon                4.709930
Hounslow                  4.669000
Islington                 4.804420
Kensington and Chelsea    4.846861
Kingston upon Thames      4.805115
Lambeth                   4.770315
Lewisham                  4.673356
Merton                    4.736242
Newham                    4.621683
Redbridge                 4.574313
Richmond upon Thames      4.856049
Southwark                 4.714687
Sutto

In [19]:
#filling missing values in `review_scores_location` column with the mean scores for that borough using 'transform' method
df["review_scores_location"] = df["review_scores_location"].fillna(df.groupby("London_borough")['review_scores_location'].transform('mean'))

In [90]:
#reverse geocoding to get suburb names based on latitude and longitude
#various versions of the code were attempted to get back the results without any error for addresses where 'suburb' was unknown. This version of the code returns the values for the full dataset
#commenting out the below code as it takes considerable time to run

# from geopy.geocoders import Nominatim
# geolocator = Nominatim(user_agent="my_application")

# def get_suburb(row):
#     try:
#         location = geolocator.reverse((row['latitude'], row['longitude']))
#         address = location.raw['address']
#         suburb = address.get('suburb', '')
#     except:
#         pass
#     return suburb

# df['suburb'] = df.apply(get_suburb, axis=1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['suburb'] = df.apply(get_suburb, axis=1)


In [91]:
# to check if the 'suburb' values had been returned
# df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 40605 entries, 0 to 71937
Data columns (total 12 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   id                      40605 non-null  int64  
 1   name                    40605 non-null  object 
 2   neighborhood_overview   40605 non-null  object 
 3   London_borough          40605 non-null  object 
 4   latitude                40605 non-null  float64
 5   longitude               40605 non-null  float64
 6   minimum_nights          40605 non-null  int64  
 7   property_type           40605 non-null  object 
 8   room_type               40605 non-null  object 
 9   price                   40605 non-null  object 
 10  review_scores_location  40605 non-null  float64
 11  suburb                  40605 non-null  object 
dtypes: float64(3), int64(2), object(7)
memory usage: 5.0+ MB


Saving cleaned DataFrame with no null-values and with suburb added - to csv

In [98]:

df.to_csv('C:\\Users\\Zinia\\Documents\\capstone-project-ZiniaB\\Data\\suburb_updated_withindex.csv',index=False)

In [101]:
joblib.dump(df,'../Data/df_suburb.pkl',compress =9)

['../Data/df_suburb.pkl']

In [20]:
#Loading df_suburb for to recheck before moving to EDA
df=joblib.load("C:\\Users\\Zinia\Documents\\capstone-project-ZiniaB\\Data\\df_suburb.pkl")

In [21]:
df.head()

Unnamed: 0,id,name,neighborhood_overview,London_borough,latitude,longitude,minimum_nights,property_type,room_type,price,review_scores_location,suburb
0,13913,Holiday London DB Room Let-on going,Finsbury Park is a friendly melting pot commun...,Islington,51.56861,-0.1127,1,Private room in rental unit,Private room,$79.00,4.71,Finsbury Park
1,15400,Bright Chelsea Apartment. Chelsea!,It is Chelsea.,Kensington and Chelsea,51.4878,-0.16813,10,Entire rental unit,Entire home/apt,$75.00,4.93,Chelsea
3,173082,The Residential Suite Above Gallery,"The neighbourhood ""Victoria Park Village"" is a...",Hackney,51.538254,-0.044086,2,Entire rental unit,Entire home/apt,$132.00,4.68,Homerton
4,42010,You Will Save Money Here,We have a unique cinema called the Phoenix whi...,Barnet,51.5859,-0.16434,4,Private room in home,Private room,$65.00,4.72,Hampstead Garden Suburb
5,17402,Superb 3-Bed/2 Bath & Wifi: Trendy W1,"Location, location, location! You won't find b...",Westminster,51.52195,-0.14094,4,Entire rental unit,Entire home/apt,$425.00,4.88,Fitzrovia


In [22]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 40605 entries, 0 to 71937
Data columns (total 12 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   id                      40605 non-null  int64  
 1   name                    40605 non-null  object 
 2   neighborhood_overview   40605 non-null  object 
 3   London_borough          40605 non-null  object 
 4   latitude                40605 non-null  float64
 5   longitude               40605 non-null  float64
 6   minimum_nights          40605 non-null  int64  
 7   property_type           40605 non-null  object 
 8   room_type               40605 non-null  object 
 9   price                   40605 non-null  object 
 10  review_scores_location  40605 non-null  float64
 11  suburb                  40605 non-null  object 
dtypes: float64(3), int64(2), object(7)
memory usage: 4.0+ MB


In [23]:
#Groupby sunurb to identify top100 suburbs by sorting value at the next stage. 
df.groupby(['suburb']).count()['id']

suburb
                  2619
Abbey Wood          33
Acton              346
Addiscombe          51
Albany Park          5
                  ... 
Worcester Park      21
World's End        179
Worton              19
Yeading             15
Yiewsley            30
Name: id, Length: 397, dtype: int64

In [24]:
#above cell shows blank cells when grouped by 'suburb'.
# # Looking closer into the blank rows for suburbs
#It appears that at the time of reverse geo-coding, for rows where the 'suburb' information was not available on the nominatim website, it allocated an empty string. 
df[df['suburb']=='']

Unnamed: 0,id,name,neighborhood_overview,London_borough,latitude,longitude,minimum_nights,property_type,room_type,price,review_scores_location,suburb
27,198258,Penthouse Living in East London,"I live in Barking town centre, at one time the...",Barking and Dagenham,51.534300,0.081780,2,Private room in rental unit,Private room,$69.00,4.440000,
45,41445,2 Double bed apartment in quiet area North London,Quite area popular with families,Barnet,51.614920,-0.256320,4,Entire rental unit,Entire home/apt,$550.00,4.530000,
64,318287,safe and spacious room in comfy family home,We are a cosmopolitan family in a lovely neigh...,Waltham Forest,51.616070,-0.029820,2,Private room in home,Private room,$29.00,4.450000,
95,81080,Luxury Self contained Studio Apt.,Lots of local history - you are right on the d...,Richmond upon Thames,51.406440,-0.335630,3,Entire serviced apartment,Entire home/apt,$170.00,4.940000,
164,343848,"Lovely double room in Oakwood, North London",It is a short walk to the local Sainsbury's su...,Enfield,51.650750,-0.118410,180,Private room in home,Private room,$41.00,4.610000,
...,...,...,...,...,...,...,...,...,...,...,...,...
71593,776521441063993062,one bedroom and a couch.,Very quiet and friendly.,Barnet,51.610528,-0.276035,1,Private room in rental unit,Private room,$73.00,4.675242,
71599,776668700186722169,Appartamento ad uso esclusivo,Tranquillo e ben collegato con il centro attra...,Waltham Forest,51.589243,-0.029610,1,Entire rental unit,Entire home/apt,$150.00,4.661732,
71714,776597925814980536,Lovely 2 Bedroom Garden Flat,Around the corner on Bellenden Road places lik...,Southwark,51.468649,-0.073742,3,Entire rental unit,Entire home/apt,$175.00,4.714687,
71801,776648991500097600,2BR Penthouse with Terrace in the Heart of Hol...,About the Holborn London Location: The Holborn...,City of London,51.517099,-0.111866,1,Entire rental unit,Entire home/apt,$259.00,4.823320,


In [25]:
#Creating a separate dataframe to populate the missing suburbs
df_missing=df[df['suburb']=='']

In [26]:
df_missing.shape

(2619, 12)

In [27]:
# As the suburb data was missing, we will ask for postcodes from the nominatim site to inform our broad estimate on suburb names.

from geopy.geocoders import Nominatim
geolocator = Nominatim(user_agent="my_application")

def get_postcode(row):
    try:
        location = geolocator.reverse((row['latitude'], row['longitude']))
        address = location.raw['address']
        postcode = address.get('postcode', '')
    except:
        pass
    return postcode


In [28]:
#Applying the postcides generated from the call to the df_missing dataframe as a separate column `postcode`
#commenting out this code it takes 20+ mins to run

# df_missing['postcode'] = df_missing.apply(get_postcode, axis=1)

In [29]:
df_missing.head(5)
#we can see that the postcode column has now been added to the df_missing dataframe

Unnamed: 0,id,name,neighborhood_overview,London_borough,latitude,longitude,minimum_nights,property_type,room_type,price,review_scores_location,suburb,postcode
27,198258,Penthouse Living in East London,"I live in Barking town centre, at one time the...",Barking and Dagenham,51.5343,0.08178,2,Private room in rental unit,Private room,$69.00,4.44,,IG11 7RQ
45,41445,2 Double bed apartment in quiet area North London,Quite area popular with families,Barnet,51.61492,-0.25632,4,Entire rental unit,Entire home/apt,$550.00,4.53,,NW7 3QA
64,318287,safe and spacious room in comfy family home,We are a cosmopolitan family in a lovely neigh...,Waltham Forest,51.61607,-0.02982,2,Private room in home,Private room,$29.00,4.45,,E4 8HB
95,81080,Luxury Self contained Studio Apt.,Lots of local history - you are right on the d...,Richmond upon Thames,51.40644,-0.33563,3,Entire serviced apartment,Entire home/apt,$170.00,4.94,,KT8 9BY
164,343848,"Lovely double room in Oakwood, North London",It is a short walk to the local Sainsbury's su...,Enfield,51.65075,-0.11841,180,Private room in home,Private room,$41.00,4.61,,EN2 7JP


In [30]:
df_missing.to_csv('C:\\Users\\Zinia\\Documents\\capstone-project-ZiniaB\\Data\\missingsuburbswithpostcode.csv', index = False)
#saving this as csv for back-up, in order to avoid making the request to nominatim again(without index)

In [31]:
#re-loading the csv file to continue with imputation of missing suburbs
df_missing = pd.read_csv('C:\\Users\\Zinia\\Documents\\capstone-project-ZiniaB\\Data\\missingsuburbswithpostcode.csv')

In [32]:
df_missing.head()

Unnamed: 0,id,name,neighborhood_overview,London_borough,latitude,longitude,minimum_nights,property_type,room_type,price,review_scores_location,suburb,postcode
0,198258,Penthouse Living in East London,"I live in Barking town centre, at one time the...",Barking and Dagenham,51.5343,0.08178,2,Private room in rental unit,Private room,$69.00,4.44,,IG11 7RQ
1,41445,2 Double bed apartment in quiet area North London,Quite area popular with families,Barnet,51.61492,-0.25632,4,Entire rental unit,Entire home/apt,$550.00,4.53,,NW7 3QA
2,318287,safe and spacious room in comfy family home,We are a cosmopolitan family in a lovely neigh...,Waltham Forest,51.61607,-0.02982,2,Private room in home,Private room,$29.00,4.45,,E4 8HB
3,81080,Luxury Self contained Studio Apt.,Lots of local history - you are right on the d...,Richmond upon Thames,51.40644,-0.33563,3,Entire serviced apartment,Entire home/apt,$170.00,4.94,,KT8 9BY
4,343848,"Lovely double room in Oakwood, North London",It is a short walk to the local Sainsbury's su...,Enfield,51.65075,-0.11841,180,Private room in home,Private room,$41.00,4.61,,EN2 7JP


The first step is to identify smaller broughs where individual suburb profiles are likely to be similar. 'City of London' is one such borough which covers mostly the financial district and is centrally located. So we will use the borough as a proxy for suburbs here

In [35]:
df_missing.isna().sum()

id                           0
name                         0
neighborhood_overview        0
London_borough               0
latitude                     0
longitude                    0
minimum_nights               0
property_type                0
room_type                    0
price                        0
review_scores_location       0
suburb                    2619
postcode                    13
dtype: int64

In [36]:
#Creating a separate df for rows with missing postcode to impute them manually

df_missing_pc= df_missing[(df_missing['postcode'].isna())]

In [37]:
#dropping the missing rows from the existing dataframe
df_missing = df_missing.dropna(subset=['postcode'])

In [38]:
df_missing_pc.head(15)

Unnamed: 0,id,name,neighborhood_overview,London_borough,latitude,longitude,minimum_nights,property_type,room_type,price,review_scores_location,suburb,postcode
131,5388341,Eclectic London home great location,"We might be biased, but we think we live in th...",City of London,51.51476,-0.07612,3,Entire rental unit,Entire home/apt,$98.00,4.5,,
143,5989171,My spare room is for rent,The local shopping street is Askew Road. Over ...,Hammersmith and Fulham,51.50181,-0.24857,1,Private room in condo,Private room,$53.00,4.77,,
482,16276053,Spacious Double Room in Zone 1 near Shoreditch,Next to trendy Brick Lane and Spittafields Market,City of London,51.5148,-0.07637,4,Private room in rental unit,Private room,$40.00,4.82,,
900,25884581,Spacious room king-size bed,"It is a safe neighbourhood, with people from d...",Brent,51.59733,-0.26751,1,Private room in home,Private room,$40.00,5.0,,
973,28237545,Lovely Garden View - Close to City Center,Zone 4 Burnt Oak & Zone 5 Edgware ( Northern L...,Barnet,51.61023,-0.27034,3,Private room in home,Private room,$26.00,5.0,,
1358,38150425,Single room-Comfort-Shared Bathroom-Tiny Room,Apple House Guest London Heathrow is situated ...,Hillingdon,51.48384,-0.44338,1,Private room in guesthouse,Private room,$62.00,4.8,,
1360,38150552,En suite Double Room 5min from Heathrow Airport,Apple House Guest London Heathrow is situated ...,Hillingdon,51.48397,-0.44298,1,Private room in guesthouse,Private room,$69.00,3.0,,
1903,53544785,Lovely modern 2 bed flat in the heart of Woolwich,peaceful,Greenwich,51.48796,0.06729,2,Entire condo,Entire home/apt,$129.00,4.685111,,
2144,644804400072872785,Bed and breakfast in leafy Bounds Green,A Quiet neighbourhood away from the hustle and...,Haringey,51.603942,-0.126679,1,Private room in bed and breakfast,Private room,$45.00,4.71,,
2146,645440879081312934,Nice room near London Designer Outlet,Aside from the the iconic stadium and the SSE ...,Brent,51.55857,-0.28001,2,Private room in rental unit,Private room,"$1,570.00",4.632068,,


In [39]:
#Looking at the location of the above properties and their location description, it will be fair to allocate the borough name as the suburb for the purposes of our analysis
df_missing_pc['suburb']=df_missing_pc['London_borough']

In [40]:
#dropping the postcode column(due to its null values) and checking the dataframe after imputing the suburb name
#this dataframe will be combined with the df_missing dataframe after the df_missing dataframe has its suburbs filled in
df_missing_pc.drop(['postcode'],axis=1, inplace=True)
df_missing_pc.head()

Unnamed: 0,id,name,neighborhood_overview,London_borough,latitude,longitude,minimum_nights,property_type,room_type,price,review_scores_location,suburb
131,5388341,Eclectic London home great location,"We might be biased, but we think we live in th...",City of London,51.51476,-0.07612,3,Entire rental unit,Entire home/apt,$98.00,4.5,City of London
143,5989171,My spare room is for rent,The local shopping street is Askew Road. Over ...,Hammersmith and Fulham,51.50181,-0.24857,1,Private room in condo,Private room,$53.00,4.77,Hammersmith and Fulham
482,16276053,Spacious Double Room in Zone 1 near Shoreditch,Next to trendy Brick Lane and Spittafields Market,City of London,51.5148,-0.07637,4,Private room in rental unit,Private room,$40.00,4.82,City of London
900,25884581,Spacious room king-size bed,"It is a safe neighbourhood, with people from d...",Brent,51.59733,-0.26751,1,Private room in home,Private room,$40.00,5.0,Brent
973,28237545,Lovely Garden View - Close to City Center,Zone 4 Burnt Oak & Zone 5 Edgware ( Northern L...,Barnet,51.61023,-0.27034,3,Private room in home,Private room,$26.00,5.0,Barnet


In [41]:
df_missing.loc[df_missing['London_borough'] == 'City of London', 'suburb'] = 'City of London'

In [42]:
# Applying suburb names for the first set of missing suburbs
suburb_prefix = {'BR2': 'Bromley', 'BR5': 'Orpington', 'BR6': 'Orpington', 'BR7': 'Chislehurst', 'CR0': 'Croydon', 'CR2':'South Croydon', 'CR5':'Coulsdon', 'DA7': 'Bexleyheath', 'DA14': 'Sidcup', 'E17 ': 'Walthamstow', 'E4': 'Walthamstow ', 'E5 ': 'Clapton', 'E12':'Wanstead' , 'E10': 'Wanstead', 'E16': 'Newham', 'EN':'Enfield','EN2':'Enfield'}


for prefix, suburb in suburb_prefix.items():
    
    df_missing.loc[df_missing['postcode'].str.startswith(prefix + ' ',), 'suburb'] = suburb

In [43]:
# Applying suburb names for the second set of missing suburbs
suburb_prefix2 = {'W10': 'North Kensington', 'W12': "Shepherd's Bush", 'W13': 'West Ealing', 'W3': 'Acton', 'W4': 'Chiswick', 'W5':'Ealing', 'W6':'Hammersmith', 'WC2A': 'Holborn'}


for prefix, suburb in suburb_prefix2.items():
    df_missing.loc[df_missing['postcode'].str.startswith(prefix + ' '), 'suburb'] = suburb

In [44]:
# Applying suburb names for the third set of missing suburbs
suburb_prefix3 = {'N10': 'Muswell Hill', 'N11': 'Arnos Grove', 'N13': 'Palmers Green', 'N14': 'Tottenham', 'N15': 'Tottenham', 'N17': 'Tottenham', 'N18': 'Tottenham', 'N21': 'Grange Park', 'N22':'Wood Green', 'N4':'Finsbury Park', 'N9': 'Edmonton', 'NW10':'Harlesden', 'NW3': 'Belsize Park', 'NW7': 'Mill Lane', 'NW9': 'Colindale'}

for prefix, suburb in suburb_prefix3.items():
    df_missing.loc[df_missing['postcode'].str.startswith(prefix + ' '), 'suburb'] = suburb

In [45]:
# Applying suburb names for the fourth set of missing suburbs
suburb_prefix4 = {'SE10': 'Greenwich', 'SE15': 'Peckham', 'SE16': 'Bermondsey', 'SE18': 'Wandsworth', 'SE19': 'Crystal Palace', 'SE2': 'Abbey Wood', 'SE21': 'Dulwich', 'SE22': 'Dulwich', 'SE26':'Sydenham', 'SE28':'Thamesmead', 'SE8': 'Deptford', 'SE9':'Eltham', 'SM': 'Sutton', 'SW11': 'Battersea', 'SW13': 'Barnes','SW15': 'Putney','SW18': 'Wansworth','SW2': 'Brixton','SW2': 'Brixton','SW4': 'Clapham','SW6': 'Fulham','SW8': 'Battersea'}

for prefix, suburb in suburb_prefix4.items():
    df_missing.loc[df_missing['postcode'].str.startswith(prefix + ' '), 'suburb'] = suburb

In [46]:
#Checking remaining missing suburbs
df_missing.isna().sum()

id                          0
name                        0
neighborhood_overview       0
London_borough              0
latitude                    0
longitude                   0
minimum_nights              0
property_type               0
room_type                   0
price                       0
review_scores_location      0
suburb                    954
postcode                    0
dtype: int64

In [47]:
#looking at the details of the rows where suburb is missing in order to identify the most appropriate way to fill in the values
df_missing[(df_missing['suburb'].isna())]

Unnamed: 0,id,name,neighborhood_overview,London_borough,latitude,longitude,minimum_nights,property_type,room_type,price,review_scores_location,suburb,postcode
0,198258,Penthouse Living in East London,"I live in Barking town centre, at one time the...",Barking and Dagenham,51.534300,0.081780,2,Private room in rental unit,Private room,$69.00,4.440000,,IG11 7RQ
3,81080,Luxury Self contained Studio Apt.,Lots of local history - you are right on the d...,Richmond upon Thames,51.406440,-0.335630,3,Entire serviced apartment,Entire home/apt,$170.00,4.940000,,KT8 9BY
7,428716,"Clean, private spacious dble rm 4mins from sta...","Beautiful wall art, parks, tennis courts, lak...",Waltham Forest,51.587330,-0.003450,1,Private room in townhouse,Private room,$85.00,4.780000,,E17 3TJ
9,447223,15mins frm Centr.London with garden,Market close-by - 3 days per week. Traditiona...,Havering,51.578660,0.169210,2,Entire home,Entire home/apt,$114.00,4.300000,,RM7 7AP
14,590443,ILFORD - 2 DR (1 private bathr + 1 en-suite ba...,- All you need is within 5-10 minutes walk (su...,Redbridge,51.560970,0.076020,25,Entire rental unit,Entire home/apt,$48.00,4.440000,,IG1 4EQ
...,...,...,...,...,...,...,...,...,...,...,...,...,...
2604,772313840279032352,Two-bedroom apartment with terrace in Farrington,Farringdon is one of the most up-and-coming ne...,Islington,51.519124,-0.104363,1,Entire rental unit,Entire home/apt,$327.00,4.804420,,EC1M 6PB
2611,775944550906085503,"Studio in Wembley, London",Studio located at a 15 mins walks from Wembley...,Brent,51.561716,-0.286860,3,Entire rental unit,Entire home/apt,$70.00,4.632068,,HA9 8HB
2613,775953684576286413,"Private Studio in Wembley, London",Studio located at a 15 mins walks from Wembley...,Brent,51.561716,-0.286860,3,Entire rental unit,Entire home/apt,$50.00,4.632068,,HA9 8HB
2614,776521441063993062,one bedroom and a couch.,Very quiet and friendly.,Barnet,51.610528,-0.276035,1,Private room in rental unit,Private room,$73.00,4.675242,,HA8 9AB


In [48]:
# Looking at the missing suburbs, majority of them are in the greater London area and fairly homogenous at a borough level. 
#We will therefore go ahead and use the borough name as a proxy for suburn name
df_missing['suburb'].fillna(df_missing['London_borough'], inplace=True)

In [49]:
df_missing.head()

Unnamed: 0,id,name,neighborhood_overview,London_borough,latitude,longitude,minimum_nights,property_type,room_type,price,review_scores_location,suburb,postcode
0,198258,Penthouse Living in East London,"I live in Barking town centre, at one time the...",Barking and Dagenham,51.5343,0.08178,2,Private room in rental unit,Private room,$69.00,4.44,Barking and Dagenham,IG11 7RQ
1,41445,2 Double bed apartment in quiet area North London,Quite area popular with families,Barnet,51.61492,-0.25632,4,Entire rental unit,Entire home/apt,$550.00,4.53,Mill Lane,NW7 3QA
2,318287,safe and spacious room in comfy family home,We are a cosmopolitan family in a lovely neigh...,Waltham Forest,51.61607,-0.02982,2,Private room in home,Private room,$29.00,4.45,Walthamstow,E4 8HB
3,81080,Luxury Self contained Studio Apt.,Lots of local history - you are right on the d...,Richmond upon Thames,51.40644,-0.33563,3,Entire serviced apartment,Entire home/apt,$170.00,4.94,Richmond upon Thames,KT8 9BY
4,343848,"Lovely double room in Oakwood, North London",It is a short walk to the local Sainsbury's su...,Enfield,51.65075,-0.11841,180,Private room in home,Private room,$41.00,4.61,Enfield,EN2 7JP


In [50]:
df_missing.isna().sum()

id                        0
name                      0
neighborhood_overview     0
London_borough            0
latitude                  0
longitude                 0
minimum_nights            0
property_type             0
room_type                 0
price                     0
review_scores_location    0
suburb                    0
postcode                  0
dtype: int64

In [51]:
#Concatenating df_missing (with completed suburbs) and df_missing_pc(completed suburbs for rows which had missing postcode)
df_suburbc=pd.concat([df_missing,df_missing_pc], axis=0)

In [52]:
df_suburbc

Unnamed: 0,id,name,neighborhood_overview,London_borough,latitude,longitude,minimum_nights,property_type,room_type,price,review_scores_location,suburb,postcode
0,198258,Penthouse Living in East London,"I live in Barking town centre, at one time the...",Barking and Dagenham,51.534300,0.081780,2,Private room in rental unit,Private room,$69.00,4.440000,Barking and Dagenham,IG11 7RQ
1,41445,2 Double bed apartment in quiet area North London,Quite area popular with families,Barnet,51.614920,-0.256320,4,Entire rental unit,Entire home/apt,$550.00,4.530000,Mill Lane,NW7 3QA
2,318287,safe and spacious room in comfy family home,We are a cosmopolitan family in a lovely neigh...,Waltham Forest,51.616070,-0.029820,2,Private room in home,Private room,$29.00,4.450000,Walthamstow,E4 8HB
3,81080,Luxury Self contained Studio Apt.,Lots of local history - you are right on the d...,Richmond upon Thames,51.406440,-0.335630,3,Entire serviced apartment,Entire home/apt,$170.00,4.940000,Richmond upon Thames,KT8 9BY
4,343848,"Lovely double room in Oakwood, North London",It is a short walk to the local Sainsbury's su...,Enfield,51.650750,-0.118410,180,Private room in home,Private room,$41.00,4.610000,Enfield,EN2 7JP
...,...,...,...,...,...,...,...,...,...,...,...,...,...
2144,644804400072872785,Bed and breakfast in leafy Bounds Green,A Quiet neighbourhood away from the hustle and...,Haringey,51.603942,-0.126679,1,Private room in bed and breakfast,Private room,$45.00,4.710000,Haringey,
2146,645440879081312934,Nice room near London Designer Outlet,Aside from the the iconic stadium and the SSE ...,Brent,51.558570,-0.280010,2,Private room in rental unit,Private room,"$1,570.00",4.632068,Brent,
2208,659428748704825984,★NEW BUILD★1Bed Studio Apart Private GYM /NETF...,Home to London's most iconic concert and event...,Brent,51.562199,-0.279161,3,Entire serviced apartment,Entire home/apt,$170.00,4.500000,Brent,
2213,659279404634312107,Nice room near London Designer Outlet,Aside from the the iconic stadium and the SSE ...,Brent,51.558550,-0.279980,2,Private room in rental unit,Private room,"$1,570.00",4.632068,Brent,


In [53]:
df_suburbc.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2619 entries, 0 to 2562
Data columns (total 13 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   id                      2619 non-null   int64  
 1   name                    2619 non-null   object 
 2   neighborhood_overview   2619 non-null   object 
 3   London_borough          2619 non-null   object 
 4   latitude                2619 non-null   float64
 5   longitude               2619 non-null   float64
 6   minimum_nights          2619 non-null   int64  
 7   property_type           2619 non-null   object 
 8   room_type               2619 non-null   object 
 9   price                   2619 non-null   object 
 10  review_scores_location  2619 non-null   float64
 11  suburb                  2619 non-null   object 
 12  postcode                2606 non-null   object 
dtypes: float64(3), int64(2), object(8)
memory usage: 286.5+ KB


In [54]:
#dropping the postcode column as it is no longer required
df_suburbc.drop(['postcode'], axis=1, inplace=True)

In [55]:
df_suburbc.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2619 entries, 0 to 2562
Data columns (total 12 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   id                      2619 non-null   int64  
 1   name                    2619 non-null   object 
 2   neighborhood_overview   2619 non-null   object 
 3   London_borough          2619 non-null   object 
 4   latitude                2619 non-null   float64
 5   longitude               2619 non-null   float64
 6   minimum_nights          2619 non-null   int64  
 7   property_type           2619 non-null   object 
 8   room_type               2619 non-null   object 
 9   price                   2619 non-null   object 
 10  review_scores_location  2619 non-null   float64
 11  suburb                  2619 non-null   object 
dtypes: float64(3), int64(2), object(7)
memory usage: 266.0+ KB


In [56]:
#pkl dataframe for all missing suburbs from the original dataframe
joblib.dump(df_suburbc,'../Data/df_missingsuburbsfilled.pkl',compress =9)

['../Data/df_missingsuburbsfilled.pkl']

In [71]:
#looking at the original dataframe before we went ahead with the imputation of missing values in the `suburbs` column
df=df=joblib.load('../Data/df_suburb.pkl')

In [72]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 40605 entries, 0 to 71937
Data columns (total 12 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   id                      40605 non-null  int64  
 1   name                    40605 non-null  object 
 2   neighborhood_overview   40605 non-null  object 
 3   London_borough          40605 non-null  object 
 4   latitude                40605 non-null  float64
 5   longitude               40605 non-null  float64
 6   minimum_nights          40605 non-null  int64  
 7   property_type           40605 non-null  object 
 8   room_type               40605 non-null  object 
 9   price                   40605 non-null  object 
 10  review_scores_location  40605 non-null  float64
 11  suburb                  40605 non-null  object 
dtypes: float64(3), int64(2), object(7)
memory usage: 4.0+ MB


In [73]:
#dropping rows that were missing `suburb` details, so we can concatenate with the version of the dataframe where the missing values have been filled in 
df2=df.dropna(axis=0, subset=['suburb'])

In [74]:
df2.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 40605 entries, 0 to 71937
Data columns (total 12 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   id                      40605 non-null  int64  
 1   name                    40605 non-null  object 
 2   neighborhood_overview   40605 non-null  object 
 3   London_borough          40605 non-null  object 
 4   latitude                40605 non-null  float64
 5   longitude               40605 non-null  float64
 6   minimum_nights          40605 non-null  int64  
 7   property_type           40605 non-null  object 
 8   room_type               40605 non-null  object 
 9   price                   40605 non-null  object 
 10  review_scores_location  40605 non-null  float64
 11  suburb                  40605 non-null  object 
dtypes: float64(3), int64(2), object(7)
memory usage: 4.0+ MB


In [61]:
# The original dataframe does not show any null values against suburb because it is likely an empty string. We will convert the blank strings to NaN and then drop these rows, so we can merge df_suburbc
df2['suburb'].replace('',np.nan, inplace=True)

In [62]:
df2.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 40605 entries, 0 to 71937
Data columns (total 12 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   id                      40605 non-null  int64  
 1   name                    40605 non-null  object 
 2   neighborhood_overview   40605 non-null  object 
 3   London_borough          40605 non-null  object 
 4   latitude                40605 non-null  float64
 5   longitude               40605 non-null  float64
 6   minimum_nights          40605 non-null  int64  
 7   property_type           40605 non-null  object 
 8   room_type               40605 non-null  object 
 9   price                   40605 non-null  object 
 10  review_scores_location  40605 non-null  float64
 11  suburb                  37986 non-null  object 
dtypes: float64(3), int64(2), object(7)
memory usage: 4.0+ MB


In [63]:
df3=df2.dropna(axis=0, subset=['suburb'])

In [64]:
df3.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 37986 entries, 0 to 71937
Data columns (total 12 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   id                      37986 non-null  int64  
 1   name                    37986 non-null  object 
 2   neighborhood_overview   37986 non-null  object 
 3   London_borough          37986 non-null  object 
 4   latitude                37986 non-null  float64
 5   longitude               37986 non-null  float64
 6   minimum_nights          37986 non-null  int64  
 7   property_type           37986 non-null  object 
 8   room_type               37986 non-null  object 
 9   price                   37986 non-null  object 
 10  review_scores_location  37986 non-null  float64
 11  suburb                  37986 non-null  object 
dtypes: float64(3), int64(2), object(7)
memory usage: 3.8+ MB


In [65]:
#concatenating the two dataframes to get the final dataframe with all suburbs filled in
df4=pd.concat([df3,df_suburbc],axis=0)

In [66]:
df4.head()

Unnamed: 0,id,name,neighborhood_overview,London_borough,latitude,longitude,minimum_nights,property_type,room_type,price,review_scores_location,suburb
0,13913,Holiday London DB Room Let-on going,Finsbury Park is a friendly melting pot commun...,Islington,51.56861,-0.1127,1,Private room in rental unit,Private room,$79.00,4.71,Finsbury Park
1,15400,Bright Chelsea Apartment. Chelsea!,It is Chelsea.,Kensington and Chelsea,51.4878,-0.16813,10,Entire rental unit,Entire home/apt,$75.00,4.93,Chelsea
3,173082,The Residential Suite Above Gallery,"The neighbourhood ""Victoria Park Village"" is a...",Hackney,51.538254,-0.044086,2,Entire rental unit,Entire home/apt,$132.00,4.68,Homerton
4,42010,You Will Save Money Here,We have a unique cinema called the Phoenix whi...,Barnet,51.5859,-0.16434,4,Private room in home,Private room,$65.00,4.72,Hampstead Garden Suburb
5,17402,Superb 3-Bed/2 Bath & Wifi: Trendy W1,"Location, location, location! You won't find b...",Westminster,51.52195,-0.14094,4,Entire rental unit,Entire home/apt,$425.00,4.88,Fitzrovia


In [68]:
df4.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 40605 entries, 0 to 2562
Data columns (total 12 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   id                      40605 non-null  int64  
 1   name                    40605 non-null  object 
 2   neighborhood_overview   40605 non-null  object 
 3   London_borough          40605 non-null  object 
 4   latitude                40605 non-null  float64
 5   longitude               40605 non-null  float64
 6   minimum_nights          40605 non-null  int64  
 7   property_type           40605 non-null  object 
 8   room_type               40605 non-null  object 
 9   price                   40605 non-null  object 
 10  review_scores_location  40605 non-null  float64
 11  suburb                  40605 non-null  object 
dtypes: float64(3), int64(2), object(7)
memory usage: 4.0+ MB


In [69]:
joblib.dump(df4,'../Data/df_suburb_clean.pkl',compress =9)

['../Data/df_suburb_clean.pkl']

In [70]:
df4.to_csv('../Data/df_suburb_clean.csv')

#### Summary Recap



The data pre-processing step was a critical part of the project due to the nature of the data required - `suburb` level data being key and the dataset did not come pre-populated with this.<br>

The key steps in the data-sourcing and data pre-processing stage were:

1) Identifying the right dataset from the inside airbnb source -Having evaluated a few different datasets, we finalized on the dataset used here as it provided a detailed level of data on the properties listed, key being `neighborhood_overview` which provides a short description of the neighborhood where the property is located <br>
2) The next step was to narrow down the dataset to include only relevant columns. As the purpose of the project was to identify neighborhood profiles and recommend properties accordingly, the key columns retained in the final dataset were related to the property location <br>
3) Imputing missing values for `name` and `review_scores_location` based on the property details that we could glean from the other columns - e.g. `name` details were completed by adding details from the property location and neighborhood overview and `review_scores_location` were filled in with average review scores for the corresponding London borough <br>
4) Reverse geocoding to map `suburb` names as per latitude and longitude - We used 'nominatim open street' to request for suburb names based on the property location. It returned values for all 40,605 listings barring 2619 of them where nominatim had no suburb details available
5) Inputing missing values for `suburbs` by requesting for `postcode` data and then manually mapping suburbs based on outer postcodes - The process for imputing missing `suburb` values was slightly time-consuming as we first went down the route of requesting for 'postcode' information which were then mapped to 'ward' level data for the 'City of London' borough as a test. However, the 'ward' level data was too narrow for the purposes of our project where the end objective is to profile and recommend broad 'suburbs' to the airbnb guests. We therefore changed the approach and looked at 'outer' level postcode e.g. BR2 in the postcode BR2 6AN to establish the suburbs against each postcode and map them against the relevant listing. For certain boroughs which are closely knit (e.g. City of London) or outer boroughs where the suburbs within are broadly similar (e.g. Richmond) we used the borough name as the `suburb` for the purposes of this project

_____________________________________________________________________