### This notebook contains the analysis of London BnB Listings dataset for 6th September 2023

Data can be downloaded from the link: http://insideairbnb.com/get-the-data.html

In [None]:
#Import the necessary libraries

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

sns.set_theme(style="whitegrid")

pd.set_option('display.max_rows', 100)
pd.set_option('display.max_columns', 500)

In [None]:
#Read the data
data = pd.read_csv("Data/LondonBnBListings.csv")

In [None]:
#Let us see the shape of the data
data.shape

In [None]:
#Let us explore few rows of the data to get a better picture
data.describe()

In [None]:
#Let us explore few rows of the data and their corresponding values
data.head(5)

###### Based on the column values above, we can safely ignore the following columns: id, listing_url, scrape_id, last_scraped, source, name, description, neighborhood_overview, picture_url, host_url, host_location, host_about, host_response_time, host_thumbnail_url, host_picture_url, host_verifications, host_has_profile_pic, neighbourhood, neighbourhood_group_cleansed, property_type, bathrooms, bathrooms_text,  minimum_nights, maximum_nights, minimum_minimum_nights, maximum_minimum_nights, minimum_maximum_nights, maximum_maximum_nights, maximum_nights_avg_ntm, calendar_updated, has_availability, availability_60, availability_90, calendar_last_scraped, number_of_reviews_l30d, first_review, last_review, review_scores_cleanliness, review_scores_checkin, review_scores_communication, review_scores_location, calculated_host_listings_count_entire_homes, calculated_host_listings_count_private_rooms, calculated_host_listings_count_shared_rooms.

The reasons are as follows:

1. Some of the columns are redundant

2. Some of the columns do not help in analysis such as urls

3. Some of the columns have mostly null values

In [None]:
data.drop(['id', 'listing_url', 'scrape_id', 'last_scraped', 'source', 'name', 'description', 'neighborhood_overview', 
           'picture_url', 'host_url', 'host_location', 'host_about', 'host_response_time', 'host_thumbnail_url', 
           'host_picture_url', 'host_verifications', 'host_has_profile_pic', 'neighbourhood', 'neighbourhood_group_cleansed', 
           'bathrooms', 'bathrooms_text', 'minimum_nights', 'maximum_nights', 'minimum_minimum_nights', 
           'maximum_minimum_nights', 'minimum_maximum_nights', 'maximum_maximum_nights', 'calendar_updated', 
           'has_availability', 'availability_60', 'availability_90', 'calendar_last_scraped', 'number_of_reviews', 
           'number_of_reviews_l30d', 'first_review', 'last_review', 'review_scores_cleanliness', 'review_scores_checkin', 
           'review_scores_communication', 'review_scores_location', 'calculated_host_listings_count_entire_homes', 
           'calculated_host_listings_count_private_rooms', 'calculated_host_listings_count_shared_rooms'],axis=1, inplace=True)

In [None]:
#Recheck the shape of the data
data.shape

In [None]:
#The price column is in the string format, convert the same to numeric format
data['price'] = list(data['price'].str.replace('$',"").str.replace(',','').astype('float'))

In [None]:
#Distribution of price data
data['price'].describe()

In [None]:
#Distribution of price data visually
bx = sns.boxplot(data=data, x='price')
bx.set(xlabel ="Price ($)", title ='Distribution of Price ($)')
plt.show()

In [None]:
data[data['price']>2000].shape

###### As observed in the box plot and the calculations above, a very small portion of the price values lie above 2000$ i.e., 0.279%. So we can safely drop those data points from our analysis.

In [None]:
data = data[data['price'] <= 2000]
data.shape

In [None]:
#Distribution of Number of reviews (last twelve months) for a listing
data['number_of_reviews_ltm'].describe()

**As per data dictionary of Airbnb, `number_of_reviews_ltm` is a calculated column the gives the number of reviews the listing has in the last twelve months. We can assume that more number of reviews (either good or bad) is proportional to the popularity of that particular listing. From the calculation above, over 25% of the data has the number of reviews more than 6. Therefore, we create a new column named `popularity1`.**

In [None]:
data['popularity1'] = data['number_of_reviews_ltm'] > 6

In [None]:
#Distribution of Availability of a listing for the next 365 days
data['availability_365'].describe()

**As per data dictionary of Airbnb, `availability_365` is a calculated column the gives the availability of the listing in the future. We can assume that less availability is proportional to the popularity of that particular listing. From the calculation above, around 50% of the data has availability less than 65 days. Therefore, we create a new column named `popularity2`.**

In [None]:
data['popularity2'] = data['availability_365'] <=65

In [None]:
#The column amenities contains a string of lists. We transform the same to list objects.
data['amenities'] = data['amenities'].apply(lambda x: eval(x))

##### **Motivation:** Business/Government wants to know popular neighbourhoods in terms of the number of listings in that neighbourhood, which will help them build amenities such as public transport, restaurants, shopping centres, salons etc.
**Question 1:** What are the top few neighbourhoods (say 10-15) with maximum listings, which may help to set up business relevant to tourists, visitors or travellers?

**Assumption:** More listings in an area may be due to high demand in the past by frequent visitors.

In [None]:
#compute how many unique neighbourhood are there
data['neighbourhood_cleansed'].nunique()

In [None]:
#Find out top 15 neighbourhood with maximum listings
neighbourhood = data.groupby(['neighbourhood_cleansed']).size().sort_values(ascending=False).reset_index()
neighbourhood.columns = ['Neighbourhood','Counts']
neighbourhood.head(15)

**Observation: The above 15 neighbourhoods with their corresponding number of listings covers around 80% of the listings in the data, which will be helpful for the business/government to extend or build facilities. From now on we will mostly concentrate on these neighbourhood listings**

In [None]:
#Extract the listings data corresponding to the top 15 neighbourhood
neighbourhood_15 = data[[i in list(neighbourhood['Neighbourhood'][0:15]) for i in data['neighbourhood_cleansed']]]

In [None]:
#Visualize how the listings are distributed on a map based on latitude and longitude information
plt.figure(figsize=(10,6), dpi = 150)

# plotting data on chart 
vz = sns.scatterplot(data=neighbourhood_15, x='longitude',y='latitude',hue='neighbourhood_cleansed')
sns.move_legend(vz, "upper left", bbox_to_anchor=(1, 1))

#Add plot title
plt.title('Visualization of property listings on London BnB listing on 6th Sep 2023')
plt.show()