# Business Understanding

Airbnb is an online platform that connects travellers looking for accommodation to hosts of spaces available for rent.
Airbnb is an American vacation rental online marketplace company based in San Francisco, California. Airbnb maintains and hosts 
a marketplace, accessible to consumers on its website or app. Users can arrange lodging, primarily homestays, and tourism 
experiences or list their spare rooms, properties, or part of it for rental. On the other hand, users who are traveling are 
looking for stays search properties and rooms by neighborhood or location. Airbnb recommends the best price in the neighborhood 
and users book the best deal.

In this analysis, we will be using Boston Airbnb data. To analyse the customer reviews and understand customer satisfaction
as well as any additional insights from the rewiew. This will allow us to understand customer preferences, and provide what
customers want. In order to do so, we will be answering the following questions:
    
1. What type of properties are being booked the most?
2. Is there a relation between the number of bookings of a place and the reviews score?
3. Is there anything that bias the reviews of the users?
4. How reviews for a given property changes through seasons?

# Data Understanding

In the selected dataset, we have 3 data file:
    - listings
    - reviews
    - calendar
We will go through the data in the below analyis to understand them better and use the appropriete one or ones for the rest of
our analysis.

In [None]:
#Import the necessary packages

import numpy as np
import pandas as pd
#import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import seaborn as sns
import cartopy


import warnings
warnings.filterwarnings('ignore')

%matplotlib inline
#mpl.style.use('ggplot')

## Reading the 3 files into dataframes

In [None]:
listings = pd.read_csv("data/listings.csv")
calendar = pd.read_csv("data/calendar.csv")
reviews = pd.read_csv("data/reviews.csv")

## Analysis and data preparation for the 3 files we have

### Lintings data

In [None]:
listings.columns

In [None]:
listings.shape

In [None]:
listings.info()

In [None]:
listings.describe()

### Dealing with null values in the listings data

In [None]:
#list of column with more than 70% null value

to_drop = [cols for cols in listings.columns.values if (listings[cols].isnull().sum()/len(listings)*100)>=70]

In [None]:
#check the result
to_drop

We will drop thos colums with more than 70% null values as they are not usefull for our analysis
The weekly_price and monthly_price can also be removed as we have the price column which is enough for our needs.

In [None]:
#Drop columns with more than 70% null value
listings.drop(to_drop,axis=1,inplace=True)

In [None]:
#cehck if all columns with more than 70% null value has been removed
[cols for cols in listings.columns.values if (listings[cols].isnull().sum()/len(listings)*100)>=70]

Now we are dealing with the price colums.

In [None]:
# List of all columns with price
listings_price = listings[[x for x in listings.columns if 'price' in x]]

In [None]:
listings_price.head()

Let's change the type of the price column to float and remove the dollar($)sign

In [None]:
# change the type of the price column and remove the $ sign
listings.price = listings.price.str.replace('$', '').str.replace(',', '').astype(float)

In [None]:
#check the changes
listings.price.head()

As our main focus on this analysis is the reviews, let's isolates all columns with reviews and see if we need to so any chansges

In [None]:
#list of columns related to reviews
listings_reviews = listings[[x for x in listings.columns if 'review' in x]]

In [None]:
listings_reviews.describe()

In [None]:
listings_reviews.info()

In [None]:
listings_reviews.head()

In the listings_reviews we do have some rows with number of reviews equal zero(0). Those rows are not so usefull for our analysis. Therefore we will drop them.

In [None]:
# Grouping all rows without reviews
to_drop1 = listings_reviews[listings_reviews['number_of_reviews'] == 0]

In [None]:
to_drop1

In [None]:
# drop all rows without reviews
listings.drop(to_drop1.index, inplace = True)

Now, we will check if we have any other null reviews values.

In [None]:
listings_reviews1 = listings[[x for x in listings.columns if 'review' in x]]

In [None]:
(listings_reviews1.isnull().sum()/len(listings_reviews1)*100).sort_values(ascending=False)

We still have some columns with null values. We will not drop those as they might be usefull to our analysis. 
We will fill those rows with the mean.
We will be using the mean because in the listings_reviews.describe() we viewed few cells earlier. The smallest values in all those table where greater than 0. 
It will be more usefull to fill them with the mean.

In [None]:
# grouping all remaining reviews columns with null values
tobe_filled = [cols for cols in listings_reviews1.columns.values if (listings_reviews1[cols].isnull().sum()/len(listings_reviews1)*100)>=2]

In [None]:
tobe_filled

In [None]:
# Fill the null vlaues with the mean
listings[tobe_filled] = listings[tobe_filled].fillna(value=round(listings[tobe_filled].mean()))

### Let's analyse the calendar data we have

In [None]:
#Analyse and prepare the calendar data

In [None]:
calendar.info()

In [None]:
calendar.describe()

In [None]:
calendar.head()

The calendar data does not contain usefull information to our analysis. We will not comtinue ivestigating it further.

### Analysing the revilews data

In [None]:
reviews.columns

In [None]:
reviews.info()

In [None]:
reviews.head()

In [None]:
# We need to change the date column to a datetime 
reviews['date'] = pd.to_datetime(reviews.date) 

In [None]:
# Let's add month and year and yearmonth column to the reviews dataset for easy plotting
reviews['month'], reviews['year'], reviews['yearmonth'] = reviews.date.dt.month, reviews.date.dt.year, reviews.date.dt.to_period("M")
reviews.info(verbose=True, null_counts=True)

In [None]:
reviews.head(10)

In [None]:
# Now, we will check for missing values in the reviews data

(reviews.isnull().sum()/len(reviews)*100).sort_values(ascending=False)

We do not have much null values. The few null values lies in the comments row which is not usesd in our analysis. We do not have more prosessing to do with the reviews data.

# Now we will start our analysis on the data

Let's begin by listing the type of property that are the most listed.

In [None]:
#Analyzing and plotting the number of listings based on their property type

propertytype = listings[['host_listings_count', 'property_type']].sort_values(by = 'host_listings_count', ascending=False)

propertytype.groupby(['property_type'])['host_listings_count'].count().sort_values(ascending=False).plot(kind='bar', 
           x='property_type',
           y='Number_Of_Listings',
           figsize =(15,8), 
           title = 'boston Property Type listing Frequency', 
           legend = False)
plt.ylabel('property type')
plt.ylabel('Number of listings')

The most listed places we have in the dataset are in this order: Appartment, House, condominium.

In [None]:
# Distribution of property types per neighbourhood

listings.groupby(['neighbourhood','property_type'])['property_type'].count().unstack('property_type').plot.bar(stacked=True,figsize=(16,10))
plt.ylabel('Nb property type')
plt.title('Distribution of property types per neighbourhood')
plt.legend(loc=3)

In [None]:
#Which neighborhood has the most listings?

plt.figure(figsize=(14,6))
listings.groupby(['neighbourhood'])['host_listings_count'].count().sort_values(ascending=False).plot(kind='bar')
plt.title('Number of booking as per neighboarhood')
plt.xlabel('Neighbourhood')
plt.ylabel('nb bookings')

In [None]:
#What are some most expensive property types in boston

plt.figure(figsize=(14,6));
listings.groupby(['property_type'])['price'].mean().sort_values(ascending=False).plot(kind='bar');
plt.title('Property price as per Property Type');
plt.xlabel('Property Type');
plt.ylabel('$ Price');

In [None]:
#What are the most and least expensive neighbourhoods in boston ?

plt.figure(figsize=(14,6))
listings.groupby(['neighbourhood'])['price'].mean().sort_values(ascending=False).plot(kind='bar')
plt.title('price range')
plt.xlabel('Neighbourhood')
plt.ylabel('nb bookings')

In [None]:
#What are the most and least booked property types in boston?

plt.figure(figsize=(14,6))
listings.groupby(['property_type'])['calculated_host_listings_count'].mean().sort_values(ascending=False).plot(kind='bar')
plt.title('Property number of booking as per Property Type')
plt.xlabel('Property Type')
plt.ylabel('nb bookings')

In [None]:
#Which neighborhoud has the most booking?

plt.figure(figsize=(14,6))
listings.groupby(['neighbourhood'])['calculated_host_listings_count'].count().sort_values(ascending=False).plot(kind='bar')
plt.title('Number of booking as per neighboarhood')
plt.xlabel('Neighbourhood')
plt.ylabel('nb bookings')

In [None]:
#What are some most rated property types in boston?

plt.figure(figsize=(14,6))
listings.groupby(['property_type'])['number_of_reviews'].mean().sort_values(ascending=False).plot(kind='bar')
plt.title('Number of reviews by Property type')
plt.xlabel('Property Type')
# plt.ylabel('number_of_reviews')

In [None]:
#What are some best rated property types in boston?

plt.figure(figsize=(14,6))
listings.groupby(['property_type'])['review_scores_rating'].mean().sort_values(ascending=False).plot(kind='bar')
plt.title('Review score per Property Type')
plt.xlabel('Property Type')
plt.ylabel('review_scores_rating')

In [None]:
#Which neighborhoud has the best ratings?

plt.figure(figsize=(14,6))
listings.groupby(['neighbourhood'])['review_scores_rating'].mean().sort_values(ascending=False).plot(kind='bar')
plt.title('Review score per neighbourhood')
plt.xlabel('Neighbourhood')
plt.ylabel('review_scores_rating')

By comparing the ratings on the plots we did. We can notice there is not a lot of variation. All the ratings are hight from 80 to 100. 
Is this due to a bias on the customer side while rating?
So customer rank hlight every property no mater the price and condition of the property?
Those questions go beyond the scope of our analysis. In order to have a clear view on the rankings, we need to applied sentiment analysis methods to the comments on the properties.

### Analysing the relationship between price and number of reviews

In [None]:
#Interpreting the relation between number of reviews and price

price_review = listings[['number_of_reviews', 'price']].sort_values(by = 'price')

price_review.plot(x = 'price', 
                  y = 'number_of_reviews', 
                  style = 'o',
                  figsize =(17,10),
                  legend = False,
                  title = 'Reviews based on Price')

plt.xlabel("price")
plt.ylabel("Number of reviews")

From the graph, the reviews were most observed for the listings that have a price range around 100 - 400. The number quickly declines as the price goes up.
Let's plot the data between the price range <=100 and see if we can have a clearer interpretation.

In [None]:
#Interpreting the relation between number of reviews and price

price_review = listings[['number_of_reviews', 'price']].sort_values(by = 'price')
price_review1 = price_review[price_review['price'] <= 100]

price_review1.plot(x = 'price', 
                  y = 'number_of_reviews', 
                  style = 'o',
                  figsize =(17,10),
                  legend = False,
                  title = 'Reviews based on Price')

plt.xlabel("price")
plt.ylabel("Number of reviews")

We can see that for a particular price the number of reviews is quite random, thus it is not showing any relationshitp between the price and the number of reviews.
Lets group the data by price in order to have a better view of the accumulated number of reviews for a given price.

In [None]:
#Interpreting the relation between number of reviews and price

price_review = listings[['number_of_reviews', 'price']].sort_values(by = 'price')

price_review.groupby(['price'])['number_of_reviews'].count().plot(x = 'price', 
                  y = 'number_of_reviews', 
                  style = 'o',
                  figsize =(17,10),
                  legend = False,
                  title = 'Reviews based on Price')

plt.xlabel("price")
plt.ylabel("Number of reviews")

In [None]:
#Interpreting the relation between number of reviews and price

price_review = listings[['number_of_reviews', 'price']].sort_values(by = 'price')
price_review1 = price_review[price_review['price'] <= 100]

price_review1.groupby(['price'])['number_of_reviews'].count().plot(x = 'price', 
                  y = 'number_of_reviews', 
                  style = 'o',
                  figsize =(17,10),
                  legend = False,
                  title = 'Reviews based on Price')

plt.xlabel("price")
plt.ylabel("Number of reviews")

**This conclude, the price does not necessarly influence the number of reviews. It is usefull to note that expensive property types are booked less whish could explain why the review numbers are low.**

This answers the question: Do users tend to give more reviews to expensive listings? 

The answer is not really, even though we see the number of reviews is greater for houses with a price lower or equal 400. Those number of reviews are very random per listings.

### Analysing the relationship between the number of reviews and the review score

In [None]:
#Interpreting the relation between number of reviews and review_scores_rating 


price_review = listings[['number_of_reviews', 'review_scores_rating']].sort_values(by = 'review_scores_rating')

price_review.plot(x = 'review_scores_rating', 
                  y = 'number_of_reviews', 
                  style = 'o',
                  figsize =(17,10),
                  legend = False,
                  title = 'Reviews based on review_scores_rating')

plt.xlabel("review_scores_rating")
plt.ylabel("Number of reviews")

The majority of our data pooint is on the second half of the graph with the reviews score between 70 and 100. We can see a slight relationship between the number of reviews and the review score.
Let's narrow down our view point to the review score between 70 to 100.

In [None]:
#Interpreting the relation between number of reviews and review_scores_rating 


price_review = listings[['number_of_reviews', 'review_scores_rating']].sort_values(by = 'review_scores_rating')
price_review1 = price_review[price_review['review_scores_rating'] >70 ]
price_review1.plot(x = 'review_scores_rating', 
                  y = 'number_of_reviews', 
                  style = 'o',
                  figsize =(17,10),
                  legend = False,
                  title = 'Reviews based on review_scores_rating')

plt.xlabel("review_scores_rating")
plt.ylabel("Number of reviews")

There seems tobe a slight increase of the number of ratings as the review scores increase. Let's take a closer look by grouping the data by review scores.

In [None]:
#Interpreting the relation between number of reviews and review_scores_rating 


price_review = listings[['number_of_reviews', 'review_scores_rating']].sort_values(by = 'review_scores_rating')

price_review.groupby(['review_scores_rating'])['number_of_reviews'].count().plot(x = 'review_scores_rating', 
                  y = 'number_of_reviews', 
                  style = 'o',
                  figsize =(17,10),
                  legend = False,
                  title = 'Reviews based on review_scores_rating')

plt.xlabel("review_scores_rating")
plt.ylabel("Number of reviews")

Rating a listing because of the amount of previews ratings might be a thling among the users from time to time, but there is no clear relationship as showned on the multiple graph. Therefore, there is no relationship between the number of reviews and the review score.

This answers the question: Do users tend to rate a listing becsue it has multiple reviews in the past?

The answer is no, although it might happen in few instenses.

### Analysing the relationship between number of listilngs and the rating scores

In [None]:
#Interpreting the relation between number of reviews and number of listings

price_review = listings[['host_listings_count', 'review_scores_rating']].sort_values(by = 'review_scores_rating')

price_review.plot(x = 'review_scores_rating', 
                  y = 'host_listings_count', 
                  style = 'o',
                  figsize =(17,10),
                  legend = False,
                  title = 'Reviews based on host_listings_count')

plt.xlabel("review_scores_rating")
plt.ylabel("host_listings_count")

The majority of the data point are located between revieww scores 70 and 100. let's narrow down the graph to those values.

In [None]:
#Interpreting the relation between number of reviews and number of listings

price_review = listings[['host_listings_count', 'review_scores_rating']].sort_values(by = 'review_scores_rating')
price_review1 = price_review[price_review['review_scores_rating'] >70 ]

price_review1.plot(x = 'review_scores_rating', 
                  y = 'host_listings_count', 
                  style = 'o',
                  figsize =(17,10),
                  legend = False,
                  title = 'Reviews based on host_listings_count')

plt.xlabel("review_scores_rating")
plt.ylabel("host_listings_count")

Once again, there is no clear relationship between the numbe of reviews and the review scores. 
Let's group the data by review score if we can have a better view.

In [None]:
#Interpreting the relation between number of reviews and number of listings

price_review = listings[['host_listings_count', 'review_scores_rating']].sort_values(by = 'review_scores_rating')
price_review1 = price_review[price_review['review_scores_rating'] >70 ]

price_review1.groupby(['review_scores_rating'])['host_listings_count'].count().plot(x = 'review_scores_rating', 
                  y = 'host_listings_count', 
                  style = 'o',
                  figsize =(17,10),
                  legend = False,
                  title = 'Reviews based on host_listings_count')

plt.xlabel("review_scores_rating")
plt.ylabel("host_listings_count")

We can see the begin of a relationship between the two values, but we cannot conclude on a relationship as it is not consistent. There is no relationship between the number of listings and the review scores.

These answers the question: Is there a relation between the number of bookings of a place and the reviews score?  

The answer is no, even though on few occasion it is true. Overall it is not sustainable. Because a place has been listed multiple times does not mean it will always have a good rating.

### Let's analyse the reviews per year and moths to see when users are most likely to give reviews

In [None]:
# grouping the reviews by year
reviews_per_year = pd.DataFrame(reviews.groupby(['year'])['listing_id'].count())

In [None]:
# adapting the indexing
reviews_per_year.rename(columns = {'listing_id': 'listing_count'}, inplace = True)
reviews_per_year.reset_index(inplace=True)

In [None]:
# overview of the output
reviews_per_year.head(20)

In [None]:
# ploting the number of reviews against the year
fig, ax = plt.subplots()
plt.scatter(reviews_per_year['year'], reviews_per_year['listing_count'])
plt.show()

We cna see in the plot above the number of reviews has increased a lot since 2009 and 1016.
Let's see below how the number of reviews increased throughout the months.

In [None]:
reviews_per_yearmonth = pd.DataFrame(reviews.groupby(['yearmonth'])['listing_id'].count())

In [None]:
reviews_per_yearmonth.rename(columns = {'listing_id': 'listing_count'}, inplace = True)
reviews_per_yearmonth.reset_index(inplace=True)

In [None]:
reviews_per_yearmonth['yearmonth'] = reviews_per_yearmonth['yearmonth'].astype(str)

In [None]:
reviews_per_yearmonth.head(20)

In [None]:
# plotting the number of reviews per monthYear

fig, ax = plt.subplots(figsize=(22, 12))
plt.scatter(reviews_per_yearmonth['yearmonth'], reviews_per_yearmonth['listing_count'])
plt.xticks(rotation=90)
plt.show()

As seen previewsly, the number of reviews has increased alot through the years. But when you look closer we can see that for each month of a given year there is a pick. Where the number of reviews id the highest. 
Let's see more clearl this trend.

In [None]:
#Splitting the yearmonth columns to get separe colums for years and months

reviews_per_yearmonth_pivot = reviews_per_yearmonth.copy()
reviews_per_yearmonth_pivot[['yearmonth','month']] = reviews_per_yearmonth_pivot['yearmonth'].str.split('-',expand=True)

In [None]:
reviews_per_yearmonth_pivot.head()

In [None]:
#Transforming the data to a matrice shape for easier plotting.

# Months columns prefilled
months_in_order = ['01', '02', '03', '04', '05', '06', '07', '08', '09', '10', '11', '12']

In [None]:
reviews_per_yearmonth_pivot = reviews_per_yearmonth_pivot[reviews_per_yearmonth_pivot['yearmonth']>"2008"].pivot(index='month', columns='yearmonth', values='listing_count').reindex(months_in_order)

In [None]:
reviews_per_yearmonth_pivot

In [None]:
reviews_per_yearmonth_pivot.columns

From a first glance, the pick of the number of reviews happens each year on the 10th month and around. We still need to plot the data for better clarity. There is few null values in 2009 and 2016 columns. The 2009 columns still fit our analysis since the all month from the 7th below are well filled. The 2016 column on the contrary is mnissing some usefull data, we will remove that column.

In [None]:
#fill the null values in the 2009 column with mean
reviews_per_yearmonth_pivot['2009'] = reviews_per_yearmonth_pivot['2009'].fillna(value=round(reviews_per_yearmonth_pivot['2009'].mean()))

In [None]:
#removing the 2016 column
reviews_per_yearmonth_pivot = reviews_per_yearmonth_pivot.drop('2016', 1)

In [None]:
reviews_per_yearmonth_pivot

In [None]:
#line chart

fig, ax = plt.subplots(figsize=(17, 10))
plt.plot(reviews_per_yearmonth_pivot, marker='o')
plt.title("Historical Count of reviews: Month by Month Comaprison", fontsize=22)
plt.xlabel("Date [Month]")
plt.ylabel("Count [Reviews]")
plt.legend(reviews_per_yearmonth_pivot.columns)
plt.show()

From the above plot, we can conclude that each year the reviews reach a pick level on the 10th month except for 2014 where the pick was reached on the 9th month.
Overall, customers tend to give more reviews during around 9th to 10th month.
This could be explained by the fact that people will be returning back to their life therefore more property will be giving back to their owner thus people will post their reviews a lot.