# SI618 Project
### Analyzing the Impact of Various Factors on B&B Visitor's Reviews 
#### — A study based on Airbnb datasets

Team members: Qian Dong (dqq) section 001; Yujun Zhang (yukiz) section 001; Yinuo Wei (seesaway) section 001


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import folium
from folium.plugins import HeatMap

### Cleaning and manipulation
1. Primary dataset description:

This dataset contains essential details enabling an in-depth analysis of hosts, geographical availability, and key metrics required for predictions and drawing meaningful conclusions. It contains information about the name, host, location, room type, number of reviews, and the price of the house.

In [None]:
ab=pd.read_csv('data/AB_NYC_2019.csv')

In [None]:
ab.head()

In [None]:
ab.shape

In [None]:
ab.columns

In [None]:
ab.describe()

In [None]:
ab.select_dtypes(exclude=['object'])\
    .plot(kind='box', subplots=True, layout=(3,4), figsize=(14,14), fontsize=14)
plt.title('Box Plot for each input variable')
plt.show()

From the summary of the table, there are some outliers of prices, so in the consideration of the plot layout, the price data is limited to those less than 2,000.

In [None]:
price_data = ab[ab.price < 2000].loc[:, ['id', 'price']]

plt.figure(figsize=(10, 6))
plt.hist(price_data.price, bins=50, ec='black')
plt.xlabel('Price')
plt.ylabel('Quantity')
plt.title('Distribution of prices')
plt.show()

From the plot, most of the price of houses and apartments are less than 500. Prices around 100 are the most common. In general, the number of houses and apartments available decreases as the price increases.

In [None]:
ab.select_dtypes(exclude=['object']).isna().sum()

Only review_per_month has missing values, so we only need to fill out the missing review_per_month values.

In [None]:
reviews_per_month_data = ab[ab.reviews_per_month.notna()].loc[:, ['id', 'reviews_per_month']]

plt.figure(figsize=(6, 4))
plt.hist(reviews_per_month_data.reviews_per_month, bins=50, ec='black')
plt.xlabel('Reviews per month')
plt.ylabel('Quantity')
plt.title('Distribution of reviews per month')
plt.show()

In [None]:
ab['reviews_per_month'].min()

From the hist plot of review_per_month and the min value of it, we can infer that the missing value should be 0. There should be no reviews if the review_per_month equals 0.

In [None]:
#replace missing with mode
ab['reviews_per_month'].fillna(0, inplace=True)

The updated graph:

In [None]:
ab['reviews_per_month'].plot(kind='box')
plt.title('Box Plot for updated reviews per month')
plt.ylabel('Quantity')
plt.show()

In [None]:
ab.select_dtypes(include=['object']).head()

In [None]:
ab['neighbourhood_group'].value_counts().plot(kind='bar', edgecolor='black')
plt.title('Neighbourhood Group Counts')
plt.xlabel('Neighbourhood Group')
plt.ylabel('Count')
plt.show()

In [None]:
# box plot of price by neighbourhood_group
plt.figure(figsize=(8, 6))
sns.boxplot(x='neighbourhood_group', y='price', data=ab[ab.price < 2000])
plt.xlabel('Neighbourhood group')
plt.ylabel('Price')
plt.title('Price by neighbourhood group')
plt.show()

From the box plot of prices across difference neighbourhood groups, we can see the house prices in Manhattan is higher than the other neighbourhood groups, which means neighbourhood, or to say areas, cast an influence of the house price.
Also looking from the value count plot of houses or apartments in different neighbourhood groups, houses in Manhattan seems the most popular.

In [None]:
ab['neighbourhood'].nunique()

In [None]:
ab['neighbourhood'].value_counts().head(10)

In [None]:
plt.figure(figsize=(6, 4))
ab['room_type'].value_counts().plot(kind='bar', edgecolor='black')
plt.title('Room Type Counts')
plt.xlabel('Room Type')
plt.ylabel('Count')
plt.show()

In [None]:
ab['last_review']=pd.to_datetime(ab['last_review'])
ab['last_review'].dt.year.value_counts().sort_index().plot(kind='bar', edgecolor='black')
plt.title('Last Review Year Counts')
plt.xlabel('Last Review Year')
plt.ylabel('Count')
plt.show()

In [None]:
ab['last_review'].dt.month.value_counts().sort_index().plot(kind='bar', edgecolor='black')
plt.title('Last Review Month Counts')
plt.xlabel('Last Review Month')
plt.ylabel('Count')
plt.show()

In [None]:
# last and first review
ab['last_review'].max(), ab['last_review'].min()

In [None]:
ab['last_review'].value_counts().head(10)

In [None]:
ab.select_dtypes(exclude=['number']).isna().sum()

Missing names are not meaningful to fill. Only missing last_review can be filled. The missing value is corresponding to the missing value of review_per_month. So we should keep the missing values as null because it has the meaning of having no review.

In [None]:
ab[ab['last_review'].isna()][['number_of_reviews', 'last_review', 'reviews_per_month']].sample(5)

In [None]:
ab[ab['last_review'].isna()][['number_of_reviews', 'last_review', 'reviews_per_month']].nunique()

It turns out that missing values of review related data are all because reviews do not exit.

2. Secondary dataset description

This dataset encompasses user feedback on Airbnb listings in New York City, offering
valuable insights into the firsthand experiences of staying in these accommodations. By
perusing these reviews, one can gain an understanding of the strengths, weaknesses, and
the ideal demographic for each listing. It contains the time that the reviews were posted and
the content of the reviews.

In [None]:
reviews=pd.read_csv('data/AB_reviews_NYC.csv')

In [None]:
reviews.head()

In [None]:
reviews.shape

In [None]:
reviews.isna().sum()

There's no missing values

In [None]:
reviews['listing_id'].nunique()

In [None]:
reviews['url'].nunique()

url and listing_id number is corresponded. url is not needed for data analysis.

In [None]:
reviews.drop(columns=['url'], inplace=True)

In [None]:
#revert review_posted_date into datatime
reviews['review_posted_date']=pd.to_datetime(reviews['review_posted_date'])
#plot review_posted_date
reviews['review_posted_date'].dt.year.value_counts().sort_index().plot(kind='bar', edgecolor='black')
plt.title('Review Posted Year Counts')
plt.xlabel('Review Posted Year')
plt.ylabel('Count')
plt.show()

In [None]:
reviews['review_posted_date'].dt.month.value_counts().sort_index().plot(kind='bar', edgecolor='black')
plt.title('Review Posted Month Counts')
plt.xlabel('Review Posted Month')
plt.ylabel('Count')
plt.show()

In [None]:
#plot histgram of review length
reviews['review'].str.len().plot(kind='hist', bins=50, edgecolor='black')

Most of the reviews are short.

### Visualization of features correlations

1. Heatmap of Correlations of the Primary dataset

In [None]:
sns.heatmap(ab.select_dtypes(exclude=['object']).drop(['id', 'host_id'], axis=1)
            .corr(), cmap='coolwarm', center=0, annot=True)
plt.title('Heatmap of Correlation of numerical variables')
plt.show()

The positive relation of review related data is of no doubt. Longitude also affect price and host listing negatively and review_per_month positively, which is a geographic influence. Mininum night and review per month has logical negative influence. availability_365 has positive affect on review numbers. host listing number  has negative effect on review numebrs. More host_listing and more availability of year is related.

### Heatmap of Geometric Distribution of Number of Rooms

In [None]:
m = folium.Map(location=[ab['latitude'].mean(), ab['longitude'].mean()], zoom_start=12)
HeatMap(data=ab[['latitude', 'longitude']], radius=15).add_to(m)
m

In the map generated, red and orange areas represent areas with more houses or apartment for reservation.

### Joining Datasets


In [None]:
airbnb = pd.merge(ab, reviews, left_on='id', right_on='listing_id', how='inner')
airbnb.head()

### Machine Learning (Predict Price)

In this part, two regression models are applied to predict the price of a room taking in the features of the first dataset.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestRegressor

In [None]:
le = LabelEncoder()
airbnb['neighbourhood_group'] = le.fit_transform(airbnb['neighbourhood_group'])
airbnb['neighbourhood'] = le.fit_transform(airbnb['neighbourhood'])
airbnb['room_type'] = le.fit_transform(airbnb['room_type'])

X = airbnb[['neighbourhood_group', 'neighbourhood', 'latitude', 'longitude', 'room_type', 'minimum_nights',
            'number_of_reviews', 'reviews_per_month', 'calculated_host_listings_count', 'availability_365']]
y = airbnb['price']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42)

#### 1. Linear Regression Model

In [None]:
LinearRegre = LinearRegression()

LinearRegre.fit(X_train, y_train)

y_pred = LinearRegre.predict(X_test)

mae = mean_absolute_error(y_test, y_pred)
print(f'Mean Absolute Error: {mae}')

mse = mean_squared_error(y_test, y_pred)
print(f'Mean Squared Error: {mse}')

r2 = r2_score(y_test, y_pred)
print(f'R-squared (R2): {r2}')

reletive_error = mae/airbnb.price.mean()
print(f'reletive error: {reletive_error}')

#### 2. Random Forest Model

In [None]:
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)

rf_model.fit(X_train, y_train)

y_pred_rf = rf_model.predict(X_test)

mae = mean_absolute_error(y_test, y_pred_rf)
print(f'Mean Absolute Error: {mae}')

mse = mean_squared_error(y_test, y_pred_rf)
print(f'Mean Squared Error: {mse}')

r2 = r2_score(y_test, y_pred_rf)
print(f'R-squared (R2): {r2}')

reletive_error = mae/airbnb.price.mean()
print(f'reletive error: {reletive_error}')

From the R-squared value, random forest regressor performs better than the linear model, as the R-squared value of random forest regressor is nearly 1.

### Natural Language Processing

Eavaluate the reviewers' satisfacotry levels based on the sentiment of their reviews

In [None]:
!pip install textblob

In [None]:
import nltk
nltk.download('stopwords')
nltk.download('wordnet')

In [None]:
from textblob import TextBlob
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

In [None]:
reviews = airbnb['review']

def preprocess_text(text):
    stop_words = set(stopwords.words('english'))
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(word.lower()) for word in text.split() if word.isalpha() and word.lower() not in stop_words]
    return ' '.join(tokens)

preprocessed_reviews = reviews.apply(preprocess_text)

sentiments = preprocessed_reviews.apply(lambda x: TextBlob(x).sentiment.polarity)

airbnb['sentiment_score'] = sentiments

print(airbnb[['review', 'sentiment_score']])

Getting the sentiment scores, we want to know if the price and sentiment has some relation.

In [None]:
# linear regression of sentiment score and price
X = airbnb[['sentiment_score']]
y = airbnb['price']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42)

LinearRegre = LinearRegression()

LinearRegre.fit(X_train, y_train)

y_pred = LinearRegre.predict(X_test)

mae = mean_absolute_error(y_test, y_pred)
print(f'Mean Absolute Error: {mae}')

mse = mean_squared_error(y_test, y_pred)
print(f'Mean Squared Error: {mse}')

r2 = r2_score(y_test, y_pred)
print(f'R-squared (R2): {r2}')

reletive_error = mae/airbnb.price.mean()
print(f'reletive error: {reletive_error}')

In [None]:
# regression plot of sentiment score by price
plt.figure(figsize=(8, 6))
sns.regplot(x='sentiment_score', y='price', data=airbnb)
plt.xlabel('Sentiment Score')
plt.ylabel('Price')
plt.title('Sentiment Score by Price')
plt.show()

From the linear regression plot, there seems to be a relation between price and sentiment score, but not obvious.

We can segment sentiment scores into 4 groups and look at if different groups have different price distribution. 
* -1.00 to -0.50: very unsatisfied
* -0.50 to 0.00: slightly unsatisfied
* 0.00 to 0.50: slightly satisfied
* 0.50 to 1.00: very satisfied

In [None]:
# segment sentiment score into 4 groups
airbnb['sentiment_score_group'] = \
    pd.cut(airbnb['sentiment_score'], bins=4, 
                labels=['very unsatisfied', 'slightly unsatisfied',
                         'slightly satisfied', 'very satisfied'])
airbnb['sentiment_score_group'].value_counts()

In [None]:
# box plot of price by sentiment score group
plt.figure(figsize=(8, 6))
sns.boxplot(x='sentiment_score_group', y='price', data=airbnb[airbnb.price < 2000])
plt.xlabel('Sentiment Score Group')
plt.ylabel('Price')
plt.title('Price by Sentiment Score Group')
plt.show()

In [None]:
# anova test
import statsmodels
import statsmodels.api as sm
from statsmodels.formula.api import ols

formula = 'price ~ C(sentiment_score_group)'
lm = ols(formula, airbnb).fit()
table = sm.stats.anova_lm(lm, typ=2)
print(table)

From the result of the ANOVA test, there exist differences of prices across different sentiment scores. This means there is a relation between price and sentiment scores. More expensive rooms tend to have more good reviews. Cheaper ones have more bad reviews.

#### Price Trend over Time

In [None]:
# sentiment score over time
year_sentiment = airbnb.groupby(airbnb['review_posted_date'].dt.year)['sentiment_score'].mean()
year_sentiment

In [None]:
# plot sentiment score over time
plt.figure(figsize=(8, 4))
year_sentiment.plot(kind='bar', edgecolor='black')
plt.xlabel('Year')
plt.ylabel('Sentiment Score')
plt.title('Sentiment Score by Year')
plt.show()

We also want to look at the sentiment score over time. To simplify the analysis, we use the mean value of sentiment score over one year, which can reveal the satisfactory level. From the plot above, we can see the satisfactory level first decreased from 2011 to 2015, then started to increase in general. with a slight decrease in 2019.

### Evaluation
Evaluate the relationship between the relative price and satisfactory
(Relative price = Actual Price - Predicted Price (Expected Price)).

In [None]:
price_differences_lr = y_pred - y_test
price_differences_lr.value_counts()

In [None]:
price_differences_rf = y_pred_rf - y_test
price_differences_rf.value_counts()

In [None]:
test_indices = X_test.index
test_set = airbnb.loc[test_indices]
test_set['price_differences_rf'] = price_differences_rf
test_set['price_differences_lr'] = price_differences_lr
test_set = test_set[abs(test_set.price_differences_rf) < 50]
test_set

In [None]:
plt.scatter(test_set['sentiment_score'], test_set['price_differences_lr'])
plt.title('Relationship between Sentiment Score and Price Difference (Test Set)')
plt.xlabel('Sentiment Score')
plt.ylabel('Price Difference (Predicted - Actual)')
plt.show()

In [None]:
plt.scatter(test_set['sentiment_score'], test_set['price_differences_rf'])
plt.title('Relationship between Sentiment Score and Price Difference (Test Set)')
plt.xlabel('Sentiment Score')
plt.ylabel('Price Difference (Predicted - Actual)')
plt.show()

In [None]:
correlation = test_set['price_differences_lr'].corr(test_set['sentiment_score'])
correlation 

It is found that the price_differences do not have strong correlation with the satisfactory levels of reviewers.

Thus, whether the price is lower or higher than expection will not be a major factor affecting reviewers' satisfactory.

Therefore, we are curious what factors do affect reviewers' satisfactory levels a lot.

### Machine Learning (Predict Users' Satisfactory Levels)

In [None]:
le = LabelEncoder()
airbnb['neighbourhood_group'] = le.fit_transform(airbnb['neighbourhood_group'])
airbnb['neighbourhood'] = le.fit_transform(airbnb['neighbourhood'])
airbnb['room_type'] = le.fit_transform(airbnb['room_type'])

X = airbnb[['neighbourhood_group', 'neighbourhood', 'latitude', 'longitude', 'room_type', 'minimum_nights',
            'number_of_reviews', 'reviews_per_month', 'calculated_host_listings_count', 'availability_365', 'price']]
y = airbnb['sentiment_score']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42)

In [None]:
LinearRegre = LinearRegression()

LinearRegre.fit(X_train, y_train)

y_pred = LinearRegre.predict(X_test)

mae = mean_absolute_error(y_test, y_pred)
print(f'Mean Absolute Error: {mae}')

mse = mean_squared_error(y_test, y_pred)
print(f'Mean Squared Error: {mse}')

r2 = r2_score(y_test, y_pred)
print(f'R-squared (R2): {r2}')

reletive_error = mae/(airbnb.sentiment_score.max() - airbnb.sentiment_score.min())
print(f'reletive error: {reletive_error}')

In [None]:
coefficients_lr = LinearRegre.coef_
feature_names = ['neighbourhood_group', 'neighbourhood', 'latitude', 'longitude', 'room_type', 'minimum_nights',
            'number_of_reviews', 'reviews_per_month', 'calculated_host_listings_count', 'availability_365', 'price']
feature_correlation_lr = pd.DataFrame({'Feature_names': feature_names, 'Coefficients': coefficients_lr})
feature_correlation_lr

In [None]:
feature_correlation_lr_sorted = \
    feature_correlation_lr \
        .sort_values(by='Coefficients', ascending=False)

#plot the coefficients
plt.figure(figsize=(8, 4))
plt.bar(feature_correlation_lr_sorted['Feature_names'], 
        feature_correlation_lr_sorted['Coefficients'])
plt.xticks(rotation=90)
plt.xlabel('Features')
plt.ylabel('Coefficients')
plt.title('Coefficients of Features')
plt.show()

From the results, we can see that neighbourhood, minimum nights, numberof reviews, availability_365 and price do not affect the satisfactory level much.

However, the location of the room and the reviews_per_month strongly affect the satisfactory levels of reviewers.

In [None]:
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)

rf_model.fit(X_train, y_train)

y_pred_rf = rf_model.predict(X_test)

mae = mean_absolute_error(y_test, y_pred_rf)
print(f'Mean Absolute Error: {mae}')

mse = mean_squared_error(y_test, y_pred_rf)
print(f'Mean Squared Error: {mse}')

r2 = r2_score(y_test, y_pred_rf)
print(f'R-squared (R2): {r2}')

reletive_error = mae/(airbnb.sentiment_score.max() - airbnb.sentiment_score.min())
print(f'reletive error: {reletive_error}')

From the learning score, we can see that R-squared is not very high and the reletive error is not low enough.

Thus, the satisfactory level is not easy to predict and many factors other than the solid data may affect it.

In [None]:
feature_importances = rf_model.feature_importances_
feature_importances 

In [None]:
feature_correlation_rf = pd.DataFrame({'Feature_names': feature_names, 'Coefficients': feature_importances})
feature_correlation_rf

In [None]:
feature_correlation_rf_sorted = feature_correlation_rf.sort_values(by='Coefficients', ascending=False)

#plot the coefficients
plt.figure(figsize=(8, 4))
plt.bar(feature_correlation_rf_sorted['Feature_names'], 
        feature_correlation_rf_sorted['Coefficients'])
plt.xticks(rotation=90)
plt.xlabel('Features')
plt.ylabel('Coefficients Importance')
plt.title('Coefficients of Features')
plt.show()

From the results, we can see that neighbourhood_group, room_type, calulated host listings count do not affect the satisfactory level much.

However, the location of the room and the reviews_per_month still strongly affect the satisfactory levels of reviewers.