# SI618 Project
### Analyzing the Impact of Various Factors on B&B Visitor's Reviews 
#### — A study based on Airbnb datasets

Team members: Qian Dong (dqq) section 001; Yujun Zhang (yukiz) section 001; Yinuo Wei (seesaway) section 001


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import folium
from folium.plugins import HeatMap

### Cleaning and manipulation
1. Primary dataset description

In [None]:
ab=pd.read_csv('data/AB_NYC_2019.csv')

In [None]:
ab.head()

In [None]:
ab.shape

In [None]:
ab.columns

In [None]:
ab.describe()

In [None]:
ab.select_dtypes(exclude=['object'])\
    .plot(kind='box', subplots=True, layout=(3,4), figsize=(14,14), fontsize=14)

In [None]:
ab.select_dtypes(exclude=['object']).isna().sum()

Only review_per_month has missing values

In [None]:
ab['reviews_per_month'].min()

we can infer that the missing value should be 0.

In [None]:
#replace missing with mode
ab['reviews_per_month'].fillna(0, inplace=True)

updated graph:

In [None]:
ab['reviews_per_month'].plot(kind='box')

In [None]:
ab.select_dtypes(include=['object']).head()

In [None]:
ab['neighbourhood_group'].value_counts().plot(kind='bar')

In [None]:
ab['neighbourhood'].nunique()

In [None]:
ab['neighbourhood'].value_counts().head(10)

In [None]:
ab['room_type'].value_counts().plot(kind='bar')

In [None]:
ab['last_review']=pd.to_datetime(ab['last_review'])
ab['last_review'].dt.year.value_counts().sort_index().plot(kind='bar')

In [None]:
ab['last_review'].dt.month.value_counts().sort_index().plot(kind='bar')

In [None]:
# last and first review
ab['last_review'].max(), ab['last_review'].min()

In [None]:
ab['last_review'].value_counts().head(10)

In [None]:
ab.select_dtypes(exclude=['number']).isna().sum()

Missing names are not meaningful to fill. Only missing last_review can be filled. The missing value is corresponding to the missing value of review_per_month. So we should keep the missing values as null because it has the meaning of having no review.

In [None]:
ab[ab['last_review'].isna()][['number_of_reviews', 'last_review', 'reviews_per_month']].sample(5)

In [None]:
ab[ab['last_review'].isna()][['number_of_reviews', 'last_review', 'reviews_per_month']].nunique()

It turns out that missing values of review related data are all because reviews do not exit.

2. Secondary dataset description

In [None]:
reviews=pd.read_csv('data/AB_reviews_NYC.csv')

In [None]:
reviews.head()

In [None]:
reviews.shape

In [None]:
reviews.isna().sum()

There's no missing values

In [None]:
reviews['listing_id'].nunique()

In [None]:
reviews['url'].nunique()

url and listing_id number is corresponded. url is not needed for data analysis

In [None]:
reviews.drop(columns=['url'], inplace=True)

In [None]:
#revert review_posted_date into datatime
reviews['review_posted_date']=pd.to_datetime(reviews['review_posted_date'])
#plot review_posted_date
reviews['review_posted_date'].dt.year.value_counts().sort_index().plot(kind='bar')

In [None]:
reviews['review_posted_date'].dt.month.value_counts().sort_index().plot(kind='bar')

In [None]:
#plot histgram of review length
reviews['review'].str.len().plot(kind='hist', bins=50)

### Visualization

1. Heatmap of Correlations of the Primary dataset

In [None]:
sns.heatmap(ab.select_dtypes(exclude=['object']).drop(['id', 'host_id'], axis=1)
            .corr(), cmap='coolwarm', center=0)

The positive relation of review related data is of no doubt. Longitude also affect price and host listing negatively and review_per_month positively, which is a geographic influence. Mininum night and review per month has logical negative influence. availability_365 has positive affect on review numbers. host listing number  has negative effect on review numebrs. More host_listing and more availability of year is related.

In [None]:
m = folium.Map(location=[ab['latitude'].mean(), ab['longitude'].mean()], zoom_start=12)
HeatMap(data=ab[['latitude', 'longitude']], radius=15).add_to(m)
m

### Joining Datasets


In [None]:
airbnb = pd.merge(ab, reviews, left_on='id', right_on='listing_id', how='inner')
airbnb

### Machine Learning (Predict Price)

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestRegressor

In [None]:
le = LabelEncoder()
airbnb['neighbourhood_group'] = le.fit_transform(airbnb['neighbourhood_group'])
airbnb['neighbourhood'] = le.fit_transform(airbnb['neighbourhood'])
airbnb['room_type'] = le.fit_transform(airbnb['room_type'])

X = airbnb[['neighbourhood_group', 'neighbourhood', 'latitude', 'longitude', 'room_type', 'minimum_nights',
            'number_of_reviews', 'reviews_per_month', 'calculated_host_listings_count', 'availability_365']]
y = airbnb['price']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42)

#### 1. Linear Regression Model

In [None]:
LinearRegre = LinearRegression()

LinearRegre.fit(X_train, y_train)

y_pred = LinearRegre.predict(X_test)

mae = mean_absolute_error(y_test, y_pred)
print(f'Mean Absolute Error: {mae}')

mse = mean_squared_error(y_test, y_pred)
print(f'Mean Squared Error: {mse}')

r2 = r2_score(y_test, y_pred)
print(f'R-squared (R2): {r2}')

reletive_error = mae/airbnb.price.mean()
print(f'reletive error: {reletive_error}')

#### 2. Random Forest Model

In [None]:
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)

rf_model.fit(X_train, y_train)

y_pred_rf = rf_model.predict(X_test)

mae = mean_absolute_error(y_test, y_pred_rf)
print(f'Mean Absolute Error: {mae}')

mse = mean_squared_error(y_test, y_pred_rf)
print(f'Mean Squared Error: {mse}')

r2 = r2_score(y_test, y_pred_rf)
print(f'R-squared (R2): {r2}')

reletive_error = mae/airbnb.price.mean()
print(f'reletive error: {reletive_error}')

### Natural Language Processing

Eavaluate the reviewers' satisfacotry levels based on the sentiment of their reviews

In [None]:
from textblob import TextBlob
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

In [None]:
reviews = airbnb['review']

def preprocess_text(text):
    stop_words = set(stopwords.words('english'))
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(word.lower()) for word in text.split() if word.isalpha() and word.lower() not in stop_words]
    return ' '.join(tokens)

preprocessed_reviews = reviews.apply(preprocess_text)

sentiments = preprocessed_reviews.apply(lambda x: TextBlob(x).sentiment.polarity)

airbnb['sentiment_score'] = sentiments

print(airbnb[['review', 'sentiment_score']])

### Evaluation
Evaluate the relationship between the reletive price and satisfactory
(Reletive price = Actual Price - Predicted Price (Expected Price)).

In [None]:
price_differences_lr = y_pred - y_test
price_differences_lr.value_counts()

In [None]:
price_differences_rf = y_pred_rf - y_test
price_differences_rf.value_counts()

In [None]:
test_indices = X_test.index
test_set = airbnb.loc[test_indices]
test_set['price_differences_rf'] = price_differences_rf
test_set['price_differences_lr'] = price_differences_lr
test_set = test_set[abs(test_set.price_differences_rf) < 50]
test_set

In [None]:
plt.scatter(test_set['sentiment_score'], test_set['price_differences_lr'])
plt.title('Relationship between Sentiment Score and Price Difference (Test Set)')
plt.xlabel('Sentiment Score')
plt.ylabel('Price Difference (Predicted - Actual)')
plt.show()

In [None]:
plt.scatter(test_set['sentiment_score'], test_set['price_differences_rf'])
plt.title('Relationship between Sentiment Score and Price Difference (Test Set)')
plt.xlabel('Sentiment Score')
plt.ylabel('Price Difference (Predicted - Actual)')
plt.show()

In [None]:
correlation = test_set['price_differences_lr'].corr(test_set['sentiment_score'])
correlation 

It is found that the price_differences do not have strong correlation with the satisfactory levels of reviewers.

Thus, whether the price is lower or higher than expection will not be a major factor affecting reviewers' satisfactory.

Therefore, we are curious what factors do affect reviewers' satisfactory levels a lot.

### Machine Learning (Predict Users' Satisfactory Levels)

In [None]:
le = LabelEncoder()
airbnb['neighbourhood_group'] = le.fit_transform(airbnb['neighbourhood_group'])
airbnb['neighbourhood'] = le.fit_transform(airbnb['neighbourhood'])
airbnb['room_type'] = le.fit_transform(airbnb['room_type'])

X = airbnb[['neighbourhood_group', 'neighbourhood', 'latitude', 'longitude', 'room_type', 'minimum_nights',
            'number_of_reviews', 'reviews_per_month', 'calculated_host_listings_count', 'availability_365', 'price']]
y = airbnb['sentiment_score']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42)

In [None]:
LinearRegre = LinearRegression()

LinearRegre.fit(X_train, y_train)

y_pred = LinearRegre.predict(X_test)

mae = mean_absolute_error(y_test, y_pred)
print(f'Mean Absolute Error: {mae}')

mse = mean_squared_error(y_test, y_pred)
print(f'Mean Squared Error: {mse}')

r2 = r2_score(y_test, y_pred)
print(f'R-squared (R2): {r2}')

reletive_error = mae/(airbnb.sentiment_score.max() - airbnb.sentiment_score.min())
print(f'reletive error: {reletive_error}')

In [None]:
coefficients = LinearRegre.coef_
feature_names = ['neighbourhood_group', 'neighbourhood', 'latitude', 'longitude', 'room_type', 'minimum_nights',
            'number_of_reviews', 'reviews_per_month', 'calculated_host_listings_count', 'availability_365', 'price']
feature_correlation = pd.DataFrame({'Feature_names': feature_names, 'Coefficients': coefficients})
feature_correlation

From the results, we can see that neighbourhood, minimum nights, numberof reviews, availability_365 and price do not affect the satisfactory level much.

However, the location of the room and the reviews_per_month strongly affect the satisfactory levels of reviewers.

In [None]:
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)

rf_model.fit(X_train, y_train)

y_pred_rf = rf_model.predict(X_test)

mae = mean_absolute_error(y_test, y_pred_rf)
print(f'Mean Absolute Error: {mae}')

mse = mean_squared_error(y_test, y_pred_rf)
print(f'Mean Squared Error: {mse}')

r2 = r2_score(y_test, y_pred_rf)
print(f'R-squared (R2): {r2}')

reletive_error = mae/(airbnb.sentiment_score.max() - airbnb.sentiment_score.min())
print(f'reletive error: {reletive_error}')

From the learning score, we can see that R-squared is not very high and the reletive error is not low enough.

Thus, the satisfactory level is not easy to predict and many factors other than the solid data may affect it.

In [None]:
feature_importances = rf_model.feature_importances_
feature_importances 

In [None]:
feature_correlation = pd.DataFrame({'Feature_names': feature_names, 'Coefficients': feature_importances})
feature_correlation

From the results, we can see that neighbourhood_group, room_type, calulated host listings count do not affect the satisfactory level much.

However, the location of the room and the reviews_per_month still strongly affect the satisfactory levels of reviewers.