# <center><span style="color:red;"> <b> Booking reviews </b></span></center>

<p align="center">
    <strong> </strong> <a href="https://www.kaggle.com/competitions/sf-booking"> Kaggle - Hotel Booking Link</a>
</p>

## <center><span style="text-decoration: underline;">Feature Engineering + Prediction of the Reviewer's Score </span></center>

<h4 style="margin-bottom: 0;">📝 Dataset Description: Hotel Reviews Dataset </h4>

- **`hotel_address`**: Address of the hotel.
- **`review_date`**: Date when the reviewer posted the corresponding review.
- **`average_score`**: Average score of the hotel, calculated based on the most recent comments over the past year.
- **`hotel_name`**: Name of the hotel.
- **`reviewer_nationality`**: Country of the reviewer.
- **`negative_review`**: Negative review provided by the reviewer.
- **`review_total_negative_word_counts`**: Total number of words in the negative review.
- **`positive_review`**: Positive review provided by the reviewer.
- **`review_total_positive_word_counts`**: Total number of words in the positive review.
- **`reviewer_score`**: Score assigned by the reviewer based on their experience.
- **`total_number_of_reviews_reviewer_has_given`**: Total number of reviews the reviewer has given in the past.
- **`total_number_of_reviews`**: Total number of valid reviews for the hotel.
- **`tags`**: Tags assigned by the reviewer to the hotel.
- **`days_since_review`**: Number of days between the review date and the dataset's last update.
- **`additional_number_of_scoring`**: Some guests provide a rating without leaving a review. This number indicates the count of valid ratings without associated reviews.
- **`lat`**: Latitude of the hotel.
- **`lng`**: Longitude of the hotel.

Source: ["Booking reviews" (kaggle.com)](https://www.kaggle.com/competitions/sf-booking) 

## <div style="text-align:center; border-radius:30px 30px; padding:7px; color:white; margin:0; font-size:110%; font-family:Pacifico; background-color:#3170af; overflow:hidden"><b> Table of Content </b></div>

1. [Import Libraries](#chapter1)
2. [Initialize Functions](#chapter2)
3. [Descriptive Statistics](#chapter3)
4. [Feature Engineering](#chapter4)
5. [Graphical + Statistical Analyses](#chapter5)    
    &emsp; 5.1 [Insurance Charges](#chapter5a)  
    &emsp; 5.2 [Insurance Charges and Gender](#chapter5b)  
    &emsp; 5.3 [Insurance Charges and Smoking Status](#chapter5c)  
    &emsp; 5.4 [Insurance Charges across Regions](#chapter5d)              
    &emsp; 5.5 [Insurance Charges across Age Groups](#chapter5e)   
    &emsp; 5.6 [Conclusions of EDA](#chapter5f)     
6. [Modeling](#chapter6)  
    &emsp; 6.1 [Linear Model](#chapter6a)     
    &emsp; 6.2 [Stochastic Gradient Descent](#chapter6b)    
    &emsp; 6.3  [Linear Model – Outliers adjustment](#chapter6c)   
    &emsp; 6.4  [Polynomials](#chapter6d)    
    &emsp; 6.5  [Polynomials with Regularization](#chapter6e)    
7.  [Conclusion](#chapter7)    

<a id="chapter1"></a>
## <div style="text-align:center; border-radius:30px 30px; padding:7px; color:white; margin:0; font-size:110%; font-family:Pacifico; background-color:#3170af; overflow:hidden"><b> ⚙️ Download Libraries </b></div>

In [None]:
import pandas as pd 
import numpy as np


# Modeling
from sklearn.feature_selection import chi2
from sklearn import linear_model
from sklearn import metrics
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor

<a id="chapter3"></a>
## <div style="text-align:center; border-radius:30px 30px; padding:7px; color:white; margin:0; font-size:110%; font-family:Pacifico; background-color:#3170af; overflow:hidden"><b> ✨ Some preliminary descriptive analysis  </b></div>

In [44]:
# Upload Dataset
df = pd.read_csv('data/hotels.csv')
data_temp = df.copy()
display('Beginning of the dataset:')
display(data_temp.head(2))
display('End of the dataset:')
display(data_temp.head(2))
display("Number of variables:", data_temp.shape[1])
display('Number of observations:', data_temp.shape[0])

'Beginning of the dataset:'

Unnamed: 0,hotel_address,additional_number_of_scoring,review_date,average_score,hotel_name,reviewer_nationality,negative_review,review_total_negative_word_counts,total_number_of_reviews,positive_review,review_total_positive_word_counts,total_number_of_reviews_reviewer_has_given,reviewer_score,tags,days_since_review,lat,lng
0,Stratton Street Mayfair Westminster Borough Lo...,581,2/19/2016,8.4,The May Fair Hotel,United Kingdom,Leaving,3,1994,Staff were amazing,4,7,10.0,"[' Leisure trip ', ' Couple ', ' Studio Suite ...",531 day,51.507894,-0.143671
1,130 134 Southampton Row Camden London WC1B 5AF...,299,1/12/2017,8.3,Mercure London Bloomsbury Hotel,United Kingdom,poor breakfast,3,1361,location,2,14,6.3,"[' Business trip ', ' Couple ', ' Standard Dou...",203 day,51.521009,-0.123097


'End of the dataset:'

Unnamed: 0,hotel_address,additional_number_of_scoring,review_date,average_score,hotel_name,reviewer_nationality,negative_review,review_total_negative_word_counts,total_number_of_reviews,positive_review,review_total_positive_word_counts,total_number_of_reviews_reviewer_has_given,reviewer_score,tags,days_since_review,lat,lng
0,Stratton Street Mayfair Westminster Borough Lo...,581,2/19/2016,8.4,The May Fair Hotel,United Kingdom,Leaving,3,1994,Staff were amazing,4,7,10.0,"[' Leisure trip ', ' Couple ', ' Studio Suite ...",531 day,51.507894,-0.143671
1,130 134 Southampton Row Camden London WC1B 5AF...,299,1/12/2017,8.3,Mercure London Bloomsbury Hotel,United Kingdom,poor breakfast,3,1361,location,2,14,6.3,"[' Business trip ', ' Couple ', ' Standard Dou...",203 day,51.521009,-0.123097


'Number of variables:'

17

'Number of observations:'

386803

In [3]:
data_temp.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 386803 entries, 0 to 386802
Data columns (total 17 columns):
 #   Column                                      Non-Null Count   Dtype  
---  ------                                      --------------   -----  
 0   hotel_address                               386803 non-null  object 
 1   additional_number_of_scoring                386803 non-null  int64  
 2   review_date                                 386803 non-null  object 
 3   average_score                               386803 non-null  float64
 4   hotel_name                                  386803 non-null  object 
 5   reviewer_nationality                        386803 non-null  object 
 6   negative_review                             386803 non-null  object 
 7   review_total_negative_word_counts           386803 non-null  int64  
 8   total_number_of_reviews                     386803 non-null  int64  
 9   positive_review                             386803 non-null  object 
 

In [4]:
data_temp.describe() 

Unnamed: 0,additional_number_of_scoring,average_score,review_total_negative_word_counts,total_number_of_reviews,review_total_positive_word_counts,total_number_of_reviews_reviewer_has_given,reviewer_score,lat,lng
count,386803.0,386803.0,386803.0,386803.0,386803.0,386803.0,386803.0,384355.0,384355.0
mean,498.246536,8.397231,18.538988,2743.992042,17.776985,7.17725,8.396906,49.443522,2.823402
std,500.258012,0.547881,29.703369,2316.457018,21.726141,11.05442,1.63609,3.466936,4.579043
min,1.0,5.2,0.0,43.0,0.0,1.0,2.5,41.328376,-0.369758
25%,169.0,8.1,2.0,1161.0,5.0,1.0,7.5,48.214662,-0.143649
50%,342.0,8.4,9.0,2134.0,11.0,3.0,8.8,51.499981,-0.00025
75%,660.0,8.8,23.0,3613.0,22.0,8.0,9.6,51.516288,4.834443
max,2682.0,9.8,408.0,16670.0,395.0,355.0,10.0,52.400181,16.429233


In [5]:
data_temp.describe(include='object')

Unnamed: 0,hotel_address,review_date,hotel_name,reviewer_nationality,negative_review,positive_review,tags,days_since_review
count,386803,386803,386803,386803,386803,386803,386803,386803
unique,1493,731,1492,225,248828,311737,47135,731
top,163 Marsh Wall Docklands Tower Hamlets London ...,8/2/2017,Britannia International Hotel Canary Wharf,United Kingdom,No Negative,No Positive,"[' Leisure trip ', ' Couple ', ' Double Room '...",1 days
freq,3587,1911,3587,184033,95907,26885,3853,1911


In [6]:
if data_temp.isin([np.inf, -np.inf]).any().any():
    print("Infinite values found!")
else:
    print("No infinite values found!")    

# Check for missing values
if data_temp.isnull().any().any():
    print("Missing values found!")
else:
    print("No missing values found!")

No infinite values found!
Missing values found!


In [7]:
data_null = (data_temp.isnull().mean()*100).round(3).to_frame()
data_null.columns = ['Missing %']
data_null = data_null.rename_axis('Variables')
 
display(data_null)

Unnamed: 0_level_0,Missing %
Variables,Unnamed: 1_level_1
hotel_address,0.0
additional_number_of_scoring,0.0
review_date,0.0
average_score,0.0
hotel_name,0.0
reviewer_nationality,0.0
negative_review,0.0
review_total_negative_word_counts,0.0
total_number_of_reviews,0.0
positive_review,0.0


<div style="padding: 20px; border: 2px solid #c77220; border-radius: 5px; box-shadow: 0 2px 4px rgba(0, 0, 0, 0.1); max-width: 100%; margin: 0 auto;">
    <ul style="font-size: 18px; font-family: 'Arial', sans-serif; line-height: 1.5em; word-wrap: break-word; overflow-wrap: break-word;">
    <h4 style="margin-bottom: 0;">💡 Interpretation:</h4>
        <li>  Two variables (<code>lat</code> and <code>lng</code>) have missing observations of around 0.633%.  </li>
    </ul>
</div>

In [24]:
print(f'Number of duplicates: {data_temp[data_temp.duplicated()].shape[0]}')

# Drop duplicates:
data_temp = data_temp.drop_duplicates()
print(f'Dimension of the dataframe after the drop of duplicates: {data_temp.shape}')

Number of duplicates: 307
Dimension of the dataframe after the drop of duplicates: (386496, 17)


In [27]:
data_temp.head(2)

Unnamed: 0,hotel_address,additional_number_of_scoring,review_date,average_score,hotel_name,reviewer_nationality,negative_review,review_total_negative_word_counts,total_number_of_reviews,positive_review,review_total_positive_word_counts,total_number_of_reviews_reviewer_has_given,reviewer_score,tags,days_since_review,lat,lng
0,Stratton Street Mayfair Westminster Borough Lo...,581,2/19/2016,8.4,The May Fair Hotel,United Kingdom,Leaving,3,1994,Staff were amazing,4,7,10.0,"[' Leisure trip ', ' Couple ', ' Studio Suite ...",531 day,51.507894,-0.143671
1,130 134 Southampton Row Camden London WC1B 5AF...,299,1/12/2017,8.3,Mercure London Bloomsbury Hotel,United Kingdom,poor breakfast,3,1361,location,2,14,6.3,"[' Business trip ', ' Couple ', ' Standard Dou...",203 day,51.521009,-0.123097


In [30]:
print(f'There are {data_temp['hotel_name'].nunique()} unique hotels')

There are 1492 unique hotels


In [36]:
data_temp['review_date'] = pd.to_datetime(data_temp['review_date'])
print(f"The latest review was done on {data_temp['review_date'].sort_values(ascending=False).iloc[0].date()}")
print(f"The first review was done on {data_temp['review_date'].sort_values(ascending=True).iloc[0].date()}")

The latest review was done on 2017-08-03
The first review was done on 2015-08-04


In [55]:
# Convert the 'tags' column from string to a proper list of tags
# x = "['Leisure trip', 'Solo traveler']" print(eval(x))  # Output: ['Leisure trip', 'Solo traveler']
data_temp['tags'] = data_temp['tags'].apply(lambda x: eval(x) if isinstance(x, str) else x)
all_tags = [tag.strip() for tags_list in data_temp['tags'] for tag in tags_list]
print(f"Number of unique tags: {len(set(all_tags))}")

Number of unique tags: 2368


<a id="chapter4"></a>  
## <div style="text-align:center; border-radius:30px 30px; padding:7px; color:white; margin:0; font-size:110%; font-family:Pacifico; background-color:#3170af; overflow:hidden"><b> 🧐 Feature Engineering </b></div>

In [None]:
y = data_temp['reviewer_score'] 

imp_cat = pd.Series(chi2(X[cat_cols], y)[0], index=cat_cols)

In [None]:
num_cols = data_temp.select_dtypes(include=['number']).columns
cat_cols = data_temp.select_dtypes(include=['number']).columns



In [26]:
data_temp['lat'].describe()

count    384048.000000
mean         49.443988
std           3.468266
min          41.328376
25%          48.214277
50%          51.500198
75%          51.516384
max          52.400181
Name: lat, dtype: float64

<a id="chapter6"></a>
## <div style="text-align:center; border-radius:30px 30px; padding:7px; color:white; margin:0; font-size:110%; font-family:Pacifico; background-color:#3170af; overflow:hidden"><b> 🛠️ Modeling </b></div>

In [11]:
X = data_temp.drop(['reviewer_score'], axis = 1)
y = data_temp['reviewer_score'] 

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 42)
regr = RandomForestRegressor(n_estimators=100)  
 
regr.fit(X_train, y_train) 
y_pred = regr.predict(X_test) 

ValueError: could not convert string to float: 'Pla a de Llevant s n Sant Mart 08019 Barcelona Spain'

In [16]:
data_temp['lat'] = data_temp['lat'].fillna(data_temp['lat'].mode()[0])
data_temp['lng'] = data_temp['lng'].fillna(data_temp['lng'].mode()[0])

In [18]:
X = data_temp.select_dtypes(include=['number']).drop(columns=['reviewer_score'])
y = data_temp['reviewer_score'] 

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 42)
regr = RandomForestRegressor(n_estimators=100)  
 
regr.fit(X_train, y_train) 
y_pred = regr.predict(X_test) 

In [20]:
print('MAPE:', metrics.mean_absolute_percentage_error(y_test, y_pred).round(3))

MAPE: 0.141
