# TripAdvisor Recommendation System

In this project, we aim to develop a recommendation system for TripAdvisor reviews that surpasses the performance of the BM25 algorithm. We'll begin by implementing a BM25 baseline using the `rank_bm25` library and then explore advanced Natural Language Processing (NLP) techniques to enhance the recommendation system.

## 1. Data Preparation

### 1.1 Importing Necessary Libraries

We'll start by importing the essential libraries for data manipulation and analysis.


In [1]:
import pandas as pd
import numpy as np

### 1.2 Loading the Dataset

Next, we'll load the TripAdvisor Hotel Reviews dataset. Ensure that the dataset file (`Reviews.csv`) is in the same directory as this notebook or provide the correct path.


In [2]:
# Load the dataset
df = pd.read_csv('tripadvisor_hotel_reviews.csv')


In [3]:
df.columns

Index(['Review', 'Rating'], dtype='object')

### 1.3 Inspecting the Dataset

Let's examine the first few rows of the dataset to understand its structure.


In [4]:
# Display the first few rows of the dataset
df.head()

Unnamed: 0,Review,Rating
0,nice hotel expensive parking got good deal sta...,4
1,ok nothing special charge diamond member hilto...,2
2,nice rooms not 4* experience hotel monaco seat...,3
3,"unique, great stay, wonderful time hotel monac...",5
4,"great stay great stay, went seahawk game aweso...",5


The dataset should have the following columns:

- `Review`: Concatenated reviews for the place.
- `Rating`: Average rating of the place based on all reviews.


### 1.4 Preprocessing the Reviews

We'll preprocess the concatenated reviews by tokenizing the text, converting it to lowercase, removing punctuation, and eliminating stopwords.


In [5]:
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import string
import nltk

# Download NLTK resources
nltk.download('punkt')
nltk.download('stopwords')

# Define stopwords and punctuation
stop_words = set(stopwords.words('english'))
punctuation = set(string.punctuation)

def preprocess(text):
    # Tokenize the text
    tokens = word_tokenize(text.lower())
    # Remove punctuation and stopwords
    tokens = [word for word in tokens if word not in stop_words and word not in punctuation]
    return tokens

# Apply preprocessing to the reviews
df['processed_review'] = df['Review'].apply(preprocess)


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Anis\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Anis\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## 2. Implementing the BM25 Baseline

### 2.1 Installing the `rank_bm25` Library

We'll install the `rank_bm25` library, which provides an efficient implementation of the BM25 algorithm.


In [6]:
!pip install rank_bm25




[notice] A new release of pip is available: 23.3.1 -> 24.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip


### 2.2 Initializing the BM25 Model

We'll initialize the BM25 model with the processed reviews.


In [7]:
from rank_bm25 import BM25Okapi

# Create a list of processed reviews
corpus = df['processed_review'].tolist()

# Initialize the BM25 model
bm25 = BM25Okapi(corpus)


### 2.3 Defining a Function to Recommend Similar Places

We'll define a function that, given a place's index, returns the most similar place based on BM25 similarity scores.


In [8]:
def recommend_similar_place(index):
    # Get the processed review of the place
    query = df.loc[index, 'processed_review']
    # Compute BM25 scores
    scores = bm25.get_scores(query)
    # Get the index of the most similar place (excluding the query place itself)
    scores[index] = -float('inf')  # Exclude the same place
    most_similar_index = scores.argmax()
    return most_similar_index


### 2.4 Evaluating the BM25 Model

To evaluate the effectiveness of the BM25 model, we'll calculate the Mean Squared Error (MSE) between the average ratings of each place and the ratings of its most similar recommended place. This comparison allows us to assess how accurately BM25 matches each place with another based on review similarity, without using any explicit rating information.

We'll optimize this process by using parallel processing with `joblib`, enabling us to compute the MSE across all entries in the dataset more efficiently. By leveraging all available CPU cores, this approach significantly reduces the time required to perform the evaluation on the entire dataset.


In [9]:
!pip install joblib




[notice] A new release of pip is available: 23.3.1 -> 24.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [10]:
from sklearn.metrics import mean_squared_error
from joblib import Parallel, delayed
import numpy as np

# Select the first 5000 indexes
sample_indexes = df.index[:100]

def compute_mse_for_index(index):
    # Get the recommended place
    recommended_index = recommend_similar_place(index)
    # Get the ratings of the query and recommended places
    query_rating = df.loc[index, 'Rating']
    recommended_rating = df.loc[recommended_index, 'Rating']
    # Compute MSE for the current pair
    return mean_squared_error([query_rating], [recommended_rating])

def evaluate_bm25_parallel(selected_indexes):
    # Use Parallel to compute the MSE in parallel for the selected indexes
    mse_scores = Parallel(n_jobs=-1)(delayed(compute_mse_for_index)(index) for index in selected_indexes)
    # Compute the average MSE
    average_mse = np.mean(mse_scores)
    return average_mse

# Evaluate the BM25 model on the first 5000 entries
bm25_mse = evaluate_bm25_parallel(sample_indexes)
print(f'BM25 Average MSE for first 100 entries: {bm25_mse}')


BM25 Average MSE for first 100 entries: 1.58


## 3. Developing an Enhanced Recommendation Model

To improve on the BM25 baseline, we'll use a hybrid approach by combining BM25 with word embeddings. The goal is to create a model that better captures the semantic similarities between reviews. Specifically, we'll:

1. Generate embeddings for each review using a pre-trained model.
2. Combine BM25 similarity scores with embedding-based similarity scores for improved recommendations.

This hybrid approach should ideally reduce the Mean Squared Error (MSE) compared to BM25 alone.


### 3.1 Setting Up Pre-trained Embeddings

We'll start by generating sentence embeddings for each review using `Sentence-BERT`. These embeddings will help capture semantic similarities between reviews, which we can then combine with BM25 similarity scores.


In [11]:
!pip install sentence-transformers




[notice] A new release of pip is available: 23.3.1 -> 24.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [12]:

from sentence_transformers import SentenceTransformer
import numpy as np

# Load a pre-trained Sentence-BERT model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Generate embeddings for each review in the dataset
df['embedding'] = df['Review'].apply(lambda x: model.encode(x))


  from tqdm.autonotebook import tqdm, trange







### 3.2 Calculating Similarity Scores

We'll use cosine similarity between the embeddings to determine the semantic similarity between reviews. This similarity measure will complement the BM25 score, which is based on exact word matches.


In [13]:
from sklearn.metrics.pairwise import cosine_similarity

def get_embedding_similarity(index, candidate_index):
    # Retrieve the embeddings for both reviews
    embedding1 = df.loc[index, 'embedding']
    embedding2 = df.loc[candidate_index, 'embedding']
    # Compute cosine similarity
    return cosine_similarity([embedding1], [embedding2])[0][0]


### 3.3 Hybrid Recommendation Function

We'll develop a hybrid recommendation function that combines BM25 and embedding similarities to find the most similar place. We’ll use a weighted average of BM25 and embedding scores to adjust their influence.


In [14]:
def recommend_hybrid_place(index, alpha=0.5):
    # Get the BM25 scores
    query = df.loc[index, 'processed_review']
    bm25_scores = bm25.get_scores(query)
    
    # Combine BM25 and embedding similarities for all other entries
    hybrid_scores = []
    for candidate_index in df.index:
        if candidate_index != index:
            # Get BM25 score
            bm25_score = bm25_scores[candidate_index]
            # Get embedding similarity
            embedding_score = get_embedding_similarity(index, candidate_index)
            # Hybrid score: weighted average of BM25 and embedding scores
            hybrid_score = alpha * bm25_score + (1 - alpha) * embedding_score
            hybrid_scores.append((candidate_index, hybrid_score))
    
    # Sort by hybrid score (higher is better) and return the best match
    most_similar_index = max(hybrid_scores, key=lambda x: x[1])[0]
    return most_similar_index


### 3.4 Evaluating the Hybrid Model

We'll evaluate the hybrid model on the first 100 entries and compare its Mean Squared Error (MSE) with that of BM25. This will help us assess if our modifications improve the recommendation accuracy.


In [15]:
def evaluate_hybrid_parallel(selected_indexes=None, alpha=0.5):
    if selected_indexes is None:
        selected_indexes = df.index[:100]  # Default to first 5000 entries if none specified
    mse_scores = Parallel(n_jobs=8)(delayed(lambda idx: mean_squared_error(
        [df.loc[idx, 'Rating']],
        [df.loc[recommend_hybrid_place(idx, alpha), 'Rating']]
    ))(index) for index in selected_indexes)
    # Calculate average MSE
    average_mse = np.mean(mse_scores)
    return average_mse

# Run evaluation with alpha set to 0.5 for equal weighting
hybrid_mse = evaluate_hybrid_parallel(sample_indexes, alpha=0.2)
print(f'Hybrid Model Average MSE for first 100 entries: {hybrid_mse}')


Hybrid Model Average MSE for first 100 entries: 1.48


### 3.5 Analyzing the Hybrid Model Results

After running the hybrid model that combines BM25 with embedding similarity, we got an MSE of 1.48, compared to the BM25-only baseline of 1.58. This improvement likely comes from adding the embeddings, which help capture the meaning of reviews more effectively than BM25 alone.

#### Why the Hybrid Model Performed Better
1. **Understanding Context**: Embeddings help pick up on the meaning of words, not just exact matches, so similar reviews are matched even if they use different words.
2. **Balancing Exact and Semantic Similarity**: By mixing BM25’s exact word matching with the semantic similarity from embeddings, we get a better overall match between reviews.

Overall, it looks like adding embeddings was worth it, as it reduced our MSE and improved the model's ability to identify similar reviews.

---

## 4. Trying Out Another Method: Word Mover's Distance (WMD)

To keep exploring ways to improve the recommendation accuracy, we’re going to try out **Word Mover's Distance (WMD)**. WMD calculates the "distance" between two documents by finding the minimum "cost" to match words from one document to another based on their meaning.

#### Why Try WMD?
WMD uses word embeddings to measure the distance between two documents, capturing both word similarity and structure. This can help it find closer matches between reviews.

### 4.1 Setting Up Word Mover's Distance Using Preprocessed Reviews

We'll calculate WMD using the preprocessed, tokenized reviews we already created. We’ll load a pre-trained Word2Vec model to measure the distance between reviews.



In [21]:
!pip install gensim
!pip install POT




[notice] A new release of pip is available: 23.3.1 -> 24.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip


Collecting POT


[notice] A new release of pip is available: 23.3.1 -> 24.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip



  Downloading POT-0.9.5-cp311-cp311-win_amd64.whl.metadata (35 kB)
Downloading POT-0.9.5-cp311-cp311-win_amd64.whl (348 kB)
   ---------------------------------------- 0.0/348.6 kB ? eta -:--:--
   -------------------------------- ------- 286.7/348.6 kB 8.9 MB/s eta 0:00:01
   ---------------------------------------- 348.6/348.6 kB 7.2 MB/s eta 0:00:00
Installing collected packages: POT
Successfully installed POT-0.9.5


In [18]:


from gensim.models import KeyedVectors

# Load pre-trained Word2Vec model (for example purposes, using a smaller one)
model = KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin.gz', binary=True)


### 4.2 Implementing WMD for Recommendations

Using WMD, we’ll find the most similar review to a given review by calculating the minimal "distance" in meaning between words in two reviews.


In [19]:
def recommend_wmd_place(index):
    query_review = df.loc[index, 'processed_review']
    wmd_distances = []
    
    for candidate_index in df.index:
        if candidate_index != index:
            candidate_review = df.loc[candidate_index, 'processed_review']
            # Compute Word Mover's Distance
            wmd_distance = model.wmdistance(query_review, candidate_review)
            wmd_distances.append((candidate_index, wmd_distance))
    
    # Sort by WMD (lower distance is better) and return the best match
    most_similar_index = min(wmd_distances, key=lambda x: x[1])[0]
    return most_similar_index


### 4.3 Evaluating the WMD-Based Model

We’ll evaluate the WMD model on the first 100 entries by calculating the Mean Squared Error (MSE) between the ratings of each place and the closest place found by WMD.


In [23]:
def evaluate_wmd_parallel(selected_indexes=None):
    if selected_indexes is None:
        selected_indexes = df.index[:100]  # Default to first 100 entries
    mse_scores = Parallel(n_jobs=2)(delayed(lambda idx: mean_squared_error(
        [df.loc[idx, 'Rating']],
        [df.loc[recommend_wmd_place(idx), 'Rating']]
    ))(index) for index in selected_indexes)
    # Calculate average MSE
    average_mse = np.mean(mse_scores)
    return average_mse

# Run evaluation on the first 100 entries
wmd_mse = evaluate_wmd_parallel(sample_indexes)
print(f'WMD Model Average MSE for first 100 entries: {wmd_mse}')


WMD Model Average MSE for first 100 entries: 1.18


### 5. Comparing the Results of the Three Models

After evaluating all three models on the first 100 entries of TripAdvisor hotel reviews, we observed the following Mean Squared Error (MSE) values:

- **BM25 Baseline Model**: MSE = 1.58
- **Hybrid Model (BM25 + Embeddings)**: MSE = 1.48
- **Word Mover's Distance (WMD) Model**: MSE = 1.18

#### How Each Model Works

1. **BM25 Baseline Model**:
   - BM25 is a popular improvement on TF-IDF that calculates relevance based on term frequency (TF) within documents and inverse document frequency (IDF) across the corpus.
   - It also prefers shorter documents when two documents have similar TF-IDF scores.
   - In this project, BM25 finds similar reviews by prioritizing exact word matches and frequency within the reviews, making it effective for identifying hotels with similar feedback based on specific terms (e.g., "clean rooms" or "friendly staff"). However, it lacks an understanding of word meaning, so it might not pick up on context if the language varies.

2. **Hybrid Model (BM25 + Embeddings)**:
   - This model combines BM25 scores with cosine similarity scores from Sentence-BERT embeddings, which provide vector-based representations that capture semantic relationships.
   - Embeddings capture more contextual meaning, helping the model match reviews that discuss similar ideas even if they use different words (e.g., "very clean" vs. "spotless").
   - The hybrid model uses a weighted average of BM25 and embedding similarity, controlled by the parameter `alpha` (set to 0.2 to emphasize embeddings more). 
   - By blending BM25’s exact matching with embedding-based semantic similarity, the hybrid model improved the MSE to 1.48, showing it could capture a bit more context than BM25 alone.

3. **Word Mover's Distance (WMD) Model**:
   - WMD calculates the “distance” between two documents by finding the minimal cumulative "travel cost" required to transform one document’s words into the other’s, using word embeddings.
   - WMD doesn’t just rely on matches or weighted averages—it measures the actual "distance" in meaning between words, enabling it to find reviews that express similar sentiments in different ways.
   - For hotel reviews, this means WMD can match reviews more effectively when similar experiences are described with varied language, helping it handle nuances in customer feedback.

#### Why WMD Performed the Best

WMD outperformed both BM25 and the hybrid model, achieving the lowest MSE of 1.18. This likely happened for several reasons:

1. **Deeper Semantic Matching**: WMD can match reviews based on meaning rather than exact words. This means it can find reviews that are expressing similar feedback, even if the words used differ, which is useful for identifying hotels with similar guest experiences.
   
2. **Optimal Word Transport**: WMD calculates the minimum "cost" to match words between two reviews, capturing subtle similarities in word usage that BM25 and the hybrid model might miss.

3. **Context-Aware Matching**: Unlike BM25, which is limited to exact word matches, and the hybrid model, which uses a blend of similarity scores, WMD adapts fully to the context of each pair of reviews, making it more flexible and effective at finding meaningful connections in the hotel review dataset.

Overall, WMD’s ability to handle language variations and capture deeper semantic similarities likely contributed to its better performance on TripAdvisor reviews.
