# **Contextual Search for Hotel Review**

Inspired by
https://github.com/UKPLab/sentence-transformers/tree/master/sentence_transformers

Data source
https://www.kaggle.com/datasets/hamzafarooq50/hotel-listings-and-reviews?resource=download&select=HotelListInDubai__en2019100120191005.csv

### **Import Package**
First install the library that would help us use BERT in an easy to use interface.

In [None]:
!pip install -U spacy
!pip install -U sentence-transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
!pip install opendatasets
!pip install pandas

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
import opendatasets as od
import pandas

# download data from Kaggle (using key and username)
od.download("https://www.kaggle.com/datasets/hamzafarooq50/hotel-listings-and-reviews?select=hotelReviewsInDubai__en2019100120191005.csv")

Skipping, found downloaded files in "./hotel-listings-and-reviews" (use force=True to force download)


In [None]:
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
from string import punctuation
from collections import Counter
from heapq import nlargest

In [None]:
!python -m spacy download en_core_web_sm

2023-02-18 00:57:27.147260: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2023-02-18 00:57:27.147553: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2023-02-18 00:57:33.551351: E tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:267] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting en-core-web-sm==3.5.0
  Downloading https://github.com/explosion/spacy-models/releases/download

In [None]:
!ls

hotel-listings-and-reviews  sample_data


# **Basic NLP**

In [None]:
# Data Cleaning

import re
#sample review from the IMDB dataset.
review = "<b>A touching movie!!</b> It is full of emotions and wonderful acting.<br> I could have sat through it a second time."
cleaned_review = re.sub(re.compile('<.*?>'), '', review) #removing HTML tags
cleaned_review = re.sub('[^A-Za-z0-9]+', ' ', cleaned_review) #taking only words

print(cleaned_review)

A touching movie It is full of emotions and wonderful acting I could have sat through it a second time 


In [None]:
#Lowercase

cleaned_review = cleaned_review.lower()
print(cleaned_review)

a touching movie it is full of emotions and wonderful acting i could have sat through it a second time 


In [None]:
# Tokenization

import nltk
nltk.download('punkt')

from nltk.tokenize import word_tokenize
tokens = nltk.word_tokenize(cleaned_review)

print(cleaned_review)
print(tokens)

a touching movie it is full of emotions and wonderful acting i could have sat through it a second time 
['a', 'touching', 'movie', 'it', 'is', 'full', 'of', 'emotions', 'and', 'wonderful', 'acting', 'i', 'could', 'have', 'sat', 'through', 'it', 'a', 'second', 'time']


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [None]:
# Stop words removal

nltk.download('stopwords')

from nltk.corpus import stopwords
stop_words = stopwords.words('english')
filtered_review = [word for word in tokens if word not in stop_words] # removing stop words
print(filtered_review)

['touching', 'movie', 'full', 'emotions', 'wonderful', 'acting', 'could', 'sat', 'second', 'time']


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [None]:
nltk.download('wordnet')
nltk.download('omw-1.4')
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
lemm_review = [lemmatizer.lemmatize(word) for word in filtered_review]
print(lemm_review)

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


['touching', 'movie', 'full', 'emotion', 'wonderful', 'acting', 'could', 'sat', 'second', 'time']


[nltk_data]   Package omw-1.4 is already up-to-date!


# **Moving to Deep Learning Part**

In [None]:
import os
import spacy
nlp = spacy.load("en_core_web_sm")
from spacy import displacy

In [None]:
text = """Looking for a hotel in New York near Times Square with free breakfast and cheaper 
than $100 for 2nd June which is really kids friendly and has a swimming pool and I want to stay there for 8 days"""
doc = nlp(text)
sentence_spans = list(doc.sents)
displacy.render(doc, jupyter = True, style="ent")

In [None]:
text = """Close to the Effiel Tower and is very high end with great shopping nearby"""
doc = nlp(text)
sentence_spans = list(doc.sents)
displacy.render(doc, jupyter = True, style="ent")

In [None]:
text = "I want to stay in a European city that filmed Game of Thrones and has very cheap booze and art galleries for 4 days"
#text = """My very photogenic mother died in a freak accident (picnic, lightning) when I was three, and, save for a pocket of warmth in the darkest past, nothing of her subsists within the hollows and dells of memory, over which, if you can still stand my style (I am writing under observation), the sun of my infancy had set: surely, you all know those redolent remnants of day suspended, with the midges, about some hedge in bloom or suddenly entered and traversed by the rambler, at the bottom of a hill, in the summer dusk; a furry warmth, golden midges"""
doc = nlp(text)
sentence_spans = list(doc.sents)
displacy.render(doc, jupyter = True, style="ent")

In [None]:
stopwords=list(STOP_WORDS)
from string import punctuation
punctuation=punctuation+ '\n'

In [None]:
import pandas as pd
from sentence_transformers import SentenceTransformer
import scipy.spatial
import pickle as pkl

embedder = SentenceTransformer('all-MiniLM-L6-v2')
#embedder = SentenceTransformer('bert-base-nli-mean-tokens')

# **Hotel data in Dubai**

In [None]:
import opendatasets as od
import pandas
  
od.download("https://www.kaggle.com/datasets/hamzafarooq50/hotel-listings-and-reviews?resource=download&select=HotelListInDubai__en2019100120191005.csv")

Skipping, found downloaded files in "./hotel-listings-and-reviews" (use force=True to force download)


### (1) Hotel list

In [None]:
import pandas as pds
  
# reading the XLSX file
file =('/content/hotel-listings-and-reviews/HotelListInDubai__en2019100120191005.csv')
df_list = pds.read_csv(file)
  
# displaying the contents of the XLSX file
df_list.head()

Unnamed: 0.1,Unnamed: 0,hotel_name,url,locality,reviews,tripadvisor_rating,checkIn,checkOut,price_per_night,booking_provider,no_of_deals,hotel_features
0,0,Four Points By Sheraton Downtown Dubai,http://www.tripadvisor.com/Hotel_Review-g29542...,Dubai,2046,,2019/10/01,2019/10/05,$74,FourPoints.com,15,
1,1,FIVE Palm Jumeirah Dubai,http://www.tripadvisor.com/Hotel_Review-g29542...,Dubai,5388,,2019/10/01,2019/10/05,,Booking.com,15,
2,2,"Atlantis, The Palm",http://www.tripadvisor.com/Hotel_Review-g29542...,Dubai,25417,,2019/10/01,2019/10/05,,Booking.com,10,
3,3,Citymax Hotel Bur Dubai,http://www.tripadvisor.com/Hotel_Review-g29542...,Dubai,3704,,2019/10/01,2019/10/05,,TripAdvisor,14,
4,4,Premier Inn Dubai International Airport Hotel,http://www.tripadvisor.com/Hotel_Review-g29542...,Dubai,5215,,2019/10/01,2019/10/05,,Booking.com,14,


### (2) Hotel Reviews

In [None]:
# reading the XLSX file
file =('/content/hotel-listings-and-reviews/hotelReviewsInDubai__en2019100120191005.csv')
df_reviews = pds.read_csv(file)
  
# displaying the contents of the XLSX file
df_reviews.head()

Unnamed: 0.1,Unnamed: 0,review_body,review_date,hotelName,hotelUrl
0,0,Just to say this is really an excellent hotel ...,"July 14, 2019",0 Four Points By Sheraton Downtown Dubai\nN...,http://www.tripadvisor.com/Hotel_Review-g29542...
1,1,"Found this pub by chance, what a great place, ...","July 12, 2019",0 Four Points By Sheraton Downtown Dubai\nN...,http://www.tripadvisor.com/Hotel_Review-g29542...
2,2,"House keeping is perfect , the rooms are alway...","July 9, 2019",0 Four Points By Sheraton Downtown Dubai\nN...,http://www.tripadvisor.com/Hotel_Review-g29542...
3,3,Although we had a few issues in terms of check...,"July 6, 2019",0 Four Points By Sheraton Downtown Dubai\nN...,http://www.tripadvisor.com/Hotel_Review-g29542...
4,4,I was stayed over 3 night in room ( 730 ) my f...,"July 4, 2019",0 Four Points By Sheraton Downtown Dubai\nN...,http://www.tripadvisor.com/Hotel_Review-g29542...


In [None]:
df_reviews['hotelName'].value_counts()

0    Four Points By Sheraton Downtown Dubai\nName: hotel_name, dtype: object           54
8    Orient Guest House\nName: hotel_name, dtype: object                               54
23    Signature 1 Hotel Tecom\nName: hotel_name, dtype: object                         54
20    Golden Tulip Al Barsha\nName: hotel_name, dtype: object                          54
18    Winchester Grand Hotel Apartments\nName: hotel_name, dtype: object               54
1    FIVE Palm Jumeirah Dubai\nName: hotel_name, dtype: object                         54
9    Barjeel Heritage Guest House\nName: hotel_name, dtype: object                     54
29    London Creek Hotel Apartments\nName: hotel_name, dtype: object                   54
7    Address Dubai Marina\nName: hotel_name, dtype: object                             54
5    JW Marriott Hotel Dubai\nName: hotel_name, dtype: object                          54
2    Atlantis, The Palm\nName: hotel_name, dtype: object                               54
3    Citym

In [None]:
# Strip/Trim
df_reviews[['Hotel_Name_Clean','Extra']] = df_reviews.hotelName.str.split("\n",expand=True)
df_reviews['Hotel_Name_Clean'] = df_reviews['Hotel_Name_Clean'].str.slice(4,).str.strip()

In [None]:
df_reviews['Hotel_Name_Clean'].drop_duplicates()

0             Four Points By Sheraton Downtown Dubai
54                          FIVE Palm Jumeirah Dubai
108                               Atlantis, The Palm
162                          Citymax Hotel Bur Dubai
216    Premier Inn Dubai International Airport Hotel
270                          JW Marriott Hotel Dubai
324                Four Points by Sheraton Bur Dubai
378                             Address Dubai Marina
432                               Orient Guest House
486                     Barjeel Heritage Guest House
540       DAMAC Towers by Paramount Hotels & Resorts
545                                 Hotel Beit Bahar
546                             Roda Boutique Villas
566                              Vida Emirates Hills
569                                   Vasantam Hotel
575                  Hyatt Place Dubai/Wasl District
612                                    BackPacker 16
634                    Crowne Plaza Dubai Apartments
635                Winchester Grand Hotel Apar

In [None]:
df_reviews.head()

Unnamed: 0.1,Unnamed: 0,review_body,review_date,hotelName,hotelUrl,Hotel_Name_Clean,Extra
0,0,Just to say this is really an excellent hotel ...,"July 14, 2019",0 Four Points By Sheraton Downtown Dubai\nN...,http://www.tripadvisor.com/Hotel_Review-g29542...,Four Points By Sheraton Downtown Dubai,"Name: hotel_name, dtype: object"
1,1,"Found this pub by chance, what a great place, ...","July 12, 2019",0 Four Points By Sheraton Downtown Dubai\nN...,http://www.tripadvisor.com/Hotel_Review-g29542...,Four Points By Sheraton Downtown Dubai,"Name: hotel_name, dtype: object"
2,2,"House keeping is perfect , the rooms are alway...","July 9, 2019",0 Four Points By Sheraton Downtown Dubai\nN...,http://www.tripadvisor.com/Hotel_Review-g29542...,Four Points By Sheraton Downtown Dubai,"Name: hotel_name, dtype: object"
3,3,Although we had a few issues in terms of check...,"July 6, 2019",0 Four Points By Sheraton Downtown Dubai\nN...,http://www.tripadvisor.com/Hotel_Review-g29542...,Four Points By Sheraton Downtown Dubai,"Name: hotel_name, dtype: object"
4,4,I was stayed over 3 night in room ( 730 ) my f...,"July 4, 2019",0 Four Points By Sheraton Downtown Dubai\nN...,http://www.tripadvisor.com/Hotel_Review-g29542...,Four Points By Sheraton Downtown Dubai,"Name: hotel_name, dtype: object"


## Combine reviews

In [None]:
df_combined = df_reviews.sort_values(['Hotel_Name_Clean']).groupby('Hotel_Name_Clean', sort=False).review_body.apply(''.join).reset_index(name='all_review')

In [None]:
df_combined.head()

Unnamed: 0,Hotel_Name_Clean,all_review
0,Address Dubai Marina,"Excellent Hotel and service, i enjoyed my stay..."
1,Al SEEF Hotel,AMAZING palace with beautiful design the servi...
2,"Atlantis, The Palm",Nice hotel for the family. Everywhere in the h...
3,BackPacker 16,It's not a fancy hotel and it's not a real hos...
4,Barjeel Heritage Guest House,Only had two days here to break the long trip ...


In [None]:
import re

df_combined['all_review'] = df_combined['all_review'].apply(lambda x: re.sub('[^a-zA-z0-9\s]','',x))

def lower_case(input_str):
    input_str = input_str.lower()
    return input_str

df_combined['all_review']= df_combined['all_review'].apply(lambda x: lower_case(x))

In [None]:
df = df_combined

In [None]:
df.head()

Unnamed: 0,Hotel_Name_Clean,all_review
0,Address Dubai Marina,excellent hotel and service i enjoyed my stay ...
1,Al SEEF Hotel,amazing palace with beautiful design the servi...
2,"Atlantis, The Palm",nice hotel for the family everywhere in the ho...
3,BackPacker 16,its not a fancy hotel and its not a real hoste...
4,Barjeel Heritage Guest House,only had two days here to break the long trip ...


In [None]:
df_sentences = df_combined.set_index("all_review")
df_sentences = df_sentences["Hotel_Name_Clean"].to_dict()
df_sentences_list = list(df_sentences.keys())
len(df_sentences_list)

28

In [None]:
list(df_sentences.keys())[1]

'amazing palace with beautiful design the services provided der is nice with the location just wow im so recommend it specially with the how much u will pay and lets dont forget its one of jumeirah group this bustling area of dubai was unknown to me until my stay at the lovely al seef hotel the staff were fantastic especially our waiter at breakfast unfortunately was unable to catch his name only know he is egyptian and provided the most attentive service lovely area to stay in even as a local resident the hotel is nicely located near busy area of the town which is easily accessible it is slightly priced very budgeted other clean and comfortable the breakfast had hot option overall good and the staffs are incredibly friendly and helpful'

In [None]:
from tqdm import tqdm
from sentence_transformers import SentenceTransformer, util

In [None]:
df_sentences_list = [str(d) for d in tqdm(df_sentences_list)]

100%|██████████| 28/28 [00:00<00:00, 5574.09it/s]


## Embeddings

In [None]:
# Corpus with example sentences
corpus = df_sentences_list
corpus_embeddings = embedder.encode(corpus,show_progress_bar=True)

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

In [None]:
corpus_embeddings[0].shape

(384,)

In [None]:
corpus_embeddings[0]
corpus_embeddings[0][1:5]

array([ 0.00088606,  0.00086861,  0.03601856, -0.06950236], dtype=float32)

In [None]:
# model = SentenceTransformer('all-MiniLM-L6-v2')
# paraphrases = util.paraphrase_mining(model, corpus)
# query_embeddings_p =  util.paraphrase_mining(model, queries,show_progress_bar=True)

In [None]:
# import pickle as pkl
# with open("/content/drive/MyDrive/BertSentenceSimilarity/Pickles/corpus_embeddings.pkl" , "wb") as file_:
# pkl.dump(corpus_embeddings,file_)

## **Query Setences input**

In [None]:
import torch

# Query sentences:
queries = ['hotel that is close to the airport ',
           'Hotel with easy access for taxi']

# Find the closest 5 sentences of the corpus for each query sentence based on cosine similarity
top_k = min(5, len(corpus))
for query in queries:
    query_embedding = embedder.encode(query, convert_to_tensor=True)

    # We use cosine-similarity and torch.topk to find the highest 5 scores
    cos_scores = util.pytorch_cos_sim(query_embedding, corpus_embeddings)[0]
    top_results = torch.topk(cos_scores, k=top_k)

    print("\n\n======================\n\n")
    print("Query:", query)
    print("\nTop 5 most similar sentences in corpus:")

    for score, idx in zip(top_results[0], top_results[1]):
        print("(Score: {:.4f})".format(score))
        print(corpus[idx], "(Score: {:.4f})".format(score)) 
        row_dict = df.loc[df['all_review']== corpus[idx]]
        print("paper_id:  " , row_dict['Hotel_Name_Clean'] , "\n")
    # for idx, distance in results[0:closest_n]:
    #     print("Score:   ", "(Score: %.4f)" % (1-distance) , "\n" )
    #     print("Paragraph:   ", corpus[idx].strip(), "\n" )
    #     row_dict = df.loc[df['all_review']== corpus[idx]]
    #     print("paper_id:  " , row_dict['Hotel'] , "\n")
    """
    # Alternatively, we can also use util.semantic_search to perform cosine similarty + topk
    hits = util.semantic_search(query_embedding, corpus_embeddings, top_k=5)
    hits = hits[0]      #Get the hits for the first query
    for hit in hits:
        print(corpus[hit['corpus_id']], "(Score: {:.4f})".format(hit['score']))
    """

In [None]:
model = SentenceTransformer('sentence-transformers/paraphrase-xlm-r-multilingual-v1')
embeddings = model.encode(corpus)
#print(embeddings)

In [None]:
query_embedding.shape

torch.Size([384])