# 2. Search Engine

## 2.0 Preprocessing

The first step is to preprocess the restaurant descriptions. For this, we use the custom-made function ```preprocessing```, and save all pre-processed documents in a list of documents ```preprocessed_docs```.

In [None]:
df = pd.read_csv('restaurants_data.tsv', sep = '\t')

In [None]:
df.head()

Unnamed: 0,restaurantName,city,postalCode,country,address,priceRange,cuisineType,description,facilitiesServices,creditCards,phoneNumber,website
0,Konnubio,Florence,50123,Italy,via dei Conti 8 r,€€,['Italian Contemporary'],Different options under one roof: a bar open f...,"['Air conditioning', 'Restaurant offering vege...","['amex', 'maestrocard', 'mastercard', 'visa']",+39 055 238 1189,https://guide.michelin.com/en/toscana/firenze/...
1,Seta Sushi Restaurant,Bologna,40125,Italy,corte Isolani 2,€€,['Japanese'],It was friendship and passion for Japanese cui...,"['Air conditioning', 'Terrace']","['amex', 'maestrocard', 'mastercard', 'visa']",+39 051 003 9367,https://guide.michelin.com/en/emilia-romagna/b...
2,Wine & Dine,Canazei,38032,Italy,strèdà Roma 5,€€,"['Regional Cuisine', ' Classic Cuisine']","Popular with locals, this restaurant has the t...","['Car park', 'Garden or park', 'Interesting wi...","['amex', 'maestrocard', 'mastercard', 'visa']",+39 0462 601111,https://guide.michelin.com/en/trentino-alto-ad...
3,Enoteca di Canelli - Casa Crippa,Canelli,14053,Italy,corso Libertà 65/a,€€,"['Modern Cuisine', ' Piedmontese']","Occupying a late-19C palazzo, in what was a hi...",['Air conditioning'],"['amex', 'dinersclub', 'mastercard', 'visa']",+39 0141 832182,https://guide.michelin.com/en/piemonte/canelli...
4,Shiroya,Rome,186,Italy,via dei Baullari 147,€€,"['Japanese', ' Asian']",One of the most popular restaurants in the his...,"['Air conditioning', 'Terrace']","['amex', 'mastercard', 'visa']",+39 06 6476 0753,https://guide.michelin.com/en/lazio/roma/resta...


In [None]:
preprocessed_docs = defaultdict(list) # initialize defaultdict to store preprocessed docs
for doc_id, doc in enumerate(df.description):
  preprocessed_docs[doc_id] = engine.preprocessing(doc) # preprocess doc at position doc_id

In [None]:
# Test description
text = '''After many years' experience in Michelin-starred restaurants, Luigi Tramontano and his wife Nicoletta
have opened their first restaurant in the chef's native Gargnano. Previously a pasta factory, the building has been converted
into an elegant, contemporary-style restaurant which has nonetheless retained its charming high ceilings.
The cuisine is inspired by regional traditions which are reinterpreted to create gourmet dishes,
all prepared with respect for the ingredients used and a strong focus on local produce.'''

# Test preprocessing on test description
print(engine.preprocessing(text))

['mani', 'year', 'experi', 'michelin', 'star', 'restaur', 'luigi', 'tramontano', 'wife', 'nicoletta', 'open', 'first', 'restaur', 'chef', 'nativ', 'gargnano', 'previous', 'pasta', 'factori', 'build', 'convert', 'eleg', 'contemporary', 'styl', 'restaur', 'nonetheless', 'retain', 'charm', 'high', 'ceil', 'cuisin', 'inspir', 'region', 'tradit', 'reinterpret', 'creat', 'gourmet', 'dish', 'prepar', 'respect', 'ingredi', 'use', 'strong', 'focu', 'local', 'produc']


## 2.1 Conjunctive Query

### 2.1.1 Create your Index!

In the following code cells, we preprocess all restaurant descriptions and 1. save unique tokens in a DataFrame ```vocabulary_df``` that maps terms to unique integer IDs, then 2. compute the inverted index for the documents.

In [None]:
# 1. Vocabulary File

# Retrieve the restaurants DataFrame
df = pd.read_csv('restaurants_data.tsv', sep='\t')

doc_tokens = [] # initialize list to store all tokens

# Find unique tokens
for doc in preprocessed_docs.values():
  doc_tokens.extend(doc)
  doc_tokens = list(set(doc_tokens)) # remove duplicates

vocabulary_dict = {term: i for i,term in enumerate(doc_tokens)} # dictionary of all vocabulary terms
vocabulary_df = pd.DataFrame({'term': vocabulary_dict.keys(), 'term_id': vocabulary_dict.values()}) # dataframe that maps terms to IDs

vocabulary_df.to_csv('vocabulary.csv', index=False) # save vocabulary dataframe in a csv file

In [None]:
# 2. Inverted Index

inverted_index = defaultdict(list) # initialize inverted_index dictionary

# Compute the inverted_index
for doc_id, row in enumerate(df.description):
  tokens = set(preprocessed_docs[doc_id]) # preprocessed description
  for token in tokens: # eliminate duplicates
    # Look up the term_id of the current term/token
    term_id = vocabulary_dict[token]
    # If the doc_id is not in the term_id's list in inverted_index, add it
    if doc_id not in inverted_index[term_id]:
      inverted_index[term_id].append(doc_id)

# Save the inverted_index dictionary to a file
with open("inverted_index.pkl", "wb") as file:
    pickle.dump(inverted_index, file)

Next, we allow the user to input a query. After clicking on search, the first search engine will be triggered to retrieve all restaurants that contain in their description the same terms as the query.

In [None]:
import ipywidgets as widgets
from IPython.display import display
# re-load inverted index in case it was modified somewhere
with open('inverted_index.pkl', 'rb') as file:
    inverted_index = pickle.load(file)

# Text input field for query
text_input = widgets.Text(
    value='',
    placeholder='Type your query',
    description='Query:',
    disabled=False
)

# Search button
search_button = widgets.Button(
    description='Search',
    disabled=False,
    button_style='primary'
)

output = widgets.Output()

# Define a function to handle button press
def on_search_button_clicked(b):
    with output:
        output.clear_output()  # clear previous output if there are any
        query = text_input.value
        if query.strip():  # Check if there's an input
            display(engine.find_restaurants(query, vocabulary_df, inverted_index, df)) # display query results
        else:
            print("Please enter something to search for.")

# Link the function to the button
search_button.on_click(on_search_button_clicked)

# Display the widgets
display(text_input, search_button, output)

Text(value='', description='Query:', placeholder='Type your query')

Button(button_style='primary', description='Search', style=ButtonStyle())

Output()

In [None]:
output

## 2.2 Ranked Search Engine with TF-IDF and Cosine Similarity

### 2.2.1 Inverted Index with TF-IDF Scores

In the following exercise, we will first compute the inverted index with TF-IDF scores using the custom-made function ```tf_idf``` and save the ```updated_inverted_index``` in a pickle file.

In [None]:
# Preliminary steps
n = len(preprocessed_docs)
updated_inverted_index = defaultdict(list) # initialize default dictionary to store the inverted_index values with TF-IDF scores
inverted_index_copy = inverted_index.copy() # Create a copy of the inverted_index to iterate over

# Compute updated_inverted_index
for term_id, docs in inverted_index_copy.items():
  tf_idf_scores = engine.tf_idf(int(term_id), inverted_index, preprocessed_docs, vocabulary_df, n)
  updated_inverted_index[term_id] = list(zip(docs, tf_idf_scores))

with open('updated_inverted_index.pkl', 'wb') as file:
    pickle.dump(updated_inverted_index, file)

Next, we retrieve from ```updated_inverted_index``` the TF-IDF scores related to documents, and memorize only the tuples (term, tf-idf) where tf-idf != 0 for each document in a pickle file.

In [None]:
# Compute the TF-IDF vectors of all documents and store them in a pickle file
doc_tf_idf_scores = defaultdict(list) # initialize dictionary to store non-zero TF-IDF scores for each document

for term_id, docs_scores in updated_inverted_index.items():
  for doc_id, tf_idf_score in docs_scores:
    if tf_idf_score != 0:
      doc_tf_idf_scores[doc_id].append((term_id,tf_idf_score))
  doc_tf_idf_scores[doc_id].sort(key=lambda x: x[0]) # sort the terms

with open('doc_tf_idf_scores.pkl', 'wb') as file:
    pickle.dump(doc_tf_idf_scores, file)

Finally, we enable the user to input a text query, and return the top-k ranked restaurants by cosine similarity.

In [None]:
# re-load inverted index in case it was modified somewhere
with open('inverted_index.pkl', 'rb') as file:
    inverted_index = pickle.load(file)

# Text input field for query
text_input = widgets.Text(
    value='',
    placeholder='Type your query',
    description='Query:',
    disabled=False
)

# Search button
search_button = widgets.Button(
    description='Search',
    disabled=False,
    button_style='primary'
)

output = widgets.Output()

# Define a function to handle button press
def on_search_button_clicked(b):
    with output:
        output.clear_output()  # clear previous output if there are any
        query = text_input.value
        if query.strip():  # Check if there's an input
            k = 10
            display(engine.top_k_restaurants(query, inverted_index, vocabulary_dict, doc_tf_idf_scores, df, n, k)) # display query results
        else:
            print("Please enter something to search for.")

# Link the function to the button
search_button.on_click(on_search_button_clicked)

# Display the widgets
display(text_input, search_button, output)

Text(value='', description='Query:', placeholder='Type your query')

Button(button_style='primary', description='Search', style=ButtonStyle())

Output()

In [None]:
output