In [1]:
!pip install lxml



# [Homework 3](https://github.com/Sapienza-University-Rome/ADM/tree/master/2024/Homework_3) - Michelin restaurants in Italy
![iStock-654454404-777x518](https://a.storyblok.com/f/125576/2448x1220/327bb24d32/hero_update_michelin.jpg/m/1224x0/filters:format(webp))

## 1. Data collection

For the data collection, we wrote the required function in a `data_collection.py` module. 

In [2]:
from data_collection import save_links, download_html_from_link_file, html_to_tsv

The following is the overview of the main functions for each step, together with the code to run. 

Every function has an optional `data_folder` argument wich server the purpose to set the working data directory. 
We tought this to be useful, for example to set the date of the data collection as the directory name. 
This is useful, as the Michelin list of restaurant is constantly updated. 

In [3]:
data_folder = 'DATA 24-11-09'
# date of last data collection, yy-mm-dd

---

### 1.1. Get the list of Michelin restaurants
   #### **Function**: `save_links`
   - **Description**: 
     Collects restaurant links from the Michelin Guide website starting from the provided `start_url`. The links are saved into a text file (`restaurant_links.txt`) within a specified data folder.
   - **Input**: 
     - `start_url`: URL of the Michelin Guide page to start scraping.
   - **Optional Input**: 
     - `file_name`: name of the output file; by default it is `restaurant_links.txt`.
     - `data_folder`: the folder where datas will be stored; by default it is `DATA`.
   - **Output**:
     - A text file containing restaurant links, one per line, saved in the `data_folder`.
   - **Key Features**:
     - Automatically detects the number of pages to scrape.
     - Skips scraping if the links file already exists.

In [4]:
start_url = "https://guide.michelin.com/en/it/restaurants"
save_links(start_url, data_folder = data_folder)

Links already collected.
There are 1982 link already collected


---

### 1.2. Crawl Michelin restaurant pages
   #### **Function**: `download_html_from_link_file`
   - **Description**: 
     Downloads the HTML from every URL in the input `file_name`, and saves them to a structured folder (`DATA/HTMLs/page_X`).
   - **Input (all optional)**:
     - `file_name`: name of the file with the links; by default it is `restaurant_links.txt`.
     - `data_folder`: the folder where datas will be stored; by default it is `DATA`.
   - **Output**:
     - Saves the HTML files in a structured folder `DATA/HTMLs/page_X`. 
   - **Key Features**:
     - Uses `ThreadPoolExecutor` to speed up the process
     - Skips existing HTML files

In [5]:
download_html_from_link_file(data_folder = data_folder)

Download HTMLs: 100%|██████████| 1982/1982 [00:00<00:00, 7462.04it/s]

All html files have been saved.





---

### 1.3 Parse downloaded pages

#### **Function**: `extract_info_from_html`
- **Description**:  
  Parses a restaurant's HTML page and extracts structured information such as name, address, cuisine type, price range, description, and services.
- **Input**:
  - `html`: The raw HTML content of a restaurant's page.
- **Output**:
  - A dictionary containing extracted fields.
- **Key Features**:
  - Handles missing data gracefully.
  - Handles addresses separated by commas.


#### **Function**: `html_to_tsv`
- **Description**:  
  Scans the `HTMLs` folder inside the `data_folder` for all the html files, then processes every file with `extract_info_from_html`.
- **Input (optional)**:
  - `data_folder`: The folder where data will be stored; by default it is `DATA`.
  - `max_workers`: the max number of concurrent HTML parsing tasks. 
- **Output**:
  - Saves the TSV files in the folder `DATA/TSVs`.
- **Key Features**:
     - Uses `ThreadPoolExecutor` to speed up the process. 
- **Advice**:
     - Fine-tune the `max_workers` parameter according to your CPU performance. As a rule of thumb, set `max_workers` to the number of CPU cores available. An estimated processing time of around 5 minutes is typical. 

In [6]:
html_to_tsv(data_folder=data_folder, max_workers=4)

Processing HTMLs: 100%|██████████| 1982/1982 [00:00<00:00, 49094.72it/s]

All files have been processed and saved.





For completeness, let us create the dataframe for our dataset, in order to handle it effectively.

#### **Function**: `create_combined_dataframe`
- **Description**:  
  This function reads all the `.tsv` files from a specified folder, loads them into individual pandas DataFrames, and then combines them into a single DataFrame. It is useful for aggregating data from multiple sources into one unified dataset for further analysis.

- **Input**:
  - `folder_path` (str): The path to the folder containing the `.tsv` files to be read.
  - `separator` (str): The delimiter used in the `.tsv` files. Typically, it's a tab (`\t`), but it could be adjusted if needed.
  
- **Output**:
  - Returns a pandas DataFrame containing all the combined data from the `.tsv` files in the specified folder.

- **Key Features**:
  - Utilizes `glob` to find all `.tsv` files in the provided folder.
  - Loads each file as a DataFrame using pandas `read_csv()` with the specified delimiter.
  - Concatenates all DataFrames into one, ignoring index to prevent duplication.
  - Efficient handling of large datasets through pandas' built-in functions.

By running this function, you'll have a consolidated view of all the restaurant data in a single DataFrame, ready for any further analysis or processing. The first few rows of the dataset are provided below.

In [7]:
from data_collection import create_combined_dataframe
df = create_combined_dataframe(data_folder+"\TSVs", separator='\t')
df.head()

Unnamed: 0,restaurantName,address,city,postalCode,country,priceRange,cuisineType,description,facilitiesServices,creditCards,phoneNumber,website
0,20Tre,via David Chiossone 20 r,Genoa,16123,Italy,€€,"Farm to table, Modern Cuisine",Situated in the heart of Genoa’s historic cent...,['Air conditioning'],"['amex', 'dinersclub', 'mastercard', 'visa']",+39 010 247 6191,https://www.ristorante20tregenova.it/
1,Il Ristorante Alain Ducasse Napoli,Via Cristoforo Colombo 45,Naples,80133,Italy,€€€€,"Creative, Mediterranean Cuisine","Alain Ducasse, one of the great names in conte...","['Air conditioning', 'Great view', 'Interestin...","['amex', 'dinersclub', 'discover', 'maestrocar...",+39 081 604 1580,https://theromeocollection.com/en/romeo-napoli...
2,Salvo Cacciatori,via Vieusseux 12,Oneglia,18100,Italy,€€€,"Ligurian, Contemporary",This restaurant has come a long way since 1906...,"['Air conditioning', 'Restaurant offering vege...","['amex', 'discover', 'jcb', 'maestrocard', 'ma...",+39 0183 293763,https://ristorantesalvocacciatori.it/
3,Terre Alte,"via Olmadella 11, località Balignano",Longiano,47020,Italy,€€€,Seafood,One of the best-known addresses in this region...,"['Air conditioning', 'Car park', 'Terrace']","['amex', 'mastercard', 'visa']",+39 0547 666138,https://ristoranteterrealte.com/
4,Tubladel,via Trebinger 22,Ortisei,39046,Italy,€€€,Regional Cuisine,Although this restaurant with wood-adorned din...,"['Car park', 'Terrace']","['mastercard', 'visa']",+39 0471 796879,https://www.tubladel.com/


---

## 2. Search Engine

In this section, we developed two types of search engines: a **Conjunctive Search Engine** and a **Ranked Search Engine**. These engines enable users to retrieve restaurant information based on their queries about descriptions.

### 2.0 Preprocessing the Text

First, we will clean and prepare the restaurant descriptions data using the `nltk` library. Let's start by installing and downloading the necessary library and packages.

In [8]:
!pip install --upgrade nltk



In [9]:
import nltk
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\marta\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\marta\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

Let us now add a new column to the DataFrame named `processedDescription`. This column will store the processed versions of the restaurant descriptions, refined by removing stopwords, cleaning punctuation, and applying stemming.

#### **Function**: `preprocess_text`
- **Description:**  
    This function preprocesses a list of restaurant descriptions to enhance their suitability for search and retrieval tasks. The function performs several preprocessing steps including tokenization, removal of common stopwords, punctuation cleaning, and word stemming. These operations help streamline search processes by reducing descriptions to their core, searchable components.

- **Input**
    - `text` (str): A string representing the text input.

- **Output**
    - `processed_text` (list of list of str): A list in which each element is a list of processed tokens corresponding to each word in the input text. Each token is a cleaned, stemmed version of the original words in the provided text.

- **Key Features**
    - **Tokenization**: Divides each description into individual words or punctuation marks for further processing.
    - **Stopword Removal**: Filters out commonly used words (e.g., "the", "is", "and") that are less meaningful for search and classification.
    - **Punctuation Cleaning**: Removes non-alphanumeric characters to focus on the essential content.
    - **Stemming**: Reduces each word to its root form, facilitating matches across different morphological variants (e.g., "eating" and "eat").

In [10]:
# Import the preprocess_text function from the search_engine module
from search_engine import preprocess_text

# Apply the preprocess_text function to the 'description' column in the DataFrame
# and store the results in a new column named 'processedDescription'
df['processedDescription'] = preprocess_text(df['description'])

# Display the first few rows of the DataFrame to verify the new column
df.head()

Unnamed: 0,restaurantName,address,city,postalCode,country,priceRange,cuisineType,description,facilitiesServices,creditCards,phoneNumber,website,processedDescription
0,20Tre,via David Chiossone 20 r,Genoa,16123,Italy,€€,"Farm to table, Modern Cuisine",Situated in the heart of Genoa’s historic cent...,['Air conditioning'],"['amex', 'dinersclub', 'mastercard', 'visa']",+39 010 247 6191,https://www.ristorante20tregenova.it/,"[situat, heart, genoa, histor, centr, contempo..."
1,Il Ristorante Alain Ducasse Napoli,Via Cristoforo Colombo 45,Naples,80133,Italy,€€€€,"Creative, Mediterranean Cuisine","Alain Ducasse, one of the great names in conte...","['Air conditioning', 'Great view', 'Interestin...","['amex', 'dinersclub', 'discover', 'maestrocar...",+39 081 604 1580,https://theromeocollection.com/en/romeo-napoli...,"[alain, ducass, one, great, name, contemporari..."
2,Salvo Cacciatori,via Vieusseux 12,Oneglia,18100,Italy,€€€,"Ligurian, Contemporary",This restaurant has come a long way since 1906...,"['Air conditioning', 'Restaurant offering vege...","['amex', 'discover', 'jcb', 'maestrocard', 'ma...",+39 0183 293763,https://ristorantesalvocacciatori.it/,"[restaur, come, long, way, sinc, 1906, owner, ..."
3,Terre Alte,"via Olmadella 11, località Balignano",Longiano,47020,Italy,€€€,Seafood,One of the best-known addresses in this region...,"['Air conditioning', 'Car park', 'Terrace']","['amex', 'mastercard', 'visa']",+39 0547 666138,https://ristoranteterrealte.com/,"[one, best, known, address, region, fish, enth..."
4,Tubladel,via Trebinger 22,Ortisei,39046,Italy,€€€,Regional Cuisine,Although this restaurant with wood-adorned din...,"['Car park', 'Terrace']","['mastercard', 'visa']",+39 0471 796879,https://www.tubladel.com/,"[although, restaur, wood, adorn, dine, room, r..."


### 2.1 Conjunctive Query

Next, we will construct the **Conjunctive Search Engine**, which retrieves restaurants whose descriptions contain all specified query terms.

#### 2.1.1 Create Your Index!

### Building and Loading the Vocabulary

In this section, we create or load a file named `"vocabulary.csv"`, which maps each unique word to a corresponding integer identifier (`term_id`). We assign integer values from $0$ up to the total number of unique words in the processed descriptions minus one. 

To optimize computation and avoid redundant processing, if the file already exists, we simply load it. For efficient usage, the data is stored in a DataFrame called `vocabulary_df`.

#### Function: `get_vocabulary`

- **Description:**  
    This function either loads an existing `"vocabulary.csv"` file or creates a new one if it does not exist. The vocabulary file maps each unique word (or "term") found in the processed text descriptions to a unique integer ID, which can be used to reference the term efficiently.

- **Input**
    - `processed_texts` (list of list of str): A list where each sublist contains tokenized and processed words from a text.
    - `file_path` (str, default=`"vocabulary.csv"`): Path to the `.csv` file where the vocabulary will be stored or loaded from.

- **Output**
    - `vocabulary_df` (`pd.DataFrame`): A DataFrame that contains two columns: `term_id`, the unique integer ID assigned to each term, and `term`, the corresponding word.

- **Key Features**
    - **File Existence Check**: Checks if `"vocabulary.csv"` already exists to avoid redundant recomputation. If it does exist, the file is loaded; otherwise, a new vocabulary is created and saved.
    - **Unique Term Extraction**: Extracts all unique terms from the processed texts by flattening the list of lists and converting it to a set, ensuring only unique words are included.
    - **DataFrame Creation**: Each term is assigned a unique integer ID, and both the term and ID are stored in a DataFrame.
    - **CSV Storage**: Saves the vocabulary as `"vocabulary.csv"` to enable reuse in future computations.

In [11]:
# Import the get_vocabulary function from the search_engine module
from search_engine import get_vocabulary

# Generate or load the vocabulary DataFrame based on the 'processedDescription' column in df
descriptions_vocabulary_df = get_vocabulary(df['processedDescription'])

# Print the first few rows of the vocabulary DataFrame
print(descriptions_vocabulary_df.head().to_string(index=False))

Loading vocabulary.csv file.
 term_id      term
       0    laurin
       1    gorini
       2 imaginari
       3  ciccioli
       4  clientel


Let us generate the **Inverted Index** dictionary, which maps each `term_id` to a list of document IDs where that term appears. Similar to the vocabulary, the function is designed to either create the `inverted_index.json` file if the dictionary has not been created yet or load it if it already exists, thereby avoiding unnecessary recomputation.

#### Function: `get_inverted_index`

- **Description:**  
    This function either creates a new **Inverted Index** or loads an existing one from a `.json` file. The inverted index maps `term_id` values to lists of document IDs where each corresponding term appears. If the index does not exist, it is created by iterating over the processed texts and populating the index with term-document mappings. The function avoids recomputation by saving the index to a `.json` file for future use.

- **Input**
    - `processed_texts` (`list of list of str`): A list of processed document texts, where each text is a list of terms (strings).
    - `vocabulary_df` (`pandas.DataFrame`): A DataFrame containing `term` and `term_id` columns, mapping each term to a unique `term_id`.
    - `file_path` (`str`, default=`"inverted_index.json"`): Path to the `.json` file where the inverted index will be stored or loaded from.

- **Output**
    - `inverted_index` (`dict`): A dictionary where keys are term IDs and values are lists of document indices (IDs) that contain each term.

- **Key Features**
    - **File Existence Check**: The function checks if an inverted index already exists in a `.json` file. If it does, the file is loaded; otherwise, a new index is created.
    - **Term Mapping**: A mapping of terms to their corresponding `term_id` is generated for fast lookups when building the index.
    - **Efficient Indexing**: The inverted index is constructed by iterating through the processed texts and storing document IDs for each unique term in a document.
    - **JSON Storage**: The inverted index is saved in a `.json` file to avoid recomputation in future runs, ensuring computational efficiency.


In [12]:
# Import the get_inverted_index function from the search_engine module
from search_engine import get_inverted_index

# Generate the inverted index using the processed descriptions and the descriptions vocabulary DataFrame
inverted_index = get_inverted_index(df['processedDescription'], descriptions_vocabulary_df)

# Iterate through the first 10 terms in the inverted index and print their corresponding document IDs
for idx, (term, docs) in enumerate(inverted_index.items()):
    if idx < 10:
        print(f"Term ID: {term}, Document IDs: {docs}")
    else:
        break

Loading Inverted Index from theinverted_index.json file.
Term ID: 0, Document IDs: [944]
Term ID: 1, Document IDs: [589]
Term ID: 2, Document IDs: [1299]
Term ID: 3, Document IDs: [886]
Term ID: 4, Document IDs: [499, 628, 1458, 1566, 1575, 1806]
Term ID: 5, Document IDs: [899]
Term ID: 6, Document IDs: [80, 91, 416, 430, 885, 1128, 1436, 1622, 1684]
Term ID: 7, Document IDs: [34, 38, 43, 78, 80, 90, 293, 328, 388, 408, 433, 556, 626, 853, 897, 1110, 1157, 1233, 1256, 1319, 1381, 1596, 1662, 1740, 1946]
Term ID: 8, Document IDs: [64]
Term ID: 9, Document IDs: [1352, 1543, 1886]


#### 2.1.2 Execute the Query

Now, let us execute the query that will be handled by the Conjunctive Search Engine. This engine will process the query terms and return a list of restaurants whose descriptions contain all of the query words.

#### **Function**: `execute_conjunctive_query`
- **Description:**  
    This function processes a search query to find documents that contain all of the terms specified in the query. It uses the inverted index to identify the documents where each term appears and then returns only those documents that contain every term in the query.

- **Input:**
    - `query` (str): The search query, typically a string containing multiple terms.
    - `inverted_index` (dict): The inverted index, which maps `term_ids` to lists of document indices (IDs) where each term appears.
    - `vocabulary_df` (pd.DataFrame): A DataFrame mapping terms to their unique `term_ids`, allowing the function to look up terms' corresponding IDs.

- **Output:**
    - `intersection_result` (list of int): A list of document IDs that contain all of the terms specified in the search query.

- **Key Features:**
    - **Preprocessing the Query:**  The input query is first preprocessed to tokenize and clean the terms before searching.
    - **Mapping Terms to IDs:**  The terms from the query are matched with their corresponding `term_ids` in the vocabulary DataFrame.
    - **Retrieving Document IDs:**  For each term in the query, the function fetches the document IDs where the term appears using the inverted index.
    - **Intersection of Document Sets:**  The function performs an intersection of all the document sets to ensure that only documents containing all the terms in the query are returned.

Let's execute the query `"modern seasonal cuisine"` and retrieve all restaurants whose processed descriptions contain all these words.

In [13]:
# Import the function for executing a conjunctive query based on the given terms
from search_engine import execute_conjunctive_query

# Execute the conjunctive query to get the document IDs that match all terms in the query
# The function returns a list of document IDs that contain all terms in the query ("modern seasonal cuisine")
documents_id = execute_conjunctive_query(
    "modern seasonal cuisine",  # The query string containing terms to be matched
    inverted_index,  # The inverted index (term -> list of document IDs where each term appears)
    descriptions_vocabulary_df  # The vocabulary DataFrame that maps terms to their IDs
)

# Retrieve the documents that match the query by using the sorted document IDs
# This creates a DataFrame with the restaurant details (name, address, description, website) for the matching documents
conjunctive_query_result_df = df[['restaurantName', 'address', 'description', 'website']].iloc[sorted(documents_id)]

# Rename the columns to make them more user-friendly for display purposes
conjunctive_query_result_df = conjunctive_query_result_df.rename(columns={
    'restaurantName': 'Restaurant Name',
    'address': 'Address',
    'description': 'Description',
    'website': 'Website'
})

# Print the resulting DataFrame containing the restaurant details that match the query
conjunctive_query_result_df

Unnamed: 0,Restaurant Name,Address,Description,Website
88,Ronchi Rò,località Cime di Dolegna 12,Ronchi Rò is an estate-cum-agriturismo surroun...,https://www.ronchiro.it
180,Razzo,via Andrea Doria 17/f,"A quiet restaurant with a relaxed, young and m...",https://vadoarazzo.it/
256,Flurin,Laubengasse 2,Flurin occupies an old medieval tower in Glore...,https://www.flurin.it
289,Quadri Bistrot,Via Solferino 48,"A modern bistro with a cocktail-bar, trendy de...",https://www.quadribistrot.it/
304,Materia | Spazio Cucina,via Teatro Massimo 29,The entrance to this restaurant is typical of ...,https://www.materiaspaziocucina.it/
456,Gallery Bistrot Contemporaneo,via Regina Margherita 3/b,"Modern, tasty and carefully curated cuisine, w...",
501,Ca' Del Moro,località Erbin 31,Situated within the La Collina dei Ciliegi win...,https://www.cadelmoro.wine/it
561,[àbitat],via Henry Dunant 1,"A young, enthusiastic and professional couple ...",https://www.abitatproject.it
565,Babette,via Michelangelo 17,Situated just beyond the centre of Albenga in ...,https://www.ristorantebabette.net/
586,Cappuccini Cucina San Francesco,via Cappuccini 54,"Housed in the resort of the same name, this el...",https://www.cappuccini.it/


### 2.2 Ranked Search Engine with TF-IDF and Cosine Similarity

Finally, we build the **Ranked Search Engine**, that returns the *top-k* restaurants sorted by their similarity to the query, utilizing **TF-IDF** and **Cosine Similarity**.

#### 2.2.1 Inverted Index with TF-IDF Scores

Let us obtain the **Inverted Index with TF-IDF Scores** dictionary. 

Like in previous cases, the following function creates the file `tfIdf_inverted_index.json` if it doesn't exist. Otherwise, it loads the file, avoiding recomputation, and stores the data in a dictionary. This is a new inverted index where each entry corresponds to a term, and the value is a list of tuples containing document IDs and their associated TF-IDF scores.

#### **Function**: `get_tfIdf_inverted_index` 
- **Description**: \
    This function creates or loads an inverted index with **TF-IDF scores** for a collection of documents. It calculates the importance of each term in each document based on the **term frequency (TF)** and **inverse document frequency (IDF)**. If the TF-IDF inverted index already exists as a JSON file, it will be loaded; otherwise, the function will generate it and save it to a file for future use.

- **Input**:
    - `inverted_index` (dict): A dictionary where the keys are `term_ids`, and the values are lists of document IDs (indices) where each term appears.
    - `vocabulary_df` (pd.DataFrame): A DataFrame that maps terms to their unique `term_ids`. This is used to look up terms and their IDs.
    - `processed_texts` (list of list of str): A list of processed documents, where each document is represented as a list of terms (strings).
    - `file_path` (str, default=`"tfIdf_inverted_index.json"`): The file path where the TF-IDF inverted index will be saved or loaded. Default is `"tfIdf_inverted_index.json"`.

-  **Output**:
    - **tfIdf_inverted_index** (`dict`): A dictionary where the keys are `term_ids`, and the values are lists of tuples. Each tuple contains:
        - `doc_id` (int): The document ID (index in the `processed_texts` list).
        - `tf-idf score` (float): The TF-IDF score for the term in the corresponding document.

- **Key Features**:
    - **Check for Existing File**: The function checks whether the TF-IDF inverted index already exists as a JSON file. If it does, it loads the index from the file to avoid recomputation.
    - **Calculate TF-IDF Scores**: If the file does not exist, the function computes the TF-IDF scores for each term in the vocabulary, using the `inverted_index` and `processed_texts` as input.
    - **Save the Index**: Once the TF-IDF scores are calculated, the index is saved to a JSON file for future use, ensuring efficient reuse of precomputed data.


### **Function**: `get_tfIdf` 
- **Description**:
    This function calculates the **TF-IDF (Term Frequency-Inverse Document Frequency)** score for a specific term in a document, providing a measure of the term's importance within that document relative to the entire corpus. It combines two components:
        - **Term Frequency (TF)**: The frequency of the term within the document.
        - **Inverse Document Frequency (IDF)**: A measure of the rarity of the term across the entire corpus.

    The function is used in `get_tfIdf_inverted_index` to compute the TF-IDF scores for terms in documents and build the **TF-IDF inverted index**.

- **Input**:
    - `term` (str): The term for which the TF-IDF score is being calculated.
    - `document` (list of str): A list of terms representing the document being analyzed.
    - `corpus` (list of list of str): A collection of documents, where each document is represented as a list of terms.

- **Output**:
    - **TF-IDF score** (float): The TF-IDF score for the given term in the specified document. It is the product of the **TF** and **IDF** scores.

- **Key Features**:
    - **Term Frequency (TF)**: The function calculates the term frequency as the ratio of occurrences of the term in the document to the total number of terms in the document.
    - **Inverse Document Frequency (IDF)**: The function computes the IDF by taking the logarithm of the total number of documents divided by the number of documents that contain the term, ensuring that terms appearing in every document are not given excessive weight.
    - **Logarithmic Scaling**: The IDF is logarithmic to reduce the effect of very frequent terms in the corpus.
    - **Combination of TF and IDF**: The TF and IDF values are multiplied together to produce the TF-IDF score, which reflects the relative importance of the term in the document within the context of the entire corpus.

In [14]:
# Importing the get_tfIdf_inverted_index function from the search_engine module
from search_engine import get_tfIdf_inverted_index

# Create the TF-IDF inverted index dictionary using the inverted index, vocabulary, and processed descriptions
tfIdf_inverted_index = get_tfIdf_inverted_index(inverted_index, descriptions_vocabulary_df, df['processedDescription'])

# Print the first 10 key-value pairs from the TF-IDF inverted index dictionary
for idx, (term, docs) in enumerate(tfIdf_inverted_index.items()):
    if idx < 10:
        print(f"Term ID: {term}, List of (Doc ID, tf-idf score) pairs: {docs}")
    else:
        break

Loading Inverted Index with TF-IDF scores from the tfIdf_inverted_index.json file.
Term ID: 0, List of (Doc ID, tf-idf score) pairs: [(944, 0.05992147308970551)]
Term ID: 1, List of (Doc ID, tf-idf score) pairs: [(589, 0.053501315258665624)]
Term ID: 2, List of (Doc ID, tf-idf score) pairs: [(1299, 0.04161213409007326)]
Term ID: 3, List of (Doc ID, tf-idf score) pairs: [(886, 0.03841120069852917)]
Term ID: 4, List of (Doc ID, tf-idf score) pairs: [(499, 0.08455191759086206), (628, 0.038920723970396816), (1458, 0.07909695516564515), (1566, 0.06130014025337499), (1575, 0.08757162893339283), (1806, 0.051083450211145826)]
Term ID: 5, List of (Doc ID, tf-idf score) pairs: [(899, 0.05447406644518682)]
Term ID: 6, List of (Doc ID, tf-idf score) pairs: [(80, 0.026710507559875075), (91, 0.027024748825285373), (416, 0.03329135724853995), (430, 0.049937035872809926), (885, 0.05220690113975583), (1128, 0.04030006403770625), (1436, 0.02443727287392826), (1622, 0.06380843472636824), (1684, 0.1093858

#### 2.2.2 Execute the Ranked Query

In this step, we will process the query entered by the user and utilize a **Ranked Search Engine** to return the most relevant documents (restaurants) based on their similarity to the query. The search engine will return the top **k** matching restaurants (documents) with the highest similarity scores. If fewer than **k** restaurants have a non-zero similarity score, it will return all the restaurants that have any similarity to the query.

### **Function**: `execute_ranked_query`

- **Description**
    This function performs a ranked search by evaluating how closely the documents in a collection match the query based on the **TF-IDF** vectors. It uses only the terms from the query that are present in the **vocabulary**, and the **cosine similarity** is computed to rank the documents.

- **Input**:
    - `query_terms` (str): A space-separated string of terms that represent the search query.
    - `tfIdf_inverted_index` (dict): A dictionary mapping term IDs to lists of tuples (document ID, TF-IDF score), representing the inverted index for the documents.
    - `vocabulary_df` (DataFrame): A DataFrame containing terms and their corresponding term IDs.
    - `processed_texts` (list of list of str): A list of processed documents, where each document is represented as a list of terms.
    - `top_k` (int): The number of top-ranked documents to return based on similarity.

- **Output**:
    - `ranked_results` (list): A list of tuples, where each tuple contains a document ID and its cosine similarity score with respect to the query.
    - `not_found` (str): A message listing the query terms that were not found in the vocabulary.

- **Key Features:**
    - **Preprocessing the Query**: The input query, a string of space-separated terms, is tokenized and cleaned using the `preprocess_text` function to ensure that the query terms are in a standard form for comparison.
    - **Handling Missing Terms**: Any query terms that are not present in the vocabulary are filtered out, and a message is created to indicate which terms could not be matched.

    - **Mapping Query Terms to Term IDs**: The remaining valid query terms are mapped to their corresponding **term IDs** using the **vocabulary** DataFrame. These term IDs will be used to build the **query vector**.
    
    - **Building the Query Vector**: A **query vector** is initialized with zeros, and for each term in the query, the **TF-IDF** score is calculated using the `get_tfIdf` function. These scores populate the query vector at the positions corresponding to the term IDs.
    
    - **Building Document Vectors**: Document vectors are also initialized with zero values. These vectors are populated using the **TF-IDF scores** from the **TF-IDF inverted index**, which stores the scores for each term in each document.

    - **Cosine Similarity Calculation**: For each document, the cosine similarity between its vector and the query vector is computed using the `get_cosine_similarity` function. Cosine similarity measures the angle between the two vectors, providing a score that indicates how similar the document is to the query. A value closer to 1 indicates higher similarity.

    - **Top- K Ranking**: The documents are ranked based on their similarity scores, with the most similar documents appearing first. The results are sorted in descending order of similarity. If there are more than `top_k` results, the function limits the returned documents to the top-k highest ranked documents. If fewer than `top_k` documents match, all matching documents are returned.

### **Function**: `get_cosine_similarity`

This function is a helper function used by the `execute_ranked_query` function to calculate the **cosine similarity** between a document vector and a query vector. This similarity metric is widely used in **information retrieval** and **text mining** to measure the similarity between two vectors, regardless of their magnitude.

#### **Description**
Cosine similarity is calculated as the cosine of the angle between two vectors. A cosine similarity score of:
- **1** means the vectors are identical.
- **0** means the vectors are orthogonal (no similarity).
- **-1** indicates complete dissimilarity (opposites).

This function specifically handles cases where one or both vectors might be zero vectors, returning 0 if the cosine similarity cannot be calculated.

- **Input**:
    - `doc_vector` (numpy array): A vector representing the document, containing **TF-IDF scores** for each term.
    - `query_vector` (numpy array): A vector representing the query, with **TF-IDF scores** assigned for each term.

- **Output**:
    - `float`: The cosine similarity score between the document and query vectors. Returns 0 if the denominator (product of norms) is zero.

- **Key Features**:
    - **Dot Product Calculation**: The **dot product** of the document and query vectors is computed. This represents the overlap or alignment between the two vectors.

    - **L2 Norm Calculation**:The **L2 norm** (magnitude) of both the document vector and the query vector is calculated individually. The L2 norm provides a measure of the vector's length.

    - **Cosine Similarity Calculation**: The cosine similarity score is derived by dividing the dot product by the product of the norms of the two vectors. This step effectively normalizes the dot product, yielding a similarity score between -1 and 1.

    - **Handling Zero Vectors**: If either vector is a **zero vector** (i.e., the L2 norm is zero), the function returns 0, as the cosine similarity is undefined in this case.

Let's execute the query `"modern seasonal cuisine"` and retrieve the *top-10* results ranked by relevance. This will allow us to see the best-matching documents based on cosine similarity between the query vector and document vectors. Any terms from the query that are not found in the vocabulary will be noted in the output.

In [15]:
# Import the function for executing a ranked query based on TF-IDF scores
from search_engine import execute_ranked_query

# Execute the ranked query with the search terms "cuisine modern seasonal"
# The function returns the ranked query result (documents and their similarity scores) and any terms not found in the vocabulary
ranked_query_result, terms_not_found = execute_ranked_query(
    "cuisine modern seasonal",  # The query string
    tfIdf_inverted_index,  # The TF-IDF inverted index (term -> document ID, TF-IDF score pairs)
    descriptions_vocabulary_df,  # The vocabulary DataFrame (terms and their IDs)
    df['processedDescription'],  # The processed descriptions of the documents
    10  # Number of top results to return
)

# Extract the document IDs from the ranked query result
ranked_query_result_ids = [doc_id for (doc_id, _) in ranked_query_result]

# Extract the TF-IDF scores from the ranked query result
ranked_query_result_scores = [tfIdf_score for (_, tfIdf_score) in ranked_query_result]

# Retrieve the restaurant information (name, address, description, and website) based on the document IDs
ranked_query_result_df = df[['restaurantName', 'address', 'description', 'website']].iloc[ranked_query_result_ids]

# Add the TF-IDF similarity scores as a new column in the DataFrame
ranked_query_result_df['tfIdf_score'] = ranked_query_result_scores

# Rename the columns to make them more user-friendly for display purposes
ranked_query_result_df = ranked_query_result_df.rename(columns={
    'restaurantName': 'Restaurant Name',
    'address': 'Address',
    'description': 'Description',
    'website': 'Website',
    'tfIdf_score': 'Similarity Score'
})

# Print any terms from the query that were not found in the vocabulary
print(terms_not_found)

# Display the DataFrame with the top-ranked results, including the restaurant details and similarity scores
ranked_query_result_df





Unnamed: 0,Restaurant Name,Address,Description,Website,Similarity Score
878,Saur,via Filippo Turati 8,"In a tiny rural village, this contemporary, al...",https://ristorantesaur.it,0.251278
726,La Botte,via Giuseppe Garibaldi 8,A modern and welcoming contemporary bistro sit...,http://www.trattorialabottestresa.it,0.235697
180,Razzo,via Andrea Doria 17/f,"A quiet restaurant with a relaxed, young and m...",https://vadoarazzo.it/,0.226468
1979,Piccolo Lord,corso San Maurizio 69 bis/g,"Professional service in a welcoming, modern re...",https://www.ristorantepiccololord.it/,0.217604
1444,La Valle,"via Umberto I 25, località Valle Sauglio",A well - run restaurant in a quiet area just o...,https://www.ristorantelavalle.it/,0.205892
6,Al Vecchio Convento,viale Borri 348,"Ask for a table in the main dining room, with ...",https://www.alvecchioconvento.it/,0.189635
798,RistoFante,via Mazzini 41,The motto of this restaurant is “In step with ...,https://www.ristofante.it/,0.182989
1878,Aprudia,largo del Forno 16,"At this restaurant in the historic centre, whe...",http://www.aprudia.com,0.174753
520,Barbieri,via Italo Barbieri,Enjoy your meal in the classic - style dining ...,https://www.hotelbarbieri.it,0.17451
0,20Tre,via David Chiossone 20 r,Situated in the heart of Genoa’s historic cent...,https://www.ristorante20tregenova.it/,0.171513


Here, we demonstrate an example where some terms in the query are not present in the vocabulary. When this occurs, the function will return ranked results based solely on the terms that do match the vocabulary, while also notifying the user of the terms that were not found. This approach allows us to retrieve relevant results even if certain query terms are missing from the vocabulary.

In this example, the query is `"modern albero cuisine seasonal treno"`. Note that the results will be identical to the previous example, as only the terms `"modern"`, `"cuisine"`, and `"seasonal"` were found and used for ranking, while `"albero"` and `"treno"` were absent from the vocabulary.

In [16]:
ranked_query2_result, terms2_not_found = execute_ranked_query(
    "modern albero cuisine seasonal treno", tfIdf_inverted_index, descriptions_vocabulary_df, df['processedDescription'], 10
)

ranked_query2_result_ids = [doc_id for (doc_id, _) in ranked_query2_result]

# Extract the TF-IDF scores from the ranked query result
ranked_query2_result_scores = [tfIdf_score for (_, tfIdf_score) in ranked_query2_result]

# Retrieve the restaurant information (name, address, description, and website) based on the document IDs
ranked_query2_result_df = df[['restaurantName', 'address', 'description', 'website']].iloc[ranked_query2_result_ids]

# Add the TF-IDF similarity scores as a new column in the DataFrame
ranked_query2_result_df['tfIdf_score'] = ranked_query2_result_scores

# Rename the columns to make them more user-friendly for display purposes
ranked_query2_result_df = ranked_query2_result_df.rename(columns={
    'restaurantName': 'Restaurant Name',
    'address': 'Address',
    'description': 'Description',
    'website': 'Website',
    'tfIdf_score': 'Similarity Score'
})

# Print any terms from the query that were not found in the vocabulary
print(terms2_not_found)

# Display the DataFrame with the top-ranked results, including the restaurant details and similarity scores
ranked_query2_result_df


No matches found for these terms: treno, albero


Unnamed: 0,Restaurant Name,Address,Description,Website,Similarity Score
878,Saur,via Filippo Turati 8,"In a tiny rural village, this contemporary, al...",https://ristorantesaur.it,0.251278
726,La Botte,via Giuseppe Garibaldi 8,A modern and welcoming contemporary bistro sit...,http://www.trattorialabottestresa.it,0.235697
180,Razzo,via Andrea Doria 17/f,"A quiet restaurant with a relaxed, young and m...",https://vadoarazzo.it/,0.226468
1979,Piccolo Lord,corso San Maurizio 69 bis/g,"Professional service in a welcoming, modern re...",https://www.ristorantepiccololord.it/,0.217604
1444,La Valle,"via Umberto I 25, località Valle Sauglio",A well - run restaurant in a quiet area just o...,https://www.ristorantelavalle.it/,0.205892
6,Al Vecchio Convento,viale Borri 348,"Ask for a table in the main dining room, with ...",https://www.alvecchioconvento.it/,0.189635
798,RistoFante,via Mazzini 41,The motto of this restaurant is “In step with ...,https://www.ristofante.it/,0.182989
1878,Aprudia,largo del Forno 16,"At this restaurant in the historic centre, whe...",http://www.aprudia.com,0.174753
520,Barbieri,via Italo Barbieri,Enjoy your meal in the classic - style dining ...,https://www.hotelbarbieri.it,0.17451
0,20Tre,via David Chiossone 20 r,Situated in the heart of Genoa’s historic cent...,https://www.ristorante20tregenova.it/,0.171513
