In [1]:
!pip install lxml



# [Homework 3](https://github.com/Sapienza-University-Rome/ADM/tree/master/2024/Homework_3) - Michelin restaurants in Italy
![iStock-654454404-777x518](https://a.storyblok.com/f/125576/2448x1220/327bb24d32/hero_update_michelin.jpg/m/1224x0/filters:format(webp))

## 1. Data collection

For the data collection, we wrote the required function in a `data_collection.py` module. 

In [2]:
from data_collection import save_links, download_html_from_link_file, html_to_tsv

The following is the overview of the main functions for each step, together with the code to run. 

Every function has an optional `data_folder` argument wich server the purpose to set the working data directory. 
We tought this to be useful, for example to set the date of the data collection as the directory name. 
This is useful, as the Michelin list of restaurant is constantly updated. 

In [3]:
data_folder = 'DATA 24-11-09'
# date of last data collection, yy-mm-dd

---

### 1.1. Get the list of Michelin restaurants
   #### **Function**: `save_links`
   - **Description**: 
     Collects restaurant links from the Michelin Guide website starting from the provided `start_url`. The links are saved into a text file (`restaurant_links.txt`) within a specified data folder.
   - **Input**: 
     - `start_url`: URL of the Michelin Guide page to start scraping.
   - **Optional Input**: 
     - `file_name`: name of the output file; by default it is `restaurant_links.txt`.
     - `data_folder`: the folder where datas will be stored; by default it is `DATA`.
   - **Output**:
     - A text file containing restaurant links, one per line, saved in the `data_folder`.
   - **Key Features**:
     - Automatically detects the number of pages to scrape.
     - Skips scraping if the links file already exists.

In [4]:
start_url = "https://guide.michelin.com/en/it/restaurants"
save_links(start_url, data_folder = data_folder)

Links already collected.
There are 1981 links already collected


---

### 1.2. Crawl Michelin restaurant pages
   #### **Function**: `download_html_from_link_file`
   - **Description**: 
     Downloads the HTML from every URL in the input `file_name`, and saves them to a structured folder (`DATA/HTMLs/page_X`).
   - **Input (all optional)**:
     - `file_name`: name of the file with the links; by default it is `restaurant_links.txt`.
     - `data_folder`: the folder where datas will be stored; by default it is `DATA`.
   - **Output**:
     - Saves the HTML files in a structured folder `DATA/HTMLs/page_X`. 
   - **Key Features**:
     - Uses `ThreadPoolExecutor` to speed up the process
     - Skips existing HTML files

In [5]:
download_html_from_link_file(data_folder = data_folder)

Download HTMLs: 100%|██████████| 1981/1981 [00:00<00:00, 2801.04it/s]

All html files have been saved.





---

### 1.3 Parse downloaded pages

#### **Function**: `extract_info_from_html`
- **Description**:  
  Parses a restaurant's HTML page and extracts structured information such as name, address, cuisine type, price range, description, and services.
- **Input**:
  - `html`: The raw HTML content of a restaurant's page.
- **Output**:
  - A dictionary containing extracted fields.
- **Key Features**:
  - Handles missing data gracefully.
  - Handles addresses separated by commas.


#### **Function**: `html_to_tsv`
- **Description**:  
  Scans the `HTMLs` folder inside the `data_folder` for all the html files, then processes every file with `extract_info_from_html`.
- **Input (optional)**:
  - `data_folder`: The folder where data will be stored; by default it is `DATA`.
  - `max_workers`: the max number of concurrent HTML parsing tasks. 
- **Output**:
  - Saves the TSV files in the folder `DATA/TSVs`.
- **Key Features**:
     - Uses `ThreadPoolExecutor` to speed up the process. 
- **Advice**:
     - Fine-tune the `max_workers` parameter according to your CPU performance. As a rule of thumb, set `max_workers` to the number of CPU cores available. An estimated processing time of around 5 minutes is typical. 

In [6]:
html_to_tsv(data_folder=data_folder, max_workers=4)

Processing HTMLs: 100%|██████████| 1981/1981 [00:00<00:00, 9287.06it/s]

All files have been processed and saved.





For completeness, let us create the dataframe for our dataset, in order to handle it effectively.

#### **Function**: `create_combined_dataframe`
- **Description**:  
  This function reads all the `.tsv` files from a specified folder, loads them into individual pandas DataFrames, and then combines them into a single DataFrame. It is useful for aggregating data from multiple sources into one unified dataset for further analysis.

- **Input**:
  - `folder_path` (str): The path to the folder containing the `.tsv` files to be read.
  - `separator` (str): The delimiter used in the `.tsv` files. Typically, it's a tab (`\t`), but it could be adjusted if needed.
  
- **Output**:
  - Returns a pandas DataFrame containing all the combined data from the `.tsv` files in the specified folder.

- **Key Features**:
  - Utilizes `glob` to find all `.tsv` files in the provided folder.
  - Loads each file as a DataFrame using pandas `read_csv()` with the specified delimiter.
  - Concatenates all DataFrames into one, ignoring index to prevent duplication.
  - Efficient handling of large datasets through pandas' built-in functions.

By running this function, you'll have a consolidated view of all the restaurant data in a single DataFrame, ready for any further analysis or processing. The first few rows of the dataset are provided below.

In [7]:
from data_collection import create_combined_dataframe
df = create_combined_dataframe(data_folder+"\TSVs", separator='\t')
df.head()

  df = create_combined_dataframe(data_folder+"\TSVs", separator='\t')


Unnamed: 0,restaurantName,address,city,postalCode,region,country,latitude,longitude,priceRange,cuisineType,description,facilitiesServices,creditCards,phoneNumber,website
0,20Tre,via David Chiossone 20 r,Genoa,16123,Liguria,Italy,44.40878,8.933115,€€,"Farm to table, Modern Cuisine","Run by three partners, this contemporary-style...",['Air conditioning'],"['amex', 'dinersclub', 'mastercard', 'visa']",+39 010 247 6191,https://www.ristorante20tregenova.it/
1,Il Ristorante Alain Ducasse Napoli,Via Cristoforo Colombo 45,Naples,80133,Campania,Italy,40.840277,14.255592,€€€€,"Creative, Mediterranean Cuisine","Alain Ducasse, one of the great names in conte...","['Air conditioning', 'Great view', 'Interestin...","['amex', 'dinersclub', 'discover', 'maestrocar...",+39 081 604 1580,https://theromeocollection.com/en/romeo-napoli...
2,Tre Olivi,via Poseidonia 41,Paestum,84047,Campania,Italy,40.42511,14.98659,€€€€,"Creative, Campanian","Oliver Glowig, German by birth but Italian by ...","['Air conditioning', 'Car park', 'Garden or pa...","['amex', 'mastercard', 'visa']",+39 0828 720023,http://www.treolivi.com
3,Tubladel,via Trebinger 22,Ortisei,39046,Trentino-South Tyrol,Italy,46.570627,11.678971,€€€,Regional Cuisine,Although this restaurant with wood-adorned din...,"['Car park', 'Terrace']","['amex', 'maestrocard', 'mastercard', 'visa']",+39 0471 796879,https://www.tubladel.com/
4,Villa Fiordaliso,corso Zanardelli 150,Gardone Riviera,25083,Lombardy,Italy,45.622208,10.56996,€€€€,Italian Contemporary,Villa Fiordaliso is one of the beautiful early...,"['Car park', 'Garden or park', 'Great view', '...","['amex', 'mastercard', 'visa']",+39 0365 20158,https://www.villafiordaliso.it


---

## 2. Search Engine

In this section, we developed two types of search engines: a **Conjunctive Search Engine** and a **Ranked Search Engine**. These engines enable users to retrieve restaurant information based on their queries about descriptions.

### 2.0 Preprocessing the Text

First, we will clean and prepare the restaurant descriptions data using the `nltk` library. Let's start by installing and downloading the necessary library and packages.

In [None]:
!pip install --upgrade nltk
!pip install --upgrade certifi



In [None]:
import nltk
import string
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

In [None]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

Let us now add a new column to the DataFrame named `processedDescription`. This column will store the processed versions of the restaurant descriptions, refined by removing stopwords, cleaning punctuation, and applying stemming.

#### **Function**: `preprocess_text`
- **Description:**  
    This function preprocesses a list of restaurant descriptions to enhance their suitability for search and retrieval tasks. The function performs several preprocessing steps including tokenization, removal of common stopwords, punctuation cleaning, and word stemming. These operations help streamline search processes by reducing descriptions to their core, searchable components.

- **Input**
    - `text` (str): A string representing the text input.

- **Output**
    - `processed_text` (list of list of str): A list in which each element is a list of processed tokens corresponding to each word in the input text. Each token is a cleaned, stemmed version of the original words in the provided text.

- **Key Features**
    - **Tokenization**: Divides each description into individual words or punctuation marks for further processing.
    - **Stopword Removal**: Filters out commonly used words (e.g., "the", "is", "and") that are less meaningful for search and classification.
    - **Punctuation Cleaning**: Removes non-alphanumeric characters to focus on the essential content.
    - **Stemming**: Reduces each word to its root form, facilitating matches across different morphological variants (e.g., "eating" and "eat").

In [None]:
# Import the preprocess_text function from the search_engine module
from search_engine import preprocess_text

# Apply the preprocess_text function to the 'description' column in the DataFrame
# and store the results in a new column named 'processedDescription'
df['processedDescription'] = preprocess_text(df['description'])

# Display the first few rows of the DataFrame to verify the new column
df.head()

Unnamed: 0,restaurantName,address,city,postalCode,country,priceRange,cuisineType,description,facilitiesServices,creditCards,phoneNumber,website,processedDescription
0,Autem*,via Serviliano Lattuada 2,Milan,20135,Italy,€€€,"Farm to table, Modern Cuisine",A modern direction with the kitchen well in vi...,"['Air conditioning', 'Wheelchair access']","['amex', 'dinersclub', 'mastercard', 'visa']",+39 351 278 0368,https://autem-milano.com/,"[modern, direct, kitchen, well, view, even, ou..."
1,Antico Albergo,via Dante Alighieri 18,Limito,20096,Italy,€€,"Italian, Classic Cuisine","With its exposed bricks and beams, this restau...","['Air conditioning', 'Terrace']","['amex', 'maestrocard', 'mastercard', 'visa']",+39 02 926 6157,https://www.anticoalbergo.it,"[expos, brick, beam, restaur, hous, 18c, build..."
2,Pierre - Trattoria Sartoriale,viale dei Mille 1/c,Treviso,31100,Italy,€€,"Modern Cuisine, Creative","Situated on the edge of the historic centre, t...","['Air conditioning', 'Wheelchair access']","['amex', 'maestrocard', 'mastercard', 'visa']",+39 0422 541022,https://www.pierretrattoriasartoriale.com,"[situat, edg, histor, centr, simpl, restaur, t..."
3,Babette,via Michelangelo 17,Albenga,17031,Italy,€€,"Mediterranean Cuisine, Ligurian",Situated just beyond the centre of Albenga in ...,"['Air conditioning', 'Great view', 'Terrace']","['maestrocard', 'mastercard', 'visa']",+39 0182 544556,https://www.ristorantebabette.net/,"[situat, beyond, centr, albenga, beauti, locat..."
4,Bottega Culinaria,contrada Pontoni 72,San Vito Chietino,66038,Italy,€€€,"Creative, Seasonal Cuisine",Nestled amid a charming landscape of olive gro...,"['Air conditioning', 'Car park', 'Garden or pa...","['amex', 'dinersclub', 'mastercard', 'visa']",+39 339 142 1111,https://www.bottegaculinaria.com/,"[nestl, amid, charm, landscap, oliv, grove, re..."


### 2.1 Conjunctive Query

Next, we will construct the **Conjunctive Search Engine**, which retrieves restaurants whose descriptions contain all specified query terms.

#### 2.1.1 Create Your Index!

### Building and Loading the Vocabulary

In this section, we create or load a file named `"vocabulary.csv"`, which maps each unique word to a corresponding integer identifier (`term_id`). We assign integer values from $0$ up to the total number of unique words in the processed descriptions minus one. 

To optimize computation and avoid redundant processing, if the file already exists, we simply load it. For efficient usage, the data is stored in a DataFrame called `vocabulary_df`.

#### Function: `get_vocabulary`

- **Description:**  
    This function either loads an existing `"vocabulary.csv"` file or creates a new one if it does not exist. The vocabulary file maps each unique word (or "term") found in the processed text descriptions to a unique integer ID, which can be used to reference the term efficiently.

- **Input**
    - `processed_texts` (list of list of str): A list where each sublist contains tokenized and processed words from a text.
    - `file_path` (str, default=`"vocabulary.csv"`): Path to the `.csv` file where the vocabulary will be stored or loaded from.

- **Output**
    - `vocabulary_df` (`pd.DataFrame`): A DataFrame that contains two columns: `term_id`, the unique integer ID assigned to each term, and `term`, the corresponding word.

- **Key Features**
    - **File Existence Check**: Checks if `"vocabulary.csv"` already exists to avoid redundant recomputation. If it does exist, the file is loaded; otherwise, a new vocabulary is created and saved.
    - **Unique Term Extraction**: Extracts all unique terms from the processed texts by flattening the list of lists and converting it to a set, ensuring only unique words are included.
    - **DataFrame Creation**: Each term is assigned a unique integer ID, and both the term and ID are stored in a DataFrame.
    - **CSV Storage**: Saves the vocabulary as `"vocabulary.csv"` to enable reuse in future computations.

In [None]:
# Import the get_vocabulary function from the search_engine module
from search_engine import get_vocabulary

# Generate or load the vocabulary DataFrame based on the 'processedDescription' column in df
descriptions_vocabulary_df = get_vocabulary(df['processedDescription'])

# Print the first few rows of the vocabulary DataFrame
print(descriptions_vocabulary_df.head().to_string(index=False))

Loading vocabulary.csv file.
 term_id      term
       0    laurin
       1    gorini
       2 imaginari
       3  ciccioli
       4  clientel


In [None]:
# Import the get_inverted_index function from the search_engine module
from search_engine import get_inverted_index

# Generate the inverted index using the processed descriptions and the descriptions vocabulary DataFrame
inverted_index = get_inverted_index(df['processedDescription'], descriptions_vocabulary_df)

# Iterate through the first 10 terms in the inverted index and print their corresponding document IDs
for idx, (term, docs) in enumerate(inverted_index.items()):
    if idx < 10:
        print(f"Term ID: {term}, Document IDs: {docs}")
    else:
        break

Loading Inverted Index from theinverted_index.json file.
Term ID: 0, Document IDs: [944]
Term ID: 1, Document IDs: [589]
Term ID: 2, Document IDs: [1299]
Term ID: 3, Document IDs: [886]
Term ID: 4, Document IDs: [499, 628, 1458, 1566, 1575, 1806]
Term ID: 5, Document IDs: [899]
Term ID: 6, Document IDs: [80, 91, 416, 430, 885, 1128, 1436, 1622, 1684]
Term ID: 7, Document IDs: [34, 38, 43, 78, 80, 90, 293, 328, 388, 408, 433, 556, 626, 853, 897, 1110, 1157, 1233, 1256, 1319, 1381, 1596, 1662, 1740, 1946]
Term ID: 8, Document IDs: [64]
Term ID: 9, Document IDs: [1352, 1543, 1886]


#### 2.1.2 Execute the Query

Now, let us execute the query that will be handled by the Conjunctive Search Engine. This engine will process the query terms and return a list of restaurants whose descriptions contain all of the query words.

#### **Function**: `execute_conjunctive_query`
- **Description:**  
    This function processes a search query to find documents that contain all of the terms specified in the query. It uses the inverted index to identify the documents where each term appears and then returns only those documents that contain every term in the query.

- **Input:**
    - `query` (str): The search query, typically a string containing multiple terms.
    - `inverted_index` (dict): The inverted index, which maps `term_ids` to lists of document indices (IDs) where each term appears.
    - `vocabulary_df` (pd.DataFrame): A DataFrame mapping terms to their unique `term_ids`, allowing the function to look up terms' corresponding IDs.

- **Output:**
    - `intersection_result` (list of int): A list of document IDs that contain all of the terms specified in the search query.

- **Key Features:**
    - **Preprocessing the Query:**  The input query is first preprocessed to tokenize and clean the terms before searching.
    - **Mapping Terms to IDs:**  The terms from the query are matched with their corresponding `term_ids` in the vocabulary DataFrame.
    - **Retrieving Document IDs:**  For each term in the query, the function fetches the document IDs where the term appears using the inverted index.
    - **Intersection of Document Sets:**  The function performs an intersection of all the document sets to ensure that only documents containing all the terms in the query are returned.

Let's execute the query `"modern seasonal cuisine"` and retrieve all restaurants whose processed descriptions contain all these words.

In [None]:
# Import the function for executing a conjunctive query based on the given terms
from search_engine import execute_conjunctive_query

# Execute the conjunctive query to get the document IDs that match all terms in the query
# The function returns a list of document IDs that contain all terms in the query ("modern seasonal cuisine")
documents_id = execute_conjunctive_query(
    "modern seasonal cuisine",  # The query string containing terms to be matched
    inverted_index,  # The inverted index (term -> list of document IDs where each term appears)
    descriptions_vocabulary_df  # The vocabulary DataFrame that maps terms to their IDs
)

# Retrieve the documents that match the query by using the sorted document IDs
# This creates a DataFrame with the restaurant details (name, address, description, website) for the matching documents
conjunctive_query_result_df = df[['restaurantName', 'address', 'description', 'website']].iloc[sorted(documents_id)]

# Rename the columns to make them more user-friendly for display purposes
conjunctive_query_result_df = conjunctive_query_result_df.rename(columns={
    'restaurantName': 'Restaurant Name',
    'address': 'Address',
    'description': 'Description',
    'website': 'Website'
})

# Print the resulting DataFrame containing the restaurant details that match the query
conjunctive_query_result_df

Unnamed: 0,Restaurant Name,Address,Description,Website
88,Ristorante Da Anita - Chalet Prà delle Nasse,"via Cavallazza 24, località Prà delle Nasse",This family-run restaurant is surrounded by vi...,https://www.ristorante-da-anita.com/ristorante...
180,Alessandro Mecca al Castello di Grinzane Cavour,via Castello 5,There’s no shortage of classic Piedmontese ing...,https://www.alessandromecca.it/
256,Modì,via Bucceri,This restaurant has moved from its previous lo...,https://www.modiristorante.it
289,Alto,via Circondariale San Francesco 2,The top floor of the renovated Executive Spa H...,http://www.altoristorante.com
304,Repubblica di Perno,"vicolo Cavour 5, loc. Perno",Situated in the heart of the gastronomic Langa...,http://www.repubblicadiperno.it
456,U.P.E.P.I.D.D.E.,"vico Sant'Agnese 2, angolo corso Cavour",Undeniably characteristic and refreshing! Dug ...,https://www.upepidde.it/
501,Molteni,via Ruzzina 2/4,Hospitality abounds in this establishment wher...,https://www.albergomolteni.it/
561,La Valle,"via Umberto I 25, località Valle Sauglio",A well - run restaurant in a quiet area just o...,https://www.ristorantelavalle.it/
565,Le Lampare al Fortino,"via Tiepolo, molo Sant'Antonio","Built over a medieval church, this old fort th...",https://www.lelamparealfortino.it/it/home/
586,Rueda Gaucha,viale Europa 18,This restaurant is justifiably renowned for tw...,


### 2.2 Ranked Search Engine with TF-IDF and Cosine Similarity

Finally, we build the **Ranked Search Engine**, that returns the *top-k* restaurants sorted by their similarity to the query, utilizing **TF-IDF** and **Cosine Similarity**.

#### 2.2.1 Inverted Index with TF-IDF Scores

Let us obtain the **Inverted Index with TF-IDF Scores** dictionary. 

Like in previous cases, the following function creates the file `tfIdf_inverted_index.json` if it doesn't exist. Otherwise, it loads the file, avoiding recomputation, and stores the data in a dictionary. This is a new inverted index where each entry corresponds to a term, and the value is a list of tuples containing document IDs and their associated TF-IDF scores.

#### **Function**: `get_tfIdf_inverted_index` 
- **Description**: \
    This function creates or loads an inverted index with **TF-IDF scores** for a collection of documents. It calculates the importance of each term in each document based on the **term frequency (TF)** and **inverse document frequency (IDF)**. If the TF-IDF inverted index already exists as a JSON file, it will be loaded; otherwise, the function will generate it and save it to a file for future use.

- **Input**:
    - `inverted_index` (dict): A dictionary where the keys are `term_ids`, and the values are lists of document IDs (indices) where each term appears.
    - `vocabulary_df` (pd.DataFrame): A DataFrame that maps terms to their unique `term_ids`. This is used to look up terms and their IDs.
    - `processed_texts` (list of list of str): A list of processed documents, where each document is represented as a list of terms (strings).
    - `file_path` (str, default=`"tfIdf_inverted_index.json"`): The file path where the TF-IDF inverted index will be saved or loaded. Default is `"tfIdf_inverted_index.json"`.

-  **Output**:
    - **tfIdf_inverted_index** (`dict`): A dictionary where the keys are `term_ids`, and the values are lists of tuples. Each tuple contains:
        - `doc_id` (int): The document ID (index in the `processed_texts` list).
        - `tf-idf score` (float): The TF-IDF score for the term in the corresponding document.

- **Key Features**:
    - **Check for Existing File**: The function checks whether the TF-IDF inverted index already exists as a JSON file. If it does, it loads the index from the file to avoid recomputation.
    - **Calculate TF-IDF Scores**: If the file does not exist, the function computes the TF-IDF scores for each term in the vocabulary, using the `inverted_index` and `processed_texts` as input.
    - **Save the Index**: Once the TF-IDF scores are calculated, the index is saved to a JSON file for future use, ensuring efficient reuse of precomputed data.


### **Function**: `get_tfIdf` 
- **Description**:
    This function calculates the **TF-IDF (Term Frequency-Inverse Document Frequency)** score for a specific term in a document, providing a measure of the term's importance within that document relative to the entire corpus. It combines two components:
        - **Term Frequency (TF)**: The frequency of the term within the document.
        - **Inverse Document Frequency (IDF)**: A measure of the rarity of the term across the entire corpus.

    The function is used in `get_tfIdf_inverted_index` to compute the TF-IDF scores for terms in documents and build the **TF-IDF inverted index**.

- **Input**:
    - `term` (str): The term for which the TF-IDF score is being calculated.
    - `document` (list of str): A list of terms representing the document being analyzed.
    - `corpus` (list of list of str): A collection of documents, where each document is represented as a list of terms.

- **Output**:
    - **TF-IDF score** (float): The TF-IDF score for the given term in the specified document. It is the product of the **TF** and **IDF** scores.

- **Key Features**:
    - **Term Frequency (TF)**: The function calculates the term frequency as the ratio of occurrences of the term in the document to the total number of terms in the document.
    - **Inverse Document Frequency (IDF)**: The function computes the IDF by taking the logarithm of the total number of documents divided by the number of documents that contain the term, ensuring that terms appearing in every document are not given excessive weight.
    - **Logarithmic Scaling**: The IDF is logarithmic to reduce the effect of very frequent terms in the corpus.
    - **Combination of TF and IDF**: The TF and IDF values are multiplied together to produce the TF-IDF score, which reflects the relative importance of the term in the document within the context of the entire corpus.

In [None]:
# Importing the get_tfIdf_inverted_index function from the search_engine module
from search_engine import get_tfIdf_inverted_index

# Create the TF-IDF inverted index dictionary using the inverted index, vocabulary, and processed descriptions
tfIdf_inverted_index = get_tfIdf_inverted_index(inverted_index, descriptions_vocabulary_df, df['processedDescription'])

# Print the first 10 key-value pairs from the TF-IDF inverted index dictionary
for idx, (term, docs) in enumerate(tfIdf_inverted_index.items()):
    if idx < 10:
        print(f"Term ID: {term}, List of (Doc ID, tf-idf score) pairs: {docs}")
    else:
        break

Loading Inverted Index with TF-IDF scores from the tfIdf_inverted_index.json file.
Term ID: 0, List of (Doc ID, tf-idf score) pairs: [(944, 0.05992147308970551)]
Term ID: 1, List of (Doc ID, tf-idf score) pairs: [(589, 0.053501315258665624)]
Term ID: 2, List of (Doc ID, tf-idf score) pairs: [(1299, 0.04161213409007326)]
Term ID: 3, List of (Doc ID, tf-idf score) pairs: [(886, 0.03841120069852917)]
Term ID: 4, List of (Doc ID, tf-idf score) pairs: [(499, 0.08455191759086206), (628, 0.038920723970396816), (1458, 0.07909695516564515), (1566, 0.06130014025337499), (1575, 0.08757162893339283), (1806, 0.051083450211145826)]
Term ID: 5, List of (Doc ID, tf-idf score) pairs: [(899, 0.05447406644518682)]
Term ID: 6, List of (Doc ID, tf-idf score) pairs: [(80, 0.026710507559875075), (91, 0.027024748825285373), (416, 0.03329135724853995), (430, 0.049937035872809926), (885, 0.05220690113975583), (1128, 0.04030006403770625), (1436, 0.02443727287392826), (1622, 0.06380843472636824), (1684, 0.1093858

#### 2.2.2 Execute the Ranked Query

In this step, we will process the query entered by the user and utilize a **Ranked Search Engine** to return the most relevant documents (restaurants) based on their similarity to the query. The search engine will return the top **k** matching restaurants (documents) with the highest similarity scores. If fewer than **k** restaurants have a non-zero similarity score, it will return all the restaurants that have any similarity to the query.

### **Function**: `execute_ranked_query`

- **Description**
    This function performs a ranked search by evaluating how closely the documents in a collection match the query based on the **TF-IDF** vectors. It uses only the terms from the query that are present in the **vocabulary**, and the **cosine similarity** is computed to rank the documents.

- **Input**:
    - `query_terms` (str): A space-separated string of terms that represent the search query.
    - `tfIdf_inverted_index` (dict): A dictionary mapping term IDs to lists of tuples (document ID, TF-IDF score), representing the inverted index for the documents.
    - `vocabulary_df` (DataFrame): A DataFrame containing terms and their corresponding term IDs.
    - `processed_texts` (list of list of str): A list of processed documents, where each document is represented as a list of terms.
    - `top_k` (int): The number of top-ranked documents to return based on similarity.

- **Output**:
    - `ranked_results` (list): A list of tuples, where each tuple contains a document ID and its cosine similarity score with respect to the query.
    - `not_found` (str): A message listing the query terms that were not found in the vocabulary.

- **Key Features:**
    - **Preprocessing the Query**: The input query, a string of space-separated terms, is tokenized and cleaned using the `preprocess_text` function to ensure that the query terms are in a standard form for comparison.
    - **Handling Missing Terms**: Any query terms that are not present in the vocabulary are filtered out, and a message is created to indicate which terms could not be matched.

    - **Mapping Query Terms to Term IDs**: The remaining valid query terms are mapped to their corresponding **term IDs** using the **vocabulary** DataFrame. These term IDs will be used to build the **query vector**.
    
    - **Building the Query Vector**: A **query vector** is initialized with zeros, and for each term in the query, the **TF-IDF** score is calculated using the `get_tfIdf` function. These scores populate the query vector at the positions corresponding to the term IDs.
    
    - **Building Document Vectors**: Document vectors are also initialized with zero values. These vectors are populated using the **TF-IDF scores** from the **TF-IDF inverted index**, which stores the scores for each term in each document.

    - **Cosine Similarity Calculation**: For each document, the cosine similarity between its vector and the query vector is computed using the `get_cosine_similarity` function. Cosine similarity measures the angle between the two vectors, providing a score that indicates how similar the document is to the query. A value closer to 1 indicates higher similarity.

    - **Top- K Ranking**: The documents are ranked based on their similarity scores, with the most similar documents appearing first. The results are sorted in descending order of similarity. If there are more than `top_k` results, the function limits the returned documents to the top-k highest ranked documents. If fewer than `top_k` documents match, all matching documents are returned.

### **Function**: `get_cosine_similarity`

This function is a helper function used by the `execute_ranked_query` function to calculate the **cosine similarity** between a document vector and a query vector. This similarity metric is widely used in **information retrieval** and **text mining** to measure the similarity between two vectors, regardless of their magnitude.

#### **Description**
Cosine similarity is calculated as the cosine of the angle between two vectors. A cosine similarity score of:
- **1** means the vectors are identical.
- **0** means the vectors are orthogonal (no similarity).
- **-1** indicates complete dissimilarity (opposites).

This function specifically handles cases where one or both vectors might be zero vectors, returning 0 if the cosine similarity cannot be calculated.

- **Input**:
    - `doc_vector` (numpy array): A vector representing the document, containing **TF-IDF scores** for each term.
    - `query_vector` (numpy array): A vector representing the query, with **TF-IDF scores** assigned for each term.

- **Output**:
    - `float`: The cosine similarity score between the document and query vectors. Returns 0 if the denominator (product of norms) is zero.

- **Key Features**:
    - **Dot Product Calculation**: The **dot product** of the document and query vectors is computed. This represents the overlap or alignment between the two vectors.

    - **L2 Norm Calculation**:The **L2 norm** (magnitude) of both the document vector and the query vector is calculated individually. The L2 norm provides a measure of the vector's length.

    - **Cosine Similarity Calculation**: The cosine similarity score is derived by dividing the dot product by the product of the norms of the two vectors. This step effectively normalizes the dot product, yielding a similarity score between -1 and 1.

    - **Handling Zero Vectors**: If either vector is a **zero vector** (i.e., the L2 norm is zero), the function returns 0, as the cosine similarity is undefined in this case.

Let's execute the query `"modern seasonal cuisine"` and retrieve the *top-10* results ranked by relevance. This will allow us to see the best-matching documents based on cosine similarity between the query vector and document vectors. Any terms from the query that are not found in the vocabulary will be noted in the output.

In [None]:
# Import the function for executing a ranked query based on TF-IDF scores
from search_engine import execute_ranked_query1

# Execute the ranked query with the search terms "cuisine modern seasonal"
# The function returns the ranked query result (documents and their similarity scores) and any terms not found in the vocabulary
ranked_query_result, terms_not_found = execute_ranked_query1(
    "cuisine modern seasonal",  # The query string
    tfIdf_inverted_index,  # The TF-IDF inverted index (term -> document ID, TF-IDF score pairs)
    descriptions_vocabulary_df,  # The vocabulary DataFrame (terms and their IDs)
    df['processedDescription'],  # The processed descriptions of the documents
    10  # Number of top results to return
)

# Extract the document IDs from the ranked query result
ranked_query_result_ids = [doc_id for (doc_id, _) in ranked_query_result]

# Extract the TF-IDF scores from the ranked query result
ranked_query_result_scores = [tfIdf_score for (_, tfIdf_score) in ranked_query_result]

# Retrieve the restaurant information (name, address, description, and website) based on the document IDs
ranked_query_result_df = df[['restaurantName', 'address', 'description', 'website']].iloc[ranked_query_result_ids]

# Add the TF-IDF similarity scores as a new column in the DataFrame
ranked_query_result_df['tfIdf_score'] = ranked_query_result_scores

# Rename the columns to make them more user-friendly for display purposes
ranked_query_result_df = ranked_query_result_df.rename(columns={
    'restaurantName': 'Restaurant Name',
    'address': 'Address',
    'description': 'Description',
    'website': 'Website',
    'tfIdf_score': 'Similarity Score'
})

# Print any terms from the query that were not found in the vocabulary
print(terms_not_found)

# Display the DataFrame with the top-ranked results, including the restaurant details and similarity scores
ranked_query_result_df


ImportError: cannot import name 'execute_ranked_query1' from 'search_engine' (/Users/riccardo.soleo/Desktop/ADM-HW3/search_engine.py)

Here, we demonstrate an example where some terms in the query are not present in the vocabulary. When this occurs, the function will return ranked results based solely on the terms that do match the vocabulary, while also notifying the user of the terms that were not found. This approach allows us to retrieve relevant results even if certain query terms are missing from the vocabulary.

In this example, the query is `"modern albero cuisine seasonal treno"`. Note that the results will be identical to the previous example, as only the terms `"modern"`, `"cuisine"`, and `"seasonal"` were found and used for ranking, while `"albero"` and `"treno"` were absent from the vocabulary.

In [None]:
ranked_query2_result, terms2_not_found = execute_ranked_query(
    "modern albero cuisine seasonal treno", tfIdf_inverted_index, descriptions_vocabulary_df, df['processedDescription'], 10
)

ranked_query2_result_ids = [doc_id for (doc_id, _) in ranked_query2_result]

# Extract the TF-IDF scores from the ranked query result
ranked_query2_result_scores = [tfIdf_score for (_, tfIdf_score) in ranked_query2_result]

# Retrieve the restaurant information (name, address, description, and website) based on the document IDs
ranked_query2_result_df = df[['restaurantName', 'address', 'description', 'website']].iloc[ranked_query2_result_ids]

# Add the TF-IDF similarity scores as a new column in the DataFrame
ranked_query2_result_df['tfIdf_score'] = ranked_query2_result_scores

# Rename the columns to make them more user-friendly for display purposes
ranked_query2_result_df = ranked_query2_result_df.rename(columns={
    'restaurantName': 'Restaurant Name',
    'address': 'Address',
    'description': 'Description',
    'website': 'Website',
    'tfIdf_score': 'Similarity Score'
})

# Print any terms from the query that were not found in the vocabulary
print(terms2_not_found)

# Display the DataFrame with the top-ranked results, including the restaurant details and similarity scores
ranked_query2_result_df


No matches found for these terms: albero, treno


Unnamed: 0,Restaurant Name,Address,Description,Website,Similarity Score
878,La Primula,via San Rocco 47,Situated between Pordenone and Aviano in the m...,https://www.ristorantelaprimula.it/,0.251275
726,Il Focarile,"via Pontina al km 46,5",The sumptuous entrance gives an elegant tone t...,http://www.ilfocarile.it,0.236176
180,Alessandro Mecca al Castello di Grinzane Cavour,via Castello 5,There’s no shortage of classic Piedmontese ing...,https://www.alessandromecca.it/,0.226464
1979,Nida,via Nicola Barbantini 338,Located outside the city centre about 1km from...,http://www.serendepico.com/nida,0.217601
1444,Antica Locanda al Cervo - Landgasthof zum Hirs...,via Schrann 9/c,"Enjoy generous, regional cuisine influenced by...",https://www.hirschenwirt.it/,0.205889
6,Indiniò,via Norsinia 21/b,This restaurant (the name Indiniò means “nowhe...,http://www.indinio.it,0.189022
798,La Brughiera,via XXIV Maggio 23,This old farmhouse in the Groane Regional Park...,https://www.labrughiera.it,0.182986
1878,Al Paradiso,via Sant'Ermacora 1,A small gem of a restaurant housed in an old f...,https://www.trattoriaparadiso.it/,0.174751
520,Campamac,strada della Valle 1,A real gourmet inn which serves traditional Pi...,https://www.campamac.com/,0.173946
0,Autem*,via Serviliano Lattuada 2,A modern direction with the kitchen well in vi...,https://autem-milano.com/,0.171544


### ***3. Define a New Score!***

We need to extract:

-Description Terms: Words relevant to the description (e.g., "Italian").

-Cuisine Type: "Italian"

-Facilities and Services: ["Outdoor seating", "Wi-Fi"]

-Price Range: "Moderately priced"(€)

In [None]:
#user inputs
user_description_query = input("Enter keywords for the restaurant description (e.g., 'modern seasonal cuisine'): ")# we are using modern seasonal couisine
user_cuisine_query = input("Enter preferred cuisine types (comma-separated): ")# we typed modern italian
user_facilities_query = input("Enter desired facilities/services (comma-separated): ")#we typed wheelchair access, wi-fi, sport
user_price_query = input("Enter preferred price range (e.g., '€', '€€', '€€€'): ")#we used €€


now we define a new scoring function that incorporate more attributes with different weights:
-Description match: 0.3
-Cuisine match: 0.25
-Facilities match: 0.25
-Price match: 0.2

In [None]:
def compute_custom_new_score(
    doc_id,
    description_scores,
    df,
    user_cuisines,
    user_facilities,
    user_price,
    max_description_score
):
    #initialize individual scores
    description_score = description_scores.get(doc_id, 0) / max_description_score  #normalize to keep it between [0,1]
    cuisine_score = 0
    facilities_score = 0
    price_score = 0

    #get restaurant data
    restaurant = df.iloc[doc_id]

    #cuisine Match
    restaurant_cuisines = [c.strip().lower() for c in restaurant['cuisineType'].split(',')]
    if any(cuisine in restaurant_cuisines for cuisine in user_cuisines):
        cuisine_score = 1

    #servicies Match
    restaurant_facilities = [f.strip().lower() for f in restaurant['facilitiesServices'].split(',')]
    matching_facilities = set(user_facilities) & set(restaurant_facilities)
    facilities_score = len(matching_facilities) / len(user_facilities) if user_facilities else 0

    #price Match
    if restaurant['priceRange'] == user_price:
        price_score = 1

    #weights
    weights = {
        'description': 0.3,
        'cuisine': 0.25,
        'facilities': 0.25,
        'price': 0.2
    }

    #total score
    total_score = (
        weights['description'] * description_score +
        weights['cuisine'] * cuisine_score +
        weights['facilities'] * facilities_score +
        weights['price'] * price_score
    )

    return total_score


now we are going to use 'heapq' to mantain the top-k restaurants

In [None]:
import heapq

def get_top_k_restaurants(
    ranked_docs,
    df,
    user_cuisines,
    user_facilities,
    user_price,
    k=10
):
    heap = []
    max_description_score = max(score for (_, score) in ranked_docs) if ranked_docs else 1

    for doc_id, description_score in ranked_docs:
        # Check if doc_id is within the bounds of df
        if doc_id < 0 or doc_id >= len(df):
            print(f"Warning: Document ID {doc_id} is out of bounds for DataFrame 'df'. Skipping this document.")
            continue
        total_score = compute_custom_new_score(
            doc_id,
            dict(ranked_docs),
            df,
            user_cuisines,
            user_facilities,
            user_price,
            max_description_score
        )

        #maintain a heap of size k
        if len(heap) < k:
            heapq.heappush(heap, (total_score, doc_id))
        else:
            heapq.heappushpop(heap, (total_score, doc_id))

    #extract restaurants from heap and sort by score descending
    top_restaurants = sorted(heap, key=lambda x: x[0], reverse=True)
    return top_restaurants


now take the restaurants details and display the top-k

In [None]:
import numpy as np
from search_engine import preprocess_text
from search_engine import get_tfIdf
from search_engine import get_vocabulary
from search_engine import get_inverted_index
from search_engine import execute_conjunctive_query
from search_engine import cosine_similarity
from collections import defaultdict
def execute_ranked_query1(query_terms, inverted_index, vocabulary_df, processed_texts, top_k):
    """
    Executes a ranked query by calculating cosine similarity between a query vector (TF-IDF)
    and document vectors, using only the terms from the query that, once processed, exist in the vocabulary.

    Parameters:
    - query_terms (str): Query input as a space-separated string of terms.
    - inverted_index (dict): Dictionary with term IDs as keys and values as lists of tuples (document ID, TF-IDF score),
                             representing the inverted index for documents.
    - vocabulary_df (DataFrame): DataFrame of vocabulary terms, each with a unique term ID.
    - processed_texts (list of list of str): List of processed texts, each represented as a list of terms.
    - top_k (int): Number of top-ranked documents to return based on similarity.

    Returns:
    - ranked_results (list): List of tuples, each containing a document ID and its similarity score.
    - not_found (str): Message listing terms from the query that were not found in the vocabulary.
    """
    
    # Tokenize and clean query terms
    query_list = preprocess_text([query_terms])[0]
    
    # Filter out terms that are not in the vocabulary and store those not found
    no_matches = [term for term in query_list if term not in vocabulary_df['term'].values]
    query_list = [term for term in query_list if term in vocabulary_df['term'].values]

    # If any query terms were not found, create a message with those terms
    not_found = "No matches found for these terms: " + ', '.join(list(set(no_matches))) if no_matches else ""

    # Map query terms to their corresponding term IDs from the vocabulary
    query_term_ids = (vocabulary_df[vocabulary_df['term'].isin(query_list)]).set_index('term').loc[query_list].reset_index()['term_id'].astype(int).tolist()

    # Initialize the query vector, setting TF-IDF values for query terms
    query_vector = np.zeros(vocabulary_df.shape[0])
    for i in range(len(query_term_ids)):
        query_vector[query_term_ids[i]] = get_tfIdf(query_list[i], query_list, processed_texts)
        
    # Initialize document vectors with default zero values for each term
    document_vectors = defaultdict(lambda: np.zeros(vocabulary_df.shape[0]))
    
    # Populate document vectors with TF-IDF scores from the inverted index
    for term_id in vocabulary_df['term_id']:
        if term_id in inverted_index:
            for doc_id, tfidf_score in inverted_index[term_id]:
                document_vectors[doc_id][term_id] = tfidf_score
    
    # Compute cosine similarity for each document vector against the query vector
    scores = {doc_id: cosine_similarity(doc_vector, query_vector) for doc_id, doc_vector in document_vectors.items()}
    
    # Rank documents by similarity scores, in descending order
    ranked_results = sorted(scores.items(), key=lambda x: x[1], reverse=True)
    
    if top_k is not None and len(ranked_results) > top_k:
        # Limit the results to the top_k documents
        ranked_results = ranked_results[:top_k]

    return ranked_results, not_found

In [None]:

# Parse user inputs
user_cuisines = [c.strip().lower() for c in user_cuisine_query.split(',')]
user_facilities = [f.strip().lower() for f in user_facilities_query.split(',')]
user_price = user_price_query.strip()

# Perform initial ranked query based on description
ranked_query_result, terms_not_found = execute_ranked_query1(
    user_description_query,
    tfIdf_inverted_index,
    descriptions_vocabulary_df,
    df['processedDescription'],
    top_k=None  # Get all documents
)

# Get top-k restaurants based on the custom score
top_k = 10
top_restaurants = get_top_k_restaurants(
    ranked_query_result,
    df,
    user_cuisines,
    user_facilities,
    user_price,
    k=top_k
)

# Retrieve and display restaurant details
restaurant_ids = [doc_id for (_, doc_id) in top_restaurants]
scores = [score for (score, _) in top_restaurants]

# Build DataFrame
result_df = df.loc[restaurant_ids, ['restaurantName', 'address', 'description', 'website']]
result_df['Custom Score'] = scores

# Sort by score descending
result_df = result_df.rename(columns={
    'restaurantName': 'Restaurant Name',
    'address': 'Address',
    'description': 'Description',
    'website': 'Website'
})

# Display the sorted results
print("\nTop Restaurants Based on Your Preferences:\n")
print(result_df.sort_values(by='Custom Score', ascending=False))



Top Restaurants Based on Your Preferences:

                                        Restaurant Name  \
726                                         Il Focarile   
1979                                               Nida   
1444  Antica Locanda al Cervo - Landgasthof zum Hirs...   
6                                               Indiniò   
798                                        La Brughiera   
1878                                        Al Paradiso   
31                                  San Quintino Resort   
647                                         Belle Parti   
1479                                           Peperosa   
907                                         Quintogusto   

                         Address  \
726       via Pontina al km 46,5   
1979   via Nicola Barbantini 338   
1444             via Schrann 9/c   
6              via Norsinia 21/b   
798           via XXIV Maggio 23   
1878         via Sant'Ermacora 1   
31                   via Vigne 6   
647           via

so here the steps of the code:

-take the user inputs

-use the preview TD-IDF search to get a list of revelant restaurants

-calculate the new score using the function 'compute_custom_new_score'

-heap manage the top k restaurants

-display the results

## 4. Visualizing the Most Relevant Restaurants

### *4.1 Collect information on unique restaurant locations in Italy (in the format of City and Region)*

**Answer:** We got arround 1.116 cities with their respective region

In [8]:
# Extracting the list of (city, region) tuples for the coordinates later 
city_region_list = list(zip(df['city'], df['region']))
print(f" cities and regions: {len(city_region_list)}")
print(f"List of unique cities and their regions: {city_region_list}")


 cities and regions: 1981
List of unique cities and their regions: [('Genoa', 'Liguria'), ('Naples', 'Campania'), ('Paestum', 'Campania'), ('Ortisei', 'Trentino-South Tyrol'), ('Gardone Riviera', 'Lombardy'), ('Varese', 'Lombardy'), ('Ticciano', 'Campania'), ('Pompei', 'Campania'), ('Verona', 'Veneto'), ('Rome', 'Lazio'), ('Florence', 'Tuscany'), ('Treia', 'Marche'), ('Livigno', 'Lombardy'), ('Milan', 'Lombardy'), ('Valdieri', 'Piedmont'), ('Rapallo', 'Liguria'), ('Sorrento', 'Campania'), ('Treviso', 'Veneto'), ('Pallanza', 'Piedmont'), ('San Michele', 'Trentino-South Tyrol'), ('San Michele', 'Trentino-South Tyrol'), ('Rome', 'Lazio'), ('Imola', 'Emilia-Romagna'), ('Olevano Romano', 'Lazio'), ('Godia', 'Friuli-Venezia Giulia'), ('Spoltore', 'Abruzzo'), ('Greve in Chianti', 'Tuscany'), ('Ponza', 'Lazio'), ('Milan', 'Lombardy'), ('Turin', 'Piedmont'), ('Mortara', 'Lombardy'), ('Milan', 'Lombardy'), ('Maranello', 'Emilia-Romagna'), ('Guarene', 'Piedmont'), ('Lonigo', 'Veneto'), ('Suna', '

### *4.2 Provide coordinates for these location.*

**Answer:** With Geopy the coordinates were impported in a CSV due to the large amount of data, and then added them to the data set for better precision. 

In [9]:
# pip3 install geopy

In [14]:
import time
import pandas as pd
from geopy.geocoders import Nominatim

# geolocator
geolocator = Nominatim(user_agent="city-region-coordinates-extractor")

# Function to get coordinates for a city-region pair starting in Italy
def get_coordinates(city, region):
    try:
        query = f"{city}, {region}, Italy"
        location = geolocator.geocode(query)
        if location:
            return location.latitude, location.longitude
        else:
            return None, None
    except Exception as e:
        print(f"Error getting coordinates for {city}, {region}: {e}")
        return None, None

# Function to extract coordinates for all unique city-region pairs
def extract_coordinates(city_region_list):
    coordinates_list = []
    
    # Loop through each city-region pair and get coordinates
    for city, region in city_region_list:
        print(f"Geocoding {city}, {region}...")
        latitude, longitude = get_coordinates(city, region)
        coordinates_list.append((city, region, latitude, longitude))
        time.sleep(1)
    
    return coordinates_list

coordinates = extract_coordinates(city_region_list)

# Create a DataFrame from the list of coordinates
coordinates_df = pd.DataFrame(coordinates, columns=['city', 'region', 'latitude', 'longitude'])
coordinates_df.to_csv('city_region_coordinates.csv', index=False)
print("Coordinates were extracted and saved to 'city_region_coordinates.csv'.")


Geocoding Genoa, Liguria...
Geocoding Naples, Campania...
Geocoding Paestum, Campania...
Geocoding Ortisei, Trentino-South Tyrol...
Geocoding Gardone Riviera, Lombardy...
Geocoding Varese, Lombardy...
Geocoding Ticciano, Campania...
Geocoding Pompei, Campania...
Geocoding Verona, Veneto...
Geocoding Rome, Lazio...
Geocoding Florence, Tuscany...
Geocoding Treia, Marche...
Geocoding Livigno, Lombardy...
Geocoding Milan, Lombardy...
Geocoding Valdieri, Piedmont...
Geocoding Rapallo, Liguria...
Geocoding Sorrento, Campania...
Geocoding Treviso, Veneto...
Geocoding Pallanza, Piedmont...
Geocoding San Michele, Trentino-South Tyrol...
Geocoding San Michele, Trentino-South Tyrol...
Geocoding Rome, Lazio...
Geocoding Imola, Emilia-Romagna...
Geocoding Olevano Romano, Lazio...
Geocoding Godia, Friuli-Venezia Giulia...
Geocoding Spoltore, Abruzzo...
Geocoding Greve in Chianti, Tuscany...
Geocoding Ponza, Lazio...
Geocoding Milan, Lombardy...
Geocoding Turin, Piedmont...
Geocoding Mortara, Lombard

Now we add them in the data set

In [15]:
# Rename the latitude and longitude columns in the coordinates_df to avoid conflicts
coordinates_df.rename(columns={'latitude': 'coordinates_latitude', 'longitude': 'coordinates_longitude'}, inplace=True)

# Merge the original dataframe (df) with the coordinates dataframe (coordinates_df)
df_coordinates = pd.merge(df, coordinates_df, on=['city', 'region'], how='left')

# Check if the merge worked as expected
print(df_coordinates.head())

# Drop duplicate rows based on the restaurant name and coordinates (latitude, longitude)
df_deduplicated = df.drop_duplicates(subset=['restaurantName', 'latitude', 'longitude'])

# Verify that duplicates are removed
print(df_deduplicated.head())

# Merge the deduplicated dataframe (df_deduplicated) with the coordinates dataframe (coordinates_df)
df_coordinates = pd.merge(df_deduplicated, coordinates_df, on=['city', 'region'], how='left')

# Check the final merged dataframe
print(df_coordinates.head())

# Group by city-region and aggregate restaurant names, price ranges, and other info
df_aggregated = df_coordinates.groupby(['city', 'region', 'latitude', 'longitude']).agg(
    restaurant_names=('restaurantName', lambda x: list(x)),
    price_ranges=('priceRange', lambda x: list(x)),
    cuisines=('cuisineType', lambda x: list(x)),
    descriptions=('description', lambda x: list(x)),
    facilities=('facilitiesServices', lambda x: list(x)),
    credit_cards=('creditCards', lambda x: list(x)),
    phone_numbers=('phoneNumber', lambda x: list(x)),
    websites=('website', lambda x: list(x))
).reset_index()

# Verify that the aggregation worked as expected
print(df_aggregated.head())




  restaurantName                   address   city  postalCode   region  \
0          20Tre  via David Chiossone 20 r  Genoa       16123  Liguria   
1          20Tre  via David Chiossone 20 r  Genoa       16123  Liguria   
2          20Tre  via David Chiossone 20 r  Genoa       16123  Liguria   
3          20Tre  via David Chiossone 20 r  Genoa       16123  Liguria   
4          20Tre  via David Chiossone 20 r  Genoa       16123  Liguria   

  country  latitude  longitude priceRange                    cuisineType  \
0   Italy  44.40878   8.933115         €€  Farm to table, Modern Cuisine   
1   Italy  44.40878   8.933115         €€  Farm to table, Modern Cuisine   
2   Italy  44.40878   8.933115         €€  Farm to table, Modern Cuisine   
3   Italy  44.40878   8.933115         €€  Farm to table, Modern Cuisine   
4   Italy  44.40878   8.933115         €€  Farm to table, Modern Cuisine   

                                         description    facilitiesServices  \
0  Run by three part

In [None]:
!pip install folium

Collecting folium
  Using cached folium-0.18.0-py2.py3-none-any.whl.metadata (3.8 kB)
Collecting branca>=0.6.0 (from folium)
  Using cached branca-0.8.0-py3-none-any.whl.metadata (1.5 kB)
Using cached folium-0.18.0-py2.py3-none-any.whl (108 kB)
Using cached branca-0.8.0-py3-none-any.whl (25 kB)
Installing collected packages: branca, folium
Successfully installed branca-0.8.0 folium-0.18.0


### 4.3 Map Setup: Use a mapping library like plotly or folium to create a visual display of restaurants by region. and 4.4 Encoding Price Ranges: Incorporate a visual representation for price ranges:

Use color-coding or marker size to represent the restaurant’s price range (€, €€, €€€, €€€€).
Include a legend for interpreting price levels.


**Answer:** By using folium, the first HTML shows the retsaurants by prices. However, in the secong graph we decided to frouped by regions and also show the price categories.

In [16]:
import folium

# Create a base map centered around Rome, Italy
map_center = [41.9028, 12.4964]  
m = folium.Map(location=map_center, zoom_start=6)

# Define a color scale for price ranges
price_colors = {
    '€': 'green',
    '€€': 'blue',
    '€€€': 'yellow',
    '€€€€': 'red'
}

# Add markers for each restaurant on the map
for _, row in df_aggregated.iterrows():
    lat, lon = row['latitude'], row['longitude']
    restaurant_names = row['restaurant_names']
    price_ranges = row['price_ranges']
    city = row['city']
    region = row['region']
    
    
    if pd.isnull(lat) or pd.isnull(lon):
        continue
    
    # Set the marker color based on price range (take the first price range for simplicity)
    price = price_ranges[0] if price_ranges else '€€'
    color = price_colors.get(price, 'black')  
    
    # Convert all items to string before joining
    restaurant_names_str = ', '.join([str(name) for name in restaurant_names])
    price_ranges_str = ', '.join([str(price) for price in price_ranges])
    
    # Create a popup with restaurant info, including city and region
    popup_content = f"<b>{restaurant_names_str}</b><br>Price Range: {price_ranges_str}<br>City: {city}<br>Region: {region}"
    
    folium.CircleMarker(
        location=[lat, lon],
        radius=8,
        color=color,
        fill=True,
        fill_opacity=0.7,
        popup=popup_content
    ).add_to(m)

# Save the map as an HTML file
m.save("restaurants_map.html")
print("Map has been saved to 'restaurants_map.html'.")
m


Map has been saved to 'restaurants_map.html'.


* Improving the code to get the restaurants by price, but also grouped by region

In [17]:
import folium
from folium.plugins import MarkerCluster

# Create a base map centered around Rome, Italy
map_center = [41.9028, 12.4964]  #
m = folium.Map(location=map_center, zoom_start=6)

# Define a color scale for price ranges
price_colors = {
    '€': 'green',
    '€€': 'blue',
    '€€€': 'yellow',
    '€€€€': 'red'
}

# Create a MarkerCluster
marker_cluster = MarkerCluster().add_to(m)

# Group restaurants by region
for region, group in df_aggregated.groupby('region'):
    # Create a sub-cluster for each region
    region_cluster = MarkerCluster(name=region).add_to(marker_cluster)

    # Add markers for each restaurant in the region
    for _, row in group.iterrows():
        lat, lon = row['latitude'], row['longitude']
        restaurant_names = row['restaurant_names']
        price_ranges = row['price_ranges']
        city = row['city']

        # Skip rows where coordinates are missing
        if pd.isnull(lat) or pd.isnull(lon):
            continue
        
        # Set the marker color based on price range (take the first price range for simplicity)
        price = price_ranges[0] if price_ranges else '€€'
        color = price_colors.get(price, 'gray')  # Default to gray if price is not found
        
        # Convert all items to string before joining
        restaurant_names_str = ', '.join([str(name) for name in restaurant_names])
        price_ranges_str = ', '.join([str(price) for price in price_ranges])

        # Create a popup with restaurant info, including city and region
        popup_content = f"<b>{restaurant_names_str}</b><br>Price Range: {price_ranges_str}<br>City: {city}<br>Region: {region}"

        # Add a marker to the region's sub-cluster
        folium.CircleMarker(
            location=[lat, lon],
            radius=8,
            color=color,
            fill=True,
            fill_opacity=0.7,
            popup=popup_content
        ).add_to(region_cluster)


folium.LayerControl().add_to(m)
m.save("restaurants_by_region_map.html")
print("Map grouped by region has been saved to 'restaurants_by_region_map.html'.")
m


Map grouped by region has been saved to 'restaurants_by_region_map.html'.


### 4.5 Plot Top-K Restaurants: Use the custom score from Step 3 to select the top-k restaurants for display.

This map will give users an overview of restaurant options across different regions in Italy, with an indication of cost based on visual cues.

In [18]:
import folium
from folium.plugins import MarkerCluster
import pandas as pd


top_k = 10
top_restaurants = get_top_k_restaurants(
    ranked_query_result,
    df,
    user_cuisines,
    user_facilities,
    user_price,
    k=top_k
)


restaurant_ids = [doc_id for (_, doc_id) in top_restaurants]
scores = [score for (score, _) in top_restaurants]
top_restaurants_df = df.loc[restaurant_ids, ['restaurantName', 'latitude', 'longitude', 'price_ranges', 'city', 'region']]


map_center = [41.9028, 12.4964]  # Rome coordinates
m = folium.Map(location=map_center, zoom_start=6)


price_colors = {
    '€': 'green',
    '€€': 'blue',
    '€€€': 'yellow',
    '€€€€': 'red'
}


marker_cluster = MarkerCluster().add_to(m)


for _, row in top_restaurants_df.iterrows():
    lat, lon = row['latitude'], row['longitude']
    restaurant_name = row['restaurantName']
    price_ranges = row['price_ranges']
    city = row['city']
    region = row['region']

    
    if pd.isnull(lat) or pd.isnull(lon):
        continue

    
    price = price_ranges[0] if price_ranges else '€€'
    color = price_colors.get(price, 'gray') 

    
    price_ranges_str = ', '.join([str(price) for price in price_ranges])
    popup_content = f"<b>{restaurant_name}</b><br>Price Range: {price_ranges_str}<br>City: {city}<br>Region: {region}"

   
    folium.CircleMarker(
        location=[lat, lon],
        radius=8,
        color=color,
        fill=True,
        fill_opacity=0.7,
        popup=popup_content
    ).add_to(marker_cluster)

    
    print(f"Restaurant: {restaurant_name}, Price Range: {price_ranges_str}, City: {city}, Region: {region}")


folium.LayerControl().add_to(m)
m.save("top_10_restaurants_map.html")
print("Top 10 restaurants map has been saved to 'top_10_restaurants_map.html'.")
m


NameError: name 'get_top_k_restaurants' is not defined