In [1]:
!pip install lxml



# [Homework 3](https://github.com/Sapienza-University-Rome/ADM/tree/master/2024/Homework_3) - Michelin restaurants in Italy
![iStock-654454404-777x518](https://a.storyblok.com/f/125576/2448x1220/327bb24d32/hero_update_michelin.jpg/m/1224x0/filters:format(webp))

## 1. Data collection

For the data collection, we wrote the required function in a `data_collection.py` module. 

In [2]:
from data_collection import save_links, download_html_from_link_file, html_to_tsv

The following is the overview of the main functions for each step, together with the code to run. 

Every function has an optional `data_folder` argument wich server the purpose to set the working data directory. 
We tought this to be useful, for example to set the date of the data collection as the directory name. 
This is useful, as the Michelin list of restaurant is constantly updated. 

In [3]:
data_folder = 'DATA 24-11-09'
# date of last data collection, yy-mm-dd

---

### 1.1. Get the list of Michelin restaurants
   #### **Function**: `save_links`
   - **Description**: 
     Collects restaurant links from the Michelin Guide website starting from the provided `start_url`. The links are saved into a text file (`restaurant_links.txt`) within a specified data folder.
   - **Input**: 
     - `start_url`: URL of the Michelin Guide page to start scraping.
   - **Optional Input**: 
     - `file_name`: name of the output file; by default it is `restaurant_links.txt`.
     - `data_folder`: the folder where datas will be stored; by default it is `DATA`.
   - **Output**:
     - A text file containing restaurant links, one per line, saved in the `data_folder`.
   - **Key Features**:
     - Automatically detects the number of pages to scrape.
     - Skips scraping if the links file already exists.

In [4]:
start_url = "https://guide.michelin.com/en/it/restaurants"
save_links(start_url, data_folder = data_folder)

Links already collected.
There are 1982 link already collected


---

### 1.2. Crawl Michelin restaurant pages
   #### **Function**: `download_html_from_link_file`
   - **Description**: 
     Downloads the HTML from every URL in the input `file_name`, and saves them to a structured folder (`DATA/HTMLs/page_X`).
   - **Input (all optional)**:
     - `file_name`: name of the file with the links; by default it is `restaurant_links.txt`.
     - `data_folder`: the folder where datas will be stored; by default it is `DATA`.
   - **Output**:
     - Saves the HTML files in a structured folder `DATA/HTMLs/page_X`. 
   - **Key Features**:
     - Uses `ThreadPoolExecutor` to speed up the process
     - Skips existing HTML files

In [5]:
download_html_from_link_file(data_folder = data_folder)

Download HTMLs: 100%|██████████| 1982/1982 [00:00<00:00, 8116.91it/s]

All html files have been saved.





---

### 1.3 Parse downloaded pages

#### **Function**: `extract_info_from_html`
- **Description**:  
  Parses a restaurant's HTML page and extracts structured information such as name, address, cuisine type, price range, description, and services.
- **Input**:
  - `html`: The raw HTML content of a restaurant's page.
- **Output**:
  - A dictionary containing extracted fields.
- **Key Features**:
  - Handles missing data gracefully.
  - Handles addresses separated by commas.


#### **Function**: `html_to_tsv`
- **Description**:  
  Scans the `HTMLs` folder inside the `data_folder` for all the html files, then processes every file with `extract_info_from_html`.
- **Input (optional)**:
  - `data_folder`: The folder where data will be stored; by default it is `DATA`.
  - `max_workers`: the max number of concurrent HTML parsing tasks. 
- **Output**:
  - Saves the TSV files in the folder `DATA/TSVs`.
- **Key Features**:
     - Uses `ThreadPoolExecutor` to speed up the process. 
- **Advice**:
     - Fine-tune the `max_workers` parameter according to your CPU performance. As a rule of thumb, set `max_workers` to the number of CPU cores available. An estimated processing time of around 5 minutes is typical. 

In [6]:
html_to_tsv(data_folder=data_folder, max_workers=4)

Processing HTMLs: 100%|██████████| 1982/1982 [00:00<00:00, 30441.48it/s]

All files have been processed and saved.





For completeness, let us create the dataframe for our dataset, in order to handle it effectively.

#### **Function**: `create_combined_dataframe`
- **Description**:  
  This function reads all the `.tsv` files from a specified folder, loads them into individual pandas DataFrames, and then combines them into a single DataFrame. It is useful for aggregating data from multiple sources into one unified dataset for further analysis.

- **Input**:
  - `folder_path` (str): The path to the folder containing the `.tsv` files to be read.
  - `separator` (str): The delimiter used in the `.tsv` files. Typically, it's a tab (`\t`), but it could be adjusted if needed.
  
- **Output**:
  - Returns a pandas DataFrame containing all the combined data from the `.tsv` files in the specified folder.

- **Key Features**:
  - Utilizes `glob` to find all `.tsv` files in the provided folder.
  - Loads each file as a DataFrame using pandas `read_csv()` with the specified delimiter.
  - Concatenates all DataFrames into one, ignoring index to prevent duplication.
  - Efficient handling of large datasets through pandas' built-in functions.

By running this function, you'll have a consolidated view of all the restaurant data in a single DataFrame, ready for any further analysis or processing. The first few rows of the dataset are provided below.

In [7]:
from data_collection import create_combined_dataframe
df = create_combined_dataframe(data_folder+"\TSVs", separator='\t')
df.head()

Unnamed: 0,restaurantName,address,city,postalCode,country,priceRange,cuisineType,description,facilitiesServices,creditCards,phoneNumber,website
0,20Tre,via David Chiossone 20 r,Genoa,16123,Italy,€€,"Farm to table, Modern Cuisine",Situated in the heart of Genoa’s historic cent...,['Air conditioning'],"['amex', 'dinersclub', 'mastercard', 'visa']",+39 010 247 6191,https://www.ristorante20tregenova.it/
1,Il Ristorante Alain Ducasse Napoli,Via Cristoforo Colombo 45,Naples,80133,Italy,€€€€,"Creative, Mediterranean Cuisine","Alain Ducasse, one of the great names in conte...","['Air conditioning', 'Great view', 'Interestin...","['amex', 'dinersclub', 'discover', 'maestrocar...",+39 081 604 1580,https://theromeocollection.com/en/romeo-napoli...
2,Salvo Cacciatori,via Vieusseux 12,Oneglia,18100,Italy,€€€,"Ligurian, Contemporary",This restaurant has come a long way since 1906...,"['Air conditioning', 'Restaurant offering vege...","['amex', 'discover', 'jcb', 'maestrocard', 'ma...",+39 0183 293763,https://ristorantesalvocacciatori.it/
3,Terre Alte,"via Olmadella 11, località Balignano",Longiano,47020,Italy,€€€,Seafood,One of the best-known addresses in this region...,"['Air conditioning', 'Car park', 'Terrace']","['amex', 'mastercard', 'visa']",+39 0547 666138,https://ristoranteterrealte.com/
4,Tubladel,via Trebinger 22,Ortisei,39046,Italy,€€€,Regional Cuisine,Although this restaurant with wood-adorned din...,"['Car park', 'Terrace']","['mastercard', 'visa']",+39 0471 796879,https://www.tubladel.com/


---

## 2. Search Engine

### 2.0 Preprocessing the Text

In [8]:
!pip install --upgrade nltk



In [9]:
import nltk
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\marta\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\marta\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [10]:
from search_engine import preprocess_text
df['processedDescription'] = preprocess_text(df['description'])
df.head()

Unnamed: 0,restaurantName,address,city,postalCode,country,priceRange,cuisineType,description,facilitiesServices,creditCards,phoneNumber,website,processedDescription
0,20Tre,via David Chiossone 20 r,Genoa,16123,Italy,€€,"Farm to table, Modern Cuisine",Situated in the heart of Genoa’s historic cent...,['Air conditioning'],"['amex', 'dinersclub', 'mastercard', 'visa']",+39 010 247 6191,https://www.ristorante20tregenova.it/,"[situat, heart, genoa, histor, centr, contempo..."
1,Il Ristorante Alain Ducasse Napoli,Via Cristoforo Colombo 45,Naples,80133,Italy,€€€€,"Creative, Mediterranean Cuisine","Alain Ducasse, one of the great names in conte...","['Air conditioning', 'Great view', 'Interestin...","['amex', 'dinersclub', 'discover', 'maestrocar...",+39 081 604 1580,https://theromeocollection.com/en/romeo-napoli...,"[alain, ducass, one, great, name, contemporari..."
2,Salvo Cacciatori,via Vieusseux 12,Oneglia,18100,Italy,€€€,"Ligurian, Contemporary",This restaurant has come a long way since 1906...,"['Air conditioning', 'Restaurant offering vege...","['amex', 'discover', 'jcb', 'maestrocard', 'ma...",+39 0183 293763,https://ristorantesalvocacciatori.it/,"[restaur, come, long, way, sinc, 1906, owner, ..."
3,Terre Alte,"via Olmadella 11, località Balignano",Longiano,47020,Italy,€€€,Seafood,One of the best-known addresses in this region...,"['Air conditioning', 'Car park', 'Terrace']","['amex', 'mastercard', 'visa']",+39 0547 666138,https://ristoranteterrealte.com/,"[one, best, known, address, region, fish, enth..."
4,Tubladel,via Trebinger 22,Ortisei,39046,Italy,€€€,Regional Cuisine,Although this restaurant with wood-adorned din...,"['Car park', 'Terrace']","['mastercard', 'visa']",+39 0471 796879,https://www.tubladel.com/,"[although, restaur, wood, adorn, dine, room, r..."


### 2.1 Conjunctive Query

#### 2.1.1 Create Your Index!

In [11]:
from search_engine import create_vocabulary
vocabulary_df = create_vocabulary(df['processedDescription'])
print(vocabulary_df.head().to_string(index=False))

Loading existing vocabulary file.
 term_id      term
       0 spaghetto
       1      1969
       2    gâteau
       3      karl
       4      area


In [12]:
from search_engine import create_inverted_index
inverted_index = create_inverted_index(df['processedDescription'], vocabulary_df)

Loading inverted index from file.


#### 2.1.2 Execute the Query

In [13]:
from search_engine import execute_conjunctive_query
query = "modern seasonal cuisine"
documents_id = execute_conjunctive_query(query, inverted_index, vocabulary_df)
subset_df = df[['restaurantName', 'address', 'description', 'website']].iloc[sorted(documents_id)]

# Rename columns for display purposes only
subset_df = subset_df.rename(columns={
    'restaurantName': 'Restaurant Name',
    'address': 'Address',
    'description': 'Description',
    'website': 'Website'
})

# Print the DataFrame
subset_df

Unnamed: 0,Restaurant Name,Address,Description,Website
88,Ronchi Rò,località Cime di Dolegna 12,Ronchi Rò is an estate-cum-agriturismo surroun...,https://www.ronchiro.it
180,Razzo,via Andrea Doria 17/f,"A quiet restaurant with a relaxed, young and m...",https://vadoarazzo.it/
256,Flurin,Laubengasse 2,Flurin occupies an old medieval tower in Glore...,https://www.flurin.it
289,Quadri Bistrot,Via Solferino 48,"A modern bistro with a cocktail-bar, trendy de...",https://www.quadribistrot.it/
304,Materia | Spazio Cucina,via Teatro Massimo 29,The entrance to this restaurant is typical of ...,https://www.materiaspaziocucina.it/
456,Gallery Bistrot Contemporaneo,via Regina Margherita 3/b,"Modern, tasty and carefully curated cuisine, w...",
501,Ca' Del Moro,località Erbin 31,Situated within the La Collina dei Ciliegi win...,https://www.cadelmoro.wine/it
561,[àbitat],via Henry Dunant 1,"A young, enthusiastic and professional couple ...",https://www.abitatproject.it
565,Babette,via Michelangelo 17,Situated just beyond the centre of Albenga in ...,https://www.ristorantebabette.net/
586,Cappuccini Cucina San Francesco,via Cappuccini 54,"Housed in the resort of the same name, this el...",https://www.cappuccini.it/


### 2.2 Ranked Search Engine with TF-IDF and Cosine Similarity

#### 2.2.1 Inverted Index with TF-IDF Scores

In [14]:
from search_engine import create_tfIdf_inverted_index
tfIdf_inverted_index = create_tfIdf_inverted_index(inverted_index, vocabulary_df, df['processedDescription'])
# Print the first 10 key-value couples
for idx, (term, docs) in enumerate(tfIdf_inverted_index.items()):
    if idx < 10:
        print(f"Term: {term}, Docs: {docs}")
    else:
        break

Loading inverted index with TF-IDF scores from file.
Term: 0, Docs: [(151, 0.0282020145948564), (389, 0.043387714761317545)]
Term: 1, Docs: [(764, 0.1109738043904193)]
Term: 2, Docs: [(1797, 0.07308031020832492)]
Term: 3, Docs: [(1514, 0.03995056958055095)]
Term: 4, Docs: [(42, 0.025277979174367058), (72, 0.033260498913640864), (105, 0.008838454256771698), (115, 0.019748421229974264), (119, 0.02872497633450802), (121, 0.03009283235043697), (134, 0.01886416356296049), (141, 0.07434699757166782), (152, 0.03510830440884313), (180, 0.06652099782728173), (194, 0.03611139882052437), (210, 0.03240766560816289), (215, 0.018055699410262183), (258, 0.054952128639928384), (272, 0.014869399514333564), (276, 0.03240766560816289), (279, 0.025793856300374545), (291, 0.033260498913640864), (316, 0.029392999039961693), (341, 0.019444599364897737), (346, 0.0382999684460107), (428, 0.034159431316712244), (440, 0.03717349878583391), (451, 0.033260498913640864), (465, 0.0382999684460107), (500, 0.025793856

#### 2.2.2 Execute the Ranked Query

In [15]:
import importlib, search_engine
importlib.reload(search_engine)

<module 'search_engine' from 'c:\\Users\\marta\\OneDrive\\Documenti\\GitHub\\ADM-HW3\\search_engine.py'>

In [16]:
from search_engine import execute_ranked_query
ranked_results = execute_ranked_query("modern seasonal cuisine", tfIdf_inverted_index, vocabulary_df,  df['processedDescription'], 10)
print(ranked_results)
ranked_ids = [doc_id for (doc_id,_) in ranked_results]
ranked_scores = [tfIdf_score for (_,tfIdf_score) in ranked_results]
ranked_results_df = df[['restaurantName', 'address', 'description', 'website']].iloc[ranked_ids]
ranked_results_df['tfIdf_score'] = ranked_scores
ranked_results_df['tfIdf_score'] = ranked_results_df['tfIdf_score'].round(10)
ranked_results_df

[(1178, 0.9736334965517988), (127, 0.9736334965517986), (183, 0.9736334965517986), (675, 0.9736334965517986), (784, 0.9736334965517986), (1415, 0.9736334965517986), (1491, 0.9736334965517986), (1815, 0.9736334965517986), (89, 0.9736334965517985), (181, 0.9736334965517985)]


Unnamed: 0,restaurantName,address,description,website,tfIdf_score
1178,Osteria dei Vespri,piazza Croce dei Vespri 6,Situated in the palazzo where the famous ball ...,https://www.osteriadeivespri.it,0.973633
127,Hosteria La Cave Cantù,via Circonvallazione Cantù 62,"Situated within the 18C Cantù monastery, now a...",https://www.lacavecantu.it,0.973633
183,Arnaldo - Clinica Gastronomica,piazza XXIV Maggio 3,This restaurant was first awarded a Michelin s...,https://www.clinicagastronomica.com/,0.973633
675,Dattilo,Contrada Dattilo,In a world of haute cuisine still dominated by...,https://dattilo.it/,0.973633
784,All'Enoteca,via Roma 57,"A standard-bearer for regional cuisine, Davide...",https://www.davidepalluda.it/,0.973633
1415,La Leggenda dei Frati,costa San Giorgio 6/a,After an uphill approach if you’re arriving he...,https://laleggendadeifrati.it/,0.973633
1491,Osteria Numero Sette,via Andrea Costa 7,There’s been a recent change in management at ...,https://osterianumerosette.business.site,0.973633
1815,Filo,SP 583 - Località Bagnana 96,This restaurant boasts a superb lakeside locat...,https://ristorantefilo.it,0.973633
89,Seta Sushi Restaurant,corte Isolani 2,It was friendship and passion for Japanese cui...,http://www.setasushirestaurant.com,0.973633
181,SanBrite,Località Alverà,This restaurant has just a few tables amid a d...,https://www.sanbrite.it/,0.973633


In [17]:
from search_engine import cosine_similarity
print(cosine_similarity([0.1,0.2,0.5],[0.1,0.2,0.5]))

1.0000000000000002


## Algorithmic Question

In [18]:
from AQ import algorithmic_question

# Open the input file "input.txt" in read mode
with open("input.txt", 'r') as file:
    # Read the first line which indicates the number of test cases (t)
    t = int(file.readline())  # Number of test cases
    
    grid_list = []  # Initialize an empty list to store the grids for each test case
    
    # Loop through each test case (from 1 to t)
    for _ in range(t):
        # Read the number of packages (n) for the current test case
        n = int(file.readline())  # Number of packages in this test case
        grid = []  # Initialize an empty list to store the coordinates of packages in the current grid
        
        # Read the coordinates for each package in the test case
        for _ in range(n):
            # Read the x and y coordinates of the package, split by space, and convert them to integers
            x, y = map(int, file.readline().split())
            # Append the package coordinates as a list [x, y] to the grid
            grid.append([x, y])
        
        # Add the current grid (list of package coordinates) to the grid_list
        grid_list.append(grid)

    # After collecting all test cases, pass the grid_list to the algorithmic_question function
    result = algorithmic_question(grid_list)
    
    # Print the result returned by the algorithmic_question function (either "YES" with the path or "NO")
    print(result)

YES
RUUURRRRUU
NO
YES
RRRRUUU

