# [Homework 3](https://github.com/Sapienza-University-Rome/ADM/tree/master/2024/Homework_3) - Michelin restaurants in Italy
![iStock-654454404-777x518](https://a.storyblok.com/f/125576/2448x1220/327bb24d32/hero_update_michelin.jpg/m/1224x0/filters:format(webp))

## 1. Data collection

For the data collection, we wrote the required function in a `data_collection.py` module. 



In [10]:
from data_collection import save_links, download_html_from_link_file, html_to_tsv

The following is the overview of the main functions for each step, together with the code to run. 

Every function has an optional `data_folder` argument wich server the purpose to set the working data directory. 
We tought this to be useful, for example to set the date of the data collection as the directory name. 
This is useful, as the Michelin list of restaurant is constantly updated. 

In [11]:
data_folder = 'DATA 11-09'
# date of last data collection

---

### 1.1. Get the list of Michelin restaurants
   #### **Function**: `save_links`
   - **Description**: 
     Collects restaurant links from the Michelin Guide website starting from the provided `start_url`. The links are saved into a text file (`restaurant_links.txt`) within a specified data folder.
   - **Input**: 
     - `start_url`: URL of the Michelin Guide page to start scraping.
   - **Optional Input**: 
     - `file_name`: name of the output file; by default it is `restaurant_links.txt`.
     - `data_folder`: the folder where datas will be stored; by default it is `DATA`.
   - **Output**:
     - A text file containing restaurant links, one per line, saved in the `data_folder`.
   - **Key Features**:
     - Automatically detects the number of pages to scrape.
     - Skips scraping if the links file already exists.

In [12]:
start_url = "https://guide.michelin.com/en/it/restaurants"
save_links(start_url, data_folder = data_folder)

Links already collected.
There are 1982 link already collected


---

### 1.2. Crawl Michelin restaurant pages
   #### **Function**: `download_html_from_link_file`
   - **Description**: 
     Downloads the HTML from every URL in the input `file_name`, and saves them to a structured folder (`DATA/HTMLs/page_X`).
   - **Input (all optional)**:
     - `file_name`: name of the file with the links; by default it is `restaurant_links.txt`.
     - `data_folder`: the folder where datas will be stored; by default it is `DATA`.
   - **Output**:
     - Saves the HTML files in a structured folder `DATA/HTMLs/page_X`. 
   - **Key Features**:
     - Uses `ThreadPoolExecutor` to speed up the process
     - Skips existing HTML files

In [13]:
download_html_from_link_file(data_folder = data_folder)

Download HTMLs: 100%|██████████| 1982/1982 [00:00<00:00, 4185.80it/s]

All html files have been saved.





---

### 1.3 Parse downloaded pages

#### **Function**: `extract_info_from_html`
- **Description**:  
  Parses a restaurant's HTML page and extracts structured information such as name, address, cuisine type, price range, description, and services.
- **Input**:
  - `html`: The raw HTML content of a restaurant's page.
- **Output**:
  - A dictionary containing extracted fields.
- **Key Features**:
  - Handles missing data gracefully.
  - Handles addresses separated by commas.


#### **Function**: `html_to_tsv`
- **Description**:  
  Scans the `HTMLs` folder inside the `data_folder` for all the html files, then processes every file with `extract_info_from_html`.
- **Input (optional)**:
  - `data_folder`: The folder where data will be stored; by default it is `DATA`.
  - `max_workers`: the max number of concurrent HTML parsing tasks. 
- **Output**:
  - Saves the TSV files in the folder `DATA/TSVs`.
- **Key Features**:
     - Uses `ThreadPoolExecutor` to speed up the process. 
- **Advice**:
     - Fine-tune the `max_workers` parameter according to your CPU performance. As a rule of thumb, set `max_workers` to the number of CPU cores available. An estimated processing time of around 5 minutes is typical. 

In [14]:
html_to_tsv(data_folder=data_folder, max_workers=4)

Processing HTMLs: 100%|██████████| 1982/1982 [00:00<00:00, 22717.33it/s]

All files have been processed and saved.





---

In [15]:
import pandas as pd
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
import string
import re
import glob
from tqdm import tqdm

# Text preprocessing setup
nltk.download('stopwords')
stemmer = PorterStemmer()
stop_words = set(stopwords.words('english'))


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\leox0\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [16]:
def preprocess_text(text):
    text = text.lower()
    text = re.sub(f'[{re.escape(string.punctuation)}]', '', text)
    words = text.split()
    words = [stemmer.stem(word) for word in words if word not in stop_words]
    return words

In [17]:
# Load all files into a single DataFrame
files = glob.glob(data_folder + '/TSVs/restaurant_*.tsv')
df_list = []

for file in tqdm(files, desc='files'):
    restaurant_id = int(file.split('_')[-1].split('.')[0])  # Extract unique ID
    data = pd.read_csv(file, sep='\t')
    data['restaurant_id'] = restaurant_id  # Add ID as a new column
    df_list.append(data)

# Combine all files into one DataFrame
df = pd.concat(df_list, ignore_index=True)

# Apply preprocessing
df['processed_description'] = df['description'].apply(preprocess_text)

files: 100%|██████████| 1982/1982 [00:04<00:00, 433.94it/s]


In [20]:
# Create Vocabulary
unique_words = pd.Series([word for words in df['processed_description'] for word in words]).unique()
vocab = {word: idx for idx, word in enumerate(unique_words)}

# Save Vocabulary
pd.DataFrame(list(vocab.items()), columns=['word', 'term_id']).to_csv('vocabulary.csv', index=False)

In [25]:
# Build Inverted Index
inverted_index = {}
for i, words in enumerate(df['processed_description']):
    restaurant_id = int(df.loc[i, 'restaurant_id'])
    for word in words:
        term_id = vocab[word]
        if term_id not in inverted_index:
            inverted_index[term_id] = []
        if restaurant_id not in inverted_index[term_id]:
            inverted_index[term_id].append(restaurant_id)

# Sort each list of document IDs in the inverted index
for term_id in inverted_index:
    inverted_index[term_id].sort()

# Save Inverted Index
import json
with open('inverted_index.json', 'w') as f:
    json.dump(inverted_index, f)