In [25]:
!pip install lxml



# [Homework 3](https://github.com/Sapienza-University-Rome/ADM/tree/master/2024/Homework_3) - Michelin restaurants in Italy
![iStock-654454404-777x518](https://a.storyblok.com/f/125576/2448x1220/327bb24d32/hero_update_michelin.jpg/m/1224x0/filters:format(webp))

## 1. Data collection

For the data collection, we wrote the required function in a `data_collection.py` module. 

In [1]:
from data_collection import save_links, download_html_from_link_file, html_to_tsv

The following is the overview of the main functions for each step, together with the code to run. 

Every function has an optional `data_folder` argument wich server the purpose to set the working data directory. 
We tought this to be useful, for example to set the date of the data collection as the directory name. 
This is useful, as the Michelin list of restaurant is constantly updated. 

In [2]:
data_folder = 'DATA 24-11-09'
# date of last data collection, yy-mm-dd

---

### 1.1. Get the list of Michelin restaurants
   #### **Function**: `save_links`
   - **Description**: 
     Collects restaurant links from the Michelin Guide website starting from the provided `start_url`. The links are saved into a text file (`restaurant_links.txt`) within a specified data folder.
   - **Input**: 
     - `start_url`: URL of the Michelin Guide page to start scraping.
   - **Optional Input**: 
     - `file_name`: name of the output file; by default it is `restaurant_links.txt`.
     - `data_folder`: the folder where datas will be stored; by default it is `DATA`.
   - **Output**:
     - A text file containing restaurant links, one per line, saved in the `data_folder`.
   - **Key Features**:
     - Automatically detects the number of pages to scrape.
     - Skips scraping if the links file already exists.

In [3]:
start_url = "https://guide.michelin.com/en/it/restaurants"
save_links(start_url, data_folder = data_folder)

Detected number of pages: 100


Links Scraping: 100%|██████████| 100/100 [01:42<00:00,  1.03s/it]

Found 1981 restaurant links





---

### 1.2. Crawl Michelin restaurant pages
   #### **Function**: `download_html_from_link_file`
   - **Description**: 
     Downloads the HTML from every URL in the input `file_name`, and saves them to a structured folder (`DATA/HTMLs/page_X`).
   - **Input (all optional)**:
     - `file_name`: name of the file with the links; by default it is `restaurant_links.txt`.
     - `data_folder`: the folder where datas will be stored; by default it is `DATA`.
   - **Output**:
     - Saves the HTML files in a structured folder `DATA/HTMLs/page_X`. 
   - **Key Features**:
     - Uses `ThreadPoolExecutor` to speed up the process
     - Skips existing HTML files

In [4]:
download_html_from_link_file(data_folder = data_folder)

Download HTMLs: 100%|██████████| 1981/1981 [00:39<00:00, 50.59it/s]

All html files have been saved.





---

### 1.3 Parse downloaded pages

#### **Function**: `extract_info_from_html`
- **Description**:  
  Parses a restaurant's HTML page and extracts structured information such as name, address, cuisine type, price range, description, and services.
- **Input**:
  - `html`: The raw HTML content of a restaurant's page.
- **Output**:
  - A dictionary containing extracted fields.
- **Key Features**:
  - Handles missing data gracefully.
  - Handles addresses separated by commas.


#### **Function**: `html_to_tsv`
- **Description**:  
  Scans the `HTMLs` folder inside the `data_folder` for all the html files, then processes every file with `extract_info_from_html`.
- **Input (optional)**:
  - `data_folder`: The folder where data will be stored; by default it is `DATA`.
  - `max_workers`: the max number of concurrent HTML parsing tasks. 
- **Output**:
  - Saves the TSV files in the folder `DATA/TSVs`.
- **Key Features**:
     - Uses `ThreadPoolExecutor` to speed up the process. 
- **Advice**:
     - Fine-tune the `max_workers` parameter according to your CPU performance. As a rule of thumb, set `max_workers` to the number of CPU cores available. An estimated processing time of around 5 minutes is typical. 

In [5]:
html_to_tsv(data_folder=data_folder, max_workers=4)

Processing HTMLs: 100%|██████████| 1981/1981 [11:44<00:00,  2.81it/s]

All files have been processed and saved.





For completeness, let us create the dataframe for our dataset, in order to handle it effectively.

#### **Function**: `create_combined_dataframe`
- **Description**:  
  This function reads all the `.tsv` files from a specified folder, loads them into individual pandas DataFrames, and then combines them into a single DataFrame. It is useful for aggregating data from multiple sources into one unified dataset for further analysis.

- **Input**:
  - `folder_path` (str): The path to the folder containing the `.tsv` files to be read.
  - `separator` (str): The delimiter used in the `.tsv` files. Typically, it's a tab (`\t`), but it could be adjusted if needed.
  
- **Output**:
  - Returns a pandas DataFrame containing all the combined data from the `.tsv` files in the specified folder.

- **Key Features**:
  - Utilizes `glob` to find all `.tsv` files in the provided folder.
  - Loads each file as a DataFrame using pandas `read_csv()` with the specified delimiter.
  - Concatenates all DataFrames into one, ignoring index to prevent duplication.
  - Efficient handling of large datasets through pandas' built-in functions.

By running this function, you'll have a consolidated view of all the restaurant data in a single DataFrame, ready for any further analysis or processing. The first few rows of the dataset are provided below.

In [7]:
from data_collection import create_combined_dataframe
df = create_combined_dataframe(data_folder+"\TSVs", separator='\t')
df.head()

  df = create_combined_dataframe(data_folder+"\TSVs", separator='\t')


Unnamed: 0,restaurantName,address,city,postalCode,region,country,latitude,longitude,priceRange,cuisineType,description,facilitiesServices,creditCards,phoneNumber,website
0,20Tre,via David Chiossone 20 r,Genoa,16123,Liguria,Italy,44.40878,8.933115,€€,"Farm to table, Modern Cuisine","Run by three partners, this contemporary-style...",['Air conditioning'],"['amex', 'dinersclub', 'mastercard', 'visa']",+39 010 247 6191,https://www.ristorante20tregenova.it/
1,Il Ristorante Alain Ducasse Napoli,Via Cristoforo Colombo 45,Naples,80133,Campania,Italy,40.840277,14.255592,€€€€,"Creative, Mediterranean Cuisine","Alain Ducasse, one of the great names in conte...","['Air conditioning', 'Great view', 'Interestin...","['amex', 'dinersclub', 'discover', 'maestrocar...",+39 081 604 1580,https://theromeocollection.com/en/romeo-napoli...
2,Tre Olivi,via Poseidonia 41,Paestum,84047,Campania,Italy,40.42511,14.98659,€€€€,"Creative, Campanian","Oliver Glowig, German by birth but Italian by ...","['Air conditioning', 'Car park', 'Garden or pa...","['amex', 'mastercard', 'visa']",+39 0828 720023,http://www.treolivi.com
3,Tubladel,via Trebinger 22,Ortisei,39046,Trentino-South Tyrol,Italy,46.570627,11.678971,€€€,Regional Cuisine,Although this restaurant with wood-adorned din...,"['Car park', 'Terrace']","['amex', 'maestrocard', 'mastercard', 'visa']",+39 0471 796879,https://www.tubladel.com/
4,Villa Fiordaliso,corso Zanardelli 150,Gardone Riviera,25083,Lombardy,Italy,45.622208,10.56996,€€€€,Italian Contemporary,Villa Fiordaliso is one of the beautiful early...,"['Car park', 'Garden or park', 'Great view', '...","['amex', 'mastercard', 'visa']",+39 0365 20158,https://www.villafiordaliso.it


---

# 4. Visualizing the Most Relevant Restaurants

* *4.1 Collect information on unique restaurant locations in Italy (in the format of City and Region)*

**Answer:** We got arround 1.116 cities with their respective region

In [52]:
# Extracting the list of (city, region) tuples for the coordinates later 
city_region_list = list(zip(df['city'], df['region']))
print(f" cities and regions: {len(city_region_list)}")
print(f"List of unique cities and their regions: {city_region_list}")


 cities and regions: 1981
List of unique cities and their regions: [('Genoa', 'Liguria'), ('Naples', 'Campania'), ('Paestum', 'Campania'), ('Ortisei', 'Trentino-South Tyrol'), ('Gardone Riviera', 'Lombardy'), ('Varese', 'Lombardy'), ('Ticciano', 'Campania'), ('Pompei', 'Campania'), ('Verona', 'Veneto'), ('Rome', 'Lazio'), ('Florence', 'Tuscany'), ('Treia', 'Marche'), ('Livigno', 'Lombardy'), ('Milan', 'Lombardy'), ('Valdieri', 'Piedmont'), ('Rapallo', 'Liguria'), ('Sorrento', 'Campania'), ('Treviso', 'Veneto'), ('Pallanza', 'Piedmont'), ('San Michele', 'Trentino-South Tyrol'), ('San Michele', 'Trentino-South Tyrol'), ('Rome', 'Lazio'), ('Imola', 'Emilia-Romagna'), ('Olevano Romano', 'Lazio'), ('Godia', 'Friuli-Venezia Giulia'), ('Spoltore', 'Abruzzo'), ('Greve in Chianti', 'Tuscany'), ('Ponza', 'Lazio'), ('Milan', 'Lombardy'), ('Turin', 'Piedmont'), ('Mortara', 'Lombardy'), ('Milan', 'Lombardy'), ('Maranello', 'Emilia-Romagna'), ('Guarene', 'Piedmont'), ('Lonigo', 'Veneto'), ('Suna', '

* *4.2 Provide coordinates for these location.*

**Answer:** With Geopy the coordinates were impported in a CSV due to the large amount of data, and then added them to the data set for better precision. 

In [35]:
# pip3 install geopy

In [53]:
import time
import pandas as pd
from geopy.geocoders import Nominatim

# geolocator
geolocator = Nominatim(user_agent="city-region-coordinates-extractor")

# Function to get coordinates for a city-region pair starting in Italy
def get_coordinates(city, region):
    try:
        query = f"{city}, {region}, Italy"
        location = geolocator.geocode(query)
        if location:
            return location.latitude, location.longitude
        else:
            return None, None
    except Exception as e:
        print(f"Error getting coordinates for {city}, {region}: {e}")
        return None, None

# Function to extract coordinates for all unique city-region pairs
def extract_coordinates(city_region_list):
    coordinates_list = []
    
    # Loop through each city-region pair and get coordinates
    for city, region in city_region_list:
        print(f"Geocoding {city}, {region}...")
        latitude, longitude = get_coordinates(city, region)
        coordinates_list.append((city, region, latitude, longitude))
        time.sleep(1)
    
    return coordinates_list

coordinates = extract_coordinates(city_region_list)

# Create a DataFrame from the list of coordinates
coordinates_df = pd.DataFrame(coordinates, columns=['city', 'region', 'latitude', 'longitude'])
coordinates_df.to_csv('city_region_coordinates.csv', index=False)
print("Coordinates were extracted and saved to 'city_region_coordinates.csv'.")


Geocoding Genoa, Liguria...
Geocoding Naples, Campania...
Geocoding Paestum, Campania...
Geocoding Ortisei, Trentino-South Tyrol...
Geocoding Gardone Riviera, Lombardy...
Geocoding Varese, Lombardy...
Geocoding Ticciano, Campania...
Geocoding Pompei, Campania...
Geocoding Verona, Veneto...
Geocoding Rome, Lazio...
Geocoding Florence, Tuscany...
Geocoding Treia, Marche...
Geocoding Livigno, Lombardy...
Geocoding Milan, Lombardy...
Geocoding Valdieri, Piedmont...
Geocoding Rapallo, Liguria...
Geocoding Sorrento, Campania...
Geocoding Treviso, Veneto...
Geocoding Pallanza, Piedmont...
Geocoding San Michele, Trentino-South Tyrol...
Geocoding San Michele, Trentino-South Tyrol...
Geocoding Rome, Lazio...
Geocoding Imola, Emilia-Romagna...
Geocoding Olevano Romano, Lazio...
Geocoding Godia, Friuli-Venezia Giulia...
Geocoding Spoltore, Abruzzo...
Geocoding Greve in Chianti, Tuscany...
Geocoding Ponza, Lazio...
Geocoding Milan, Lombardy...
Geocoding Turin, Piedmont...
Geocoding Mortara, Lombard

Now we add them in the data set

In [58]:
# Rename the latitude and longitude columns in the coordinates_df to avoid conflicts
coordinates_df.rename(columns={'latitude': 'coordinates_latitude', 'longitude': 'coordinates_longitude'}, inplace=True)

# Merge the original dataframe (df) with the coordinates dataframe (coordinates_df)
df_coordinates = pd.merge(df, coordinates_df, on=['city', 'region'], how='left')

# Check if the merge worked as expected
print(df_coordinates.head())

# Drop duplicate rows based on the restaurant name and coordinates (latitude, longitude)
df_deduplicated = df.drop_duplicates(subset=['restaurantName', 'latitude', 'longitude'])

# Verify that duplicates are removed
print(df_deduplicated.head())

# Merge the deduplicated dataframe (df_deduplicated) with the coordinates dataframe (coordinates_df)
df_coordinates = pd.merge(df_deduplicated, coordinates_df, on=['city', 'region'], how='left')

# Check the final merged dataframe
print(df_coordinates.head())

# Group by city-region and aggregate restaurant names, price ranges, and other info
df_aggregated = df_coordinates.groupby(['city', 'region', 'latitude', 'longitude']).agg(
    restaurant_names=('restaurantName', lambda x: list(x)),
    price_ranges=('priceRange', lambda x: list(x)),
    cuisines=('cuisineType', lambda x: list(x)),
    descriptions=('description', lambda x: list(x)),
    facilities=('facilitiesServices', lambda x: list(x)),
    credit_cards=('creditCards', lambda x: list(x)),
    phone_numbers=('phoneNumber', lambda x: list(x)),
    websites=('website', lambda x: list(x))
).reset_index()

# Verify that the aggregation worked as expected
print(df_aggregated.head())




  restaurantName                   address   city  postalCode   region  \
0          20Tre  via David Chiossone 20 r  Genoa       16123  Liguria   
1          20Tre  via David Chiossone 20 r  Genoa       16123  Liguria   
2          20Tre  via David Chiossone 20 r  Genoa       16123  Liguria   
3          20Tre  via David Chiossone 20 r  Genoa       16123  Liguria   
4          20Tre  via David Chiossone 20 r  Genoa       16123  Liguria   

  country  latitude  longitude priceRange                    cuisineType  \
0   Italy  44.40878   8.933115         €€  Farm to table, Modern Cuisine   
1   Italy  44.40878   8.933115         €€  Farm to table, Modern Cuisine   
2   Italy  44.40878   8.933115         €€  Farm to table, Modern Cuisine   
3   Italy  44.40878   8.933115         €€  Farm to table, Modern Cuisine   
4   Italy  44.40878   8.933115         €€  Farm to table, Modern Cuisine   

                                         description    facilitiesServices  \
0  Run by three part

In [64]:
!pip install folium

Collecting folium
  Using cached folium-0.18.0-py2.py3-none-any.whl.metadata (3.8 kB)
Collecting branca>=0.6.0 (from folium)
  Using cached branca-0.8.0-py3-none-any.whl.metadata (1.5 kB)
Using cached folium-0.18.0-py2.py3-none-any.whl (108 kB)
Using cached branca-0.8.0-py3-none-any.whl (25 kB)
Installing collected packages: branca, folium
Successfully installed branca-0.8.0 folium-0.18.0


* 4.3 Map Setup: Use a mapping library like plotly or folium to create a visual display of restaurants by region.

* 4.4 Encoding Price Ranges: Incorporate a visual representation for price ranges:

Use color-coding or marker size to represent the restaurant’s price range (€, €€, €€€, €€€€).
Include a legend for interpreting price levels.


**Answer:** By using folium, the first HTML shows the retsaurants by prices. However, in the secong graph we decided to frouped by regions and also show the price categories.

In [70]:
import folium

# Create a base map centered around Rome, Italy
map_center = [41.9028, 12.4964]  
m = folium.Map(location=map_center, zoom_start=6)

# Define a color scale for price ranges
price_colors = {
    '€': 'green',
    '€€': 'blue',
    '€€€': 'yellow',
    '€€€€': 'red'
}

# Add markers for each restaurant on the map
for _, row in df_aggregated.iterrows():
    lat, lon = row['latitude'], row['longitude']
    restaurant_names = row['restaurant_names']
    price_ranges = row['price_ranges']
    city = row['city']
    region = row['region']
    
    
    if pd.isnull(lat) or pd.isnull(lon):
        continue
    
    # Set the marker color based on price range (take the first price range for simplicity)
    price = price_ranges[0] if price_ranges else '€€'
    color = price_colors.get(price, 'black')  
    
    # Convert all items to string before joining
    restaurant_names_str = ', '.join([str(name) for name in restaurant_names])
    price_ranges_str = ', '.join([str(price) for price in price_ranges])
    
    # Create a popup with restaurant info, including city and region
    popup_content = f"<b>{restaurant_names_str}</b><br>Price Range: {price_ranges_str}<br>City: {city}<br>Region: {region}"
    
    folium.CircleMarker(
        location=[lat, lon],
        radius=8,
        color=color,
        fill=True,
        fill_opacity=0.7,
        popup=popup_content
    ).add_to(m)

# Save the map as an HTML file
m.save("restaurants_map.html")
print("Map has been saved to 'restaurants_map.html'.")


Map has been saved to 'restaurants_map.html'.


* Improving the code to get the restaurants by price, but also grouped by region

In [69]:
import folium
from folium.plugins import MarkerCluster

# Create a base map centered around Rome, Italy
map_center = [41.9028, 12.4964]  #
m = folium.Map(location=map_center, zoom_start=6)

# Define a color scale for price ranges
price_colors = {
    '€': 'green',
    '€€': 'blue',
    '€€€': 'yellow',
    '€€€€': 'red'
}

# Create a MarkerCluster
marker_cluster = MarkerCluster().add_to(m)

# Group restaurants by region
for region, group in df_aggregated.groupby('region'):
    # Create a sub-cluster for each region
    region_cluster = MarkerCluster(name=region).add_to(marker_cluster)

    # Add markers for each restaurant in the region
    for _, row in group.iterrows():
        lat, lon = row['latitude'], row['longitude']
        restaurant_names = row['restaurant_names']
        price_ranges = row['price_ranges']
        city = row['city']

        # Skip rows where coordinates are missing
        if pd.isnull(lat) or pd.isnull(lon):
            continue
        
        # Set the marker color based on price range (take the first price range for simplicity)
        price = price_ranges[0] if price_ranges else '€€'
        color = price_colors.get(price, 'gray')  # Default to gray if price is not found
        
        # Convert all items to string before joining
        restaurant_names_str = ', '.join([str(name) for name in restaurant_names])
        price_ranges_str = ', '.join([str(price) for price in price_ranges])

        # Create a popup with restaurant info, including city and region
        popup_content = f"<b>{restaurant_names_str}</b><br>Price Range: {price_ranges_str}<br>City: {city}<br>Region: {region}"

        # Add a marker to the region's sub-cluster
        folium.CircleMarker(
            location=[lat, lon],
            radius=8,
            color=color,
            fill=True,
            fill_opacity=0.7,
            popup=popup_content
        ).add_to(region_cluster)


folium.LayerControl().add_to(m)
m.save("restaurants_by_region_map.html")
print("Map grouped by region has been saved to 'restaurants_by_region_map.html'.")


Map grouped by region has been saved to 'restaurants_by_region_map.html'.
