#### NOVA IMS / BSc in Data Science / Text Mining 2024/2025
### <b>Group Project: "Solving the Hyderabadi Word Soup"</b>
#### Notebook `Web Scraping`

#### Group:
- `Miguel Matos - 20221925`
- `Nuno Leandro - 20221861`
- `Patrícia Bezerra - 20221907`
- `Rita Silva - 20221920`
- `Vasco Capão - 20221906`
#### <font color='#BFD72F'>Table of Contents </font> <a class="anchor" id='toc'></a> 
- [0. Imports](#p0)
- [1. Wikipedia Dish List Scraper](#p1)
- [2.  Data Cleaning](#p2)
- [3. Export Data](#p3)

<font color='#BFD72F' size=7>0. Imports</font> <a class="anchor" id="p0"></a>

In [None]:
from utils.pipeline_project import *
import utils.pipeline_project as pipeline_project
from utils.functions import *

<font color='#BFD72F' size=7>1. Wikipedia Dish List Scraper</font> <a class="anchor" id="p1"></a>

[Back to TOC](#toc)

We decide to use web scraping because extracting dish names directly from reviews was difficult due to unstructured text. By scraping a list of dishes from Wikipedia, we created a dictionary of dish names, which helped us identify and extract dishes mentioned in the reviews.

In [7]:
# Logging setup to capture errors and store them in 'scraper_errors.log' file
logging.basicConfig(filename='Scraper/scraper_errors.log', level=logging.ERROR)

# Define headers with a random User-Agent to simulate a real browser and avoid blocking
HEADERS = {
    'User-Agent': random.choice([
        'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36',
        'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.93 Safari/537.36'
    ]),
    'Referer': 'https://www.google.com/'
}

# Create a requests session with retry logic to handle temporary failures
session = requests.Session()
retries = Retry(total=5, backoff_factor=1, status_forcelist=[500, 502, 503, 504])
session.mount('https://', HTTPAdapter(max_retries=retries))

# URL of the Wikipedia page containing the list of dishes
url = "https://en.wikipedia.org/wiki/List_of_dishes"

# Send a GET request to fetch the page content with a timeout of 10 seconds
response = session.get(url, headers=HEADERS, timeout=10)

# Parse the HTML content using BeautifulSoup
soup = BeautifulSoup(response.text, 'html.parser')

# Dictionary to store dishes by their categories
dishes_by_category = {}

# Find all div elements containing lists of dishes
divs = soup.find_all('div', {'class': 'div-col'})

# Loop through each div to extract categories and dishes
for div in divs:
    try:
        # Get the heading (category) preceding the current div, such as h2 or h3
        heading = div.find_previous(['h3', 'h2'])
        category = re.sub(r'\[edit\]', '', heading.text).strip() if heading else 'Unknown'
        
        # Extract dishes and their links within the div
        dishes = [{'name': li.text.strip(), 'link': "https://en.wikipedia.org" + li.find('a')['href']}
                  for li in div.find_all('li') if li.find('a')]
        
        # Store the extracted dishes under their respective categories
        if dishes:
            dishes_by_category[category] = dishes
    except Exception as e:
        # Log any errors encountered during category processing
        logging.error(f"Error processing category: {e}")

<font color='#BFD72F' size=7>2.  Data Cleaning</font> <a class="anchor" id="p2"></a>

[Back to TOC](#toc)

In [None]:
# List to store the final data for DataFrame creation
data = []

# List of unwanted items to exclude from the scraped data
unwanted_items = ['Main page', 'Contents', 'Current events', 'Random article', 'See also']

# Loop through each category and its associated dishes
for category, dishes in dishes_by_category.items():
    for dish in dishes:
        try:
            # Send a GET request to fetch the dish's detailed page
            response = session.get(dish['link'], headers=HEADERS, timeout=10)
            soup = BeautifulSoup(response.text, 'html.parser')

            # Check if the page is a redirect; if so, skip it
            if soup.find('div', {'class': 'redirectText'}):
                print(f"Page {dish['link']} is a redirect. Skipping it.")
                continue

            # List to store potential dish items found on the page
            list_items = []
            
            # Loop through all 'ul' elements to extract 'li' items
            for ul in soup.select('ul'):
                # Skip 'ul' elements that are inside 'nav' or 'footer' tags
                if ul.find_parent('nav') or ul.find_parent('footer'):
                    continue
                for li in ul.find_all('li'):
                    # Clean and normalize the dish name by removing extra spaces
                    clean_dish_name = re.sub(r'\s+', ' ', li.text.strip())
                    # Only add items that are not in the unwanted list
                    if clean_dish_name not in unwanted_items:
                        list_items.append(clean_dish_name)

            # Add the cleaned dish items to the data list
            for item in set(list_items):
                data.append({'Main Category': category, 'Sub-Page': dish['name'], 'Dish': item})

        except Exception as e:
            # Log any errors encountered during scraping of a dish link
            logging.error(f"Error scraping {dish['link']}: {e}")

# Convert the collected data into a pandas DataFrame
df = pd.DataFrame(data)

In [8]:
df.head(10)

Unnamed: 0,Main Category,Sub-Page,Dish
0,Lists of prepared foods,List of almond dishes,Tteok
1,Lists of prepared foods,List of almond dishes,Brunch
2,Lists of prepared foods,List of almond dishes,Candied almonds – Sweet almond snack food
3,Lists of prepared foods,List of almond dishes,Biscuit Tortoni – Italian frozen dairy dessert
4,Lists of prepared foods,List of almond dishes,Semmelwrap – Swedish almond pastry
5,Lists of prepared foods,List of almond dishes,Coconut milk
6,Lists of prepared foods,List of almond dishes,Qurabiya – Shortbread-like cookies found in th...
7,Lists of prepared foods,List of almond dishes,Brined
8,Lists of prepared foods,List of almond dishes,Pages displaying wikidata descriptions as a fa...
9,Lists of prepared foods,List of almond dishes,Praline


In [10]:
df["Dish"] = df["Dish"].apply(lambda dish: extract_until_dash(dish))

In [11]:
df.head(10)

Unnamed: 0,Main Category,Sub-Page,Dish
0,Lists of prepared foods,List of almond dishes,Tteok
1,Lists of prepared foods,List of almond dishes,Brunch
2,Lists of prepared foods,List of almond dishes,Candied almonds
3,Lists of prepared foods,List of almond dishes,Biscuit Tortoni
4,Lists of prepared foods,List of almond dishes,Semmelwrap
5,Lists of prepared foods,List of almond dishes,Coconut milk
6,Lists of prepared foods,List of almond dishes,Qurabiya
7,Lists of prepared foods,List of almond dishes,Brined
8,Lists of prepared foods,List of almond dishes,Pages displaying wikidata descriptions as a fa...
9,Lists of prepared foods,List of almond dishes,Praline


In [12]:
preprocessor = pipeline_project.MainPipeline(no_stopwords= False).main_pipeline
df["Dish"] = df["Dish"].map(lambda x: preprocessor(x))

In [13]:
df = df[["Sub-Page", "Dish"]]
df = df.rename(columns= {"Sub-Page": "Category"})

In [14]:
df.head()

Unnamed: 0,Category,Dish
0,List of almond dishes,tteok
1,List of almond dishes,brunch
2,List of almond dishes,candied almond
3,List of almond dishes,biscuit tortoni
4,List of almond dishes,semmelwrap


<font color='#BFD72F' size=7>3. Export Data</font> <a class="anchor" id="p3"></a>

[Back to TOC](#toc)

In [15]:
df.to_csv('Data/wikipedia_dishes.csv', index=False)