# Team Project ODCM - Team 10

## Project 
In this document is the code written to scrape the data from Vivino. The extracted data could be used to analyse the influence of the price of wine on consumer ratings. The Vivino website offers over millions of different wines, it is impossible to scrape all of them. Therefore, we made created a sample of wines we are going to scrape. This sample consists of all wines originated in Spain and that is deliverable to the Netherlands. As of 05-10-2024 This leaves a sample of 8,079 wines (this number could change over time). Of these wines, the following data will be extracted:
- **Hyperlink**: The hyperlink of each wine, which includes the unique id of each wine. The unique wine id will later be isolated when cleaning the data.
- **Brand**: The brand that produces the wine.
- **Wine**: The name of the specific wine.
- **Rating**: The star rating of the specific wine (0-5).
- **Price**: The price of the specific wine. When the wine is on discount, the original price will be taken, not the discounted price.
- **Timestamp**: The timestamp at which the data is extracted. This is useful if the dataset will be used in future analysis, so the date and time of extraction can always be found.

*Important note: Since the Vivino website offerse some wines multiple times, later the duplicate rows will be deleted, which leaves a sample of 7,585 wines*

### Code
The code will run individually for each type of wine: red, white, rose, sparkling, dessert and fortified. After the code of each type is run, a dataset will be created containing all wines of that type. Later, when cleaning the data, each dataset will get identified by adding an extra column with the type. Then, all datasets will be merged into one dataset containing all 8,057 wines. This dataset will be further cleaned and after removing duplicate wines, it will producte a final dataset with 7,585 wines.

We created a makefile to automate this process. When running the makefile (located in the GitHub repository), it will first run this webscraping code, creating seperate datasets per type of wine. Because there is no need anymore for the seperate datasets (since they are all merged together), the makefile will also deleted the seperate datasets created by the webscraper.

### Install the packages & librabies 

To extract the date, we make use of Beautiful Soup and Selenium. In order to run the code, first install and import the packages below:

In [1]:
#installing packages
!pip3 install selenium
!pip3 install webdriver_manager
!pip3 install beautifulsoup4

#libraries
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from webdriver_manager.chrome import ChromeDriverManager
from bs4 import BeautifulSoup
import requests
import time
import csv



## Red Wines 

In the cell below we are collecting the data of the red wines. To be safe that the code doesn't break, we seperate the wines we want to extract into 4 seperate urls. These urls are then, one by one, fed into the code using a for loop. After scraping, the data will be appended to a csv file names 'red_wine.csv'.

In [4]:
service = Service(ChromeDriverManager().install())
driver = webdriver.Chrome(service=service)

urls = ('https://www.vivino.com/explore?e=eJzLLbI1VMvNzLM1UMtNrLA1NFBLrrT181FLtnUNDVIrAEqmp9mWJRZlppYk5qjlF6XYpqQWJ6vlJ1XaFhRlJqeqlZdExwJVJVcWA-nUYjUwCQC3hRy5', # 0-10 euro
      'https://www.vivino.com/explore?e=eJzLLbI1VMvNzLM1NFDLTaywNTJQS6609fNRS7Z1DQ1SKwDKpqfZliUWZaaWJOao5Rel2KakFier5SdV2hYUZSanqpWXRMcCVSVXFgPp1GI1MAkAya8c6w%3D%3D', # 10-20 euro
      'https://www.vivino.com/explore?e=eJzLLbI1VMvNzLM1MlDLTaywNTFQS6609fNRS7Z1DQ1SKwDKpqfZliUWZaaWJOao5Rel2KakFier5SdV2hYUZSanqpWXRMcCVSVXFgPp1GI1MAkAyosc7g%3D%3D', # 20-40 euro
      'https://www.vivino.com/explore?e=eJzLLbI1VMvNzLM1MVDLTaywNTUwUEuutPXzUUu2dQ0NUisASqen2ZYlFmWmliTmqOUXpdimpBYnq-UnVdoWFGUmp6qVl0THAlUlVxYD6dRiNTAJAN31HSE%3D') # 40-500 euro

# Open a csv file to store the data in
with open('red_wine.csv', mode='w', newline='', encoding='utf-8') as file:
      writer = csv.writer(file)

      # Write the header row
      writer.writerow(['hyperlink','Brand', 'Wine', 'Rating', 'Reviews', 'Price','Timestamp'])

for url in urls:
      driver.get(url)
      driver.maximize_window()

      # Optional: Adding some wait time for the page to fully load if needed
      driver.implicitly_wait(20)

      # Create a function to click away the cookies
      try:
            accept_cookies_button = driver.find_element(By.ID,"didomi-notice-agree-button")
            accept_cookies_button.click()
            print("Cookies accepted.")
      except Exception as e:
            print("Cookie acceptance button not found or could not be clicked:")

      # Infinite scroll to load more content
      scroll_pause_time = 2 # Adjust if necessary
      last_height = driver.execute_script("return document.body.scrollHeight")

      while True:
            # Scroll down to the bottom of the page
            driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    
            # Wait for the new page to load
            time.sleep(scroll_pause_time)
    
            # Calculate new scroll height and compare with the last scroll height
            new_height = driver.execute_script("return document.body.scrollHeight")
            if new_height == last_height:
                  time.sleep(scroll_pause_time) # Wait for the potentially new content to load
                  new_height = driver.execute_script("return document.body.scrollHeight") # Try scroling once more
                  if new_height == last_height:
                        break  # Stop if we've reached the end of the page
            last_height = new_height

      # Get the final page source after all content is loaded
      page_source = driver.page_source

      # Parse the page source with BeautifulSoup
      soup = BeautifulSoup(page_source, 'html.parser')

      # Create empty lists to store the data
      hyperlink = []
      brands = []
      wines = []
      ratings = []
      reviews = []
      prices = []
      timestamp = []

      # Find all wine entries on the page
      wine_entries = soup.find_all(class_='card__card--2R5Wh wineCard__wineCardContent--3cwZt')

      for entry in wine_entries:

            # Extract hyperlink
            link_tag = entry.find('a', class_='wineCard__cardLink--3F_uB')
            if link_tag and link_tag.has_attr('href'):
                  hyperlink.append(link_tag['href'])

            # Extract brand
            brand = entry.find(class_='wineInfoVintage__truncate--3QAtw')
            if brand:
                  brands.append(brand.get_text(strip=True))

            # Extract wine name
            wine = entry.find(class_='wineInfoVintage__vintage--VvWlU wineInfoVintage__truncate--3QAtw') 
            if wine:
                  wines.append(wine.get_text(strip=True))

            # Extract rating
            rating = entry.find(class_='vivinoRating_averageValue__uDdPM')
            if rating:
                  ratings.append(rating.get_text(strip=True))
            # Extract review count
            review = entry.find(class_='vivinoRating_caption__xL84P')
            if review:
                  # Get only the first part so the word 'beoordelingen' is not scraped
                  review_text = review.get_text(strip=True)
                  review_count = review_text.split()[0]
                  reviews.append(review_count)
            
            # Check for the presence of the discount first
            discount_price_div = entry.find(class_='price_strike__mOVjZ addToCart__subText--1pvFt')
            if discount_price_div:
                  # If discount exist, get the original price
                  discount_price_text = discount_price_div.get_text(strip=True)
                  price_only = discount_price_text.split()[-1]  # Get the last part (currency + price)
                  prices.append(price_only)  # Append only the discount price to the list
    
            else:  
                  # Extract currency & price if present in the addToCartButton
                  price_divs = entry.find_all(class_='addToCartButton__price--qJdh4')
                  if price_divs:  # If primary price class exists
                        for price_div in price_divs:
                              currency = price_div.find('div', class_='addToCartButton__currency--2CTNX')
                              price = price_div.find_all('div')[1]  # Assuming price is in the second div
                              full_price = f"{currency.get_text(strip=True) if currency else ''}{price.get_text(strip=True) if price else ''}"
                              prices.append(full_price)  # Save the full price
                  else:  # If not present, extract price from alternative class (online verkrijgbaar vanaf...)
                        alt_price_div = entry.find(class_='addToCart__subText--1pvFt addToCart__ppcPrice--ydrd5')
                        if alt_price_div:
                              alt_price_text = alt_price_div.get_text(strip=True)
                              price_only = alt_price_text.split()[-1]  # Get the last part (currency + price)
                              prices.append(price_only)  # Append only the price to the list
            
            # Extract timestamp
            timestamps = time.time()
            timestamp.append(timestamps)

            # Wait for 2 seconds to not overload the server
            time.sleep(2)

      # Open a csv file to store the data in
      with open('red_wine.csv', mode='a', newline='', encoding='utf-8') as file:
            writer = csv.writer(file)

            # Write the data rows
            for hyperlink,brand, wine, rating, reviews, price, timestamp in zip(hyperlink,brands, wines, ratings, reviews, prices, timestamp):
                  writer.writerow([
                        hyperlink,
                        brand,
                        wine,
                        rating,
                        reviews,
                        price,
                        timestamp 
                  ])
      
      print("Data saved to 'red_wine.csv' successfully.")

Cookies accepted.


NoSuchWindowException: Message: no such window: target window already closed
from unknown error: web view not found
  (Session info: chrome=129.0.6668.70)
Stacktrace:
	GetHandleVerifier [0x00717143+25587]
	(No symbol) [0x006AA2E4]
	(No symbol) [0x005A2113]
	(No symbol) [0x0057E23B]
	(No symbol) [0x0061179F]
	(No symbol) [0x00624CB9]
	(No symbol) [0x0060A936]
	(No symbol) [0x005DBA73]
	(No symbol) [0x005DC4CD]
	GetHandleVerifier [0x009F4C63+3030803]
	GetHandleVerifier [0x00A46B99+3366473]
	GetHandleVerifier [0x007A95F2+624802]
	GetHandleVerifier [0x007B0E6C+655644]
	(No symbol) [0x006B2C9D]
	(No symbol) [0x006AFD68]
	(No symbol) [0x006AFF05]
	(No symbol) [0x006A2336]
	BaseThreadInitThunk [0x75067BA9+25]
	RtlInitializeExceptionChain [0x7701C0CB+107]
	RtlClearBits [0x7701C04F+191]


## White Wines

In the cell below we are collecting the data of the white wines. To be safe that the code doesn't break, we seperate the wines we want to extract into 2 seperate urls. These urls are then, one by one, fed into the code using a for loop. After scraping, the data will be appended to a csv file names 'white_wine.csv'.

In [6]:
service = Service(ChromeDriverManager().install())
driver = webdriver.Chrome(service=service)

urls = ('https://www.vivino.com/explore?e=eJzLLbI1VMvNzLM1UMtNrLA1MlBLrrT181FLtnUNDVIrAEqmp9mWJRZlppYk5qjlF6XYJhYnq-UnVdoWFGUmp6qVl0TH2hoBNRUD6dRiNTAJAJlRHFM%3D', # 0-20 euro
        'https://www.vivino.com/explore?e=eJzLLbI1VMvNzLM1MlDLTaywNTUwUEuutPXzUUu2dQ0NUisASqen2ZYlFmWmliTmqOUXpdgmFier5SdV2hYUZSanqpWXRMfaGgE1FQPp1GI1MAkAvnccuA%3D%3D') # 20-500 euro

# Open a csv file to store the data in
with open('white_wine.csv', mode='w', newline='', encoding='utf-8') as file:
    writer = csv.writer(file)

    # Write the header row
    writer.writerow(['hyperlink','Brand', 'Wine', 'Rating', 'Reviews', 'Price','Timestamp'])

for url in urls:
    driver.get(url)
    driver.maximize_window()

    # Optional: Adding some wait time for the page to fully load if needed
    driver.implicitly_wait(20)

    # Create a function to click away the cookies
    try:
        accept_cookies_button = driver.find_element(By.ID,"didomi-notice-agree-button")
        accept_cookies_button.click()
        print("Cookies accepted.")
    except Exception as e:
        print("Cookie acceptance button not found or could not be clicked:")

    # Infinite scroll to load more content
    scroll_pause_time = 2 # Adjust if necessary
    last_height = driver.execute_script("return document.body.scrollHeight")

    while True:
        # Scroll down to the bottom of the page
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    
        # Wait for the new page to load
        time.sleep(scroll_pause_time)
    
        # Calculate new scroll height and compare with the last scroll height
        new_height = driver.execute_script("return document.body.scrollHeight")
        if new_height == last_height:
            time.sleep(scroll_pause_time) # Wait for the potentially new content to load
            new_height = driver.execute_script("return document.body.scrollHeight") # Try scroling once more
            if new_height == last_height:
                break  # Stop if we've reached the end of the page
        last_height = new_height

    # Get the final page source after all content is loaded
    page_source = driver.page_source

    # Parse the page source with BeautifulSoup
    soup = BeautifulSoup(page_source, 'html.parser')

    # Create empty lists to store the data
    hyperlink = []
    brands = []
    wines = []
    ratings = []
    reviews = []
    prices = []
    timestamp = []

    # Find all wine entries on the page
    wine_entries = soup.find_all(class_='card__card--2R5Wh wineCard__wineCardContent--3cwZt')

    for entry in wine_entries:

        # Extract hyperlink
        link_tag = entry.find('a', class_='wineCard__cardLink--3F_uB')
        if link_tag and link_tag.has_attr('href'):
            hyperlink.append(link_tag['href'])

        # Extract brand
        brand = entry.find(class_='wineInfoVintage__truncate--3QAtw')
        if brand:
            brands.append(brand.get_text(strip=True))

        # Extract wine name
        wine = entry.find(class_='wineInfoVintage__vintage--VvWlU wineInfoVintage__truncate--3QAtw') 
        if wine:
            wines.append(wine.get_text(strip=True))

        # Extract rating
        rating = entry.find(class_='vivinoRating_averageValue__uDdPM')
        if rating:
            ratings.append(rating.get_text(strip=True))
        # Extract review count
        review = entry.find(class_='vivinoRating_caption__xL84P')
        if review:
            # Get only the first part so the word 'beoordelingen' is not scraped
            review_text = review.get_text(strip=True)
            review_count = review_text.split()[0]
            reviews.append(review_count)
            
        # Check for the presence of the discount first
        discount_price_div = entry.find(class_='price_strike__mOVjZ addToCart__subText--1pvFt')
        if discount_price_div:
            # If discount exist, get the original price
            discount_price_text = discount_price_div.get_text(strip=True)
            price_only = discount_price_text.split()[-1]  # Get the last part (currency + price)
            prices.append(price_only)  # Append only the discount price to the list
    
        else:  
            # Extract currency & price if present in the addToCartButton
            price_divs = entry.find_all(class_='addToCartButton__price--qJdh4')
            if price_divs:  # If primary price class exists
                for price_div in price_divs:
                    currency = price_div.find('div', class_='addToCartButton__currency--2CTNX')
                    price = price_div.find_all('div')[1]  # Assuming price is in the second div
                    full_price = f"{currency.get_text(strip=True) if currency else ''}{price.get_text(strip=True) if price else ''}"
                    prices.append(full_price)  # Save the full price
            else:  # If not present, extract price from alternative class (online verkrijgbaar vanaf...)
                alt_price_div = entry.find(class_='addToCart__subText--1pvFt addToCart__ppcPrice--ydrd5')
                if alt_price_div:
                    alt_price_text = alt_price_div.get_text(strip=True)
                    price_only = alt_price_text.split()[-1]  # Get the last part (currency + price)
                    prices.append(price_only)  # Append only the price to the list
            
        # Extract timestamp
        timestamps = time.time()
        timestamp.append(timestamps)

        # Wait for 2 seconds to not overload the server
        time.sleep(2)

    # Open a csv file to store the data in
    with open('white_wine.csv', mode='a', newline='', encoding='utf-8') as file:
        writer = csv.writer(file)

        # Write the data rows
        for hyperlink,brand, wine, rating, reviews, price, timestamp in zip(hyperlink,brands, wines, ratings, reviews, prices, timestamp):
            writer.writerow([
                hyperlink,
                brand,
                wine,
                rating,
                reviews,
                price,
                timestamp 
            ])
      
    print("Data saved to 'white_wine.csv' successfully.")

Cookies accepted.
Data saved to 'white_wine.csv' successfully.
Cookie acceptance button not found or could not be clicked: Message: no such element: Unable to locate element: {"method":"css selector","selector":"[id="didomi-notice-agree-button"]"}
  (Session info: chrome=129.0.6668.70); For documentation on this error, please visit: https://www.selenium.dev/documentation/webdriver/troubleshooting/errors#no-such-element-exception
Stacktrace:
	GetHandleVerifier [0x00717143+25587]
	(No symbol) [0x006AA2E4]
	(No symbol) [0x005A2113]
	(No symbol) [0x005E6F62]
	(No symbol) [0x005E71AB]
	(No symbol) [0x00627852]
	(No symbol) [0x0060ABE4]
	(No symbol) [0x00625370]
	(No symbol) [0x0060A936]
	(No symbol) [0x005DBA73]
	(No symbol) [0x005DC4CD]
	GetHandleVerifier [0x009F4C63+3030803]
	GetHandleVerifier [0x00A46B99+3366473]
	GetHandleVerifier [0x007A95F2+624802]
	GetHandleVerifier [0x007B0E6C+655644]
	(No symbol) [0x006B2C9D]
	(No symbol) [0x006AFD68]
	(No symbol) [0x006AFF05]
	(No symbol) [0x006A2

## Rose Wines

In the cell below we are collecting the data of the red wines. Since the number of rose wines is not extensive, it is not necesarry to seperate the wines into seperate urls. Therefore, the code is only run once with the full set of rose wines. After scraping, the data will be appended to a csv file names 'rose_wine.csv'.

In [6]:
service = Service(ChromeDriverManager().install())
driver = webdriver.Chrome(service=service)

# Opening the 'Vivino' website
url = 'https://www.vivino.com/explore?e=eJwdi70KgCAABt_mmw1q_Ma2aAiaIsLMQkgNtb-3T1ruhuNsYAFrHAWsfFgJAfWybaBY9x2OXLeVlwxGJ7nDh4WLjgp-fhlkMm6Lk_KnS7jTMLLMd8zWET8_0d4gXQ%3D%3D'
driver.get(url)
driver.maximize_window()

# Optional: Adding some wait time for the page to fully load if needed
driver.implicitly_wait(20)  # 20 seconds 

 # Create a function to click away the cookies
try:
    accept_cookies_button = driver.find_element(By.ID,"didomi-notice-agree-button")
    accept_cookies_button.click()
    print("Cookies accepted.")
except Exception as e:
    print("Cookie acceptance button not found or could not be clicked:")

# Infinite scroll to load more content
scroll_pause_time = 2  # Adjust if necessary
last_height = driver.execute_script("return document.body.scrollHeight")

while True:
    # Scroll down to the bottom of the page
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    
    # Wait for the new page to load
    time.sleep(scroll_pause_time)
    
    # Calculate new scroll height and compare with the last scroll height
    new_height = driver.execute_script("return document.body.scrollHeight")
    if new_height == last_height:
        time.sleep(scroll_pause_time)  # Wait for the potentially new content to load
        break  # Stop if we've reached the end of the page
    last_height = new_height

# Get the final page source after all content is loaded
page_source = driver.page_source

# Parse the page source with BeautifulSoup
soup = BeautifulSoup(page_source, 'html.parser')

# Create empty lists to store the data
hyperlink = []
brands = []
wines = []
ratings = []
reviews = []
prices = []
timestamp = []

# Find all wine entries on the page
wine_entries = soup.find_all(class_='card__card--2R5Wh wineCard__wineCardContent--3cwZt')

for entry in wine_entries:

    link_tag = entry.find('a', class_='wineCard__cardLink--3F_uB')
    if link_tag and link_tag.has_attr('href'):
        hyperlink.append(link_tag['href'])

    # Extract brand
    brand = entry.find(class_='wineInfoVintage__truncate--3QAtw')
    if brand:
        brands.append(brand.get_text(strip=True))

    # Extract wine name
    wine = entry.find(class_='wineInfoVintage__vintage--VvWlU wineInfoVintage__truncate--3QAtw') 
    if wine:
        wines.append(wine.get_text(strip=True))

    # Extract rating
    rating = entry.find(class_='vivinoRating_averageValue__uDdPM')
    if rating:
        ratings.append(rating.get_text(strip=True))
    
    # Extract review count
    review = entry.find(class_='vivinoRating_caption__xL84P')
    if review:
       # Get only the first part so the word 'beoordelingen' is not scraped
        review_text = review.get_text(strip=True)
        review_count = review_text.split()[0]
        reviews.append(review_count)

    # Check for the presence of the discount first
    discount_price_div = entry.find(class_='price_strike__mOVjZ addToCart__subText--1pvFt')
    if discount_price_div:
        # If discount exist, get the original price
        discount_price_text = discount_price_div.get_text(strip=True)
        price_only = discount_price_text.split()[-1]  # Get the last part (currency + price)
        prices.append(price_only)  # Append only the discount price to the list
    
    else:  
        # Extract currency & price if present in the addToCartButton
        price_divs = entry.find_all(class_='addToCartButton__price--qJdh4')
        if price_divs:  # If primary price class exists
            for price_div in price_divs:
                currency = price_div.find('div', class_='addToCartButton__currency--2CTNX')
                price = price_div.find_all('div')[1]  # Assuming price is in the second div
                full_price = f"{currency.get_text(strip=True) if currency else ''}{price.get_text(strip=True) if price else ''}"
                prices.append(full_price)  # Save the full price
        else:  # If not present, extract price from alternative class (online verkrijgbaar vanaf...)
            alt_price_div = entry.find(class_='addToCart__subText--1pvFt addToCart__ppcPrice--ydrd5')
            if alt_price_div:
                alt_price_text = alt_price_div.get_text(strip=True)
                price_only = alt_price_text.split()[-1]  # Get the last part (currency + price)
                prices.append(price_only)  # Append only the price to the list
                
    timestamps = time.time()
    timestamp.append(timestamps)

    # Wait for 2 seconds to not overload the server
    time.sleep(2)

# Open csv file to store the data
with open('rose_wine.csv', mode='w', newline='', encoding='utf-8') as file:
    writer = csv.writer(file)

    # Write the header row
    writer.writerow(['hyperlink','Brand', 'Wine', 'Rating', 'Reviews', 'Price','Timestamp'])

    # Write the data rows
    for hyperlink,brand, wine, rating, reviews, price, timestamp in zip(hyperlink, brands, wines, ratings, reviews, prices, timestamp):
        writer.writerow([
            hyperlink,
            brand,
            wine,
            rating,
            reviews,
            price,
            timestamp
        ])

print("Data saved to 'rose_wine.csv' successfully.")

NoSuchWindowException: Message: no such window: target window already closed
from unknown error: web view not found
  (Session info: chrome=129.0.6668.70)
Stacktrace:
	GetHandleVerifier [0x00717143+25587]
	(No symbol) [0x006AA2E4]
	(No symbol) [0x005A2113]
	(No symbol) [0x0057E23B]
	(No symbol) [0x0061179F]
	(No symbol) [0x00624CB9]
	(No symbol) [0x0060A936]
	(No symbol) [0x005DBA73]
	(No symbol) [0x005DC4CD]
	GetHandleVerifier [0x009F4C63+3030803]
	GetHandleVerifier [0x00A46B99+3366473]
	GetHandleVerifier [0x007A95F2+624802]
	GetHandleVerifier [0x007B0E6C+655644]
	(No symbol) [0x006B2C9D]
	(No symbol) [0x006AFD68]
	(No symbol) [0x006AFF05]
	(No symbol) [0x006A2336]
	BaseThreadInitThunk [0x75067BA9+25]
	RtlInitializeExceptionChain [0x7701C0CB+107]
	RtlClearBits [0x7701C04F+191]


## Sparkling Wines

In the cell below we are collecting the data of the red wines. Since the number of sparkling wines is not extensive, it is not necesarry to seperate the wines into seperate urls. Therefore, the code is only run once with the full set of sparkling wines. After scraping, the data will be appended to a csv file names 'sparkling_wine.csv'.

In [8]:
service = Service(ChromeDriverManager().install())
driver = webdriver.Chrome(service=service)

# Opening the 'Vivino' website
url = 'https://www.vivino.com/explore?e=eJwdi70KgCAABt_mm41o_Ma2aAiaIsLMQkgNtb-3T1ruhuNsYAFrHAWsfFgJAfWybaBY9x2OXLeVlwxGJ7nDh4WLjgp-fhlkMm6Lk_KnS7jTMLLMd8zWET8_0c4gXA%3D%3D'
driver.get(url)
driver.maximize_window()

# Optional: Adding some wait time for the page to fully load if needed
driver.implicitly_wait(20)  # 20 seconds 

 # Create a function to click away the cookies
try:
    accept_cookies_button = driver.find_element(By.ID,"didomi-notice-agree-button")
    accept_cookies_button.click()
    print("Cookies accepted.")
except Exception as e:
    print("Cookie acceptance button not found or could not be clicked:")

# Infinite scroll to load more content
scroll_pause_time = 2  # Adjust if necessary
last_height = driver.execute_script("return document.body.scrollHeight")

while True:
    # Scroll down to the bottom of the page
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    
    # Wait for the new page to load
    time.sleep(scroll_pause_time)
    
    # Calculate new scroll height and compare with the last scroll height
    new_height = driver.execute_script("return document.body.scrollHeight")
    if new_height == last_height:
        time.sleep(scroll_pause_time)  # Wait for the potentially new content to load
        break  # Stop if we've reached the end of the page
    last_height = new_height

# Get the final page source after all content is loaded
page_source = driver.page_source

# Parse the page source with BeautifulSoup
soup = BeautifulSoup(page_source, 'html.parser')

# Create empty lists to store the data
hyperlink = []
brands = []
wines = []
ratings = []
reviews = []
prices = []
timestamp = []

# Find all wine entries on the page
wine_entries = soup.find_all(class_='card__card--2R5Wh wineCard__wineCardContent--3cwZt')

for entry in wine_entries:

    link_tag = entry.find('a', class_='wineCard__cardLink--3F_uB')
    if link_tag and link_tag.has_attr('href'):
        hyperlink.append(link_tag['href'])

    # Extract brand
    brand = entry.find(class_='wineInfoVintage__truncate--3QAtw')
    if brand:
        brands.append(brand.get_text(strip=True))

    # Extract wine name
    wine = entry.find(class_='wineInfoVintage__vintage--VvWlU wineInfoVintage__truncate--3QAtw') 
    if wine:
        wines.append(wine.get_text(strip=True))

    # Extract rating
    rating = entry.find(class_='vivinoRating_averageValue__uDdPM')
    if rating:
        ratings.append(rating.get_text(strip=True))
    
    # Extract review count
    review = entry.find(class_='vivinoRating_caption__xL84P')
    if review:
       # Get only the first part so the word 'beoordelingen' is not scraped
        review_text = review.get_text(strip=True)
        review_count = review_text.split()[0]
        reviews.append(review_count)

    # Check for the presence of the discount first
    discount_price_div = entry.find(class_='price_strike__mOVjZ addToCart__subText--1pvFt')
    if discount_price_div:
        # If discount exist, get the original price
        discount_price_text = discount_price_div.get_text(strip=True)
        price_only = discount_price_text.split()[-1]  # Get the last part (currency + price)
        prices.append(price_only)  # Append only the discount price to the list
    
    else:  
        # Extract currency & price if present in the addToCartButton
        price_divs = entry.find_all(class_='addToCartButton__price--qJdh4')
        if price_divs:  # If primary price class exists
            for price_div in price_divs:
                currency = price_div.find('div', class_='addToCartButton__currency--2CTNX')
                price = price_div.find_all('div')[1]  # Assuming price is in the second div
                full_price = f"{currency.get_text(strip=True) if currency else ''}{price.get_text(strip=True) if price else ''}"
                prices.append(full_price)  # Save the full price
        else:  # If not present, extract price from alternative class (online verkrijgbaar vanaf...)
            alt_price_div = entry.find(class_='addToCart__subText--1pvFt addToCart__ppcPrice--ydrd5')
            if alt_price_div:
                alt_price_text = alt_price_div.get_text(strip=True)
                price_only = alt_price_text.split()[-1]  # Get the last part (currency + price)
                prices.append(price_only)  # Append only the price to the list
                
    timestamps = time.time()
    timestamp.append(timestamps)

    # Wait for 2 seconds to not overload the server
    time.sleep(2)

# Open csv file to store the data
with open('sparkling_wine.csv', mode='w', newline='', encoding='utf-8') as file:
    writer = csv.writer(file)

    # Write the header row
    writer.writerow(['hyperlink','Brand', 'Wine', 'Rating', 'Reviews', 'Price','Timestamp'])

    # Write the data rows
    for hyperlink,brand, wine, rating, reviews, price, timestamp in zip(hyperlink, brands, wines, ratings, reviews, prices, timestamp):
        writer.writerow([
            hyperlink,
            brand,
            wine,
            rating,
            reviews,
            price,
            timestamp
        ])

print("Data saved to 'sparkling_wine.csv' successfully.")

Cookie acceptance button not found or could not be clicked: Message: no such element: Unable to locate element: {"method":"css selector","selector":"[id="didomi-notice-agree-button"]"}
  (Session info: chrome=129.0.6668.70); For documentation on this error, please visit: https://www.selenium.dev/documentation/webdriver/troubleshooting/errors#no-such-element-exception
Stacktrace:
	GetHandleVerifier [0x00717143+25587]
	(No symbol) [0x006AA2E4]
	(No symbol) [0x005A2113]
	(No symbol) [0x005E6F62]
	(No symbol) [0x005E71AB]
	(No symbol) [0x00627852]
	(No symbol) [0x0060ABE4]
	(No symbol) [0x00625370]
	(No symbol) [0x0060A936]
	(No symbol) [0x005DBA73]
	(No symbol) [0x005DC4CD]
	GetHandleVerifier [0x009F4C63+3030803]
	GetHandleVerifier [0x00A46B99+3366473]
	GetHandleVerifier [0x007A95F2+624802]
	GetHandleVerifier [0x007B0E6C+655644]
	(No symbol) [0x006B2C9D]
	(No symbol) [0x006AFD68]
	(No symbol) [0x006AFF05]
	(No symbol) [0x006A2336]
	BaseThreadInitThunk [0x75067BA9+25]
	RtlInitializeExcepti

## Dessert Wines

In the cell below we are collecting the data of the red wines. Since the number of dessert wines is not extensive, it is not necesarry to seperate the wines into seperate urls. Therefore, the code is only run once with the full set of dessert wines. After scraping, the data will be appended to a csv file names 'dessert_wine.csv'.

In [2]:
service = Service(ChromeDriverManager().install())
driver = webdriver.Chrome(service=service)

# Opening the 'Vivino' website
url = 'https://www.vivino.com/explore?e=eJwdi70KgCAABt_mm22Ipm9si4agKSLMLITUUPt7-6TlbjjOBhawxlHAyoelEFAv2waKdd_hyHVbeclgdJI7fFi46Kjg55dBJuO2OCl_uoQ7DSOrfMdsHfHzA9IOIGA%3D'
driver.get(url)
driver.maximize_window()

# Optional: Adding some wait time for the page to fully load if needed
driver.implicitly_wait(20)  # 20 seconds 

 # Create a function to click away the cookies
try:
    accept_cookies_button = driver.find_element(By.ID,"didomi-notice-agree-button")
    accept_cookies_button.click()
    print("Cookies accepted.")
except Exception as e:
    print("Cookie acceptance button not found or could not be clicked:")

# Infinite scroll to load more content
scroll_pause_time = 2  # Adjust if necessary
last_height = driver.execute_script("return document.body.scrollHeight")

while True:
    # Scroll down to the bottom of the page
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    
    # Wait for the new page to load
    time.sleep(scroll_pause_time)
    
    # Calculate new scroll height and compare with the last scroll height
    new_height = driver.execute_script("return document.body.scrollHeight")
    if new_height == last_height:
        time.sleep(scroll_pause_time)  # Wait for the potentially new content to load
        break  # Stop if we've reached the end of the page
    last_height = new_height

# Get the final page source after all content is loaded
page_source = driver.page_source

# Parse the page source with BeautifulSoup
soup = BeautifulSoup(page_source, 'html.parser')

# Create empty lists to store the data
hyperlink = []
brands = []
wines = []
ratings = []
reviews = []
prices = []
timestamp = []

# Find all wine entries on the page
wine_entries = soup.find_all(class_='card__card--2R5Wh wineCard__wineCardContent--3cwZt')

for entry in wine_entries:

    link_tag = entry.find('a', class_='wineCard__cardLink--3F_uB')
    if link_tag and link_tag.has_attr('href'):
        hyperlink.append(link_tag['href'])

    # Extract brand
    brand = entry.find(class_='wineInfoVintage__truncate--3QAtw')
    if brand:
        brands.append(brand.get_text(strip=True))

    # Extract wine name
    wine = entry.find(class_='wineInfoVintage__vintage--VvWlU wineInfoVintage__truncate--3QAtw') 
    if wine:
        wines.append(wine.get_text(strip=True))

    # Extract rating
    rating = entry.find(class_='vivinoRating_averageValue__uDdPM')
    if rating:
        ratings.append(rating.get_text(strip=True))
    
    # Extract review count
    review = entry.find(class_='vivinoRating_caption__xL84P')
    if review:
       # Get only the first part so the word 'beoordelingen' is not scraped
        review_text = review.get_text(strip=True)
        review_count = review_text.split()[0]
        reviews.append(review_count)

    # Check for the presence of the discount first
    discount_price_div = entry.find(class_='price_strike__mOVjZ addToCart__subText--1pvFt')
    if discount_price_div:
        # If discount exist, get the original price
        discount_price_text = discount_price_div.get_text(strip=True)
        price_only = discount_price_text.split()[-1]  # Get the last part (currency + price)
        prices.append(price_only)  # Append only the discount price to the list
    
    else:  
        # Extract currency & price if present in the addToCartButton
        price_divs = entry.find_all(class_='addToCartButton__price--qJdh4')
        if price_divs:  # If primary price class exists
            for price_div in price_divs:
                currency = price_div.find('div', class_='addToCartButton__currency--2CTNX')
                price = price_div.find_all('div')[1]  # Assuming price is in the second div
                full_price = f"{currency.get_text(strip=True) if currency else ''}{price.get_text(strip=True) if price else ''}"
                prices.append(full_price)  # Save the full price
        else:  # If not present, extract price from alternative class (online verkrijgbaar vanaf...)
            alt_price_div = entry.find(class_='addToCart__subText--1pvFt addToCart__ppcPrice--ydrd5')
            if alt_price_div:
                alt_price_text = alt_price_div.get_text(strip=True)
                price_only = alt_price_text.split()[-1]  # Get the last part (currency + price)
                prices.append(price_only)  # Append only the price to the list
                
    timestamps = time.time()
    timestamp.append(timestamps)

    # Wait for 2 seconds to not overload the server
    time.sleep(2)

# Open csv file to store the data
with open('dessert_wine.csv', mode='w', newline='', encoding='utf-8') as file:
    writer = csv.writer(file)

    # Write the header row
    writer.writerow(['hyperlink','Brand', 'Wine', 'Rating', 'Reviews', 'Price','Timestamp'])

    # Write the data rows
    for hyperlink,brand, wine, rating, reviews, price, timestamp in zip(hyperlink, brands, wines, ratings, reviews, prices, timestamp):
        writer.writerow([
            hyperlink,
            brand,
            wine,
            rating,
            reviews,
            price,
            timestamp
        ])

print("Data saved to 'dessert_wine.csv' successfully.")

Cookies accepted.
Data saved to 'dessert_wine.csv' successfully.


## Fortified Wines

In the cell below we are collecting the data of the red wines. Since the number of fortified wines is not extensive, it is not necesarry to seperate the wines into seperate urls. Therefore, the code is only run once with the full set of fortified wines. After scraping, the data will be appended to a csv file names 'fortified_wine.csv'.

In [10]:
service = Service(ChromeDriverManager().install())
driver = webdriver.Chrome(service=service)

# Opening the 'Vivino' website
url = 'https://www.vivino.com/explore?e=eJzLLbI1VMvNzLM1UMtNrLA1NTBQS6609fNRS7Z1DQ1SKwDKpqfZliUWZaaWJOao5Rel2CYWJ6vlJ1XaFhRlJqeqlZdEx9oamQB1FQMZqcVqYBIAxtccug%3D%3D'
driver.get(url)
driver.maximize_window()

# Optional: Adding some wait time for the page to fully load if needed
driver.implicitly_wait(20)  # 20 seconds 

 # Create a function to click away the cookies
try:
    accept_cookies_button = driver.find_element(By.ID,"didomi-notice-agree-button")
    accept_cookies_button.click()
    print("Cookies accepted.")
except Exception as e:
    print("Cookie acceptance button not found or could not be clicked:")

# Infinite scroll to load more content
scroll_pause_time = 2  # Adjust if necessary
last_height = driver.execute_script("return document.body.scrollHeight")

while True:
    # Scroll down to the bottom of the page
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    
    # Wait for the new page to load
    time.sleep(scroll_pause_time)
    
    # Calculate new scroll height and compare with the last scroll height
    new_height = driver.execute_script("return document.body.scrollHeight")
    if new_height == last_height:
        time.sleep(scroll_pause_time)  # Wait for the potentially new content to load
        break  # Stop if we've reached the end of the page
    last_height = new_height

# Get the final page source after all content is loaded
page_source = driver.page_source

# Parse the page source with BeautifulSoup
soup = BeautifulSoup(page_source, 'html.parser')

# Create empty lists to store the data
hyperlink = []
brands = []
wines = []
ratings = []
reviews = []
prices = []
timestamp = []

# Find all wine entries on the page
wine_entries = soup.find_all(class_='card__card--2R5Wh wineCard__wineCardContent--3cwZt')

for entry in wine_entries:

    link_tag = entry.find('a', class_='wineCard__cardLink--3F_uB')
    if link_tag and link_tag.has_attr('href'):
        hyperlink.append(link_tag['href'])

    # Extract brand
    brand = entry.find(class_='wineInfoVintage__truncate--3QAtw')
    if brand:
        brands.append(brand.get_text(strip=True))

    # Extract wine name
    wine = entry.find(class_='wineInfoVintage__vintage--VvWlU wineInfoVintage__truncate--3QAtw') 
    if wine:
        wines.append(wine.get_text(strip=True))

    # Extract rating
    rating = entry.find(class_='vivinoRating_averageValue__uDdPM')
    if rating:
        ratings.append(rating.get_text(strip=True))
    
    # Extract review count
    review = entry.find(class_='vivinoRating_caption__xL84P')
    if review:
       # Get only the first part so the word 'beoordelingen' is not scraped
        review_text = review.get_text(strip=True)
        review_count = review_text.split()[0]
        reviews.append(review_count)

    # Check for the presence of the discount first
    discount_price_div = entry.find(class_='price_strike__mOVjZ addToCart__subText--1pvFt')
    if discount_price_div:
        # If discount exist, get the original price
        discount_price_text = discount_price_div.get_text(strip=True)
        price_only = discount_price_text.split()[-1]  # Get the last part (currency + price)
        prices.append(price_only)  # Append only the discount price to the list
    
    else:  
        # Extract currency & price if present in the addToCartButton
        price_divs = entry.find_all(class_='addToCartButton__price--qJdh4')
        if price_divs:  # If primary price class exists
            for price_div in price_divs:
                currency = price_div.find('div', class_='addToCartButton__currency--2CTNX')
                price = price_div.find_all('div')[1]  # Assuming price is in the second div
                full_price = f"{currency.get_text(strip=True) if currency else ''}{price.get_text(strip=True) if price else ''}"
                prices.append(full_price)  # Save the full price
        else:  # If not present, extract price from alternative class (online verkrijgbaar vanaf...)
            alt_price_div = entry.find(class_='addToCart__subText--1pvFt addToCart__ppcPrice--ydrd5')
            if alt_price_div:
                alt_price_text = alt_price_div.get_text(strip=True)
                price_only = alt_price_text.split()[-1]  # Get the last part (currency + price)
                prices.append(price_only)  # Append only the price to the list
                
    timestamps = time.time()
    timestamp.append(timestamps)
    
    # Wait for 2 seconds to not overload the server
    time.sleep(2)

# Open csv file to store the data
with open('fortified_wine.csv', mode='w', newline='', encoding='utf-8') as file:
    writer = csv.writer(file)

    # Write the header row
    writer.writerow(['hyperlink','Brand', 'Wine', 'Rating', 'Reviews', 'Price','Timestamp'])

    # Write the data rows
    for hyperlink,brand, wine, rating, reviews, price, timestamp in zip(hyperlink, brands, wines, ratings, reviews, prices, timestamp):
        writer.writerow([
            hyperlink,
            brand,
            wine,
            rating,
            reviews,
            price,
            timestamp
        ])

print("Data saved to 'fortified_wine.csv' successfully.")

Cookies accepted.
Data saved to 'fortified_wine.csv' successfully.
