# 1. Load URL to Python using selenium

We are using selenium to extract data from the url of the red wine ranking page of Vivino.com. 
Selenium lets us scrape the page as if we are users: this withholds us from errors and blocks from the website. 


In [1]:
#import relevant packages used in this script
import re
from selenium import webdriver
import pandas as pd
import statistics
import csv
from time import sleep

#load chrome webdriver 
driver = webdriver.Chrome() #,chrome_options=chrome_options)

In [2]:
#call url of vivino website
driver.get('https://www.vivino.com/explore?e=eJzLLbI1VMvNzLM1UMtNrLA1NTBQS6609fNRS7Z1DQ1SKwDKpqfZliUWZaaWJOao5Rel2KrlJ1XaqpWXRMfaGgIAb8QUBg%3D%3D')

# 2. Set up correct settings for chromedriver

The Chromedriver window is maximized. This is done because the code of the full-screen window and the smaller-sized window can slightly differ. Thus, the chromedriver screen is maximized to ensure that the correct code will be scraped for all users of our scraper. 

Secondly, the scroller scrolls a bit town, since the wine Regions have to be loaded (if you check it in the chromedriver window, you can see it yoursel). 

In [3]:
#maximize chromedriver window 
driver.maximize_window()
driver.execute_script('window.scrollTo(0, ' + str(1000) + ')')
#sleep 20 seconds to make sure the Regions are loaded
sleep(20)


In [4]:
#Automatically click cookie consent button
#Nog niet gelukt 


#link = driver.find_element_by_class_name("MuiButton-label jss12")
#print(link)
#link.click()

In [5]:
from bs4 import BeautifulSoup
res = driver.page_source.encode('utf-8')
soup = BeautifulSoup(res, "html.parser")

Below the total amount of wines left with the filter we implemented is extraced from the website. This is done by finding the right class and then split the total line of text obtained by spaces and then select the first element, since that is the amount of wines that are going to be scraped. (This total amount of wines that are going to be scraped will later on be implemented in the loop, to make sure all wines will end up in the output list)

Now it is time to perform the loop to scrape all the wines. The outcome of this loop is 4 lists that include wine names, wine prices, number of reviews and the average ratings of all wines from Bourgogne region. 

First the lists are defined. Adding elements to a non-existing list is impossible.
The number of iterations is total_wines/5. The total number of wines is 5 because we chose the scroll range to be 1000. With a scroll range of 1000, 5 wines will be scraped in 1 run. We thus need to use total_wines/5 runs to scrape all wines on the website. 

The function then returns 4 lists which can be transformed in variables to perform further analysis on. 

In [6]:
urls = ["https://www.vivino.com/explore?e=eJzLLbI1VMvNzLM1UMtNrLA1NTBQS6609fNRS7Z1DQ1SKwDKpqfZliUWZaaWJOao5Rel2KrlJ1XaqpWXRMcCJYuApLGFMQAadBYx", "https://www.vivino.com/explore?e=eJzLLbI1VMvNzLM1UMtNrLA1NTBQS6609fNRS7Z1DQ1SKwDKpqfZliUWZaaWJOao5Rel2KrlJ1XaqpWXRMcCJYuApLGFKQAadhYz", "https://www.vivino.com/explore?e=eJzLLbI1VMvNzLM1UMtNrLA1NTBQS6609fNRS7Z1DQ1SKwDKpqfZliUWZaaWJOao5Rel2KrlJ1XaqpWXRMcCJYuApJEpAAQ-Ffo%3D", "https://www.vivino.com/explore?e=eJzLLbI1VMvNzLM1UMtNrLA1NTBQS6609fNRS7Z1DQ1SKwDKpqfZliUWZaaWJOao5Rel2KrlJ1XaqpWXRMcCJYuApLGFJQAaehY3", "https://www.vivino.com/explore?e=eJzLLbI1VMvNzLM1UMtNrLA1NTBQS6609fNRS7Z1DQ1SKwDKpqfZliUWZaaWJOao5Rel2KrlJ1XaqpWXRMcCJYuApImBAQAaZBYn", "https://www.vivino.com/explore?e=eJzLLbI1VMvNzLM1UMtNrLA1NTBQS6609fNRS7Z1DQ1SKwDKpqfZliUWZaaWJOao5Rel2KrlJ1XaqpWXRMcCJYuApLGlCQAadxYz"]

In [7]:
#create list of all regions
total_regions_list = []
for counter in range(0,6):
    total_regions_list.append(soup.find_all(attrs={"class": "filterPills__items--_grOA"})[1].find_all(attrs={"class": "pill__text--24qI1"})[counter].text)


In [8]:
#check whether list is correct
print(total_regions_list)

['Bordeaux', 'Bourgogne', 'Napa Valley', 'Piemonte', 'Rhone Valley', 'Toscana']


In [9]:
#write function to add url to correct region
def get_url(Region):
        for i in range(0,6):
            if total_regions_list[i] == Region: 
                url = urls[i]
        return url
    

In [10]:
#define a scroll range high enough to scrape 1000 wines
the_range = int(500)

In [11]:
def scroll_page(Region):    
    
    url =  get_url(Region)      
    
    from time import sleep
    
    driver.get(url)
    res = driver.page_source.encode('utf-8')
    soup = BeautifulSoup(res, "html.parser")
    scroll_range = 0
    
    sleep (5)
    
    all_wines = ((soup.find_all(attrs={"class" : "querySummary__querySummary--39WP2"}))[0].text)
    numbers = [int(word) for word in all_wines.split() if word.isdigit()][0]
    numbers = int(numbers)
    print(numbers)
    
    scraping_round = 1
        
    for _ in range(the_range):
        res = driver.page_source.encode('utf-8')
        soup = BeautifulSoup(res, "html.parser")
        
        # total number of wines in current view
        num_wines_view = len(soup.find_all(attrs={"data-testid" : "wineCard"}))
    
        # add wine attributes name, price, number of reviews and average ratings to list
        for counter in range(num_wines_view):
            the_name_id = soup.find_all(attrs={"data-testid": "wineCard"})[counter].find_all(attrs={"class": "wineInfoVintage__vintage--VvWlU wineInfoVintage__truncate--3QAtw"})[0].text
            the_price_id = soup.find_all(attrs={"data-testid": "wineCard"})[counter].find_all(True, {"class": ["addToCartButton__price--qJdh4" , "addToCart__subText--1pvFt addToCart__ppcPrice--ydrd5", "addToCart__subText--1pvFt addToCart__soldOut--1dP2Z"]})[0].text
            the_reviews_id =  soup.find_all(attrs={"data-testid": "wineCard"})[counter].find_all(attrs={"class": "vivinoRating__caption--3tZeS"})[0].text
            the_rating_id = soup.find_all(attrs={"data-testid": "wineCard"})[counter].find_all(attrs={"class": "vivinoRating__averageValue--3Navj"})[0].text    
           
            filename = Region
            fullpath = str(filename) + ".csv"
            with open(fullpath, mode='a', newline='') as csv_file:
                writer = csv.writer(csv_file)
                writer.writerow([scraping_round, Region, the_name_id, the_price_id, the_reviews_id, the_rating_id])
                
                
        scraping_round += 1
        
        # scroll down the page
        scroll_range += 1000
        driver.execute_script('window.scrollTo(0, ' + str(scroll_range) + ')')
        
        
        #update total number of wines in current view, since extra wines are added
        num_wines_view = len(soup.find_all(attrs={"data-testid" : "wineCard"}))
        
        #break loop if 500 wines are scraped
        if num_wines_view >= int(500): 
            break
        #break loop if all wines available on page are scraped 
      #  if num_wines_view >= numbers:
       #     break
        # pause for 5 seconds
        sleep(1)   

Now the function  scroll_page() will be runned for all Regions. Thus the output will be 6 csv files including the wine names, prices, number of reviews and average ratings.  

In [12]:
def filter_file(Region):

    fullpath = Region + ".csv"
    filtered_file = Region + 'filtered.csv'
    
    answer = int(max(int(column[0].replace(',', '')) for column in csv.reader(open(fullpath,'r'))))
    print(answer)

    with open(fullpath,'r') as fin:
        with open(filtered_file,'w', newline='') as fout:
                for row in csv.reader(fin):               
                    if int(row[0]) == int(the_range):
                        csv.writer(fout).writerow(row)
                for row in csv.reader(fin): 
                    if int(row[0]) == int(answer):
                         csv.writer(fout).writerow(row)
              
                

In [None]:
for Region in total_regions_list: 
    scroll_page(Region)
    filter_file(Region)


0


In [None]:
filter_file("Bordeaux")