##  <center>Assessment 3</center>
##  <center>WebCrawler and NLP System</center>

## Task1

## Overview:

<font size="3"> The majority of wine drinkers are not experts. Most buy wine by price, label type and reputation (aka marketing) [1]. To buy a wine without being disappointed, one would require a trial an error approach, record metadata for each bottle, for example, price, grape, blend, region, country etc, and do not forget the wine drinkers descriptives such as 'a subtle hint of chocolate'. This would take considerable effort, expense, and discipline. What if chance could be taken out of buying wine, the change of being disappointed is better than 51%?<br><br>This project compares wine reviews and sentiment from wine experts and the average wine drinker, aka wine plebs. The project goal uses NLP to produce a review that is a balance between the expert and the average wine drinker. To achieve this goal, a WebCrawler takes a wine list from a popular Australian website, nicks.com.au. This website was chosen due to large number of wine items available. The wine items chosen are red wines between \\$15&ndash;\$40, mainly to reduce the wine list, however, this price range will be typlical for the non-wine expert. Another WebCrawler is used to find wine reviews from the experts and non-experts. Natural Language Processing sentiment analysis is used fine key words within each review, these words sentiment are then used to create a new review. This new review is a complementary decision maker for the average wine drinker that removes the impact of label type and reputation from the purchase.</font>

    

## References
[1] Spence, C (2020). Wine psychology: basic & applied. _Cogn. Research_ _5_, 22 . https://doi.org/10.1186/s41235-020-00225-6

## Task2

<font size="3">This programme consumes data from two wine websites, nicks.com.au and vivino.com. From the nicks.com.au site, 10,080 wine items are scraped with their expert reviews/descriptions. From vivino.com, the non-expert wine reviews are scrapped using the wine list from nicks.com.au. These two sites were used due to there being a large inventory of wines and reviews. Wines between \\$15&ndash;\$40 are selected, firstly to reduce scrapping time and secondly, this price bracket is assumed a typical price range for most people. In principle, thousands of wines can be scraped.

The data extracted from nicks.com.au site is wine vintage (year), wine name, price, and description/review. The price is per bottle and the description/review are expert reviews describing the wine using expert wine words, for example wine 2019 30 Mile Shiraz, ‘Fresh and inky the nose tosses up a mix of crushed berries, plum and liquorice followed by subtle spicy oak and pepper notes’. The data extracted from vivino.com are the corresponding non-expert wine review. Using wine example 2019 30 Mile Shiraz, a non-expert wine review is ‘At Brittany’s bachelorette night on Salt spring. It’s the kind of red I like.’, and ‘Easy drinking with full meals’. Site vivino.com has multiple wine reviews.

The nicks.com.au was relatively easily to scrape as the site has an option to show the wines in list form. This meant only having to iterate through the list under class name ‘info’ and extracting the relevant elements. Site vivino.com was more difficult for two reasons, firstly, the pages load dynamically, i.e., only when slowly scrolling down the page, content is loaded. Secondly, to extract more than three reviews, a popup window had to be scrapped. The overcome the dynamic page loading problem, a slow scroller function is used to slowly scroll down the page. It took considerable time to find the balance between the speed of the scroller and making sure the content is loaded. Another function is used to check the existence of the xpath/id etc being used. If the path does not exist, scrape without popup. Certain implicit wait and sleep times were used to ensure wine names were inserted in the search bar and content loaded. 
 
The vivino.com site under ‘Terms of Use’, indicates content can be used for non-commercial use only. The data cannot be licensed, soled, rented, transferred etc. The nicks.com.au site is more liberal and nonspecific with regards to data use.

The metadata used in the project is price per bottle. As per [1], it is assumed that people buy wine on price, therefore where performing NLP analysis, price is taken into consideration to reduce potential bias.  The content extractor exported those elements required to achieve the objective, being vintage, wine name, price and the reviews from wine experts and non-experts. The final data structure is a single csv file which NLP will perform the analysis
</font>

In [None]:
import time
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
import pandas as pd

<font size="3">This section scrapes 10,080 red wine products between \\$15&ndash;\$40, using 'vintage', 'product', 'price' and 'rating'. The scraper then iterates through the wine list, adds wine name (product) to the search bar of nicks.com.au, then scrapes the expert wine descriptions/reviews. The reviews are used in NLP analysis.</font>

In [40]:
# Chrome driver setup with loading/scroll

def selem_drive(driver):
    driver.set_window_size(1024, 1000)
    # 11,571 red wines between $15-$40
    url = 'https://www.nicks.com.au/red-wines?cat=9&dir=desc&limit=60&mode=list&order=score&price=3.00-.00000'  
    driver.get(url)
    # scroll the page, wait 3 seconds and continue until page stops loading
    page_len = driver.execute_script(
        "window.scrollTo(0, document.body.scrollHeight);var lenOfPage=document.body.scrollHeight;return lenOfPage;")
    match = False
    while match == False:
        last_count = page_len
        time.sleep(3)
        page_len = driver.execute_script(
            "window.scrollTo(0, document.body.scrollHeight);var lenOfPage=document.body.scrollHeight;return lenOfPage;")
        if last_count == page_len:
            match = True

    return driver


# get wine list as defind in driver.get(url)
vintage_list, product_list, rating_list, price_list = [], [], [], []


def get_elements_nicks(driver):
    xpath = '/html/body/div[1]/main/menu[2]/span'
    check_num_items = WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, xpath)))
    # current_total = int(check_num_items.text[2:4])
    total_items = int(check_num_items.text[7:13])
    total_pages = round((total_items / 60 + 1))  # 194 pages
    tot_to_use = total_pages - 24  # reduce to 10,080 wines
    # print(total_items - current_total)

    for i in range(2, tot_to_use): # change 'tot_to_use' to 3 for testing

        driver.find_element_by_link_text(str(i)).click()

        try:
            for res in driver.find_elements_by_class_name("info"):
                vintage_list.append(res.text.splitlines()[0])
                product_list.append(res.text.splitlines()[1])
                rating_list.append(res.text.splitlines()[2])
                price_list.append(res.text.splitlines()[3])
        except IndexError as e:
            print(e.args)

    # print(vintage_list, product_list, rating_list, price_list)
    #print(len(vintage_list), len(product_list), len(rating_list), len(price_list))
    df = ({'Vintage': vintage_list, 'Wine': product_list, 'Rating': rating_list, 'Price': price_list})
    wine_data = pd.DataFrame(df)
    wine_data.to_csv('nicks_wine_data_list.csv', index=False, header=True, encoding='utf-8')
    print(wine_data.head())
    

# search wine list (nicks_wine_data.csv) and combines wine list with wine descriptions to nicks_wine_data_.csv
desc_list = []


def search_elements_nicks(driver):
    wine_data = pd.read_csv("nicks_wine_data_list.csv")
    result = wine_data['Vintage'].apply(str) + ' ' + wine_data['Wine'].apply(str)

    try:
        for i in result:
            search = driver.find_element_by_xpath('//*[@id="search"]')
            search.send_keys(i)
            search.send_keys(Keys.RETURN)
            element = driver.find_element_by_xpath('//*[@id="hits"]/div/div/div/ul/li[2]/a')
            element.click()

            for desc in driver.find_elements_by_xpath('/html/body/div[1]/main/div[1]/div[2]/div[3]'):
                desc_list.append(desc.text)
    except:
        pass

    df_desc = ({'Description': desc_list})
    nicks_wine_data_desc = pd.DataFrame(df_desc)
    nicks_wine_data_desc.to_csv('nicks_wine_data_desc.csv', index=False, header=True, encoding='utf-8')

    wine_data_list = pd.read_csv("nicks_wine_data_list.csv")
    wine_data_desc = pd.read_csv("nicks_wine_data_desc.csv")
    nicks_wine_data = pd.concat([wine_data_list, wine_data_desc], axis=1)
    # wine dataset to use
    nicks_wine_data.to_csv('nicks_wine_data.csv', index=False, header=True, encoding='utf-8')
    print(nicks_wine_data.head())


if __name__ == '__main__':
    driver = webdriver.Chrome(executable_path=r'C:/seleniumChromeDriver/chromedriver_win32/chromedriver.exe')
    selem_drive(driver)
    get_elements_nicks(driver)
    search_elements_nicks(driver)

  Vintage                                     Wine Rating    Price
0    2018       Powell & Son Barossa Valley Shiraz     96  $357.00
1    2018            Brave Souls The Whaler Shiraz     96  $258.00
2    2016         Incygnes Green’s Vineyard Shiraz     96  $354.00
3    2018  Heathcote Estate Single Vineyard Shiraz     96  $479.88
4    2018     Mr Riggs Outpost Coonawarra Cabernet     96  $239.88
   Vintage                                     Wine  Rating    Price  \
0     2018       Powell & Son Barossa Valley Shiraz      96  $357.00   
1     2018            Brave Souls The Whaler Shiraz      96  $258.00   
2     2016         Incygnes Green’s Vineyard Shiraz      96  $354.00   
3     2018  Heathcote Estate Single Vineyard Shiraz      96  $479.88   
4     2018     Mr Riggs Outpost Coonawarra Cabernet      96  $239.88   

                                         Description  
0  Fruit for this wine is sourced from mature vin...  
1  New comer Julia Weirich in collaboration with ...  


<font size="3">This section uses the wine list scrapped from the above code. The wine name is added to the search bar of vivino.com to scrape the non-expert wine reviews. The reviews are used in NLP analysis.</font>

In [None]:
import time
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver import ActionChains
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
import pandas as pd


# dynamic webpage, slow scroll to make sure reviews are shown
def slow_scroll(driver):
    y = 512
    for timer in range(0, 20):
        driver.execute_script("window.scrollTo(0, " + str(y) + ")")
        y += 512
        time.sleep(1)

        
# check if link to more reviews exists
def check_path_exists(path):
    try:
        driver.find_element_by_link_text(path)
    except NoSuchElementException:
        return False
    return True


comm_review_list = []


# Get non wine expert reviews
def get_wine_reviews(driver):
    wine_data = pd.read_csv("nicks_wine_data_list4.csv")
    result = wine_data['Vintage'].apply(str) + ' ' + wine_data['Wine'].apply(str)
    count = 0

    try:
        for i in result:
            search = driver.find_element_by_xpath(
                '//*[@id="navigation-container"]/div/nav/div[1]/div/div/div/form/input')
            search.send_keys(i)
            time.sleep(1)
            search.send_keys(Keys.RETURN)
            time.sleep(1)
            element_wine = driver.find_element_by_xpath(
                '/html/body/div[3]/section[1]/div/div/div/div[1]/div/div[1]/div/div[1]/a/figure')
            element_wine.click()
            time.sleep(2)
            slow_scroll(driver)
            driver.implicitly_wait(5)
            print(count) # count times scraped, incase it crashed
            
            path_check = check_path_exists('Show more reviews')

            if path_check is False:
                data = driver.find_elements_by_xpath('//*[@id="all_reviews"]/div[2]/div[1]')
                for res in data:
                    comm_review_list.append(res.text)

            else:
                driver.find_element_by_link_text('Show more reviews').click()
                comm_review = driver.find_element_by_class_name('allReviews__header--1AKxx')
                actions = ActionChains(driver)
                actions.move_to_element(comm_review).click().perform()

                for _ in range(50):
                    actions.send_keys(Keys.END).perform()

                data = driver.find_elements_by_class_name('allReviews__reviews--EpUem')
                for d in data:
                    comm_review_list.append(d.text)

                # close popup
                WebDriverWait(driver, 3).until(
                    EC.element_to_be_clickable((By.XPATH, '//*[@id="baseModal"]/div/div[1]/a'))).click()

    except:
        pass         

    df_desc = ({'Description': comm_review_list})
    vivino_review_pleb = pd.DataFrame(df_desc)
    print(vivino_review_pleb.head)
#   vivino_review_pleb.to_csv('vivino_review_pleb.csv', index=False, header=True, encoding='utf-8')

#   wine_data_list = pd.read_csv("nicks_wine_data_temp.csv")
#   wine_data_desc = pd.read_csv("vivino_vine_desc_pleb.csv")
#   nicks_wine_data = pd.concat([wine_data_list, wine_data_desc], axis=1)
#   final_wine_data.to_csv('nicks_wine_data.csv', index=False, header=True, encoding='utf-8')
#   print(final_wine_data.head()) # final data repo for NLP



if __name__ == '__main__':
    driver = webdriver.Chrome(executable_path=r'C:/seleniumChromeDriver/chromedriver_win32/chromedriver.exe')
    driver.set_window_size(1800, 1024)
    url = 'https://www.vivino.com/AU/en/'
    driver.get(url)
    get_wine_reviews(driver)

<font size="3">File 'final_wine_data.csv' is used for NLP analysis.</font>