# Review Summarization

1. Select an English-speaking website that hosts customer reviews on products (or services, businesses, movies, events, etc).

2. Make sure that the website includes a free-text search box that users can use to search for products.

3. Email me your selection at ted@aueb.gr. Each student should work on a different website, so I will maintain the list of selected websites at the top of our Wiki. First come, first served.

4. Create a first Python Notebook with a function called scrape( ). The function should accept as a parameter a query (a word or short phrase).  The function should then use selenium to:

   * submit the query to the website's search box and retrieve the list of matching products.
   * access the first product on the list and download all its reviews into a csv file. For each review, the function should get the text, the rating, and the date. One line per review, 3 fields per line.

5. Create a second Python Notebook with a function called summarize( ). The function should accept as a parameter the path to a csv file created by the first Notebook. It should then create a 1-page pdf file that includes a summary of all the reviews in the csv.

The nature of the summary is entirely up to you. It can be text-based, visual-based, or a combination of both.
It is also up to you to define what is important enough to be included in the summary.
Focus on creating a summary that you think would be the most informative for customers.
The creation of the pdf should be done through the notebook.
You can use whatever Python-based library that you want.


---

> Chalkiopoulos Georgios, Electrical and Computer Engineer NTUA <br />
> Data Science postgraduate Student <br />
> gchalkiopoulos@aueb.gr

## Install Libraries

In [1]:
"""
!pip install -U selenium
!pip install webdriver-manager
"""

'\n!pip install -U selenium\n!pip install webdriver-manager\n'

## Imports

In [2]:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.webdriver import WebDriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.common import NoSuchElementException, TimeoutException, ElementClickInterceptedException
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.remote import webelement

from pathlib import Path
import csv, time
from typing import TextIO, List, Tuple
import logging
import re
from re import Pattern, Match
from datetime import datetime

In [6]:
class AmazonScrapper:
    """Class that scrapes amazon reviews. Searches for the first product (user defined) and saves the rating text, the rating score and the date"""
    website: str = "https://www.amazon.co.uk/"
    logger = logging.getLogger("AmazonLogger")


    def __init__(self,
                 query: str,
                 driver: WebDriver = None,
                 output_path: str = None,
                 wait: int = 5
                 ):
        self.query = query
        self.driver = driver
        self.output_path: Path = Path(f"amazon_reviews_{query.replace(' ', '_')}.csv") if output_path is None else output_path
        self.wait = wait
        self.logger = self._setup_logger()
        self.writer, self.fw = self._writer()


    def _setup_logger(self):
        """Setup up logger"""

        # Create logger
        logger = logging.getLogger(self.__class__.__name__)
        logger.setLevel(logging.INFO)

        if not logger.handlers:
            # Create console handler and set level to debug
            ch = logging.StreamHandler()
            ch.setLevel(logging.INFO)

            # Create formatter
            formatter = logging.Formatter('[%(asctime)s] %(levelname)s [%(name)s] - %(message)s')

            # Add formatter to ch
            ch.setFormatter(formatter)

            # Add ch to logger
            logger.addHandler(ch)

        return logger

    def _writer(self) -> Tuple[csv.writer, TextIO]:
        """Initiates a csv.writer method and returns it"""

        # open a new csv writer
        fw: TextIO = self.output_path.open(mode="w",encoding="utf8")
        writer = csv.writer(fw,lineterminator="\n")
        writer.writerow(["text", "rating", "date"])
        return writer, fw


    def get_reviews(self) -> None:
        """Main method that performs needed steps to get the reviews"""

        self._setup_driver()
        self._load_main_page()
        self._accept_cookies()
        self._apply_query()

        product, product_name = self._find_product()
        self._click_product(product, product_name)

        self._see_all_reviews()
        self._process_reviews()


        self.logger.info("Closing Driver.")
        self.driver.quit()
        self.fw.close()

    def _setup_driver(self) -> WebDriver:
        if self.driver:
            pass
        else:
            self.driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
        return self.driver

    def _load_main_page(self) -> None:
        """Loads main page"""
        self.logger.info(f"Initialize website: {self.website}.")
        self.driver.maximize_window()
        self.driver.get(self.website)
        time.sleep(self.wait)


    def _accept_cookies(self) -> None:
        """Try to accept cookies"""
        WebDriverWait(WebDriver, self.wait)
        try:
            accept_box = self.driver.find_element(by=By.ID, value="sp-cc-accept")
            accept_box.click()
            self.logger.info("Cookies accepted")
        except NoSuchElementException:
            self.logger.warning("Cookies element not found.")
        time.sleep(self.wait)


    def _apply_query(self) -> None:
        """Find the search box and apply the query"""

        # find search box
        search_box = self.driver.find_element(by=By.ID, value="twotabsearchtextbox")
        search_box.send_keys(self.query)

        # press search button
        search = self.driver.find_element(by=By.ID, value="nav-search-submit-button")
        search.click()
        self.logger.info(f"Search for {self.query} submitted.")
        time.sleep(self.wait)

    def _find_product(self) -> webelement:
        """finds the first non-sponsored product"""
        items =  WebDriverWait(self.driver,self.wait).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "[data-component-type='s-search-result']")))

        for product in items:

            # Find product name
            try:
                product_name = product.find_element(by=By.XPATH, value=".//h2/a/span[contains(@class,'text')]").text
            except NoSuchElementException:
                self.logger.warning("Class name 'a-size-base-plus a-color-base a-text-normal' (Product Name) not found.")
                continue

            # return first non-sponsored product
            if self._is_sponsored(product):
                self.logger.info(f"Skipping sponsored product: {product_name}.")
                continue
            else:
                self.logger.info(f"Found not sponsored product: {product_name}")
                return product, product_name
        time.sleep(self.wait)


    @staticmethod
    def _is_sponsored(_product) -> bool:
        """Checks if a listed product is a sponsored one"""
        # skip sponsored
        try:
            _product.find_element(by=By.CSS_SELECTOR, value="[aria-label='View Sponsored information or leave ad feedback']")
            return True
        except NoSuchElementException:
            return False

    def _click_product(self, product: webelement, product_name: str) -> None:
        """click on product"""
        # try to find the clickable link
        try:
            link = product.find_element(by=By.XPATH, value=".//h2/a[contains(@class,'a-link-normal')]")
            link.click()
            self.logger.info(f"Clicked on product: {product_name}.")
            time.sleep(self.wait)
        except NoSuchElementException:
            self.logger.error(f"Could not find clickable link for the product: {product_name}.")
            raise ValueError(f"Clickable link not found for the product {product_name}. Please check!")
        time.sleep(self.wait)


    def _see_all_reviews(self) -> None:
        """Click on the see all reviews button"""

        # skip sponsored
        try:
            local_reviews = WebDriverWait(self.driver, self.wait).until(EC.element_to_be_clickable((By.CLASS_NAME, "cr-widget-FocalReviews")))
            all_reviews = local_reviews.find_element(by=By.CSS_SELECTOR, value="[data-hook='see-all-reviews-link-foot']")
            all_reviews.click()
            self.logger.info("Clicked See all Reviews (Local Reviews).")
            time.sleep(self.wait)

        except NoSuchElementException:
            self.logger.warning("Could not find Local review element.")

    def _get_page_reviews(self) -> List[webelement.WebElement]:
        """returns all reviews in a page"""

        # scroll down
        self.driver.execute_script('window,scrollTo(0,document.body.scrollHeight)')

        # get all the reviews in the page
        try:
            reviews =  WebDriverWait(self.driver, self.wait).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, '[data-hook="review"]')))
            self.logger.info("Reviews Loaded.")
            return reviews
        except NoSuchElementException:
            self.logger.warning(f"Could not find the 'review' CSS element in data-hook.")
            return None


    def _write_review(self, review: webelement.WebElement) -> None:
        """Add a line with the text, rating and date of a given review.
        Uses code from the lecture Customer Analytics

        Args:
            review: a webelement.WebElement object with the current review
        """

        # initialize key attributes
        rating, content, date ='NA','NA','NA'

        # try to find the date box
        try:
            date_box = review.find_element(by=By.CSS_SELECTOR, value='[data-hook="review-date"]')
        except NoSuchElementException:
            date_box = None

        # box found, extract text
        if date_box:
            date_text: str = date_box.text

        # Only keep EN reviews
        pattern: Pattern = re.compile("Reviewed in (?P<country>.*) on (?P<review_date>.*)")
        match: dict = re.search(pattern, date_text).groupdict()
        countries: List[str] = ["United States", "Australia", "United Kingdom"]

        if any(x in match.get("country") for x in countries):
            date = datetime.strptime(match["review_date"], '%d %B %Y').strftime("%Y/%m/%d")
        else:
            self.logger.debug(f"Skipping Non-English review: {match.get('country')}")
            return


        # try to find the rating box
        try:
            rating_box=review.find_element(by=By.CSS_SELECTOR, value='[data-hook*="review-star-rating"]')
        except NoSuchElementException:
            rating_box=None

        # box found
        if rating_box:
            rating_info=rating_box.get_attribute('class') # get the text of class attribute
            rating = re.search('a-star-(\d)',rating_info)  # look for the star rating from the class text
            rating = rating.group(1) # extract the star rating

        # try to find the content box
        try:
            review_text = review.find_element(by=By.CSS_SELECTOR, value='[data-hook="review-body"]')
        except NoSuchElementException:
            review_text = None

        # box found, extract text
        if review_text:
            text = review_text.text

        # write a new row
        self.writer.writerow([text, rating, date])


    def _process_reviews(self) -> None:
        """Loads next review page until the end. Calls self._write_review and self._get_page_reviews"""

        page: int = 1
        while True:
            try:
                reviews = self._get_page_reviews()

                self.logger.info(f"Iterating Page {page}.")
                for review in reviews:

                    try:
                        self._write_review(review)
                    except:
                        self.logger.warning("Could not write review.")
                        time.sleep(self.wait)

            except TimeoutException:
                self.logger.warning("Could not load reviews.")

            # wait until the next Button loads
            next_button = WebDriverWait(self.driver,self.wait*10).until(EC.presence_of_element_located((By.CLASS_NAME,'a-last')))

            # final page reached, 'next' button is disabled on this page
            if 'a-disabled' in next_button.get_attribute('class'):
                self.logger.info("Reached Last Page.")
                break

            # stop after 100 pages loaded
            if page == 100:
                self.logger.info("Reached 150 Pages.")
                break

            # click on the next Button
            try:
                next_button.click()

            except ElementClickInterceptedException:
                self.logger.warning("Could not Click. Refreshing Page, please scroll manually.")
                self.driver.refresh()
                time.sleep(self.wait*2)

            # wait for a few seconds
            time.sleep(self.wait*2)
            page += 1

        self.logger.info(f"All reviews loaded. file saved under: \n{self.output_path.absolute()}")

In [9]:
def scrape(query: str,
           driver: WebDriver = None,
           wait: int = 5,
           output_path: str = None) -> None:
    """
    Functions that accepts a query (along with a selenium.webdriver.chrome.webdriver.Webdriver)
    and scraps the first non sponsored product from amazon.co.uk.
    Uses the AmazonScrapper Class

    Args:
        query: the name of the product to search
        driver (optional): a Webdriver object
        wait (optional): wait time. Set to 5 by default due to stable performance
        output_path (optional): output_path name. Default in amazon_reviews_{query}.csv

    Returns:
        None
    """

    AmazonScrapper(query=query, driver=driver, wait=wait, output_path=output_path).get_reviews()

In [10]:
query: str = "Vans Ward Sneaker"

scrape(query=query)

[2022-12-26 21:17:56,069] INFO [AmazonScrapper] - Initialize website: https://www.amazon.co.uk/.
[2022-12-26 21:18:03,155] INFO [AmazonScrapper] - Cookies accepted
[2022-12-26 21:18:10,867] INFO [AmazonScrapper] - Search for Vans Ward Sneaker submitted.
[2022-12-26 21:18:15,967] INFO [AmazonScrapper] - Found not sponsored product: Vans Men's Ward Sneaker
[2022-12-26 21:18:19,687] INFO [AmazonScrapper] - Clicked on product: Vans Men's Ward Sneaker.
[2022-12-26 21:18:30,870] INFO [AmazonScrapper] - Clicked See all Reviews (Local Reviews).
[2022-12-26 21:18:35,938] INFO [AmazonScrapper] - Reviews Loaded.
[2022-12-26 21:18:35,939] INFO [AmazonScrapper] - Iterating Page 1.
[2022-12-26 21:18:41,852] INFO [AmazonScrapper] - Reviews Loaded.
[2022-12-26 21:18:41,854] INFO [AmazonScrapper] - Iterating Page 2.
[2022-12-26 21:18:47,734] INFO [AmazonScrapper] - Reviews Loaded.
[2022-12-26 21:18:47,735] INFO [AmazonScrapper] - Iterating Page 3.
[2022-12-26 21:18:53,657] INFO [AmazonScrapper] - Revie