# Review Summarization

1. Select an English-speaking website that hosts customer reviews on products (or services, businesses, movies, events, etc).

2. Make sure that the website includes a free-text search box that users can use to search for products.

3. Email me your selection at ted@aueb.gr. Each student should work on a different website, so I will maintain the list of selected websites at the top of our Wiki. First come, first served.

4. Create a first Python Notebook with a function called scrape( ). The function should accept as a parameter a query (a word or short phrase).  The function should then use selenium to:

   * submit the query to the website's search box and retrieve the list of matching products.
   * access the first product on the list and download all its reviews into a csv file. For each review, the function should get the text, the rating, and the date. One line per review, 3 fields per line.

5. Create a second Python Notebook with a function called summarize( ). The function should accept as a parameter the path to a csv file created by the first Notebook. It should then create a 1-page pdf file that includes a summary of all the reviews in the csv.

The nature of the summary is entirely up to you. It can be text-based, visual-based, or a combination of both.
It is also up to you to define what is important enough to be included in the summary.
Focus on creating a summary that you think would be the most informative for customers.
The creation of the pdf should be done through the notebook.
You can use whatever Python-based library that you want.


---

> Chalkiopoulos Georgios, Electrical and Computer Engineer NTUA <br />
> Data Science postgraduate Student <br />
> gchalkiopoulos@aueb.gr

## Install Libraries

In [2]:
"""
!pip install -U selenium
!pip install webdriver-manager
"""

'\n!pip install -U selenium\n!pip install webdriver-manager\n'

## Imports

In [3]:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.webdriver import WebDriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.common import NoSuchElementException
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.remote import webelement

from pathlib import Path
import csv, time
from typing import TextIO
import logging

In [23]:

class AmazonScrapper:
    """Class that scrapes amazon reviews. Searches for the first product (user defined) and saves the rating text, the rating score and the date"""
    website: str = "https://www.amazon.co.uk/"
    logger = logging.getLogger("AmazonLogger")


    def __init__(self,
                 query: str,
                 driver: WebDriver,
                 output_file: str = "amazon_reviews.csv",
                 wait: int = 5
                 ):
        self.query = query
        self.driver = driver
        self.output_file: Path = Path(output_file)
        self.wait = wait
        self.logger = self._setup_logger()


    def _setup_logger(self):
        """Setup up logger"""

        # Create logger
        logger = logging.getLogger(self.__class__.__name__)
        logger.setLevel(logging.INFO)

        if not logger.handlers:
            # Create console handler and set level to debug
            ch = logging.StreamHandler()
            ch.setLevel(logging.INFO)

            # Create formatter
            formatter = logging.Formatter('[%(asctime)s] %(levelname)s [%(name)s] - %(message)s')

            # Add formatter to ch
            ch.setFormatter(formatter)

            # Add ch to logger
            logger.addHandler(ch)

        return logger

    def _writer(self) -> csv.writer:
        """Initiates a csv.writer method and returns it"""

        # open a new csv writer
        fw: TextIO = self.output_file.open(mode="w",encoding="utf8")
        writer = csv.writer(fw,lineterminator="\n")
        writer.writerow(["text", "rating", "date"])
        return writer


    def get_reviews(self) -> None:
        """Main method that performs needed steps to get the reviews"""
        self._load_main_page()
        self._accept_cookies()
        self._apply_query()

        product, product_name = self._find_product()
        self._click_product(product, product_name)

        self._see_all_reviews()
        # self.driver.quit()


    def _load_main_page(self) -> None:
        """Loads main page"""
        self.logger.info(f"Initialize website: {self.website}.")
        self.driver.maximize_window()
        self.driver.get(self.website)


    def _accept_cookies(self) -> None:
        """Try to accept cookies"""
        WebDriverWait(WebDriver, self.wait)
        accept_box = self.driver.find_element(by=By.ID, value="sp-cc-accept")
        try:
            accept_box.click()
            self.logger.info("Cookies accepted")
        except NoSuchElementException:
            self.logger.warning("Cookies element not found.")



    def _apply_query(self) -> None:
        """Find the search box and apply the query"""

        # find search box
        search_box = self.driver.find_element(by=By.ID, value="twotabsearchtextbox")
        search_box.send_keys(self.query)

        # press search button
        search = self.driver.find_element(by=By.ID, value="nav-search-submit-button")
        search.click()
        self.logger.info(f"Search for {self.query} submitted.")
        time.sleep(self.wait)

    def _find_product(self) -> webelement:
        """finds the first non-sponsored product"""
        items =  WebDriverWait(driver,self.wait).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "[data-component-type='s-search-result']")))

        for product in items:

            # Find product name
            try:
                product_name = product.find_element(by=By.XPATH, value=".//h2/a/span[contains(@class,'text')]").text
            except NoSuchElementException:
                self.logger.warning("Class name 'a-size-base-plus a-color-base a-text-normal' (Product Name) not found.")
                continue

            # return first non-sponsored product
            if self._is_sponsored(product):
                self.logger.info(f"Skipping sponsored product: {product_name}.")
                continue
            else:
                self.logger.info(f"Found not sponsored product: {product_name}")
                return product, product_name


    @staticmethod
    def _is_sponsored(_product) -> bool:
        """Checks if a listed product is a sponsored one"""
        # skip sponsored
        try:
            _product.find_element(by=By.CSS_SELECTOR, value="[aria-label='View Sponsored information or leave ad feedback']")
            return True
        except NoSuchElementException:
            return False

    def _click_product(self, product: webelement, product_name: str) -> None:
        """click on product"""
        # try to find the clickable link
        try:
            link = product.find_element(by=By.XPATH, value=".//h2/a[contains(@class,'a-link-normal')]")
            link.click()
            self.logger.info(f"Clicked on product: {product_name}.")
            time.sleep(self.wait)
        except NoSuchElementException:
            self.logger.error(f"Could not find clickable link for the product: {product_name}.")
            raise ValueError(f"Clickable link not found for the product {product_name}. Please check!")


    def _see_all_reviews(self) -> None:
        """Click on the see all reviews button"""

                # skip sponsored
        try:
            local_reviews = WebDriverWait(self.driver, self.wait).until(EC.element_to_be_clickable((By.CLASS_NAME, "cr-widget-FocalReviews")))
            all_reviews = local_reviews.find_element(by=By.CSS_SELECTOR, value="[data-hook='see-all-reviews-link-foot']")
            all_reviews.click()
            self.logger.info("Clicked See all Reviews (Local Reviews).")
            time.sleep(self.wait)

        except NoSuchElementException:
            self.logger.watning("Could not find Local review element.")



In [25]:
query: str = "adidas"
driver: WebDriver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))

AmazonScrapper(query=query, driver=driver).get_reviews()

[2022-12-04 19:24:23,016] INFO [AmazonScrapper] -Initialize website: https://www.amazon.co.uk/.
[2022-12-04 19:24:24,886] INFO [AmazonScrapper] -Cookies accepted
[2022-12-04 19:24:46,324] INFO [AmazonScrapper] -Search for adidas submitted.
[2022-12-04 19:24:51,423] INFO [AmazonScrapper] -Skipping sponsored product: ODLO Men's Suw Boxer Natural + Light Men's Panty.
[2022-12-04 19:24:51,461] INFO [AmazonScrapper] -Skipping sponsored product: adidas Unisex-Youth Adidas Judo Gi Kids Uniform White Blue 250g Gb Stripes Suit 110 120 130 140 150 160 adidas Judo Gi Kids Uniform White Blue 250g GB Stripes Suit 110 120 130 140 150 160.
[2022-12-04 19:24:51,500] INFO [AmazonScrapper] -Skipping sponsored product: adidas Men's Ld Wntr Hd Sweatshirt.
[2022-12-04 19:24:51,539] INFO [AmazonScrapper] -Skipping sponsored product: 55 Sport X-Type Replacement Studs for adidas Football & Rugby Boots.
[2022-12-04 19:24:51,578] INFO [AmazonScrapper] -Found not sponsored product: adidas Men's Core18 Hoody HOOD