# Amazon Customer Reviews Dataset Generator

## Setup

### Install and import required libraries and modules

In [1]:
# Install and import the required libraries

# Uncomment the following line if you are running this notebook for the first time and don't have the required packages installed
#!pip install -r ../requirements.txt

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException
from bs4 import BeautifulSoup
import lxml
import time
import re
import pandas as pd
import math
from collections import namedtuple
from IPython.display import clear_output
from amazon_scraper import AmazonScraper

## Input

Enter the URL for the product page from Amazon. Optionally, enter the number of pages to be scraped. If the number of pages is not specified, it will iterate through all the pages until it reaches the end of reviews from the US or the last page of reviews available.

The input is a `dictionary` in the following format:

```python
product_list = {
    'product_1_ur': 10,
    'product_2_url': 10,
    'product_3_url': 10
}
```

In [2]:
product_list = {
    'https://www.amazon.com/PlayStation-5-Console/dp/B09DFCB66S/ref=sr_1_4?crid=341VBTUP1514Q&keywords=playstation+5&qid=1667007732&qu=eyJxc2MiOiI2LjE5IiwicXNhIjoiOC42NCIsInFzcCI6IjguMzkifQ%3D%3D&sprefix=playstation%2Caps%2C317&sr=8-4': 100,
    'https://www.amazon.com/Xbox-S/dp/B08G9J44ZN/ref=lp_16225016011_1_10': 100,
    'https://www.amazon.com/Nintendo-Switch-Neon-Blue-Joy%E2%80%91/dp/B07VGRJDFY/ref=sr_1_2?keywords=nintendo+switch&qid=1667058528&qu=eyJxc2MiOiI1LjA3IiwicXNhIjoiNS4yOCIsInFzcCI6IjQuODkifQ%3D%3D&sprefix=nintendo+%2Caps%2C388&sr=8-2': 100
}

## Web Scraping

In [3]:
# Test get_product_name
scraper = AmazonScraper()

reviews = []

# Scrape the reviews
for product in product_list:
    reviews.extend(scraper.scrape_reviews(product, product_list[product], product_list))

clear_output()
print('Scraping complete. {} reviews scraped for {} products.'.format(len(reviews), len(product_list)))

# Save the scraped data to a CSV file
df = pd.DataFrame(reviews)

# Remove string starting with "Video Player is loading." and ending with "This is a modal window."
df['review_body'] = df['review_body'].str.replace(r'Video Player is loading.\s*This is a modal window.', '', regex=True).str.strip()
# Remove string "The media could not be loaded."
df['review_body'] = df['review_body'].str.replace(r'The media could not be loaded.', '', regex=True).str.strip()
# Remove rows with empty review_body
df = df[df['review_body'] != '']

df.to_csv('../dataset/amazon_reviews.csv', index=True)

Scraping complete. 2748 reviews scraped for 3 products.
