# Amazon Customer Reviews Dataset Generator

## Introduction

The product review page URLs and the number of reviews to be scraped for each product are defined in the `reviews` dictionary. The `reviews` dictionary is a Python dictionary with the product review page URL as the key and the number of reviews to be scraped for that product as the value. The `reviews` dictionary is defined in the `reviews.py` file.

## Setup

### Install and import required libraries and modules

In [1]:
# Install and import the required libraries

# Uncomment the following line if you are running this notebook for the first time and don't have the required packages installed
#!pip install -r requirements.txt

import re
import requests
import pandas as pd
from bs4 import BeautifulSoup
from collections import namedtuple
from IPython.display import clear_output

import reviews_dictionary as rd

## Function Definitions

The `AmazonScraper` class was created using the guide [How To Create An Amazon Review Scraper With Python](https://youtu.be/w2i3hPyJe08) by [Jie Jenn](https://www.youtube.com/c/JieJenn).

In [2]:
# Amazon scraper

# Opting for a namedtuple to store the scraped data instead of dictionaries for better readability and performance (named tuples are faster than dictionaries and don't have hashability issues)
ReviewDetails = namedtuple('ReviewDetails', ['product_name', 'review_title', 'review_comment', 'review_rating', 'review_date', 'user_username', 'user_profile_url', 'verified_purchase'])

# 
class AmazonScraper:
    review_date_pattern = re.compile('(?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|Nov(?:ember)?|Dec(?:ember)?) \d+, \d{4}')
    product_name_pattern = re.compile('^https:\/{2}www.amazon.com\/(.+)\/product-reviews')
    
    # Headers
    def __init__(self):
        # Create a session
        self.session = requests.Session()
        self.session.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv: 87.0) Gecko/20100101 Firefox/87.0'
        self.session.headers['Accept'] = 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8'
        self.session.headers['Accept-Language'] = 'en-US,en;q=0.5'
        self.session.headers['Connection'] = 'keep-alive'
        self.session.headers['Upgrade-Insecure-Requests'] = '1'

    # Main function
    def scrapeReviews(self, url, page_num, filter_by='recent'):
        """
        @args: filter_by - 'recent' or 'helpful'
        @return: list of ReviewDetails namedtuples
        """
        try:
            review_url = re.search('^.+(?=\/)', url).group()
            review_url = review_url + '?reviewerType=all_reviews&sortBy={0}&pageNumber={1}'.format(filter_by, page_num)
            print('Processing comments page {} for {}...'.format(page_num, review_url))
            clear_output(wait=True)
            response = self.session.get(review_url)

            # Check if the response is valid
            product_name = self.product_name_pattern.search(url).group(1) if self.product_name_pattern.search(url) else ''

            if not product_name:
                print('Invalid URL.')
                return
            else:
                product_name = product_name.replace('-', ' ')

            # Parse the response
            soup = BeautifulSoup(response.content, 'html.parser')
            review_list = soup.find('div', {'id': 'cm_cr-review_list' })    

            reviews = []
            product_reviews = review_list.find_all('div', {'data-hook': 'review'})
            
            # Iterate through the reviews
            for product_review in product_reviews:
                review_title = product_review.find('a', {'data-hook': 'review-title'}).text.strip()
                review_comment = product_review.find('span', {'data-hook': 'review-body'}).text.strip()
                review_rating = product_review.find('i', {'data-hook': 'review-star-rating'}).text
                review_date = self.review_date_pattern.search(product_review.find('span', {'data-hook': 'review-date'}).text).group(0)
                user_username = product_review.a.span.text
                user_profile_url = 'https://amazon.com/{0}'.format(product_review.a['href'])
                verified_purchase = True if product_review.find('span', {'data-hook': 'avp-badge'}) else False
                reviews.append(ReviewDetails(product_name, review_title, review_comment, review_rating, review_date, user_username, user_profile_url, verified_purchase))
            
            return reviews
        except Exception as e:
            print(e)
            
            return None

## Runnning the code

In [3]:
# Comments scraper

reviews = []
reviews_urls = rd.review_urls
scraper = AmazonScraper()

for url, num_pages in reviews_urls.items():
    for page in range(1, num_pages+1):
        if (page):
            reviews.extend(scraper.scrapeReviews(url, page))
        
# Save to csv with special characters
pd.DataFrame(reviews).to_csv('reviews.csv', index=False, encoding='utf-8-sig')

Processing comments page 5 for https://www.amazon.com/2022-Apple-MacBook-Laptop-chip/product-reviews/B0B3C57XLR?reviewerType=all_reviews&sortBy=recent&pageNumber=5...


## Visualizing the dataset

In [4]:
pd.DataFrame(reviews).head()

Unnamed: 0,product_name,review_title,review_comment,review_rating,review_date,user_username,user_profile_url,verified_purchase
0,Apple AirPods Charging Latest Model,Charging is horrible,Charger doesn’t consistently charge. Drains wh...,3.0 out of 5 stars,"October 23, 2022",Teddi,https://amazon.com//gp/profile/amzn1.account.A...,True
1,Apple AirPods Charging Latest Model,"Works perfect with Dell laptop, Samsung cell p...",Easy steps to follow for all devices.,5.0 out of 5 stars,"October 22, 2022",Pama,https://amazon.com//gp/profile/amzn1.account.A...,True
2,Apple AirPods Charging Latest Model,Apple does it again,Apple is always top quality and these are no e...,5.0 out of 5 stars,"October 22, 2022",Charlie P.,https://amazon.com//gp/profile/amzn1.account.A...,True
3,Apple AirPods Charging Latest Model,Works good,No problems they work good,5.0 out of 5 stars,"October 22, 2022",Paula Pearson,https://amazon.com//gp/profile/amzn1.account.A...,True
4,Apple AirPods Charging Latest Model,Good product,"confortable, very good product",5.0 out of 5 stars,"October 22, 2022",Reynaldo Rodriguez,https://amazon.com//gp/profile/amzn1.account.A...,True
