# Amazon Product Reviews Scraper and Preprocessor

This Jupyter Notebook demonstrates the process of scraping Amazon product reviews and preprocessing the collected data for further analysis. The notebook includes the following steps:

1. **Setup and Installation**: Install necessary Python packages.
2. **Scraping Amazon Reviews**: Use the `selectorlib` library to extract reviews from Amazon product pages.
3. **Data Preprocessing**: Clean and preprocess the scraped reviews data for two categories: Books and Beauty products.
4. **Data Export**: Save the cleaned datasets to CSV files for future use.

- Dependencies:

In [None]:
!pip install python-dateutil requests selectorlib requests 

In [19]:
from selectorlib import Extractor
import requests 
import json 
from time import sleep
import csv
from dateutil import parser as dateparser
import re

In [None]:
"""
    e: Extractor object containing the selectors
"""
e = Extractor.from_yaml_file('selectors.yml')

def scrape(url):    
    """
        Parameters:
            url: URL of the Amazon product page
            
        Returns:
            product_dict: Dictionary containing the product
            
        Description:
            This function scrapes the Amazon product page and returns a dictionary containing the product details.
    """
    headers = {
        'authority': 'www.amazon.com',
        'pragma': 'no-cache',
        'cache-control': 'no-cache',
        'dnt': '1',
        'upgrade-insecure-requests': '1',
        'user-agent': 'Mozilla/5.0 (X11; CrOS x86_64 8172.45.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.64 Safari/537.36',
        'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
        'sec-fetch-site': 'none',
        'sec-fetch-mode': 'navigate',
        'sec-fetch-dest': 'document',
        'accept-language': 'en-GB,en-US;q=0.9,en;q=0.8',
    }

    # Downloading the page using requests
    print("Downloading %s"%url)
    r = requests.get(url, headers=headers)
    # Simple check to check if page was blocked (Usually 503)
    if r.status_code > 500:
        if "To discuss automated access to Amazon data please contact" in r.text:
            print("Page %s was blocked by Amazon. Please try using better proxies\n"%url)
        else:
            print("Page %s must have been blocked by Amazon as the status code was %d"%(url,r.status_code))
        return None
    # Pass the HTML of the page and create 
    return e.extract(r.text)

This Python script reads URLs from a file and scrapes reviews from these URLs. The reviews are then written to a CSV file. The script performs the following steps:

1. Initializes a serial number counter `sno`.
2. Opens the URL list file for reading and the CSV file for appending.
3. Sets up a CSV writer with specified field names and writes the header.
4. Iterates through each URL in the URL list file.
5. Extracts the ASIN (Amazon Standard Identification Number) from the URL using a regular expression.
6. Calls a `scrape` function to get review data from the URL.
7. Processes each review:
   - Adds product title, URL, and ASIN to the review.
   - Normalizes the 'verified' field.
   - Extracts and formats the rating and date.
   - Removes the 'images' field if present.
   - Adds a serial number to the review.
8. Writes the processed review to the CSV file.
9. Pauses for 11 seconds between requests to avoid being blocked.
10. Prints a message if no reviews are found for a URL.

In [None]:
sno = 1
with open("beauty_products_urls.txt",'r') as urllist, open('beauty_category_reviews.csv','a', newline='') as outfile:
    writer = csv.DictWriter(outfile, fieldnames=["sno", "title", "content", "date", "variant", "verified", "author", "rating", "product", "product_asin", "url"],quoting=csv.QUOTE_ALL)
    writer.writeheader()
    for url in urllist.readlines():
        asin_match = re.search(r'/dp/([A-Z0-9]{10})', url)
        if asin_match:
            asin = asin_match.group(1)
        else:
            asin = None
        data = scrape(url) 
        if data and data.get('reviews'):
            for r in data['reviews']:
                r["product"] = data["product_title"]
                r['url'] = url
                r['product_asin'] = asin
                if 'verified' in r and r['verified'] is not None:
                    if 'Verified Purchase' in r['verified']:
                        r['verified'] = 'Yes'
                    else:
                        r['verified'] = 'Yes'
                else:
                    r['verified'] = 'No'
                
                if r['rating']:
                    r['rating'] = r['rating'].split(' out of')[0]
                else:
                    r['rating'] = None
                if r['date']:
                    date_posted = r['date'].split('on ')[-1]
                    r['date'] = dateparser.parse(date_posted).strftime('%d %b %Y')
                else:
                    r['date'] = None
                if 'images' in r:
                    del r['images']
                r['sno'] = sno
                sno += 1
                writer.writerow(r)
            sleep(11)
        else:
            print(f"No reviews found for {url}")
    

Downloading https://www.amazon.com/KAHI-Wrinkle-Bounce-Multi-Balm/dp/B0DDC3QGJL/ref=cm_cr_dp_d_show_all_btm?ie=UTF8&reviewerType=all_reviews

Downloading https://www.amazon.com/ELEMIS-Pro-Collagen-Nutrient-Rich-Intensive-Anti-Wrinkle/dp/B000J10II4/ref=cm_cr_dp_d_show_all_btm?ie=UTF8&reviewerType=all_reviews

Downloading https://www.amazon.com/DOMINAS-Cream-Plus-1-76-Niacinamide/dp/B0BSKRK5R1/ref=cm_cr_dp_d_show_all_btm?ie=UTF8&reviewerType=all_reviews

Downloading https://www.amazon.com/ELEMIS-Pro-Collagen-Marine-Cream-SPF/dp/B07BMBQG73/ref=cm_cr_dp_d_show_all_btm?ie=UTF8&reviewerType=all_reviews

Downloading https://www.amazon.com/Vanicream-Gentle-Cleanser-sensitive-Dispenser/dp/B00QY1XZ4W/ref=cm_cr_dp_d_show_all_btm?ie=UTF8&reviewerType=all_reviews

Downloading https://www.amazon.com/Blue-Lagoon-Nourishing-Sustainable-Bioactive/dp/B0BGXXTY64/ref=cm_cr_dp_d_show_all_btm?ie=UTF8&reviewerType=all_reviews

Downloading https://www.amazon.com/Ursa-Major-Natural-Biodegradable-Cruelty-Free/d

In [None]:
!pip install pandas

In [13]:
import pandas as pd

#### Books category reviews preprocessing:

In [14]:
df = pd.read_csv('books_category_reviews.csv')

display (df.head())

Unnamed: 0,sno,title,content,date,variant,verified,author,rating,product,product_asin,url
0,1,5.0 out of 5 stars Unique!,What a unique plot line! With characters you ...,12 Nov 2024,,No,Kindle Worm,5.0,,B0CQHMTW4X,https://www.amazon.com/Deadly-Animals-Novel-Ma...
1,2,4.0 out of 5 stars Creepy Folklore Mystery,Ava is not like other 14-year-olds. She has a ...,12 Nov 2024,,No,Hannah O. Christensen,4.0,,B0CQHMTW4X,https://www.amazon.com/Deadly-Animals-Novel-Ma...
2,3,5.0 out of 5 stars Fresh approach to crime wit...,An incredible debut. 'Deadly Animals' shines w...,17 Feb 2024,,No,sikonat,5.0,,B0CQHMTW4X,https://www.amazon.com/Deadly-Animals-Novel-Ma...
3,4,4.0 out of 5 stars Deadly Animals by Marie Tie...,Thirteen year old Ava Bonney is unlike most ot...,23 Feb 2024,,No,damppebbles,4.0,,B0CQHMTW4X,https://www.amazon.com/Deadly-Animals-Novel-Ma...
4,5,,I absolutely loved this book. It was so well w...,23 Sep 2024,,No,Crime fiction lover,,,B0CQHMTW4X,https://www.amazon.com/Deadly-Animals-Novel-Ma...


- Dropping unnecessary columns in books category reviews dataset.

In [15]:
df.drop(['variant', 'verified', 'product'], axis=1, inplace=True)

In [16]:
df.head()

Unnamed: 0,sno,title,content,date,author,rating,product_asin,url
0,1,5.0 out of 5 stars Unique!,What a unique plot line! With characters you ...,12 Nov 2024,Kindle Worm,5.0,B0CQHMTW4X,https://www.amazon.com/Deadly-Animals-Novel-Ma...
1,2,4.0 out of 5 stars Creepy Folklore Mystery,Ava is not like other 14-year-olds. She has a ...,12 Nov 2024,Hannah O. Christensen,4.0,B0CQHMTW4X,https://www.amazon.com/Deadly-Animals-Novel-Ma...
2,3,5.0 out of 5 stars Fresh approach to crime wit...,An incredible debut. 'Deadly Animals' shines w...,17 Feb 2024,sikonat,5.0,B0CQHMTW4X,https://www.amazon.com/Deadly-Animals-Novel-Ma...
3,4,4.0 out of 5 stars Deadly Animals by Marie Tie...,Thirteen year old Ava Bonney is unlike most ot...,23 Feb 2024,damppebbles,4.0,B0CQHMTW4X,https://www.amazon.com/Deadly-Animals-Novel-Ma...
4,5,,I absolutely loved this book. It was so well w...,23 Sep 2024,Crime fiction lover,,B0CQHMTW4X,https://www.amazon.com/Deadly-Animals-Novel-Ma...


- Saving the cleaned dataset to a csv file:

In [21]:
df.to_csv('data/books_category_reviews.csv', index=False)

#### Beauty category reviews preprocessing:

In [None]:
df2 = pd.read_csv('data/beauty_category_reviews.csv')

display (df2.head())

Unnamed: 0,sno,title,content,date,variant,verified,author,rating,product,product_asin,url
0,1,5.0 out of 5 stars Quality product at a much l...,I have tried more expensive makeup fixatives a...,23 Aug 2024,,No,Amazon Customer,5.0,,B0CTHWQV2R,https://www.amazon.com/W7-Fixer-Matte-Set-Sett...
1,2,5.0 out of 5 stars Love it,Works great Read more,30 Sep 2024,,No,Sandy motherofone,5.0,,B0CTHWQV2R,https://www.amazon.com/W7-Fixer-Matte-Set-Sett...
2,3,4.0 out of 5 stars I like the matte coverage,I like the mat coverage over my make up. It t...,29 Sep 2024,,No,cjosea101,4.0,,B0CTHWQV2R,https://www.amazon.com/W7-Fixer-Matte-Set-Sett...
3,4,3.0 out of 5 stars Does not last a full Eight ...,I used this product immediately after receivin...,15 Jul 2024,,No,Lane,3.0,,B0CTHWQV2R,https://www.amazon.com/W7-Fixer-Matte-Set-Sett...
4,5,5.0 out of 5 stars Consistent,"I've been using this for close to a year now, ...",05 Nov 2024,,No,Hannah L.,5.0,,B0CTHWQV2R,https://www.amazon.com/W7-Fixer-Matte-Set-Sett...


- Dropping unnecessary columns in beauty category reviews dataset.

In [9]:
df2.drop(['variant', 'verified', 'product'], axis=1, inplace=True)

df2.head()

Unnamed: 0,sno,title,content,date,author,rating,product_asin,url
0,1,5.0 out of 5 stars Quality product at a much l...,I have tried more expensive makeup fixatives a...,23 Aug 2024,Amazon Customer,5.0,B0CTHWQV2R,https://www.amazon.com/W7-Fixer-Matte-Set-Sett...
1,2,5.0 out of 5 stars Love it,Works great Read more,30 Sep 2024,Sandy motherofone,5.0,B0CTHWQV2R,https://www.amazon.com/W7-Fixer-Matte-Set-Sett...
2,3,4.0 out of 5 stars I like the matte coverage,I like the mat coverage over my make up. It t...,29 Sep 2024,cjosea101,4.0,B0CTHWQV2R,https://www.amazon.com/W7-Fixer-Matte-Set-Sett...
3,4,3.0 out of 5 stars Does not last a full Eight ...,I used this product immediately after receivin...,15 Jul 2024,Lane,3.0,B0CTHWQV2R,https://www.amazon.com/W7-Fixer-Matte-Set-Sett...
4,5,5.0 out of 5 stars Consistent,"I've been using this for close to a year now, ...",05 Nov 2024,Hannah L.,5.0,B0CTHWQV2R,https://www.amazon.com/W7-Fixer-Matte-Set-Sett...


- Saving the cleaned dataset to a csv file:

In [10]:
df2.to_csv('beauty_category_reviews.csv', index=False)