# Amazon Product Description Scraping Bot
To use this script, please install python, at least version 3.
Follow this <a href="https://www.youtube.com/watch?v=Xjv1sY630Uc">video</a> to setup your selenium and the chromedriver.

Once you have finished setting up your selenium and python
install the required python packages:
- `pip install selenium`
- `pip install pandas`

Raw Dataset Collection:<br/>
The raw dataset used to scrape the product description can be done by downloading from the following website (https://jmcauley.ucsd.edu/data/amazon/).

### Data Cleaning
The raw json data downloaded from the above website isn't clean enough to be directly loaded to pandas dataframe. In order to clean it first, run the code in the next section, below.


In [24]:
fileName='JSONs/reviews_Video_Games_5.json' # path to the file you want to use to scrape
saveDirectory='scraped/' # path to the folder to save the scraped data into
scrape_limit=3000 # not every product can be succesfully scraped so add some extra 500 for items

In [25]:
from selenium import webdriver
import pandas as pd
import time

from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import ElementNotVisibleException
import json
import sys

chromedriverPath = '/Users/vidit/Desktop/RandomProj/Ryan_Data/chromedriver' # Path to the chromium driver



In [26]:
for p in sys.path:
    print(p)

/Users/vidit/Desktop/RandomProj/Ryan_Data
/usr/local/Cellar/python/3.7.7/Frameworks/Python.framework/Versions/3.7/lib/python37.zip
/usr/local/Cellar/python/3.7.7/Frameworks/Python.framework/Versions/3.7/lib/python3.7
/usr/local/Cellar/python/3.7.7/Frameworks/Python.framework/Versions/3.7/lib/python3.7/lib-dynload

/Users/vidit/Library/Python/3.7/lib/python/site-packages
/usr/local/lib/python3.7/site-packages
/usr/local/lib/python3.7/site-packages/IPython/extensions
/Users/vidit/.ipython


# !!!! Clean the raw data before running the next cell
Cleaning code is available at the bottom

In [28]:
data = pd.read_json(fileName)
data

Unnamed: 0,reviewerID,asin,reviewerName,helpful,reviewText,overall,summary,unixReviewTime,reviewTime
0,A2HD75EMZR8QLN,0700099867,123,"[8, 12]",Installing the game was a struggle (because of...,1,Pay to unlock content? I don't think so.,1341792000,"07 9, 2012"
1,A3UR8NLLY1ZHCX,0700099867,"Alejandro Henao ""Electronic Junky""","[0, 0]",If you like rally cars get this game you will ...,4,Good rally game,1372550400,"06 30, 2013"
2,A1INA0F5CWW3J4,0700099867,"Amazon Shopper ""Mr.Repsol""","[0, 0]",1st shipment received a book instead of the ga...,1,Wrong key,1403913600,"06 28, 2014"
3,A1DLMTOTHQ4AST,0700099867,ampgreen,"[7, 10]","I got this version instead of the PS3 version,...",3,"awesome game, if it did not crash frequently !!",1315958400,"09 14, 2011"
4,A361M14PU2GUEG,0700099867,"Angry Ryan ""Ryan A. Forrest""","[2, 2]",I had Dirt 2 on Xbox 360 and it was an okay ga...,4,DIRT 3,1308009600,"06 14, 2011"
...,...,...,...,...,...,...,...,...,...
231775,A1ICREREXO9J81,B00KHECZXO,Frustrated gamer,"[0, 1]",Funny people on here are rating sellers that a...,5,this is for rating the system not the seller,1405814400,"07 20, 2014"
231776,A3VVMIMMTYQV5F,B00KHECZXO,Johnny Saigon,"[8, 11]",All this is is the Deluxe 32GB Wii U with Mari...,1,Get the Other Bundle Which Includes Extra Whee...,1403308800,"06 21, 2014"
231777,A1DD4B97M4DUC5,B00KHECZXO,migit,"[62, 66]",The package should have more red on it and sho...,1,Fake bundle,1401321600,"05 29, 2014"
231778,A2Q9CNJ4T6ZK99,B00KHECZXO,"Philip Brown ""Philip & Chana""","[33, 36]",Can get this at Newegg for $329.00 and the pac...,1,Looks Like We Have Gougers Again.,1401667200,"06 2, 2014"


In [29]:
uniqueproducts = data.asin.unique()
uniqueproducts

array(['0700099867', '6050036071', '7100027950', ..., 'B00JXW6GE0',
       'B00KAI3KW2', 'B00KHECZXO'], dtype=object)

In [30]:
#this launches the browser
driver=webdriver.Chrome(chromedriverPath)

In [31]:
def parse_about(driver,url):
    driver.get(f'https://www.amazon.com/dp/{url}')
    product=''
    try:
        productDesc = driver.find_element_by_id('productDescription')
        
        if productDesc:
            allp = productDesc.find_elements_by_tag_name("p")
            description=''
            
            for p in allp:
                description+=p.text
                description+='\n\n'
            return description
        else:
            ul = driver.find_element_by_css_selector('ul.a-unordered-list.a-vertical.a-spacing-mini')
            allli = ul.find_elements_by_tag_name("li")
            description = ''
            for li in allli:
                description += (li.text+'.')
            return description
    except ElementNotVisibleException:
        print("\tpage not available!")
        return ''
    except Exception:
        print("\tcannot get product description")
        return ''

In [32]:
result = []

In [None]:
for i,prod in enumerate(uniqueproducts):
    if i==scrape_limit:
        break
    if i%100==0:
        print("program running:",len(result))
    try:
        result.append(parse_about(driver,prod).replace('..','.'))
    except:
        pass

program running: 0
	cannot get product description
	cannot get product description
	cannot get product description
	cannot get product description
program running: 100
program running: 200
program running: 300
program running: 400
	cannot get product description
program running: 500
	cannot get product description
program running: 600


In [None]:
df = pd.DataFrame()
df['description'] = result
df.to_csv(fileName.replace('.json','.csv'))

In [13]:
df

Unnamed: 0,description
0,Dirt 3 is a popular rally racing game for Play...
1,
2,Continue Link's adventures with Legend of Zeld...
3,Having stunning Nintendo Wii High Definition G...
4,Tom Clancy's H.A.W.X. 2 plunges fans into an e...


## Format/Clean the raw Amazon Data

In [14]:
file_lines=None
with open(fileName) as f:
    file_lines = f.readlines()
    
with open(fileName,'w') as file:
    for i,line in enumerate(file_lines):
        if i==0:         
            file.write('['+str(line)+',')
        elif i==(len(file_lines)-1):
            file.write(str(line)+']')
        else:
            file.write(str(line)+',')
        