# Web scraping of Amazon products

We use the Python library selenium for web crawling and scraping.

In [1]:
import sys
!{sys.executable} -m pip install selenium



In [2]:
from selenium import webdriver
from PIL import Image
import requests
import urllib.request
import os

We initialize a new web driver and feed it the page that we want to start scraping from. In our case, we scrape the names, prices, and ratings of GPU listings from the first ten pages of Amazon.

In [6]:
driver = webdriver.Chrome("/Users/ruicao/Downloads/scrape_notebook/chromedriver")
driver.get("https://www.amazon.com/s?k=gpu")

![title](amazon-gpus.png)

We find the links to each product page by looking for the class html tag with description "a-link-normal a-text-normal" and we store these links in a list. We do this for the first ten pages of GPU listings.

In [7]:
elems = driver.find_elements_by_xpath('.//a[@class = "a-link-normal a-text-normal"]')
product_links = []
for elem in elems:
    product_links.append(elem.get_attribute("href"))
    
next_url = driver.find_elements_by_xpath('.//li[@class = "a-last"]//a')[0].get_attribute("href")
for _ in range(9):
    driver.get(next_url)
    curr_elems = driver.find_elements_by_xpath('.//a[@class = "a-link-normal a-text-normal"]')
    for curr_elem in curr_elems:
        product_links.append(curr_elem.get_attribute("href"))
    next_url = driver.find_elements_by_xpath('.//li[@class = "a-last"]//a')[0].get_attribute("href")

For each product page, we get the name, price, and rating of the product. To do this, we look for span html tags with descriptions that allow us to uniquely identify each attribute. We write these functions and run them for each page that we stored in our initial list. Some of the data will return null values because selenium oftens runs too quickly on each page and will miss html tags when they are actually there. We set them as null for now, but running multiple iterations and corroborating the data would lead to a complete dataset.

In [8]:
def get_name(url):
    span = driver.find_elements_by_xpath('.//span[@id = "productTitle"]')
    if not span:
        start = url.find("amazon.com/")+11
        end = url.find("/", start)
        print('Test works')
        return url[start:end]
    return span[0].text

In [9]:
def get_price():
    span = driver.find_elements_by_xpath('.//span[@id = "priceblock_ourprice"]')
    if not span:
        return None
    return float(span[0].text[1:].replace(',',''))

In [10]:
def get_rating():
    span = driver.find_elements_by_xpath('.//span[@class = "reviewCountTextLinkedHistogram noUnderline"]')
    if not span:
        return None
    return float(span[0].get_attribute("title").split()[0])

In [11]:
product_names = []
product_prices = []
product_ratings = []
for link in product_links:
    driver.get(link)
    product_names.append(get_name(link))
    product_prices.append(get_price())
    product_ratings.append(get_rating())
print("Length of Names:", len(product_names))
print("Length of Prices:", len(product_prices))
print("Length of Ratings:", len(product_ratings))

Length of Names: 228
Length of Prices: 228
Length of Ratings: 228


We store the information that we scraped into a pandas dataframe and then export the data as a csv, ready to be analyzed!

In [12]:
import pandas as pd

In [13]:
zippedList =  list(zip(product_names, product_prices, product_ratings, product_links))
amazon_gpus = pd.DataFrame(zippedList, columns = ['Product Name' , 'Price', 'Rating (out of 5)', 'URL']) 

In [14]:
amazon_gpus

Unnamed: 0,Product Name,Price,Rating (out of 5),URL
0,MSI Gaming Radeon RX 5500 XT Boost Clock: 1845...,219.99,4.5,https://www.amazon.com/MSI-Gaming-RX-5500-XT/d...
1,"XFX Radeon RX 580 GTS XXX Edition 1386MHz OC+,...",212.81,4.5,https://www.amazon.com/XFX-Radeon-1386MHz-Grap...
2,XFX RX 5600 XT THICC III PRO - 14GBPS 6GB GDDR...,321.14,4.5,https://www.amazon.com/XFX-5600-THICC-III-PRO/...
3,Gigabyte GV-N1030OC-2GI Nvidia GeForce GT 1030...,88.99,4.5,https://www.amazon.com/Gigabyte-GV-N1030OC-2GI...
4,PNY GeForce GT 1030 2GB Graphic Card (VCGGT103...,119.99,4.3,https://www.amazon.com/PNY-GeForce-1030-Graphi...
...,...,...,...,...
223,ASUS GeForce GTX 1070 8GB Turbo Edition 4K & V...,,4.3,https://www.amazon.com/ASUS-GeForce-Auto-Extre...
224,"EVGA GeForce GTX 1080 Ti SC2 Gaming, 11GB GDDR...",,4.7,https://www.amazon.com/EVGA-GeForce-Gaming-GDD...
225,"EVGA GeForce GTX 1650 XC Ultra Gaming, 4GB GDD...",241.81,4.6,https://www.amazon.com/EVGA-GeForce-Ultra-Gami...
226,EZDIY-FAB New PCI Express PCIe3.0 16x Flexible...,23.98,4.2,https://www.amazon.com/gp/slredirect/picassoRe...


In [15]:
amazon_gpus.to_csv("amazon_gpus.csv")

One limitation of our current program is that we are scraping some extraneous listings such as sponsored listings. We will have to filter these out either before scraping or in our final dataset.