# Web scraping of Amazon products

We use the Python library selenium for web crawling and scraping.

In [1]:
import sys
!{sys.executable} -m pip install selenium

You should consider upgrading via the '/usr/local/Cellar/jupyterlab/3.0.1/libexec/bin/python -m pip install --upgrade pip' command.[0m


In [2]:
from selenium import webdriver
from PIL import Image
import requests
import urllib.request
import os

We initialize a new web driver and feed it the page that we want to start scraping from. In our case, we scrape the names, prices, and ratings of GPU listings from the first ten pages of Amazon.

In [4]:
driver = webdriver.Chrome("/Users/awang/Desktop/saas-work/WebScraping/chromedriver")
driver.get("https://www.amazon.com/s?k=gpu")

![title](amazon-gpus.png)

We find the links to each product page by looking for the class html tag with description "a-link-normal a-text-normal" and we store these links in a list. We do this for the first ten pages of GPU listings.

In [5]:
elems = driver.find_elements_by_xpath('.//a[@class = "a-link-normal a-text-normal"]')
product_links = []
for elem in elems:
    product_links.append(elem.get_attribute("href"))
    
next_url = driver.find_elements_by_xpath('.//li[@class = "a-last"]//a')[0].get_attribute("href")
for _ in range(9):
    driver.get(next_url)
    curr_elems = driver.find_elements_by_xpath('.//a[@class = "a-link-normal a-text-normal"]')
    for curr_elem in curr_elems:
        product_links.append(curr_elem.get_attribute("href"))
    next_url = driver.find_elements_by_xpath('.//li[@class = "a-last"]//a')[0].get_attribute("href")

For each product page, we get the name, price, and rating of the product. To do this, we look for span html tags with descriptions that allow us to uniquely identify each attribute. We write these functions and run them for each page that we stored in our initial list. Some of the data will return null values because selenium oftens runs too quickly on each page and will miss html tags when they are actually there. We set them as null for now, but running multiple iterations and corroborating the data would lead to a complete dataset.

In [6]:
def get_name(url):
    span = driver.find_elements_by_xpath('.//span[@id = "productTitle"]')
    if not span:
        start = url.find("amazon.com/")+11
        end = url.find("/", start)
        print('Test works')
        return url[start:end]
    return span[0].text

In [7]:
def get_price():
    span = driver.find_elements_by_xpath('.//span[@id = "priceblock_ourprice"]')
    if not span:
        return None
    return float(span[0].text[1:].replace(',',''))

In [8]:
def get_rating():
    span = driver.find_elements_by_xpath('.//span[@class = "reviewCountTextLinkedHistogram noUnderline"]')
    if not span:
        return None
    return float(span[0].get_attribute("title").split()[0])

In [9]:
product_names = []
product_prices = []
product_ratings = []
for link in product_links:
    driver.get(link)
    product_names.append(get_name(link))
    product_prices.append(get_price())
    product_ratings.append(get_rating())
print("Length of Names:", len(product_names))
print("Length of Prices:", len(product_prices))
print("Length of Ratings:", len(product_ratings))

Length of Names: 201
Length of Prices: 201
Length of Ratings: 201


We store the information that we scraped into a pandas dataframe and then export the data as a csv, ready to be analyzed!

In [10]:
import pandas as pd

In [11]:
zippedList =  list(zip(product_names, product_prices, product_ratings, product_links))
amazon_gpus = pd.DataFrame(zippedList, columns = ['Product Name' , 'Price', 'Rating (out of 5)', 'URL']) 

In [12]:
amazon_gpus

Unnamed: 0,Product Name,Price,Rating (out of 5),URL
0,XFX AMD Radeon RX 580 GTS XXX Edition Graphics...,699.99,,https://www.amazon.com/gp/slredirect/picassoRe...
1,ASUS Phoenix GeForce GTX 1650 OC Edition Graph...,369.99,,https://www.amazon.com/gp/slredirect/picassoRe...
2,ASUS TUF Gaming NVIDIA GeForce RTX 3070 OC Edi...,1599.99,1.0,https://www.amazon.com/Gaming-GeForce-Graphics...
3,HHCJ6 Dell NVIDIA Tesla K80 24GB GDDR5 PCI-E 3...,199.00,3.9,https://www.amazon.com/Dell-Tesla-K80-Accelera...
4,Gigabyte Geforce GTX 1050 Ti OC 4GB GDDR5 128 ...,309.99,4.6,https://www.amazon.com/Gigabyte-Geforce-GDDR5-...
...,...,...,...,...
196,EVGA EPOWER V (100-UV-0600-BR),149.99,5.0,https://www.amazon.com/EVGA-100-UV-0600-BR-EPO...
197,EVGA GeForce GT 1030 SC 2GB GDDR5 Single Slot ...,,4.5,https://www.amazon.com/EVGA-GeForce-Single-Gra...
198,2 GB Graphics Video Card GPU Upgrade Replaceme...,185.98,5.0,https://www.amazon.com/Graphics-Replacement-Mi...
199,Bewinner1 Graphics Card GPU Brace Support Hold...,24.19,,https://www.amazon.com/Bewinner1-Graphics-All-...


In [13]:
amazon_gpus.to_csv("amazon_gpus.csv")

One limitation of our current program is that we are scraping some extraneous listings such as sponsored listings. We will have to filter these out either before scraping or in our final dataset.