# Web Scraping Data

The purpose of this notebook is to web scrape data from grailed.com in order to generate a DataFrame for analysis and modeling. First, I took summary information and url links from the main page after automating a browser to scroll and refresh the screen 200 times. Next, I took the URLs I pulled from the original page and scraped more complete info on each listing, plus the top image. Finally, I joined the resulting DataFrames and saved for import into my EDA notebook. 

## Importing Libraries

In [1]:
import pandas as pd
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
from selenium.common.exceptions import NoSuchElementException
from webdriver_manager.chrome import ChromeDriverManager
import time
import numpy as np
import urllib.request
import progressbar

import sys

In [2]:
sys.path.append('/Users/nicksubic/Documents/flatiron/phase_1/nyc-mhtn-ds-091420-lectures/capstone/Clothing_Recommender/src/')
sys.path.append('/Users/nicksubic/.wdm/drivers/chromedriver/mac64/87.0.4280.88/chromedriver')

In [3]:
import data

In [4]:
# Instantiate the webdriver

driver = webdriver.Chrome(ChromeDriverManager().install())

[WDM] - Current google-chrome version is 87.0.4280
[WDM] - Get LATEST driver version for 87.0.4280
[WDM] - Get LATEST driver version for 87.0.4280
[WDM] - Trying to download new driver from http://chromedriver.storage.googleapis.com/87.0.4280.88/chromedriver_mac64.zip
 
[WDM] - Driver has been saved in cache [/Users/nicksubic/.wdm/drivers/chromedriver/mac64/87.0.4280.88]


### Instatiating the webdriver and scrolling to update the page 200 times

In [5]:
chrome_options = webdriver.ChromeOptions()

# Navigate to the main page for tops
driver.get('https://www.grailed.com/categories/tops')
timeout = 30

#Wait for the page to load or return a Time Out error and close
try:
    WebDriverWait(driver, timeout).until(EC.visibility_of_element_located((By.XPATH, "//div[@class='feed-item']")))
except TimeoutException:
    print("Timed Out")
    driver.quit()

In [9]:
def scroll_page(number_of_scrolls):
    '''Takes in a number of scrolls and scrolls to up the selected page that number of times'''
    # Bypass the header and recommended items
    results = driver.find_elements_by_xpath('//div[@class="FiltersInstantSearch"]//div[@class="feed-item"]')

    # Start the scroll
    for i in range(0, number_of_scrolls):
        driver.execute_script("window.scrollTo(0,document.body.scrollHeight)")
        
        # Print an update every 30 scrolls
        if i%30 == 0:
            print(f'Completed scroll {i+1} out of {number_of_scrolls}')
        time.sleep(2)

    # Set the results as everything after the initial scroll
    results = driver.find_elements_by_xpath('//div[@class="FiltersInstantSearch"]//div[@class="feed-item"]')
    return results

In [10]:
results = scroll_page(200)

Completed scroll 1 out of 200
Completed scroll 31 out of 200
Completed scroll 61 out of 200
Completed scroll 91 out of 200
Completed scroll 121 out of 200
Completed scroll 151 out of 200
Completed scroll 181 out of 200


river.remote.webelement.WebElement (session="98065bfa0bad5548d8f2910bfe9bfe0e", element="895fcbd1-331c-4c44-b064-0cd977daff07")>,
 <selenium.webdriver.remote.webelement.WebElement (session="98065bfa0bad5548d8f2910bfe9bfe0e", element="6490ee02-5791-4b44-8ff6-134a0d8a2104")>,
 <selenium.webdriver.remote.webelement.WebElement (session="98065bfa0bad5548d8f2910bfe9bfe0e", element="1e6b5666-c83f-4a90-a150-577065132b7d")>,
 <selenium.webdriver.remote.webelement.WebElement (session="98065bfa0bad5548d8f2910bfe9bfe0e", element="a8c8ef66-497b-455e-a689-971a8e5ae873")>,
 <selenium.webdriver.remote.webelement.WebElement (session="98065bfa0bad5548d8f2910bfe9bfe0e", element="7a2b084a-2245-40bb-b870-7d16eaaca4aa")>,
 <selenium.webdriver.remote.webelement.WebElement (session="98065bfa0bad5548d8f2910bfe9bfe0e", element="3e2ae70c-4acb-49dc-9cda-4d4fb1362f3c")>,
 <selenium.webdriver.remote.webelement.WebElement (session="98065bfa0bad5548d8f2910bfe9bfe0e", element="b6631ed7-0c01-436b-8a75-302aa866ab9a")>,


### Creating the first DataFrame by scraping the first page

In [9]:
df = data.make_dataframe(results)

In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10100 entries, 0 to 10099
Data columns (total 9 columns):
Name        10100 non-null object
Designer    10100 non-null object
Price       3658 non-null object
NewPrice    6442 non-null object
OldPrice    6442 non-null object
Size        10100 non-null object
Time        10100 non-null object
LastBump    7759 non-null object
Link        10100 non-null object
dtypes: object(9)
memory usage: 710.3+ KB


### Creating the second DataFrame by following the links in the first one, gathering more info and downloading images

In [18]:
def page_dataframe(list_of_links):
    '''Takes a list of links to grailed.com and scrapes data and saves the main image.
       Returns a DataFrame with all the page's summary information.'''

    # Lists to append everything to
    UserName=[]
    Sold=[]
    Feedback=[]
    CurrentListings=[]
    Description=[]
    ProfileLink=[]
    FeedBack=[]
    FeedbackLink=[]
    FollowerCount=[]
    FullSize=[]
    PostedTime=[]
    BumpedTime=[]
    Location=[]
    Transactions=[]

    # Iterate through the list of links
    for i, link in enumerate(df.Link):
        
        # If the link is broken, add a nan value to every column in that row
        try:
            driver.get(link)
        except HTTPError:
            UserName.append(np.nan)
            Sold.append(np.nan)
            Feedback.append(np.nan)
            CurrentListings.append(np.nan)
            Description.append(np.nan)
            ProfileLink.append(np.nan)
            FeedBack.append(np.nan)
            FeedbackLink.append(np.nan)
            FollowerCount.append(np.nan)
            FullSize.append(np.nan)
            PostedTime.append(np.nan)
            BumpedTime.append(np.nan)
            Location.append(np.nan)
            Transactions.append(np.nan)
            continue
        
        # Download the first image for the listing, start the loop over if it's missing
        try:
            image = driver.find_element_by_xpath('//div[@class="-image-wrapper"]')
            src = image.find_element_by_tag_name('img').get_attribute('src')
            urllib.request.urlretrieve(src, f'src/images/{i}.jpg')
        except NoSuchElementException:
            continue
        
        # Get the seller's name, or input NaN if there isn't one
        try:
            UserName.append(driver.find_element_by_xpath('//span[@class="-username"]').text)
        except NoSuchElementException:
            UserName.append(np.nan)

        # Get number of items sold by seller, else NaN
        try:
            Sold.append(driver.find_element_by_xpath('//a[@class="-link"]/span[2]').text)
        except NoSuchElementException:
            Sold.append(np.nan)

        # Get number Feedback ratings seller has
        try:
            FeedBack.append(driver.find_element_by_xpath('//span[@class="-feedback-count"]').text)
        except NoSuchElementException:
            FeedBack.append(np.nan)

        # Get number of seller's listings
        try:
            CurrentListings.append(driver.find_element_by_xpath('//a[@class="-for-sale-link"]').text)
        except NoSuchElementException:
            CurrentListings.append(np.nan)

        # Get Item Description- this is a pretty long string generally
        try:
            Description.append(driver.find_element_by_xpath('//div[@class="listing-description"]').text)
        except NoSuchElementException:
            Description.append(np.nan)

        # Get seller's profile link
        try:
            ProfileLink.append(driver.find_element_by_xpath('//span[@class="Username"]/a').get_attribute("href"))
        except NoSuchElementException:
            ProfileLink.append(np.nan)

        # Get Feedback Link
        try:
            FeedbackLink.append(driver.find_element_by_xpath('//div[@class="-details"]/a').get_attribute("href"))
        except NoSuchElementException:
            FeedbackLink.append(np.nan)

        # Get number of likes for item
        try:
            FollowerCount.append(driver.find_element_by_xpath('//p[@class="-follower-count"]').text)
        except NoSuchElementException:
            FollowerCount.append(np.nan)

        # Get expanded size info
        try:
            FullSize.append(driver.find_element_by_xpath('//h2[@class="listing-size sub-title"]').text)
        except NoSuchElementException:
            FullSize.append(np.nan)

        # Get time posted
        try:
            PostedTime.append(driver.find_element_by_xpath('//div[@class="-metadata"]/span[2]').text)
        except NoSuchElementException:
            PostedTime.append(np.nan)

        # Get time price dropped most recently
        try:
            BumpedTime.append(driver.find_element_by_xpath('//div[@class="-metadata"]/span[4]').text)
        except NoSuchElementException:
            BumpedTime.append(np.nan)

        # Get seller location
        try:
            Location.append(driver.find_element_by_xpath('//label[@class="--label"]').text)
        except NoSuchElementException:
            Location.append(np.nan)
        
        # Print status update every 100 pages
        if i%100 == 0:
            print(f'Completed Page {i} out of {len(list_of_links)}.')

    # Create dictionary with every list generated by the scrape loop   
    page_dict = {'Username': UserName,
                     'Sold': Sold,
                     'Feedback': FeedBack,
                     'CurrentListings': CurrentListings, 
                     'Description': Description, 
                     'ProfileLink': ProfileLink, 
                     'FeedbackLink': FeedbackLink, 
                     'FollowerCount': FollowerCount,
                     'FullSize': FullSize,
                     'PostedTime': PostedTime,
                     'BumpedTime': BumpedTime, 
                     'Location': Location}

    # Convert the dictionary to a DataFrame and return    
    return pd.DataFrame(page_dict), counter

In [19]:
df2 = page_dataframe(df.Link)

Completed Page 0 out of 10100.
Completed Page 100 out of 10100.
Completed Page 200 out of 10100.
Completed Page 300 out of 10100.
Completed Page 400 out of 10100.
Completed Page 500 out of 10100.
Completed Page 600 out of 10100.
Completed Page 700 out of 10100.
Completed Page 800 out of 10100.
Completed Page 900 out of 10100.
Completed Page 1000 out of 10100.
Completed Page 1100 out of 10100.
Completed Page 1200 out of 10100.
Completed Page 1300 out of 10100.
Completed Page 1400 out of 10100.
Completed Page 1500 out of 10100.
Completed Page 1600 out of 10100.
Completed Page 1700 out of 10100.
Completed Page 1800 out of 10100.
Completed Page 1900 out of 10100.
Completed Page 2000 out of 10100.
Completed Page 2100 out of 10100.
Completed Page 2200 out of 10100.
Completed Page 2300 out of 10100.
Completed Page 2400 out of 10100.
Completed Page 2500 out of 10100.
Completed Page 2600 out of 10100.
Completed Page 2700 out of 10100.
Completed Page 2800 out of 10100.
Completed Page 2900 out of

In [11]:
# Close webdriver
driver.close()

### Combining the DataFrames and verifying everything worked

In [22]:
# Combine the DataFrames
df3 = pd.concat([df, df2[0]], axis = 1)

In [23]:
df3.head()

Unnamed: 0,Name,Designer,Price,NewPrice,OldPrice,Size,Time,LastBump,Link,Username,...,Feedback,CurrentListings,Description,ProfileLink,FeedbackLink,FollowerCount,FullSize,PostedTime,BumpedTime,Location
0,Bape Varsity style Jacket Bathing Ape,BAPE,$155,,,M,about 14 hours ago,,https://www.grailed.com/listings/18766603-bape...,oghypeshop,...,77 Feedback,59 Listings for Sale,Bape Varsity Jacket\nSize M fits true\nRelease...,https://www.grailed.com/oghypeshop,https://www.grailed.com/oghypeshop/feedback,111,Size: US M / EU 48-50 / 2,1 day ago,,Add a comment
1,Vintage Nike Sunfaded Mini Swoosh Travis Style...,MADE IN USA × NIKE × VINTAGE,$100,,,L,about 17 hours ago,,https://www.grailed.com/listings/18763620-made...,im_groot,...,8 Feedback,47 Listings for Sale,Brand : Vintage Nike sunfaded mini swoosh blac...,https://www.grailed.com/im_groot,https://www.grailed.com/im_groot/feedback,87,Size: US L / EU 52-54 / 3,1 day ago,,Shipping: Asia to
2,CDG logo hoodie,CDG CDG CDG × COMME DES GARCONS,$129,,,L,about 18 hours ago,,https://www.grailed.com/listings/18761921-cdg-...,binefartoldn,...,25 Feedback,9 Listings for Sale,Comes des garçons black hoodie\nFrom Dover str...,https://www.grailed.com/binefartoldn,https://www.grailed.com/binefartoldn/feedback,168,Size: US L / EU 52-54 / 3,1 day ago,,Shipping: UK to
3,90s Faded Uni Blank Tee,STREETWEAR × VINTAGE,$25,,,M,1 day ago,,https://www.grailed.com/listings/18750493-stre...,TwoFold,...,1086 Feedback,189 Listings for Sale,90s Faded Uni Blank Tee. Size Medium.\nPit To ...,https://www.grailed.com/TwoFold,https://www.grailed.com/TwoFold/feedback,107,Size: US M / EU 48-50 / 2,2 days ago,,Shipping: US to
4,Vintage Nike Grey Black Big Logo Spellout Hood...,NIKE × VINTAGE,$55,,,M,1 day ago,,https://www.grailed.com/listings/18745492-nike...,gcwiek,...,275 Feedback,46 Listings for Sale,Mens medium\nGood condition other than some we...,https://www.grailed.com/gcwiek,https://www.grailed.com/gcwiek/feedback,108,Size: US M / EU 48-50 / 2,2 days ago,,Shipping: US to


In [26]:
# Save to csv
df3.to_csv('./src/data/grailed.csv')

The info is now scraped, combined and ready to load into the EDA notebook.