# Grailed Webscrape Part 1: Designer Links

Grailed is an especially difficult website to webscrape. Original I hoped to just scrape grailed.com/sold but I quickly discovered one could only scroll far enough to collect about 800 sold items. To overcome the issue I decided to collect sold items on grailed with the following steps.

1. scrape grailed for the links to all the designers on grailed (~6,000 total brands).
2. Go to the sold section for each of those designers.
3. For each designer, collect the link to each individual item.
4. For each item gather the essential components that can be used for modeling.

This first notebook completes the first two steps of the webscraping task. First by collecting links to each of the designer pages on grailed and next by collecting the sold pages for each of those designers.

Help was recieved from the following post: https://medium.com/@mike_liu/scraping-grailed-8501eef914a8

Although the notebooks that follow have been heavily adapted

In [10]:
# imports
import pandas as pd
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.action_chains import ActionChains
import time

## Part 1: Designer Links

In [13]:
# Setting a base url for where to find all the designers on grailed
base_url = "https://www.grailed.com/designers/"

# open up chrome
chrome_options = webdriver.ChromeOptions()
driver = webdriver.Chrome("C:/Users/alber/Documents/chromedriver.exe",options=chrome_options) # replace webdrive location with ones own
driver.get(base_url)

# wait 30 sec will quit if takes over 30 seconds
timeout = 30
try:
    WebDriverWait(driver, timeout).until(EC.visibility_of_element_located((By.XPATH, "//div[@id='app']")))
except TimeoutException:
    print("Timed out")
    driver.quit()

In [14]:
# gathering all links on designer page
results = driver.find_elements_by_xpath("//a[@href]")

6140

In [15]:
# Making a list of all the links 
Link =[]
for result in results:
        link = result.get_attribute("href") # grabbing the link attribute
        Link.append(link)

# Turn the Links into a DataFrame
ItemDF=pd.DataFrame(Link,columns=['Link'])
ItemDF

Unnamed: 0,Link
0,https://www.grailed.com/
1,https://www.grailed.com/shop
2,https://www.grailed.com/sell
3,https://www.grailed.com/drycleanonly
4,https://www.grailed.com/users/sign_up
...,...
6135,https://www.facebook.com/grailed
6136,https://www.twitter.com/grailed
6137,https://www.youtube.com/channel/UCrcycxtz_yoAf...
6138,https://www.linkedin.com/company/grailed


In [16]:
# filtering out links that are not designer page links
ItemDF = ItemDF[ItemDF['Link'].str.contains("designers/")]

In [17]:
# in total we have 6,028 designers to scrape
len(ItemDF)

6028

In [18]:
# saving the brand_links as a csv
ItemDF.to_csv('brand_links.csv')

## Part 2: Getting to the Sold Page for Each Designer

This step may seem like an uneccessary one. For example you may think that going from the current listings of Prada to the sold listings of prada would be as simple as changing the designer link from https://www.grailed.com/designers/prada to https://www.grailed.com/sold/prada this, however is not the case. Instead the true link to the prada sold listings is https://www.grailed.com/sold/taPPoyeJEw. There is no simple way to access what exactly the sold link may be for each designer so they must be gathered by the following proccess:

1. access the designer page
2. click on show only
3. close the login pop-up
4. click on the sold box
5. wait for page to load and gather the link

Because this process is slow I broke up the process of gather the sold links for the 6,000 designers into groups of 1,000 incase my computer crashes (which it did multiple times).

In [23]:
# list of sold links to be appended to
sold_links_1000 = []
# count will be used to get an idea of where i am in the scraping process
count = 0

# going through first 1,000 designer links
for link in ItemDF['Link'][:1000]:
    # will open the desinger link
    base_url = link  
    # open up chrome. 
    chrome_options = webdriver.ChromeOptions()
    chrome_options.add_argument("--start-maximized") # using full screan
    driver = webdriver.Chrome("C:/Users/alber/Documents/chromedriver.exe",options=chrome_options) # replace webdrive location with ones own
    driver.get(base_url)
    # wait 30 sec
    timeout = 30
    try:
        WebDriverWait(driver, timeout).until(EC.visibility_of_element_located((By.XPATH, "//div[@id='__next']")))
    except TimeoutException:
        print("Timed out waiting for page to load")
        driver.close()
        continue
    
    WebDriverWait(driver, 10)    
    
    try:
        element = driver.find_elements_by_xpath("//*[contains(text(),'Show Only')]") # finding show only button 
        element[0].click(); # click show only
    except:
        driver.close()
        continue
        
    driver.implicitly_wait(1)
    
    try:
        elem = driver.find_element_by_xpath("//div[@class='UsersAuthentication']") # user login window will pop up
        ac = ActionChains(driver)
        ac.move_to_element(elem).move_by_offset(250, 0).click().perform() # clicking away from login window
    except:
        driver.close()
        print('failed')
        continue
    
    driver.implicitly_wait(1)
    
    try:
        element = driver.find_elements_by_xpath("//*[contains(text(),'Show Only')]") # finding show only button again
        element[0].click(); # clicking now that pop-up is away
    except:
        driver.close()
        continue
    
    driver.implicitly_wait(1)
    
    try:
        element = driver.find_elements_by_xpath("//input[@id='sold-filter']") # finding sold box
        element[0].click(); # click sold box
    except:
        driver.close()
        continue
    
    driver.implicitly_wait(1)   
    
    try:
        sold_link = driver.current_url # getting current url which is the sold link for the designer
        sold_links_1000.append(sold_link) # appending sold link
    except:
        driver.close()
        continue
    
    driver.close()
    
    count += 1
    
    if count%10 == 0:
        print(count)

# creating df and exporting first 1000 sold links
SoldDF_1000=pd.DataFrame(sold_links_1000,columns=['Link'])
SoldDF_1000.to_csv('sold_links_1000.csv')

10
20
30
40
50
60
70
80
90
100
110
120
130
140
150
160
170
180
190
200
210
220
230
240
250
260
270
280
290
300
310
320
330
340
350
360
370
380
390
400
410
420
430
440
450
460
470
480
490
500
510
520
530
540
550
560
570
580
590
600
610
620
630
640
650
660
670
680
690
700
710
720
730
740
750
760
770
780
790
800
810
820
830
840
850
860
870
880
890
900
910
920
930
940
950
960
970
980
990
1000


## This same process is then repeated 5 more times with the last 5,000 designers

In [24]:
# list of sold links
sold_links_2000 = []
count = 0
for link in ItemDF['Link'][1000:2000]:
    # Set a URL
    base_url = link  
    # open up chrome. 
    chrome_options = webdriver.ChromeOptions()
    chrome_options.add_argument("--start-maximized")
    driver = webdriver.Chrome("C:/Users/alber/Documents/chromedriver.exe",options=chrome_options)
    driver.get(base_url)
    # wait 30 sec
    timeout = 30
    try:
        WebDriverWait(driver, timeout).until(EC.visibility_of_element_located((By.XPATH, "//div[@id='__next']")))
    except TimeoutException:
        print("Timed out waiting for page to load")
        driver.close()
        continue
    
    WebDriverWait(driver, 10)    
    
    try:
        element = driver.find_elements_by_xpath("//*[contains(text(),'Show Only')]")
        element[0].click();
    except:
        driver.close()
        continue
        
    driver.implicitly_wait(1)
    
    try:
        elem = driver.find_element_by_xpath("//div[@class='UsersAuthentication']")
        ac = ActionChains(driver)
        ac.move_to_element(elem).move_by_offset(250, 0).click().perform()
    except:
        driver.close()
        print('failed')
        continue
    
    driver.implicitly_wait(1)
    
    try:
        element = driver.find_elements_by_xpath("//*[contains(text(),'Show Only')]")
        element[0].click();
    except:
        driver.close()
        continue
    
    driver.implicitly_wait(1)
    
    try:
        element = driver.find_elements_by_xpath("//input[@id='sold-filter']")
        element[0].click();
    except:
        driver.close()
        continue
    
    driver.implicitly_wait(1)   
    
    try:
        sold_link = driver.current_url
        sold_links_2000.append(sold_link)
    except:
        driver.close()
        continue
    
    driver.close()
    
    count += 1
    
    if count%10 == 0:
        print(count)
        
SoldDF_2000=pd.DataFrame(sold_links_2000,columns=['Link'])
SoldDF_2000.to_csv('sold_links_2000.csv')

10
20
30
40
50
60
70
80
90
100
110
120
130
140
150
160
170
180
190
200
210
220
230
240
250
260
270
280
290
300
310
320
330
340
350
360
370
380
390
400
410
420
430
440
450
460
470
480
490
500
510
520
530
540
550
560
570
580
590
600
610
620
630
640
650
660
670
680
690
700
710
720
730
740
750
760
770
780
790
800
810
820
830
840
850
860
870
880
890
900
910
920
930
940
950
960
970
980
990
1000


In [32]:
# list of sold links
sold_links_3000_2 = []
count = 0
for link in ItemDF['Link'][2269:3000]:
    # Set a URL
    base_url = link  
    # open up chrome
    chrome_options = webdriver.ChromeOptions()
    chrome_options.add_argument("--start-maximized")
    driver = webdriver.Chrome("C:/Users/alber/Documents/chromedriver.exe",options=chrome_options)
    driver.get(base_url)
    # wait 30 sec
    timeout = 30
    try:
        WebDriverWait(driver, timeout).until(EC.visibility_of_element_located((By.XPATH, "//div[@id='__next']")))
    except TimeoutException:
        print("Timed out waiting for page to load")
        driver.close()
        continue
    
    WebDriverWait(driver, 10)    
    
    try:
        element = driver.find_elements_by_xpath("//*[contains(text(),'Show Only')]")
        element[0].click();
    except:
        driver.close()
        continue
        
    driver.implicitly_wait(1)
    
    try:
        elem = driver.find_element_by_xpath("//div[@class='UsersAuthentication']")
        ac = ActionChains(driver)
        ac.move_to_element(elem).move_by_offset(250, 0).click().perform()
    except:
        driver.close()
        print('failed')
        continue
    
    driver.implicitly_wait(1)
    
    try:
        element = driver.find_elements_by_xpath("//*[contains(text(),'Show Only')]")
        element[0].click();
    except:
        driver.close()
        continue
    
    driver.implicitly_wait(1)
    
    try:
        element = driver.find_elements_by_xpath("//input[@id='sold-filter']")
        element[0].click();
    except:
        driver.close()
        continue
    
    driver.implicitly_wait(1)   
    
    try:
        sold_link = driver.current_url
        sold_links_3000_2.append(sold_link)
    except:
        driver.close()
        continue
    
    driver.close()
    
    count += 1
    
    if count%10 == 0:
        print(count)

# computer decided to stop at one point so needed to append the two lists        
sold_links_3000_total = sold_links_3000+sold_links_3000_2
SoldDF_3000=pd.DataFrame(sold_links_3000_total,columns=['Link'])
SoldDF_3000.to_csv('sold_links_3000.csv')

10
20
30
40
50
60
70
80
90
100
110
120
130
140
150
160
170
180
190
200
210
220
230
240
250
260
270
280
290
300
310
320
330
340
350
360
370
380
390
400
410
420
430
440
450
460
470
480
490
500
510
520
530
540
550
560
570
580
590
600
610
620
630
640
650
660
670
680
690
700
710
720
730


In [33]:
# list of sold links
sold_links_4000_2 = []
count = 0
for link in ItemDF['Link'][3184:4000]:
    # Set a URL
    base_url = link  
    # open up chrome.
    chrome_options = webdriver.ChromeOptions()
    chrome_options.add_argument("--start-maximized")
    driver = webdriver.Chrome("C:/Users/alber/Documents/chromedriver.exe",options=chrome_options)
    driver.get(base_url)
    # wait 30 sec
    timeout = 30
    try:
        WebDriverWait(driver, timeout).until(EC.visibility_of_element_located((By.XPATH, "//div[@id='__next']")))
    except TimeoutException:
        print("Timed out waiting for page to load")
        driver.close()
        continue
    
    WebDriverWait(driver, 10)    
    
    try:
        element = driver.find_elements_by_xpath("//*[contains(text(),'Show Only')]")
        element[0].click();
    except:
        driver.close()
        continue
        
    driver.implicitly_wait(1)
    
    try:
        elem = driver.find_element_by_xpath("//div[@class='UsersAuthentication']")
        ac = ActionChains(driver)
        ac.move_to_element(elem).move_by_offset(250, 0).click().perform()
    except:
        driver.close()
        print('failed')
        continue
    
    driver.implicitly_wait(1)
    
    try:
        element = driver.find_elements_by_xpath("//*[contains(text(),'Show Only')]")
        element[0].click();
    except:
        driver.close()
        continue
    
    driver.implicitly_wait(1)
    
    try:
        element = driver.find_elements_by_xpath("//input[@id='sold-filter']")
        element[0].click();
    except:
        driver.close()
        continue
    
    driver.implicitly_wait(1)   
    
    try:
        sold_link = driver.current_url
        sold_links_4000_2.append(sold_link)
    except:
        driver.close()
        continue
    
    driver.close()
    
    count += 1
    
    if count%10 == 0:
        print(count)
        
sold_links_4000_total = sold_links_4000+sold_links_4000_2        
SoldDF_4000=pd.DataFrame(sold_links_4000_total,columns=['Link'])
SoldDF_4000.to_csv('sold_links_4000.csv')

10
20
30
40
50
60
70
80
90
100
110
120
130
140
150
160
170
180
190
200
210
220
230
240
250
260
270
280
290
300
310
320
330
340
350
360
370
380
390
400
410
420
430
440
450
460
470
480
490
500
510
520
530
540
550
560
570
580
590
600
610
620
630
640
650
660
670
680
690
700
710
720
730
740
750
760
770
780
790
800
810


In [37]:
# list of sold links
sold_links_5000_2 = []
count = 0
for link in ItemDF['Link'][4700:5000]:
    # Set a URL
    base_url = link  
    # open up chrome.
    chrome_options = webdriver.ChromeOptions()
    chrome_options.add_argument("--start-maximized")
    driver = webdriver.Chrome("C:/Users/alber/Documents/chromedriver.exe",options=chrome_options)
    driver.get(base_url)
    # wait 30 sec
    timeout = 30
    try:
        WebDriverWait(driver, timeout).until(EC.visibility_of_element_located((By.XPATH, "//div[@id='__next']")))
    except TimeoutException:
        print("Timed out waiting for page to load")
        driver.close()
        continue
    
    WebDriverWait(driver, 10)    
    
    try:
        element = driver.find_elements_by_xpath("//*[contains(text(),'Show Only')]")
        element[0].click();
    except:
        driver.close()
        continue
        
    driver.implicitly_wait(1)
    
    try:
        elem = driver.find_element_by_xpath("//div[@class='UsersAuthentication']")
        ac = ActionChains(driver)
        ac.move_to_element(elem).move_by_offset(250, 0).click().perform()
    except:
        driver.close()
        print('failed')
        continue
    
    driver.implicitly_wait(1)
    
    try:
        element = driver.find_elements_by_xpath("//*[contains(text(),'Show Only')]")
        element[0].click();
    except:
        driver.close()
        continue
    
    driver.implicitly_wait(1)
    
    try:
        element = driver.find_elements_by_xpath("//input[@id='sold-filter']")
        element[0].click();
    except:
        driver.close()
        continue
    
    driver.implicitly_wait(1)   
    
    try:
        sold_link = driver.current_url
        sold_links_5000_2.append(sold_link)
    except:
        driver.close()
        continue
    
    driver.close()
    
    count += 1
    
    if count%10 == 0:
        print(count)

sold_links_5000_total = sold_links_5000+sold_links_5000_2                
SoldDF_5000=pd.DataFrame(sold_links_5000_total,columns=['Link'])
SoldDF_5000.to_csv('sold_links_5000_3.csv')

10
20
30
40
50
60
70
80
90
100
110
120
130
140
150
160
170
180
190
200
210
220
230
240
250
260
270
280
290
300


In [28]:
# list of sold links
sold_links_6000 = []
count = 0
for link in ItemDF['Link'][5000:]:
    # Set a URL
    base_url = link  
    # open up chrome. 
    chrome_options = webdriver.ChromeOptions()
    chrome_options.add_argument("--start-maximized")
    driver = webdriver.Chrome("C:/Users/alber/Documents/chromedriver.exe",options=chrome_options)
    driver.get(base_url)
    # wait 30 sec
    timeout = 30
    try:
        WebDriverWait(driver, timeout).until(EC.visibility_of_element_located((By.XPATH, "//div[@id='__next']")))
    except TimeoutException:
        print("Timed out waiting for page to load")
        driver.close()
        continue
    
    WebDriverWait(driver, 10)    
    
    try:
        element = driver.find_elements_by_xpath("//*[contains(text(),'Show Only')]")
        element[0].click();
    except:
        driver.close()
        continue
        
    driver.implicitly_wait(1)
    
    try:
        elem = driver.find_element_by_xpath("//div[@class='UsersAuthentication']")
        ac = ActionChains(driver)
        ac.move_to_element(elem).move_by_offset(250, 0).click().perform()
    except:
        driver.close()
        print('failed')
        continue
    
    driver.implicitly_wait(1)
    
    try:
        element = driver.find_elements_by_xpath("//*[contains(text(),'Show Only')]")
        element[0].click();
    except:
        driver.close()
        continue
    
    driver.implicitly_wait(1)
    
    try:
        element = driver.find_elements_by_xpath("//input[@id='sold-filter']")
        element[0].click();
    except:
        driver.close()
        continue
    
    driver.implicitly_wait(1)   
    
    try:
        sold_link = driver.current_url
        sold_links_6000.append(sold_link)
    except:
        driver.close()
        continue
    
    driver.close()
    
    count += 1
    
    if count%10 == 0:
        print(count)
        
SoldDF_6000=pd.DataFrame(sold_links_6000,columns=['Link'])
SoldDF_6000.to_csv('sold_links_6000.csv')

10
20
30
40
50
60
70
80
90
100
110
120
130
140
150
160
170
180
190
200
210
220
230
240
250
260
270
280
290
300
310
320
330
340
350
360
370
380
390
400
410
420
430
440
450
460
470
480
490
500
510
520
530
540
550
560
570
580
590
600
610
620
630
640
650
660
670
680
690
700
710
720
730
740
750
760
770
780
790
800
810
820
830
840
850
860
870
880
890
900
910
920
930
940
950
960
970
