Skip to content

franklinjtan/fashion-store-crawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

👘👗Fashion Store Crawler

LinkedIn

I created an application that crawls the fashion store brand Reformation and extracts product information.

About the Project

A coding challenge was provided by Changing Room, a Honey extension for sustainable fashion that helps users track and reduce their environmental impact. The coding challenge features an application that crawls a fashion brand's website using a crawler framework (in this case Selenium), extracts information in a structured manner, and stores the data in a hosted database (AWS RDS).

Extracted information:

  • display_name (str)
  • product_material (str)
  • color (str)
  • size (list)
  • price (str)
  • product_url (str)
  • image_links (list)
  • brand_name (str)
  • description (str)
  • scraped_date (date)
  • category (str)

Process

  • Understand key objectives and deliverables
  • Research and study the structure of the website: Reformation
  • Reading Selenium documentation and downloading the appropriate Web Driver with my Chrome version
  • Brainstorm/strategize/pseudocode for collecting extracted information and inserting into DB
  • Programming(using By.XPATH to find all clothing links to loop through)
  • Continue extracting information from the site, testing functions, and appending it to a dictionary
  • Creating an account on AWS, setting up AWS IAM User, and creating a DB with AWS RDS
  • Planning out a potential primary key for the database
  • Establishing database connection and setting up environment variables for DB_PASS
  • CREATE TABLE reformation_db
  • Inserting into reformation_db
  • SELECT * from reformation_db

Addressing Strategy to Scrape Entire Website & Automatic Updates

"By finding the appropriate xpath, '//div[@class="product-tile__quickadd"]/div/a', I was able to grab all the product links which enabled me to go through each product page and extract the necessary information. I think this strategy can be expanded to include all the critical pages of the site using Beautiful Soup.

"I understand that the site structure can change which can be tricky, so finding a unique ID/primary would be necessary to keep track of new, old, or not available products. This is something I would look into more. I know for Amazon products, there is a unique identifier (ASINS) that can be used to track the stage of the product.

"Lastly, with all the projected requests on the site, it is likely that an IP block will occur, so a proxy server would need to be purchased.

Information Extraction Example

 product_material_find = driver.find_elements(By.XPATH, '//div[@class="margin-b--15"]')
 product_materials_list = [y.get_attribute('innerHTML') for y in product_material_find]
 product_material = " ".join(list(map(lambda z: z.strip(), product_materials_list)))

Technologies

Screenshots of Process

Inspecting the Reformation Website AWS RDS

Other Acknowledgements

License

MIT