I created an application that crawls the fashion store brand Reformation and extracts product information.
A coding challenge was provided by Changing Room, a Honey extension for sustainable fashion that helps users track and reduce their environmental impact. The coding challenge features an application that crawls a fashion brand's website using a crawler framework (in this case Selenium), extracts information in a structured manner, and stores the data in a hosted database (AWS RDS).
Extracted information:
- display_name (str)
- product_material (str)
- color (str)
- size (list)
- price (str)
- product_url (str)
- image_links (list)
- brand_name (str)
- description (str)
- scraped_date (date)
- category (str)
- Understand key objectives and deliverables
- Research and study the structure of the website: Reformation
- Reading Selenium documentation and downloading the appropriate Web Driver with my Chrome version
- Brainstorm/strategize/pseudocode for collecting extracted information and inserting into DB
- Programming(using By.XPATH to find all clothing links to loop through)
- Continue extracting information from the site, testing functions, and appending it to a dictionary
- Creating an account on AWS, setting up AWS IAM User, and creating a DB with AWS RDS
- Planning out a potential primary key for the database
- Establishing database connection and setting up environment variables for DB_PASS
- CREATE TABLE reformation_db
- Inserting into reformation_db
- SELECT * from reformation_db
"By finding the appropriate xpath, '//div[@class="product-tile__quickadd"]/div/a', I was able to grab all the product links which enabled me to go through each product page and extract the necessary information. I think this strategy can be expanded to include all the critical pages of the site using Beautiful Soup.
"I understand that the site structure can change which can be tricky, so finding a unique ID/primary would be necessary to keep track of new, old, or not available products. This is something I would look into more. I know for Amazon products, there is a unique identifier (ASINS) that can be used to track the stage of the product.
"Lastly, with all the projected requests on the site, it is likely that an IP block will occur, so a proxy server would need to be purchased.
product_material_find = driver.find_elements(By.XPATH, '//div[@class="margin-b--15"]')
product_materials_list = [y.get_attribute('innerHTML') for y in product_material_find]
product_material = " ".join(list(map(lambda z: z.strip(), product_materials_list)))