# Webscraping for excel data from Sephora

The goal for this notebook is to provide usable excel data by pulling it from a site that utilizes lazy loading, and follows standard product container naming conventions. Each step is crucial to ensure data is aligned and matched properly to its related values.

The sections of this analysis include: 

 - Webdriver loader
     * Opens Sephora webpage
     * Scrapes through all product "divs" with specified classname, simulated scrolling to get past sites that utilize "lazy loading"
     * Computes whether data is already present, before deciding to add into main dataframe
     * Finalizes product data
 - Displaying partial data extracted.
 - Conversion of each column to an array, splicing them together.
 - Each of the products scraped is now parsed into a new webdriver, extracting the product ingredient's div container, specified by the class name.
 - Values present in this dataframe is then processed in a similar fashion to the products themselves.
 - Final result, an excel sheet containing the product name and ingredient data is output as an excel sheet.
 - Available data that can be added to the final sheet includes, 'Image URL', and 'URL Extension'.
 - In short, given a Sephora category url, this notebook outputs an excel sheet containing products present on that page, as well as its ingredients, showing its process every step of the way.

In [2]:
import time
import pandas as pd
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup

product_names = set()
image_urls = set()
product_links = set()

driver = webdriver.Chrome()

time.sleep(5)

url = 'https://www.sephora.sg/brands/aveda/hair'
driver.get(url)



# scroll down to load all products
scroll_pause_time = 3
scroll_height = 0
while True:
    driver.execute_script(f"window.scrollTo(0, {scroll_height});")
    time.sleep(scroll_pause_time)
    html = driver.page_source
    soup = BeautifulSoup(html, 'html.parser')
    containers = soup.find_all('div', {'class': 'products-card-container'})
    
    for container in containers:
        product_name = container.find('div', {'class': 'product-name'}).text.strip()
        product_link = 'https://www.sephora.sg' + container.find('a', {'class': 'product-card-image-link'})['href']
        product_image = container.find('img', {'class': 'product-card-image'})['src']
        if product_link not in product_links:
            product_names.add(product_name)
            image_urls.add(product_image)
            product_links.add(product_link)
    
    if len(product_links) >= 30:
        break
    scroll_height += 1000
    
driver.quit()

df = pd.DataFrame({'Product Name': list(product_names), 'Image Url': list(image_urls), 'Product Links': list(product_links)})


NoSuchWindowException: Message: no such window: target window already closed
from unknown error: web view not found
  (Session info: chrome=112.0.5615.138)
Stacktrace:
Backtrace:
	GetHandleVerifier [0x0034DCE3+50899]
	(No symbol) [0x002DE111]
	(No symbol) [0x001E5588]
	(No symbol) [0x001CD333]
	(No symbol) [0x0022F4DB]
	(No symbol) [0x0023DB33]
	(No symbol) [0x0022B6F6]
	(No symbol) [0x00207708]
	(No symbol) [0x0020886D]
	GetHandleVerifier [0x005B3EAE+2566302]
	GetHandleVerifier [0x005E92B1+2784417]
	GetHandleVerifier [0x005E327C+2759788]
	GetHandleVerifier [0x003E5740+672048]
	(No symbol) [0x002E8872]
	(No symbol) [0x002E41C8]
	(No symbol) [0x002E42AB]
	(No symbol) [0x002D71B7]
	BaseThreadInitThunk [0x755E7D49+25]
	RtlInitializeExceptionChain [0x76F2B74B+107]
	RtlClearBits [0x76F2B6CF+191]




Displays a shortened version of the data extracted.
'Image URL' may not align with 'Product Name' currently.

In [21]:
print(df)

                                         Product Name  \
0                     Shampure™ Nurturing Conditioner   
1              Botanical Repair Strengthening Shampoo   
2                Nutriplenish™ Shampoo Light Moisture   
3         Invati Advanced™  Exfoliating Shampoo Light   
4     Botanical Repair™ Strengthening Overnight Serum   
5            Nutriplenish™ Conditioner Light Moisture   
6        Nutriplenish™ Treatment Masque Deep Moisture   
7                    Nutriplenish™ Multi-use Hair Oil   
8          Invati Advanced™  Exfoliating Shampoo Rich   
9                     Rosemary Mint Purifying Shampoo   
10                 Invati Advanced™ Scalp Revitalizer   
11  Botanical Repair Intensive Strengthening Hair ...   
12         Nutriplenish™ Daily Moisturizing Treatment   
13  Invati Advanced™ 3-Step System Light Set (Holi...   
14     Invati Advanced™ Intensive Hair & Scalp Masque   
15  Botanical Repair Intensive Strengthening Hair ...   
16                            B



Prints 'Array' versions of each column. Data should match accurate at this stage.

In [22]:
dfArrayProduct = []
dfArrayImg = []
dfArrayLink = []
for container in containers:
        dfArrayProduct.append(container.find('div', {'class': 'product-name'}).text.strip())
        dfArrayImg.append(container.find('img', {'class': 'product-card-image'})['src'])
        dfArrayLink.append('https://www.sephora.sg' + container.find('a', {'class': 'product-card-image-link'})['href'])
print(dfArrayProduct)
print('')
print(dfArrayImg)
print('')
print(dfArrayLink)



['Rosemary Mint Weightless Conditioner', 'Shampure™ Nurturing Conditioner', 'Invati Advanced™  Exfoliating Shampoo Light', 'Shampure™ Nurturing Shampoo', 'Rosemary Mint Purifying Shampoo', 'Invati Advanced™ Thickening Conditioner', 'Nutriplenish™ Shampoo Light Moisture', 'Nutriplenish™ Multi-use Hair Oil', 'Invati Advanced™  Exfoliating Shampoo Rich', 'Nutriplenish™ Leave-in Conditioner', 'Nutriplenish™ Daily Moisturizing Treatment', 'Smooth Infusion™ Style-Prep™ Smoother', 'Nutriplenish™ Treatment Masque Deep Moisture', 'Invati Advanced™ Intensive Hair & Scalp Masque', 'Botanical Repair™ Strengthening Overnight Serum', 'Nutriplenish™ Curl Gelée', 'Botanical Repair Strengthening Leave-In Treatment For Hair', 'Be Curly™ Curl Enhancer', 'Be Curly™ Style Prep™', 'Invati Advanced™ Scalp Revitalizer', 'Botanical Repair Intensive Strengthening Hair Masque Rich', 'Botanical Repair Intensive Strengthening Hair Masque Light', 'Invati Advanced™ Thickening Foam', 'Botanical Repair Strengthening S



Prints 'Array' versions of each column. Data should match accurate at this stage.

In [23]:
df = pd.DataFrame({'Product_Name': list(dfArrayProduct), 'Image_URL': list(dfArrayImg), 'Extensions': list(dfArrayLink)})

print(df)

                                         Product_Name  \
0                Rosemary Mint Weightless Conditioner   
1                     Shampure™ Nurturing Conditioner   
2         Invati Advanced™  Exfoliating Shampoo Light   
3                         Shampure™ Nurturing Shampoo   
4                     Rosemary Mint Purifying Shampoo   
5             Invati Advanced™ Thickening Conditioner   
6                Nutriplenish™ Shampoo Light Moisture   
7                    Nutriplenish™ Multi-use Hair Oil   
8          Invati Advanced™  Exfoliating Shampoo Rich   
9                  Nutriplenish™ Leave-in Conditioner   
10         Nutriplenish™ Daily Moisturizing Treatment   
11              Smooth Infusion™ Style-Prep™ Smoother   
12       Nutriplenish™ Treatment Masque Deep Moisture   
13     Invati Advanced™ Intensive Hair & Scalp Masque   
14    Botanical Repair™ Strengthening Overnight Serum   
15                           Nutriplenish™ Curl Gelée   
16  Botanical Repair Strengthen

In [24]:
print(dfArrayLink)

['https://www.sephora.sg/products/aveda-rosemary-mint-weightless-conditioner/v/250ml', 'https://www.sephora.sg/products/aveda-shampure-nurturing-conditioner/v/250ml', 'https://www.sephora.sg/products/aveda-invati-advanced-exfoliating-shampoo-light/v/200ml', 'https://www.sephora.sg/products/aveda-shampure-nurturing-shampoo/v/250ml', 'https://www.sephora.sg/products/aveda-rosemary-mint-purifying-shampoo/v/250ml', 'https://www.sephora.sg/products/aveda-invati-advanced-thickening-conditioner/v/200ml', 'https://www.sephora.sg/products/aveda-nutriplenish-shampoo-light-moisture/v/250ml', 'https://www.sephora.sg/products/aveda-nutriplenish-multi-use-hair-oil/v/30ml', 'https://www.sephora.sg/products/aveda-invati-advanced-exfoliating-shampoo-rich/v/200ml', 'https://www.sephora.sg/products/aveda-nutriplenish-leave-in-conditioner/v/200ml', 'https://www.sephora.sg/products/aveda-nutriplenish-daily-moisturizing-treatment/v/150ml', 'https://www.sephora.sg/products/aveda-smooth-infusion-style-prep-sm



Each product load will require 11s to ensure all data is loaded and extracted before the webdriver is closed.

In [25]:
print("Number of links: ",  str(len(dfArrayLink)))
print("Estimated Load Time (in seconds)", str(len(dfArrayLink) * 11))
for link in dfArrayLink:
    print(link)


Number of links:  34
Estimated Load Time (in seconds) 374
https://www.sephora.sg/products/aveda-rosemary-mint-weightless-conditioner/v/250ml
https://www.sephora.sg/products/aveda-shampure-nurturing-conditioner/v/250ml
https://www.sephora.sg/products/aveda-invati-advanced-exfoliating-shampoo-light/v/200ml
https://www.sephora.sg/products/aveda-shampure-nurturing-shampoo/v/250ml
https://www.sephora.sg/products/aveda-rosemary-mint-purifying-shampoo/v/250ml
https://www.sephora.sg/products/aveda-invati-advanced-thickening-conditioner/v/200ml
https://www.sephora.sg/products/aveda-nutriplenish-shampoo-light-moisture/v/250ml
https://www.sephora.sg/products/aveda-nutriplenish-multi-use-hair-oil/v/30ml
https://www.sephora.sg/products/aveda-invati-advanced-exfoliating-shampoo-rich/v/200ml
https://www.sephora.sg/products/aveda-nutriplenish-leave-in-conditioner/v/200ml
https://www.sephora.sg/products/aveda-nutriplenish-daily-moisturizing-treatment/v/150ml
https://www.sephora.sg/products/aveda-smooth

Ingredients webscraping from each product, utilizing the url extension of each product to find its relevant page & scraping ingredient data.

In [9]:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup

def get_product_ingredients(url):
    ingredientHeader = []
    service = Service('path/to/chromedriver') 
    driver = webdriver.Chrome(service=service)
    driver.get(url)
    try:
        WebDriverWait(driver, 5).until(EC.presence_of_element_located((By.CLASS_NAME, 'product-ingredients-values')))
        soup = BeautifulSoup(driver.page_source, 'html.parser')
        ingredients_div = soup.find('div', {'class': 'product-ingredients-values'}).text.strip()
        header_div = soup.find('div', {'class' : "product-heading"}).h1.text
        ingredientHeader.append(ingredients_div)
        ingredientHeader.append(header_div)
        return ingredientHeader
    finally:
        driver.quit()

for url in dfArrayLink:
    ingredients = get_product_ingredients(url)
    print(ingredients[1])
    print(ingredients[0])
    print('')

In [26]:
import pandas as pd
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup

def get_product_ingredients(url):
    service = Service('path/to/chromedriver')
    driver = webdriver.Chrome(service=service)
    driver.get(url)
    try:
        WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CLASS_NAME, 'product-ingredients-values')))
        soup = BeautifulSoup(driver.page_source, 'html.parser')
        ingredients_div = soup.find('div', {'class': 'product-ingredients-values'}).text.strip()
        header_div = soup.find('div', {'class': "product-heading"}).h1.text
        return [header_div, ingredients_div]
    except:
        return []
    finally:
        driver.quit()

ingredients_list = [get_product_ingredients(url) for url in dfArrayLink]

df = pd.DataFrame(ingredients_list, columns=['product', 'ingredients'])

                                              product  \
0                Rosemary Mint Weightless Conditioner   
1                     Shampure™ Nurturing Conditioner   
2         Invati Advanced™  Exfoliating Shampoo Light   
3                         Shampure™ Nurturing Shampoo   
4                     Rosemary Mint Purifying Shampoo   
5             Invati Advanced™ Thickening Conditioner   
6                Nutriplenish™ Shampoo Light Moisture   
7                    Nutriplenish™ Multi-use Hair Oil   
8          Invati Advanced™  Exfoliating Shampoo Rich   
9                  Nutriplenish™ Leave-in Conditioner   
10         Nutriplenish™ Daily Moisturizing Treatment   
11              Smooth Infusion™ Style-Prep™ Smoother   
12       Nutriplenish™ Treatment Masque Deep Moisture   
13     Invati Advanced™ Intensive Hair & Scalp Masque   
14    Botanical Repair™ Strengthening Overnight Serum   
15                           Nutriplenish™ Curl Gelée   
16  Botanical Repair Strengthen

In [None]:
print(df)



Generates excel sheet containing data shown in dataframe above.

In [28]:
df.to_excel('aveda-ingredients.xlsx', index=False, startrow=0)