<img src="./images/cs_logo_pink.PNG" style="float: left; margin: 36px 20px 0 0; height: 60px">

# Capstone Project - Cos Skin <br><i style = "font-size:16px">Your skin but better</i>

## Notebook 1 of 7
<b>Notebook 1: Introduction & Data Collection Part 1 of 3<br></b>
Notebook 2: Data Collection Part 2 of 3<br>
Notebook 3: Data Collection Part 3 of 3<br>
Notebook 4: EDA & Data Cleaning<br>
Notebook 5: Pre-processing<br>
Notebook 6: Modeling<br>
Notebook 7: App Deployment

# Introduction

## Background
The number of skincare brands and products in the market are ever increasing. And having more options does not necessarily help us make better choices. The overwhelming number of available options to create a skincare routine from scratch may lead to decision fatigue instead. Hence I decided to build a skincare recommender to help beginners put together a simple skincare routine and to expose existing skincare users to more options as well. <br>

I will be looking at 5 different categories: Cleanser, Toner, Day Moisturizer, Night Cream, Sunscreen to create a simple skincare routine.

## Goal
To recommend 5 different products based on user's profile and preferences. These 5 products put together gives the user a morning routine consisting of 4 products, and a night routine consisting of 3 products.
| Morning Routine                                                            | Night Routine                                    |
|----------------------------------------------------------------------------|--------------------------------------------------|
| 1. Cleanser<br>  2. Toner<br> 3. Day Moisturizer <br> 4. Sunscreen<br> | 1. Cleaner <br> 2. Toner<br> 3. Night Cream  |

## Data Collection

For this project, I will be scraping data from Sephora Singapore. To focus on a simple skincare routine, I will be zooming into the 5 basic categories - cleanser, toner, day moisturizer, night cream and sunscreen.<br>

Data Collection will be split into three parts.<br>
In the first notebook, I will be using Selenium to scrape basic product information by category url. The products will be scraped by category, and the basic information scraped includes product url, name, price, brand, ratings, number of reviews and whether it is a Sephora Exclusive product or not.
<br><br>
In the second notebook, I will also be using Selenium to scrape additional product information which includes product descriptions, product claims, product ingredients and product images. 
<br><br>
In the third notebook, I will be using API GET request to pull product reviews. 

In [1]:
# import libraries
import pandas as pd
import csv
import time
import sys, traceback

from tqdm import tqdm
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
from selenium.common.exceptions import NoSuchElementException

#### Functions for Web Scraping

In [2]:
# create a list of URLs to be scraped
def create_cat_url():
    cats = []
    for i in range(len(total_pages)):
        j = 0
        k = total_pages[i]
        while j < k:
            if j == 0:
                cats.append(cats_to_scrape[i])
            else:
                cats.append(f'{cats_to_scrape[i]}&page={j+1}')
            j+=1

    return cats

# find number of pages per category
def cat_pages(cat_url):
    option = webdriver.ChromeOptions()
    option.add_argument('--headless')
    chrome_executable = Service('/Users/CLARE/mambaforge/bin/chromedriver')
    driver = webdriver.Chrome(service=chrome_executable, options = option)
    driver.implicitly_wait(10)
    
    cat_url = cat_url
    driver.get(cat_url) 
    
    try:
        WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "div.pagination-container.full>nav>a.page")))
        pagination = driver.find_elements(By.CSS_SELECTOR, "div.pagination-container.full>nav>a.page")
        page_list = [p.get_attribute("text") for p in pagination]
        last_page = int(page_list[-2])
    except IndexError:
        last_page = 1
    
    return last_page

# scrape data with selenium
def scrape_pdt_by_cat(category, csv_name, tqdm_desc):
    option = webdriver.ChromeOptions()
    option.add_argument('--headless')
    chrome_executable = Service('/Users/CLARE/mambaforge/bin/chromedriver')
    driver = webdriver.Chrome(service=chrome_executable, options = option)
    driver.maximize_window()
    driver.implicitly_wait(10)
    
    # create and open csv file
    csvFile = open(csv_name, 'a',  encoding = "utf-8")
    header = ["pdt_url", "brand", "pdt_name", "price", "rating", "num_reviews", "sephora_exclusive"]
    csvWriter = csv.DictWriter(csvFile, fieldnames = header)
    csvWriter.writeheader()
    
    for i in range(len(category)):
        url = category[i]
        driver.get(url) 
        WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "a.product-card-description[data-v-664a9576]")))
        box = driver.find_elements(By.CSS_SELECTOR, "a.product-card-description[data-v-664a9576]")

        for i in tqdm(range(len(box)), desc = tqdm_desc):
            csvFile = open(csv_name, 'a',  encoding = "utf-8")
            header = ["pdt_url", "brand", "pdt_name", "price", "rating", "num_reviews", "sephora_exclusive"]
            csvWriter = csv.DictWriter(csvFile, fieldnames = header)    
            
            try:
                pdt_link = box[i].get_attribute("href")
                pdt_brand = box[i].find_element(By.CSS_SELECTOR, "p.product-card-brand").text
                pdt_name = box[i].find_element(By.CSS_SELECTOR, "p.product-card-product").text
                pdt_price = float(box[i].find_element(By.CSS_SELECTOR, "p.product-price>span").text.replace('$', ''))
                pdt_rating = float(box[i].find_element(By.CSS_SELECTOR, "div.rating-container>div.product-rating-rating").get_attribute("data-rateit-value"))
                pdt_review = int(box[i].find_element(By.CSS_SELECTOR, "div.rating-container>span.product-rating-count").text.replace('(', '').replace(')',''))
                pdt_exclusive = box[i].find_element(By.CSS_SELECTOR, "p>span.dy-plp-exclusive-tag").text
                csvWriter.writerow({"pdt_url": pdt_link, "brand":pdt_brand, "pdt_name":pdt_name, "price":pdt_price, "rating":pdt_rating, "num_reviews":pdt_review, "sephora_exclusive":pdt_exclusive}) 
                
            except NoSuchElementException:
                exc_type, exc_value, exc_traceback = sys.exc_info()
                if exc_traceback.tb_lineno == 70:
                    pdt_exclusive = 0
                    csvWriter.writerow({"pdt_url": pdt_link, "brand":pdt_brand, "pdt_name":pdt_name, "price":pdt_price, "rating":pdt_rating, "num_reviews":pdt_review, "sephora_exclusive":pdt_exclusive})
                else: 
                    pdt_rating = 'NA'
                    pdt_review = 'NA'
                    csvWriter.writerow({"pdt_url": pdt_link, "brand":pdt_brand, "pdt_name":pdt_name, "price":pdt_price, "rating":pdt_rating, "num_reviews":pdt_review, "sephora_exclusive":pdt_exclusive})
    csvFile.close()                                    
    driver.quit()

In [3]:
# first page of category URLs
cats_to_scrape = ["https://www.sephora.sg/categories/skincare/cleanser-and-exfoliator/facial-cleanser?view=120", 
                 "https://www.sephora.sg/categories/skincare/toner?view=120",
                 "https://www.sephora.sg/categories/skincare/moisturiser/day-moisturiser?view=120", 
                 "https://www.sephora.sg/categories/skincare/moisturiser/night-cream?view=120",
                 "https://www.sephora.sg/categories/skincare/suncare/face-sunscreen?view=120"]

In [4]:
# total number of pages per category
total_pages = []
for i in range(len(cats_to_scrape)):
    p = cat_pages(cats_to_scrape[i])
    total_pages.append(p)

In [5]:
total_pages

[4, 2, 5, 3, 1]

In [6]:
# list of all URLs to be scrapped
cat_urls = create_cat_url()

In [7]:
cat_urls

['https://www.sephora.sg/categories/skincare/cleanser-and-exfoliator/facial-cleanser?view=120',
 'https://www.sephora.sg/categories/skincare/cleanser-and-exfoliator/facial-cleanser?view=120&page=2',
 'https://www.sephora.sg/categories/skincare/cleanser-and-exfoliator/facial-cleanser?view=120&page=3',
 'https://www.sephora.sg/categories/skincare/cleanser-and-exfoliator/facial-cleanser?view=120&page=4',
 'https://www.sephora.sg/categories/skincare/toner?view=120',
 'https://www.sephora.sg/categories/skincare/toner?view=120&page=2',
 'https://www.sephora.sg/categories/skincare/moisturiser/day-moisturiser?view=120',
 'https://www.sephora.sg/categories/skincare/moisturiser/day-moisturiser?view=120&page=2',
 'https://www.sephora.sg/categories/skincare/moisturiser/day-moisturiser?view=120&page=3',
 'https://www.sephora.sg/categories/skincare/moisturiser/day-moisturiser?view=120&page=4',
 'https://www.sephora.sg/categories/skincare/moisturiser/day-moisturiser?view=120&page=5',
 'https://www.se

In [8]:
# split cat_urls list into individual lists by categories 
cleanser = cat_urls[0:4]
toner = cat_urls[4:6]
day_moisturizer = cat_urls[6:11]
night_cream = cat_urls[11:14]
sunscreen = cat_urls[14:16]

In [9]:
categories = [cleanser, toner, day_moisturizer, night_cream, sunscreen]

In [10]:
# csv file name & location for each category
file_name = ["../data/cleanser_basic_info.csv",
            "../data/toner_basic_info.csv", 
            "../data/day_moisturizer_basic_info.csv", 
            "../data/night_cream_basic_info.csv", 
            "../data/sunscreen_basic_info.csv"]

In [11]:
# tqdm_description to identify which category is being scrapped
tqdm_desc = ['cleanser', 'toner', 'day_moisturizer', 'night_cream', 'sunscreen']

In [12]:
# scrapping of category URLs to get basic product info
for i in range(len(categories)):
    scrape_pdt_by_cat(categories[i], file_name[i], tqdm_desc[i])

cleanser: 100%|███████████████████████████████| 120/120 [12:11<00:00,  6.10s/it]
cleanser: 100%|███████████████████████████████| 120/120 [13:50<00:00,  6.92s/it]
cleanser: 100%|███████████████████████████████| 120/120 [13:20<00:00,  6.67s/it]
cleanser: 100%|█████████████████████████████████| 12/12 [01:41<00:00,  8.43s/it]
toner: 100%|██████████████████████████████████| 120/120 [12:31<00:00,  6.26s/it]
toner: 100%|████████████████████████████████████| 91/91 [10:58<00:00,  7.24s/it]
day_moisturizer: 100%|████████████████████████| 120/120 [11:40<00:00,  5.84s/it]
day_moisturizer: 100%|████████████████████████| 120/120 [15:31<00:00,  7.76s/it]
day_moisturizer: 100%|████████████████████████| 120/120 [15:21<00:00,  7.68s/it]
day_moisturizer: 100%|████████████████████████| 120/120 [15:01<00:00,  7.51s/it]
day_moisturizer: 100%|██████████████████████████| 66/66 [07:05<00:00,  6.45s/it]
night_cream: 100%|████████████████████████████| 120/120 [13:21<00:00,  6.68s/it]
night_cream: 100%|██████████

The initial scraping obtains 1466 products in total. As I am looking to recommend only one product for each category, I will drop any items that contain words such as 'set' or 'kit'. I also observed that some products have the same name but are from different brands. Hence a new columns is created, concatenating the brand and product name to act as a unique identifier for each item.

In [13]:
# import datasets that were scraped above
cleanser_basic_info = pd.read_csv('../data/cleanser_basic_info.csv')
toner_basic_info = pd.read_csv('../data/toner_basic_info.csv')
day_m_basic_info = pd.read_csv('../data/day_moisturizer_basic_info.csv')
night_cream_basic_info = pd.read_csv('../data/night_cream_basic_info.csv')
sunscreen_basic_info = pd.read_csv('../data/sunscreen_basic_info.csv')

In [14]:
# function to remove all sets/kits from dataset
def drop_sets(df):
    df = df[(df['pdt_name'].str.contains('Kit') == False) & (df['pdt_name'].str.contains('Set') == False)]
    df = df.reset_index(drop = True)
    
    return df

# add a unique identifier by concatenating brand and product name as different brands may have products with the same name
def unique_identifier(df):
    df['unique_id'] = df['brand'] + '-' + df['pdt_name']
    return df

In [15]:
# drop prodcuts that come as a set
cleanser_pdts = drop_sets(cleanser_basic_info)
toner_pdts = drop_sets(toner_basic_info)
day_moisturizer_pdts = drop_sets(day_m_basic_info)
night_cream_pdts = drop_sets(night_cream_basic_info)
sunscreen_pdts = drop_sets(sunscreen_basic_info)

In [16]:
# create a new column to serve as product identification
cleanser_pdts_unique = unique_identifier(cleanser_pdts)
toner_pdts_unique = unique_identifier(toner_pdts)
day_moisturizer_pdts_unique = unique_identifier(day_moisturizer_pdts)
night_cream_pdts_unique = unique_identifier(night_cream_pdts)
sunscreen_pdts_unique = unique_identifier(sunscreen_pdts)

In [19]:
# export datasets
cleanser_pdts_unique.to_csv('../data/cleanser_pdts.csv', index = False )
toner_pdts_unique.to_csv('../data/toner_pdts.csv', index = False )
day_moisturizer_pdts_unique.to_csv('../data/day_moisturizer_pdts.csv', index = False )
night_cream_pdts_unique.to_csv('../data/night_cream_pdts.csv', index = False )
sunscreen_pdts_unique.to_csv('../data/sunscreen_pdts.csv', index = False )