# Web Scraping: Almost Data Science With Python And BeautifulSoup

Web scraping is a wonderful facet of programming which lets you collect incredible amounts of information automatically. Along with the freedom you get from not needing to sit in front of a computer for 6 hours, you're also faced with the legal grey area of automatic data collection. 

That's why before proceeding to scrape everything and anything, first you need to read the Terms and Conditions of a webpage, send them an email asking if what you're doing *really* is ok, or better yet - let professionals do the talking and outsource your data scraping needs, for example by using Find Data Lab.

If you're still interested in the DIY version or just want to understand the principles of how data scraping works, follow along the (as of yet - still unfinished) journey of how I wanted to decipher what the ingredients on the back of my moisturisers label actually mean. 

Tasks for this project:
1. Get a list of the ingredients from the web (done)
    1.1. Scrape
    1.2. Save the output as a .txt file or create a database
2. Access PubChem chemistry database and get the short summaries of each compound/substance
3. Read the output by candlelight while taking a bubble bath 

Let's get started!

In [None]:
import requests
from bs4 import BeautifulSoup
import re

As this is the pre-alpha version, we'll be trying to scrape the ingredients of a single product from only one brand.

In [None]:
urls = [
        'https://www.sephora.com/brand/the-ordinary',
        ]

For loop is commented out - that would be the way to proceed if we'd have multiple brands to check out.
The next lines of code load the html output of 'urls' and find the <a> tag which defines a hyperlink. 

Every 'clickable' thing on a webpage is a hyperlink. That's something to keep in mind if your project involves navigating through multiple pages of a site.

In [None]:
#for n in range(len(urls)):
n = 0
page = requests.get(urls[n])
soup = BeautifulSoup(page.content, 'html.parser')
subsections = soup.find_all('a')

Next we get the <a> tags attribute href, which specifies the URL of the page the link goes to.
    
Some technicalities include initialising the list before looping through it and appending the list while in the loop.

In [None]:
links = []
for link in subsections:
    links.append(link.get('href'))

Now, we've got all of the links, but that's of no use, since there's no need to navigate to "Gifts" or "Quizzes".

By inspecting the product links, we can conclude that the unifying factor is the string "theordinary_fromthebrand", which means that in order to select all of the products individually we need to find a hyperlink with these words in it.

Using regular expressions we loop through a list containing all of the links and python returns a list containing either a NoneType object or a Match object. Match object is a link that contains the product identifying string.

In [None]:
products = []
reg = re.compile("theordinary_fromthebrand")
for n in range(len(links)):
    products.append(reg.search(links[n]))

This is where we filter out all of the NoneType objects, which will leave us with a list containing only the product links.

In [None]:
f_products = [x for x in products if x is not None]

We finish link hunting by getting the Match objects string attribute and creating a full URL.

In [None]:
f_products_links = ["https://www.sephora.com" + f_products[x].string for x in range(len(f_products))]

That's essentially it. Next we apply the same principles to get the actual list of product ingredients.

As I was only interested in testing my method and *getting the answers*, I chose to pass the 7th link from the product list (remember that python lists start counting from 0, unlike e.g. R lists).

Next we find the division that contains a specific class, which in our case is a text box on the website. 

You can find the specific identifiers by right-clicking anywhere on a webpage and choosing "Inspect Element" in Firefox or just "Inspect" in Google Chrome.

Web scraping needs to be personalised for every website, since no page is the same. That's why the next lines are not exactly generalizable.

Next we find the "Ingredients" tab in the output and select only the part that gets printed on the label. Here I found that it can be neatly done by splitting off a chunk of text, that follows two breaks.

Then we clean up the text and create a numbered list that only contains the product ingredients. Using replace() four times is not very elegant, but hey, it works.

In [None]:
#loads only the HA moisturising factors
prod_page = requests.get(f_products_links[6])

soup = BeautifulSoup(prod_page.content, 'html.parser')
ingreds = soup.find_all('div', {'class': 'css-pz80c5'})
title = soup.find('span', {'class': 'css-0'}).text

#ingredients tab
ing_tab = str(ingreds[2]).split('<br/><br/>')
#this is like the expressional equivalent of moon moon
label = ing_tab[1].replace('.', '').replace('</div>', '').replace('\r', '').replace('\n', '').split(', ')

The last step is optional. If you want to save the output as a .txt file with every ingredient in a new line, use this:

In [None]:
filepath = '/path/to/file/folder'

#write all product ingredients in a .txt file
#if you decide to spend your night becoming a rogue biochemist:
#make all of the file names
#a_bunch_of_files = ["path/to/file%i.txt" %x for x in range(len(f_products_links))]
#loop the rest, or just make a database
links_path = filepath + '/ingred_list.txt'
with open(links_path, 'w+') as linksfile:
    linksfile.write(title + '\n') #the first line is the product name
    for it in label:
        linksfile.write('%s\n' % it)
    
linksfile.close()
linksfile.closed


This is it! Part one of "I wanted to decipher what the ingredients on the back of my moisturisers label actually mean" is done and hopefully you've gotten a little insight into how web scraping works.

If you don't want to deal with the legal aspects of web scraping and spend a bunch of your valuable time, make your life easier and outsource your work to professionals. 