# The Worst Way to Web Scrape: Selenium

If you want to feel like a sexy wizard, then use non-headless Selenium and go on a biochemistry journey with me.

Brief reminder of what I set out to do. I wanted to understand what the ingrediens on the back of my moisturiser's label *really* mean.

Remember the tasks for this project?
1. Get a list of the ingredients from the web (done) <br>
    1.1. Scrape <br>
    1.2. Save the output as a .txt file or create a database
2. Access PubChem chemistry database and get the short summaries of each compound/substance (DONE!)
3. Read the output by candlelight while taking a bubble bath

It's finally time to access the PubChem databases! 

If you're a part of the .txt file aesthetic crowd, then the following lines will be useful. Here we read in the .txt file we created in the last tootorial (hehe):

In [None]:
filepath = '/path/to/your/file'
links_path = filepath + '/ingred_list.txt'
ingreds = []
# open file and read the content in a list
with open(links_path, 'r') as linksfile:
    for line in linksfile:
        # remove linebreak which is the last character of the string
        currentPlace = line[:-1]

        # add item to the list
        ingreds.append(currentPlace)

linksfile.close()
linksfile.closed

Let's import all of the modules. You can use whatever browser you want. I don't remember why, but this time I opted for Firefox. 

In [None]:
from selenium.webdriver import Firefox
from selenium.webdriver.common.keys import Keys
from selenium.common.exceptions import NoSuchElementException, TimeoutException
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as ec
from selenium.webdriver.common.by import By
import pandas as pd

If you're not familiar with PubChem a brief decription would be that it's like Google for chemistry. 

The next function will get the featured compound result if it's available. If not, the first compound will be selected:

In [None]:
#clicks on the featured result
#if confirm that featured-result-link exists
def get_result(name = ing[n]):
    try:
        fres = driver.find_element_by_xpath("//a[@data-action='featured-result-link']")
        if fres.is_displayed():
            fres.click()
    
    #if no featured link, click on the first search result        
    except NoSuchElementException:
        print("No featured result, looking for the 1st compounds result")
        try:
            res = driver.find_element_by_xpath("//a[@data-action='result-link']")
            if res.is_displayed():
                print("1st compounds result selected")
                res.click()
                
        except NoSuchElementException:
            print("Invalid search for", name)

This is where we start the Selenium magic. Keep in mind that you need to download a webdriver, either chromedriver for Google Chrome or gecko driver for Firefox. The driver needs to be either added to your PATH or you can do as I did and add an executable path to your driver. 

In [None]:
title = ingreds[0]
ing = ingreds[1:len(ingreds)]
n=2

gecko_driver = "/path/to/your/geckodriver"
driver = Firefox(executable_path = gecko_driver)
driver.get("https://pubchem.ncbi.nlm.nih.gov/")

I opted to select elements all throughout this script using XPATHS. Make sure to add some kind of a wait until driver loads the page or at least the stuff you need, otherwise there **will** be an error. 

In [None]:
#xpath of the search bar
src_xpath = "//input[contains(@id, 'search_')]"
#waits for the search bar to be loaded
wait = WebDriverWait(driver, 10)
wait.until(ec.visibility_of_element_located((By.XPATH, src_xpath)))

src = driver.find_element_by_xpath(src_xpath)
src.send_keys(ing[n])
src.send_keys(Keys.ENTER)
# have to think of something better here
driver.implicitly_wait(5)
get_result()

Now we've (hopefully) entered a valid compound name that will yield a valid search result.
Now the action is happening in a specific compound's page. I've decided that I'm interested in the 'Pharmacology' section. 
There's a possibility that some more niche compounds will not have a 'Pharmacology' section, so an option is to get the use in manufacturing or something of the sort. I'll get to that later.

## Getting text from a page with Selenium
Easy: driver.find_element().text <br>
Done.

In [None]:
pharm_xpath = "//section[contains(@id, 'Pharmacology')]"

try:
    wait.until(ec.visibility_of_element_located((By.XPATH, pharm_xpath)))
    pharm = driver.find_element_by_xpath(pharm_xpath).text
except TimeoutException:
    print("Took too long to load :(")
except NoSuchElementException:
    print("No Pharmacology section, looking for the next best thing")
    #try:
        #search for use in manufacturing
driver.quit()

Next we move into the data 'munging' territory. Don't ask me what I think of terms like 'munging' and 'wrangling'. 

I've decided to specifically get the info from sections pharmacology, mechanism of action and metabolite description. 
After splitting the huge text block by new lines, we get a list of all of the items that were written in the big 'Pharmacology' section.

In [None]:
paragr = pharm.split("\n")
pharm_paragr = paragr[1:len(paragr)]

sections = [
        'Pharmacology',
        'Mechanism of Action',
        'Metabolite Description'
        ]

Next we need to find where the sections are in the big pharm_paragr list. The following code does just that, returning an index of relevant headings. I made it print out the location of headings probably because I was feeling lonely when I wrote those lines and wanted someone to talk to. There's no need to do that really.

If you want to be a stylish script wizard, you can create log files, to which your script writes some status messages throughout the process.

In [None]:
sect_nb = []
for z in range(len(pharm_paragr)):
    for g in sections:
        if g in pharm_paragr[z]:
            sect_nb.append(z)
            print("List element nb %i is the section " %z + pharm_paragr[z])

I feel ashamed that this next section took me THE LONGEST TIME to write, but here it finally is.
First we check that there are no weird heading duplicates found in the big Bertha (pharm_paragr).
Next we need to find ALL the headings, not only the ones we want to find. Headings luckily start with a number, so we sort by that - if the first element in a string is a number.

Then we need to find the range of relevant paragraphs. At this moment the pharm_paragr file is a jumble of headings and paragraphs. So, to find *relevant* paragraphs only, we first find where all of the headings are and then add +1, because the next list item after a heading will be a paragraph.After that, we get the location of where the relevant paragraph ends. 

In [None]:
if len(sections) == len(sect_nb):
    #finds all headings: a string that starts with a number
    heads = [pharm_paragr.index(w) for w in pharm_paragr if w[0].isdigit()]
    #range of relevant paragraphs
    sect_par_index = [heads.index(i) + 1 for i in heads if i in sect_nb]
    par_end = [heads[i] for i in sect_par_index]
    par_start = [i + 1 for i in sect_nb]

I've created a little reference for my own comfort:

In [None]:
    # pharm_paragr - this is where the text is
    # sect_nb = heading numbers
    # par_start = first list item of a paragraph/section
    # par_end the last list item of a section

Next comes the littlest data frame. The index will not be numerical, we'll use the ingredient names. Subsequently, it won't be possible to access it by .loc[], use .iloc[]
I've chosen the columns to be the section names that we scraped previously.

In [None]:
df = pd.DataFrame(index = ing, columns = sections)

Next came the most brain-farty thing in this whole shebang. I had to make a new list of descriptions. The descriptions had to be merged from the master-list, since one paragraph contained multiple list items. I decided to join the list items by planting back the '\n'. It will be easier to re-split the list on retrieval (OMG SO EXCITING STAY TUNED FOR WHAT I HAVE IN STORE NEXT).
Then the 3 corresponding section descriptors are added in the relevant row:

In [None]:
    x = []
    # makes a list where every list item is a merged description from the master list
    for a in range(len(par_start)):
        x.append(['\n '.join(pharm_paragr[par_start[a]:par_end[a]])])
    df.iloc[n] = x
#this is where the next loop starts

Beautiful end to a part of a beautiful journey. I'm coming for you beauty industry! Just kidding, I just want to be the hipster who knows what squalene is for.