In [50]:
from IPython.display import Image

In [None]:
Image(filename='../do') 

# Introduction
In this notebook, I will show how I solved a not so straigh foward complex web scraping problem. In the process, you will learn the following: 
- How to do webscraping with Selenium
- How to resolve with element identification problems in dynamic pages
- How to navigate through pages on a d

# Motivation
What problem am I trying to solve. I would like to keep track of job postings on this [site](https://careers.un.org/lbw/Home.aspx). Why can't I use regular webscraping packages such as requests or scrapy. The reason is that the website uses dynamic web pages which displays different content each time it's viewed. For example, the page may change with the time of day, the user that accesses the webpage, or the type of user interaction. For instance, in order to view the job postings, you have to submit a search and in order to view all search results, you have to click through the search button. All these things are difficult to achieve with a library requests. Instead, I use [Selenium](https://selenium-python.readthedocs.io) for Python.

# Setup
In order to do able to scrape data from dynamic pages, we need to use **Selenium**, which is the most popular browser automation tool. The best way to use Selenium is via **WebDriver**, a powerful API that builds on top of Selenium and makes calls to a browser to automate it, carrying out actions such as *open this web page*, *click this link*, *see whether the link opens this URL* which are some of the things we will do, etc.

In Python, we need to install the package selenium and a supported WebDriverfor browser we want to use. In my case, I installed a Webdriver for Firefox, but you can work with your favorite web browser.

## Installing Webdrivers
1. Download the latest [GeckoDriver](https://github.com/mozilla/geckodriver/releases/) (for Firefox) and [ChromeDriver](http://chromedriver.storage.googleapis.com/index.html) drivers.
2. Unpack them into somewhere fairly easy to navigate to, like the root of your home user directory.
3. Add the geckodriver and chromedriver driver's location to your system PATH. For instance, in Mac OS or Linux system, assuming that the path ```~/.local/bin```, then you can place the driver's executable in this folder or modify ```.bash_profile (or .bashrc)``` to include link to where you unpackaded the driver.

Please refer to [Mozilla documentation](https://developer.mozilla.org/en-US/docs/Learn/Tools_and_testing/Cross_browser_testing/Your_own_automation_environment) for detailed instructions on how to do this for Windows systems.

## Installing selenium
The Python package can be installed using pip Next, you install the selenium package, using pip or any other package manager you like. And of course you can install within your virtual or global environment.
```pip install selenium```

## Test selenium
To test that everything is working, we will do a Google search using selenium instead of using the UI. Its actually fun when you do this for the first time. When using selenium, you can select a ```headless``` option which means you dont see the browser when its opened. Alternatively, you can turn off this option and you can see the broweser as you interact with it using selenium. 

In [21]:
from selenium.webdriver import Firefox
from selenium.webdriver.firefox.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support.expected_conditions import presence_of_element_located


opts = Options()
opts.headless = False # turn on or off headless option
search_term = 'Lake Malawi'
browser = Firefox(options=opts)
browser.get("https://google.com/") # open the webpage
browser.find_element(By.NAME, "q").send_keys(search_term + Keys.RETURN)
browser.implicitly_wait(10)
results_stats = browser.find_element_by_xpath('//*[@id="result-stats"]')
print("Number of search results {}".format(results_stats.text))
browser.close()

Number of search results About 41,300,000 results (0.81 seconds) 


## Other useful tools
Just like in any other web scrapign task, its important to be able to accurately identify elements within a webpage's DOM using Xpath and CSS selectors. For this, using built in developers provided in the browsers is useful. In addition, I used [ChroPath](https://chrome.google.com/webstore/detail/chropath/ljngjbnaijcbncmcnjfhigebomdlkcjo?hl=en-US): its an extension for Chrome.

In [18]:
results_stats.text

'About 38,200,000 results (0.80 seconds) '

# Understanding the Task at Hand
The goal here is to be able to download job posts everyday from this site and dave them to disk as a text file. In order to do this, there are two hoops we have to go through as follows:
- Search job openings using the drop down option provided.
- Extract all job posts from the search results. While here, we will have to click through all pages of search results 

## Manually exploring the site
As a requirement, we need to understand the website we would like to get data from, so, we will navigate to the 
[UN jobs website](https://careers.un.org/lbw/Home.aspx). One of the key things to note is that we can leave as is the default search options and get the results.

# Search with Selenium
As we saw in the previous section, we can just accept the default search options and proceed to search. 
In order to do this with selenium, we need to identifify the search button. 
Once again, you can use developer tools in your favorite browswer to get this information. If you are using chroPath with Google Chrome, it is straightfoward to point on the search button and then copy the xPath for the button. The process is like so:
1. Open up developer tools by going to *more tools->developer tools*
2. Click on the inspeact element button and then select the search button
3. Right click and select copy Xpath

In [41]:
opts = Options()
opts.headless = True # turn on or off headless option
browser = Firefox(options=opts)
browser.get("https://careers.un.org/lbw/home.aspx?lang=en-US")
search_button_id = "ctl00_ContentPlaceHolder1_UNCareersLoader1_ctl00_SearchControl1_btnSearch"
# search with default options
browser.find_element(By.ID,search_button_id).send_keys(Keys.RETURN)

# Retrieve the Search Results from the Page
The next thing we need to do is retrieve the results from the page (i.e, being able to grab the actual job posts). Again using the developer tools, 
we poke around the search results to find a way to extract the results. There are definately more than one way of doing this, but this is how I achieved it. By checking the XPath for the displayed job positions, I noted that all of them have ```gvSearchGrid``` in their *id*. I therefore decided to use this fact and use an XPath which searches by id and also check that the id contains the afirementioned string like so:

```find_elements_by_xpath('//*[contains(@id,"gvSearchGrid")]')```

Before settling on this solution, I tried another solution which relied on the fact that the search results are stored in a table. As such, using XPath searching for *tr* also works. I ended up not using the *tr* tag because I just had more problems with it than the other approach.

In [42]:
xpath_for_search_results = '//*[contains(@id,"gvSearchGrid")]'
search_results = browser.find_elements_by_xpath(xpath_for_search_results)

## Process the retrieved results
For each job post, I'm interested in the following data: position title, grade, duty station and posting period. From the ```find_elements_by_xpath``` function, we a get a list of ```selenium.webdriver.firefox.webelement.FirefoxWebElement```. One of the disadvantage of the XPath string we used is that it gives us elements which arent job posts. As such, we have to sift through them and keep only the relevant items. We can get the text of this a *FirefoxWebElementelement* by invoking the ```text``` method. Thus, we can get the position title as displayed in the search results. Also, we can get all the details of the position using this approach. However, I found it difficult to parse the text from the ```text``` method when it contains all details because of conflicting separators. For instance, for the position shown below, its not easy to use either comma or space as a separator to get the information we need.

```DIRECTOR, ECONOMIC AFFAIRS D-2 143143 Economic, Social and Development Programme Management United Nations Conference on Trade and Development GENEVA 15/02/2021```

Since each job post shown in the search results is a link to the detailed information about the job, I decided to get the required data by following this link.

## Following clickable links
Unlike in other webpages where you can retrieve the url of a link using the "href" tag, 
for javascript enabled pages, the "href" returns a java script which runs when you click the link. For instance, lets for one of the job posts.

In [32]:
for i in results_table:
    if i.text == 'DIRECTOR, ECONOMIC AFFAIRS':
        url = i.get_attribute("href")
        print("URL--> {}".format(url))

URL-->javascript:__doPostBack('ctl00$ContentPlaceHolder1$UNCareersLoader1$ctl00$RadtabStrip_Grid1$gvSearchGrid$ctl02$lnkTitle','')


However, we can easily follow the link by using the ```click()``` function. Once we click the page, driver adds another window, we can switch to that window and retrieve the data we need. I do this below.

In [None]:
for i in search_results:
    if i.text == 'DIRECTOR, ECONOMIC AFFAIRS':
        i.click()
        windows = browser.window_handles
        browser.switch_to.window(windows[-1])
        pos_title = browser.find_element_by_xpath('//*[contains(@id,"jobPostingTitle")]').text
        grade = pos_title.split(",")[-1]
        job_code = browser.find_element_by_xpath('//*[contains(@id,"jobCodeTitle")]').text
        duty_station = browser.find_element_by_xpath('//*[contains(@id,"jobDutystation")]').text
        period = browser.find_element_by_xpath('//*[contains(@id,"jobPeriod")]').text
        url = browser.current_url
        pos_html = {'Title': job_code, 'Grade': grade, 'DutyStation': duty_station,
                        'PostingPeriod': period, 'url': url}
        print(pos_html)
        break

# Get Search Results from All Pages

There are more than 10 pages of results and 
our goal is to scrape all job posts from these pages. There are several minor problems to solve:
1. Determine how many pages of results are there
2. Progressively access each batch of results (visible on page)
3. Follow each page of results. We already know that we have to click through the links

## Get number of pages
The site displays 1-10 pages at a time and a link to click through to the next chunk of 10 pages. To find the total number of pages, we have to click through to the end using ```>>```. The final chunk of 10 pages will have ```<<``` and we use this fact to retrieve the final numbef of pages. In the function ```get_number_of_pages()```, I do the following to get to number of pages.
1. On the first page of results, retrieve ```>>``` and use ```click()``` to get to the next chunk of results.
2. Use the *tr* tag to get table rows which include the row for page numbers
3. Loop through the rows and if we find ```<<```, we know we have hit the last page of results
4. Simply extract the last page number from the row of page numbers

In [None]:
def get_number_of_pages(guessed_num_pages=40):
    """
    Determine number of pages on the site.
    :return: Number of pages
    :rtype:
    """
    with webdriver.Firefox() as driver:
        driver.get("https://careers.un.org/lbw/home.aspx?lang=en-US")
        driver.find_element(By.ID,
                            "ctl00_ContentPlaceHolder1_UNCareersLoader1_ctl00_SearchControl1_btnSearch").send_keys(
            Keys.RETURN)

        driver.implicitly_wait(10)
        next_page = driver.find_element_by_link_text(">>")
        next_page.click()
        tr_rows = driver.find_elements_by_tag_name("tr")
        try:
            for tr in tr_rows:
                if "<<" in tr.text:
                    txt = tr.text.split(" ")[-1]
                    driver.close()
                    return int(txt)
        except:
            # if above fails, just return a large number
            driver.close()
            return guessed_num_pages

# Putting Everything Together
Lets wrap all the logic in the following three functions.
- **search_with_selenium()**. The role of this function is to loop through al pages and within the loop call the rest of the functions.
- **get_search_results_from_page()**. For each page, retrieve job posts from that page using the appropriate XPath. The important problem I had to deal with in this function is the fact that only 10 pages are visible at a time. As such, to navigate to page 11, we need to use ```...``` instead of the page number. So, this also happens for page 21  and all odd numbered pages following multiple of 10.
- **extract_job_posts()**. Within this function, I sift through the retrieved results as shown earlier and identify which ones are job posts. For each job post, follow the link and extract job post details. One issue I had to deal with here is switching between the main window and the job details window: when you click a job post link, it opens another window and in orde to grab the details, you have to switch to that window. And this required that I switch back to the main window.

In [47]:
def search_with_selenium(url="https://careers.un.org/lbw/home.aspx?lang=en-US"):
    """
    Takes the base url and searches for all professional jobs
    :return: A list of dict items which have
    :rtype:
    """
    num_pages = get_number_of_pages()
    print("Found {} pages".format(num_pages))
    # for identifying search button
    search_button_id = "ctl00_ContentPlaceHolder1_UNCareersLoader1_ctl00_SearchControl1_btnSearch"
    with webdriver.Firefox() as driver:
        driver.get(url)
        driver.find_element(By.ID,search_button_id).send_keys(
            Keys.RETURN)

        data = []
        for page in range(1, num_pages + 1):
            try:
                results_table = get_search_results_from_page(driver, page)
                data += extract_job_posts(driver, results_table)
            except:
                pass
    return data

In [49]:
def get_search_results_from_page(driver, page):
    """
    Helper function to extract search results matching job posts.
    :return: A list of selenium.webdriver.firefox.webelement.FirefoxWebElement
    :rtype:
    """
    xpath_for_search_results = '//*[contains(@id,"gvSearchGrid")]'
    if page > 1:
        driver.implicitly_wait(10)
        next_page = None
        # Temporarily hard coded these values as I dont expect to have more than 30 pages
        # This could be dynamically generated from the number of pages and visible pages shown
        if page == 11:
            next_page = driver.find_element_by_link_text("...")
        elif page == 21:
            # this is because when we hit page 21, we have two ... for going foward and backwards
            elements = driver.find_elements_by_link_text("...")
            for i in elements:
                if str(page) in i.get_attribute("href"):
                    next_page = i
                    break
        else:
            next_page = driver.find_element_by_link_text(str(page))
        next_page.click()
        results_table = driver.find_elements_by_xpath(xpath_for_search_results)
    else:
        driver.implicitly_wait(10)
        results_table = driver.find_elements_by_xpath(xpath_for_search_results)

    return results_table


In [None]:
def extract_job_posts(driver, results_table):
    """
    Given results on a single page, extract the job details
    :param driver:
    :type driver:
    :param results_table:
    :type results_table:
    :return:
    :rtype:
    """
    data = []

    for i in results_table:
        try:
            if not i.text:
                continue
            i.click()
            windows = driver.window_handles
            if len(windows) == 1:
                continue
            driver.switch_to.window(windows[-1])
            try:
                pos_title = driver.find_element_by_xpath('//*[contains(@id,"jobPostingTitle")]').text
                grade = pos_title.split(",")[-1]
            except:
                grade = 'Not specified'

            try:
                job_code = driver.find_element_by_xpath('//*[contains(@id,"jobCodeTitle")]').text
            except:
                job_code = i.text

            duty_station = driver.find_element_by_xpath('//*[contains(@id,"jobDutystation")]').text
            period = driver.find_element_by_xpath('//*[contains(@id,"jobPeriod")]').text
            url = driver.current_url
            pos_html = {'Title': job_code, 'Grade': grade, 'DutyStation': duty_station,
                        'PostingPeriod': period, 'url': url}
            data.append(pos_html)
            driver.close()
            driver.switch_to.window(windows[0])
        except Exception as e:
            print(e)
            driver.switch_to.window(windows[0])
            continue
    return data

# Concluding Remarks

In this notebook, I have showed how I used selenium to scrape data off of 
a website where content is dynamically loaded using  javascript. So, one big question is why not use *requests*, *Beatifulsoup* or *scrapy*? For web pages where data is dynamically loaded,  the data is only loaded when you click on a link or perfom some other action in the browser, as such, you cannot reach the desired data by downloading it using **requests** or standard **scrapy** code. Although scrapy do have advanced features to deal with this issue, for my simple use case, I found that using **selenium** was much easier. Its worth mentinoning that selenium can be used in combination with scrapy as explained [here](https://stackoverflow.com/questions/17975471/selenium-with-scrapy-for-dynamic-page).

The task that I perfom in this notebook is straighfoward. However, navigating through pages to ensure that I scraped data from all pages was slightly tricky and I have shown how I resolved the problem. Also, with the code I have, I still have this error coming up: *Web element reference not seen before* which means the code could use some improvement.