Sudhir Voleti. ISB. Apr 2019.

Hi all,

Welcome back. Often during data collection, scraping static webpages using tools like R's rvest or py's beautiful_soup may simply not be enough for a number of reasons:
* Because the target webpage is dynamic, 
* has hidden and changing content, 
* has [AJAX (Asynchronous Javascript And XML)](https://en.wikipedia.org/wiki/Ajax%28programming%29) scripts running (which allows web pages to dynamically change content without needing to reload the entire page), 
* etc.
<br>
<br>
When we navigate such a page, the page (and by extension the web application behind it) would require us to take a number of actions. Think of:

1. Clicking a button to load more content
2. Refreshing old content
3. Clearing search boxes prior to search
4. Typing in search queries
5. Choosing options from pulldown menus
6. Scrolling 
7. Etc.

<br>
<br>
As people, we can do this easily. But imagine needing to collect data from 100s of such pages or page-instances?
<br>
<br>
In such a scenario, would it not be great if we could launch a browser remotely and control actions to be taken to display the data we need, and then thereafter scrape the data? This is where webdrivers enter the picture. One of the best known is Selenium, which has interfaces both with R and py. Py is more stable with Selenium, so here we are.
<br>
<br>
To recap: Aim of this doc is to demo webdriver functionality using a few elementary exercises.
<br>
<br>


### Installation

Skip in case you have already installed

In [1]:
import sys

### Import the required libraries

In [2]:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys 

Ok. What follows applies if you're a novice (somewhat like me).
<br>
<br>
So if the above code block ran on, say, spyder's IPython console, you've installed the webdriver correctly.
<br>
<br>
I specifically unzipped the binary and installed it at a local machine location ('F:/R miscell/chromedriver.exe') where I can call it at will without the hassle of relative URLs etc. You may choose to do the same.
<br>
<br>
Now, in the code below, we invoke the webdriver to go play around a tad with the www.python.org webpage as our target webpage, do some basic searches there etc.
<br>
<br>
Be aware we can use CSS selectors from selector gadget like previously, or other options to get toparticular tags and HTML elements on the target page.
<br>
<br>
Specifically, keep a lookout for the following selenium py funcs:
1. .get() to find the target webpage and open it in a new chrome instance.
2. .find_element_by_*() to find particular fields, boxes, buttons etc on the target webpage.
<br>

Behold.

In [3]:
#https://selenium-python.readthedocs.io/getting-started.html
driver = webdriver.Chrome('data/chromedriver') 
driver.get("http://www.python.org")    #get URL's page source
assert "Python" in driver.title   # confirm with assert() if 'python' is in page title

The moment the above runs, .get(target webpage) opens a new chrome webpage instance, being remote-driven by the web-driver.
<br>
<br>
Plan now is to search in the searchbox on the target webpage for a search-term, say, "pycon" (for 'python conferences'). This implies we first need to ID the search field with .find_element_by_*(search-field-ID) and tell the webdriver to populate it.
<br>
<br>
We can find element by name or by CSS selector (among other options like Xpath). I'll just use the CSS selector.
<br>
<br>
Let me drag-and-drop a screenshot of what the CSS selector for the search box field should be like.

![](data/img1.png)

In [4]:
# elem = driver.find_element_by_name("q")   # http://selenium-python.readthedocs.io/locating-elements.html#locating-elements
elem = driver.find_element_by_css_selector("#id-search-field")     # alternately, to above line, using find_element_by_* 

So we've IDed the webpage element we want to act on using its CSS tag. We've named the element elem so we can repeatedly call it and manipulate it.
<br>
<br>
Next, I'll double-check and clear the search box in case it has any previous entries in it using .clear(); then populate the search box with our search term using .send_keys("search term"); hit return and check if any result was returned or not.
<br>
<br>
See below.

In [5]:
# sending keys below is like entering keys via keyboard
elem.clear()    # first clear any pre-populated text in the input field
elem.send_keys("pycon")     # type "pycon" for py conf into search-text field 
elem.send_keys(Keys.RETURN)     # hit ENTER after typing in search-query for pycon. neat.
assert "No results found." not in driver.page_source     # To ensure that some results are found, make an assertion

It makes sense to use assert just to be sure. Here's a screenshot of the results, for the record.

![](data/img2.png)

Now that the results are displayed, we can use beautiful soup to scrape the results as required. Cool, eh?
<br>
<br>
Alright. How about we search for something else now. Say, 'beautiful soup' itself. See below.

In [6]:
# trying new search term for say 'beautiful soup'
elem = driver.find_element_by_css_selector("#id-search-field")     # alternately, to above line, using find_element_by_* 
elem.clear()    # first clear any pre-populated text in the input field
elem.send_keys("beautiful soup")     # type "pycon" for py conf into search-text field 
elem.send_keys(Keys.RETURN)     # hit ENTER after typing in search-query for pycon. neat.

And so on.
<br>
Last but not least, it is important to close the page you opened, using the driver.close() command as below.

In [7]:
driver.close()     # quit will exit entire browser whereas close` will close one tab

Well, any simple examples we can try now? How about automating your FB login with chrome driver?
<br>
<br>
That's for homework, maybe.
<br>
<br>
Shall end this markdown here.
<br>
<br>
Sudhir ISB, Apr 2019.