# Web scraping using selenium

This notebook assumes that you have selenium installed along with the browser driver. You can find the drivers here:

http://www.seleniumhq.org/download/

Here we use the Chrome driver but you can use anyone of your choice.

The selenium package may be installed in anacoda with:

```bash
conda install -c conda-forge selenium
```

This short tutorial uses a trick mentioned in this [post](https://stackoverflow.com/a/23447450/8500344) so that the brower may be hidden.

Start with the required imports:

In [1]:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.chrome.options import Options

The following configures the driver. Here we configure window size and a headless version of the driver.

In [2]:
WINDOW_SIZE = "1920,1080"

chrome_options = Options()  
chrome_options.add_argument("--headless")  
chrome_options.add_argument("--window-size=%s" % WINDOW_SIZE)

Then we just have to create the driver. Omit to pass the options if you want to see the driver window while you work. It might be use to inspect the page your are browsing with developper tools.

In [3]:
driver = webdriver.Chrome(chrome_options=chrome_options)

Now we go to the url we want to scrap. This example uses the www.airfleets.net which is an amazing site for aeronautic fans:

In [4]:
driver.get("http://www.airfleets.net")

The following function selects an element with specified id in the page `search_bas` which is a line edit widget, clears its content, writes the content of `value` variable, presses RETURN key and the extracts and returns the source page (html content).

In [5]:
def request(driver, value):
    elem = driver.find_element_by_id("search_bas")
    elem.clear()
    elem.send_keys(value)
    elem.send_keys(Keys.RETURN)
    return driver.page_source

Simply use the function to get all aircraft with manufacturer serial number being 100:

In [6]:
result = request(driver, "100")

Finally get all Airbus aircrafts mentioned in this page:

In [7]:
import re
pattern = re.compile(r"A[0-9]{3}")
print pattern.findall(result)

[u'A300', u'A300', u'A300', u'A320', u'A330', u'A380']


You can of course do things much more complitated to parse the page and extract the information you want.