# Intro to Web Scraping: with Selenium
Courtesy of Professor Mike Davis

In [1]:
from bs4 import BeautifulSoup
import requests
import pandas as pd

In [2]:
page_to_scrape = "https://en.wikipedia.org/wiki/List_of_countries_by_population_(United_Nations)"

# Pre-Javascript
Here, we're going to just capture the response from the url above, and use Beautiful Soup to help us parse the content. If there is any client side (renders in the web browser) Java Script, it won't render here.

In [3]:
page_response = requests.get(page_to_scrape)
page_response

<Response [200]>

In [10]:
soup = BeautifulSoup(page_response.content,'html.parser')
# print(soup.prettify())

## Built-ins

There are several built ins, like .title, which lets us access the title element of the page:

In [11]:
soup.title

<title>List of countries by population (United Nations) - Wikipedia</title>

We can also pull out the text by referencing the .text attribute:

In [59]:
soup.title.text

'List of countries by population (United Nations) - Wikipedia'

## Finding  elements with .find() and .find_all()
the .find() method finds the first instance, and the .find_all() method returns a list with all found instances of the element we search for. Let's try to find a table with attributes class:'wikitable' and 'class:'sortable'

In [60]:
table = soup.find("table", attrs = {'class':'wikitable', 'class':'sortable'})
# table
type(table)

bs4.element.Tag

Here, we'll find the first tr element, and then find each th element

In [61]:
col_headers = table.find('tbody').find('tr').find_all('th')
# col_headers
type(col_headers)

bs4.element.ResultSet

In [62]:
table_headers = [ele.text.strip() for ele in col_headers]
table_headers

[]

And, here we'll find the first tbody element, and then althe rt elements within it.

In [16]:
rows = table.find('tbody').find_all('tr')
clean_rows = []
for row in rows:
    cols = row.find_all('td')
    cols = [ele.text.strip() for ele in cols]
    if cols and cols[1] != 'World':
        clean_rows.append(cols)

Let's put this into a pandas dataframe

In [17]:
mydf = pd.DataFrame(clean_rows)
mydf.columns = ['rank','country','continent','region','pop2016','population','change']
mydf['population'] = mydf['population'].str.replace(',','').astype('int')
mydf.head()

Unnamed: 0,rank,country,continent,region,pop2016,population,change
0,1,China[a],Asia,Eastern Asia,1403500365,1409517397,+0.4%
1,2,India,Asia,Southern Asia,1324171354,1339180127,+1.1%
2,3,United States,Americas,Northern America,322179605,324459463,+0.7%
3,4,Indonesia,Asia,South-eastern Asia,261115456,263991379,+1.1%
4,5,Brazil,Americas,South America,207652865,209288278,+0.8%


And just use a few columns

In [18]:
cols_to_keep = ['country','continent','population']
world_pop = mydf[cols_to_keep].reset_index(drop=True)
world_pop.head()

Unnamed: 0,country,continent,population
0,China[a],Asia,1409517397
1,India,Asia,1339180127
2,United States,Americas,324459463
3,Indonesia,Asia,263991379
4,Brazil,Americas,209288278


In [19]:
world_pop.to_csv('world_pop.csv', index = False)

## Rendering Javascript with Selenium
Web pages sometimes load data using javascript, which won't execute if you just capture the get request's response (what gets sent back when you query a web page's url). You need a Web Driver to render this. This is where the Selenium library helps us.

In [21]:
from selenium import webdriver 
from bs4 import BeautifulSoup
import pandas as pd

You will probably need to 'pip install selenium'. It's not a standard library that comes with Anaconda. Then, you'll need to download the webdriver, geckodriver, to let Selenium use FireFox to render web pages.

(Note: the geckodriver that's in this REPO is for MacOSX, not Windows.)

Download the geckodriver from here: https://github.com/mozilla/geckodriver/releases

In [39]:
driverpath = "/Users/denisvrdoljak/Desktop/SCUpython/OMIS30_GitHubREPO/wk9/"
# change this to the folder where the geckodriver executable is located
# download geckodriver from here:
# https://github.com/mozilla/geckodriver/releases
driver = webdriver.Firefox(driverpath)
page_to_scrape = "https://en.wikipedia.org/wiki/List_of_countries_by_population_(United_Nations)"
driver.get(page_to_scrape)

In [43]:
soup = BeautifulSoup(driver.page_source, "html.parser")

In [45]:
table = soup.find("table", attrs = {'class':'wikitable', 'class':'sortable'})
table

<table class="wikitable sortable plainrowheaders jquery-tablesorter" style="text-align: right;">
<caption>Countries and areas ranked by population in 2017
</caption>
<thead><tr>
<th class="headerSort" data-sort-type="number" role="columnheader button" scope="col" tabindex="0" title="Sort ascending">Rank
</th>
<th class="headerSort" role="columnheader button" scope="col" tabindex="0" title="Sort ascending">Country or area
</th>
<th class="headerSort" role="columnheader button" scope="col" tabindex="0" title="Sort ascending"><a href="/wiki/United_Nations_geoscheme" title="United Nations geoscheme">UN continental<br/>region</a><sup class="reference" id="cite_ref-region_2-0"><a href="#cite_note-region-2">[2]</a></sup>
</th>
<th class="headerSort" role="columnheader button" scope="col" tabindex="0" title="Sort ascending"><a href="/wiki/United_Nations_geoscheme" title="United Nations geoscheme">UN statistical<br/>region</a><sup class="reference" id="cite_ref-region_2-1"><a href="#cite_note-r

Here, we do more or less the same steps, but the rendered (post client-side Java Script) html may look different, so we might need to search for different tags.

In [46]:
col_headers = table.find('thead').find('tr').find_all('th')
table_headers = [ele.text.strip() for ele in col_headers]
table_headers

['Rank',
 'Country or area',
 'UN continentalregion[2]',
 'UN statisticalregion[2]',
 'Population(1 July 2016)[3]',
 'Population(1 July 2017)[3]',
 'Change']

In [47]:
rows = table.find('tbody').find_all('tr')
clean_rows = []
for row in rows:
    cols = row.find_all('td')
    cols = [ele.text.strip() for ele in cols]
    if cols and cols[1] != 'World':
        clean_rows.append(cols)

And, let's just use the same columns we did above

In [48]:
mydf = pd.DataFrame(clean_rows)
mydf.columns = ['rank','country','continent','region','pop2016','population','change']
mydf['population'] = mydf['population'].str.replace(',','').astype('int')
cols_to_keep = ['country','continent','population']
world_pop = mydf[cols_to_keep].reset_index(drop=True)
world_pop.head()

Unnamed: 0,country,continent,population
0,China[a],Asia,1409517397
1,India,Asia,1339180127
2,United States,Americas,324459463
3,Indonesia,Asia,263991379
4,Brazil,Americas,209288278
