## Web Scrapping
* Outline:
    1. Web scraping using Python and BeautifulSoup - block 1
        - US presidents in history from wikipedia
            + requests
            + BeautifulSoup
            + or urllib.request
    2. Web Scraping Wikipedia Tables using BeautifulSoup and Python - block 12
        - Countries listed in the table from wikipedia
            + requests
            + BeautifulSoup
            + Pandas
    3. Web Scraping HTML Tables with Python - block 21
        - Table from the Pokemon Database
            + requests
            + lxml.html
            + Pandas

### Web scraping using Python and BeautifulSoup
https://www.codementor.io/dankhan/web-scrapping-using-python-and-beautifulsoup-o3hxadit4

In [1]:
# Install requests and beautifulsoup4
# $ pip install requests
# $ pip install beautifulsoup4

Difference between requests.get() and urrlib.request.urlopen() python:  
https://stackoverflow.com/questions/38114499/difference-between-requests-get-and-urrlib-request-urlopen-python  

##### urllib and urllib2 are both Python modules that do URL request related stuff but offer different functionalities.  

1) urllib2 can accept a Request object to set the headers for a URL request, urllib accepts only a URL.  

2) urllib provides the urlencode method which is used for the generation of GET query strings, urllib2 doesn't have such a function. This is one of the reasons why urllib is often used along with urllib2.  

##### Requests - Requests’ is a simple, easy-to-use HTTP library written in Python.  

1) Python Requests encodes the parameters automatically so you just pass them as simple arguments, unlike in the case of urllib, where you need to use the method urllib.encode() to encode the parameters before passing them.  

2) It automatically decoded the response into Unicode.  

3) Requests also has far more convenient error handling.If your authentication failed, urllib2 would raise a urllib2.URLError, while Requests would return a normal response object, as expected. All you have to see if the request was successful by boolean response.ok  

Reference - https://stackoverflow.com/questions/2018026/what-are-the-differences-between-the-urllib-urllib2-urllib3-and-requests-modul

#### Collecting web page data

Go to this link, https://en.wikipedia.org/wiki/List_of_presidents_of_the_United_States#Presidents, and right click on the table containing all the information about the United States presidents and then click on the inspect to inspect the page

      # The goal for this exercise is to find all the United State Presidents in the history.

In [2]:
# Import the installed modules
import requests
from bs4 import BeautifulSoup
from urllib import request 

In [3]:
# To get the data from the web page we will use requests API's get() method
url = "https://en.wikipedia.org/wiki/List_of_Presidents_of_the_United_States"
page = requests.get(url)
#page = request.urlopen(url).read().decode('utf8')  # use urllib.request module

In [4]:
# page is a requests.models.Response object if we use Requests module
# page is a string object if we use urllib.Request module
type(page)

requests.models.Response

A list of the response status code:
https://developer.mozilla.org/en-US/docs/Web/HTTP/Status

In [5]:
# It is always good to check the http response status code from Requests 
print(page.status_code)   # This should print 200
# print(page[:800])

200


In [6]:
# Now we have collected the data from the web page by using content() method from Requests module
print(page.content[:800])

b'<!DOCTYPE html>\n<html class="client-nojs" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8"/>\n<title>List of presidents of the United States - Wikipedia</title>\n<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgMonthNamesShort":["","Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov","Dec"],"wgRequestId":"XciHFgpAIDEAABQJKH4AAACM","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"List_of_presidents_of_the_United_States","wgTitle":"List of presidents of the Unite'


In [7]:
# create a bs4 object and use the prettify method from bs4
# This will print data in format like inspecting the web page.
soup = BeautifulSoup(page.content, 'html.parser') # the input of the BeautifulSoup should be string object or bytes
print(soup.prettify()[:1000])

<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   List of presidents of the United States - Wikipedia
  </title>
  <script>
   document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgMonthNamesShort":["","Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov","Dec"],"wgRequestId":"XciHFgpAIDEAABQJKH4AAACM","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"List_of_presidents_of_the_United_States","wgTitle":"List of presidents of the United States","wgCurRevisionId":925557458,"wgRevisionId":925557458,"wgArticleId":19908980,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgC

As of now we know that our table is in tag "table" and class "wikitable".   
So, first we will extract the data in table tag using find method of bs4 object.   
This method returns a bs4 object  

In [8]:
tb = soup.find('table', {'class':'wikitable'}) # find tag, table, and class (using dictionary), wikitable

In [9]:
# The tb should be a bs4.element.Tag object
type(tb)

bs4.element.Tag

This tag has many nested tags but we only need text under title element of the tag a of parent tag b (which is the child tag of table).   
For that we need to find all b tags under the table tag and then find all the a tags under the b tags.  
For this we will use find_all method and iterate over each of the b tag to get the a tag

The HTML < b > tag is used to create a 'b' element, which represents bold text in an HTML document.  
   
The HTML < a > tag is used for creating an a element (also known as an "anchor" element). The a element represents a hyperlink. 

In [10]:
for link in tb.find_all('b'):
    name = link.find('a')
    print(name)  

<a href="/wiki/George_Washington" title="George Washington">George Washington</a>
<a href="/wiki/John_Adams" title="John Adams">John Adams</a>
<a href="/wiki/Thomas_Jefferson" title="Thomas Jefferson">Thomas Jefferson</a>
<a href="/wiki/James_Madison" title="James Madison">James Madison</a>
<a href="/wiki/James_Monroe" title="James Monroe">James Monroe</a>
<a href="/wiki/John_Quincy_Adams" title="John Quincy Adams">John Quincy Adams</a>
<a href="/wiki/Andrew_Jackson" title="Andrew Jackson">Andrew Jackson</a>
<a href="/wiki/Martin_Van_Buren" title="Martin Van Buren">Martin Van Buren</a>
<a href="/wiki/William_Henry_Harrison" title="William Henry Harrison">William Henry Harrison</a>
<a href="/wiki/John_Tyler" title="John Tyler">John Tyler</a>
<a href="/wiki/James_K._Polk" title="James K. Polk">James K. Polk</a>
<a href="/wiki/Zachary_Taylor" title="Zachary Taylor">Zachary Taylor</a>
<a href="/wiki/Millard_Fillmore" title="Millard Fillmore">Millard Fillmore</a>
<a href="/wiki/Franklin_Pie

The eleemnt title can be extracted from all a tags using the method get_text().

In [11]:
for link in tb.find_all('b'):
    name = link.find('a') #  find() returns the first item that matches the tag
    print(name.get_text('title'))

George Washington
John Adams
Thomas Jefferson
James Madison
James Monroe
John Quincy Adams
Andrew Jackson
Martin Van Buren
William Henry Harrison
John Tyler
James K. Polk
Zachary Taylor
Millard Fillmore
Franklin Pierce
James Buchanan
Abraham Lincoln
Andrew Johnson
Ulysses S. Grant
Rutherford B. Hayes
James A. Garfield
Chester A. Arthur
Grover Cleveland
Benjamin Harrison
Grover Cleveland
William McKinley
Theodore Roosevelt
William H. Taft
Woodrow Wilson
Warren Harding
Calvin Coolidge
Herbert Hoover
Franklin Delano Roosevelt
Harry S. Truman
Dwight D. Eisenhower
John F. Kennedy
Lyndon B. Johnson
Richard Nixon
Gerald Ford
Jimmy Carter
Ronald Reagan
George H. W. Bush
Bill Clinton
George W. Bush
Barack Obama
Donald Trump


### Web Scraping Wikipedia Tables using BeautifulSoup and Python
https://medium.com/analytics-vidhya/web-scraping-wiki-tables-using-beautifulsoup-and-python-6b9ea26d8722

#### Collecting web page data  
Use the link from wikipedia: https://en.wikipedia.org/wiki/List_of_Asian_countries_by_area  
     # The goal is to scrap Wikipedia to find out all the countries in Asia.

First, import requests library.  
Requests allows you to send organic, grass-fed HTTP/1.1 requests, without the need for manual labor.

In [12]:
import requests

In [13]:
# Assign a link to variable named website_url
# requests.get(url).text will ping a website and return you HTML of the website.
website_url = requests.get('https://en.wikipedia.org/wiki/List_of_Asian_countries_by_area').text

In [14]:
# By using .text() method from Requests, the website_url will be returned as a string object
type(website_url)

str

In [15]:
print(website_url[:300])

<!DOCTYPE html>
<html class="client-nojs" lang="en" dir="ltr">
<head>
<meta charset="UTF-8"/>
<title>List of Asian countries by area - Wikipedia</title>
<script>document.documentElement.className="client-js";RLCONF={"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgP


We begin by reading the source code for a given web page and creating a BeautifulSoup (soup)object with the BeautifulSoup function.   
Beautiful Soup is a Python package for parsing HTML and XML documents.   
It creates a parse tree for parsed pages that can be used to extract data from HTML, which is useful for web scraping.   
Prettify() function in BeautifulSoup will enable us to view how the tags are nested in the document.  

##### Different of Parsers
https://www.crummy.com/software/BeautifulSoup/bs4/doc/#differences-between-parsers

In [16]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(website_url,'lxml')
print(soup.prettify()[:400])

<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   List of Asian countries by area - Wikipedia
  </title>
  <script>
   document.documentElement.className="client-js";RLCONF={"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"List_of_Asian_countries_by_area","wgTitle":"List of Asian countries by a


If you carefully inspect the HTML script all the table contents i.e. names of the countries which we intend to extract is under class Wikitable Sortable.  
So our first task is to find class ‘wikitable sortable’ in the HTML script.  

In [17]:
My_table = soup.find('table',{'class':'wikitable sortable'})

Under table class ‘wikitable sortable’ we have links with country name as title.

Now to extract all the links within < a >, we will use find_all().

In [18]:
links=My_table.find_all('a')
links

[<a href="/wiki/Russia" title="Russia">Russia</a>,
 <a href="#cite_note-russiaTotalAreaByCIA-1">[1]</a>,
 <a href="/wiki/China" title="China">China</a>,
 <a href="/wiki/Hong_Kong" title="Hong Kong">Hong Kong</a>,
 <a href="/wiki/Macau" title="Macau">Macau</a>,
 <a href="/wiki/India" title="India">India</a>,
 <a href="#cite_note-2">[2]</a>,
 <a href="/wiki/Kazakhstan" title="Kazakhstan">Kazakhstan</a>,
 <a href="/wiki/Saudi_Arabia" title="Saudi Arabia">Saudi Arabia</a>,
 <a href="/wiki/Iran" title="Iran">Iran</a>,
 <a href="/wiki/Mongolia" title="Mongolia">Mongolia</a>,
 <a href="/wiki/Indonesia" title="Indonesia">Indonesia</a>,
 <a href="/wiki/Pakistan" title="Pakistan">Pakistan</a>,
 <a href="/wiki/Turkey" title="Turkey">Turkey</a>,
 <a href="/wiki/Myanmar" title="Myanmar">Myanmar</a>,
 <a href="/wiki/Afghanistan" title="Afghanistan">Afghanistan</a>,
 <a href="/wiki/Yemen" title="Yemen">Yemen</a>,
 <a href="/wiki/Thailand" title="Thailand">Thailand</a>,
 <a href="/wiki/Turkmenistan" t

Some of notes with < a > tag were store in the links variable.  
Use operator, in, to filter the < a > tag having 'wiki' as a string.

In [19]:
countries=[]                     # create an empty list to append the country name in for loop
for link in links:
    if 'wiki' in str(link):      # find the tage with strings having 'wiki'
        countries.append(link.get('title'))  # alternative: use get_text() method
countries

['Russia',
 'China',
 'Hong Kong',
 'Macau',
 'India',
 'Kazakhstan',
 'Saudi Arabia',
 'Iran',
 'Mongolia',
 'Indonesia',
 'Pakistan',
 'Turkey',
 'Myanmar',
 'Afghanistan',
 'Yemen',
 'Thailand',
 'Turkmenistan',
 'Uzbekistan',
 'Iraq',
 'Japan',
 'Vietnam',
 'Malaysia',
 'Oman',
 'Philippines',
 'Laos',
 'Kyrgyzstan',
 'Syria',
 'Golan Heights',
 'Cambodia',
 'Bangladesh',
 'Nepal',
 'Tajikistan',
 'North Korea',
 'South Korea',
 'Jordan',
 'Azerbaijan',
 'United Arab Emirates',
 'Georgia (country)',
 'Sri Lanka',
 'Egypt',
 'Bhutan',
 'Taiwan',
 'Armenia',
 'Kuwait',
 'East Timor',
 'Qatar',
 'Lebanon',
 'Israel',
 'State of Palestine',
 'Brunei',
 'Singapore',
 'Bahrain',
 'Maldives']

Convert the list countries into Pandas DataFrame to work in python.

In [20]:
import pandas as pd
df=pd.DataFrame()
df['Country']=countries
df.head(5)

Unnamed: 0,Country
0,Russia
1,China
2,Hong Kong
3,Macau
4,India


### Web Scraping HTML Tables with Python
https://towardsdatascience.com/web-scraping-html-tables-with-python-c9baba21059

#### Collecting web page data  
    # The goal is to try scraping the online Pokemon Database 
(http://pokemondb.net/pokedex/all).

##### Inspect HTML
Before moving forward, we need to understand the structure of the website we wish to scrape.  
This can be done by clicking right-clicking the element we wish to scrape and then hitting “Inspect”. 

##### Import Libraries
We will need requests for getting the HTML contents of the website and lxml.html for parsing the relevant fields. Finally, we will store the data on a Pandas Dataframe.

In [21]:
import requests          # to get a response from a url
import lxml.html as lh   # lxml parser
import pandas as pd      # pandas dataframe

##### Scrape Table Cells
The code below allows us to get the Pokemon stats data of the HTML table.

In [22]:
#Assign the url to an url variable
url='https://pokemondb.net/pokedex/all'

#Create a handle, page, to handle the contents of the website
page = requests.get(url)

In [23]:
# It is always good to check the http response status code from Requests 
print(page.status_code)   # This should print 200
# print(page[:800])

200


In [24]:
print(page.content[:500])

b'<!DOCTYPE html>\n<html lang="en">\n<head>\n<meta charset="utf-8">\n<title>Pok\xc3\xa9mon Pok\xc3\xa9dex: list of Pok\xc3\xa9mon with stats | Pok\xc3\xa9mon Database</title>\n<link rel="preconnect" href="https://fonts.gstatic.com">\n<link rel="preconnect" href="https://img.pokemondb.net">\n<link rel="stylesheet" href="/static/css/pokemondb-e614e67e0f.css">\n<meta name="viewport" content="width=device-width, initial-scale=1">\n<meta property="og:description" name="description" content="The Pok\xc3\xa9dex contains detailed stats for eve'


In [25]:
# instead of using beautifulsoup, use fromstring() method from lxml.html module to create a html structure object
#Store the contents of the website under doc
doc = lh.fromstring(page.content)  # <lxml.html.HtmlElement> / <Element html at 0x121281688>

In [26]:
type(doc)

lxml.html.HtmlElement

####  < table > tag
The < table > tag defines an HTML table.  
An HTML table consists of the < table > element and one or more < tr >, < th >, and < td > elements.  
The < tr > element defines a table row,  
The < th > element defines a table header,  
and the < td > element defines a table cell.  

A more complex HTML table may also include < caption >, < col >, < colgroup >, < thead >, < tfoot >, and < tbody > elements.  

Example:   
< table>  
....< tr>  
........< th>Month< /th>  
........< th>Savings< /th>  
....< / tr>  
....< tr>  
........< td> January< / td>  
........< td> 100< / td>  
....< / tr>  
....< tr>  
........< td>February< /td>  
........< td> 80< /td>  
....< /tr>  
< /table>  

In [27]:
#Parse data that are stored between <tr>..</tr> of HTML by using xpath() from lxml module
# the xpath() method will return a list
tr_elements = doc.xpath('//tr')

In [28]:
# The new variable is a list of html structure objects
type(tr_elements)  # <list>

list

For sanity check, ensure that all the rows have the same width.   
If not, we probably got something more than just the table.  

In [29]:
#Check the length of the first 12 rows
[len(T) for T in tr_elements[:12]]

[10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10]

Looks like all our rows have exactly 10 columns. This means all the data collected on tr_elements are from the table.

##### Parse Table Header
Next, let’s parse the first row as our header.

In [1]:
#Create empty list
col=[]
#For each row, store each first element (header) and an empty list
for i,t in enumerate(tr_elements[0]):
    name = t.text_content()
    print('{:2d} : {:s}'.format(i+1,name))
    col.append((name,[]))

NameError: name 'tr_elements' is not defined

In [31]:
col

[('#', []),
 ('Name', []),
 ('Type', []),
 ('Total', []),
 ('HP', []),
 ('Attack', []),
 ('Defense', []),
 ('Sp. Atk', []),
 ('Sp. Def', []),
 ('Speed', [])]

Now, we have a list of tuples that contains column header and empty lists.  
We are going to iterate each row to fill the cells from the table to the empty lists.

##### Creating Pandas DataFrame  
Each header is appended to a tuple along with an empty list.

In [32]:
#Since our first row is the header, data is stored on the second row onwards
for j in range(1,len(tr_elements)):
    # T is our j'th row
    T=tr_elements[j]
    
    # If row is not of size 10, the //tr data is not from our table 
    if len(T)!=10:
        break
    
    #i is the index counter of our column
    i=0
    
    #Iterate through each element of the row
    for t in T.iterchildren():
        data=t.text_content() 
        #Check if row is empty
        if i>0:
        #Convert any numerical value to integers
        # The table on the web has the numeric value in integer, instead of float
        # Try to convert the data into numeric
        # If it cannot be converted, use the except and pass keep the data as a string
            try:
                data=int(data)
            except:
                pass
        #Append the data to the empty list of the i'th column
        col[i][1].append(data) 
        #Increment i for the next column
        i+=1

In [33]:
[len(cell) for (colname,cell) in col]

[926, 926, 926, 926, 926, 926, 926, 926, 926, 926]

Perfect! This shows that each of our 10 columns has exactly 926 values.  
Now we are ready to create the DataFrame:

In [34]:
Dict={colname : column for (colname , column) in col}

In [35]:
import pandas as pd 
df=pd.DataFrame(Dict)
df.head()

Unnamed: 0,#,Name,Type,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed
0,1,Bulbasaur,Grass Poison,318,45,49,49,65,65,45
1,2,Ivysaur,Grass Poison,405,60,62,63,80,80,60
2,3,Venusaur,Grass Poison,525,80,82,83,100,100,80
3,3,Venusaur Mega Venusaur,Grass Poison,625,80,100,123,122,120,80
4,4,Charmander,Fire,309,39,52,43,60,50,65


Wola! we have scraped the pokedex table from the web :)