# Scraping data from a web page - example with explanations
Requires beautifulsoup4

- pip install beautifulsoup4 
or 
- sudo pip install beautifulsoup4

see [Beautifulsoup documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#navigating-the-tree) for more details of what is shown below

Here is a summary of the navigation with BeautifulSoup
![alt text](summaryOfBS4CommandsForNavigation.png "navigation summary")

A version of this code without explanations is available [here](WebScraping_SinglePage-NoInstructions.ipynb)

In [None]:
from bs4 import BeautifulSoup
import requests

## Getting all the HTML code

In [None]:
url = 'https://en.wikipedia.org/wiki/Reince_Priebus'
r  = requests.get(url)
data = r.text
soup = BeautifulSoup(data)
soup

### Example: get all the links on the page

In [None]:
for link in soup.find_all('a'):
    print(link.get('href'))

## Searching by CSS class
Inspecting the wikipedia page https://en.wikipedia.org/wiki/Reince_Priebus you can see that the date of birth and name are inside HTML elements (span and div) with class attributes. We will retrieve these two pieces of information first 
![alt text](webScrapingImg1.png "html structure for birthday and name")

In [None]:
# look for a span tag with a class named bday
bday=soup.find("span", class_="bday").get_text()
bday

In [None]:
# one can also get the list of all matches
#bdayTag=soup.find_all("span", class_="bday")

In [None]:
nickname=soup.find("div", class_="nickname").get_text()
nickname

## Navigating HTML
The information about education doesn't have any class tags and it is stored in an HTML table.
![alt text](webScrapingImg2.png "html structure for education")
Because there are many tables in the page and they are not named, we do the following:
- start by looking for a th tag with text Education (using find)
- retrieve the row that contains the th tag (using parent)
- because the information about the education is stored as links we then retrieve all the 'a' tags (using find_all)
- for each one of the links we retrieve the text (using string)

The code marked explanation only is not necessary and only there to show you what is happening

In [None]:
# find the th tag that has a text 'Education'
edu_heading = soup.find("th", string="Education")
edu_heading

In [None]:
# One can also find multiple tags that match the requirement
#edu_table = soup.find_all("th", string="Education")

In [None]:
# go one level up to the table row level
edu_row = edu_heading.parent
edu_row

In [None]:
# get a list containing all links
edu_list_with_tags = edu_row.find_all('a')
edu_list_with_tags

In [None]:
# Explanationm Only
# this shows how one can extract the text of the links
# you also see that, in order to get each degree, we need two consecutive elements (university and degree name)
for element in edu_list_with_tags:
   print(element.string)

In [None]:
# Explanationm Only
# this shows that edu_list_with_tags is a bs4 element 
type(edu_list_with_tags)

In [None]:
# therefore we transform it into a list to be able to scroll through each element
# to confirm that all is fine we print the first element of the list
tagsList = list(edu_list_with_tags)
tagsList[0]

In [None]:
# Explanationm Only
# this shows that each element in our list is a bs4 element 
# therefore we can apply bs4 methods
type(tagsList[0])

In [None]:
# Explanationm Only
# this shows how to extract the text from a single element
tagsList[0].string

Now we write the code that builds a list of degrees for the person in question starting from the tagsList
- first we create a list where we will store all the degrees in the form \[university, degree\]
- then we traverse the tagsList 2 by 2 (see the range from 0 to the end of the list with step 2)
- and we save the text of the link in a pair \[university, degree\]

In [None]:
edu_list=[]
for x in range(0,len(tagsList)-1, 2):
    edu_list.append([tagsList[x].string, tagsList[x+1].string])
edu_list

information about children is also stored in a table 
![alt text](webScrapingImg3.png "html structure for numb of children")

In [None]:
# find the th tag that has a text 'Children'
children_heading = soup.find("th", string="Children")
children_heading

In [None]:
# Explanationm Only
# next_sibling gives use the next element in the HTML hierarchy 
children_heading.next_sibling

In [None]:
# we retreive the number of children by getting the text of the next siblin
numbChildren = children_heading.next_sibling.string
numbChildren