# Scraping an html page (loading and searching it's contents)

# Local:  saved in a file on your computer
# Remote: somewhere on the web

To fully understand this notebook, open the example_html.html file in another tab, and open it's example_html.html's source code in a third tab (or even better: in browser's View>Developer tools). You will see in a minute what is the exact addres sof that file.

For scraping, we need a few of different libraries, most notably Beautifulsoup. Let's first import these:

In [2]:
import os
from urllib.request import urlopen
from bs4 import BeautifulSoup

We can simply enter a web page as a string and open it. Afterwards, BeautifulSoup converts it into a BeautifulSoup object which has many interesting functions and attributes:

In [23]:
# website address
#page = 'http://www.uebs.ed.ac.uk'

# open the url and store the website
#website = urlopen(page)

# for now we use a local file (os.getcwd() gets the Current Working Directory, aka. the folder you're in)
file_url = "file:///"+os.getcwd()+"/example_html.html"
website_source_code = urlopen(file_url)


# in another tab: (open the example_html.html file directly in your browser to see how it will look like)
# then in your browser, right click and select 'view source', or open developer tools to see the source
print("Paste this url to your browser to see the demo website (copy the whole thing, together wioth the file:// part):")
print( file_url)

# convert the website's content, for this a parser is needed. In this case a html parser
soup = BeautifulSoup(website_source_code, 'html.parser')

Paste this url to your browser to see the demo website (copy the whole thing, together wioth the file:// part):
file:///C:\Users\s2112348\OneDrive - University of Edinburgh\Github Repos\web-and-social-network-analytics-notes\week1-web-scraping-and-analytics/example_html.html


In [None]:
# here's a complete html of the page, but it's easier to read if you open it's source using the url above
print(soup)

In [24]:

# .find_all retrieves all tags containing 'h1':
h1Tags = soup.find_all('h1')
for h1 in h1Tags:
    print('Complete tag code: ', h1)
    print("Just the text in the tag: ", h1.text)

Complete tag code:  <h1 title="A header">Example for Media and Web Analytics</h1>
Just the text in the tag:  Example for Media and Web Analytics
Complete tag code:  <h1 title="A header">Some other stuff</h1>
Just the text in the tag:  Some other stuff


It does not work with attributes of tags:

In [25]:
titleTags = soup.find_all('title')
for title in titleTags:
    print('Complete tag code: ', title)
    print("Just the text in the tag: ", title.text)
    
# nothing will be printed. there are no tags <title> </title> there

## Understanding the html is all about finding components you need:

### .find_all( ) will find all things that match criteria, in a list
### .find( ) will find just the first item that mathes the criteria

You can use it on the whole website, like `a_table = soup.find("table")` or on an element you found before `rows = a_table.find("tr")`

You can seek for types of tags, classes or ids  `soup.find("h1")`,  `soup.find(id="main_navigation")`, `soup.find(class="warning_message")`

But it is very frequent to fetch an element by its unique id:

In [26]:
middle_row = soup.find(id='middle_row')

print('Complete tag code: ', middle_row)
print("Just the text in the tag: ", middle_row.text)


Complete tag code:  <tr id="middle_row">
<td>400</td>
<td>500</td>
<td>600</td>
</tr>
Just the text in the tag:  
400
500
600



## Find children:

When, like above, a tag contains some children (tags inside it) you can extract them into a list.
The example would be above table row `<tr></tr>` includes three table data `<td></td>`
    
```
.findChildren()
``` will give you alist with all tags inside of a given tag

You can specify exactly which chhildre, if you want, like with the .find(). So you could use `.findChildren("tr")` or `.findChildren(class="warning_message")`

In [27]:
middle_row = soup.find(id='middle_row')
cells_in_the_row = middle_row.findChildren()
for cell in cells_in_the_row:
    print('Complete tag code: ', cell, "Just the text in the tag: ", cell.text)


Complete tag code:  <td>400</td> Just the text in the tag:  400
Complete tag code:  <td>500</td> Just the text in the tag:  500
Complete tag code:  <td>600</td> Just the text in the tag:  600


You can dive deeper into certain tags, for example here you look for all divs from the (CSS) class called hipster:

In [30]:
class_elements = soup.find_all("div", {"class" : "hipster" })
for element in class_elements:
    #print('whole tag:\n', str(element), '\n')
    print('Just the text: ', element.text)

Just the text:  
A Dangerous-Looking Header

I look like a paragraph Kylo Ren could have written.


Just the text:  
Another Dangerous-Looking Header

This one is not as scary.




Getting all the elements out of the table:

In [31]:
# list all tables, since we only have 1, use the first in the list at index 0
my_table = soup.find_all('table')[0]
# or just use: my_table = soup.find('table')

# loop the rows and keep the row number
row_num = 0
for row in my_table.find_all('tr'):
    print("Row: "+str(row_num))
    row_num = row_num+1

    #loop the cells in the row
    for cell in row.find_all('td'):
        print("whole html:", str(cell)+" \tJust content: "+cell.text)
        
# if you'd like, try to change this code to use .findChildren( ) rather than .find_all('tr')

Row: 0
whole html: <td>100</td> 	Just content: 100
whole html: <td>200</td> 	Just content: 200
whole html: <td>300</td> 	Just content: 300
Row: 1
whole html: <td>400</td> 	Just content: 400
whole html: <td>500</td> 	Just content: 500
whole html: <td>600</td> 	Just content: 600
Row: 2
whole html: <td>700</td> 	Just content: 700
whole html: <td>800</td> 	Just content: 800
whole html: <td>900</td> 	Just content: 900


### Minitask: Now attempt to scrape something from a real online website:

Use the above code to make a list of all the degrees available in business school of University of Edinburgh. 

1. You will need to get the source of the page the list is on and feed it into the breautiful soup (see code above). (use this url instead of our demo website file://..... use this:  https://www.ed.ac.uk/studying/undergraduate/degrees/index.php?action=view&code=12)
2. get the html component that holds all the degrees. Use developer tools to identify what type of component it is (hint: ul stamds for "unordered list"). Does this component have a class or an id? How would you get a component when you know it's id? (hint: proxy_degreeList )
3. What type of a tag are the actual names of degrees in? (div, a, p, or something else) hint: what tag surround the name of the course?
4. Grab children of that type from the component with all names and in a loop, extract only the text of each of them. And print them.


I am posting the solution lower down, but do try to solve it by yourself first!

In [34]:
# copy-paste relevant parts of the code from above to start:
import os
from urllib.request import urlopen
from bs4 import BeautifulSoup

# website address
page = 'https://www.ed.ac.uk/studying/undergraduate/degrees/index.php?action=view&code=12'

# open the url and store the website
website = urlopen(page)

# convert the website's content, for this a parser is need
soup = BeautifulSoup(website, 'html.parser')
print(soup)

<!DOCTYPE html>

<html dir="ltr" lang="en" prefix="og: http://ogp.me/ns#">
<head>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<meta content="width=device-width, initial-scale=1.0" name="viewport"/>
<link href="http://www.w3.org/1999/xhtml/vocab" rel="profile"/>
<meta charset="utf-8">
<meta content="https://www.ed.ac.uk/sites/all/themes/uoe/assets/uoe-logo-centred-black.png" name="twitter:image">
<meta content="Business,degree,programme,study,business,studies,international,economics,accounting,management,planning,finance,marketing,strategy,enterprise,innovation,human,resources,entrepreneurship,accountancy,careers,foundation,prospectus,2021" name="keywords"/>
<meta content="Find out more about studying Business at Edinburgh, including what you can study, how you will be taught and career opportunities." name="description"/>
<meta content="2020-11-09" http-equiv="last-modified"/>
<meta content="https://www.ed.ac.uk/sites/all/themes/uoe/assets/uoe-logo-centred-black.png" propert

In [43]:
degree_list = soup.find(id="proxy_degreeList")
degree_list_c = degree_list.findChildren()

l = []
for degree in degree_list_c:
    l.append(degree.text)
    
l2 = []
for i in range(1,len(l),3): 
    l2.append(l[i])
    
l2

['Business and Economics (MA) NL11',
 'Business and Geography (MA) NL17',
 'Business and Law (MA) NM11',
 'Business Management (MA) N100',
 'Business with Decision Analytics (MA) NN12',
 'Business with Enterprise and Innovation (MA) N1N2',
 'Business with Human Resource Management (MA) N1N6',
 'Business with Marketing (MA) N1N5',
 'Business with Strategic Economics (MA) N1L1',
 'Finance and Business (MA) NN13',
 'International Business (MA) N120',
 'International Business with Arabic (MA) N1T6',
 'International Business with Chinese (MA) N1T1',
 'International Business with French (MA) N1R1',
 'International Business with German (MA) N1R2',
 'International Business with Italian (MA) N1R3',
 'International Business with Japanese (MA) N1T2',
 'International Business with Russian (MA) N1R7',
 'International Business with Spanish (MA) N1R4']

In [48]:
degree_list = soup.find(id="proxy_degreeList")
degree_list_c = degree_list.findChildren("a")

#l = []
#for degree in degree_list_c:
#    l.append(degree.text)    
#l

for degree in degree_list_c:
    print(degree.text)  

Business and Economics (MA) NL11
Business and Geography (MA) NL17
Business and Law (MA) NM11
Business Management (MA) N100
Business with Decision Analytics (MA) NN12
Business with Enterprise and Innovation (MA) N1N2
Business with Human Resource Management (MA) N1N6
Business with Marketing (MA) N1N5
Business with Strategic Economics (MA) N1L1
Finance and Business (MA) NN13
International Business (MA) N120
International Business with Arabic (MA) N1T6
International Business with Chinese (MA) N1T1
International Business with French (MA) N1R1
International Business with German (MA) N1R2
International Business with Italian (MA) N1R3
International Business with Japanese (MA) N1T2
International Business with Russian (MA) N1R7
International Business with Spanish (MA) N1R4


Only uncover the solutions once you tried to complete the task:
    
    
<details><summary style='color:blue'>CLICK HERE TO SEE THE THE HINT 1.</summary>

1. You will need to get the source of the page the list is on and feed it into the breautiful soup (see code above). (use this url instead of our demo website file://..... use this:  https://www.ed.ac.uk/studying/undergraduate/degrees/index.php?action=view&code=12)

```
file_url = "https://www.ed.ac.uk/studying/undergraduate/degrees/index.php?action=view&code=12"
website_source_code = urlopen(file_url)
soup_degrees_website = BeautifulSoup(website_source_code, 'html.parser')
```
</details>

<details><summary style='color:blue'>CLICK HERE TO SEE THE THE HINT 2.</summary>

 2. get the html component that holds all the degrees.  Use developer tools to identify what type of component it is (hint: ul stamds for "unordered list").  Does this component have a class or an id? How would you get a component when you know it's id?  (hint: proxy_degreeList )
```
degrees = soup_degrees_website.find(id='proxy_degreeList')
 ```   
</details>

<details><summary style='color:blue'>CLICK HERE TO SEE THE THE HINT 3.</summary>

 3. What type of a tag are the actual names of degrees in? (div, a, p, or something else) hint: what tag surround the name of the course?
``` 
for list_item in degrees.findChildren("a"):
  ```  
</details>



<details><summary style='color:blue'>CLICK HERE TO SEE THE THE HINT 4.</summary>

4. Grab children of that type from the component with all names and in a loop, extract only the text of each of them. And print them.
```
    print("Degree Name:", list_item.text)
    ```
</details>
