# Retrieving data from the web, Part 1

![image.png](attachment:image.png)

## 5.01 Introduction to retrieving data from the web

Keywords: `Highlighted`

* `HTTP servers`, also known as web servers; 
* `TCP/IP`: Transmission Control Protocol (TCP) and the Internet Protocol (IP). TCP port 80, 222xx, IP is the IP address
* `Client-server network`: A client-server network is the medium through which clients access resources and services from a central computer, via either a local area network (LAN) or a wide-area network (WAN), such as the Internet. 
* When a client makes a request to a web server for a web page, normally, it receives back data consisting of `HTML code., Javascript, and JSON etc`;
* JSON is easier to work with as it only contains the data. It does not contain a description of how the data `should be displayed`, unlike markup languages;
* Websites are accessed via `HTTP (protocol)`.  This is different to HTML which defines the format that the data is in. 
* API: A common way of organizing the set of data services on a website is called a RESTful API.

### `More about RESTful API`
What is REST architecture?
REST stands for `REpresentational State Transfer`. REST is introduced by Roy Fielding in 2000. REST uses HTTP protocol and  web standards based architecture. With REST, every component is a resource and a resource is accessed by a common interface using HTTP standard methods. 

A REST Server provides access to resources and REST client accesses and `modifies` the resource `parameters` to get `different elements` of resources (e.g., monthly, daily data). Each resource is identified using the URIs or global IDs. REST uses various representation to represent a resource such as text, JSON, CSV, XML. JSON is currently the most popular one.  Json is prefer more than CSV because it contains `hierarchical` data

There are at least HTTP 4 methods:
1. GET − Provides a read only access to a resource.
2. POST − Used to create a new resource.  
3. DELETE − Used to remove a resource.
4. PUT − Used to update a existing resource or create a new resource. 

Beware of the Attack on the webserver, router can read the URL in GET, even in https.<br> 


Reference:
**If you want to know more about RESTAPI**
* https://programminghistorian.org/en/lessons/creating-apis-with-python-and-flask
* https://towardsdatascience.com/the-right-way-to-build-an-api-with-python-cd08ab285f8f

## 5.02 Handling data on the web

Have a look at the requests library. 

Requests: HTTP for Humans  https://requests.readthedocs.io/en/master/
        
The quickstart guide  https://docs.python-requests.org/en/master/user/quickstart/ should be enough to get you started working with this incredible library!

There are lots of clever things you can do with this library, including handling all manner of content. See if you can create a simple request and get a response code back. What do you think that first number is all about?

If any of the links are broken, let us know via the Student Portal.  https://my.london.ac.uk/

## 5.03 Introduction to HTML and the DOM

![image.png](attachment:image.png)

HTML, H3, table, text, request a page,  
* DOM (Document Object Model):
* `Document`
   * `<html>`
      * `<head>`
          * `<title>`
             * `<body>`
* HTML header, meta charset="utf-8",
* Video:

``` 
<video controls poster="/images/w3html5.gif">
          <source src="movie.mp4" type="video/mp4">
          <source src="movie.ogg" type="video/ogg">
          Your browser does not support the video tag.
</video> 
      ```

* CSS e.g., (inline style sheet) 
* Javascript e.g. on Click event

Other HTML Tags: See https://www.w3schools.com/tags/tag_img.asp

http://doc.gold.ac.uk/~smcg004/test.html


go to http://doc.gold.ac.uk/~smcgr004/test.html
* It reveals that the code follows the DOM, as follows: <br>
`<tr> is the row;` <br> `<th> is the header;` <br>
`<td> is the cell;'`  
`<link re=stylesheet' is the CSS)`
for CSS, please see https://www.w3schools.com/css/


```html

<!DOCTYPE html>
<html>
<head>
<link rel="stylesheet" type="text/css" href="mstyle.css">
</head>
<body>

<h2>Staff Database</h2>
<p>This is our webpage of staff data. The records below are members of staff.</p>
<table>
  <tr>
    <th>Name</th>
    <th>Surname</th>
    <th>Age</th>
  </tr>
  <tr>
    <td>Sean</td>
    <td>McGrath</td>
    <td>31</td>
  </tr>
  <tr>
    <td>Santiago</td>
    <td>Mateo</td>
    <td>39</td>
  </tr>
  <tr>
    <td>Jerry</td>
    <td></td>
    <td>22</td>
  </tr>
  <tr>
    <td>Gabriela</td>
    <td>Ruiz</td>
    <td>24</td>
  </tr>
  <tr>
    <td>Niamh</td>
    <td>Mulligan</td>
    <td>19</td>
  </tr>
</table>

</body>
</html>


```



<!DOCTYPE html>
<html>
<head>
<link rel="stylesheet" type="text/css" href="mstyle.css">
</head>
<body>

<h2>Staff Database</h2>
<p>This is our webpage of staff data. The records below are members of staff.</p>
<table>
  <tr>
    <th>Name</th>
    <th>Surname</th>
    <th>Age</th>
  </tr>
  <tr>
    <td>Sean</td>
    <td>McGrath</td>
    <td>31</td>
  </tr>
  <tr>
    <td>Santiago</td>
    <td>Mateo</td>
    <td>39</td>
  </tr>
  <tr>
    <td>Jerry</td>
    <td></td>
    <td>22</td>
  </tr>
  <tr>
    <td>Gabriela</td>
    <td>Ruiz</td>
    <td>24</td>
  </tr>
  <tr>
    <td>Niamh</td>
    <td>Mulligan</td>
    <td>19</td>
  </tr>
</table>

</body>
</html>

## Practice Quiz - Welcome to the World Wide Web!
Practice Quiz • 10 min


**Question 1**
What does HTML stand for?
* `Hypertext Markup Language`
* Higher Text Moodle Language


**Question 2**
What does DOM stand for?
* `Document Object Model`
* Dynamic Object Manifestation


**Question 3**
What is the DOM?
* `An abstract model for handling nodes/elements in a tree-like hierarchy`
* A model for handling data requests

**Question 4**  The root node is `the highest node` in the tree structure, and has no parent. 
An example of a root node could be 
* `HTML`
* `Document`
* `XML`
* P tags
```
<html>
<xml>
document.documentElement/document.documentElement.nodeName in Javascript
```

## 5.04 HTTP and transferring data via the web


![image.png](attachment:image.png)

In [1]:
import requests

In [2]:
rslt=requests.get('http://doc.gold.ac.uk/~smcgr004/test.html')

In [3]:
rslt.status_code

200

In [4]:
rslt.text

'<!DOCTYPE html>\n<html>\n<head>\n<link rel="stylesheet" type="text/css" href="mstyle.css">\n</head>\n<body>\n\n<h2>Staff Database</h2>\n<p>This is our webpage of staff data. The records below are members of staff.</p>\n<table>\n  <tr>\n    <th>Name</th>\n    <th>Surname</th>\n    <th>Age</th>\n  </tr>\n  <tr>\n    <td>Sean</td>\n    <td>McGrath</td>\n    <td>31</td>\n  </tr>\n  <tr>\n    <td>Santiago</td>\n    <td>Mateo</td>\n    <td>39</td>\n  </tr>\n  <tr>\n    <td>Jerry</td>\n    <td></td>\n    <td>22</td>\n  </tr>\n  <tr>\n    <td>Gabriela</td>\n    <td>Ruiz</td>\n    <td>24</td>\n  </tr>\n  <tr>\n    <td>Niamh</td>\n    <td>Mulligan</td>\n    <td>19</td>\n  </tr>\n</table>\n\n</body>\n</html>\n\n'

In [5]:
print(rslt.text)

<!DOCTYPE html>
<html>
<head>
<link rel="stylesheet" type="text/css" href="mstyle.css">
</head>
<body>

<h2>Staff Database</h2>
<p>This is our webpage of staff data. The records below are members of staff.</p>
<table>
  <tr>
    <th>Name</th>
    <th>Surname</th>
    <th>Age</th>
  </tr>
  <tr>
    <td>Sean</td>
    <td>McGrath</td>
    <td>31</td>
  </tr>
  <tr>
    <td>Santiago</td>
    <td>Mateo</td>
    <td>39</td>
  </tr>
  <tr>
    <td>Jerry</td>
    <td></td>
    <td>22</td>
  </tr>
  <tr>
    <td>Gabriela</td>
    <td>Ruiz</td>
    <td>24</td>
  </tr>
  <tr>
    <td>Niamh</td>
    <td>Mulligan</td>
    <td>19</td>
  </tr>
</table>

</body>
</html>




In [6]:
import bs4

In [7]:
bs=bs4.BeautifulSoup(rslt.text, "html.parser")

### Find()

In [8]:
bs.find("tr")

<tr>
<th>Name</th>
<th>Surname</th>
<th>Age</th>
</tr>

In [9]:
bs.find_all("tr")

[<tr>
 <th>Name</th>
 <th>Surname</th>
 <th>Age</th>
 </tr>,
 <tr>
 <td>Sean</td>
 <td>McGrath</td>
 <td>31</td>
 </tr>,
 <tr>
 <td>Santiago</td>
 <td>Mateo</td>
 <td>39</td>
 </tr>,
 <tr>
 <td>Jerry</td>
 <td></td>
 <td>22</td>
 </tr>,
 <tr>
 <td>Gabriela</td>
 <td>Ruiz</td>
 <td>24</td>
 </tr>,
 <tr>
 <td>Niamh</td>
 <td>Mulligan</td>
 <td>19</td>
 </tr>]

### Find_All()

In [10]:
rows=bs.find_all("tr")

In [11]:
for i, row in enumerate(rows):
    print("\nrow " + str(i) + ": " + str(row))


row 0: <tr>
<th>Name</th>
<th>Surname</th>
<th>Age</th>
</tr>

row 1: <tr>
<td>Sean</td>
<td>McGrath</td>
<td>31</td>
</tr>

row 2: <tr>
<td>Santiago</td>
<td>Mateo</td>
<td>39</td>
</tr>

row 3: <tr>
<td>Jerry</td>
<td></td>
<td>22</td>
</tr>

row 4: <tr>
<td>Gabriela</td>
<td>Ruiz</td>
<td>24</td>
</tr>

row 5: <tr>
<td>Niamh</td>
<td>Mulligan</td>
<td>19</td>
</tr>


## 5.05 Introduction to web scraping


![image.png](attachment:image.png)

In [17]:
import json
import requests
from bs4 import BeautifulSoup

In [24]:
def get_soup(URL,  jar=None):
    request_headers = {"update-insecure-requests":"1",
                   "user-agent": "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:47.0) Gecko/20100101 Firefox/47.0",
                   "accept-language": "zh-CN,zh;q=0.8,en-US;q=0.5,en;q=1.0",
                   "accept": "*/*", "accept-encoding": "gzip, deflate, br"}

    if jar:
        r = requests.get(URL, cookies=jar, headers=request_headers)
    else:
        r = requests.get(URL, headers=request_headers)
        jar = requests.cookies.RequestsCookieJar()
    print(r.url)
    data = r.text
    soup = BeautifulSoup(data, "html.parser")
    return soup, jar


In [25]:
soup, jar = get_soup('http://doc.gold.ac.uk/~smcgr004/test.html')

http://doc.gold.ac.uk/~smcgr004/test.html


### What happened if we do not have request_headers. Try??

In [26]:
request_headers = {"update-insecure-requests":"1",
                   "user-agent": "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:47.0) Gecko/20100101 Firefox/47.0",
                   ### ACCEPTING Chinese Languages
                   "accept-language": "zh-CN;q=0.5,zh;q=0.8,en-US;q=1,en;q=0.3,*;q=0.5",
                   "accept": "*/*", "accept-encoding": "gzip, deflate, br"}
      

#### https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Accept-Language
#### See https://developer.mozilla.org/en-US/docs/Glossary/Quality_values  

### find()

In [27]:
def page_name(soup):
    h2 = soup.find("h2")
    if h2 is not None:
        return h2.text

In [28]:
page_name(soup)

'Staff Database'

### ### findNext()

<!DOCTYPE html>
<html>
<head>
<link rel="stylesheet" type="text/css" href="mstyle.css">
</head>
<body>

<h2>Staff Database</h2>
<p>This is our webpage of staff data. The records below are members of staff.</p>
<table>
  <tr>
    <th>Name</th>
    <th>Surname</th>
    <th>Age</th>
  </tr>
  <tr>
    <td>Sean</td>
    <td>McGrath</td>
    <td>31</td>
  </tr>
  <tr>
    <td>Santiago</td>
    <td>Mateo</td>
    <td>39</td>
  </tr>
  <tr>
    <td>Jerry</td>
    <td></td>
    <td>22</td>
  </tr>
  <tr>
    <td>Gabriela</td>
    <td>Ruiz</td>
    <td>24</td>
  </tr>
  <tr>
    <td>Niamh</td>
    <td>Mulligan</td>
    <td>19</td>
  </tr>
</table>

</body>
</html>

In [29]:
def page_surname(soup):
    th=soup.find("td", text = "McGrath")
    if th is None:
        th=soup.find("td", text = "mcgraph")
    if th is None:
        th=soup.find("td", text = "Mcgraph")
    if th is not None:
        age = th.findNext("td").text  ## See the table to illustrate
        return age

In [38]:
page_surname(soup)

'31'

In [39]:
row = (page_name(soup), page_surname(soup))

#### so ...row is tuple.  Index 0 is "Staff Database"; index 1 is '31'

In [40]:
row

('Staff Database', '31')

In [41]:
def age_to_float(value):
    value = value.replace("1","2")
    return float(value)

#### getting index 1, i.e., 31, replace 1 with 2, and convert it to float

In [42]:
age_to_float(row[1])

32.0

In [44]:
def case_replace(value):
    value = value.replace("Database", "database")
    return(value)


#### getting index 0, i.e., "Staff Database", and replace to "database"

In [45]:
case_replace(row[0])

'Staff database'

In [47]:
def list_to_dict(row):
    for i in row:
        d={}
        d['Database Name'] = row[0]
        d['Age']= row[1]
    return d

In [50]:
def writeJson(data):
    with open("results.json", "w", encoding='utf-8') as f:
        json.dump(data, f, ensure_ascii=False)

In [51]:
row

('Staff Database', '31')

In [52]:
print(list_to_dict(row))

{'Database Name': 'Staff Database', 'Age': '31'}


In [53]:
writeJson(list_to_dict(row))

## 5.06 Data sources on the internet, let's explore!

Go and find some data sources on the internet around a subject that you are interested in. Hint: you should think about this exercise as a lens for **your coursework**! 

For example, can you find any JSON format financial data? 

* What is interesting about this data? 
* What sort of processing might you want to do on it?
* What sort of interesting insights can you draw from an analysis of the data?
* Who does the data belong to and is it freely accessible?

Post your findings on the discussion forums and reply to at least two posts of your peers. 

Participation is optional

## 5.07 HTTP, the internet and file formats
Practice Quiz • 10  minutes

**Question 1**
Which sentence best describes the internet?
* The internet is a set of computers connected together. Every  machine on the internet can connect to every other machine.
* `The Internet is a set of computers connected together, some of which are publicly visible to other machines on the internet.`

` Some machines are programmed not to respond to certain types of queries such as ICMP packets (ping.)` NOT every machine can be connected

**Question 2**
A web browser is an example of an HTTP client.
* `True`
* False

**Question 3**
JSON is a common format for websites that includes information about how the website should be laid out. 
* `False`
* True

*JSON can contain lots of types of data but is generally not used to handle things like styling and layout. CSS is generally the standard for identifying features of layout on the web.*

## 5.08 Grab it, fix it, save it

![image.png](attachment:image.png)

## 5.09 Thinking about web scraping
Practice Quiz • 5 MIN

**Question 1**
What are the dangers of web scraping?
* It might break your computer
* `Web scraping can cause a denial of service by demanding excessive resources from a remote server, for instance when scraping thousands of pages of data`
* Web scraping might not be possible on all data types
* `Not all data is intended to be scraped. There may be legal or ethical implications of retrieving and holding said data`

**Question 2**
Before we web scrape some content, we should check...
* `The terms and conditions of the website to ensure that we are conforming to standards and guidelines set out.`
* That we can convert content to a dataframe
* Whether it can be scraped technically



## 5.100 Web Scraping Basics

See the file

Click on the "Start" button to launch the lab activity. Note, you might experience some performance issues with cell number 4 here. It may take a minute or two to scrape the content because there is a lot of HTML to parse!

The file you will need for this exercise is:

Web_Scraping_Basics.ipynb

# END