# HTTP Requests and Parsing

---

## HTTP Requests

---

### Background Reading
- **Internet**--interconnected network of computers that use the internet suite of protocols (TCP/IP) to send and receive information.  It is the infrastructure.
- **World Wide Web**--network of web pages that are hosted by web servers and viewed by web browsers. It is the information (files) that is transferred.
- **IP Address**-- Internet Protocol Address. It is a unique address given to an electronic device connected to the internet.  It provides a location that data can be sent to.  Like a phone number for a person.
- **Port**--number which tells a computer which application on the computer should receive packets of data.  If we compared the IP address to a phone number, the port would be like the extension that got us to the right person.
- **Socket**--one endpoint of a two-way communication link between two programs connected to the internet. A socket is bound to a port number. It can be thought of as both the phone number and the phone number extension when there is an active call between two people.
    - **Transport Layer Security (TLS)**--protocols for establishing authenticated and encrypted links between computers connected to the internet. It is what makes HTTPS secure.  It is the successor to the **Secure Sockets Layer (SSL)**, which is now deprecated. 
- **HTTP and HTTPS**--HyperText Transfer Protocol. Is is the protocol that specifies how web browsers and web servers send and receive data back and forth across the Internet.  The S in HTTPS stands for secure.  HTTPS is used any time sensitive information is transferred across the internet.
- **HTML**--Hypertext Markup Language.  Language used to specify the text of a webpage, format of a webpage, and hyperlink web pages together.  HTML files end with .html.  These files are the main resources that are transferred across the internet.  However, in today's age, we are also transferring CSS, XML, JSON, JS, and binary files (media, compressed program files, etc.) frequently. 
 - **URL**--Uniform Resource Locator.  Contains address/location of a web server that has a resource (like HTML files).  The domain name is a unique name that humans can remember and is converted into an IP address for computers to use.  URL also contains the method for retrieving the resource (file).  HTTP or HTTPS are the methods.  Note that the port is often left out as it is always port 80 for HTTP and port 443 for HTTPS.
    - ![](images/URLimage.png)
    - The URL above tells a web browser to use HTTP, go to a certain web server computer (domain/IP address), go to the correct application on that web server computer (port), and then go to the correct file (file path) on that web server computer to GET the resource (HTML file).
    - If it all works out, we'll get a "200 OK" status code called a **header** as part of the information we receive back.  If there is an error we may get back the "404 not found" status code header.
- More information is served from web servers to web browsers, but it is a two way communication.  We the users send requests for files to a web server using HTTP(S).  This communication follows a request, response pattern.  We request something from a web server and wait for it to respond.  The most used HTTP request is called the GET method.  We try and GET data from a web server. Normally, we do this without thinking about it by Googling a webpage and clicking on links.  However, we can do this in a more intentional way using a programming language.
    1. Single GET request.  We may want to request a single web page file.  We'll be doing this below.
    1. **Scrape**--a script sends a GET request to a webpage, downloads the HTML file, and saves this info to a database.  It looks for hyperlinks on the downloaded copy, then sends GET requests to new webpages. GET requests, download, find links, repeat.  The term scraping often implies that we are targeting certain websites and file types to download.  On a smaller scale than *crawl*.
    1. **Crawl**--also called spider.  Same as scrape, but term often implies we are navigating and downloading lots of web pages.  If we wanted to build an index for a search engine we'd  *crawl* over all the pages on the world wide web.

---

### HTTP Get Requests
- We can send a single GET request in a handful of ways.  As the internet has evolved, new Python libraries have been created that perform requests in a more Pythonic way with more functionality.  We'll cover 5 ways to send HTTP get requests, but we really only need to know the `requests` library.

**5 WAYS to SEND HTTP GET REQUESTS**

1.  **Telnet**--old program that can open a socket to a host on a port.  Potential security risk, so it was removed from all Windows CPs.

2. **Web Browser Developer Tools**--send HTTP requests using a web browser.  This can actually be helpful if we want to double check our Python code.
    1. Go to a website in a web browser
    1. In Mozilla, enable the menu bar by right clicking on the top row where the tabs are
    1. Tools> Web Developer> Web Developer Tools
    1. Refresh web browser
    1. Click Network
    1. Click a row.  This sends a GET request.
    1. View the headers

3. **`socket`**--old and verbose.  Not used today, but helps us understand what's going on under the hood.
    1. Import socket module
    2. Create socket object
    3. Specify domain and port that socket object will be connected to
    4. Encode data.  Convert Unicode string to bytes (UTF-8).  When we talk to an external resource like a network we send bytes (UTF-8).
    5. Send data
    6. Receive data
    7. Decode data.  Convert bytes (UTF-8) to Unicode string.  When we receive data from an external resource like a network we receive bytes (UTF-8).
    8. Close network socket connection

Code | Use
--- | ---
`socket` | Module
`socket.socket()` | Returns socket object. Similar to a file object/handle.  In a nutshell parameters mean that we are going to have a stream of communication with a socket object and we are going to connect over the internet.  Don't need to fully understand to use.
`.connect()` | Socket object method.  Connects to the socket object.  It is kinda like the second step of getting the socket object ready before we can communicate with socket.  We specify which socket we want to connect to by using the parameters (website domain, port).  This is similar to typing a URL into web browser.
`.encode()` | String object method.  Converts Unicode string to bytes/byte object using UTF-8
`.send()` | Socket object method.  Sends byte object command to web server.  Parameter is the command created above.
`.recv()` | Socket object method.  Receives byte object data from web server.  Parameter is the max numbers of characters we wish to receive  at one time in one chunk.  Called buffer (buff) size.  It is common for buffer to be a relatively small power of 2, for example 4096.
`.decode()` | Bytes object method.  Converts bytes/byte object to Unicode string using UTF-8
`.close()` | Socket object method.  Closes network socket connection

---

**EXAMPLES**

In [1]:
import socket

In [2]:
sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
sock.connect(('data.pr4e.org', 80))
http_get_method = 'GET http://data.pr4e.org/romeo.txt HTTP/1.0\r\n\r\n'.encode()
sock.send(http_get_method) 

while True:
    data = sock.recv(512)
    if (len(data) < 1):
        break
    print(data.decode())
sock.close

HTTP/1.1 200 OK
Date: Sun, 02 Jan 2022 21:06:28 GMT
Server: Apache/2.4.18 (Ubuntu)
Last-Modified: Sat, 13 May 2017 11:22:22 GMT
ETag: "a7-54f6609245537"
Accept-Ranges: bytes
Content-Length: 167
Cache-Control: max-age=0, no-cache, no-store, must-revalidate
Pragma: no-cache
Expires: Wed, 11 Jan 1984 05:00:00 GMT
Connection: close
Content-Type: text/plain

But soft what light through yonder window breaks
It is the east and Juliet is the sun
Arise fair sun and kill the envious moon
Who is already s
ick and pale with grief



<bound method socket.close of <socket.socket fd=1360, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=0, laddr=('192.168.7.68', 58234), raddr=('192.241.136.170', 80)>>

---

4. **`urllib`**--is made for Python 3.  It is an updated combination of urllib (Python 1.2) and urllib2 (Python 1.6).  Better than `socket`, but still dated and less used.  Creates a file handle that we can use almost exactly like a file handle created with `open()`.  No header info by default, but can get header information if desired.

Code | Use
--- | ---
`urllib` | Library
`urllib.reqest` | Dot notation specifies `request` module within `urllib` library.  Need this for VS code.  This request module is different from the requests library found below.
`urllib.request.urlopen()` | Creates file object/handle

---

**EXAMPLES**

In [3]:
import urllib  
# Must use dot notation in VS Code to specify requests module within urllib library
# import urllib.request

**`.read()` and `.decode()`**

In [4]:
# Open socket, encode and send GET request, and receive file all in one command
fh = urllib.request.urlopen('http://data.pr4e.org/romeo.txt')

b_romeo = fh.read()  # Read in bytes from file handle
s_romeo = b_romeo.decode()  # Decode bytes to string
print(s_romeo)

But soft what light through yonder window breaks
It is the east and Juliet is the sun
Arise fair sun and kill the envious moon
Who is already sick and pale with grief



**`for` loop and `.decode()`**

In [5]:
# Open socket, encode and send GET request, and receive file all in one command
fh = urllib.request.urlopen('http://data.pr4e.org/romeo.txt')

for b_line in fh:
    print(b_line.decode(), end="")  # Read in bytes from file handle, decode, and print

But soft what light through yonder window breaks
It is the east and Juliet is the sun
Arise fair sun and kill the envious moon
Who is already sick and pale with grief


5. **`requests`**--the most popular way.  Use this.  Only library found on W3schools.  Apparently it is built upon a 3rd party package called `urllib3`.  `requests` makes `urllib3` even better.  Similar to `urllib`, in that it creates a file handle.  The `read()`, `readline`, and `readlines` methods do not work with request objects, but for loops do.

Code | Use
--- | ---
`requests` | Library
`requests.get('<URL>')` | Return response object.  Send GET request. Parameter is URL.  Optional argument `timeout = <SECONDS>`.  This is the number of seconds `.get()` will wait to receive bytes from the socket before raising a `Timeout` error.  If not included then request could go on indefinitely.  Recommended.
`.status_code` | Request object attribute.  Returns HTTP status code.
`.raise_for_status()` | Request object method.  Raise an exception if there was an error downloading file.  Does nothing if no error.  Good practice to always run this after HTTP request.  Can run to stop program or can run in try statement.
`.url` | Request object attribute.  Returns URL from request.
`.headers` | Request object attribute.  Returns Python dictionary of HTTP headers.
`.text` | Request object attribute.  Returns response body as Python string.  `requests` module makes an educated guess as to the encoding of the file text (usually UTF-8) based on HTTP headers and decodes bytes automatically.
`.encoding` | Request object attribute.  Displays decoding used to create `.text`.
`.content` | Request object attribute.  Returns response body as bytes object (bytes).  If `gzip` or `deflate` file transfer encoding, these are decoded for us automatically
`.json` | Request object attribute.  Returns response body as JSON string.
`.iter_content()` | Request object method.  Turns bytes object into iterable object by breaking it up into chunks.  Specify size of chunks in bytes.  `chunk_size = 100000` is usually good.  Used for writing files.
`for` | Read in file in for loop.  Like in `socket` and `urllib` we read in bytes from the file.
`requests.codes.ok` | Constant that returns value 200

---

**EXAMPLES**

In [6]:
import requests

**`.raise_for_status()`**

In [7]:
try:
    request_object = requests.get('https://www.wikipedia.org/BAD_PAGE', timeout = 1.0)
    request_object.raise_for_status()
except Exception as err :
    print(f'The web page could not be reached. {err}')

The web page could not be reached. 404 Client Error: Not Found for url: https://en.wikipedia.org/BAD_PAGE


**View Metadata**

In [8]:
request_object = requests.get('http://data.pr4e.org/romeo.txt', timeout = 1.0)
request_object.raise_for_status()  # No error so nothing done.
print(request_object.status_code)
print(request_object.url)
print(request_object.headers)

200
http://data.pr4e.org/romeo.txt
{'Date': 'Sun, 02 Jan 2022 21:06:29 GMT', 'Server': 'Apache/2.4.18 (Ubuntu)', 'Last-Modified': 'Sat, 13 May 2017 11:22:22 GMT', 'ETag': '"a7-54f6609245537"', 'Accept-Ranges': 'bytes', 'Content-Length': '167', 'Cache-Control': 'max-age=0, no-cache, no-store, must-revalidate', 'Pragma': 'no-cache', 'Expires': 'Wed, 11 Jan 1984 05:00:00 GMT', 'Keep-Alive': 'timeout=5, max=100', 'Connection': 'Keep-Alive', 'Content-Type': 'text/plain'}


**`.text`**

In [9]:
request_object = requests.get('http://data.pr4e.org/romeo.txt', timeout = 1.0)
request_object.raise_for_status()  # No error so nothing done.
print(request_object.encoding)
print(type(request_object.text))
print(request_object.text)

ISO-8859-1
<class 'str'>
But soft what light through yonder window breaks
It is the east and Juliet is the sun
Arise fair sun and kill the envious moon
Who is already sick and pale with grief



**Read in bytes with for loop**

In [10]:
request_object = requests.get('http://data.pr4e.org/romeo.txt', timeout = 1.0)
request_object.raise_for_status()  # No error so nothing done.

i_count = 0
for b_line in request_object:
    s_line = b_line.decode()
    print(s_line)

But soft what light through yonder window breaks
It is the east and Juliet is the sun
Arise fair sun and kill the envious moon
W
ho is already sick and pale with grief



**`.content`**

In [11]:
request_object = requests.get('http://data.pr4e.org/romeo.txt', timeout = 1.0)
request_object.raise_for_status()  # No error so nothing done.
print(request_object.content)

b'But soft what light through yonder window breaks\nIt is the east and Juliet is the sun\nArise fair sun and kill the envious moon\nWho is already sick and pale with grief\n'


**`.iter.content()`**

In [12]:
with open('output/requests_byte_file.jpg', 'wb') as file_object:
    request_object = requests.get('http://data.pr4e.org/cover3.jpg', timeout = 1.0)
    request_object.raise_for_status()  # No error so nothing done.
    for b_chunk in request_object.iter_content(chunk_size = 100000):
        file_object.write(b_chunk)

- Lastly, note that there are other HTTP requests besides GET

Method | Use
--- | ---
`GET` | Web browsers sends GET to request file/resource from a web server.  The most common command.
`POST` | Web browser sends POST to update file/resource on web server
`PUT` | Similar to `Post`
`HEAD` | Similar to `GET`
`DELETE` |Web browser sends DELETE to delete file/resource form web server
`PATCH` | 
`OPTIONS` | Web Browser sends OPTIONS to receive a list of communication options

---

## Parsing HTML
- Once we GET our data, we read it into our script and then parse it.  
- **Parse**--conversion of data from the original format to another format that is more useful to us
- To parse data properly, we need to know the basic structure of the data.  We'll first look at HTML data.
- HTML documents use *tags* and *elements* within their code
    1. **Element**--combination of a start **tag**, some content, and an end **tag**. The element includes the start and end tags.
```html
<tagname>Content goes here</tagname>
```
    1. **Attributes**--name/value pairs on the start tag of the element.  Optional.  They provide more information about the element.
```html
<tagname attribute_name="value">Content goes here</tagname>
```
- These elements are organized in an **element tree** structure.  The tree starts at the "root" element and branches to nested "leaf"/"child" elements/nodes.   Elements at the same level of nested-ness are called siblings.  A parent it the element above the current one.
- In the below element tree, note that:
    - The root element is `<html>` and there is only one root
    - All the elements in the tree are descendants of of `<html>`
    - `<html>`'s direct descendants (children one level below) are `<head>` and `<body>`
    - `<head>` and `<body>` are siblings
    - The parent of `<head>` and `<body>` is `<html>`
    - `<head>` and `<body>` have their own children 

![](images/HTML_tree.png)

- Note that HTML does NOT care about indents (no meaning).  However, they are used to signify level on element tree as this dramatically increases readability.
- Many HTML documents on the web contain bad HTML code that is not in a standard, correct format. E.g. elements may have missing end tags.  However, many websites forgive these errors and allow the code to function.  This is great for website creators, but bad for someome using regular expressions that assume HTML is in a certain format.  Because HTML can be formatted in a couple ways, and because it can be formatted wrong and still work, trying to use regular expression with HTMl is a bad idea.  Intsead, use Beautiful Soup.
- **Beautiful Soup**--library that uses element tree structure to parse HTML and XML, converting it into a more user friendly format.  As Beautiful Soup parses, it fixes errors with the code structure so our tools like RegEx return the expected results.  Its name and domain (https://www.crummy.com/software/BeautifulSoup/bs4/doc/) tease HTML for being so bad.
- Beautiful soup transforms HTML documents into 3 main kinds of Python objects.  Each has its own attributes and methods.
    1. **BeautifulSoup**--represents the parsed HTML document as a whole.  For most purposes it can be treated similarly to a tag object.
    1. **Tag**--corresponds to the element in original HTML document
    1. **NavigableString**--corresponds to the content within an element
        1. **Comment**--special type of `NavigableString` 
        1. There are more special types of `NavigabeString` objects

Code | Use
--- | ---
`bs4` | Module.  File is called `beautifulsoup4`, however import using `import bs4`.
`bs4.BeautifulSoup()` | Parses HTML or XML document and returns BeautifulSoup object.  First argument is the document, which can be in its encoded bytes format or its decoded Python string format.  Both work.  Second argument is parser.  `html.parser` is built-in and works fine.  `lxml` must be installed, but is faster.  Two others can be seen on website.
`.prettify()` | BeautifulSoup or Tag object method.  Like `pprint()`.   Uses insignificant whitespace to print an easier to read text.
`.<TAG>` | BeautifulSoup object attribute.  Returns tag object. Returns FIRST element with specified tag.  E.g. if we entered `.p` it would return the element of the first paragraph, but not the second nor third paragraphs.  Tag object includes all descendants and can be visualized as an branch of the element tree.
`.contents` | Tag object attribute.  Returns a tag's DIRECT descendants in a list.
`.descendants` | Tag object attribute.  Returns ALL tag's descendants in generator object.  Generator object is iterable and can be used in for loop or used with `list()` to create a Python list.
`.string` | Tag object attribute.  Returns NavigableString object.  This should then be converted to a Python string data type with `str()`.  If a tag has only one child, and that child is a NavigableString it returns that text as excepted.  If tag's only child is another tag, and *that* tag has a NavigableString child, the the parent tag is considered to have the same NavigableString as the child tag.
`.find_all()` | BeautifulSoup or Tag object method.  Same as `<OBJECT>('a')`.  Returns all descendants that match filters in list-like object.  Most popular use of entire library.
More | See website for more info on moving sideways to siblings, moving up the tree to parents, filtering, the `.find()` function, and much more.

---

**EXAMPLES**

In [13]:
import requests
import bs4

**From URL**

In [14]:
# Create a requests object
ro = requests.get('http://www.dr-chuck.com/page1.htm', timeout = 1)
ro.raise_for_status()

# We could use .text or .content attributes here.  Both work.
soup = bs4.BeautifulSoup(ro.content, "html.parser")
print(type(soup))

<class 'bs4.BeautifulSoup'>


**From File Object/Handle**

In [15]:
with open('input/html_example.html', 'rt') as file_handle:
    soup = bs4.BeautifulSoup(file_handle, "html.parser")
    print(type(soup))

<class 'bs4.BeautifulSoup'>


**BeautifulSoup Object**
- Notice how the original is missing end tags for `<html>` and `<body>`
- This was parsed and the BeautifulSoup object created.  The BeautifulSoup object now contains closing tags.

In [16]:
s_html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""
print(type(soup))
soup = bs4.BeautifulSoup(s_html, "html.parser")  # BeautifulSoup object
print(soup)
print('\n')
print(soup.prettify())

<class 'bs4.BeautifulSoup'>

<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body></html>


<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    Elsie
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/til

**Tag Object**

In [17]:
print(type(soup.head))  # Tag object
print((soup.head))  # Tags are branches that include all descendants
print('\n')
print(soup.head.prettify())

<class 'bs4.element.Tag'>
<head><title>The Dormouse's story</title></head>


<head>
 <title>
  The Dormouse's story
 </title>
</head>



**`.contents`**

In [18]:
print(type(soup.head.contents))
print(soup.head.contents)

<class 'list'>
[<title>The Dormouse's story</title>]


**`.descendants`**

In [19]:
gen_descendants = soup.head.descendants
print(type(gen_descendants))
for child in gen_descendants:  # Iterable.
    print(child)

<class 'generator'>
<title>The Dormouse's story</title>
The Dormouse's story


In [20]:
gen_descendants = soup.head.descendants
print(list(gen_descendants))  # Turned into normal list

[<title>The Dormouse's story</title>, "The Dormouse's story"]


**NavigableString Object**

In [21]:
tag = soup.title  # Tag object
nav_string = tag.string  # NavigableString object
print(type(nav_string))
print(nav_string)
s_nav_string = str(nav_string)  # Convert to Python string

<class 'bs4.element.NavigableString'>
The Dormouse's story


**`.find_all()`**

In [22]:
found = soup.find_all("a")
print(type(found))
for tag in found:
    print(tag)

<class 'bs4.element.ResultSet'>
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>


---

## Parsing XML
- Data from HTTP GET requests may also be formatted with XML
- **XML**--Extensible Markup Language.  "Wire format". The "wire is the internet.  Whatever the original data format, it can be converted to XML (called serialize) sent across the internet (the "wire"), and then be converted (de-serialize) into any other format.
- Both XML and HTML come from the SGML family of languages. Because of this, both use tags, elements, and trees. Though they both provide a standardized format for storing data, the uses are quite different.
- HTML changes how data is displayed so humans can view it. HTML makes data easy for humans to read. XML stores data so that any program on any computer can send data to any other program on any other computer. XML makes data easy for machines to read. Also, HTML tags are predefined. XML tags are not predefined. This means that two programs must agree on the meaning of the XML tags being used.
- XML trees share tree structure with HTML as seen in the image below:

![](images/XML_tree.png)

- The image below is similar to the one above, but shows attributes when the are included for an element:

![](images/XML_tree_attributes.gif)

- To GET XML data we would send a GET request to a web server that we know outputs XML with a desired element tree structure.  After receiving it we would read in the data and parse it.
- We can parse XML data with a module in the Python Standard Library called `xml`

Code | Use
--- | ---
`xml.etree.ElementTree` | Module.  Conventionally, `import xml.etree.ElementTree as ET`.
`ET.parse()` | Returns element tree object.  Parameter is file name.
`ET.fromstring()` | Returns element tree object.  Parameter is string of text that is structured as an element tree.
`.findall()` | Element tree object method.  Returns elements with a specified tag that are direct children of current element.  Parameter is tag to search for.
`.find()` | Element tree object method.  Returns *first* child with a specified tag.  Parameter is tag to search for.
`.find().text` | Returns the text content from the specified element tag name
`.find().get` | Returns the attribute value from the specified element tag name and attribute name

---

**EXAMPLES**

In [23]:
import xml.etree.ElementTree as ET  # Use alias as this is long

In [24]:
# The data would normally be gotten from an HTTP GET request, but here, we'll just provide it as a string
s_data = """
<person>
    <name>The Black Knight</name>
    <phone type="intl">
    +1 734 303 4456
    </phone>
    <email hide="yes"/>
</person>
""" # Having triple quotes on separate lines makes code look cleaner, but does create a new line at the top and bottom

tree_object = ET.fromstring(s_data)  # Creates an element tree object 
print(type(tree_object))

# Goes down tree and finds name tag
print(tree_object.find('name'))

# Goes down tree and finds name tag.  Returns just the text content of element.
print(tree_object.find('name').text) 

# Goes down tree and finds email tag.  Returns value associated with hide attribute.
print(tree_object.find('email').get('hide'))

<class 'xml.etree.ElementTree.Element'>
<Element 'name' at 0x000002A1A81FE900>
The Black Knight
yes


In [25]:
# The data would normally be gotten from an HTTP request, but here, we'll just provide it as a string
s_data = """<stuff>
    <users>
        <user x="2">
            <id>001</id>
            <name>Knights Who Say Ni</name>
        </user>
        <user x="7">
             <id>009</id>
             <name>Unladen Swallow</name>
        </user>
    </users>
</stuff>"""

tree_object = ET.fromstring(s_data)  # Creates an element tree object
print(type(tree_object))

# Returns list of tags/elements
l_users = tree_object.findall('users/user')  # Finds all user elements under the users element
print(type(l_users))
print(len(l_users))

for user in l_users:
    print(type(user))
    print(user.find('name').text)  # Name
    print(user.find('id').text)  # ID
    print(user.get("x"))  # Return value of attribute x

<class 'xml.etree.ElementTree.Element'>
<class 'list'>
2
<class 'xml.etree.ElementTree.Element'>
Knights Who Say Ni
001
2
<class 'xml.etree.ElementTree.Element'>
Unladen Swallow
009
7


---

## Parsing JSON
- We covered JSON in more detail in the *JSON* section above
- Data from HTTP GET requests may also be formatted with JSON
- JSON, like XML, is a "wire format". The "wire is the internet. Whatever the original data format, it can be converted to JSON (called serialize) sent across the internet (the "wire"), and then be converted (de-serialize) into any other format.
- XML is older, more complicated, and more flexible. JSON is newer and simpler. XML has to be parsed with an XML parser. JSON can be parsed by a JavaScript function. JSON is now much more popular than XML.

Code | Use
--- | ---
`json` | Module
`json.loads()` | Parses JSON and returns Python.  Converts JSON string to appropriate Python data types.  Can be thought of as "load **s** tring".
`json.dumps()` | Parses Python and returns JSON. Converts Python data types to JSON string.  Can be thought of as "dump **s** tring"

---

**EXAMPLES**

In [26]:
import requests
import json

**`loads()`**

In [27]:
# Loads, then sum certain values

request_object = requests.get('http://py4e-data.dr-chuck.net/comments_42.json', timeout = 1.0)
request_object.raise_for_status()  # No error so nothing done.
b_data = request_object.content
print(type(b_data))

s_JSON_data = b_data.decode()
print(type(s_JSON_data))

d_data = json.loads(s_JSON_data)
print(type(d_data))

i_accumulate = 0
for d_withinlist in d_data['comments']:
    i_accumulate += d_withinlist['count']
print(i_accumulate)

<class 'bytes'>
<class 'str'>
<class 'dict'>
2553


- If the URL is not working in future, can always use this nested Python dictionary for practice:

```python
d_data = {'note': 'This file contains the sample data for testing', 'comments': [{'name': 'Romina', 'count': 97}, {'name': 'Laurie', 'count': 97}, {'name': 'Bayli', 'count': 90}, {'name': 'Siyona', 'count': 90}, {'name': 'Taisha', 'count': 88}, {'name': 'Alanda', 'count': 87}, {'name': 'Ameelia', 'count': 87}, {'name': 'Prasheeta', 'count': 80}, {'name': 'Asif', 'count': 79}, {'name': 'Risa', 'count': 79}, {'name': 'Zi', 'count': 78}, {'name': 'Danyil', 'count': 76}, {'name': 'Ediomi', 'count': 76}, {'name': 'Barry', 'count': 72}, {'name': 'Lance', 'count': 72}, {'name': 'Hattie', 'count': 66}, {'name': 'Mathu', 'count': 66}, {'name': 'Bowie', 'count': 65}, {'name': 'Samara', 'count': 65}, {'name': 'Uchenna', 'count': 64}, {'name': 'Shauni', 'count': 61}, {'name': 'Georgia', 'count': 61}, {'name': 'Rivan', 'count': 59}, {'name': 'Kenan', 'count': 58}, {'name': 'Hassan', 'count': 57}, {'name': 'Isma', 'count': 57}, {'name': 'Samanthalee', 'count': 54}, {'name': 'Alexa', 'count': 51}, {'name': 'Caine', 'count': 49}, {'name': 'Grady', 'count': 47}, {'name': 'Anne', 'count': 40}, {'name': 'Rihan', 'count': 38}, {'name': 'Alexei', 'count': 37}, {'name': 'Indie', 'count': 36}, {'name': 'Rhuairidh', 'count': 36}, {'name': 'Annoushka', 'count': 32}, {'name': 'Kenzi', 'count': 25}, {'name': 'Shahd', 'count': 24}, {'name': 'Irvine', 'count': 22}, {'name': 'Carys', 'count': 21}, {'name': 'Skye', 'count': 19}, {'name': 'Atiya', 'count': 18}, {'name': 'Rohan', 'count': 18}, {'name': 'Nuala', 'count': 14}, {'name': 'Maram', 'count': 12}, {'name': 'Carlo', 'count': 12}, {'name': 'Japleen', 'count': 9}, {'name': 'Breeanna', 'count': 7}, {'name': 'Zaaine', 'count': 3}, {'name': 'Inika', 'count': 2}]}
```

**`dumps()`**

In [28]:
# Dump: Python to JSON.  Format JSON for readability
x = {"name":"John", "age":30, "city":"New York"}
print(f'This is the Python dictionary: {x}')

# parse
y = json.dumps(x, indent=4, separators=(", ", "="))
# Indent puts each dictionary item on a new line and indents 4 spaces.
# Separator is symbol used to separate items, followed by symbol used to separate keys from values.

print(f'It has been converted (dumps) into a string and modified for readability: {y}')

This is the Python dictionary: {'name': 'John', 'age': 30, 'city': 'New York'}
It has been converted (dumps) into a string and modified for readability: {
    "name"="John", 
    "age"=30, 
    "city"="New York"
}


In [29]:
# Similar to above with sort alphabetically
x = {"name":"John", "age":30, "city":"New York"}
print(f'This is the Python dictionary: {x}')

# parse
y = json.dumps(x, indent=4, sort_keys=True)

print(f'This is the string soted alphabetically by key: {y}')

This is the Python dictionary: {'name': 'John', 'age': 30, 'city': 'New York'}
This is the string soted alphabetically by key: {
    "age": 30,
    "city": "New York",
    "name": "John"
}


---

### APIs
- **API**--Application Programming Interface. It is code (Programming) that allows Applications to Interface (send/receive data) across the Internet.  In other words, a contract between applications that defines the patterns of interaction. 
- The transfer of data ultimately uses HTTP GET requests and usually includes JSON formatted plaintext data
- API protocols are not pre-defined like other data sharing protocols. There is usually a producer of data and a consumer of data.   The producer defines what data they are willing to share, what format the data is in, and how the consumer can make requests to receive the data.
- There are a variety of reason why companies, governments, researchers, and non-profits share their data
- Producers often limit the number of API requests so that they can control how their data is used.  Often companies sell access to their data .  They do this by requiring keys. Keys are unique sequences that confirm the user has registered/paid to access the data.
- Many websites make their data available in JSON format. Facebook, Twitter, Yahoo, Google, Tumblr, Wikipedia, Flickr, Data.gov, Reddit, IMDb, Rotten Tomatoes, LinkedIn, and many other popular sites offer APIs for programs to use.

Code | Use
--- | ---
`?` | A ? at the end of a URL indicates that it can accept parameters. By adding parameters to end of URL, it changes the output of the web server.
`urllib.parse.urlencode()` | Differs from normal `encode()` and `decode()`.  The urllib version returns string with ":" replaced by "=", "spaces" replaced by "+", and "," replaced by "%" from a specified Python dictionary. Used to format Python dictionary so that it can be added onto a URL with a ? as parameters.

---

**EXAMPLES**

In [30]:
import urllib.request, urllib.parse, urllib.error
import json

- Parses to parameters, sends GET request to API, converts JSON string to Python dictionary, and navigates data tree

In [31]:
s_url_api = 'http://py4e-data.dr-chuck.net/json?'  # Geocode API website that does not require a key

s_address = "1600 Pennsylvania Avenue NW, Washington, DC 20500"  # A big white house

d_param = dict()  # Blank dictionary
d_param['key'] = 42  # Key for Dr. Chuck's API
d_param['address'] = s_address  # Key value pair added
s_url_param = s_url_api + urllib.parse.urlencode(d_param)
print(s_url_param)

url_handle = urllib.request.urlopen(s_url_param)
s_data = url_handle.read().decode()
print('Retrieved', len(s_data), 'characters')
#print(s_data) # Useful to understand tree structure

d_data = json.loads(s_data)  # Turns string into Python dictionary
print(type(d_data))


place_id = d_data['results'][0]['place_id']
# Navigates the tree below
print(f'Place ID is: {place_id}')


#Tree in form
#d_data
#    "results":List
#        Dictionary
#            "address_components" : List
#            "formated_address" : "address, State, Zip, USA
#            "geometry" : Dictionary
#            "place_id" : ID
#            "types" : List
#    "status": OK

http://py4e-data.dr-chuck.net/json?key=42&address=1600+Pennsylvania+Avenue+NW%2C+Washington%2C+DC+20500
Retrieved 2325 characters
<class 'dict'>
Place ID is: ChIJGVtI4by3t4kRr51d_Qm_x58


---