# ProgRes, Part II

# Web services

Fabien Mathieu - fabien.mathieu@normalesup.org

Sébastien Tixeuil - Sebastien.Tixeuil@lip6.fr

# Roadmap

- Part I: done
- Part II (Web services)
  - This week:
    - Definitions / Reminders (OSI, HTTP, REST...)
    - Client side
       - Retrieve content
       - Manipulate content
  - Server side (next week)
- Part III: P2P

# Methodology

- Course and practicals are made on notebooks (jupyter or jupyterlab)
- This means you will send your practical notebooks. Please put your name on the file and inside as well!
- Practicals: for some advanced optional questions, you may add some traditional `.py` files (companion packages) -> `zip`
- Mini-projects: `zip` with a mix of notebooks and `.py` files is expected. Report can be integrated inside a notebook (preferred) or a PDF, limited to 10 pages.

## What is a Web Service?

# Reminder: the OSI model

| Layer | Protocol Data Unit (PDU) | Function |
| :-- | --- | :-: |
| 7 (Application) | Data | High-level protocols such as for resource sharing or remote file access, e.g. HTTP. |
| 6 (Presentation) | Data | Translation of data between a networking service and an application;<br>including character encoding, data compression and encryption/decryption. |
| 5 (Session) | Data | Managing communication sessions, i.e.,<br>continuous exchange of information in the form of multiple back-and-forth transmissions between two nodes. |
| 4 (Transport) | Segment, datagram | Reliable transmission of data segments between points on a network,<br>including segmentation, acknowledgement and multiplexing. |
| 3 (Network) | Packet | Structuring and managing a multi-node network, including addressing, routing and traffic control. |
| 2 (Data link) | Frame, PRB | Transmission of data frames between two nodes connected by a physical layer. |
| 1 (Physical) | Bit, Symbol | Transmission and reception of raw bit streams over a physical medium. |

From wikipedia

# Reminder: the OSI model

- OSI should be seen as a guideline more than strict frontiers
- 1 & 2 (physical layers): depend on the physical medium you use
- 3 & 4 (Internet layers): IP (3), UDP ("3.5"), TCP (4)
- Data layers: HTTP(S), SSH, SMTP, (S)FTP... (some see 5/6/7 distinction as artificial)

# The OSI hourglass

Just for Internet culture

- Many physical implementations
- Many data applications
- One waist: IP

<img src = "https://scx2.b-cdn.net/gfx/news/hires/2011/howtheintern.jpg">

# OSI model and PROGRES

- Part I (past sessions) was making your own data layers out of sockets
- Part II is about directly using L7 protocols
- Part III will be about overlay networks ("L8")

# HTTP in the old days (Web 1.0)

- Goal: serve a (static) web page
- Client (user on a navigator) requests a URL
- Server serves a physical file (or a view of a directory)
- Navigator displays the page

- https://files.data.gouv.fr/anssi/ascadv2/
- http://test-debit.free.fr/

- HTTP: HyperText Transfert Protocol
- URL: Uniform Resource Locator

# HTTPS today (Web X.Y)

- Goal: C asks R to S using a URI
- C is anything
- R is anything
- S is a server
- https://pokeapi.co/api/v2/pokemon-form/25/
- https://api.archives-ouvertes.fr/search/?q=authIdHal_s:fabien-mathieu

- HTTPS: HyperText Transfert Protocol Secure
- URI: Uniform Resource Identifier

# Web services

- Client is typically a program executed to access content
 - On client side (JS of a webpage)
 - On server side (to *build* the webpage to return)
- Request is a HTTP(S) method on a URI
- Result is anything (None, html, xml, json, image...)

# HTTP methods

Any HTTP method is like this:
- A request is sent by the client, with some headers (type of request, ...) and possibly a body (input data)
- A response is sent by the server, with some headers (status code, metadata) and possibly a body (output data)

# HTTP methods: theory

From wikipedia:
- **GET**: requests data from the target resource. GET requests should only retrieve data. All data is in the URI (useful for caching).
- **HEAD**: like GET, but don't actually send the data. Uses include checking whether a page is available through the status code and quickly finding the size of a file (Content-Length).
- **POST**: requests the target to process some resources (data) sent by the client. Posted data is not on the URI. For example, it is used for posting a message to an Internet forum, subscribing to a mailing list, or completing an online shopping transaction.
- **PUT**: requests to create or update using the data enclosed in the request. A distinction from POST is that the client specifies the target location on the server.
- **DELETE**: requests suppression of entry.
- **CONNECT**: establishes a TCP/IP tunnel. It is often used to secure connections through one or more HTTP proxies with TLS.
- **OPTIONS**: requests that the supported HTTP methods that it supports. This can be used to check the functionality of a web server.
- **TRACE**: transfers the received request in the response body. That way a client can see what (if any) changes or additions have been made by intermediaries.
- **PATCH**: modifies part(s) of entry. This can save bandwidth by updating a part of a file or document without having to transfer it entirely.

# HTTP methods: GET or POST?

- **GET** can perform all methods (e.g. "http://my.server.co/uri?method=POST&data=...")
- Same for **POST** (which is a GET with hidden data)
- 99% of methods used in the real world are GET or POST

# HTTP methods: GET?

- All request information is visible in the URI
- Human can write a GET in the browser
- Useful for caching / bookmarking
- Good practice: use get to read

# HTTP methods: POST?

- Hide some request information inside the sent message
- Usually performed by forms / javascript
- Useful for sending login / password
- Good practice: use post to write (and password)

# Anatomy of a GET method: inspect

The browser way: Inspect

https://pokeapi.co/api/v2/pokemon-form/25/

# Anatomy of a GET method: within Python

The requests package (that we will use intensively) gives you all.

In [234]:
from requests import get
request = get("https://pokeapi.co/api/v2/pokemon-form/25/")
print(f"Request headers: {request.request.headers}")
print(f"Response headers: {request.headers}")

Request headers: {'User-Agent': 'python-requests/2.26.0', 'Accept-Encoding': 'gzip, deflate, br', 'Accept': '*/*', 'Connection': 'keep-alive'}
Response headers: {'Date': 'Fri, 21 Oct 2022 06:56:08 GMT', 'Content-Type': 'application/json; charset=utf-8', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'access-control-allow-origin': '*', 'Cache-Control': 'public, max-age=86400, s-maxage=86400', 'etag': 'W/"4f8-9UFHCQTIN7577Tbo+plk2O2VsR0"', 'function-execution-id': 't7ug4xeujcin', 'strict-transport-security': 'max-age=31556926', 'x-cloud-trace-context': 'b74ae7c6207f58334b7dc42b61d4df26', 'x-country-code': 'FR', 'x-orig-accept-language': 'pl-PL,pl;q=0.9,en-US;q=0.8,en;q=0.7', 'x-powered-by': 'Express', 'x-served-by': 'cache-cdg20772-CDG', 'x-cache': 'HIT', 'x-cache-hits': '1', 'x-timer': 'S1666305942.176406,VS0,VE1', 'vary': 'Accept-Encoding,cookie,need-authorization, x-fh-requested-host, accept-encoding', 'alt-svc': 'h3=":443"; ma=86400, h3-29=":443"; ma=86400', 'CF-Cache-St

# Anatomy of a GET method: within Python

In [235]:
print(f"Status code: {request.status_code}")
print(f"Method: {request.request.method}")

Status code: 200
Method: GET


In [236]:
{k: v for k, v in request.json().items() if k != 'sprites'}

{'form_name': '',
 'form_names': [],
 'form_order': 1,
 'id': 25,
 'is_battle_only': False,
 'is_default': True,
 'is_mega': False,
 'name': 'pikachu',
 'names': [],
 'order': 36,
 'pokemon': {'name': 'pikachu',
  'url': 'https://pokeapi.co/api/v2/pokemon/25/'},
 'types': [{'slot': 1,
   'type': {'name': 'electric', 'url': 'https://pokeapi.co/api/v2/type/13/'}}],
 'version_group': {'name': 'red-blue',
  'url': 'https://pokeapi.co/api/v2/version-group/1/'}}

# Anatomy of a POST method: within Python

We will use http://ptsv2.com, a website to test post.

http://ptsv2.com/t/progres

In [239]:
from requests import post
request = post("http://ptsv2.com/t/progres/post", json={'Hello': 'world', 'ansWer': 42, 'password': 'MyPrivatePassword'})

# Anatomy of a POST method: within Python

http://ptsv2.com/t/progres

In [240]:
print(f"Status code: {request.status_code}")
print(f"Method: {request.request.method}")
print(f"Request headers: {request.request.headers}")
print(f"Response headers: {request.headers}")
print(request.text)

Status code: 200
Method: POST
Request headers: {'User-Agent': 'python-requests/2.26.0', 'Accept-Encoding': 'gzip, deflate, br', 'Accept': '*/*', 'Connection': 'keep-alive', 'Content-Length': '65', 'Content-Type': 'application/json'}
Response headers: {'Content-Type': 'text/plain; charset=utf-8', 'Vary': 'Accept-Encoding', 'Access-Control-Allow-Origin': '*', 'Content-Encoding': 'gzip', 'X-Cloud-Trace-Context': 'dcb45161edf7a0912bc7653e7560835d', 'Date': 'Fri, 21 Oct 2022 06:59:48 GMT', 'Server': 'Google Frontend', 'Cache-Control': 'private', 'Content-Length': '77'}
Thank you for this dump. I hope you have a lovely day!


# REST API

- REpresentational State Transfert Application Programming Interface is a simple, realtively normalized, way of performing web services.
- Response is typically json or xml.

https://pokeapi.co/api/v2/pokemon-form/25/

- Method is GET (implicit)
- https://pokeapi.co/api/ is the base URL of the URI
- v2 is the API version
- pokemon-form/25/ is the actual request (fetch Pikachu!)

## Client Side

### Retrieve data

# Retrieve data

Data can be:
- Arbitrary bytes (image, pdf, binary...)
- Text
- Structured text (html, json, ...)

You need to adapt:
- Load in memory or save to file?
- Don't load a text as bytes or a json as text!

# Retrieve with requests

In [241]:
from requests import Session
manga = "http://lelscano.com"
s = Session()
r = s.get(manga)
print(f"Request status is {r.status_code},\n"
 f"Content length is {len(r.content)} bytes,\n"
 f"Request encoding is {r.encoding},\n"
 f"Text size is {len(r.text)} chars.")
print(f"Response headers: {r.headers}")

Request status is 200,
Content length is 62512 bytes,
Request encoding is UTF-8,
Text size is 62506 chars.
Response headers: {'Date': 'Fri, 21 Oct 2022 07:01:19 GMT', 'Content-Type': 'text/html; charset=UTF-8', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'Set-Cookie': 'mobile_lelscan=0; expires=Sat, 22-Oct-2022 07:01:19 GMT; Max-Age=86400; path=lelscans.net', 'Vary': 'Accept-Encoding', 'CF-Cache-Status': 'DYNAMIC', 'Report-To': '{"endpoints":[{"url":"https:\\/\\/a.nel.cloudflare.com\\/report\\/v3?s=4ipRNPQorsdARL90uBVSLPhlZmbNK4c52W5aqZiw%2FVjv%2BkxU89wm38D9cGA%2FuGRFfrEasJw8yOGjr6VrtPm6Xspmpo1qWa2u4ap3Bq%2BzmwiHe8DGTWUqgFS6Uw3km%2F0%3D"}],"group":"cf-nel","max_age":604800}', 'NEL': '{"success_fraction":0,"report_to":"cf-nel","max_age":604800}', 'Server': 'cloudflare', 'CF-RAY': '75d81f0efe3fd642-CDG', 'Content-Encoding': 'br', 'alt-svc': 'h3=":443"; ma=86400, h3-29=":443"; ma=86400'}


# Retrieve with requests

In [242]:
r.text[:1000]

'<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">\r\n<html xmlns="http://www.w3.org/1999/xhtml">\r\n<head>\r\n<title>One Piece lecture en ligne scan</title>\r\n\t<meta name="description" content="One Piece Lecture en ligne, tous les scan One Piece." /> \r\n\t<meta name="lelscan" content="One Piece" />\r\n\t<meta http-equiv="Content-Type" content="text/html;charset=ISO-8859-1" />\r\n\t<meta http-equiv="Content-Language" content="fr" />\r\n\t<meta name="keywords" content="One Piece lecture en ligne, lecture en ligne One Piece, scan One Piece, One Piece scan, One Piece lel, lecture en ligne One Piece, Lecture, lecture,  scan, chapitre, chapitre One Piece, lecture One Piece, lecture Chapitre One Piece, mangas, manga, One Piece, One Piece fr, One Piece france, scans, image One Piece " /> \r\n\t<meta name="subject" content="One Piece lecture en ligne scan" />\r\n\t<meta name="identifier-url" content="https://lelscans.ne

# Retrieve with requests

In [243]:
r.content[:1000]

b'<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">\r\n<html xmlns="http://www.w3.org/1999/xhtml">\r\n<head>\r\n<title>One Piece lecture en ligne scan</title>\r\n\t<meta name="description" content="One Piece Lecture en ligne, tous les scan One Piece." /> \r\n\t<meta name="lelscan" content="One Piece" />\r\n\t<meta http-equiv="Content-Type" content="text/html;charset=ISO-8859-1" />\r\n\t<meta http-equiv="Content-Language" content="fr" />\r\n\t<meta name="keywords" content="One Piece lecture en ligne, lecture en ligne One Piece, scan One Piece, One Piece scan, One Piece lel, lecture en ligne One Piece, Lecture, lecture,  scan, chapitre, chapitre One Piece, lecture One Piece, lecture Chapitre One Piece, mangas, manga, One Piece, One Piece fr, One Piece france, scans, image One Piece " /> \r\n\t<meta name="subject" content="One Piece lecture en ligne scan" />\r\n\t<meta name="identifier-url" content="https://lelscans.n

# Example: remote file size

In [244]:
def get_size(url):
    s = Session()
    r = s.head(url)
    return int(r.headers['Content-Length'])

In [245]:
url = "http://ftp.crifo.org/debian-cd/current/amd64/iso-dvd/debian-11.5.0-amd64-DVD-1.iso"
get_size(url)

3897638912

# Example: stream downloading

In [246]:
from pathlib import Path
def download(source_url, dest_file):
    s = Session()
    s.verify = False
    r = s.get(source_url, stream=True)
    dest_file = Path(dest_file)
    with open(dest_file, "wb") as f:
        for chunk in r.iter_content(chunk_size=8192):
            if chunk: 
                f.write(chunk)

In [247]:
url = "https://www-npa.lip6.fr/~tixeuil/m2r/uploads/Main/PROGRES2022_2.pdf"
download(url, 'python.pdf')



### Manipulate data

#### Basic string manipulation

# String manipulation

Even if you have structured data, you need to master basic string manipulations.

In [248]:
txt = " Python is a great language\n but Erlang is pretty cool too!  "
print(txt)

 Python is a great language
 but Erlang is pretty cool too!  


# Split

In [249]:
print(l1:=txt.split())
print(l2:=txt.split('a'))
print(l3:=txt.split('\n'))
print(l4:=txt.split('an'))

['Python', 'is', 'a', 'great', 'language', 'but', 'Erlang', 'is', 'pretty', 'cool', 'too!']
[' Python is ', ' gre', 't l', 'ngu', 'ge\n but Erl', 'ng is pretty cool too!  ']
[' Python is a great language', ' but Erlang is pretty cool too!  ']
[' Python is a great l', 'guage\n but Erl', 'g is pretty cool too!  ']


# Join

In [250]:
print(' '.join(l1))
print('a'.join(l2))
print('\n'.join(l3))
print('AN'.join(l4))

Python is a great language but Erlang is pretty cool too!
 Python is a great language
 but Erlang is pretty cool too!  
 Python is a great language
 but Erlang is pretty cool too!  
 Python is a great lANguage
 but ErlANg is pretty cool too!  


# Other methods

By default, `str` in Python have many powerful methods... Try them!

In [251]:
print(', '.join([method_name for method_name in dir(txt) 
                 if callable(getattr(txt, method_name)) and not method_name.startswith('_')]))

capitalize, casefold, center, count, encode, endswith, expandtabs, find, format, format_map, index, isalnum, isalpha, isascii, isdecimal, isdigit, isidentifier, islower, isnumeric, isprintable, isspace, istitle, isupper, join, ljust, lower, lstrip, maketrans, partition, removeprefix, removesuffix, replace, rfind, rindex, rjust, rpartition, rsplit, rstrip, split, splitlines, startswith, strip, swapcase, title, translate, upper, zfill


#### Regular Expression

# Regular Expressions

- More complex that basic methods
- More powerful
- Personal advice: don't use RegEx if you can avoid it
- Personal advice #2: sometimes, you cannot avoid it, so you need to learn

# Regular Expressions

Recipe:
- define a *pattern*
- apply pattern to a string
- you can:
  - find all pattern occurrences (`findall`)
  - substitute with another string (`sub`)
  - check if the pattern matches the string (`match`/`fullmatch`)
  - extract parts of the pattern (`group`)

# Defining a pattern

- a, X, 9, < -- ordinary characters just match themselves exactly.
- Some characters have special meanings: . ^ \$ * + ? { [ ] \ | ( ) (details below)
- . (a period) -- matches any single character except newline '\n'
- \w -- (lowercase w) matches a "word" character: a letter or digit or underscore [a-zA-Z0-9_]. \W matches any non-word character. 
- \b -- boundary between word and non-word 
- \s -- (lowercase s) matches a single whitespace character -- space, newline, return, 
tab, form [ \n\r\t\f]. \S (upper case S) matches any non-whitespace 
character.

# Defining a pattern

- \t, \n, \r -- tab, newline, return 
- \d -- decimal digit [0-9]
- \$=end—match the end of the string. ^ matches the start of the string but also means not
- \ -- inhibit the "specialness" of a character. So, for example, use \\. to match a 
period or \\\\ to match a slash. If you are unsure if a character has special meaning, 
such as '@', you can put a slash in front of it, \@, to make sure it is treated just as a 
character

# Defining a pattern

- [] - set of possible characters
- | - or
- {n} - exactly n occurrences.
- () - create group 
- \+ - at least one occurrence.
- \* - zero or more occurrence.
- ? - zero or one occurrence.

# Example: extract Email Information

In [252]:
import re
txt = 'fabien.mathieu@normalesup.org'
pattern = '([^@]+)@([^@]+)'
m = re.fullmatch(pattern, txt)
print(m.groups())

('fabien.mathieu', 'normalesup.org')


# Example: *manual* html parsing

https://www.lip6.fr/recherche/team_membres.php?acronyme=NPA

In [253]:
url = "https://www.lip6.fr/recherche/team_membres.php?acronyme=NPA"
r = s.get(url)
txt = r.text

# Example: *manual* html parsing

In [254]:
pattern1 = r"<table class='annuaire'>(.*?)</table>"
permanents = re.findall(pattern1, txt, re.DOTALL)[0]
permanents

'\n\t<tr>\n\t\t<td><a class="nouser" title="Pas de page personnelle" href="#"></a><a href=\'../actualite/personnes-fiche.php?ident=P224\'>Baey&nbsp;Sébastien</a> (Maître de Conférences, Sorbonne Université)</td>\n\t\t<td class=\'bureau\'>Campus Pierre et Marie Curie 26-00/103</td>\n\t</tr>\n\t<tr>\n\t\t<td><a class="nouser" title="Pas de page personnelle" href="#"></a><a href=\'../actualite/personnes-fiche.php?ident=P144\'>Baynat&nbsp;Bruno</a> (Maître de Conférences, Sorbonne Université)</td>\n\t\t<td class=\'bureau\'>Campus Pierre et Marie Curie 26-00/112</td>\n\t</tr>\n\t<tr>\n\t\t<td><a class="user" title="Page personnelle" href="http://lip6.fr/Lelia.Blin"></a><a href=\'../actualite/personnes-fiche.php?ident=P512\'>Blin&nbsp;Lélia</a> (Maître de Conférences  [HDR], Université d’Évry - Université Paris-Saclay)</td>\n\t\t<td class=\'bureau\'>Campus Pierre et Marie Curie 26-00/122</td>\n\t</tr>\n\t<tr>\n\t\t<td><a class="user" title="Page personnelle" href="http://lip6.fr/Marcelo.Amor

# Example: *manual* html parsing

In [255]:
pattern = r"<a href=.*?>([^<]*?)</a>.*?26-00/([0-9]{3})"
print("\n".join(f"{p[0].replace('&nbsp;', ' ')}: {p[1]}" 
                for p in re.findall(pattern, permanents, re.DOTALL)))

Baey Sébastien: 103
Baynat Bruno: 112
Blin Lélia: 122
Dias de Amorim Marcelo: 109
Fdida Serge: 111
Fladenmuller Anne: 108
Fossati Francesca: 117
Fourmaux Olivier: 103
Friedman Timur: 107
Giovanidis Anastasios: 126
Malouch Naceur: 105
Marzouki Meryem: 105
Potop-Butucaru Maria: 115
Spathis Prométhée: 128
Thai Kim Loan: 114
Tixeuil Sébastien: 113


#### HTML parsing with BeautifulSoup

# BeautifulSoup

A much easier way to manipulate html!
- Make a soup (a navigable version of a string)
- Browse a soup 
- soup.find("tag") / soup.tag (returns soup)
- soup.find_all("tag") / soup("tag") (returns list)
- soup.find("tag", {'attr_name': 'attr_value'})
- soup.contents (list of children)
- soup.attrs: attributes

# BeautifulSoup

Extract text:
- soup.decode_contents(): returns soup as string
- soup.encode_contents(): returns soup as bytes
- soup.text: return soup as tagless string
- soup['attr_name']: return attribute value
- soup.name: tag name

# Back to RE example

In [256]:
from bs4 import BeautifulSoup as Soup
soup = Soup(r.text)
nbsp = '\xa0'
entries = soup.table('tr')
names = [ p('a')[1].text.replace(nbsp, ' ') for p in entries ]
rooms = [ p('td')[-1].text.rsplit('/')[1] for p in entries ]
print('\n'.join(f"{n}: {r}" for n, r in zip(names, rooms)))

Baey Sébastien: 103
Baynat Bruno: 112
Blin Lélia: 122
Dias de Amorim Marcelo: 109
Fdida Serge: 111
Fladenmuller Anne: 108
Fossati Francesca: 117
Fourmaux Olivier: 103
Friedman Timur: 107
Giovanidis Anastasios: 126
Malouch Naceur: 105
Marzouki Meryem: 105
Potop-Butucaru Maria: 115
Spathis Prométhée: 128
Thai Kim Loan: 114
Tixeuil Sébastien: 113


# Another example

https://www.lip6.fr/production/publications-type.php?id=-1&annee=2022&type_pub=ART

In [257]:
news = "https://www.lip6.fr/production/publications-type.php?id=-1&annee=2022&type_pub=ART"
soup = Soup(s.get(news).text)

# Another example

In [258]:
soup.find('li', {'class': 'D700'})

<li class="D700"><strong>P. Amestoy, O. Boiteau, A. Buttari, M. Gerest, F. Jézéquel, J.‑Y. L'Excellent, Th. Mary</strong> : “<a href="https://hal.archives-ouvertes.fr/hal-03251738">Mixed Precision Low Rank Approximations and their Application to Block Low Rank LU Factorization</a>”, IMA Journal of Numerical Analysis, (Oxford University Press (OUP)) [Amestoy 2022b]</li>

# Another example

In [259]:
for p in soup.find_all('li', {'class': 'D700'})[:5]:
    a = p.a
    print(a.text)
    print(a['href'])

Mixed Precision Low Rank Approximations and their Application to Block Low Rank LU Factorization
https://hal.archives-ouvertes.fr/hal-03251738
Proposer un jeu sérieux pour former à l’inclusion : retour d’expérience en France
https://hal.archives-ouvertes.fr/hal-03596832
Lessons Learned and Future Directions of MetaTutor: Leveraging Multichannel Data to Scaffold Self-Regulated Learning With an Intelligent Tutoring System
https://hal.archives-ouvertes.fr/hal-03701172
On Polynomial Modular Number Systems over $ \mathbb{Z}/{p}\mathbb{Z} $
https://hal.archives-ouvertes.fr/hal-03611829
Multistage knapsack
https://hal.archives-ouvertes.fr/hal-03660984


#### XML

# XML

- A human-readable way to represent data
- Introduced as a generalization / normalization of HTML
- Extensible Markup Language (XML) 
- Serializable (can be directly loaded/dumped from string)
- Used by many langages

# XML specification

XML is made of markups similar to HTML:
- tag: something that starts with < and ends with >.
  - start-tag, such as `<section>`
  - end-tag, such as `</section>`
  - empty-element tag, such as `<line-break />
- element: empty-element tag or anything between a start and matching tags (included)
- content: anything between a start and matching tags (excluded). Can contain text and/or element(s)
- attribute: key-value pairs stored inside a start or empty tag.

# Example #1

In [260]:
xml = """
<?xml version="1.0" encoding="UTF-8"?> 
<note> 
<to>Tove</to>
<from>Jani</from> 
<heading>Reminder</heading>
<body>Don't forget me this weekend!</body> 
</note>
"""

# Example #2

In [261]:
xml = """<?xml version="1.0"?> <data> 
<country name="Liechtenstein"> <rank>1</rank>
<year>2008</year> <gdppc>141100</gdppc>
<neighbor name="Austria" direction="E"/> 
<neighbor name="Switzerland" direction="W"/> 
</country>
<country name="Singapore"> 
<rank>4</rank>
<year>2011</year> <gdppc>59900</gdppc>
<neighbor name="Malaysia" direction="N"/> 
</country>
<country name="Panama"> 
<rank>68</rank>
<year>2011</year> <gdppc>13600</gdppc>
<neighbor name="Costa Rica" direction="W"/> 
<neighbor name="Colombia" direction="E"/> 
</country> </data> 
"""
with open('data.xml', 'wt') as f:
    f.write(xml)

# Parsing XML with the xml package

xml.etree.ElementTree loads the whole file, you can then navigate in the tree structure.

In [263]:
import xml.etree.ElementTree as ET
root = ET.parse('data.xml').getroot()
print(f"Main tag: {root.tag}; main attributes: {root.attrib}")
print(f"Text of second element of first element: {root[0][1].text}")
for child in root:
    print(child.tag, child.attrib)
for n in root.iter('neighbor'):
    print(n.attrib)

Main tag: data; main attributes: {}
Text of second element of first element: 2008
country {'name': 'Liechtenstein'}
country {'name': 'Singapore'}
country {'name': 'Panama'}
{'name': 'Austria', 'direction': 'E'}
{'name': 'Switzerland', 'direction': 'W'}
{'name': 'Malaysia', 'direction': 'N'}
{'name': 'Costa Rica', 'direction': 'W'}
{'name': 'Colombia', 'direction': 'E'}


# Parsing XML with the xml package

- You can also load from string (instead of from file)
- For very large files, you may want to iterate from file instead of loading the full content

In [264]:
root = ET.fromstring(xml) 
print(f"Main tag: {root.tag}; main attributes: {root.attrib}")

Main tag: data; main attributes: {}


# Parsing XML with BeautifulSoup

In [265]:
soup = Soup(xml)
soup

<?xml version="1.0"?><html><body><data>
<country name="Liechtenstein"> <rank>1</rank>
<year>2008</year> <gdppc>141100</gdppc>
<neighbor direction="E" name="Austria"></neighbor>
<neighbor direction="W" name="Switzerland"></neighbor>
</country>
<country name="Singapore">
<rank>4</rank>
<year>2011</year> <gdppc>59900</gdppc>
<neighbor direction="N" name="Malaysia"></neighbor>
</country>
<country name="Panama">
<rank>68</rank>
<year>2011</year> <gdppc>13600</gdppc>
<neighbor direction="W" name="Costa Rica"></neighbor>
<neighbor direction="E" name="Colombia"></neighbor>
</country> </data>
</body></html>

# Parsing XML with BeautifulSoup

In [266]:
print("\n".join( c.name+" "+str(c.attrs) for c in soup.data.contents if c.name))

country {'name': 'Liechtenstein'}
country {'name': 'Singapore'}
country {'name': 'Panama'}


In [267]:

print("\n".join( str(c.attrs) for c in soup('neighbor')))

{'name': 'Austria', 'direction': 'E'}
{'name': 'Switzerland', 'direction': 'W'}
{'name': 'Malaysia', 'direction': 'N'}
{'name': 'Costa Rica', 'direction': 'W'}
{'name': 'Colombia', 'direction': 'E'}


#### JSON

# JSON

- A simple way to represent data
- Introduced for Javascript (JavaScript Object Notation)
- More compact than HTML/XML, but still easy to read by humans
- Anything that can be represented in XML can be represented in JSON
- Serializable (can be directly dumped into a string)
- Widely used by many langages

# JSON specification

JSON data can be:
- A number
- A string
- A boolean, `true` or `false` (-> `True` or `False` in Python)
- An ordered list of elements (-> Python `list`)
- A collection of key–value pairs where the keys are strings (-> Python `dict`)
- null: an empty value, using the word `null` (-> `None`)

# Example: from/to string

In [270]:
from json import loads, dumps
dumps(['aéçèà',1234,[4,5,6], {'key1': None, 'key2': True}])

'["a\\u00e9\\u00e7\\u00e8\\u00e0", 1234, [4, 5, 6], {"key1": null, "key2": true}]'

In [271]:
loads('["a\\u00e9\\u00e7\\u00e8\\u00e0", 1234, [4, 5, 6], {"key1": null, "key2": true}]')

['aéçèà', 1234, [4, 5, 6], {'key1': None, 'key2': True}]

# Example: from/to file

In [272]:
from json import load, dump
data = {}
data['people'] = []
data['people'].append({'name': 'Mark', 'website': 'facebook.com'})
data['people'].append({'name': 'Larry', 'website': 'google.com'})
data['people'].append({'name': 'Tim', 'website': 'apple.com',})
with open('data.json', 'wt') as f:
    dump(data, f)

# Example: from/to file

In [273]:
with open('data.json', 'rt') as f:
    raw = f.read()
raw

'{"people": [{"name": "Mark", "website": "facebook.com"}, {"name": "Larry", "website": "google.com"}, {"name": "Tim", "website": "apple.com"}]}'

In [274]:
with open('data.json', 'rt') as f:
    data = load(f)
print('\n'.join( f"Name: {p['name']}; Website: {p['website']}" for p in data['people']))

Name: Mark; Website: facebook.com
Name: Larry; Website: google.com
Name: Tim; Website: apple.com


#### CSV

# CSV

- CSV: Comma Separated Values
- Cheap format for tables
- Each line is a row
- Column are separated by a separator (usually but not necessarily comma)
- First row may contain header names

# Example: the Big Mac index

https://github.com/TheEconomist/big-mac-data

In [275]:
url = "https://github.com/TheEconomist/big-mac-data/raw/master/source-data/big-mac-source-data.csv"
big_mac = s.get(url).text
print(big_mac[:300])

name,iso_a3,currency_code,local_price,dollar_ex,GDP_dollar,date
Argentina,ARG,ARS,2.5,1,,2000-04-01
Australia,AUS,AUD,2.59,1.68,,2000-04-01
Brazil,BRA,BRL,2.95,1.79,,2000-04-01
Britain,GBR,GBP,1.9,0.632911392,,2000-04-01
Canada,CAN,CAD,2.85,1.47,,2000-04-01
Chile,CHL,CLP,1260,514,,2000-04-01
China,C


# Using the csv module

In [276]:
from io import StringIO # Make a string look like a file
import csv
with StringIO(big_mac) as csvfile:
    r = csv.reader(csvfile)
    for row in r:
        if(row[0] == "France"):
            print(str(row[0]) + ',' + str(row[3]) + ',' + str(row[6]))

France,3.5,2011-07-01
France,3.6,2012-01-01
France,3.6,2012-07-01
France,3.6,2013-01-01
France,3.9,2013-07-01
France,3.8,2014-01-01
France,3.9,2014-07-01
France,3.9,2015-01-01
France,4.1,2015-07-01
France,4.1,2016-01-01
France,4.1,2016-07-01
France,4.1,2017-01-01
France,4.1,2017-07-01
France,4.2,2018-01-01
France,4.2,2018-07-01
France,4.2,2019-01-01
France,4.2,2019-07-09
France,4.2,2020-01-14
France,4.2,2020-07-01
France,4.2,2021-01-01
France,4.3,2021-07-01
France,4.35,2022-01-01
France,4.7,2022-07-01


# Using the csv module

In [277]:
with StringIO(big_mac) as csvfile:
    r = csv.reader(csvfile)
    for i, row in enumerate(r):
        print(str(row[0]) + ',' + str(row[3]) + ',' + str(row[6]))
        if i>6:
            break

name,local_price,date
Argentina,2.5,2000-04-01
Australia,2.59,2000-04-01
Brazil,2.95,2000-04-01
Britain,1.9,2000-04-01
Canada,2.85,2000-04-01
Chile,1260,2000-04-01
China,9.9,2000-04-01


# Using the csv module

In [278]:
with StringIO(big_mac) as csvfile:
    r = csv.DictReader(csvfile)
    for row in r:
        if row['currency_code'] == 'EUR' and '2022-07' in row['date']:
            print(row['name'], row['local_price'])

Austria 4.35
Belgium 4.6
Germany 4.58
Spain 4.58
Estonia 3.4
Euro area 4.65
Finland 5.25
France 4.7
Greece 4
Ireland 5
Italy 5.1
Lithuania 3.05
Latvia 3
Netherlands 4.5
Portugal 4
Slovakia 3.9
Slovenia 3.2


# Pandas

In [279]:
import pandas as pd
with StringIO(big_mac) as csvfile:
    df = pd.read_csv(csvfile)
df

Unnamed: 0,name,iso_a3,currency_code,local_price,dollar_ex,GDP_dollar,date
0,Argentina,ARG,ARS,2.50,1.000000,,2000-04-01
1,Australia,AUS,AUD,2.59,1.680000,,2000-04-01
2,Brazil,BRA,BRL,2.95,1.790000,,2000-04-01
3,Britain,GBR,GBP,1.90,0.632911,,2000-04-01
4,Canada,CAN,CAD,2.85,1.470000,,2000-04-01
...,...,...,...,...,...,...,...
1943,Uruguay,URY,UYU,255.00,41.910000,15169.153,2022-07-01
1944,United States,USA,USD,5.77,1.000000,63078.471,2022-07-01
1945,Venezuela,VEN,VES,10.00,5.673200,1690.659,2022-07-01
1946,Vietnam,VNM,VND,69000.00,23417.000000,3520.738,2022-07-01


#### Other formats

# Other formats

- All widely used formats have a Python package to manipulate them
- xls, xlsx -> pandas, xlswriter
- pdf -> pdfminer.six
- ... (look for it when you need it)

#### Python packages dedicated to a website

# Wikipedia

In [280]:
from wikipedia import page
r = page("Python (programming language)")
print(r.summary)

Python is a high-level, general-purpose programming language. Its design philosophy emphasizes code readability with the use of significant indentation.Python is dynamically-typed and garbage-collected. It supports multiple programming paradigms, including structured (particularly procedural), object-oriented and functional programming. It is often described as a "batteries included" language due to its comprehensive standard library.Guido van Rossum began working on Python in the late 1980s as a successor to the ABC programming language and first released it in 1991 as Python 0.9.0. Python 2.0 was released in 2000 and introduced new features such as list comprehensions, cycle-detecting garbage collection, reference counting, and Unicode support. Python 3.0, released in 2008, was a major revision that is not completely backward-compatible with earlier versions. Python 2 was discontinued with version 2.7.18 in 2020.Python consistently ranks as one of the most popular programming languag

# Google Scholar

In [281]:
from scholarly import scholarly
searcher = next(scholarly.search_author("Sebastien Tixeuil"))
searcher['interests']

['Distributed Computing', 'Computer Networks', 'Algorithms & Theory']

In [282]:
searcher = next(scholarly.search_author("Fabien Mathieu"))
searcher['interests']

['Graphs', 'P2P networks', 'queuing systems', 'ranking algorithms']

## Server Side... next week!