In [1]:
%%html
<style>
.prompt_container { display: none !important; }
.prompt { display: none !important; }
.run_this_cell { display: none !important; }

.rise-enabled .reveal .slides,
.rise-enabled .reveal .slides h1,
.rise-enabled .reveal .slides h2,
.rise-enabled .reveal .slides h3,
.rise-enabled .reveal .slides h4,
.rise-enabled .reveal .slides h5,
.rise-enabled .reveal .slides h6

{

  
    font-family: "Domestic Manners", sans-serif !important;
}

.slides {
    position: absolute;
    top: 0;
    left: 0;
}

</style>

# ProgRes, Part II

# Web services

Fabien Mathieu - fabien.mathieu@normalesup.org

Sébastien Tixeuil - Sebastien.Tixeuil@lip6.fr

# Roadmap

- Part I: done
- Part II (Web services)
  - This week:
    - Definitions / Reminders (OSI, HTTP, REST...)
    - Client side
       - Retrieve content
       - Manipulate content
  - Next week: Server side
- Part III: P2P

# Methodology

- Course and practicals are made on notebooks (jupyter or jupyterlab)
- This means you will send your practical notebooks. 
- Please put your name(s) **on the filename AND inside** as well!
- Practicals: for some advanced optional questions, you may add some traditional `.py` files (companion packages) -> `zip`
- Mini-projects: `zip` with a mix of notebooks and `.py` files is expected.

# Deontology

One unique rule: quote your sources!
- You can Google your answers (wikipedia, stackoverflow, ...)
- You can chat-GPT you answers (at your own risks)
- You are encouraged to discuss with other students during **practicals**

BUT:
- If you quote but obviously didn't understand what you quoted don't expect full marks.
- Not quoting is called *plagiarism*. Don't complain if you get caught.
- Mini-Projects: cross-groups "*cooperation*" will be sanctioned

# Jupyter notebook?

A notebook is just a text file with extension `.ipynb` that contains cells.
- Two main types of cells:
  - Markdown cells to write formatted text. You can itemize or write maths like $\frac{\sqrt{\pi}}{2}$
  - Code cell to execute Python code
- This is a markdown cell

In [None]:
# This is a code cell
x = 1+1

In [None]:
# Cell codes share the same workspace
x

# Using Jupyter Notebook

Two modes:
- Command mode (blue). Hit `esc` to enter it
- Edit mode (green). Hit `enter` on a cell to edit
- There are many shortcuts (hit `H` on command mode to see them)

## What is a Web Service?

# Reminder: the OSI model

| Layer | Protocol Data Unit (PDU) | Function |
| :-- | --- | :-: |
| 7 (Application) | Data | High-level protocols such as for resource sharing or remote file access, e.g. HTTP. |
| 6 (Presentation) | Data | Translation of data between a networking service and an application;<br>including character encoding, data compression and encryption/decryption. |
| 5 (Session) | Data | Managing communication sessions, i.e.,<br>continuous exchange of information in the form of multiple back-and-forth transmissions between two nodes. |
| 4 (Transport) | Segment, datagram | Reliable transmission of data segments between points on a network,<br>including segmentation, acknowledgement and multiplexing. |
| 3 (Network) | Packet | Structuring and managing a multi-node network, including addressing, routing and traffic control. |
| 2 (Data link) | Frame, PRB | Transmission of data frames between two nodes connected by a physical layer. |
| 1 (Physical) | Bit, Symbol | Transmission and reception of raw bit streams over a physical medium. |

From wikipedia

# Reminder: the OSI model

- OSI should be seen as a guideline more than strict frontiers
- 1 & 2 (physical layers): depend on the physical medium you use
- 3 & 4 (Internet layers): IP (3), UDP ("3.5"), TCP (4)
- Data layers: HTTP(S), SSH, SMTP, (S)FTP... (some see 5/6/7 distinction as artificial)

# The OSI hourglass

Just for Internet culture

- Many physical implementations
- Many data applications
- One waist: IP

<img src = "https://scx2.b-cdn.net/gfx/news/hires/2011/howtheintern.jpg">

# OSI model and PROGRES

- Part I (past sessions) was making your own data layers out of sockets
- Part II is about directly using L7 protocols
- Part III will be about overlay networks ("L8")

# HTTP in the old days (Web 1.0)

- Goal: serve a (static) web page
- Client (user on a navigator) requests a URL
- Server serves a physical file (or a view of a directory)
- Navigator displays the page

- https://files.data.gouv.fr/anssi/ascadv2/
- http://test-debit.free.fr/

- HTTP: HyperText Transfert Protocol
- URL: Uniform Resource Locator

# HTTPS today (Web X.Y)

- Goal: C asks R to S using a URI
- C is anything
- R is anything
- S is a server
- https://pokeapi.co/api/v2/pokemon-form/25/
- https://api.archives-ouvertes.fr/search/?q=authIdHal_s:fabien-mathieu

- HTTPS: HyperText Transfert Protocol Secure
- URI: Uniform Resource Identifier

# Web services

- Client is typically a program executed to access content
 - On client side (JS of a webpage)
 - On server side (to *build* the webpage to return)
- Request is a HTTP(S) method on a URI
- Result is anything (None, html, xml, json, image...)

# HTTP methods

Any HTTP method is like this:
- A request is sent by the client, with some headers (type of request, ...) and possibly a body (input data)
- A response is sent by the server, with some headers (status code, metadata) and possibly a body (output data)

# HTTP methods: theory

From wikipedia:
- **GET**: requests data from the target resource. GET requests should only retrieve data. All data is in the URI (useful for caching).
- **HEAD**: like GET, but don't actually send the data. Uses include checking whether a page is available through the status code and quickly finding the size of a file (Content-Length).
- **POST**: requests the target to process some resources (data) sent by the client. Posted data is not on the URI. For example, it is used for posting a message to an Internet forum, subscribing to a mailing list, or completing an online shopping transaction.
- **PUT**: requests to create or update using the data enclosed in the request. A distinction from POST is that the client specifies the target location on the server.
- **DELETE**: requests suppression of entry.
- **CONNECT**: establishes a TCP/IP tunnel. It is often used to secure connections through one or more HTTP proxies with TLS.
- **OPTIONS**: requests that the supported HTTP methods that it supports. This can be used to check the functionality of a web server.
- **TRACE**: transfers the received request in the response body. That way a client can see what (if any) changes or additions have been made by intermediaries.
- **PATCH**: modifies part(s) of entry. This can save bandwidth by updating a part of a file or document without having to transfer it entirely.

# HTTP methods in practice: GET or POST?

- **GET** can perform all methods (e.g. "http://my.server.co/uri?method=POST&data=...")
- Same for **POST** (which is a GET with hidden data)
- 99% of methods used in the real world are GET or POST

# HTTP methods: GET?

- All request information is visible in the URI
- Human can write a GET in the browser
- Useful for caching / bookmarking
- Good practice: use get to read

# HTTP methods: POST?

- Hide some request information inside the sent message
- Usually performed by forms / javascript
- Useful for sending login / password
- Good practice: use post to write (and password)

# Anatomy of a GET method: inspect

The browser way: Inspect

https://pokeapi.co/api/v2/pokemon-form/25/

# Anatomy of a GET method: within Python

The requests package (that we will use intensively) gives you all.

In [None]:
from requests import get
request = get("https://pokeapi.co/api/v2/pokemon-form/25/")
print(f"Request headers: {request.request.headers}")
print(f"Response headers: {request.headers}")

# Anatomy of a GET method: within Python

In [None]:
print(f"Status code: {request.status_code}")
print(f"Method: {request.request.method}")

In [None]:
{k: v for k, v in request.json().items() if k != 'sprites'}

# Anatomy of a POST method: within Python

We will use http://ptsv3.com, a website to test post.

http://ptsv3.com/t/progres

In [None]:
from requests import post
request = post("http://ptsv3.com/t/progres/post", json={'Hello': 'world', 'ansWer': 42, 'password': 'MyPrivatePassword'})

# Anatomy of a POST method: within Python

http://ptsv3.com/t/progres

In [None]:
print(f"Status code: {request.status_code}")
print(f"Method: {request.request.method}")
print(f"Request headers: {request.request.headers}")
print(f"Response headers: {request.headers}")
print(request.text)

# REST API

- REpresentational State Transfert Application Programming Interface is a simple, realtively normalized, way of performing web services.
- Response is typically json or xml.

https://pokeapi.co/api/v2/pokemon-form/25/

- Method is GET (implicit)
- https://pokeapi.co/api/ is the base URL of the URI
- v2 is the API version
- pokemon-form/25/ is the actual request (fetch Pikachu!)

## Client Side

### Retrieve data

# Retrieve data

Data can be:
- Arbitrary bytes (image, pdf, binary...)
- Text
- Structured text (html, json, ...)

You need to adapt:
- Load in memory or save to file?
- Don't load a text as bytes or a json as text!

# Retrieve with requests

In [None]:
from requests import Session
manga = "https://lelscans.net"
s = Session() # Sessions make better performance
r = s.get(manga)
print(f"Request status is {r.status_code},\n"
 f"Content length is {len(r.content)} bytes,\n"
 f"Request encoding is {r.encoding},\n"
 f"Text size is {len(r.text)} chars.")
print(f"Response headers: {r.headers}")

# Retrieve with requests

In [None]:
r.text[:1000]

# Retrieve with requests

In [None]:
r.content[:1000]

# Example: remote file size

In [None]:
def get_size(url):
    s = Session()
    r = s.head(url)
    return int(r.headers['Content-Length'])

In [None]:
url = "http://ftp.crifo.org/debian-cd/current/amd64/iso-dvd/debian-12.2.0-amd64-DVD-1.iso"
get_size(url)

# Example: stream downloading

In [None]:
from pathlib import Path
def download(source_url, dest_file):
    s = Session()
    s.verify = False
    r = s.get(source_url, stream=True)
    dest_file = Path(dest_file)
    with open(dest_file, "wb") as f:
        for chunk in r.iter_content(chunk_size=8192):
            if chunk: 
                f.write(chunk)

In [None]:
url = "https://www-npa.lip6.fr/~tixeuil/m2r/uploads/Main/PROGRES2022_2.pdf"
download(url, 'python.pdf')

<a href="./python.pdf">python.pdf (local file)</a>

### Manipulate data

#### Basic string manipulation

# String manipulation

Even if you have structured data, you need to master basic string manipulations.

In [None]:
txt = " Python is a great language\n but Erlang is pretty cool too!  "
print(txt)

# Split

In [None]:
print(l1:=txt.split())
print(l2:=txt.split('a'))
print(l3:=txt.split('\n'))
print(l4:=txt.split('an'))

# Join

In [None]:
print(' '.join(l1))
print('a'.join(l2))
print('\n'.join(l3))
print('AN'.join(l4))

# Other methods

By default, `str` in Python have many powerful methods... Try them!

In [None]:
print(', '.join([method_name for method_name in dir(txt) 
                 if callable(getattr(txt, method_name)) and not method_name.startswith('_')]))

#### Regular Expression

# Regular Expressions

- More powerful
- More complex that basic methods
- Personal advice: don't use RegEx if you can avoid it
- Personal advice #2: sometimes, you cannot avoid it, so you need to learn

# Regular Expressions

Recipe:
- define a *pattern*
- apply pattern to a string
- you can:
  - find all pattern occurrences (`findall`)
  - substitute with another string (`sub`)
  - check if the pattern matches the string (`match`/`fullmatch`)
  - extract parts of the pattern (`group`)

# Defining a pattern

- a, X, 9, < -- ordinary characters just match themselves exactly.
- Some characters have special meanings: . ^ \$ * + ? { [ ] \ | ( ) (details below)
- . (a period) -- matches any single character except newline '\n'
- \w -- (lowercase w) matches a "word" character: a letter or digit or underscore [a-zA-Z0-9_]. \W matches any non-word character. 
- \b -- boundary between word and non-word 
- \s -- (lowercase s) matches a single whitespace character -- space, newline, return, 
tab, form [ \n\r\t\f]. \S (upper case S) matches any non-whitespace 
character.

# Defining a pattern

- \t, \n, \r -- tab, newline, return 
- \d -- decimal digit [0-9]
- \$=end—match the end of the string. ^ matches the start of the string but also means not
- \ -- inhibit the "specialness" of a character. So, for example, use \\. to match a 
period or \\\\ to match a backslash. If you are unsure if a character has special meaning, 
such as '@', you can put a slash in front of it, \@, to make sure it is treated just as a 
character

# Defining a pattern

- [] - set of possible characters
- | - or
- {n} - exactly n occurrences.
- () - create group 
- \+ - at least one occurrence.
- \* - zero or more occurrence.
- ? - zero or one occurrence. Also mean greedy match (`.*` vs `.*?`).

# Example: extract Email Information

In [None]:
import re
txt = 'fabien.mathieu@normalesup.org'
pattern = '([^@]+)@([^@]+)'
m = re.fullmatch(pattern, txt)
print(m.groups())

# Example: *manual* html parsing

https://www.lip6.fr/recherche/team_membres.php?acronyme=NPA

In [None]:
url = "https://www.lip6.fr/recherche/team_membres.php?acronyme=NPA"
r = s.get(url)
txt = r.text

# Example: *manual* html parsing

In [None]:
pattern1 = r"<table class='annuaire'>(.*?)</table>"
permanents = re.findall(pattern1, txt, re.DOTALL)[0]
permanents

# Example: *manual* html parsing

In [None]:
pattern = r"<a href=.*?>([^<]*?)</a>.*?([0-9]{2}-[0-9]{2})/([0-9]{3})"
print("\n".join(f"{p[0].replace('&nbsp;', ' ')}: corridor {p[1]}, room {p[2]}" 
                for p in re.findall(pattern, permanents, re.DOTALL)))

#### HTML parsing with BeautifulSoup

# BeautifulSoup

A much easier way to manipulate html!
- Make a soup (a navigable version of a string)
- Browse a soup 
- soup.find("tag") / soup.tag (returns soup)
- soup.find_all("tag") / soup("tag") (returns list)
- soup.find("tag", {'attr_name': 'attr_value'})
- soup.contents (list of children)
- soup.attrs: attributes

# BeautifulSoup

Extract text:
- soup.decode_contents(): returns soup as string
- soup.encode_contents(): returns soup as bytes
- soup.text: return soup as tagless string
- soup['attr_name']: return attribute value
- soup.name: tag name

# Back to RE example

In [None]:
from bs4 import BeautifulSoup as Soup
soup = Soup(r.text)
nbsp = '\xa0'
entries = soup.table('tr')
names = [ p('a')[1].text.replace(nbsp, ' ') for p in entries ]
locations = [p.find('td', {'class': 'bureau'}).text.split()[-1].split('/') for p in entries]
print('\n'.join(f"{n}: corridor {l[0]}, room {l[1]} " for n, l in zip(names, locations)))

# Another example

https://www.lip6.fr/production/publications-type.php?id=-1&annee=2022&type_pub=ART

In [None]:
news = "https://www.lip6.fr/production/publications-type.php?id=-1&annee=2022&type_pub=ART"
soup = Soup(s.get(news).text)

# Another example

In [None]:
# First article
soup.find('li', {'class': 'D700'})

# Another example

In [None]:
# The 5 first articles: names and URL
for p in soup.find_all('li', {'class': 'D700'})[:5]:
    a = p.a
    print(a.text)
    print(a['href'])

#### XML

# XML

- A human-readable way to represent data
- Introduced as a generalization / normalization of HTML
- Extensible Markup Language (XML) 
- Serializable (can be directly loaded/dumped from string)
- Used by many langages

# XML specification

XML is made of markups similar to HTML:
- tag: something that starts with < and ends with >.
  - start-tag, such as `<section>`
  - end-tag, such as `</section>`
  - empty-element tag, such as `<line-break />
- element: empty-element tag or anything between a start and matching tags (included)
- content: anything between a start and matching tags (excluded). Can contain text and/or element(s)
- attribute: key-value pairs stored inside a start or empty tag.

# Example #1

In [None]:
xml = """
<?xml version="1.0" encoding="UTF-8"?> 
<note> 
<to>Tove</to>
<from>Jani</from> 
<heading>Reminder</heading>
<body>Don't forget me this weekend!</body> 
</note>
"""

# Example #2

In [None]:
xml = """<?xml version="1.0"?> <data> 
<country name="Liechtenstein"> <rank>1</rank>
<year>2008</year> <gdppc>141100</gdppc>
<neighbor name="Austria" direction="E"/> 
<neighbor name="Switzerland" direction="W"/> 
</country>
<country name="Singapore"> 
<rank>4</rank>
<year>2011</year> <gdppc>59900</gdppc>
<neighbor name="Malaysia" direction="N"/> 
</country>
<country name="Panama"> 
<rank>68</rank>
<year>2011</year> <gdppc>13600</gdppc>
<neighbor name="Costa Rica" direction="W"/> 
<neighbor name="Colombia" direction="E"/> 
</country> </data> 
"""
with open('data.xml', 'wt') as f:
    f.write(xml)

# Parsing XML with the xml package

xml.etree.ElementTree loads the whole file, you can then navigate in the tree structure.

In [None]:
import xml.etree.ElementTree as ET
root = ET.parse('data.xml').getroot()
print(f"Main tag: {root.tag}; main attributes: {root.attrib}")
print(f"Text of second element of first element: {root[0][1].text}")
for child in root:
    print(child.tag, child.attrib)
for n in root.iter('neighbor'):
    print(n.attrib)

# Parsing XML with the xml package

You can also load from string (instead of from file)

In [None]:
root = ET.fromstring(xml) 
print(f"Main tag: {root.tag}; main attributes: {root.attrib}")

For very large files, you may want to iterate from file instead of loading the full content

# Parsing XML with BeautifulSoup

In [None]:
soup = Soup(xml, features='xml')
soup

# Parsing XML with BeautifulSoup

In [None]:
print("\n".join( c.name+" "+str(c.attrs) for c in soup.data.contents if c.name))

In [None]:

print("\n".join( str(c.attrs) for c in soup('neighbor')))

#### JSON

# JSON

- A simple way to represent data
- Introduced for Javascript (JavaScript Object Notation)
- More compact than HTML/XML, but still easy to read by humans
- Anything that can be represented in XML can be represented in JSON
- Serializable (can be directly dumped into a string)
- Widely used by many langages

# JSON specification

JSON data can be:
- A number
- A string
- A boolean, `true` or `false` (-> `True` or `False` in Python)
- An ordered list of elements (-> Python `list`)
- A collection of key–value pairs where the keys are strings (-> Python `dict`)
- null: an empty value, using the word `null` (-> `None`)

# Example: from/to string

In [None]:
from json import loads, dumps
dumps(['aéçèà',1234,[4,5,6], {'key1': None, 'key2': True}])

In [None]:
loads('["a\\u00e9\\u00e7\\u00e8\\u00e0", 1234, [4, 5, 6], {"key1": null, "key2": true}]')

# Example: from/to file

In [None]:
from json import load, dump
data = {}
data['people'] = []
data['people'].append({'name': 'Mark', 'website': 'facebook.com'})
data['people'].append({'name': 'Larry', 'website': 'google.com'})
data['people'].append({'name': 'Tim', 'website': 'apple.com',})
with open('data.json', 'wt') as f:
    dump(data, f)

# Example: from/to file

In [None]:
with open('data.json', 'rt') as f:
    raw = f.read()
raw

In [None]:
with open('data.json', 'rt') as f:
    data = load(f)
print('\n'.join( f"Name: {p['name']}; Website: {p['website']}" for p in data['people']))

# Fun fact: Jupyter Notebooks are... json

In [None]:
with open('Web_services.ipynb', encoding='utf8') as f:
    this_notebook = load(f)

In [None]:
this_notebook['cells'][:5]

#### CSV

# CSV

- CSV: Comma Separated Values
- Cheap format for tables
- Each line is a row
- Column are separated by a separator (usually but not necessarily comma)
- First row may contain header names

# Example: the Big Mac index

https://github.com/TheEconomist/big-mac-data

In [None]:
url = "https://github.com/TheEconomist/big-mac-data/raw/master/source-data/big-mac-source-data.csv"
big_mac = s.get(url).text
print(big_mac[:300])

# Using the csv module

In [None]:
from io import StringIO # Make a string look like a file
import csv
with StringIO(big_mac) as csvfile:
    r = csv.reader(csvfile)
    for row in r:
        if(row[0] == "France"):
            print(str(row[0]) + ',' + str(row[3]) + ',' + str(row[6]))

# Using the csv module

In [None]:
with StringIO(big_mac) as csvfile:
    r = csv.reader(csvfile)
    for i, row in enumerate(r):
        print(str(row[0]) + ',' + str(row[3]) + ',' + str(row[6]))
        if i>6:
            break

# Using the csv module

In [None]:
with StringIO(big_mac) as csvfile:
    r = csv.DictReader(csvfile)
    for row in r:
        if row['currency_code'] == 'EUR' and '2022-07' in row['date']:
            print(row['name'], row['local_price'])

# Pandas

In [None]:
import pandas as pd
with StringIO(big_mac) as csvfile:
    df = pd.read_csv(csvfile)
df

# Pandas

Method `describe` lists the statistical properties of each attribute.

In [None]:
df.describe()

# Pandas

Method `apply` allows to apply a formula. Example: compute prices in dollars.

In [None]:
import numpy as np
df['dollar_price'] = df.apply(lambda r: r['local_price']/r['dollar_ex']
                                              if r['dollar_ex'] else np.nan, axis=1)

# Pandas

In [None]:
df['dollar_price'].describe()

# Pandas

In [None]:
df.iloc[df['dollar_price'].idxmax()]

In [None]:
df.iloc[df['dollar_price'].idxmin()]

# Pandas

In [None]:
from matplotlib import pyplot as plt
f, a = plt.subplots(figsize=(15, 5))
df.boxplot(column=['dollar_price'], by='date', figsize=(15, 5), rot=45, ax=a)
a.set_xticklabels(labels=a.get_xticklabels(), ha='right')
plt.ylim([0, 10])
plt.show()

#### Other formats

# Other formats

- All widely used formats have a Python package to manipulate them
- xls, xlsx -> pandas, xlswriter
- pdf -> pdfminer.six
- ... (look for it when you need it)

#### Python packages dedicated to a website

# Wikipedia

In [None]:
from wikipedia import page
r = page("Python (programming language)")
print(r.summary)

# Google Scholar

In [None]:
from scholarly import scholarly
searcher = next(scholarly.search_author("Sebastien Tixeuil"))
searcher['interests']

In [None]:
searcher = next(scholarly.search_author("Fabien Mathieu"))
searcher['interests']

## Server Side... next week!