# Textual data scraping and preprocessing

> This course is a reworking of the excellent book designed by Melanie Walsh, [*Introduction to Cultural Analytics & Python*](https://melaniewalsh.github.io/Intro-Cultural-Analytics/welcome.html). Many paragraphs and explanations have been retained without modification.

> **Read this book!**, itself inspired by *Inspired by web scraping lessons from [Lauren Klein](https://github.com/laurenfklein/emory-qtm340/blob/master/notebooks/class4-web-scraping-complete.ipynb) and [Allison Parrish](https://github.com/aparrish/dmep-python-intro/blob/master/scraping-html.ipynb)*

Classically, a distinction is made between different work stages: (1) data production, (2) data processing and (3) data analysis.
This course is a practical and fairly detailed introduction to the first 2 stages: **production and (pre)processing**.

You'll learn how to use open Web services to automatically collect textual data. In doing so, you'll gain a better understanding of how the Web works (HTTP, HTML), discover structured data formats (CSV, JSON, XML) and get hands-on experience of the Python programming language.

The important thing is not necessarily to retain everything, but to gain a better understanding of how this ecosystem works, so that you can gradually determine by yourself the solutions you need to implement to meet your own requirements.


In this lesson, we're going to introduce :

1. how to "scrape" data from the internet with the Python libraries requests and BeautifulSoup.
1. how to preprocess our datas with spaCy.

We will cover how to:

* Programmatically access the text of a web page
* Extract informations from structured documents (CSV/TSV, HTML, JSON, XML)
* Build collections of texts
* Design pre-processings

And along the way, we'll be learning the basics of the Python programming language.

## Why Do We Need To Scrape At All?

Today, written heritage is massively available on the Internet under Free Licenses.

Community initiatives such as Project Gutenberg, based on crowdsourcing, have been succeeded by very large-scale institutional projects exploiting the potential of machine learning for the automatic acquisition of text, including handwritten. Gallica (BnF's digital library) represents over 10 million documents available online.

Here are a few projects worth your attention :

- [Project Gutenberg](https://www.gutenberg.org/): since December 1971! Project Gutenberg is a library of free electronic versions of physically existing books (>70,000 free eBooks).
- [Wikisource](https://fr.wikisource.org/wiki/Wikisource:Accueil): Wikisource is a digital library of public domain texts, managed as a wiki using the MediaWiki engine (> 360,000 free and open texts).
- [Gallica](https://gallica.bnf.fr/): Gallica is the digital library of the Bibliothèque nationale de France and its partners. It has been freely accessible since 1997, and contains > 10 million documents.
- [HathiTrust](https://www.hathitrust.org/)
- …

This digital heritage opens up unprecedented prospects for the humanities: provided, of course, that we know how to use these services in order to build research corpora. Text is also increasingly natively digital, and researchers often need to automatically follow certain subjects on platforms such as Twitter.

This course will not teach you how to analyze these huge textual corpora, but rather how to build them. It will also show you how to pre-process them, an essential prerequisite for computational analysis.

**Building the collection of Jules Verne novels**

Let's start by building up a small corpus of Jules Verne's works available on Project Gutenberg. In the process, we'll discover HTTP.

## Reading a metadata table using Pandas (CSV/TSV)

![csv](./img/verne_csv.png)

A [comma-separated values](https://en.wikipedia.org/wiki/Comma-separated_values) (CSV) file is a delimited text file that uses a comma to separate values. Each line of the file is a data record. Each record consists of one or more fields, separated by commas. The use of the comma as a field separator is the source of the name for this file format. A CSV file typically stores tabular data (numbers and text) in plain text, in which case each line will have the same number of fields.

The T of TSV is for 'tabulation'. Tabs are more convenient for separating text values, which may themselves contain commas...
In short, our lesson begins with the reading of a simple TSV metadata table, which is a very common data exchange format.

=====

**[Pandas](https://pandas.pydata.org/)** is a library written for the Python programming language, enabling data manipulation and analysis. In particular, it offers data structures and array manipulation operations.

Here, we use Pandas to read a [TSV table](https://en.wikipedia.org/wiki/Comma-separated_values) and store its data in a [DataFrame](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) (a 2-dimensional array), so that it can be manipulated.

In [None]:
import pandas as pd

In [None]:
urls = pd.read_csv("data/verne.csv", delimiter='\t', encoding='utf-8')
# Get an overview
urls.head()

Here, the list of books and their metadata (download link, language) has been compiled in advance. We'll see later how to collect this information automatically, using a search API such as that provided by BnF.

Let's learn how to read the DataFrame in different ways and to access the values contained in the cells.

In [None]:
# Knowing the size of the df
print(
    f"{'Dimensions':15}"
    f"{'Lines':15}"
    f"{'Columns'}"
)
print(
    f"{str(urls.shape):15}"
    f"{str(urls.shape[0]):15}"
    f"{str(urls.shape[1])}"
)

You can access one or more specific lines or slices using the [`.iloc[]`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.iloc.html) property.

In [None]:
# Access a specific line
urls.iloc[5]

In [None]:
# Access specific lines
urls.iloc[[5, 7, 11]]

In [None]:
# Access a slice of lines
urls.iloc[5:8]

You can get cell value by index or name.

In [None]:
print(
    str(urls.iloc[5][2]),
    '<=>',
    str(urls.iloc[5]['Title'])
)

Each novel title in this TSV file is paired with a URL for the plain text. How can we actually use these URLs to get computationally tractable text data? Though we could manually navigate to each URL and copy/paste each screenplay into a file, that would be suuuuper slow and painstaking. It would be much better to programmatically access the text data attached to every URL.

## HTTP Requests and Responses

To programmatically access the text data attached to every URL, we can use a Python library called [requests](https://requests.readthedocs.io/en/master/).

A [Uniform Resource Locator](https://en.wikipedia.org/wiki/URL) (URL, ~ a web address), is a reference to a web resource that specifies its location on a computer network and a mechanism for retrieving it.

Every HTTP URL conforms to the syntax of a generic URI. The URI generic syntax consists of 6 components organized hierarchically:

`http://www.domain.com:80/path/to/myfile.html?key1=value1&key2=value2#anchor_in_doc`

URLS parts:

- **protocol**: `http://` or `https://`.
- **domain name**: `www.domain.com` –instead of a domain name, you can use an IP address.
- **port**: `:80` –indicates the technical "door" to be used to access the server's resources. This fragment is generally absent, as the browser uses the standard ports associated with the protocols (80 for HTTP, 443 for HTTPS).
- **path**: `/path/to/myfile.html` –path, on the web server, to the resource. In the early days of the Web, this path often corresponded to a "physical" path existing on the server. Today, this path is merely an abstraction managed by the web server, and no longer corresponds to a "physical" reality.
- **parameters**: `?key1=value1&key2=value2` –constructed as a list of key/value pairs separated by an ampersand.
- **anchor**: `#anchor_in_doc` –points to a given location in the resource.

When you type in a URL in your search address bar, you're sending an HTTP **request** for a web page, and the server which stores that web page will accordingly send back a **response**, some web page data that your browser will render.

<img src="./img/http.png" width="600px">

In the image below, in the inspector's network tab, you can see that for the URL, 2 HTTP requests received a positive response (status 200).

![404](./img/request-response.png)

### Get HTML Data with Requests

**HTTP**

In the HTTP protocol, a **method** is a command specifying a type of request, i.e. asking the server to perform an action. In general, the action concerns a resource identified by the URL following the method name.

There are many [methods](https://en.wikipedia.org/wiki/HTTP#HTTP/1.1_request_messages), the most common being `GET`, `HEAD` and `POST` :

- `GET`: the most common method for requesting a resource. A GET request has no effect on the resource.
- `HEAD`: this method only requests information about the resource, without actually requesting the resource itself.
- `POST`: this method is used to send data for processing (usually from an HTML form).

**Requests Python Library**

With the [`.get()` method](https://requests.readthedocs.io/en/latest/api/#requests.get), we can request to "get" web page data for a specific URL, which we will store in a varaible called `response`.

In [None]:
import requests

In [None]:
response = requests.get("https://www.gutenberg.org/cache/epub/4791/pg4791.txt")

### HTTP Header Fields

Wikipedia: "[HTTP header fields](https://en.wikipedia.org/wiki/List_of_HTTP_header_fields) are a **list of strings sent and received by both the client program and server on every HTTP request and response. These headers are usually invisible to the end-user and are only processed or logged by the server and client applications. They define how information sent/received through the connection are encoded** (as in Content-Encoding), the session verification and identification of the client (as in browser cookies, IP address, user-agent) or their anonymity thereof (VPN or proxy masking, user-agent spoofing), how the server should handle data (as in Do-Not-Track), the age (the time it has resided in a shared cache) of the document being downloaded, amongst others."


In [None]:
response.headers

Thus, using the `headers` [`Response` object](https://requests.readthedocs.io/en/latest/user/advanced/#request-and-response-objects); we can see that the document returned by Project Gutenberg is a plain text file.

But more often than not, as we'll soon see, responses are encoded in HTML.

In [None]:
response.headers['Content-Type']

### HTTP Status Code

If we check out `response`, it will simply tell us its [HTTP response code](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status), aka whether the request was successful or not.

"200" is a successful response, while "404" is a common "Page Not Found" error.

In [None]:
response

Let's see what happens if we make a mistake entering the URL...  
('page4791' instead of 'pg4791')

In [None]:
bad_response = requests.get("https://www.gutenberg.org/cache/epub/4791/page4791.txt")
bad_response

![404](./img/bad_response.png)

### Extract Text From Web Page

To actually get at the text data in the reponse, we need to use [`.text` property](https://requests.readthedocs.io/en/latest/api/#requests.Response.text), which we will save in a variable called `text_string`.

Project Gutenberg provides here a version of the novel in plain text format (what is very convenient from a pedagogical point of view). But more often, the text data that we're getting on the Web is formatted in the HTML markup language, which we will talk more about in the BeautifulSoup section below.

In [None]:
text_string = response.text

Here's the text of the novel now in a variable.

In [None]:
print(text_string)

### Extract Text From Multiple Web Pages

Repeating the operation for each novel would be tedious… Let's see how we can extract the text for every URL in the DataFrame at once. To do so, we're going to create a smaller DataFrame containing the first 10 novels –fewer processings for the demonstration and the planet…

In [None]:
sample_urls = urls[:10]

We're going to make a function called `scrape_novel()` that includes our `requests.get()` and `response.text` code.

In [None]:
def scrape_novel(url):
    response = requests.get(url)
    html_string = response.text
    return html_string

Then we're going to apply it to the "url" column of the DataFrame and create a new "text" column for the resulting extracted text.

In [None]:
sample_urls['text'] = sample_urls['url'].apply(scrape_novel)

In [None]:
sample_urls.head(3)

The DataFrame above is truncated, so we can't see the full contents of the "text" column. But if we print out every row in the column, we can see that we successfully extracted text for each URL (though some of these URLs returned 404 errors).

In [None]:
print(sample_urls.iloc[4]['text'])

It's simple and easy! However, that plain text format poses a few problems. It is not possible to automatically distinguish the Jules Verne text from the metadata and editorial paratext. Likewise, all the credit references at the end of the transcription are mixed in with the text, which can skew the analysis.

We need a more structured format that allows us to distinguish between the author's text, the metadata and the editorial paratext.

## Web Scraping

Not all web pages will be as easy to scrape as these Gutenberg project plain text files, however. Let's say we wanted to scrape the lyrics for NTM's song "[On est encore là](https://genius.com/Supreme-ntm-on-est-encore-la-lyrics)" (1998) from Genius.com.

NB. The teaching sequence is by Mélanie Walsh, and has been updated as the data model has evolved...

<img src="./img/ntm.png" class="center" >

Even at a glance, we can tell that this *Genius* web page is a lot more complicated than the *Gutengerg project* page and that it contains a lot of information beyond the lyrics.

Sure enough, if we use our requests library again and try to grab the data for this web page, the underlying data is much more complicated, too.

In [None]:
response = requests.get("https://genius.com/Supreme-ntm-on-est-encore-la-lyrics")
html_string = response.text
print(html_string)

How can we extract just the song lyrics from this messy soup of a document? Luckily there's a Python library that can help us called [BeautifulSoup](https://beautiful-soup-4.readthedocs.io/en/latest/), which parses HTML documents.

To understand BeautifulSoup and HTML, we're going to take things one step at a time. We'll start with a very simple example (beginner level) to understand the basic structure of an HTML page. Next, we'll go back to the Jules Verne novels available on Project Gutenberg (intermediate level) before trying to automatically extract the lyrics to the NTM song (advanced level).

### HTML5

This [Singers' singers webpage](todolink) is adapted from Parrish's website titled "[Kittens and the TV Shows They Love](http://static.decontextualize.com/kittens.html)" made for the purposes of teaching BeautifulSoup.

Instant geek: thanks to IPython's [core.display module](https://ipython.org/ipython-doc/2/api/generated/IPython.core.display.html), it's even possible to display the content of a web page in a notebook.

In [None]:
from IPython.display import display, HTML
response = requests.get("https://raw.githubusercontent.com/architexte/cours-data-processing/main/data/punk.html")
html_string = response.text
display(HTML(html_string))

Let's take a look at the structure of this HTML page:

In [None]:
print(html_string)

#### HTML Tags

HTML stands for HyperText Markup Language. It is the standard language for writing web page documents. The most important thing you need to know about HTML is that the language uses HTML "tags" to represent different elements, such as a main header `<h1>`. 

| HTML Tag                | Explanation                              |
|--------------------|-------------------------------------------|
| <\!DOCTYPE>        | Defines document type                 |
| <html\>             | Root of the HTML document                  |
| <head\>             | Metadata about document    |
| <title\>            | Title for document          |
| <body\>             | Document body               |
| <h1\> to <h6\>       |  Headings                    |
| <div\> | Bloc section in a document                   |
| <p\>                | Paragraph                       |
| <ul\> | Unordered list                     |
| <ol\> | Ordered list                     |
| <li\> | List item                     |
| <br\>               | Line break               |
| <a\> | Hyperlink                       |
| <img\> | Image                         |
| <span\> | Inline section in a document                   |
| <\!\-\-comment here-\-> | Comment                         |

HTML tags often, but not always, require a "closing" tag. For example, the main header "Kittens and the TV Shows They Love" will be surrounded by `<h1>` (opening tag) and `</h1>` (closing tag) on either side: `<h1>Singers' singers</h1>`

#### HTML Attributes, Classes, and IDs

HTML elements sometimes come with even more information inside a tag. This will often be a keyword (like `class` or `id`) followed by an equals sign `=` and a further descriptor such as `<div class="iggy">`.

We need to know about tags as well as attributes, classes, and IDs because this is how we're going to extract specific HTML data with BeautifulSoup.

### BeautifulSoup (Singers’ singers)

In [None]:
from bs4 import BeautifulSoup

To make a BeautifulSoup document, we call `BeautifulSoup()` with two parameters: the `html_string` from our HTTP request and [the kind of parser](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#specifying-the-parser-to-use) that we want to use, which will always be `"html.parser"` for our purposes.

In [None]:
response = requests.get("https://raw.githubusercontent.com/architexte/cours-data-processing/main/data/punk.html")
html_string = response.text
document = BeautifulSoup(html_string, "html.parser")

In [None]:
print(type(document))

In [None]:
document

#### Extract HTML Element

We can use the [`.find()` method](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#find) to find and extract certain elements, such as a main header.

In [None]:
document.find("h1")

If we want only the text contained between those tags, we can use `.text` to extract just the text.

In [None]:
document.find("h1").text

In [None]:
type(document.find("h1").text)

Find the HTML element that contains an image.

In [None]:
document.find("img")

In [None]:
document.find("img")['src']

**The `.find()` method returns only one result** (the first one).  
However, we may wish to obtain a list of all the images called up in the page.

#### Extract Multiple HTML Elements

You can also extract multiple HTML elements at a time with the [`.find_all()` method](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#find-all) that returns a list.

In [None]:
document.find_all("img")

With Python, it's easy to use a `for` loop to go through the list:

In [None]:
for img in document.find_all("img"):
    print(img['src'])

It's possible to extract a serie of elements according to the value of their attributes (here, `div` whose class attribute value is `singer`).

In [None]:
document.find_all("div", attrs={"class": "singer"})

In [None]:
document.find("h2").text

In [None]:
document.find_all("h2")

``` {warning}
Heads up! The code below will cause an error.
```
Let's try to extract the text from all the header2 elements:

In [None]:
document.find_all("h2").text

That didn't work! In order to extract text data from multiple HTML elements, we need a `for` loop and some list-building.

In [None]:
all_h2_headers = document.find_all("h2")
all_h2_headers

First we will make an empty list called `h2_headers`.

Then `for` each `header` in `all_h2_headers`, we will grab the `.text`, put it into a variable called `header_contents`, then `.append()` it to our `h2_headers` list.

In [None]:
h2_headers = []
for header in all_h2_headers:
    header_contents = header.text
    h2_headers.append(header_contents)

In [None]:
h2_headers

You can produce the same result in a more "pythonic" way by using a **list comprehension** (shorter syntax):

In [None]:
h2_headers = [header.text for header in all_h2_headers]
h2_headers

#### Inspect HTML Elements with Browser

Most times if you're looking to extract something from an HTML document, it's best to use your "Inspect" capabilities in your web browser. You can hover over elements that you're interested in and find that specific element in the HTML.

<img src="./img/inspect1.png" width="700px">

For example, if we hover over the main link "Johnny Cash":

<img src="./img/inspect2.png" width="700px" >

### Your Turn! (Project Gutenberg)

Ok so now we've learned a little bit about how to use BeautifulSoup to parse HTML documents. So how would we apply what we've learned to extract the text of Jules Verne's novels?

Project Gutenberg shares its editions in plain text format, but not only. An HTML version is of course also available. It's a little more difficult to process it than the full-text version, but at least we can try to extract metadata and the author's text alone.

Let's follow our example of the *Voyage au Centre de la Terre*: https://www.gutenberg.org/cache/epub/4791/pg4791.html.

Inspect the page with your browser and try to extract:

- the main header of the page
- the title of the novel
- `p` elements with `id``

Finally, try to extract the only text of the novel (the one written by Jules Verne).

In [None]:
response = requests.get("https://www.gutenberg.org/cache/epub/4791/pg4791.html")
html_string = response.text
document = BeautifulSoup(html_string, "html.parser")

In [None]:
# Main header
document.find('h2')

In [None]:
# Main header text
document.find('h2').text

No `h1`elmement, it’s a semantic oddity... A closer look reveals that the hierarchy of headings is not even respected (`h5` > `h2`).

But note that thanks to BeautifulSoup, you can easily print the text contained in all descendant elements (here `span`). How convenient!

In [None]:
# Title of the novel
document.find("h5").text

In [None]:
# First p element with an id attribute
document.find("p", {"id" : True})

In [None]:
# The `p` element whose id value is 'id02329'.
document.find("p", {"id" : "id02329"})

In [None]:
# A `p` element that has an `id` attribute AND a `style` attribute.
document.find("p", attrs={"id" : True, "style" : True})

In [None]:
# Same, compact syntax
document.find("p", {"id" : True, "style" : True})

In [None]:
# List of the 10 first p elements with an id attribute
document.find_all("p", {"id" : True})[0:8]

In [None]:
#document.find_all(['p', 'h2', 'h5'])
#document.find_all(['p', 'h2', 'h5'], id=True) # relou
#document.select('p[id], h2, h5')

Better, but we don't want to extract either the acknowledgements (`@id` 'id00000'-'id00002') or the editorial notes (`@id` 'id00003'-'id00005')...  
One strategy is to extract only the paragraphs following the novel's title (`h5`):

In [None]:
start_tag = document.find('h5')
start_tag.find_all_next('p')[0:2]

We're making progress, but on closer inspection, we realize that we're forgetting the titles (`h2`). You can pass a list of elements to the `find_all_next()` method:

In [None]:
start_tag = document.find('h5')
start_tag.find_all_next(['p', 'h2'])[0:3]

All that's left is to save the text in a list, simply with a small loop:

In [None]:
verne_text_list = []
for element in start_tag.find_all_next(['p', 'h2']):
    text = element.text
    verne_text_list.append(text)
verne_text_list[0:3]

In [None]:
# Otherwise, print it all in a single string.
verne_p = document.find_all('p', id=True)
for element in start_tag.find_all_next(['p', 'h2'])[0:3]:
    print(element.text)

**Issue**

- find all `p` with id : `document.find_all('p', id=True)`

This syntax prohibits the selection of paragraphs with `id` and `h2` without: impossible to write `('[p', id=True, h2])`…  
In this case, you can use the `select()` method :

- select all `p` with id and all `h2` : ` document.select('p[id], h2')`

All you have to do is write:

```python
verne_text_list = [element.text for element in document.select('p[id], h2, h5')]
```

**Summary**

Thanks to HTML, BeautifulSoup and a little trickery, you can extract Jules Verne's text alone. A little more difficult than with the plain text format, but also more subtle: you manage to separate the author's text from the editorial paratext.

Issue: this recipe works for this novel, but what guarantee do we have that it will work for other texts? To automate extractions, we need standardized sources... That's one of the advantages of APIs.

### Your Turn! (Genius)

So how would we apply what we've learned to extract NTM lyrics?

https://genius.com/Supreme-ntm-on-est-encore-la-lyrics

What HTML element do we need to "find" to extract the song lyrics?

In [None]:
response = requests.get("https://genius.com/Supreme-ntm-on-est-encore-la-lyrics")
html_string = response.text
document = BeautifulSoup(html_string, "html.parser")

What do we have in the `p` elements?

In [None]:
ntm_p = document.find_all("p")
print(ntm_p)

What HTML element do we need to "find" to extract the title?

In [None]:
print(document.find('h1').text)

En inspectant la page, on s’aperçoit que certains éléments `div` ont un attribut `data-lyrics-container`

In [None]:
print(document.find('div', {"data-lyrics-container": "true"}).text)

**Issue**. We lose the lines of verse. What can we do? The [`get_text()` method](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#get-text) returns all the text in a document (or beneath a tag) as a single Unicode string and enables to specify a string to be used to join the bits of text together…

In [None]:
lyrics = document.find('div', {"data-lyrics-container": "true"}).get_text("\n")
print(lyrics)

**Issue (continued)**. Good idea, but it's really not great because of the segmentation of the annotations...  
We need to be more specific and process all `br` elements:

In [None]:
for br in document.find_all("br"):
    br.replace_with("\n")
lyrics = document.find('div', {"data-lyrics-container": "true"}).text
print(lyrics)

**Issue (end)**. Gee, we only have the beginning of the lyrics, which are written in several `div`...

In [None]:
lyrics = document.find_all('div', {"data-lyrics-container": "true"})
for lyrics_part in lyrics:
    print(lyrics_part.text)

**Reuse**.  
All that remains is to write this code into a small function so that we can reuse it to automatically extract the text of other songs.

In [None]:
def get_lyrics(song_genius_url):
    response = requests.get(song_genius_url)
    document = BeautifulSoup(response.text, "html.parser")
    for br in document.find_all("br"):
        br.replace_with("\n")
    lyrics = document.find_all('div', {"data-lyrics-container": "true"})
    for lyrics_part in lyrics:
        print(lyrics_part.text)

In [None]:
# https://genius.com/Grandmaster-flash-and-the-furious-five-the-message-lyrics
# https://genius.com/De-la-soul-the-magic-number-lyrics
# https://genius.com/Dr-jeckyll-and-mr-hyde-genius-rap-lyrics
# https://genius.com/Supreme-ntm-on-est-encore-la-lyrics
get_lyrics('https://genius.com/Amel-bent-ma-philosophie-lyrics')

**Conclusion**.

Unfortunately, this method is neither reusable nor future-proof… **We need standardized data served via an API.**

## APIs

[Wikipedia](https://en.wikipedia.org/wiki/API): An **application programming interface** (API) is a way for two or more computer programs to communicate with each other. It is a type of software interface, offering a service to other pieces of software.  
In contrast to a user interface, which connects a computer to a person, an application programming interface connects computers or pieces of software to each other. It is not intended to be used directly by a person (the end user) other than a computer programmer who is incorporating it into the software.

- **An API enables a computer to request information from another computer over the Internet**.
- **Data access endpoints and the format of the response are standardized according to a specification**.

### Genius API (JSON)

According to its [documentation](https://docs.genius.com), the Genius API provides access to various resources, including:

- Search (results)
- Artists
- Songs

<img src="./img/genius_doc.png" class="center" >

Let's explore the possibilities out of curiosity…

Genius uses the OAuth2 standard for making API calls on behalf of individual users.  
Requests are authenticated with an **Access Token** sent in an HTTP header or simply **as a request parameter**.

[How to get, store and call your Genius API keys](https://melaniewalsh.github.io/Intro-Cultural-Analytics/04-Data-Collection/07-Genius-API.html#api-keys)…

The best practice is to keep your API keys away from your code, such as in another file.

My key is stored in a python file called `api_key.py` that contains just one variable `your_client_access_token = "MY_API_KEY"`, so I can import below this variable into this notebook with `import api_key`.

In [None]:
import api_key
#api_key.your_client_access_token

#### Making an API Request

Making an API request looks a lot like typing a specially-formatted URL. But instead of getting a rendered HTML web page in return, you get some data in return.

Let's start with the basic search, which allows you to get a bunch of Genius data about any artist or songs that you search for:

http://api.genius.com/search?q={search_term}&access_token={client_access_token}

In [None]:
search_term = "Supreme NTM"
genius_search_url = f"http://api.genius.com/search?q={search_term}&access_token={api_key.your_client_access_token}"
response = requests.get(genius_search_url)
response.headers['Content-Type']

This time, the response is not formatted as plain text or HTML, but as JSON.
With Requests, you can call [.json()](https://requests.readthedocs.io/en/latest/api/#requests.Response.json) to returns the json-encoded content of a response.

Thanks to IPython's [`.display` module](https://ipython.readthedocs.io/en/stable/api/generated/IPython.display.html#IPython.display.JSON), we can effectively display a response that is quite long:

In [None]:
from IPython.display import JSON

In [None]:
JSON(response.json())

#### JSON

[JSON](https://en.wikipedia.org/wiki/JSON) (JavaScript Object Notation) is an open standard file format and data interchange format that uses human-readable text to store and transmit data objects consisting of attribute–value pairs and arrays (or other serializable values). JSON is commonly used by APIs.  
JSON data can be nested and contains key/value pairs.

JSON [Syntax Rules](https://www.w3schools.com/whatis/whatis_json.asp):

- Data is in name/value pairs: `'title': 'On est encore là'`
- Data is separated by commas: `'id': 87367, 'language': 'fr'`
- Curly braces hold **objects**: `'release_date_components': {'year': 1998, 'month': 4, 'day': 21}`
- Square brackets hold **arrays**: `'featured_artists': […]`

See also https://en.wikipedia.org/wiki/JSON#Syntax:

- **Array**: an ordered list of zero or more elements
- **Object**: a collection of name–value pairs where the names (also called keys) are strings. 

We can index this data and look at the 10th “hit” (`['hits'][9]`) about our search term "Supreme NTM":

NB: the code `\xa0` represents a non-breaking space.

In [None]:
json_data = response.json()
json_data['response']['hits'][9]

#### Looping Through JSON Data

We can see that each `hits` in the `response` corresponds to a song. With a `for` loop, we can easily extract the title (`full_title`) as well as the `id` of each song:

In [None]:
for song in json_data['response']['hits']:
    print(
        f"{str(song['result']['id']):10}", # constrain string length to align
        song['result']['full_title']
    )

A closer look reveals other relevant informations:
    
- date of release = ???
- number of visits to the associated page = ???


In [None]:
for song in json_data['response']['hits']:
    print(
        f"{str(song['result']['id']):10}",
        song['result']['release_date_components']['year'],
        f"{str(song['result']['stats']['pageviews']):8}",
        song['result']['full_title']
    )

We can take advantage of what we've already learned to store those metadata in a dataframe. We also take this opportunity to extract :

- artist name = ???
- artist id =  ???

NB. This `artist_id` will allow us to automatically extract information about the band using the `artists` route. (`GET /artists/:id`).

In [None]:
songs_meta = []
for song in json_data['response']['hits']:
    songs_meta.append([song['result']['id'],
                       song['result']['full_title'],
                       song['result']['release_date_components']['year'],
                       song['result']['stats']['pageviews'],
                       song['result']['primary_artist']['id'],
                       song['result']['artist_names']
])

#Make a Pandas dataframe from a list
songs_df = pd.DataFrame(songs_meta)
songs_df.columns = ['song_id', 'song_title', 'year', 'page_views', 'artist_id', 'artist_names']
songs_df

In [None]:
def get_songs_meta_of_a_search(band_name):
    genius_search_url = f"http://api.genius.com/search?q={band_name}&access_token={api_key.your_client_access_token}"
    json_data = requests.get(genius_search_url).json()
    songs_meta = []
    for song in json_data['response']['hits']:
        songs_meta.append([song['result']['id'],
                           song['result']['full_title'],
                           song['result']['release_date_components']['year'],
                           song['result']['stats']['pageviews'],
                           song['result']['primary_artist']['id'],
                           song['result']['artist_names']
                          ])
    
    songs_df = pd.DataFrame(songs_meta)
    songs_df.columns = ['song_id', 'song_title', 'year', 'page_views', 'artist_id', 'artist_names']
    return(songs_df)

In [None]:
get_songs_meta_of_a_search('Orelsan')

#### Exercice 1

Extract the NTM band description using the following route, which does not require an authentication token: https://genius.com/api/artists/{artist_id}

- find the band ID
- discover the [`text_format` query parameter](https://docs.genius.com/#/response-format-h1)find group identifier that can be used to specify how text content is formatted. 

**Answer:**

```python
artist_id = "24568"
text_format = 'plain'

genius_api_url = f"https://genius.com/api/artists/{artist_id}?text_format={text_format}"
json_data = requests.get(genius_api_url).json()
print(json_data['response']['artist']['description']['plain'])
```


#### Exercice 2

https://docs.genius.com/#artists-h2:

```
GET /artists/:id/songs

Documents (songs) for the artist specified. By default, 20 items are returned for each request.
````

The query parameter `sort` sorts songs by popularity.  
**Goal**: write a function that returns a list of the n most popular songs for an artist_id.


**Answer**:

```python
def get_popular_songs_of_artist(artist_id, sort_param, max_number):
    genius_api_url = f"https://genius.com/api/artists/{artist_id}/songs?sort={sort_param}&per_page={max_number}text_format=plain"
    json_data = requests.get(genius_api_url).json()

    for song in json_data['response']['songs']:
        year = song['release_date_components']['year'] if song['release_date_components'] else 'none'
        print(song['id'],
              song['annotation_count'],
              year,
              song['full_title'])
```

and

```python
> get_popular_songs_of_artist('24568', 'popularity', 10)
```

#### LyricsGenius Wrapper

Our initial aim (already achieved) was to extract song lyrics. But is there a simpler, more durable way? There is a `song` resource, and its [documentation](https://docs.genius.com/#songs-h2) is encouraging.

> A song is a document hosted on Genius. It's usually music lyrics.

Let's have a look.


In [None]:
from IPython.display import JSON
song_id = '87367'
json_data = requests.get(f"https://api.genius.com/songs/{song_id}?access_token={api_key.your_client_access_token}").json()
JSON(json_data)

**Disappointing!** Unfortunately, contrary to what the documentation claims, the lyrics are not accessible.  
The Genius API appears extremely restrictive, probably for commercial reasons (it's easier to contribute than to grab the data...). In this case, it may be useful to use a wrapper.

An **API wrapper** is a language-specific package or kit that encapsulates multiple API calls to make complicated functions easy to use. 

For Genius, there's an excellent wrapper, freely available, **LyricsGenius**. 

- code: https://github.com/johnwmillr/LyricsGenius
- documentation: https://lyricsgenius.readthedocs.io/en/master/ 

The [method implemented to extract lyrics](https://github.com/johnwmillr/LyricsGenius/blob/master/lyricsgenius/genius.py#L95) is quite similar to ours. However, the code is of better quality: it's more tried and tested, and we can expect it to be maintained. And it's easier to use. There's no need, for example, to find the song identifier.

In [None]:
pip install git+https://github.com/johnwmillr/LyricsGenius.git

In [None]:
import lyricsgenius
genius = lyricsgenius.Genius(api_key.your_client_access_token)

In [None]:
artist = genius.search_artist("NTM", max_songs=3, sort="title")

In [None]:
print(artist, type(artist))

L’objet `artist` stocke toutes les informations relatives à un artiste, notamment :

- son nom: `artist.name`
- son id: `artist.id`
- la liste de ses chansons: `artist.songs`

On peut ainsi facilement accéder à chacune des chansons pour obtenir, par exemple pour la seconde de la liste:

- le titre: `artist.songs[1].title`
- l’id: `artist.songs[1].id`
- les paroles: `artist.songs[1].lyrics`



In [None]:
# Print and analyze all the current properties and values of `artist` object?
'''
from pprint import pprint
pprint(vars(artist))
'''

In [None]:
print(artist.name, artist.id)
print('=====')
print(artist.songs[1].title, artist.songs[1].id)
print('=====')
print(artist.songs[1].lyrics[56:140], '…')

It's easy to loop on the song list:

In [None]:
for song in artist.songs:
    print(song.title)

LyricsGenius permet d’accéder directement aux paroles d’une chanson:

In [None]:
song_title_search = 'Encore là'
band_name_search = 'NTM'
print(genius.search_song(song_title_search, band_name_search).lyrics)

It becomes very easy to write a small function to build a search corpus...

In [None]:
import lyricsgenius
import pandas as pd

def get_artists_lyrics(artist_names_list, max_songs, sort_criteria):
    
    genius = lyricsgenius.Genius(api_key.your_client_access_token)
    lyrics_df = pd.DataFrame(columns=['artist_name', 'artist_id', 'song_id', 'song_title', 'song_lyrics'])
    
    for artist_name in artist_names_list:
        artist = genius.search_artist(artist_name, max_songs=max_songs, sort=sort_criteria)
        artist_name = artist.name
        artist_id = artist.id
        songs_meta = []
        for song in artist.songs:
            song_meta = [
                str(artist_name),
                str(artist_id),
                str(song.id),
                str(song.title),
                str(song.lyrics)
            ]
            lyrics_df.loc[len(lyrics_df)] = song_meta
    return(lyrics_df)

In [None]:
lyrics_df = get_artists_lyrics(['NTM', 'Orelsan'], 3, 'popularity')

In [None]:
lyrics_df.head()

**The corpus is rapidly built up for analysis: stylometry, attribution or topic modeling, etc.**

In [None]:
print(lyrics_df.iloc[1].song_lyrics[:240])

**Issue**. We see that the first line contains credits... `.index()` allows us to position ourselves after the first line break.

In [None]:
# problème on constate que la première ligne contient des crédits… la méthode .index() permet de se positionner après le premier saut de ligne.
lyrics = lyrics_df.iloc[1].song_lyrics[:240]
print(lyrics[lyrics.index('\n')+1:])

We can apply this method to the `song_lyrics` column of the dataframe, in order to improve the data. Alternatively, we could rewrite our `get_artists_lyrics()` function.
**Warning**. Be careful to apply only once, otherwise the first line of the lyrics will be lost each time it is run...

In [None]:
lyrics_df['song_lyrics']

In [None]:
lyrics_df['song_lyrics'] = lyrics_df['song_lyrics'].apply(lambda x: x[x.index('\n')+1:])
lyrics_df['song_lyrics']

With Genius, we discovered JSON and learned how to browse it to extract data. We read its API documentation and managed a token for authentication. Finally, we discovered a first wrapper and its documentation, so we could easily build up a search corpus.

### Gallica Search API (XML)

[Gallica](https://gallica.bnf.fr/) is the digital library of the Bibliothèque nationale de France and its partners. It has been freely accessible since 1997, and contains several million documents (>10).

Above all, BnF has done an impressive job of making these open resources available, via various APIs that enable interaction with different services: metadata, textual data, but also iconographic data recollection.

These [Gallica's APIs are documented](https://api.bnf.fr/recherche?f[0]=sources:197) in exemplary manner, with particular attention paid to the needs of researchers.

**Using the Search API, let's start by trying to automatically collect data relating to Jules Verne's novels.**  

**This API allows you to search the Gallica digital collection:**

- **Documentation**: [https://api.bnf.fr/fr/api-gallica-de-recherche](https://api.bnf.fr/fr/api-gallica-de-recherche)
- **SRU Gallica search service**: https://api.bnf.fr/fr/api-gallica-de-recherche#scroll-nav__3

SRU (Search/Retrieval via URL) is a **standard metadata exchange protocol** adapted to the needs of library catalogues. In other words, the libraries have agreed to define a standardised way of offering their catalogue as a web service.

And the SRU standard specifies **how to query** the server that exposes the catalogue data and **how to format the response**.

Read: https://bibliotheques.wordpress.com/2017/10/27/papa-cest-quoi-un-sru/

![SRU](img/SRU.png)


#### CQL Query

The basics: https://gallica.bnf.fr/SRU?version=1.2&operation=searchRetrieve&query={CQL_QUERY}


[CQL, the Contextual Query Language](https://www.loc.gov/standards/sru/cql/), is a formal language for representing queries to information retrieval systems such as web indexes, bibliographic catalogs and museum collection information. The design objective is that queries be human readable and writable, and that the language be intuitive while maintaining the expressiveness of more complex languages.


A few examples to help you understand.


**Documents mentioning "Jules Verne" in Gallica** (>18752 records…):

- `query=gallica all "Jules Verne"`
- [https://gallica.bnf.fr/SRU?version=1.2&operation=searchRetrieve&**query=gallica all "Jules Verne"**](https://gallica.bnf.fr/SRU?version=1.2&operation=searchRetrieve&query=gallica%20all%20%22Jules%20Verne%22)

**Documents of which "Jules Verne" is the author in Gallica** (>400 records):  

- `query=dc.creator all "jules verne"`
- [https://gallica.bnf.fr/SRU?version=1.2&operation=searchRetrieve&**query=dc.creator all "Jules Verne"**](https://gallica.bnf.fr/SRU?version=1.2&operation=searchRetrieve&query=dc.creator%20all%20%22Jules%20Verne%22)

**Books (monographs) in French authored by "Jules Verne" in Gallica** (>290 records):

- `query=dc.creator all "jules verne"&filter=dc.type all "monographie" and dc.language all "fre"`
- [https://gallica.bnf.fr/SRU?version=1.2&operation=searchRetrieve&**query=dc.creator all "jules verne"&filter=dc.type all "monographie" and dc.language all "fre"**](https://gallica.bnf.fr/SRU?version=1.2&operation=searchRetrieve&query=dc.creator%20all%20%22jules%20verne%22&filter=dc.type%20all%20%22monographie%22%20and%20dc.language%20all%20%22fre%22)


There are also **sorting criteria**. Results can be sorted according to OCR quality: `ocr.quality/sort.descending`:

[https://gallica.bnf.fr/SRU?version=1.2&operation=searchRetrieve&query=dc.creator all "jules verne" sortby ocr.quality/sort.descending&filter=dc.type all "monographie" and dc.language all "fre"](https://gallica.bnf.fr/SRU?version=1.2&operation=searchRetrieve&query=dc.creator%20all%20%22jules%20verne%22%20sortby%20ocr.quality/sort.descending&filter=dc.type%20all%20%22monographie%22%20and%20dc.language%20all%20%22fre%22)

There are other parameters:
 
- `startRecord`: the pagination system index, between 1 and the maximum number of results returned by the query
- `maximumRecords`: the number of results returned by the service (from 0 to a maximum of 50). By default, if this parameter is not present, the value is 15.


**The 5 Jules Verne books with the best OCR quality**:

- `query=dc.creator all "jules verne" sortby ocr.quality/sort.descending&filter=dc.type all "monographie" and dc.language all "fre"&maximumRecords=5`
- [https://gallica.bnf.fr/SRU?version=1.2&operation=searchRetrieve&query=dc.creator all "jules verne" sortby ocr.quality/sort.descending&filter=dc.type all "monographie" and dc.language all "fre"&maximumRecords=5](https://gallica.bnf.fr/SRU?version=1.2&operation=searchRetrieve&query=dc.creator%20all%20%22jules%20verne%22%20sortby%20ocr.quality/sort.descending&filter=dc.type%20all%20%22monographie%22%20and%20dc.language%20all%20%22fre%22&maximumRecords=5)

We're going to start working on this very small corpus to understand the format of the response.
 

In [None]:
import requests
url = "https://gallica.bnf.fr/SRU?version=1.2&operation=searchRetrieve&query=dc.creator%20all%20%22jules%20verne%22%20sortby%20ocr.quality/sort.descending&filter=dc.type%20all%20%22monographie%22%20and%20dc.language%20all%20%22fre%22&maximumRecords=5"
response = requests.get(url)
response.headers['Content-Type']

This time, the response is not formatted as JSON, but as XML.  
Unfortunately, `Requests` does not handle parsing XML responses… :(

In [None]:
print(response.text)

You can observe the XML structure of a notice.

```xml
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<srw:searchRetrieveResponse
    xmlns:srw="http://www.loc.gov/zing/srw/"
    xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/"
    xmlns:dc="http://purl.org/dc/elements/1.1/">
    <srw:records>
        <srw:record>
            <srw:recordSchema>http://www.openarchives.org/OAI/2.0/OAIdc.xsd</srw:recordSchema>
            …
            <srw:recordData>
                <oai_dc:dc>
                    <dc:creator>Verne, Jules (1828-1905)…</dc:creator>
                    <dc:date>1896</dc:date>
                    <dc:identifier>https://gallica.bnf.fr/ark:/12148/bpt6k65501998</dc:identifier>
                    <dc:language>fre</dc:language>
                    <dc:source>Bibliothèque nationale de France…</dc:source>
                    <dc:title>Les voyages du Capitaine Cook…</dc:title>
                    <dc:identifier>https://gallica.bnf.fr/ark:/12148/bpt6k65501998</dc:identifier>
                </oai_dc:dc>
            </srw:recordData>
            …
        </srw:record>
        <srw:record/>
        …
    </srw:records>
</srw:searchRetrieveResponse>
```

To extract the information… **we need to parse XML data**!

#### XML Parsing

**[The ElementTree XML API](https://docs.python.org/3/library/xml.etree.elementtree.html)**. The `xml.etree.ElementTree` (`ET` in short) module implements a simple and efficient API for parsing and creating XML data.

XML is an inherently hierarchical data format, and the most natural way to represent it is with a tree. **ET has two classes** for this purpose:

- `ElementTree` represents the whole XML document as a tree
- `Element` represents a single node in this tree.

Interactions with the whole document (reading and writing to/from files) are usually done on the ElementTree level. Interactions with a single XML element and its sub-elements are done on the Element level.


**[fromstring() parses XML from a string directly into an Element](https://docs.python.org/3/library/xml.etree.elementtree.html#xml.etree.ElementTree.fromstring)** –stored below in the `root` variable– which is the root element of the parsed tree.

Then, `Element` has some useful methods that help iterate recursively over all the sub-tree below it (its children, their children, and so on). For example, `Element.iter()`.

**[The iter() method](https://docs.python.org/3/library/xml.etree.elementtree.html#xml.etree.ElementTree.Element.iter)** creates a tree iterator with the current element as the root. The iterator iterates over this element and all elements below it, in document (depth first) order. If tag is not None or '*', only elements whose tag equals tag are returned from the iterator.

In [None]:
import xml.etree.ElementTree as ET
root = ET.fromstring(response.content)
for child in root.iter('*'):
    print(child.tag)

#### Metadata extraction

We can see that all the elements of the response are accessible, according to their namespace: `{http://purl.org/dc/elements/1.1/}creator`  
The `creator` element is declared for the Dublin Core namespace (`http://purl.org/dc/elements/1.1/`).

Using a `for` loop, each `record` is read, and the `Element.findall()` method is used to extract Dublin Core metadata:

- `identifier`
- `date`
- `title`
- `creator`

NB. The [`Element.findall()` method](https://docs.python.org/3/library/xml.etree.elementtree.html#xml.etree.ElementTree.Element.findall) finds only elements with a tag which are direct children of the current element.

In [None]:
import xml.etree.ElementTree as ET
root = ET.fromstring(response.content)
for record in root.iter('{http://www.loc.gov/zing/srw/}record'):
    dc_identifier = record.findall('{http://www.loc.gov/zing/srw/}recordData/{http://www.openarchives.org/OAI/2.0/oai_dc/}dc/{http://purl.org/dc/elements/1.1/}identifier')[0].text
    dc_date = record.findall('{http://www.loc.gov/zing/srw/}recordData/{http://www.openarchives.org/OAI/2.0/oai_dc/}dc/{http://purl.org/dc/elements/1.1/}date')[0].text
    dc_title = record.findall('{http://www.loc.gov/zing/srw/}recordData/{http://www.openarchives.org/OAI/2.0/oai_dc/}dc/{http://purl.org/dc/elements/1.1/}title')[0].text
    dc_creator = record.findall('{http://www.loc.gov/zing/srw/}recordData/{http://www.openarchives.org/OAI/2.0/oai_dc/}dc/{http://purl.org/dc/elements/1.1/}creator')[0].text
    print(dc_creator, dc_identifier, dc_date, dc_title)

In [None]:
# Same thing, but loading data into a dataframe
import pandas as pd
metadata = []
root = ET.fromstring(response.content)
for record in root.iter('{http://www.loc.gov/zing/srw/}record'):
    dc_identifier = record.findall('{http://www.loc.gov/zing/srw/}recordData/{http://www.openarchives.org/OAI/2.0/oai_dc/}dc/{http://purl.org/dc/elements/1.1/}identifier')[0].text
    dc_date = record.findall('{http://www.loc.gov/zing/srw/}recordData/{http://www.openarchives.org/OAI/2.0/oai_dc/}dc/{http://purl.org/dc/elements/1.1/}date')[0].text
    dc_title = record.findall('{http://www.loc.gov/zing/srw/}recordData/{http://www.openarchives.org/OAI/2.0/oai_dc/}dc/{http://purl.org/dc/elements/1.1/}title')[0].text
    dc_creator = record.findall('{http://www.loc.gov/zing/srw/}recordData/{http://www.openarchives.org/OAI/2.0/oai_dc/}dc/{http://purl.org/dc/elements/1.1/}creator')[0].text
    ocr_quality = record.findall('{http://www.loc.gov/zing/srw/}extraRecordData/nqamoyen')[0].text
    metadata.append({
        'ocr_quality': ocr_quality,
        'dc_creator': dc_creator,
        'dc_identifier': dc_identifier,
        'dc_date': dc_date,
        'dc_title': dc_title
    })
metadata_df = pd.DataFrame(metadata)
metadata_df

In [None]:
# We make the code reusable by defining a function.
def get_gallica_metadata(search_api_url):
    response = requests.get(url)
    root = ET.fromstring(response.content)
    metadata = []
    for record in root.iter('{http://www.loc.gov/zing/srw/}record'):
        dc_identifier = record.findall('{http://www.loc.gov/zing/srw/}recordData/{http://www.openarchives.org/OAI/2.0/oai_dc/}dc/{http://purl.org/dc/elements/1.1/}identifier')[0].text
        dc_date = record.findall('{http://www.loc.gov/zing/srw/}recordData/{http://www.openarchives.org/OAI/2.0/oai_dc/}dc/{http://purl.org/dc/elements/1.1/}date')[0].text if record.findall('{http://www.loc.gov/zing/srw/}recordData/{http://www.openarchives.org/OAI/2.0/oai_dc/}dc/{http://purl.org/dc/elements/1.1/}date') else 'NaN'
        dc_title = record.findall('{http://www.loc.gov/zing/srw/}recordData/{http://www.openarchives.org/OAI/2.0/oai_dc/}dc/{http://purl.org/dc/elements/1.1/}title')[0].text
        dc_creator = record.findall('{http://www.loc.gov/zing/srw/}recordData/{http://www.openarchives.org/OAI/2.0/oai_dc/}dc/{http://purl.org/dc/elements/1.1/}creator')[0].text
        ocr_quality = record.findall('{http://www.loc.gov/zing/srw/}extraRecordData/nqamoyen')[0].text
        metadata.append({
            'ocr_quality': ocr_quality,
            'dc_creator': dc_creator,
            'dc_identifier': dc_identifier,
            'dc_date': dc_date,
            'dc_title': dc_title
        })
    metadata_df = pd.DataFrame(metadata)
    return metadata_df


Same thing. But this version of the function, which is a little more abstract, allows you to specify the list of fields to be extracted as an argument (`dc_elements_array`).

In [None]:
# Same. Pass the list of fields to be extracted
def get_gallica_metadata(search_api_url, dc_elements_array):
    response = requests.get(url)
    root = ET.fromstring(response.content)
    metadata = []
    for record in root.iter('{http://www.loc.gov/zing/srw/}record'):
        record_meta_dic = {}
        for dc_el in dc_elements_array:
            if record.findall('{http://www.loc.gov/zing/srw/}recordData/{http://www.openarchives.org/OAI/2.0/oai_dc/}dc/{http://purl.org/dc/elements/1.1/}'+dc_el)[0].text:
                record_meta_dic[dc_el] = record.findall('{http://www.loc.gov/zing/srw/}recordData/{http://www.openarchives.org/OAI/2.0/oai_dc/}dc/{http://purl.org/dc/elements/1.1/}'+dc_el)[0].text
            else :
                record_meta_dic[dc_el] = 'NaN'
        # non DC metadata
        record_meta_dic['ocr_quality'] = record.findall('{http://www.loc.gov/zing/srw/}extraRecordData/nqamoyen')[0].text if record.findall('{http://www.loc.gov/zing/srw/}extraRecordData/nqamoyen')[0].text else 'NaN'
        metadata.append(record_meta_dic)
    metadata_df = pd.DataFrame(metadata)
    return metadata_df


In [None]:
url = "https://gallica.bnf.fr/SRU?version=1.2&operation=searchRetrieve&query=dc.creator%20all%20%22jules%20verne%22%20sortby%20ocr.quality/sort.descending&filter=dc.type%20all%20%22monographie%22%20and%20dc.language%20all%20%22fre%22&startRecord=1&maximumRecords=5"
books_df = get_gallica_metadata(url, ['creator', 'identifier', 'title'])
books_df


#### Save DF to CSV

The Pandas [pd.DataFrame.to_csv](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html#pandas-dataframe-to-csv) method exports the dataframe object to a csv file.

Several parameters are useful here:

- `sep`: field delimiter for the output file.
- `index`: write or not row names (index).
- `encoding`: the encoding to use in the output file.


In [None]:
books_df.to_csv('./output/books.tsv', sep='\t', encoding='utf-8', index=False)

You can also specify the list of columns to be exported:

In [None]:
books_df.to_csv('./output/books.tsv', sep='\t', encoding='utf-8', index=False,
               columns=['creator', 'identifier', 'title'])

At the start of the course, we inherited a TSV file for accessing the Gutenberg versions of Jules Verne's novels. This time, thanks to Gallica's Search API and your new skills in XML data parsing, you've automatically built up this table yourself.

**Now it's time to retrieve the text too!**

### Gallica Document API

Documentation: https://api.bnf.fr/fr/api-document-de-gallica

From a document found via the Search API or the Gallica interface, the Document API can be used to retrieve the metadata needed to use the document's digital resources, including:

- bibliographic informations
- search hits
- text (plain text / OCR)

Thus, for an ark identifier, it is always possible to retrieve the metadata of the OAI record:

https://gallica.bnf.fr/services/OAIRecord?ark={ark}

This service returns the document's OAI-PMH record as well as other technical information, such as document type, or whether or not full-text searching is available.

Only one parameter is mandatory: the ark of the document's numerical identifier.


#### Metadata retrieval

If necessary, we can find all the metadata we need for our edition of *Les voyages du Capitaine Cook*:  
https://gallica.bnf.fr/services/OAIRecord?ark=ark:/12148/bpt6k65501998

We need to get its ark in our dataframe `books_df`:

a. Find the corresponding line (= the line whose 'title' cell contains 'Captain Cook'):  
`books_df.loc[books_df['title'].str.contains('Capitaine Cook')]`  
Note that str.contains() is case sensitive.

b. Extract the identifier (https://gallica.bnf.fr/ark:/12148/bpt6k6550199`):  
`books_df.loc[books_df['title'].str.contains('Capitaine Cook')]['identifier'][1]`  
NB. the final `[1]` index is necessary because [`.loc`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html) returns a list.

c. Extract its ark ('ark:/12148/bpt6k65501998'), substring of the identifier:  
`books_df.loc[books_df['title'].str.contains('Capitaine Cook')]['identifier'][1].split('https://gallica.bnf.fr/')[1]`  
or, more simply, we use the string index:  
`books_df.loc[books_df['title'].str.contains('Capitaine Cook')]['identifier'][1][23:]`

=====
Or, even simpler, you already know the ark (and that's it!):  
`ark:/12148/bpt6k65501998`



In [None]:
my_ark = books_df.loc[books_df['title'].str.contains('Capitaine Cook')]['identifier'][1][23:]
my_novel_record = requests.get('https://gallica.bnf.fr/services/OAIRecord?ark='+my_ark)
print(my_novel_record.text)

From here, you can put your new skills in XML parsing to good use, discovering, for example, the OCR qulity available for this book:

In [None]:
from xml.etree import ElementTree
tree = ElementTree.fromstring(my_novel_record.content)
tree.find('nqamoyen').text

Or to find related resources:

In [None]:
for relation in tree.iter('{http://purl.org/dc/elements/1.1/}relation'):
    print(relation.text)

#### Text retrieval (plain text)

It's really very easy!

When a document is indexed in full text, it is possible to obtain this text, using the `textBrut` qualifier:  
https://gallica.bnf.fr/{ark}.texteBrut

Let's retrieve the text from *Les voyages du Capitaine Cook*:  
https://gallica.bnf.fr/ark:/12148/bpt6k65501998.texteBrut


In [None]:
text_url = f"https://gallica.bnf.fr/{my_ark}.texteBrut"
response = requests.get(text_url)
response.text

What do we find? Surprisingly, the text is not actually available in plain text, but is formatted in HTML…

In [None]:
response.headers['Content-Type']

But that's a good thing! We're going to be able to put to good use the BeautifulSoup skills we've acquired in processing the files shared by Project Gutenberg, so as to extract Jules Verne's text alone.

You need to analyze the HTML publication model, so as to cut out the metadata at the beginning of the file (pay attention to the use of the `hr` element...). Test before wrapping the code in a small function `get_book_text()`.

In [None]:
from bs4 import BeautifulSoup
document = BeautifulSoup(response.text, "html.parser")
first_hr_tag = document.select('hr')[0]
text = ''
for p in first_hr_tag.find_all_next('p'):
    text += p.text + '\n'
#print(text.strip())

In [None]:
def get_book_text(ark_url):
    response = requests.get(ark_url+'.texteBrut')
    document = BeautifulSoup(response.text, "html.parser")
    first_hr_tag = document.select('hr')[0]
    text = ''
    for p in first_hr_tag.find_all_next('p'):
        text += p.text + '\n'
    return text

Now we just need to retrieve the text for each of our novels, by applying our little function to each line of our dataframe...

**That's easy! We simply add a 'text' column.  
But it's a bit time-consuming, because you have to retrieve all the data via HTTP, then process it... Don't launch the cell until you're sure!**

In [None]:
books_df['text'] = books_df.apply(lambda x: get_book_text(x['identifier']), axis=1)
books_df.head()

In [None]:
#print(books_df.iloc[1]['text'])

Note that you can also specify the pages you wish to retrieve, using the `f[X]n[y]`qualifier (X is the page number of the start page, and n is the number of subsequent pages):

https://gallica.bnf.fr/ark:/12148/{ark}/f{start_page_number}n{number_of_pages}.texteBrut

For example, the first 5 pages of the chapter devoted to Bougainville (p. 75-79 => 5 pages from f87)

https://gallica.bnf.fr/ark:/12148/bpt6k65501998/f87n5.texteBrut


#### Save Text to File

We often need to save data locally...

In [None]:
file_name = books_df.iloc[1]['identifier'][34:]
with open(f'output/{file_name}.txt', 'a') as file:
    file.write(books_df.iloc[1]['text'])

With [.iterrows()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.iterrows.html), you can iterate over DataFrame rows for easy export each text:

In [None]:
for index, row in books_df.iterrows():
    file_name = row['identifier'][34:]
    with open(f'output/{file_name}.txt', 'w') as file:
        file.write(row['text'])

You may also need to save the text retrieved via the API directly to a file.  
But remember: the text is actually formatted in HTML and we'll need to process it.

In [None]:
my_ark = 'ark:/12148/bpt6k65501998'
response_text = requests.get(f"https://gallica.bnf.fr/{my_ark}.texteBrut").text
with open(f'./output/{my_ark[11:]}.html', 'w') as file:
    file.write(response_text)

#### Text retrieval (OCR)

The Gallica API can be used to extract an OCR page from the digital library:

https://gallica.bnf.fr/RequestDigitalElement?O={ark}&E=ALTO&Deb={page_number}

For example, the ALTO on the first page of the first chapter devoted to Bougainville (p. 1 => f13)

https://gallica.bnf.fr/RequestDigitalElement?O=bpt6k65501998&E=ALTO&Deb=13


In [None]:
print(requests.get('https://gallica.bnf.fr/RequestDigitalElement?O=bpt6k65501998&E=ALTO&Deb=13').text)

It's XML, ALTO to be precise.

[ALTO](https://en.wikipedia.org/wiki/ALTO_(XML)) (Analysed Layout and Text Object) is an XML standard for reporting the physical layout and logical structure of text transcribed by optical character recognition (OCR).

It contains for each text box:

- the text
- its coordinates
- the recognition confidence rate
- and even formatting elements

The format allows images and text to be superimposed (PDF like).

We don't have time to go into detail about the potential of the ALTO. Let's take a simple example that allows us to put text to one side: let's look at images.

Thanks to [Gallica's IIIF API](https://api.bnf.fr/fr/api-iiif-de-recuperation-des-images-de-gallica), you can also display the page.

In [None]:
from IPython.display import Image

page_iiif_url = 'https://gallica.bnf.fr/iiif/ark:/12148/bpt6k65501998/f13/full/400,/0/native.jpg'
Image(url=page_iiif_url)

The illustration is tagged in the ALTO (element `Illustration`) and we can extract its identifier, as well as its coordinates in the page:

In [None]:
from xml.etree import ElementTree
tree = ElementTree.fromstring(requests.get('https://gallica.bnf.fr/RequestDigitalElement?O=bpt6k65501998&E=ALTO&Deb=13').content)
for illustration in tree.iter('{http://bibnum.bnf.fr/ns/alto_prod}Illustration'):
    print(illustration.get('ID'))

In [None]:
illustrations = []
for illustration in tree.iter('{http://bibnum.bnf.fr/ns/alto_prod}Illustration'):
    illustrations.append({
            'ID': illustration.get('ID'),
            'HPOS': illustration.get('HPOS'),
            'VPOS': illustration.get('VPOS'),
            'WIDTH': illustration.get('WIDTH'),
            'HEIGHT': illustration.get('HEIGHT')
        })
illustrations

Thanks to the IIIF API, we can easily display (or extract for analysis) this single illustration.

In [None]:
from IPython.display import Image

illustration_iiif_url = ('https://gallica.bnf.fr/iiif/ark:/12148/bpt6k65501998/f13/'
      + illustrations[0]['HPOS'] +','
      + illustrations[0]['VPOS'] +','
      + illustrations[0]['WIDTH'] +','
      + illustrations[0]['HEIGHT']
      + '/full/0/native.jpg')

Image(url=illustration_iiif_url, width=400)

### Wrappers

- PyGallica : https://github.com/ian-nai/PyGallica
- gallipy : https://libraries.io/pypi/gallipy


## Preprocessing

Let's mobilize the spaCy skills we've acquired to write a pre-processing function that will normalize all tokens and exclude stopwords.

With [.copy()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.copy.html), we create a new object:

In [None]:
books_df_prep = books_df.copy()
books_df_prep.head()

Preprocessing functions can be defined below (see spaCy introduction), here :

- Lowercases the text
- Lemmatizes each token
- Removes punctuation symbols
- Removes stop words

In [None]:
# Preprocessing functions
import spacy
from spacy.lang.fr.stop_words import STOP_WORDS as fr_stop

import fr_core_news_md
nlp = fr_core_news_md.load()

def preprocess_lemma(token):
    return token.lemma_.strip().lower()

# Filter: a function that returns True or False for a token according to certain criteria
def is_token_allowed(token):
    return bool(
        token
        and str(token).strip()
        and not token.is_stop
        and not token.is_punct
    )

def preprocess_text(text):
    doc = nlp(text)
    filtered_doc_lemmas = [
        preprocess_lemma(token)
        for token in doc
        if is_token_allowed(token)
    ]
    return ' '.join(filtered_doc_lemmas)

In [None]:
# test!
preprocess_text(books_df_prep.iloc[1]['text'])

Once the preprocessing function `preprocess_text()` has been tested, we can apply it with .apply() to the text of each novel contained in our dataframe:

In [None]:
books_df_prep['keywords'] = books_df_prep.apply(lambda x: preprocess_text(x['text']), axis=1)

In [None]:
books_df_prep.head()

In [None]:
books_df_prep.iloc[1]['keywords']

We can easily adapt the code, for example to retain only stop words, if our aim is to do automatic author attribution.

We have to redefine the preprocess_text() function identically, as the is_token_allowed() function it calls has been redefined to retain only stop words. But if you look closely, the modification is very slight indeed: only the is_token_allowed() function has been modified.

In [None]:
# Filter: a function that returns True or False for a token according to certain criteria
def is_token_allowed(token):
    return bool(
        token.is_stop
        and str(token).strip()
    )

def preprocess_text(text):
    doc = nlp(text)
    filtered_doc_lemmas = [
        preprocess_lemma(token)
        for token in doc
        if is_token_allowed(token)
    ]
    return ' '.join(filtered_doc_lemmas)

In [None]:
# test!
preprocess_text(books_df_prep.iloc[1]['text'])

In [None]:
books_df_prep['stopwords'] = books_df_prep.apply(lambda x: preprocess_text(x['text']), axis=1)

In [None]:
books_df_prep.head()

## Appendix. Import

In [None]:
# all imports
'''
import pandas as pd
import requests
from bs4 import BeautifulSoup
from IPython.display import display, HTML, Image, JSON
import xml.etree.ElementTree as ET
from xml.etree import ElementTree
import lyricsgenius
import spacy
from spacy.lang.fr.stop_words import STOP_WORDS as fr_stop
'''