# Textual data scraping and preprocessing

> This course is a reworking of the excellent book designed by Melanie Walsh, [*Introduction to Cultural Analytics & Python*](https://melaniewalsh.github.io/Intro-Cultural-Analytics/welcome.html). Many paragraphs and explanations have been retained without modification.

> **Read this book!**, itself inspired by *Inspired by web scraping lessons from [Lauren Klein](https://github.com/laurenfklein/emory-qtm340/blob/master/notebooks/class4-web-scraping-complete.ipynb) and [Allison Parrish](https://github.com/aparrish/dmep-python-intro/blob/master/scraping-html.ipynb)*

Classically, a distinction is made between different work stages: (1) data production, (2) data processing and (3) data analysis.
This course is a practical and fairly detailed introduction to the first 2 stages: **production and (pre)processing**.

You'll learn how to use open Web services to automatically collect textual data. In doing so, you'll gain a better understanding of how the Web works (HTTP, HTML), discover structured data formats (CSV, JSON, XML) and get hands-on experience of the Python programming language.

The important thing is not necessarily to retain everything, but to gain a better understanding of how this ecosystem works, so that you can gradually determine for yourself the solutions you need to implement to meet your own requirements.


In this lesson, we're going to introduce how to "scrape" data from the internet with the Python libraries requests and BeautifulSoup.

We will cover how to:

* Programmatically access the text of a web page
* Understand the basics of HTML
* Extract certain HTML elements
* Understand the basics of structured documents (CSV/TSV, JSON, XML)…
* …and parse them to extract informations
* Build collections
* Design pre-processings

And along the way, we'll be learning the basics of the Python programming language.

## Why Do We Need To Scrape At All?

Today, written heritage is massively available on the Internet under Free Licenses.

Community initiatives such as Project Gutenberg, based on crowdsourcing, have been succeeded by very large-scale institutional projects exploiting the potential of machine learning for the automatic acquisition of text, including handwritten. Gallica (BnF's digital library) represents over 10 million documents available online.

Here are a few projects worth your attention :

- [Project Gutenberg](https://www.gutenberg.org/): since December 1971! Project Gutenberg is a library of free electronic versions of physically existing books (>70,000 free eBooks).
- [Wikisource](https://fr.wikisource.org/wiki/Wikisource:Accueil): Wikisource is a digital library of public domain texts, managed as a wiki using the MediaWiki engine (> 360,000 free and open texts).
- [Gallica](https://gallica.bnf.fr/): Gallica is the digital library of the Bibliothèque nationale de France and its partners. It has been freely accessible since 1997, and contains > 10 million documents.
- [HathiTrust](https://www.hathitrust.org/)
- …

This digital heritage opens up unprecedented prospects for the human and social sciences: provided, of course, that we know how to use these services to build research corpora. Text is also increasingly natively digital, and researchers often need to automatically follow certain subjects on platforms such as Twitter.

This course will not teach you how to analyze these vast textual corpora, but rather how to build them. It will also show you how to pre-process them, an essential prerequisite for computational analysis.

**Building the collection of Jules Verne novels**

Let's start by building up a small corpus of Jules Verne's works available on Project Gutenberg. In the process, we'll discover HTTP.

## Reading a metadata table using Pandas (CSV/TSV)

![csv](./img/verne_csv.png)

A [comma-separated values](https://en.wikipedia.org/wiki/Comma-separated_values) (CSV) file is a delimited text file that uses a comma to separate values. Each line of the file is a data record. Each record consists of one or more fields, separated by commas. The use of the comma as a field separator is the source of the name for this file format. A CSV file typically stores tabular data (numbers and text) in plain text, in which case each line will have the same number of fields.

The T of TSV is for 'tabulation'. Tabs are more convenient for separating text values, which may themselves contain commas...
In short, our lesson begins with the reading of a simple TSV metadata table, which is a very common data exchange format.

=====

**[Pandas](https://pandas.pydata.org/)** is a library written for the Python programming language, enabling data manipulation and analysis. In particular, it offers data structures and array manipulation operations.

Here, we use Pandas to read a [TSV table](https://en.wikipedia.org/wiki/Comma-separated_values) and store its data in a [DataFrame](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) (a 2-dimensional array), so that it can be manipulated.

In [1]:
import pandas as pd

In [2]:
urls = pd.read_csv("datas/verne.csv", delimiter='\t', encoding='utf-8')
# Get an overview
urls.head()

Unnamed: 0,Etext-No.,Author,Title,url,Language
0,26823,"Verne, Jules, 1828-1905",Michel Strogoff: Pièce à grand spectacle en 5 ...,https://www.gutenberg.org/cache/epub/26823/pg2...,fr
1,800,"Verne, Jules, 1828-1905",Le tour du monde en quatre-vingts jours,https://www.gutenberg.org/cache/epub/800/pg800...,fr
2,3456,"Verne, Jules, 1828-1905",Le tour du monde en quatre-vingts jours,https://www.gutenberg.org/cache/epub/3456/pg34...,fr
3,4548,"Verne, Jules, 1828-1905",Cinq Semaines En Ballon,https://www.gutenberg.org/cache/epub/4548/pg45...,fr
4,4717,"Verne, Jules, 1828-1905",Autour de la Lune,https://www.gutenberg.org/cache/epub/4717/pg47...,fr


Here, the list of books and their metadata (download link, language) has been compiled in advance. We'll see later how to collect this information automatically, using a search API such as that provided by BnF.

Let's learn how to read the DataFrame in different ways and to access the values contained in the cells.

In [3]:
# Knowing the size of the df
print(
    f"{'Dimensions':15}"
    f"{'Lines':15}"
    f"{'Columns'}"
)
print(
    f"{str(urls.shape):15}"
    f"{str(urls.shape[0]):15}"
    f"{str(urls.shape[1])}"
)

Dimensions     Lines          Columns
(46, 5)        46             5


You can access one or more specific lines or slices using the [`.iloc[]`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.iloc.html) property.

In [4]:
# Access a specific line
urls.iloc[5]

Etext-No.                                                 4791
Author                                 Verne, Jules, 1828-1905
Title                             Voyage au Centre de la Terre
url          https://www.gutenberg.org/cache/epub/4791/pg47...
Language                                                    fr
Name: 5, dtype: object

In [5]:
# Access specific lines
urls.iloc[[5, 7, 11]]

Unnamed: 0,Etext-No.,Author,Title,url,Language
5,4791,"Verne, Jules, 1828-1905",Voyage au Centre de la Terre,https://www.gutenberg.org/cache/epub/4791/pg47...,fr
7,5081,"Verne, Jules, 1828-1905",Les Indes Noires,https://www.gutenberg.org/cache/epub/5081/pg50...,fr
11,5097,"Verne, Jules, 1828-1905",Vingt mille Lieues Sous Les Mers — Complete,https://www.gutenberg.org/cache/epub/5097/pg50...,fr


In [6]:
# Access a slice of lines
urls.iloc[5:8]

Unnamed: 0,Etext-No.,Author,Title,url,Language
5,4791,"Verne, Jules, 1828-1905",Voyage au Centre de la Terre,https://www.gutenberg.org/cache/epub/4791/pg47...,fr
6,4968,"Verne, Jules, 1828-1905",Les Cinq Cents Millions De La Bégum,https://www.gutenberg.org/cache/epub/4968/pg49...,fr
7,5081,"Verne, Jules, 1828-1905",Les Indes Noires,https://www.gutenberg.org/cache/epub/5081/pg50...,fr


You can get cell value by index or name.

In [7]:
print(
    str(urls.iloc[5][2]),
    '<=>',
    str(urls.iloc[5]['Title'])
)

Voyage au Centre de la Terre <=> Voyage au Centre de la Terre


Each novel title in this TSV file is paired with a URL for the plain text. How can we actually use these URLs to get computationally tractable text data? Though we could manually navigate to each URL and copy/paste each screenplay into a file, that would be suuuuper slow and painstaking. It would be much better to programmatically access the text data attached to every URL.

## HTTP Requests and Responses

To programmatically access the text data attached to every URL, we can use a Python library called [requests](https://requests.readthedocs.io/en/master/).

<img src="./img/http.png" width="600px">

A [Uniform Resource Locator](https://en.wikipedia.org/wiki/URL) (URL), colloquially termed a web address, is a reference to a web resource that specifies its location on a computer network and a mechanism for retrieving it.

Every HTTP URL conforms to the syntax of a generic URI. The URI generic syntax consists of 6 components organized hierarchically:

`http://www.domain.com:80/path/to/myfile.html?key1=value1&key2=value2#anchor_in_doc`

URLS parts:

- **protocol**: `http://` or `https://`.
- **domain name**: `www.domain.com` –instead of a domain name, you can use an IP address.
- **port**: `:80` –indicates the technical "door" to be used to access the server's resources. This fragment is generally absent, as the browser uses the standard ports associated with the protocols (80 for HTTP, 443 for HTTPS).
- **path**: `/path/to/myfile.html` –path, on the web server, to the resource. In the early days of the Web, this path often corresponded to a "physical" path existing on the server. Today, this path is merely an abstraction managed by the web server, and no longer corresponds to a "physical" reality.
- **parameters**: `?key1=value1&key2=value2` –constructed as a list of key/value pairs separated by an ampersand.
- **anchor**: `#anchor_in_doc` –points to a given location in the resource.

When you type in a URL in your search address bar, you're sending an HTTP **request** for a web page, and the server which stores that web page will accordingly send back a **response**, some web page data that your browser will render.

In the image below, in the inspector's network tab, you can see that for the URL, 2 HTTP requests received a positive response (status 200).

![404](./img/request-response.png)

### Get HTML Data with Requests

**HTTP**

In the HTTP protocol, a **method** is a command specifying a type of request, i.e. asking the server to perform an action. In general, the action concerns a resource identified by the URL following the method name.

There are many [methods](https://en.wikipedia.org/wiki/HTTP#HTTP/1.1_request_messages), the most common being `GET`, `HEAD` and `POST` :

- `GET`: the most common method for requesting a resource. A GET request has no effect on the resource.
- `HEAD`: this method only requests information about the resource, without actually requesting the resource itself.
- `POST`: this method is used to send data for processing (usually from an HTML form).

**Requests Python Library**

With the [`.get()` method](https://requests.readthedocs.io/en/latest/api/#requests.get), we can request to "get" web page data for a specific URL, which we will store in a varaible called `response`.

In [8]:
import requests

In [9]:
response = requests.get("https://www.gutenberg.org/cache/epub/4791/pg4791.txt")

### HTTP Header Fields

Wikipedia: "[HTTP header fields](https://en.wikipedia.org/wiki/List_of_HTTP_header_fields) are a **list of strings sent and received by both the client program and server on every HTTP request and response. These headers are usually invisible to the end-user and are only processed or logged by the server and client applications. They define how information sent/received through the connection are encoded** (as in Content-Encoding), the session verification and identification of the client (as in browser cookies, IP address, user-agent) or their anonymity thereof (VPN or proxy masking, user-agent spoofing), how the server should handle data (as in Do-Not-Track), the age (the time it has resided in a shared cache) of the document being downloaded, amongst others."


In [10]:
response.headers

{'Date': 'Fri, 02 Jun 2023 12:40:57 GMT', 'Server': 'Apache', 'Content-Location': 'pg4791.txt.utf8', 'Vary': 'negotiate', 'TCN': 'choice', 'Last-Modified': 'Fri, 12 May 2023 21:00:11 GMT', 'Accept-Ranges': 'bytes', 'Content-Length': '461106', 'X-Backend': 'gutenweb1', 'Content-Type': 'text/plain; charset=utf-8'}

Thus, using the `headers` [`Response` object](https://requests.readthedocs.io/en/latest/user/advanced/#request-and-response-objects); we can see that the document returned by Project Gutenberg is a plain text file.

But more often than not, as we'll soon see, responses are encoded in HTML.

In [11]:
response.headers['Content-Type']

'text/plain; charset=utf-8'

### HTTP Status Code

If we check out `response`, it will simply tell us its [HTTP response code](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status), aka whether the request was successful or not.

"200" is a successful response, while "404" is a common "Page Not Found" error.

In [12]:
response

<Response [200]>

Let's see what happens if we make a mistake entering the URL...  
('page4791' instead of 'pg4791')

In [15]:
bad_response = requests.get("https://www.gutenberg.org/cache/epub/4791/page4791.txt")
bad_response

<Response [404]>

![404](./img/bad_response.png)

### Extract Text From Web Page

To actually get at the text data in the reponse, we need to use [`.text` property](https://requests.readthedocs.io/en/latest/api/#requests.Response.text), which we will save in a variable called `text_string`.

Project Gutenberg provides here a version of the novel in plain text format (what's convenient from a pedagogical point of view). But more often, the text data that we're getting on the Web is formatted in the HTML markup language, which we will talk more about in the BeautifulSoup section below.

In [16]:
text_string = response.text

Here's the text of the novel now in a variable.

In [17]:
print(text_string)

﻿The Project Gutenberg eBook of Voyage au Centre de la Terre, by Jules Verne

This eBook is for the use of anyone anywhere in the United States and
most other parts of the world at no cost and with almost no restrictions
whatsoever. You may copy it, give it away or re-use it under the terms
of the Project Gutenberg License included with this eBook or online at
www.gutenberg.org. If you are not located in the United States, you
will have to check the laws of the country where you are located before
using this eBook.

Title: Voyage au Centre de la Terre

Author: Jules Verne

Release Date: March 21, 2002 [eBook #4971]
[Most recently updated: May 12, 2023]

Language: French


Produced by: Carlo Traverso, Robert Rowe, Charles Franks
and the Online Distributed Proofreading Team.
Revised by Richard Tonsing.

*** START OF THE PROJECT GUTENBERG EBOOK VOYAGE AU CENTRE DE LA TERRE ***




We thank the Bibliotheque Nationale de France that has made available
the image files at www://gallica.bnf.fr

### Extract Text From Multiple Web Pages

Repeating the operation for each novel would be tedious… Let's see how we can extract the text for every URL in the DataFrame at once. To do so, we're going to create a smaller DataFrame containing the first 10 novels –fewer processings for the demonstration and the planet…

In [18]:
sample_urls = urls[:10]

We're going to make a function called `scrape_novel()` that includes our `requests.get()` and `response.text` code.

In [19]:
def scrape_novel(url):
    response = requests.get(url)
    html_string = response.text
    return html_string

Then we're going apply it to the "url" column of the DataFrame and create a new "text" column for the resulting extracted text.

In [None]:
sample_urls['text'] = sample_urls['url'].apply(scrape_novel)

In [21]:
sample_urls.head(3)

Unnamed: 0,Etext-No.,Author,Title,url,Language,text
0,26823,"Verne, Jules, 1828-1905",Michel Strogoff: Pièce à grand spectacle en 5 ...,https://www.gutenberg.org/cache/epub/26823/pg2...,fr,"﻿Project Gutenberg's Michel Strogoff, by Jules..."
1,800,"Verne, Jules, 1828-1905",Le tour du monde en quatre-vingts jours,https://www.gutenberg.org/cache/epub/800/pg800...,fr,﻿The Project Gutenberg EBook of Le Tour du Mon...
2,3456,"Verne, Jules, 1828-1905",Le tour du monde en quatre-vingts jours,https://www.gutenberg.org/cache/epub/3456/pg34...,fr,﻿The Project Gutenberg Etext of Tour Du Mond 8...


The DataFrame above is truncated, so we can't see the full contents of the "text" column. But if we print out every row in the column, we can see that we successfully extracted text for each URL (though some of these URLs returned 404 errors).

In [22]:
print(sample_urls.iloc[4]['text'])

﻿The Project Gutenberg EBook of Autour de la Lune, by Jules Verne

This eBook is for the use of anyone anywhere at no cost and with
almost no restrictions whatsoever.  You may copy it, give it away or
re-use it under the terms of the Project Gutenberg License included
with this eBook or online at www.gutenberg.org


Title: Autour de la Lune

Author: Jules Verne

Posting Date: May 30, 2011 [EBook #4717]
Release Date: December, 2003
[This file was first posted on March 6, 2002]

Language: French



*** START OF THIS PROJECT GUTENBERG EBOOK AUTOUR DE LA LUNE ***




Produced by John Walker, http://www.fourmilab.ch/








----------------------------------------------------------------------

                          AUTOUR DE LA LUNE
                        Etext Production Notes

Mathematical symbols are enclosed in the brackets "\(" and "\)" and are
expressed as their character or symbol names in the LaTeX typesetting
language. One formula has also been typeset in "visual mode", in l

It's simple and easy! However, that plain text format poses a few problems. It is not possible to automatically distinguish the Jules Verne text from the metadata and editorial paratext. Likewise, all the credit references at the end of the transcription are mixed in with the text, which can skew the analysis.

We need a more structured format that allows us to distinguish between the author's text, the metadata and the editorial paratext.

## Web Scraping

Not all web pages will be as easy to scrape as these Gutenberg project plain text files, however. Let's say we wanted to scrape the lyrics for NTM's song "[On est encore là](https://genius.com/Supreme-ntm-on-est-encore-la-lyrics)" (1998) from Genius.com.

NB. The teaching sequence is by Mélanie Walsh, and has been updated as the data model has evolved...

<img src="./img/ntm.png" class="center" >

Even at a glance, we can tell that this *Genius* web page is a lot more complicated than the *Gutengerg project* page and that it contains a lot of information beyond the lyrics.

Sure enough, if we use our requests library again and try to grab the data for this web page, the underlying data is much more complicated, too.

In [23]:
response = requests.get("https://genius.com/Supreme-ntm-on-est-encore-la-lyrics")
html_string = response.text
print(html_string)

<!doctype html>
<html>
  <head>
    <title>Suprême NTM – On est encore là Lyrics | Genius Lyrics</title>

<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<meta content='width=device-width,initial-scale=1' name='viewport'>

  <meta name="apple-itunes-app" content="app-id=709482991">

<link href="https://assets.genius.com/images/apple-touch-icon.png?1685647951" rel="apple-touch-icon" />


  

  <link href="https://assets.genius.com/images/apple-touch-icon.png?1685647951" rel="apple-touch-icon" />

  

  <!-- Mobile IE allows us to activate ClearType technology for smoothing fonts for easy reading -->
  <meta http-equiv="cleartype" content="on">




<META name="y_key" content="f63347d284f184b0">

<meta property="og:site_name" content="Genius"/>
<meta property="fb:app_id" content="265539304824" />
<meta property="fb:pages" content="308252472676410" />

<link title="Genius" type="application/opensearchdescription+xml" rel="search" href="https://genius.com/opensearch.xm

How can we extract just the song lyrics from this messy soup of a document? Luckily there's a Python library that can help us called [BeautifulSoup](https://beautiful-soup-4.readthedocs.io/en/latest/), which parses HTML documents.

To understand BeautifulSoup and HTML, we're going to take things one step at a time. We'll start with a very simple example (beginner level) to understand the basic structure of an HTML page. Next, we'll go back to the Jules Verne novels available on Project Gutenberg (intermediate level) before trying to automatically extract the lyrics to the NTM song (advanced level).

### HTML5

This [Singers' singers webpage](todolink) is adapted from Parrish's website titled "[Kittens and the TV Shows They Love](http://static.decontextualize.com/kittens.html)" made for the purposes of teaching BeautifulSoup.

Instant geek: thanks to IPython's [core.display module](https://ipython.org/ipython-doc/2/api/generated/IPython.core.display.html), it's even possible to display the content of a web page in a notebook.

In [24]:
from IPython.display import display, HTML
response = requests.get("http://corpus.enc.sorbonne.fr/corpus/test/punk.html")
html_string = response.text
display(HTML(html_string))

Let's take a look at the structure of this HTML page:

In [25]:
print(html_string)

<!doctype html>
<html>
  <head>
    <title>Singers' singers</title>
  </head>
  <body>
    <h1>Singers' singers</h1>
    <p class="subtitle">Singers and the Singers They Love</p>
    <div class="singer">
      <h2>Iggy Pop</h2>
      <div><img src="https://upload.wikimedia.org/wikipedia/commons/thumb/e/e8/Iggy_Pop_HS_Yearbook.jpeg/170px-Iggy_Pop_HS_Yearbook.jpeg"></div>
      <ul class="idols">
        <li><a href="https://en.wikipedia.org/wiki/Frank_Sinatra">Frank Sinatra</a></li>
        <li><a href="https://en.wikipedia.org/wiki/Jim_Morrison">Jim Morrison</a></li>
      </ul>
      Last check-up: <span class="lastcheckup">2023-05-30</span>
    </div>
    <div class="singer">
      <h2>Joe Strummer</h2>
      <div><img src="https://upload.wikimedia.org/wikipedia/commons/6/63/Joe_strummer_1999.jpg"></div>
      <ul class="idols">
        <li><a href="https://en.wikipedia.org/wiki/Johnny_Cash">Johnny Cash</a></li>
        <li><a href="https://en.wikipedia.org/wiki/John_Lydon">John Lydo

#### HTML Tags

HTML stands for HyperText Markup Language. It is the standard language for writing web page documents. The most important thing you need to know about HTML is that the language uses HTML "tags" to represent different elements, such as a main header `<h1>`. 

| HTML Tag                | Explanation                              |
|--------------------|-------------------------------------------|
| <\!DOCTYPE>        | Defines document type                 |
| <html\>             | Root of the HTML document                  |
| <head\>             | Metadata about document    |
| <title\>            | Title for document          |
| <body\>             | Document body               |
| <h1\> to <h6\>       |  Headings                    |
| <div\> | Bloc section in a document                   |
| <p\>                | Paragraph                       |
| <ul\> | Unordered list                     |
| <ol\> | Ordered list                     |
| <li\> | List item                     |
| <br\>               | Line break               |
| <a\> | Hyperlink                       |
| <img\> | Image                         |
| <span\> | Inline section in a document                   |
| <\!\-\-comment here-\-> | Comment                         |

HTML tags often, but not always, require a "closing" tag. For example, the main header "Kittens and the TV Shows They Love" will be surrounded by `<h1>` (opening tag) and `</h1>` (closing tag) on either side: `<h1>Singers' singers</h1>`

#### HTML Attributes, Classes, and IDs

HTML elements sometimes come with even more information inside a tag. This will often be a keyword (like `class` or `id`) followed by an equals sign `=` and a further descriptor such as `<div class="iggy">`.

We need to know about tags as well as attributes, classes, and IDs because this is how we're going to extract specific HTML data with BeautifulSoup.

### BeautifulSoup (Singers’ singers)

In [26]:
from bs4 import BeautifulSoup

To make a BeautifulSoup document, we call `BeautifulSoup()` with two parameters: the `html_string` from our HTTP request and [the kind of parser](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#specifying-the-parser-to-use) that we want to use, which will always be `"html.parser"` for our purposes.

In [27]:
response = requests.get("http://corpus.enc.sorbonne.fr/corpus/test/punk.html")
html_string = response.text
document = BeautifulSoup(html_string, "html.parser")

In [28]:
print(type(document))

<class 'bs4.BeautifulSoup'>


In [29]:
document

<!DOCTYPE html>

<html>
<head>
<title>Singers' singers</title>
</head>
<body>
<h1>Singers' singers</h1>
<p class="subtitle">Singers and the Singers They Love</p>
<div class="singer">
<h2>Iggy Pop</h2>
<div><img src="https://upload.wikimedia.org/wikipedia/commons/thumb/e/e8/Iggy_Pop_HS_Yearbook.jpeg/170px-Iggy_Pop_HS_Yearbook.jpeg"/></div>
<ul class="idols">
<li><a href="https://en.wikipedia.org/wiki/Frank_Sinatra">Frank Sinatra</a></li>
<li><a href="https://en.wikipedia.org/wiki/Jim_Morrison">Jim Morrison</a></li>
</ul>
      Last check-up: <span class="lastcheckup">2023-05-30</span>
</div>
<div class="singer">
<h2>Joe Strummer</h2>
<div><img src="https://upload.wikimedia.org/wikipedia/commons/6/63/Joe_strummer_1999.jpg"/></div>
<ul class="idols">
<li><a href="https://en.wikipedia.org/wiki/Johnny_Cash">Johnny Cash</a></li>
<li><a href="https://en.wikipedia.org/wiki/John_Lydon">John Lydon</a></li>
</ul>
      Last check-up: <span class="lastcheckup">2023-06-01</span>
</div>
</body>
</ht

#### Extract HTML Element

We can use the [`.find()` method](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#find) to find and extract certain elements, such as a main header.

In [30]:
document.find("h1")

<h1>Singers' singers</h1>

If we want only the text contained between those tags, we can use `.text` to extract just the text.

In [31]:
document.find("h1").text

"Singers' singers"

In [32]:
type(document.find("h1").text)

str

Find the HTML element that contains an image.

In [33]:
document.find("img")

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/e/e8/Iggy_Pop_HS_Yearbook.jpeg/170px-Iggy_Pop_HS_Yearbook.jpeg"/>

In [34]:
document.find("img")['src']

'https://upload.wikimedia.org/wikipedia/commons/thumb/e/e8/Iggy_Pop_HS_Yearbook.jpeg/170px-Iggy_Pop_HS_Yearbook.jpeg'

**The `.find()` method returns only one result** (the first one).  
However, we may wish to obtain a list of all the images called up in the page.

#### Extract Multiple HTML Elements

You can also extract multiple HTML elements at a time with the [`.find_all()` method](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#find-all) that returns a list.

In [35]:
document.find_all("img")

[<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/e/e8/Iggy_Pop_HS_Yearbook.jpeg/170px-Iggy_Pop_HS_Yearbook.jpeg"/>,
 <img src="https://upload.wikimedia.org/wikipedia/commons/6/63/Joe_strummer_1999.jpg"/>]

With Python, it's easy to use a `for` loop to go through the list:

In [36]:
for img in document.find_all("img"):
    print(img['src'])

https://upload.wikimedia.org/wikipedia/commons/thumb/e/e8/Iggy_Pop_HS_Yearbook.jpeg/170px-Iggy_Pop_HS_Yearbook.jpeg
https://upload.wikimedia.org/wikipedia/commons/6/63/Joe_strummer_1999.jpg


It's possible to extract a serie of elements according to the value of their attributes (here, `div` whose class attribute value is `singer`).

In [37]:
document.find_all("div", attrs={"class": "singer"})

[<div class="singer">
 <h2>Iggy Pop</h2>
 <div><img src="https://upload.wikimedia.org/wikipedia/commons/thumb/e/e8/Iggy_Pop_HS_Yearbook.jpeg/170px-Iggy_Pop_HS_Yearbook.jpeg"/></div>
 <ul class="idols">
 <li><a href="https://en.wikipedia.org/wiki/Frank_Sinatra">Frank Sinatra</a></li>
 <li><a href="https://en.wikipedia.org/wiki/Jim_Morrison">Jim Morrison</a></li>
 </ul>
       Last check-up: <span class="lastcheckup">2023-05-30</span>
 </div>,
 <div class="singer">
 <h2>Joe Strummer</h2>
 <div><img src="https://upload.wikimedia.org/wikipedia/commons/6/63/Joe_strummer_1999.jpg"/></div>
 <ul class="idols">
 <li><a href="https://en.wikipedia.org/wiki/Johnny_Cash">Johnny Cash</a></li>
 <li><a href="https://en.wikipedia.org/wiki/John_Lydon">John Lydon</a></li>
 </ul>
       Last check-up: <span class="lastcheckup">2023-06-01</span>
 </div>]

In [38]:
document.find("h2").text

'Iggy Pop'

In [39]:
document.find_all("h2")

[<h2>Iggy Pop</h2>, <h2>Joe Strummer</h2>]

``` {warning}
Heads up! The code below will cause an error.
```
Let's try to extract the text from all the header2 elements:

In [40]:
document.find_all("h2").text

AttributeError: ResultSet object has no attribute 'text'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?

That didn't work! In order to extract text data from multiple HTML elements, we need a `for` loop and some list-building.

In [43]:
all_h2_headers = document.find_all("h2")
all_h2_headers

[<h2>Iggy Pop</h2>, <h2>Joe Strummer</h2>]

First we will make an empty list called `h2_headers`.

Then `for` each `header` in `all_h2_headers`, we will grab the `.text`, put it into a variable called `header_contents`, then `.append()` it to our `h2_headers` list.

In [44]:
h2_headers = []
for header in all_h2_headers:
    header_contents = header.text
    h2_headers.append(header_contents)

In [45]:
h2_headers

['Iggy Pop', 'Joe Strummer']

You can produce the same result in a more "pythonic" way by using a **list comprehension** (shorter syntax):

In [46]:
h2_headers = [header.text for header in all_h2_headers]
h2_headers

['Iggy Pop', 'Joe Strummer']

#### Inspect HTML Elements with Browser

Most times if you're looking to extract something from an HTML document, it's best to use your "Inspect" capabilities in your web browser. You can hover over elements that you're interested in and find that specific element in the HTML.

<img src="./img/inspect1.png" width="700px">

For example, if we hover over the main link "Johnny Cash":

<img src="./img/inspect2.png" width="700px" >

### Your Turn! (Project Gutenberg)

Ok so now we've learned a little bit about how to use BeautifulSoup to parse HTML documents. So how would we apply what we've learned to extract the text of Jules Verne's novels?

Project Gutenberg shares its editions in plain text format, but not only. An HTML version is of course also available. It's a little more difficult to process it than the full-text version, but at least we can try to extract metadata and the author's text alone.

Let's follow our example of the *Voyage au Centre de la Terre*: https://www.gutenberg.org/cache/epub/4791/pg4791.html.

Inspect the page with your browser and try to extract:

- the main header of the page
- the title of the novel
- `p` elements with `id``

Finally, try to extract the only text of the novel (the one written by Jules Verne).

In [47]:
response = requests.get("https://www.gutenberg.org/cache/epub/4791/pg4791.html")
html_string = response.text
document = BeautifulSoup(html_string, "html.parser")

In [48]:
# Main header
document.find('h2')

<h2 id="pg-header-heading">The Project Gutenberg eBook of <span lang="fr">Voyage au Centre de la Terre</span></h2>

In [49]:
# Main header text
document.find('h2').text

'The Project Gutenberg eBook of Voyage au Centre de la Terre'

No `h1`elmement, it’s a semantic oddity... A closer look reveals that the hierarchy of headings is not even respected (`h5` > `h2`).

But note that thanks to BeautifulSoup, you can easily print the text contained in all descendant elements (here `span`). How convenient!

In [50]:
# Title of the novel
document.find("h5").text

'VOYAGE  AU  CENTRE DE  LA TERRE'

In [51]:
# First p element with an id attribute
document.find("p", {"id" : True})

<p id="id00000">We thank the Bibliotheque Nationale de France that has made available</p>

In [52]:
# The `p` element whose id value is 'id02329'.
document.find("p", {"id" : "id02329"})

<p id="id02329">—Explique-toi, mon garçon,</p>

In [53]:
# A `p` element that has an `id` attribute AND a `style` attribute.
document.find("p", attrs={"id" : True, "style" : True})

<p id="id00001" style="margin-top: 4em">the image files at www://gallica.bnf.fr, authorizing the preparation
of the etext through OCR.</p>

In [54]:
# Same, compact syntax
document.find("p", {"id" : True, "style" : True})

<p id="id00001" style="margin-top: 4em">the image files at www://gallica.bnf.fr, authorizing the preparation
of the etext through OCR.</p>

In [55]:
# List of the 10 first p elements with an id attribute
document.find_all("p", {"id" : True})[0:8]

[<p id="id00000">We thank the Bibliotheque Nationale de France that has made available</p>,
 <p id="id00001" style="margin-top: 4em">the image files at www://gallica.bnf.fr, authorizing the preparation
 of the etext through OCR.</p>,
 <p id="id00002" style="margin-top: 2em">Nous remercions la Bibliothèque Nationale de France qui a mis à
 disposition les images dans www://gallica.bnf.fr, et a donné
 l'authorisation à les utilizer pour préparer ce texte.</p>,
 <p id="id00003" style="margin-top: 4em">Editorial note: We emphasize with <i>X</i> the runes that Verne emphasizes
 with serifs, and translitterates with uppecase.</p>,
 <p id="id00004">Note de l'éditeur: On répresente avec <i>X</i> les runes que Verne relève
 avec des sérifs, et transcrit avec des maj uscules.</p>,
 <p id="id00005" style="margin-top: 4em">Jules Verne</p>,
 <p id="id00008" style="margin-top: 2em">Le 24 mai 1863, un dimanche, mon oncle, le professeur Lidenbrock,
 revint précipitamment vers sa petite maison située au

In [None]:
#document.find_all(['p', 'h2', 'h5'])
#document.find_all(['p', 'h2', 'h5'], id=True) # relou
#document.select('p[id], h2, h5')

Better, but we don't want to extract either the acknowledgements (`@id` 'id00000'-'id00002') or the editorial notes (`@id` 'id00003'-'id00005')...  
One strategy is to extract only the paragraphs following the novel's title (`h5`):

In [56]:
start_tag = document.find('h5')
start_tag.find_all_next('p')[0:2]

[<p id="id00008" style="margin-top: 2em">Le 24 mai 1863, un dimanche, mon oncle, le professeur Lidenbrock,
 revint précipitamment vers sa petite maison située au numéro 19
 de König-strasse, l'une des plus anciennes rues du vieux quartier
 de Hambourg.</p>,
 <p id="id00009">La bonne Marthe dut se croire fort en retard, car le dîner
 commençait à peine à chanter sur le fourneau de la cuisine.</p>]

We're making progress, but on closer inspection, we realize that we're forgetting the titles (`h2`). You can pass a list of elements to the `find_all_next()` method:

In [57]:
start_tag = document.find('h5')
start_tag.find_all_next(['p', 'h2'])[0:3]

[<h2 id="id00007" style="margin-top: 4em">I</h2>,
 <p id="id00008" style="margin-top: 2em">Le 24 mai 1863, un dimanche, mon oncle, le professeur Lidenbrock,
 revint précipitamment vers sa petite maison située au numéro 19
 de König-strasse, l'une des plus anciennes rues du vieux quartier
 de Hambourg.</p>,
 <p id="id00009">La bonne Marthe dut se croire fort en retard, car le dîner
 commençait à peine à chanter sur le fourneau de la cuisine.</p>]

All that's left is to save the text in a list, simply with a small loop:

In [58]:
verne_text_list = []
for element in start_tag.find_all_next(['p', 'h2']):
    text = element.text
    verne_text_list.append(text)
verne_text_list[0:3]

['I',
 "Le 24 mai 1863, un dimanche, mon oncle, le professeur Lidenbrock,\r\nrevint précipitamment vers sa petite maison située au numéro 19\r\nde König-strasse, l'une des plus anciennes rues du vieux quartier\r\nde Hambourg.",
 'La bonne Marthe dut se croire fort en retard, car le dîner\r\ncommençait à peine à chanter sur le fourneau de la cuisine.']

In [59]:
# Otherwise, print it all in a single string.
verne_p = document.find_all('p', id=True)
for element in start_tag.find_all_next(['p', 'h2'])[0:3]:
    print(element.text)

I
Le 24 mai 1863, un dimanche, mon oncle, le professeur Lidenbrock,
revint précipitamment vers sa petite maison située au numéro 19
de König-strasse, l'une des plus anciennes rues du vieux quartier
de Hambourg.
La bonne Marthe dut se croire fort en retard, car le dîner
commençait à peine à chanter sur le fourneau de la cuisine.


**Issue**

- find all `p` with id : `document.find_all('p', id=True)`

This syntax prohibits the selection of paragraphs with `id` and `h2` without: impossible to write `('[p', id=True, h2])`…  
In this case, you can use the `select()` method :

- select all `p` with id and all `h2` : ` document.select('p[id], h2')`

All you have to do is write:

```python
verne_text_list = [element.text for element in document.select('p[id], h2, h5')]
```

**Summary**

Thanks to HTML, BeautifulSoup and a little trickery, you can extract Jules Verne's text alone. A little more difficult than with the plain text format, but also more subtle: you manage to separate the author's text from the editorial paratext.

Issue: this recipe works for this novel, but what guarantee do we have that it will work for other texts? To automate extractions, we need standardized sources... That's one of the advantages of APIs.

### Your Turn! (Genius)

So how would we apply what we've learned to extract NTM lyrics?

https://genius.com/Supreme-ntm-on-est-encore-la-lyrics

What HTML element do we need to "find" to extract the song lyrics?

In [60]:
response = requests.get("https://genius.com/Supreme-ntm-on-est-encore-la-lyrics")
html_string = response.text
document = BeautifulSoup(html_string, "html.parser")

What do we have in the `p` elements?

In [61]:
ntm_p = document.find_all("p")
print(ntm_p)

[<p class="HeaderCredits__Label-wx7h8g-2 ghcavQ">Produced by</p>, <p class="SongPage__HeaderSpace-sc-19xhmoi-3 cZZbkS"></p>, <p>How to Format Lyrics:</p>, <p>To learn more, check out our <a class="Link__StyledLink-rwn6i6-0 coFKbX" font-weight="light" href="https://genius.com/Genius-how-to-add-songs-to-genius-annotated">transcription guide</a> or visit our <a class="Link__StyledLink-rwn6i6-0 coFKbX" font-weight="light" href="https://genius.com/transcribers">transcribers forum</a></p>, <p>Artistes : KoolShen, Joeystarr<br/>
Instru :<br/>
(I) Madizm<br/>
(II) Zoxeakopat<br/>
Sample : Whayback – <a data-api_path="/songs/8394" href="https://genius.com/Artifacts-whayback-lyrics" rel="noopener">“Artifacts”</a><br/>
Album : <em>Suprême NTM</em> (1998)<br/>
[Bad image! Try pasting a different image link!]</p>, <p>Le double morceau de l'album éponyme du Suprême NTM, qui affirme la persévérance du duo dans sa subversion. La version (II) présente le même texte, privé de l'extrait télévisuel faisan

What HTML element do we need to "find" to extract the title?

In [62]:
print(document.find('h1').text)

On est encore là


En inspectant la page, on s’aperçoit que certains éléments `div` ont un attribut `data-lyrics-container`

In [63]:
print(document.find('div', {"data-lyrics-container": "true"}).text)

[Couplet 1 - KoolShen]Retour en force de l'ordre moral, je veux surtout pas te casser ton moralMais c'est le bordel quand t'entres pas dans leur panel, je suis formelEt reste formé pour ça, nique le CSAC'est pour ça que j'ai gardé ma tenue de combatQue je lâcherai pas mon ton-bâ, fais gaffe à ton dos, protège tes abdosSi tu parles cash de leurs vices, ils te feront pas de cadeauOn nous censure parce que notre culture est trop basanéeQu'on représente pas assez la France du passéC'est carré, on veut nous stopper : ça allaitTant qu'on rappait dans les MJC, mais aujourd'huiLe phénomène a grandi, Dieu merci, je remercieLes jeunes qui rappent sans merci et puis nique sa mère, siOn passe pas dans leurs radios, on fera le tour, c'est pas graveLe plus dur c'était de sortir de la cave et les gens le saventHey, on est encore làPrêt à foutre le souk et tout le monde est ccord-d'a[Refrain - Joeystarr, KoolShen]Non, non ! (hahaha...) On est encore làPrêt à foutre le souk et tout le monde est ccord-d

**Issue**. We lose the lines of verse. What can we do? The [`get_text()` method](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#get-text) returns all the text in a document (or beneath a tag) as a single Unicode string and enables to specify a string to be used to join the bits of text together…

In [64]:
lyrics = document.find('div', {"data-lyrics-container": "true"}).get_text("\n")
print(lyrics)

[Couplet 1 - KoolShen]
Retour en force de l'ordre moral
, je veux surtout pas te casser ton moral
Mais c'est le bordel quand t'entres pas dans leur panel
, je suis formel
Et reste formé pour ça, 
nique le CSA
C'est pour ça que j'ai gardé ma tenue de combat
Que je lâcherai pas mon ton-bâ, 
fais gaffe à ton dos, protège tes abdos
Si tu parles cash de leurs vices, ils te feront pas de cadeau
On nous censure parce que notre culture est trop basanée
Qu'on représente pas assez la France du passé
C'est carré, on veut nous stopper : ça allait
Tant qu'on rappait dans les MJC
, mais aujourd'hui
Le phénomène a grandi, Dieu merci, je remercie
Les jeunes qui rappent sans merci et puis
 
nique sa mère
, si
On passe pas dans leurs radios, on fera le tour, c'est pas grave
Le plus dur c'était de sortir de la cave et les gens le savent
Hey, on est encore là
Prêt à foutre le souk et tout le monde est ccord-d'a
[Refrain - Joeystarr, KoolShen]
Non, non ! (hahaha...) On est encore là
Prêt à foutre le souk e

**Issue (continued)**. Good idea, but it's really not great because of the segmentation of the annotations...  
We need to be more specific and process all `br` elements:

In [65]:
for br in document.find_all("br"):
    br.replace_with("\n")
lyrics = document.find('div', {"data-lyrics-container": "true"}).text
print(lyrics)

[Couplet 1 - KoolShen]
Retour en force de l'ordre moral, je veux surtout pas te casser ton moral
Mais c'est le bordel quand t'entres pas dans leur panel, je suis formel
Et reste formé pour ça, nique le CSA
C'est pour ça que j'ai gardé ma tenue de combat
Que je lâcherai pas mon ton-bâ, fais gaffe à ton dos, protège tes abdos
Si tu parles cash de leurs vices, ils te feront pas de cadeau
On nous censure parce que notre culture est trop basanée
Qu'on représente pas assez la France du passé
C'est carré, on veut nous stopper : ça allait
Tant qu'on rappait dans les MJC, mais aujourd'hui
Le phénomène a grandi, Dieu merci, je remercie
Les jeunes qui rappent sans merci et puis nique sa mère, si
On passe pas dans leurs radios, on fera le tour, c'est pas grave
Le plus dur c'était de sortir de la cave et les gens le savent
Hey, on est encore là
Prêt à foutre le souk et tout le monde est ccord-d'a

[Refrain - Joeystarr, KoolShen]
Non, non ! (hahaha...) On est encore là
Prêt à foutre le souk et tout 

**Issue (end)**. Gee, we only have the beginning of the lyrics, which are written in several `div`...

In [66]:
lyrics = document.find_all('div', {"data-lyrics-container": "true"})
for lyrics_part in lyrics:
    print(lyrics_part.text)

[Couplet 1 - KoolShen]
Retour en force de l'ordre moral, je veux surtout pas te casser ton moral
Mais c'est le bordel quand t'entres pas dans leur panel, je suis formel
Et reste formé pour ça, nique le CSA
C'est pour ça que j'ai gardé ma tenue de combat
Que je lâcherai pas mon ton-bâ, fais gaffe à ton dos, protège tes abdos
Si tu parles cash de leurs vices, ils te feront pas de cadeau
On nous censure parce que notre culture est trop basanée
Qu'on représente pas assez la France du passé
C'est carré, on veut nous stopper : ça allait
Tant qu'on rappait dans les MJC, mais aujourd'hui
Le phénomène a grandi, Dieu merci, je remercie
Les jeunes qui rappent sans merci et puis nique sa mère, si
On passe pas dans leurs radios, on fera le tour, c'est pas grave
Le plus dur c'était de sortir de la cave et les gens le savent
Hey, on est encore là
Prêt à foutre le souk et tout le monde est ccord-d'a

[Refrain - Joeystarr, KoolShen]
Non, non ! (hahaha...) On est encore là
Prêt à foutre le souk et tout 

**Reuse**.  
All that remains is to write this code into a small function so that we can reuse it to automatically extract the text of other songs.

In [67]:
def get_lyrics(song_genius_url):
    response = requests.get(song_genius_url)
    document = BeautifulSoup(response.text, "html.parser")
    for br in document.find_all("br"):
        br.replace_with("\n")
    lyrics = document.find_all('div', {"data-lyrics-container": "true"})
    for lyrics_part in lyrics:
        print(lyrics_part.text)

In [None]:
# https://genius.com/Grandmaster-flash-and-the-furious-five-the-message-lyrics
# https://genius.com/De-la-soul-the-magic-number-lyrics
# https://genius.com/Dr-jeckyll-and-mr-hyde-genius-rap-lyrics
# https://genius.com/Supreme-ntm-on-est-encore-la-lyrics
get_lyrics('https://genius.com/Amel-bent-ma-philosophie-lyrics')

**Conclusion**.

Unfortunately, this method is neither reusable nor future-proof… **We need standardized data served via an API.**

## APIs

[Wikipedia](https://en.wikipedia.org/wiki/API): An **application programming interface** (API) is a way for two or more computer programs to communicate with each other. It is a type of software interface, offering a service to other pieces of software.  
In contrast to a user interface, which connects a computer to a person, an application programming interface connects computers or pieces of software to each other. It is not intended to be used directly by a person (the end user) other than a computer programmer who is incorporating it into the software.

- **An API enables a computer to request information from another computer over the Internet**.
- **Data access endpoints and the format of the response are standardized according to a specification**.

### Genius API (JSON)

According to its [documentation](https://docs.genius.com), the Genius API provides access to various resources, including:

- Search (results)
- Artists
- Songs

<img src="./img/genius_doc.png" class="center" >

Let's explore the possibilities out of curiosity…

Genius uses the OAuth2 standard for making API calls on behalf of individual users.  
Requests are authenticated with an **Access Token** sent in an HTTP header or simply **as a request parameter**.

[How to get, store and call your Genius API keys](https://melaniewalsh.github.io/Intro-Cultural-Analytics/04-Data-Collection/07-Genius-API.html#api-keys)…

The best practice is to keep your API keys away from your code, such as in another file.

My key is stored in a python file called `api_key.py` that contains just one variable `your_client_access_token = "MY_API_KEY"`, so I can import below this variable into this notebook with `import api_key`.

In [69]:
import api_key
#api_key.your_client_access_token

#### Making an API Request

Making an API request looks a lot like typing a specially-formatted URL. But instead of getting a rendered HTML web page in return, you get some data in return.

Let's start with the basic search, which allows you to get a bunch of Genius data about any artist or songs that you search for:

http://api.genius.com/search?q={search_term}&access_token={client_access_token}

In [70]:
search_term = "Supreme NTM"
genius_search_url = f"http://api.genius.com/search?q={search_term}&access_token={api_key.your_client_access_token}"
response = requests.get(genius_search_url)
response.headers['Content-Type']

'application/json; charset=utf-8'

This time, the response is not formatted as plain text or HTML, but as JSON.
With Requests, you can call [.json()](https://requests.readthedocs.io/en/latest/api/#requests.Response.json) to returns the json-encoded content of a response.

Thanks to IPython's [`.display` module](https://ipython.readthedocs.io/en/stable/api/generated/IPython.display.html#IPython.display.JSON), we can effectively display a response that is quite long:

In [71]:
from IPython.display import JSON

In [72]:
JSON(response.json())

<IPython.core.display.JSON object>

#### JSON

[JSON](https://en.wikipedia.org/wiki/JSON) (JavaScript Object Notation) is an open standard file format and data interchange format that uses human-readable text to store and transmit data objects consisting of attribute–value pairs and arrays (or other serializable values). JSON is commonly used by APIs.  
JSON data can be nested and contains key/value pairs.

JSON [Syntax Rules](https://www.w3schools.com/whatis/whatis_json.asp):

- Data is in name/value pairs: `'title': 'On est encore là'`
- Data is separated by commas: `'id': 87367, 'language': 'fr'`
- Curly braces hold **objects**: `'release_date_components': {'year': 1998, 'month': 4, 'day': 21}`
- Square brackets hold **arrays**: `'featured_artists': […]`

See also https://en.wikipedia.org/wiki/JSON#Syntax:

- **Array**: an ordered list of zero or more elements
- **Object**: a collection of name–value pairs where the names (also called keys) are strings. 

We can index this data and look at the 10th “hit” (`['hits'][9]`) about our search term "Supreme NTM":

NB: the code `\xa0` represents a non-breaking space.

In [73]:
json_data = response.json()
json_data['response']['hits'][9]

{'highlights': [],
 'index': 'song',
 'type': 'song',
 'result': {'annotation_count': 24,
  'api_path': '/songs/87367',
  'artist_names': 'Suprême NTM',
  'full_title': 'On est encore là by\xa0Suprême\xa0NTM',
  'header_image_thumbnail_url': 'https://images.genius.com/17800f65ce7efc1a4bea4f1bc86685c6.300x300x1.jpg',
  'header_image_url': 'https://images.genius.com/17800f65ce7efc1a4bea4f1bc86685c6.320x320x1.jpg',
  'id': 87367,
  'language': 'fr',
  'lyrics_owner_id': 64332,
  'lyrics_state': 'complete',
  'path': '/Supreme-ntm-on-est-encore-la-lyrics',
  'pyongs_count': 1,
  'relationships_index_url': 'https://genius.com/Supreme-ntm-on-est-encore-la-sample',
  'release_date_components': {'year': 1998, 'month': 4, 'day': 21},
  'release_date_for_display': 'April 21, 1998',
  'release_date_with_abbreviated_month_for_display': 'Apr. 21, 1998',
  'song_art_image_thumbnail_url': 'https://images.genius.com/17800f65ce7efc1a4bea4f1bc86685c6.300x300x1.jpg',
  'song_art_image_url': 'https://imag

#### Looping Through JSON Data

We can see that each `hits` in the `response` corresponds to a song. With a `for` loop, we can easily extract the title (`full_title`) as well as the `id` of each song:

In [74]:
for song in json_data['response']['hits']:
    print(
        f"{str(song['result']['id']):10}", # constrain string length to align
        song['result']['full_title']
    )

67453      Ma Benz by Suprême NTM (Ft. Lord Kossity)
2706       Laisse pas traîner ton fils by Suprême NTM
4921       That’s My People by Suprême NTM
3957441    Sur le drapeau by Suprême NTM (Ft. Sofiane)
61371      Seine-Saint-Denis Style by Suprême NTM
54666      Pose ton gun by Suprême NTM
70084      Affirmative Action (Saint-Denis-Style Remix) by Suprême NTM (Ft. The Firm)
203        La fièvre by Suprême NTM
51240      Tout n'est pas si facile by Suprême NTM
87367      On est encore là by Suprême NTM


A closer look reveals other relevant informations:
    
- date of release = ???
- number of visits to the associated page = ???


In [75]:
for song in json_data['response']['hits']:
    print(
        f"{str(song['result']['id']):10}",
        song['result']['release_date_components']['year'],
        f"{str(song['result']['stats']['pageviews']):8}",
        song['result']['full_title']
    )

67453      1998 97018    Ma Benz by Suprême NTM (Ft. Lord Kossity)
2706       1998 104386   Laisse pas traîner ton fils by Suprême NTM
4921       1998 74920    That’s My People by Suprême NTM
3957441    2018 58408    Sur le drapeau by Suprême NTM (Ft. Sofiane)
61371      1998 53136    Seine-Saint-Denis Style by Suprême NTM
54666      1998 46390    Pose ton gun by Suprême NTM
70084      1996 32851    Affirmative Action (Saint-Denis-Style Remix) by Suprême NTM (Ft. The Firm)
203        1995 27962    La fièvre by Suprême NTM
51240      1995 26234    Tout n'est pas si facile by Suprême NTM
87367      1998 20264    On est encore là by Suprême NTM


We can take advantage of what we've already learned to store those metadata in a dataframe. We also take this opportunity to extract :

- artist name = ???
- artist id =  ???

NB. This `artist_id` will allow us to automatically extract information about the band using the `artists` route. (`GET /artists/:id`).

In [76]:
songs_meta = []
for song in json_data['response']['hits']:
    songs_meta.append([song['result']['id'],
                       song['result']['full_title'],
                       song['result']['release_date_components']['year'],
                       song['result']['stats']['pageviews'],
                       song['result']['primary_artist']['id'],
                       song['result']['artist_names']
])

#Make a Pandas dataframe from a list
songs_df = pd.DataFrame(songs_meta)
songs_df.columns = ['song_id', 'song_title', 'year', 'page_views', 'artist_id', 'artist_names']
songs_df

Unnamed: 0,song_id,song_title,year,page_views,artist_id,artist_names
0,67453,Ma Benz by Suprême NTM (Ft. Lord Kossity),1998,97018,24568,Suprême NTM (Ft. Lord Kossity)
1,2706,Laisse pas traîner ton fils by Suprême NTM,1998,104386,24568,Suprême NTM
2,4921,That’s My People by Suprême NTM,1998,74920,24568,Suprême NTM
3,3957441,Sur le drapeau by Suprême NTM (Ft. Sofiane),2018,58408,24568,Suprême NTM (Ft. Sofiane)
4,61371,Seine-Saint-Denis Style by Suprême NTM,1998,53136,24568,Suprême NTM
5,54666,Pose ton gun by Suprême NTM,1998,46390,24568,Suprême NTM
6,70084,Affirmative Action (Saint-Denis-Style Remix) b...,1996,32851,24568,Suprême NTM (Ft. The Firm)
7,203,La fièvre by Suprême NTM,1995,27962,24568,Suprême NTM
8,51240,Tout n'est pas si facile by Suprême NTM,1995,26234,24568,Suprême NTM
9,87367,On est encore là by Suprême NTM,1998,20264,24568,Suprême NTM


In [77]:
def get_songs_meta_of_a_search(band_name):
    genius_search_url = f"http://api.genius.com/search?q={band_name}&access_token={api_key.your_client_access_token}"
    json_data = requests.get(genius_search_url).json()
    songs_meta = []
    for song in json_data['response']['hits']:
        songs_meta.append([song['result']['id'],
                           song['result']['full_title'],
                           song['result']['release_date_components']['year'],
                           song['result']['stats']['pageviews'],
                           song['result']['primary_artist']['id'],
                           song['result']['artist_names']
                          ])
    
    songs_df = pd.DataFrame(songs_meta)
    songs_df.columns = ['song_id', 'song_title', 'year', 'page_views', 'artist_id', 'artist_names']
    return(songs_df)

In [79]:
get_songs_meta_of_a_search('Orelsan')

Unnamed: 0,song_id,song_title,year,page_views,artist_id,artist_names
0,3266715,Notes pour trop tard by OrelSan (Ft. Ibeyi),2017,551896,1286,OrelSan (Ft. Ibeyi)
1,4088140,Rêves bizarres by OrelSan (Ft. Damso),2018,314322,1286,OrelSan (Ft. Damso)
2,55645,Suicide social by OrelSan,2011,255724,1286,OrelSan
3,3267187,San by OrelSan,2017,234475,1286,OrelSan
4,3267190,Défaite de famille by OrelSan,2017,218459,1286,OrelSan
5,3267195,Paradis by OrelSan,2017,216777,1286,OrelSan
6,51809,Sale pute by OrelSan,2007,215420,1286,OrelSan
7,3266712,Zone by OrelSan (Ft. Dizzee Rascal & Nekfeu),2017,211259,1286,OrelSan (Ft. Dizzee Rascal & Nekfeu)
8,7325573,L'odeur de l'essence by OrelSan,2021,201167,1286,OrelSan
9,3266713,La pluie by OrelSan (Ft. Stromae),2017,188567,1286,OrelSan (Ft. Stromae)


#### Exercice 1

Extract the NTM band description using the following route, which does not require an authentication token: https://genius.com/api/artists/{artist_id}

- find the band ID
- discover the [`text_format` query parameter](https://docs.genius.com/#/response-format-h1)find group identifier that can be used to specify how text content is formatted. 

**Answer:**

```python
artist_id = "24568"
text_format = 'plain'

genius_api_url = f"https://genius.com/api/artists/{artist_id}?text_format={text_format}"
json_data = requests.get(genius_api_url).json()
print(json_data['response']['artist']['description']['plain'])
```


#### Exercice 2

https://docs.genius.com/#artists-h2:

```
GET /artists/:id/songs

Documents (songs) for the artist specified. By default, 20 items are returned for each request.
````

The query parameter `sort` sorts songs by popularity.  
**Goal**: write a function that returns a list of the n most popular songs for an artist_id.


**Answer**:

```python
def get_popular_songs_of_artist(artist_id, sort_param, max_number):
    genius_api_url = f"https://genius.com/api/artists/{artist_id}/songs?sort={sort_param}&per_page={max_number}text_format=plain"
    json_data = requests.get(genius_api_url).json()

    for song in json_data['response']['songs']:
        year = song['release_date_components']['year'] if song['release_date_components'] else 'none'
        print(song['id'],
              song['annotation_count'],
              year,
              song['full_title'])
```

and

```python
> get_popular_songs_of_artist('24568', 'popularity', 10)
```

#### LyricsGenius Wrapper

Our initial aim (already achieved) was to extract song lyrics. But is there a simpler, more durable way? There is a `song` resource, and its [documentation](https://docs.genius.com/#songs-h2) is encouraging.

> A song is a document hosted on Genius. It's usually music lyrics.

Let's have a look.


In [80]:
from IPython.display import JSON
song_id = '87367'
json_data = requests.get(f"https://api.genius.com/songs/{song_id}?access_token={api_key.your_client_access_token}").json()
JSON(json_data)

<IPython.core.display.JSON object>

**Disappointing!** Unfortunately, contrary to what the documentation claims, the lyrics are not accessible.  
The Genius API appears extremely restrictive, probably for commercial reasons (it's easier to contribute than to grab the data...). In this case, it may be useful to use a wrapper.

An **API wrapper** is a language-specific package or kit that encapsulates multiple API calls to make complicated functions easy to use. 

For Genius, there's an excellent wrapper, freely available, **LyricsGenius**. 

- code: https://github.com/johnwmillr/LyricsGenius
- documentation: https://lyricsgenius.readthedocs.io/en/master/ 

The [method implemented to extract lyrics](https://github.com/johnwmillr/LyricsGenius/blob/master/lyricsgenius/genius.py#L95) is quite similar to ours. However, the code is of better quality: it's more tried and tested, and we can expect it to be maintained. And it's easier to use. There's no need, for example, to find the song identifier.

In [None]:
pip install git+https://github.com/johnwmillr/LyricsGenius.git

In [83]:
import lyricsgenius
genius = lyricsgenius.Genius(api_key.your_client_access_token)

In [84]:
artist = genius.search_artist("NTM", max_songs=3, sort="title")

Searching for songs by NTM...

Changing artist name to 'Suprême NTM'
Song 1: "93.2 NTMEO Radio"
Song 2: "Affirmative Action"
Song 3: "Affirmative Action (Saint-Denis-Style Remix)"

Reached user-specified song limit (3).
Done. Found 3 songs.


In [85]:
print(artist, type(artist))

Suprême NTM, 3 songs <class 'lyricsgenius.types.artist.Artist'>


L’objet `artist` stocke toutes les informations relatives à un artiste, notamment :

- son nom: `artist.name`
- son id: `artist.id`
- la liste de ses chansons: `artist.songs`

On peut ainsi facilement accéder à chacune des chansons pour obtenir, par exemple pour la seconde de la liste:

- le titre: `artist.songs[1].title`
- l’id: `artist.songs[1].id`
- les paroles: `artist.songs[1].lyrics`



In [None]:
# Print and analyze all the current properties and values of `artist` object?
'''
from pprint import pprint
pprint(vars(artist))
'''

In [86]:
print(artist.name, artist.id)
print('=====')
print(artist.songs[1].title, artist.songs[1].id)
print('=====')
print(artist.songs[1].lyrics[56:140], '…')

Suprême NTM 24568
=====
Affirmative Action 1657320
=====
Chacun sa Mafia 
Chacun sa Mifa
You sit back relax catching contacts
Sip your cognac …


It's easy to loop on the song list:

In [87]:
for song in artist.songs:
    print(song.title)

93.2 NTMEO Radio
Affirmative Action
Affirmative Action (Saint-Denis-Style Remix)


LyricsGenius permet d’accéder directement aux paroles d’une chanson:

In [89]:
song_title_search = 'Encore là'
band_name_search = 'NTM'
print(genius.search_song(song_title_search, band_name_search).lyrics)

Searching for "Encore là" by NTM...
Done.
18 ContributorsOn est encore là Lyrics[Couplet 1 - KoolShen]
Retour en force de l'ordre moral, je veux surtout pas te casser ton moral
Mais c'est le bordel quand t'entres pas dans leur panel, je suis formel
Et reste formé pour ça, nique le CSA
C'est pour ça que j'ai gardé ma tenue de combat
Que je lâcherai pas mon ton-bâ, fais gaffe à ton dos, protège tes abdos
Si tu parles cash de leurs vices, ils te feront pas de cadeau
On nous censure parce que notre culture est trop basanée
Qu'on représente pas assez la France du passé
C'est carré, on veut nous stopper : ça allait
Tant qu'on rappait dans les MJC, mais aujourd'hui
Le phénomène a grandi, Dieu merci, je remercie
Les jeunes qui rappent sans merci et puis nique sa mère, si
On passe pas dans leurs radios, on fera le tour, c'est pas grave
Le plus dur c'était de sortir de la cave et les gens le savent
Hey, on est encore là
Prêt à foutre le souk et tout le monde est ccord-d'a

[Refrain - Joeystarr, 

It becomes very easy to write a small function to build a search corpus...

In [90]:
import lyricsgenius
import pandas as pd

def get_artists_lyrics(artist_names_list, max_songs, sort_criteria):
    
    genius = lyricsgenius.Genius(api_key.your_client_access_token)
    lyrics_df = pd.DataFrame(columns=['artist_name', 'artist_id', 'song_id', 'song_title', 'song_lyrics'])
    
    for artist_name in artist_names_list:
        artist = genius.search_artist(artist_name, max_songs=max_songs, sort=sort_criteria)
        artist_name = artist.name
        artist_id = artist.id
        songs_meta = []
        for song in artist.songs:
            song_meta = [
                str(artist_name),
                str(artist_id),
                str(song.id),
                str(song.title),
                str(song.lyrics)
            ]
            lyrics_df.loc[len(lyrics_df)] = song_meta
    return(lyrics_df)

In [91]:
lyrics_df = get_artists_lyrics(['NTM', 'Orelsan'], 3, 'popularity')

Searching for songs by NTM...

Changing artist name to 'Suprême NTM'
Song 1: "Laisse pas traîner ton fils"
Song 2: "Ma Benz"
Song 3: "That’s My People"

Reached user-specified song limit (3).
Done. Found 3 songs.
Searching for songs by Orelsan...

Changing artist name to 'OrelSan'
Song 1: "Notes pour trop tard"
Song 2: "Rêves bizarres"
Song 3: "Suicide social"

Reached user-specified song limit (3).
Done. Found 3 songs.


In [92]:
lyrics_df.head()

Unnamed: 0,artist_name,artist_id,song_id,song_title,song_lyrics
0,Suprême NTM,24568,2706,Laisse pas traîner ton fils,41 ContributorsTranslationsEnglishLaisse pas t...
1,Suprême NTM,24568,67453,Ma Benz,36 ContributorsMa Benz Lyrics[Intro - Lord Kos...
2,Suprême NTM,24568,4921,That’s My People,24 ContributorsTranslationsEnglishThat’s My Pe...
3,OrelSan,1286,3266715,Notes pour trop tard,101 ContributorsTranslationsEnglishNotes pour ...
4,OrelSan,1286,4088140,Rêves bizarres,153 ContributorsRêves bizarres Lyrics[Paroles ...


**The corpus is rapidly built up for analysis: stylometry, attribution or topic modeling, etc.**

In [105]:
print(lyrics_df.iloc[1].song_lyrics[:240])

36 ContributorsMa Benz Lyrics[Intro - Lord Kossity &  Joey Starr]
Yo, yo rude boy !
Jaguar Gorgone, Kool Shen, Lord Kossity
Come back again, now man !
SP.One from the track
Everytime I’m coming with car (for real !)
He comin' with them [?]



**Issue**. We see that the first line contains credits... `.index()` allows us to position ourselves after the first line break.

In [107]:
# problème on constate que la première ligne contient des crédits… la méthode .index() permet de se positionner après le premier saut de ligne.
lyrics = lyrics_df.iloc[1].song_lyrics[:240]
print(lyrics[lyrics.index('\n')+1:])

Yo, yo rude boy !
Jaguar Gorgone, Kool Shen, Lord Kossity
Come back again, now man !
SP.One from the track
Everytime I’m coming with car (for real !)
He comin' with them [?]



We can apply this method to the `song_lyrics` column of the dataframe, in order to improve the data. Alternatively, we could rewrite our `get_artists_lyrics()` function.
**Warning**. Be careful to apply only once, otherwise the first line of the lyrics will be lost each time it is run...

In [108]:
lyrics_df['song_lyrics']

0    41 ContributorsTranslationsEnglishLaisse pas t...
1    36 ContributorsMa Benz Lyrics[Intro - Lord Kos...
2    24 ContributorsTranslationsEnglishThat’s My Pe...
3    101 ContributorsTranslationsEnglishNotes pour ...
4    153 ContributorsRêves bizarres Lyrics[Paroles ...
5    111 ContributorsSuicide social Lyrics[Paroles ...
Name: song_lyrics, dtype: object

In [109]:
lyrics_df['song_lyrics'] = lyrics_df['song_lyrics'].apply(lambda x: x[x.index('\n')+1:])
lyrics_df['song_lyrics']

0    \n[Couplet 1 - Kool Shen]\nÀ l'aube de l'an 20...
1    Yo, yo rude boy !\nJaguar Gorgone, Kool Shen, ...
2    J't'explique, c'que j'kiffe, c'est de fumer de...
3    \n[Intro]\nOk, j'avais ton âge y'a à peu près ...
4    \n[Refrain : OrelSan]\nCe soir, j'me mets mina...
5    \n[Couplet unique]\nAujourd'hui sera l'dernier...
Name: song_lyrics, dtype: object

With Genius, we discovered JSON and learned how to browse it to extract data. We read its API documentation and managed a token for authentication. Finally, we discovered a first wrapper and its documentation, so we could easily build up a search corpus.

### Gallica Search API (XML)

[https://api.bnf.fr/fr/api-gallica-de-recherche#/recherche/findInCatalogue](https://api.bnf.fr/fr/api-gallica-de-recherche#/recherche/findInCatalogue)

This API allows you to search the Gallica digital collection.


Dire un mot de SRU (cf https://bibliotheques.wordpress.com/2017/10/27/papa-cest-quoi-un-sru/) – il faut connaître le modèle de données de la BnF : SRU is a **standardised web service** adapted to the needs of library catalogues. In other words, the libraries have agreed to define a standardised way of offering their catalogue as a web service.

And the SRU standard **specifies how to query the server that exposes the catalogue data**.

![SRU](img/SRU.png)

La base :

```
https://gallica.bnf.fr/SRU?version=1.2&operation=searchRetrieve&query={CQL QUERY}
```

#### CQL Query

[CQL, the Contextual Query Language](https://www.loc.gov/standards/sru/cql/), is a formal language for representing queries to information retrieval systems such as web indexes, bibliographic catalogs and museum collection information. The design objective is that queries be human readable and writable, and that the language be intuitive while maintaining the expressiveness of more complex languages.

The syntax makes it possible to take advantage of the richness of the semantic model. But you have to know it. A few examples to help you understand.


Documents mentioning "Jules Verne" in Gallica (>18752 records…) :

- `query=gallica all "Jules Verne"`
- [https://gallica.bnf.fr/SRU?version=1.2&operation=searchRetrieve&**query=gallica all "Jules Verne"**](https://gallica.bnf.fr/SRU?version=1.2&operation=searchRetrieve&query=gallica%20all%20%22Jules%20Verne%22)

Documents of which "Jules Verne" is the author in Gallica (>400 records) :  

- `query=dc.creator all "jules verne"`
- [https://gallica.bnf.fr/SRU?version=1.2&operation=searchRetrieve&**query=dc.creator all "Jules Verne"**](https://gallica.bnf.fr/SRU?version=1.2&operation=searchRetrieve&query=dc.creator%20all%20%22Jules%20Verne%22)

Books (monographs) in French authored by "Jules Verne" in Gallica (>290 records) :

- `query=dc.creator all "jules verne"&filter=dc.type all "monographie" and dc.language all "fre"`
- [https://gallica.bnf.fr/SRU?version=1.2&operation=searchRetrieve&**query=dc.creator all "jules verne"&filter=dc.type all "monographie" and dc.language all "fre"**](https://gallica.bnf.fr/SRU?version=1.2&operation=searchRetrieve&query=dc.creator%20all%20%22jules%20verne%22&filter=dc.type%20all%20%22monographie%22%20and%20dc.language%20all%20%22fre%22)


Il existe aussi des critères de tri. On peut ainsi trier les résultats selon la qualité de l’OCR : `ocr.quality/sort.descending` :

[https://gallica.bnf.fr/SRU?version=1.2&operation=searchRetrieve&query=dc.creator all "jules verne" sortby ocr.quality/sort.descending&filter=dc.type all "monographie" and dc.language all "fre"](https://gallica.bnf.fr/SRU?version=1.2&operation=searchRetrieve&query=dc.creator%20all%20%22jules%20verne%22%20sortby%20ocr.quality/sort.descending&filter=dc.type%20all%20%22monographie%22%20and%20dc.language%20all%20%22fre%22)


Les 5 livres Gallica avec la meilleure qualité OCR dont Jules Verne est l’auteur :

[https://gallica.bnf.fr/SRU?version=1.2&operation=searchRetrieve&query=dc.creator all "jules verne" sortby ocr.quality/sort.descending&filter=dc.type all "monographie" and dc.language all "fre"&maximumRecords=5](https://gallica.bnf.fr/SRU?version=1.2&operation=searchRetrieve&query=dc.creator%20all%20%22jules%20verne%22%20sortby%20ocr.quality/sort.descending&filter=dc.type%20all%20%22monographie%22%20and%20dc.language%20all%20%22fre%22&maximumRecords=5)

 Détailler les autres paramètres
 
- startRecord
- maximumRecords


The first 3 books in French written by Jules Verne: 

[https://gallica.bnf.fr/SRU?version=1.2&operation=searchRetrieve&query=dc.creator all "jules verne"&filter=dc.type all "monographie" and dc.language all "fre"&startRecord=1&maximumRecords=3](https://gallica.bnf.fr/SRU?version=1.2&operation=searchRetrieve&query=dc.creator%20all%20%22jules%20verne%22&filter=dc.type%20all%20%22monographie%22%20and%20dc.language%20all%20%22fre%22&startRecord=1&maximumRecords=3)

Nous allons commencer à travailler sur ce tout petit corpus pour comprendre le code.
 

In [111]:
import requests
url = "https://gallica.bnf.fr/SRU?version=1.2&operation=searchRetrieve&query=dc.creator%20all%20%22jules%20verne%22%20sortby%20ocr.quality/sort.descending&filter=dc.type%20all%20%22monographie%22%20and%20dc.language%20all%20%22fre%22&maximumRecords=5"
response = requests.get(url)
response_string = response.text
print(response_string)

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<srw:searchRetrieveResponse xmlns:ns6="http://gallica.bnf.fr/namespaces/gallica/" xmlns:diag="http://www.loc.gov/zing/srw/diagnostic/" xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:srw="http://www.loc.gov/zing/srw/" xmlns:dc="http://purl.org/dc/elements/1.1/">
    <srw:version>1.2</srw:version>
    <srw:echoedSearchRetrieveRequest>
        <srw:query>dc.creator all "jules verne" sortby ocr.quality/sort.descending</srw:query>
        <srw:version>1.2</srw:version>
    </srw:echoedSearchRetrieveRequest>
    <srw:numberOfRecords>218</srw:numberOfRecords>
    <srw:extraResponseData>&lt;numberOfRecordsDecollapser&gt;296&lt;/numberOfRecordsDecollapser&gt;</srw:extraResponseData>
    <srw:records>
        <srw:record>
            <srw:recordSchema>http://www.openarchives.org/OAI/2.0/OAIdc.xsd</srw:recordSchema>
            <srw:recordPacking>xml</srw:recordPacking>
            <srw:recordData>
                <oai_dc:d

On constate que la donnée est structurée, non pas en JSON mais en XML.

On peut observer la structure d’une notice.

```xml
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<srw:searchRetrieveResponse
    xmlns:srw="http://www.loc.gov/zing/srw/"
    xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/"
    xmlns:dc="http://purl.org/dc/elements/1.1/">
    <srw:records>
        <srw:record>
            <srw:recordSchema>http://www.openarchives.org/OAI/2.0/OAIdc.xsd</srw:recordSchema>
            …
            <srw:recordData>
                <oai_dc:dc>
                    <dc:creator>Verne, Jules (1828-1905)…</dc:creator>
                    <dc:date>1896</dc:date>
                    <dc:identifier>https://gallica.bnf.fr/ark:/12148/bpt6k65501998</dc:identifier>
                    <dc:language>fre</dc:language>
                    <dc:source>Bibliothèque nationale de France…</dc:source>
                    <dc:title>Les voyages du Capitaine Cook…</dc:title>
                    <dc:identifier>https://gallica.bnf.fr/ark:/12148/bpt6k65501998</dc:identifier>
                </oai_dc:dc>
            </srw:recordData>
            …
        </srw:record>
        <srw:record/>
        …
    </srw:records>
</srw:searchRetrieveResponse>
```

To extract the information, we need to parse XML data.

In [112]:
import xml.etree.ElementTree as ET
root = ET.fromstring(response.content)
for child in root.iter('*'):
    print(child.tag)

{http://www.loc.gov/zing/srw/}searchRetrieveResponse
{http://www.loc.gov/zing/srw/}version
{http://www.loc.gov/zing/srw/}echoedSearchRetrieveRequest
{http://www.loc.gov/zing/srw/}query
{http://www.loc.gov/zing/srw/}version
{http://www.loc.gov/zing/srw/}numberOfRecords
{http://www.loc.gov/zing/srw/}extraResponseData
{http://www.loc.gov/zing/srw/}records
{http://www.loc.gov/zing/srw/}record
{http://www.loc.gov/zing/srw/}recordSchema
{http://www.loc.gov/zing/srw/}recordPacking
{http://www.loc.gov/zing/srw/}recordData
{http://www.openarchives.org/OAI/2.0/oai_dc/}dc
{http://purl.org/dc/elements/1.1/}contributor
{http://purl.org/dc/elements/1.1/}contributor
{http://purl.org/dc/elements/1.1/}creator
{http://purl.org/dc/elements/1.1/}date
{http://purl.org/dc/elements/1.1/}description
{http://purl.org/dc/elements/1.1/}description
{http://purl.org/dc/elements/1.1/}description
{http://purl.org/dc/elements/1.1/}description
{http://purl.org/dc/elements/1.1/}description
{http://purl.org/dc/elements/

On constate que tous les éléments de la réponse sont accessibles, selon leur espace de nom.

In [113]:
import xml.etree.ElementTree as ET
root = ET.fromstring(response.content)
for record in root.iter('{http://www.loc.gov/zing/srw/}record'):
    dc_identifier = record.findall('{http://www.loc.gov/zing/srw/}recordData/{http://www.openarchives.org/OAI/2.0/oai_dc/}dc/{http://purl.org/dc/elements/1.1/}identifier')[0].text
    dc_date = record.findall('{http://www.loc.gov/zing/srw/}recordData/{http://www.openarchives.org/OAI/2.0/oai_dc/}dc/{http://purl.org/dc/elements/1.1/}date')[0].text
    dc_title = record.findall('{http://www.loc.gov/zing/srw/}recordData/{http://www.openarchives.org/OAI/2.0/oai_dc/}dc/{http://purl.org/dc/elements/1.1/}title')[0].text
    dc_creator = record.findall('{http://www.loc.gov/zing/srw/}recordData/{http://www.openarchives.org/OAI/2.0/oai_dc/}dc/{http://purl.org/dc/elements/1.1/}creator')[0].text
    print(dc_creator, dc_identifier, dc_date, dc_title)

Verne, Jules (1828-1905). Auteur du texte https://gallica.bnf.fr/ark:/12148/bpt6k65775059 1877 Hector Servadac : voyages et aventures à travers le monde solaire / par Jules Verne ; dessins de P. Philippoteaux, gravés par Laplante
Verne, Jules (1828-1905). Auteur du texte https://gallica.bnf.fr/ark:/12148/bpt6k65501998 1896 Les voyages du Capitaine Cook / par Jules Verne... ; dessins par P. Philippoteaux ; fac-similés d'après les documents anciens et cartes par Matthis et Morieu
Verne, Jules (1828-1905). Auteur du texte https://gallica.bnf.fr/ark:/12148/bpt6k6512278z 1886 Robur-le-conquérant / par Jules Verne ; 87 dessins par L. Benett et G. Roux
Verne, Jules (1828-1905). Auteur du texte https://gallica.bnf.fr/ark:/12148/bpt6k65773135 1887 Nord contre Sud : les voyages extraordinaires / par Jules Verne,... ; dessins par Benett..
Verne, Jules (1828-1905). Auteur du texte https://gallica.bnf.fr/ark:/12148/bpt6k65143496 1871 De la terre à la lune : trajet direct en 97 heures (10e édition) 

In [114]:
# On charge le tout dans un dataframe
import pandas as pd
metadata = []
root = ET.fromstring(response.content)
for record in root.iter('{http://www.loc.gov/zing/srw/}record'):
    dc_identifier = record.findall('{http://www.loc.gov/zing/srw/}recordData/{http://www.openarchives.org/OAI/2.0/oai_dc/}dc/{http://purl.org/dc/elements/1.1/}identifier')[0].text
    dc_date = record.findall('{http://www.loc.gov/zing/srw/}recordData/{http://www.openarchives.org/OAI/2.0/oai_dc/}dc/{http://purl.org/dc/elements/1.1/}date')[0].text
    dc_title = record.findall('{http://www.loc.gov/zing/srw/}recordData/{http://www.openarchives.org/OAI/2.0/oai_dc/}dc/{http://purl.org/dc/elements/1.1/}title')[0].text
    dc_creator = record.findall('{http://www.loc.gov/zing/srw/}recordData/{http://www.openarchives.org/OAI/2.0/oai_dc/}dc/{http://purl.org/dc/elements/1.1/}creator')[0].text
    ocr_quality = record.findall('{http://www.loc.gov/zing/srw/}extraRecordData/nqamoyen')[0].text
    metadata.append({
        'ocr_quality': ocr_quality,
        'dc_creator': dc_creator,
        'dc_identifier': dc_identifier,
        'dc_date': dc_date,
        'dc_title': dc_title
    })
    #print(dc_creator, dc_identifier, dc_date, dc_title)
#print(metadata)
metadata_df = pd.DataFrame(metadata)
metadata_df

Unnamed: 0,ocr_quality,dc_creator,dc_identifier,dc_date,dc_title
0,99.99,"Verne, Jules (1828-1905). Auteur du texte",https://gallica.bnf.fr/ark:/12148/bpt6k65775059,1877,Hector Servadac : voyages et aventures à trave...
1,99.99,"Verne, Jules (1828-1905). Auteur du texte",https://gallica.bnf.fr/ark:/12148/bpt6k65501998,1896,Les voyages du Capitaine Cook / par Jules Vern...
2,99.99,"Verne, Jules (1828-1905). Auteur du texte",https://gallica.bnf.fr/ark:/12148/bpt6k6512278z,1886,Robur-le-conquérant / par Jules Verne ; 87 des...
3,99.99,"Verne, Jules (1828-1905). Auteur du texte",https://gallica.bnf.fr/ark:/12148/bpt6k65773135,1887,Nord contre Sud : les voyages extraordinaires ...
4,99.99,"Verne, Jules (1828-1905). Auteur du texte",https://gallica.bnf.fr/ark:/12148/bpt6k65143496,1871,De la terre à la lune : trajet direct en 97 he...


In [115]:
# On rend le code réutilisable en définissant une fonction
def get_gallica_metadata(search_api_url):
    response = requests.get(url)
    root = ET.fromstring(response.content)
    metadata = []
    for record in root.iter('{http://www.loc.gov/zing/srw/}record'):
        dc_identifier = record.findall('{http://www.loc.gov/zing/srw/}recordData/{http://www.openarchives.org/OAI/2.0/oai_dc/}dc/{http://purl.org/dc/elements/1.1/}identifier')[0].text
        dc_date = record.findall('{http://www.loc.gov/zing/srw/}recordData/{http://www.openarchives.org/OAI/2.0/oai_dc/}dc/{http://purl.org/dc/elements/1.1/}date')[0].text if record.findall('{http://www.loc.gov/zing/srw/}recordData/{http://www.openarchives.org/OAI/2.0/oai_dc/}dc/{http://purl.org/dc/elements/1.1/}date') else 'NaN'
        dc_title = record.findall('{http://www.loc.gov/zing/srw/}recordData/{http://www.openarchives.org/OAI/2.0/oai_dc/}dc/{http://purl.org/dc/elements/1.1/}title')[0].text
        dc_creator = record.findall('{http://www.loc.gov/zing/srw/}recordData/{http://www.openarchives.org/OAI/2.0/oai_dc/}dc/{http://purl.org/dc/elements/1.1/}creator')[0].text
        ocr_quality = record.findall('{http://www.loc.gov/zing/srw/}extraRecordData/nqamoyen')[0].text
        metadata.append({
            'ocr_quality': ocr_quality,
            'dc_creator': dc_creator,
            'dc_identifier': dc_identifier,
            'dc_date': dc_date,
            'dc_title': dc_title
        })
    metadata_df = pd.DataFrame(metadata)
    return metadata_df


In [116]:
# On rend le code réutilisable en définissant une fonction
def get_gallica_metadata(search_api_url, dc_elements_array):
    response = requests.get(url)
    root = ET.fromstring(response.content)
    metadata = []
    for record in root.iter('{http://www.loc.gov/zing/srw/}record'):
        record_meta_dic = {}
        for dc_el in dc_elements_array:
            if record.findall('{http://www.loc.gov/zing/srw/}recordData/{http://www.openarchives.org/OAI/2.0/oai_dc/}dc/{http://purl.org/dc/elements/1.1/}'+dc_el)[0].text:
                record_meta_dic[dc_el] = record.findall('{http://www.loc.gov/zing/srw/}recordData/{http://www.openarchives.org/OAI/2.0/oai_dc/}dc/{http://purl.org/dc/elements/1.1/}'+dc_el)[0].text
            else :
                record_meta_dic[dc_el] = 'NaN'
        # non DC metadata
        record_meta_dic['ocr_quality'] = record.findall('{http://www.loc.gov/zing/srw/}extraRecordData/nqamoyen')[0].text if record.findall('{http://www.loc.gov/zing/srw/}extraRecordData/nqamoyen')[0].text else 'NaN'
        metadata.append(record_meta_dic)
    metadata_df = pd.DataFrame(metadata)
    return metadata_df


In [117]:
url = "https://gallica.bnf.fr/SRU?version=1.2&operation=searchRetrieve&query=dc.creator%20all%20%22jules%20verne%22%20sortby%20ocr.quality/sort.descending&filter=dc.type%20all%20%22monographie%22%20and%20dc.language%20all%20%22fre%22&startRecord=1&maximumRecords=5"
books_df = get_gallica_metadata(url, ['creator', 'identifier', 'title'])
books_df


Unnamed: 0,creator,identifier,title,ocr_quality
0,"Verne, Jules (1828-1905). Auteur du texte",https://gallica.bnf.fr/ark:/12148/bpt6k65775059,Hector Servadac : voyages et aventures à trave...,99.99
1,"Verne, Jules (1828-1905). Auteur du texte",https://gallica.bnf.fr/ark:/12148/bpt6k65501998,Les voyages du Capitaine Cook / par Jules Vern...,99.99
2,"Verne, Jules (1828-1905). Auteur du texte",https://gallica.bnf.fr/ark:/12148/bpt6k6512278z,Robur-le-conquérant / par Jules Verne ; 87 des...,99.99
3,"Verne, Jules (1828-1905). Auteur du texte",https://gallica.bnf.fr/ark:/12148/bpt6k65773135,Nord contre Sud : les voyages extraordinaires ...,99.99
4,"Verne, Jules (1828-1905). Auteur du texte",https://gallica.bnf.fr/ark:/12148/bpt6k65143496,De la terre à la lune : trajet direct en 97 he...,99.99


In [118]:
# une fois qu’on dispose des ark, on peut charger le texte
# Découverte de l’API Document

ark_url = 'https://gallica.bnf.fr/ark:/12148/bpt6k65775059'
response = requests.get(ark_url+'.texteBrut')


In [119]:
from bs4 import BeautifulSoup
document = BeautifulSoup(response.text, "html.parser")
first_hr_tag = document.select('hr')[0]
text = ''
for p in first_hr_tag.find_all_next('p'):
    text += p.text + '\n'
#print(text.strip())

In [120]:
def get_book_text(ark_url):
    response = requests.get(ark_url+'.texteBrut')
    document = BeautifulSoup(response.text, "html.parser")
    first_hr_tag = document.select('hr')[0]
    text = ''
    for p in first_hr_tag.find_all_next('p'):
        text += p.text + '\n'
    return text
    

In [None]:
#print(get_book_text(ark_url))

In [122]:
# on peut enfin ajoouter ce texte dans notre dataframe
books_df['text'] = books_df.apply(lambda x: get_book_text(x['identifier']), axis=1)
books_df


Unnamed: 0,creator,identifier,title,ocr_quality,text
0,"Verne, Jules (1828-1905). Auteur du texte",https://gallica.bnf.fr/ark:/12148/bpt6k65775059,Hector Servadac : voyages et aventures à trave...,99.99,JULES VERNE \nHECTOR \nSERVADAC \nHECTOR SERVA...
1,"Verne, Jules (1828-1905). Auteur du texte",https://gallica.bnf.fr/ark:/12148/bpt6k65501998,Les voyages du Capitaine Cook / par Jules Vern...,99.99,LES VOYAGES \nDU \nCAPITAINE COOK \nBIBLIOTHÈQ...
2,"Verne, Jules (1828-1905). Auteur du texte",https://gallica.bnf.fr/ark:/12148/bpt6k6512278z,Robur-le-conquérant / par Jules Verne ; 87 des...,99.99,ROBUR LE CONQUÉRANT \nCOLLECTION HETZEL \nROBU...
3,"Verne, Jules (1828-1905). Auteur du texte",https://gallica.bnf.fr/ark:/12148/bpt6k65773135,Nord contre Sud : les voyages extraordinaires ...,99.99,JULES VERNE \nNORD \nVOYAGES \nEXTRAORDINAIRE ...
4,"Verne, Jules (1828-1905). Auteur du texte",https://gallica.bnf.fr/ark:/12148/bpt6k65143496,De la terre à la lune : trajet direct en 97 he...,99.99,DE LA TERRE \nA \nLA LUNE \nⓒ \nOUVRAGES DU MÊ...


In [None]:
#print(books_df.iloc[1]['text'])

In [None]:
# Pour constituer son corpus et l’appeler plus tard, il est aussi possible de copier les fichiers localement

#print(books_df.iloc[1]['text'])

In [None]:
# Il existe des wrappers pour les API les plus utilisées
# Par exemple Twitter : https://twython.readthedocs.io/en/latest/

In [None]:
# faire un tour des services existants pour rammaser du texte, en particulier Wikisource
#https://fr.wikisource.org/wiki/J%E2%80%99accuse%E2%80%A6!
#https://ws-export.wmcloud.org/?lang=fr&title=J%E2%80%99accuse%E2%80%A6!

### Gallica Document API

Pour un identfiant ark, il est toujours possible d’accéder aux métadonnées inscrites dans la notice OAI.

`https://gallica.bnf.fr/services/OAIRecord?ark={ark}`

Ce service renvoie la notice OAI-PMH du document ainsi que d’autres informations techniques, telles que le type de document, ou si la recherche plein texte est disponible ou pas.

Un seul paramètre est obligatoire, il s’agit de l’ark de l’identifiant numérique du document.

In [123]:
from xml.etree import ElementTree
tree = ElementTree.fromstring(requests.get('https://gallica.bnf.fr/services/OAIRecord?ark=bpt6k65501998').content)

In [124]:
tree.find('nqamoyen').text

'99.99'

In [125]:
# ou lister les ressources liées
for relation in tree.iter('{http://purl.org/dc/elements/1.1/}relation'):
    print(relation.text)

Notice du catalogue : http://catalogue.bnf.fr/ark:/12148/cb31562766z


#### Service de récupération du texte

Pour la récupération du texte brut :

`https://gallica.bnf.fr/ark:/12148/{ark}.texteBrut`

Si l'on souhaite une partie du document, on ajoutera le qualifier f[X]n[y] à la fin du qualifier texteBrut, où X est le numéro de la page à partir de laquelle on souhaite obtenir le texte, et n le nombre des pages suivantes.

`https://gallica.bnf.fr/ark:/12148/{ark}/f{start_page_number}n{number_of_pages}.texteBrut`

Par exemple, les 5 premières pages du chapitre consacré à Bougainville (p. 75-79 => 5 pages à partir de f87)

[https://gallica.bnf.fr/ark:/12148/bpt6k65501998/f87n5.texteBrut](https://gallica.bnf.fr/ark:/12148/bpt6k65501998/f87n5.texteBrut)


#### Service de récupération de l’OCR (ALTO)

L'API Gallica suivante (RequestDigitalElement) permet d'extraire une page OCR de la bibliothèque numérique :

`https://gallica.bnf.fr/RequestDigitalElement?O={ark}&E=ALTO&Deb={page_number}`

Par exemple, l’ALTO de la première page du premier chapitre consacré à Bougainville (p. 1 => f13)

https://gallica.bnf.fr/RequestDigitalElement?O=bpt6k65501998&E=ALTO&Deb=13


In [126]:
# ALTO = du XML
# Il devient possible d’extraire du contenu, par exemple la référence d’une imaje

from xml.etree import ElementTree
tree = ElementTree.fromstring(requests.get('https://gallica.bnf.fr/RequestDigitalElement?O=bpt6k65501998&E=ALTO&Deb=13').content)
for illustration in tree.iter('{http://bibnum.bnf.fr/ns/alto_prod}Illustration'):
    print(illustration.get('ID'))

PAG_00000013_IL000001


In [127]:
# Il est du coup possible, de récupérer les coordonnées de l’image et de l’afficher grâce à IIIF

illustrations = []
for illustration in tree.iter('{http://bibnum.bnf.fr/ns/alto_prod}Illustration'):
    illustrations.append({
            'ID': illustration.get('ID'),
            'HPOS': illustration.get('HPOS'),
            'VPOS': illustration.get('VPOS'),
            'WIDTH': illustration.get('WIDTH'),
            'HEIGHT': illustration.get('HEIGHT')
        })
illustrations

[{'ID': 'PAG_00000013_IL000001',
  'HPOS': '378',
  'VPOS': '774',
  'WIDTH': '1845',
  'HEIGHT': '2209'}]

In [128]:
from IPython.display import Image

illustration_iiif_url = ('https://gallica.bnf.fr/iiif/ark:/12148/bpt6k65501998/f13/'
      + illustrations[0]['HPOS'] +','
      + illustrations[0]['VPOS'] +','
      + illustrations[0]['WIDTH'] +','
      + illustrations[0]['HEIGHT']
      + '/full/0/native.jpg')

Image(url=illustration_iiif_url, width=400)


### Wrappers

- PyGallica : https://github.com/ian-nai/PyGallica
- gallipy : https://libraries.io/pypi/gallipy


## Preprocessing with Spacy

On apprend à manipuler le texte. On commence avec les entités nommées
In this lesson, we're going to learn about a text analysis method called Named Entity Recognition (NER). This method will help us computationally identify people, places, and things (of various kinds) in a text or collection of texts.

In [None]:
!pip install -U spacy

In [None]:
import spacy
from spacy import displacy
from collections import Counter
import pandas as pd
pd.options.display.max_rows = 600
pd.options.display.max_colwidth = 400

In [None]:
# Download Language Model
conda install -c conda-forge spacy-model-fr_core_news_md

In [None]:
import fr_core_news_md
nlp = fr_core_news_md.load()

In [None]:
text = books_df.iloc[1]['text']
document = nlp(text)

In [None]:
#displacy.render(document, style="ent")

In [None]:
#document.ents

In [None]:
for named_entity in document.ents:
    print(named_entity, named_entity.label_)

In [None]:
for named_entity in document.ents:
    if named_entity.label_ == "PER":
        print(named_entity)

In [None]:
import math
number_of_chunks = 80

chunk_size = math.ceil(len(text) / number_of_chunks)

text_chunks = []

for number in range(0, len(text), chunk_size):
    text_chunk = text[number:number+chunk_size]
    text_chunks.append(text_chunk)

In [None]:
#len(text_chunks)
#text_chunks[0]

In [None]:
# https://spacy.io/api/language#pipe
# nlp.pipe returns a generator on purpose
# https://www.dataquest.io/blog/python-generators-tutorial/
chunked_documents = list(nlp.pipe(text_chunks))

In [None]:
# TODO: transformer en fonction pour extraire selon arg les PER|LOC
people = []

for doc in chunked_documents:
    for named_entity in doc.ents:
        if named_entity.label_ == "PER":
            people.append(named_entity.text)

# https://docs.python.org/3/library/collections.html#collections.Counter
people_tally = Counter(people)

people_df = pd.DataFrame(people_tally.most_common(), columns=['character', 'count'])
people_df.iloc[0:25]

In [None]:
# NB on peut aussi segmenter en phrases
'''
for sentence in document.sents:
    print(sentence.text, '===')
'''

### POS / Lemmatisation

In [None]:
sample = "Un jour peut-être, je parviendrai à réussir ce que je souhaite entreprendre."

In [None]:
document = nlp(books_df.iloc[1]['text'])

In [None]:
# https://universaldependencies.org/u/pos/
# les tokens ont de très nombreux attributs
for token in document:
    print(token, token.lemma_, token.pos_)

### Les bases de Spacy
https://realpython.com/natural-language-processing-spacy-python/

In [None]:
quote = "Certains attendent que le temps change, d'autres le saisissent avec force et agissent. Lorsque ta vue veut pénétrer trop loin dans les ténèbres, il advient qu'en imaginant tu t'égares."

In [None]:
# le modèle de langue chargé
# un paquet de modèles de langue
nlp

In [None]:
document = nlp(quote)

In [None]:
document

In [None]:
type(document)

In [None]:
print([token.text for token in document])

In [None]:
sentences = list(document.sents)
sentences

In [None]:
for sentence in sentences:
    print(f'{sentence[:5]}…')

In [None]:
from spacy.lang.fr.stop_words import STOP_WORDS as fr_stop
for stop_word in list(fr_stop)[:10]:
    print(stop_word)
len(fr_stop)

In [None]:
print([token.text for token in document if not token.is_stop])

In [None]:
# On accède aux données structurées, mais pas aux paroles…
response = requests.get("https://genius.com/api/songs/87367")
response.json()

In [None]:
requests.get(f"https://api.genius.com/songs/87367?access_token={api_key.your_client_access_token}").json()