![](https://drive.google.com/uc?export=view&id=1L9JLLQHPZoMRwzYfmKcyM9VME_SHeZrr)

# TP 1 : pre-processing texts

In this practical session, we will see how to pre-process textual data using NLTK and Spacy.

Within a computer, text is encoded as a string of characters. 
In order to analyze textual data within NLP applications, we first need to properly preprocess it. 
An NLP preprocessing pipeline generally consists of the following steps :
* sentence segmentation
* tokenisation
* normalization: lower-casing, lemmatization, optionally removing stop-words and punctuation 
* pos-tagging
* named entity recognition
* parsing

The first two steps are necessary, while the others are optional.

For these exercises, we will use the modules **NLTK** and **spacy** (already installed on google colab, but some libraries might be missing for your NLTK, we'll see later).

NLTK and Spacy both provide ways to carry out tasks such as segmentation, tokenization, lemmatization and pos-tagging.

NLTK is a rather old library, but still used a lot. NLTK was built by scholars and researchers as a tool to help you create complex NLP functions.
Spacy is more recent one, it implements an NLP pipeline. While NLTK provides access to many algorithms to get something done, spaCy provides the best way to do it (https://www.activestate.com/blog/natural-language-processing-nltk-vs-spacy/).

We will extract information from Wikipedia pages as an example.

## 0- Upload and read the text files

At first, we're going to use a text written in English. Then, we'll try to apply the tools to French.
We'll use the wikipedia library to extract pages from wikipedia

In [None]:
! pip install wikipedia

In [None]:
import wikipedia
wikipedia.set_lang('en')
text_en = wikipedia.page("Lovelace")
print(text_en.content[:1000])

In [None]:
wikipedia.set_lang('fr')
text_fr = wikipedia.page("Lovelace")
print(text_fr.content[:1000])

## 1- Using NLTK



--> **For now, we work on the English file**

### 1.1 Sentence Segmentation

**Exercise 1:** Breaking the text into Sentences

* Import [NLTK](https://www.nltk.org/api/nltk.html)
* In NLTK, you can use help(X) to get information about function X works e.g., help(nltk.word_tokenize) to get information about NLTK's word tokenizer.   
Use the [help function](https://www.nltk.org/api/nltk.html?highlight=help#module-nltk.help) to see how to use *nltk.sent_tokenize*  
* Now use the [sent_tokenize()](https://www.nltk.org/api/nltk.tokenize.html?highlight=sent_tokenize#nltk.tokenize.sent_tokenize) function to segment the text into sentences.
* Apply print() to the output of the tokenizer to view the results
* How many sentences do we have?

You might need to download additional resources for NLTK, e.g. the package 'punkt':



In [None]:
import nltk
nltk.download('punkt')

In [None]:
import nltk

# get information on nltk.sent_tokenize 


# Perform sentence segmentation on the English wikipedia page


# Print the sentences and the total number of sentences


### 1.2 Tokenization

**Exercise 2:** Tokenizing a text file 

* Tokenize the text using NLTK [word_tokenize](https://www.nltk.org/api/nltk.tokenize.html?highlight=word_tokenize)
* Inspect the results: does it work well?
* How many tokens do we have?
* How many words / types / unique tokens do we have? Hint: use numpy.unique(list).

In [None]:
import numpy as np

# Use NLTK to tokenize the text


# Print out the tokens, the total number of tokens, and the number of 
# unique tokens (vocabulary)


### 1.3 Pre-processing of French text

--> **Now, we will use the French wikipedia page**

**Exercise 3** Tokenization for French
* Now, try to perform the same pre-processing on the French document
* Do you see a problem?
* Check the language option for NLTK, does it work better?
* Use the *RegexpTokenizer* to solve the issue.

In [None]:
# Use NLTK to perform sentence segmentation and tokenization on the 
# French wikipedia page (same prints)



In [None]:
# Change the language option?


In [None]:
# Below we define a regex based tokenizer, does it work better?
from nltk import RegexpTokenizer
tokenizer = RegexpTokenizer(r'''\w'|\w+|[^\w\s]''')


## 2- Using Spacy

All info about Spacy: https://spacy.io/ ; More info on the pipelines: https://spacy.io/usage/processing-pipelines 

Spacy is a more realistic library for NLP than NLTK, with higher performances on the basic processing steps. 

Spacy can be used to directly tokenize any text. 
With spacy, we build a pipeline that does everything at once. 
To make it work, you need to **load a model specific to the target language**, for example 'en' for English (there are also some domain specific models).


The model corresponds to a processing 'pipeline': 
  by default, it includes the tokenisation, the lemmatization and the POS tagging

Using spacy:
- import the spacy module into Python 
- load all the necessary models, e.g. for English


In [None]:
import spacy 
nlp = spacy.load('en_core_web_sm')


Then process a text with the pipeline: 



In [None]:
doc = nlp(content_en)

### 2.1 Tokenisation

**Exercise 4:** Tokenize the text in French
* Find a model for French and tokenize the text in the file. Hint: you will need to download the model first, that can be done in the notebook using: *spacy.cli.download( model_name )*
* What does contain the *doc* variable? Hint: You can either access Spacy's manual on the internet to find out how to access the information, or look at the built-in help by typing help(doc). https://spacy.io/api/doc
* Print the individual tokens. Do you see any error?
* How many tokens do we have?
* How many words / types / unique tokens do we have (i.e. vocabulary size)?
* Use Pandas to better visualize the results

In [None]:
# Dowload a model for French


In [None]:
# Load the model


# Preprocess using spacy's pipeline


# Inspect tokens: print out the tokens, the total number of tokens, and the 
# number of unique tokens (vocabulary) 



#### Pandas

You can use Pandas to better visualize the results

In [None]:
# Display a pandas dataframe with the tokens


### 2.2 Sentence segmentation

**Exercise 5:**
Apart from token segmentation, Spacy has also automatically segmented our document intro sentences. 
* Print out the different sentences of the document.
Hint: Look at the "Data descriptors " in the help page for 'doc'.



In [None]:
t

In [None]:
# Print the sentences


## 3- Further pre-processing 

We saw earlier that the most frequent words are punctuation and function words. 
In order to find the most important words, e.g. to index documents, we probably want to remove these tokens.
We are thus now going to **remove punctuation signs and "stop words"**.
Note that for a full normalization, we would probably also lower case the first word of each sentence, and all words that are not tagged as proper nouns (but it requires pos tagging).

Exercise 6:
* Define a function that segments, tokenizes, removes punctuation and removes stop words. 
  * **Hint** look at *string.punctuation*
  * **Hint** *spacy.lang* contains language specific data for each language, in particular stop words lists.
* Apply this function to the french wikipedia page
* Print the total number of unique tokens after this pre-processing and the first 100 tokens on the cleaned version of the text
* Display a panda dataframe containing an ordered list of the tokens after cleaning and their frequency


In [None]:
# Define a function that cleans a text by segmenting into sentences, 
# tokenizing, removing punctuation and stop words



# Apply the function to the French wikipedia page


# Print the number of unique tokens ater cleaning


# Print the first 100 tokens of the cleaned version of the text


In [None]:
# Display a panda dataframe containing an ordered list of the tokens after 
# cleaning and their frequency


## 4- POS tagging

Remember that the model corresponds to a processing 'pipeline' in Spacy: 
  - by default, it includes the tokenisation, the lemmatization and the POS tagging

**Exercise 7**
- print each individual token, together with its lemmatized form and part of speech tag
- Use Panda to better visualize the results
- Look at the results, do you see any error?
- You can use the method 'spacy.explain' to have information about some annotation, for example the POS tags. Apply it to each POS tag to get a more detailed label.


In [None]:
# Print tokens, lemmas, and pos tags


#### Pandas

You can use Pandas to better visualize the results

In [None]:
# Display a panda dataframe containing the tokens and associated lemmas and POS 


In [None]:
# Use the method 'explain' to get a more detailed version of the POS tags



## 5- Named entity recognition

As part of the preprocessing pipeline, Spacy has also carried out named entity recognition.

**Exercise 8:**
* print out each named entity, together with the label assigned to it
* what do the labels stand for?
* Use the module called 'displacy' to visualize the Named Entities directly in the text.

In [None]:
# print out each named entity, together with the label assigned to it


In [None]:
# Use the method 'explain' to get a more detailed version of the NE tags


In [None]:
# Use the module called 'displacy' to visualize the Named Entities directly in the text


## 6- Parsing 

Finally, as part of the pipeline, Spacy has also performed a dependency parsing (note that each module can de disabled if not needed).

**Exercise 9:** 

* Retrieve the information from the dependency parses: dependent and head of each token for the first sentence of the document
* Use displacy to visualize a parse tree: first try with a simple sentence (e.g. *La petite brise la glace.*) then use the first sentence of the document.
* Navigating the parse tree. Each element of the tree is associated to attributes: you can use them to inspect the different elements of the trees: 
  * Define a Panda dataframe with each token id associated to its head, with the relation between them. The eventual children of the current token are also printed.
* Print all the adjectives and the noun they modify



In [None]:
# Retrieve the information from the dependency parses: dependent and head of 
# each token for the first sentence of the document



In [None]:
# Use displacy to visualize a parse tree: 
# first try with a simple sentence (e.g. *La petite brise la glace.*) 




In [None]:
# then use the first sentence of the wikipedia document


In [None]:
# Define a panda dataframe containing all the information about a token:
# text, pos, dep relation to head, head text, head pos, children 


In [None]:
# Extract all adjectives and the noun they modify


## 7- Putting it all together

Now we are going to use the skills practiced in the preceding exercises to build a simple question-answering system on a toy dataset (in French).

We will focus on specific questions of the form "Qui a peint X ?". 
We will define patterns based on differents ways of formulating this question, and use them to extract the answer from a small toy corpus based on wikipedia pages on paintings. 

When you're done with this exercise, try to answer other types of questions, such as "Où est exposé X ?", "Quand a été peinte X ?".

Below, we reload the spacy French model adding specific options to merge named entities containing multiple tokens.

In [None]:
# Load the model
nlp = spacy.load('fr_core_news_sm')

nlp.add_pipe("merge_entities")
nlp.add_pipe("merge_noun_chunks")

Here is the list of questions we will consider. 
You also need a corpus of source documents, you can find it on Moodle (corpus_qa.txt).

In [None]:
question_list = [
    'La Joconde est un tableau de qui ?',
    'Le radeau de la méduse est une peinture réalisée par qui ?'
]
corpus = 'corpus_qa.txt'

**Exercise 10:** In this part, we focus on the first question. This question is designed to be similar to the document containing the answer. We can thus define a pattern based on its structure to extract the answer from the document.

- Process the question using the spacy nlp pipeline 
- display its parse tree and / or print a Pandas dataframe containing information from the parse tree
- Now to define a lexico-syntactic pattern, you are going to use spacy *DependencyMatcher* : https://spacy.io/usage/rule-based-matching#dependencymatcher
  * Look at the doc to understand how it works
  * Define a pattern that should match 'qui ' in the question 

In [None]:
# Display the parse tree of the first question 


In [None]:
from spacy.matcher import DependencyMatcher
# Define a lexico-syntactic pattern that allows to retrieve the answer

pattern = [
    #...

]

# If you match the pattern to the original question, it should output 'qui'


**Exercise 11**: Retrieve matching documents 
Retrieve the documents that are relevant to the question, i.e. the ones containing the keyword 'La Joconde'.

It is recommended to define a function, that could be used for the next exercises. It could work by:

* first indexing all the documents using the named entities present in the document (i.e. build a dictionnary mapping a named entity to all documents where it is present)
* now the *retrieve_documents(...)* method should try to match the input keywork with a named entity and return the matching documents.
* Test the function with 'Joconde':
  * Does it work with the method based on named entities?
  * Add a backup solution : if no document is found,simply try to find the string corresponding to the keyword in the document


In [None]:
# Retrieve the document that is relevant to the question, i.e. the one 
# containing 'La Joconde'


**Exercise 12:** Test the pattern

- Apply the pattern to each sentence of the retrieved document, do you find the right answer?

In [None]:
#  Apply the pattern to each sentence of this document, do you find the right answer?


**Exercise 13:** Now define a pattern to answer the second question, and find the answer!

Be careful, there is a little issue here:
- Display the parse tree for the question and the matching document: what do you observe?
- Build a pattern to match the question and the answer (Hint : you can use a list of dependency relations in the pattern, using e.g. *"RIGHT_ATTRS": {"DEP": {"IN":[ "acl", "advcl" ] }*
- Finally, retrieve the answer

In [None]:
# Display the parse tree of the second question 


In [None]:
# Display the parse tree of the matching document (first and unique sentence)


In [None]:
# Define a pattern that matches both the question and answer


In [None]:
# Retrieve the answer


**Exercise 14:** Use the patterns defined to find the answers to the questions below. Here, we know that we're looking for the painter, we don'y want to match the question to the answer, we want to test known patterns to find the right answer.

- find a way to extract the name of the painting from the question
- retrieve the relevant document
- test the patterns defined previously to extract the correct answer

In [None]:
new_questions = [
    'Qui a peint American Gothic ?',
    'Qui est l\'auteur de la peinture La Nuit étoilée',
    'Qui a réalisé Un Garrochista ?',
]

**Exercise 15:** To go further

Try to:
- find who painted 'Le Cri' and 'Arearea
- where are located 'La Joconde' and 'Arearea'