**<h1>TP 1 : Utilisations de base des bibliothèques spaCy and NLTK</h1>**


Vérifier que le package spacy est installé

In [None]:
!pip show spacy

Name: spacy
Version: 3.7.5
Summary: Industrial-strength Natural Language Processing (NLP) in Python
Home-page: https://spacy.io
Author: Explosion
Author-email: contact@explosion.ai
License: MIT
Location: /usr/local/lib/python3.11/dist-packages
Requires: catalogue, cymem, jinja2, langcodes, murmurhash, numpy, packaging, preshed, pydantic, requests, setuptools, spacy-legacy, spacy-loggers, srsly, thinc, tqdm, typer, wasabi, weasel
Required-by: en-core-web-sm, fastai


Vérifier les modèles installés

In [None]:
import spacy
print(spacy.util.get_installed_models())

['en_core_web_sm']


Charger le modèle en_core_web_sm

In [None]:
nlp = spacy.load('en_core_web_sm')

Vous pouvez tester d'autre modèles : https://spacy.io/models et https://spacy.io/usage/models

Après avoir chargé le modèle, la variable nlp fait maintenant référence à une instance de la classe Language, qui contient des règles spécifiques à la langue pour diverses tâches (par exemple, la tokenisation) ainsi qu'un pipeline de traitement.

# Tokenization

L'objet Doc est un conteneur pour accéder aux annotations linguistiques.Un Doc est une séquence d'objets Token. Il permet d'accéder aux phrases et aux entités nommées, d'exporter les annotations vers des tableaux numpy, et de sérialiser sans perte vers des chaînes binaires compressées. L'objet Doc contient un tableau de structures TokenC. Les objets Token et Span au niveau de Python sont des vues de ce tableau, c'est-à-dire qu'ils ne possèdent pas eux-mêmes les données.
Vous pouvez apprendre mieux sur l'objet Doc ici https://spacy.io/api/doc

In [None]:
# Phrase exemple
s = "He didn't want to pay $20 for this book."
doc = nlp(s)

Nous pouvons itérer sur cet objet Doc et afficher les tokens.

In [None]:
print([t.text for t in doc])

['He', 'did', "n't", 'want', 'to', 'pay', '$', '20', 'for', 'this', 'book', '.']


L'objet Doc peut être indexé et découpé comme une liste ordinaire.

In [None]:
print(doc[0])
print(type(doc[0]))

He
<class 'spacy.tokens.token.Token'>


In [None]:
# La découpe d'un objet Doc retourne un objet Span.
print(doc[0:3])
print(type(doc[0:3]))

He didn't
<class 'spacy.tokens.span.Span'>


In [None]:
# Accéder à l'index d'un token dans une phrase.
print([(t.text, t.i) for t in doc])

[('He', 0), ('did', 1), ("n't", 2), ('want', 3), ('to', 4), ('pay', 5), ('$', 6), ('20', 7), ('for', 8), ('this', 9), ('book', 10), ('.', 11)]


Vous pouvez en apprendre davantage sur les objets Token et Span ici :

https://spacy.io/api/token

https://spacy.io/api/span

La tokenisation de spaCy est non-destructive, ce qui signifie que l'entrée originale peut être reconstruite à partir des tokens.

#Segmentation des phrases

Nous pouvons également tokeniser plusieurs phrases et accéder à chaque phrase individuellement en utilisant la propriété sents de l'objet Doc.

In [None]:
s = """The room was eerily silent, save for the faint rustling
 of the curtains as the breeze gently passed through the open
 window. She stood still, staring at the vast expanse of the
 horizon, where the sky met the sea in a seamless blend of
 colors. A soft hum filled the air, almost as if the world
 itself were holding its breath. She wandered closer to the
 edge, her shoes tapping lightly on the wooden floor, and
 peered out, wondering if she had made the right decision
 to come here. The path ahead seemed both inviting and
 uncertain, like the beginning of an adventure
 she hadn't yet understood."""

doc = nlp(s)

for si in doc.sents :
  print (si)


The room was eerily silent, save for the faint rustling
 of the curtains as the breeze gently passed through the open
 window.
She stood still, staring at the vast expanse of the
 horizon, where the sky met the sea in a seamless blend of 
 colors.
A soft hum filled the air, almost as if the world 
 itself were holding its breath.
She wandered closer to the
 edge, her shoes tapping lightly on the wooden floor, and 
 peered out, wondering if she had made the right decision 
 to come here.
The path ahead seemed both inviting and 
 uncertain, like the beginning of an adventure
 she hadn't yet understood.


# Normalisation de la casse (Case-Folding)

In [None]:
s = "He informed Pr. Kamal that he had completed the tests and would send the results soon."
doc = nlp(s)
print([t.lower_ for t in doc])


['he', 'informed', 'pr', '.', 'kamal', 'that', 'he', 'had', 'completed', 'the', 'tests', 'and', 'would', 'send', 'the', 'results', 'soon', '.']


# Suppression des mots vide

spaCy est livré avec une liste de mots vides par défaut. Pour afficher votre document sans les mots vides, vous pouvez utiliser l'attribut is_stop.









In [None]:
# spaCy's default stop word list.
print(nlp.Defaults.stop_words)
print(len(nlp.Defaults.stop_words))
print([t for t in doc if not t.is_stop])

{'hereupon', 'ten', 'anything', 'toward', 'front', 'few', '‘m', 'go', 'because', 'some', 'will', 'had', 'has', 'sometime', '‘s', 'i', 'done', 'except', 'due', 'he', 'does', 'forty', 'thence', 'three', 'bottom', 'already', 'seem', 'whatever', 'indeed', 'nobody', 'before', 'which', 'put', 'sometimes', "'d", 'yourself', 'thru', 'were', 'they', 'alone', 'almost', 'via', 'five', 'part', 'do', 'twelve', 'those', 'all', 'moreover', 'thus', 'after', 'third', 'by', 'every', 'whenever', 'using', 'him', 'other', 're', 'could', 'whence', 'serious', 'did', 'hundred', 'above', 'too', 'becoming', 'first', 'in', 'herein', 'than', "'m", '’re', 'across', 'what', 'here', 'whereafter', 'an', 'whether', 'them', '‘ve', '’m', 'always', 'among', 'take', 'have', 'anyway', 'within', 'everything', 'next', 'therein', '‘ll', 'ca', "'s", 'ourselves', "'re", 'each', 'onto', 'if', 'empty', 'see', 'may', 'throughout', 'but', 'either', 'much', 'on', 'was', 'unless', 'both', 'thereby', 'least', 'together', 'below', 'nev

# Lemmatisation

In [None]:
[(t.text, t.lemma_) for t in doc]

[('He', 'he'),
 ('informed', 'inform'),
 ('Pr', 'Pr'),
 ('.', '.'),
 ('Kamal', 'Kamal'),
 ('that', 'that'),
 ('he', 'he'),
 ('had', 'have'),
 ('completed', 'complete'),
 ('the', 'the'),
 ('tests', 'test'),
 ('and', 'and'),
 ('would', 'would'),
 ('send', 'send'),
 ('the', 'the'),
 ('results', 'result'),
 ('soon', 'soon'),
 ('.', '.')]

# POS Tagging

In [None]:
s = "A student took part in a course at school"
doc = nlp(s)
[(t.text, t.pos_) for t in doc]



[('A', 'DET'),
 ('student', 'NOUN'),
 ('took', 'VERB'),
 ('part', 'NOUN'),
 ('in', 'ADP'),
 ('a', 'DET'),
 ('course', 'NOUN'),
 ('at', 'ADP'),
 ('school', 'NOUN')]

In [None]:
spacy.explain('PROPN')
print(spacy.explain('NNP'))
print(spacy.explain('VBD'))

noun, proper singular
verb, past tense




# Reconnaissance des entitées nommées

In [None]:
s = "Tesla is designing an electric SUV that could possibly arrive in Europe by next winter."
doc = nlp(s)

[(t.text, t.ent_type_) for t in doc]

[('Tesla', 'ORG'),
 ('is', ''),
 ('designing', ''),
 ('an', ''),
 ('electric', ''),
 ('SUV', ''),
 ('that', ''),
 ('could', ''),
 ('possibly', ''),
 ('arrive', ''),
 ('in', ''),
 ('Europe', 'LOC'),
 ('by', ''),
 ('next', 'DATE'),
 ('winter', 'DATE'),
 ('.', '')]

In [None]:
from spacy import displacy

# We need to set the 'jupyter' variable to True in order to output
# the visualization directly. Otherwise, you'll get raw HTML.
displacy.render(doc, style='ent', jupyter=True)

In [None]:
spacy.explain('ORG')

'Companies, agencies, institutions, etc.'

In [None]:
print([(t.text, t.ent_type_) for t in doc if t.ent_type != 0])

[('Tesla', 'ORG'), ('Europe', 'LOC'), ('next', 'DATE'), ('winter', 'DATE')]


In [None]:
print([(ent.text, ent.label_) for ent in doc.ents])

[('Tesla', 'ORG'), ('Europe', 'LOC'), ('next winter', 'DATE')]


# Parsing

In [None]:
s = "She enrolled in the course at the national school of applied science"
doc = nlp(s)

# Note the 'style' argument is assigned a 'dep' flag this time around.
displacy.render(doc, style='dep', jupyter=True)

# Tester un modèle pour le français

In [None]:
!python -m spacy download fr_core_news_sm

Collecting fr-core-news-sm==3.7.0
  Downloading https://github.com/explosion/spacy-models/releases/download/fr_core_news_sm-3.7.0/fr_core_news_sm-3.7.0-py3-none-any.whl (16.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m16.3/16.3 MB[0m [31m57.2 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: fr-core-news-sm
Successfully installed fr-core-news-sm-3.7.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('fr_core_news_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [None]:
nlp = spacy.load("fr_core_news_sm")
import fr_core_news_sm
nlp = fr_core_news_sm.load()




In [None]:
text = "L'Ecole Nationale des Sciences Appliquées lance le premier concours qui aura lieu le mardi à Tanger. Tarik est parmis les candidats qui vont passer ce concours"
doc = nlp(text)
print([(w.text, w.pos_) for w in doc])
print(nlp.Defaults.stop_words)
print(len(nlp.Defaults.stop_words))
print([t for t in doc if not t.is_stop])
displacy.render(doc, style='ent', jupyter=True)
print([(t.text, t.lemma_) for t in doc])
displacy.render(doc, style='dep', jupyter=True)

[("L'", 'DET'), ('Ecole', 'PROPN'), ('Nationale', 'PROPN'), ('des', 'ADP'), ('Sciences', 'NOUN'), ('Appliquées', 'ADJ'), ('lance', 'VERB'), ('le', 'DET'), ('premier', 'ADJ'), ('concours', 'NOUN'), ('qui', 'PRON'), ('aura', 'VERB'), ('lieu', 'NOUN'), ('le', 'DET'), ('mardi', 'NOUN'), ('à', 'ADP'), ('Tanger', 'PROPN'), ('.', 'PUNCT'), ('Tarik', 'PROPN'), ('est', 'AUX'), ('parmis', 'VERB'), ('les', 'DET'), ('candidats', 'NOUN'), ('qui', 'PRON'), ('vont', 'VERB'), ('passer', 'VERB'), ('ce', 'DET'), ('concours', 'NOUN')]
{'allons', 'longtemps', 'eu', 'vais', 'nos', 'ceux-là', 'certaines', 'différentes', 'différent', 'té', 'deja', 'peux', 'n’', 'vous', 'specifiques', 'cela', 'mien', 'troisième', 'hui', 'antérieure', 'treize', 'sien', 'derrière', 'ouias', 'souvent', 'ça', 'cinquantième', 'laquelle', 'miens', 'suffit', 'un', 'i', 'se', 'certaine', 'font', 'chacune', 'je', 'tenir', 'celle-ci', 'eux-mêmes', 'hé', 'restent', 'feront', 'aura', 'autres', 'avons', 'même', 'cinq', 'peut', 'dix-huit',

[("L'", 'le'), ('Ecole', 'Ecole'), ('Nationale', 'Nationale'), ('des', 'de'), ('Sciences', 'science'), ('Appliquées', 'appliqué'), ('lance', 'lancer'), ('le', 'le'), ('premier', 'premier'), ('concours', 'concours'), ('qui', 'qui'), ('aura', 'avoir'), ('lieu', 'lieu'), ('le', 'le'), ('mardi', 'mardi'), ('à', 'à'), ('Tanger', 'Tanger'), ('.', '.'), ('Tarik', 'Tarik'), ('est', 'être'), ('parmis', 'parmis'), ('les', 'le'), ('candidats', 'candidat'), ('qui', 'qui'), ('vont', 'aller'), ('passer', 'passer'), ('ce', 'ce'), ('concours', 'concours')]


# Utilisation de NLTK

In [None]:
!pip show nltk

Name: nltk
Version: 3.9.1
Summary: Natural Language Toolkit
Home-page: https://www.nltk.org/
Author: NLTK Team
Author-email: nltk.team@gmail.com
License: Apache License, Version 2.0
Location: /usr/local/lib/python3.11/dist-packages
Requires: click, joblib, regex, tqdm
Required-by: textblob


In [None]:
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

In [None]:
example_string = """Muad'Dib learned rapidly because his first training was in how to learn.
And the first lesson of all was the basic trust that he could learn.
It's shocking to find how many people do not believe they can learn,
and how many more believe learning to be difficult."""
print(sent_tokenize(example_string))
print(word_tokenize(example_string))

["Muad'Dib learned rapidly because his first training was in how to learn.", 'And the first lesson of all was the basic trust that he could learn.', "It's shocking to find how many people do not believe they can learn,\nand how many more believe learning to be difficult."]
["Muad'Dib", 'learned', 'rapidly', 'because', 'his', 'first', 'training', 'was', 'in', 'how', 'to', 'learn', '.', 'And', 'the', 'first', 'lesson', 'of', 'all', 'was', 'the', 'basic', 'trust', 'that', 'he', 'could', 'learn', '.', 'It', "'s", 'shocking', 'to', 'find', 'how', 'many', 'people', 'do', 'not', 'believe', 'they', 'can', 'learn', ',', 'and', 'how', 'many', 'more', 'believe', 'learning', 'to', 'be', 'difficult', '.']


In [None]:
nltk.download("stopwords")
from nltk.corpus import stopwords


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [None]:
worf_quote = "Sir, I protest. I am not a merry man!"
stop_words = set(stopwords.words("english"))
filtered_list = [word for word in words_in_quote if word.casefold() not in stop_words]
filtered_list

['Sir', ',', 'protest', '.', 'merry', 'man', '!']

In [None]:
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
string_for_stemming = """The crew of the USS Discovery discovered many discoveries.
Discovering is what explorers do."""
words = word_tokenize(string_for_stemming)
stemmed_words = [stemmer.stem(word) for word in words]
stemmed_words

['the',
 'crew',
 'of',
 'the',
 'uss',
 'discoveri',
 'discov',
 'mani',
 'discoveri',
 '.',
 'discov',
 'is',
 'what',
 'explor',
 'do',
 '.']

In [None]:
nltk.download('averaged_perceptron_tagger_eng')

[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger_eng.zip.


True

In [None]:
sagan_quote = """If you wish to make an apple pie from scratch,
you must first invent the universe."""
words_in_sagan_quote = word_tokenize(sagan_quote)
nltk.pos_tag(words_in_sagan_quote)

[('If', 'IN'),
 ('you', 'PRP'),
 ('wish', 'VBP'),
 ('to', 'TO'),
 ('make', 'VB'),
 ('an', 'DT'),
 ('apple', 'NN'),
 ('pie', 'NN'),
 ('from', 'IN'),
 ('scratch', 'NN'),
 (',', ','),
 ('you', 'PRP'),
 ('must', 'MD'),
 ('first', 'VB'),
 ('invent', 'VB'),
 ('the', 'DT'),
 ('universe', 'NN'),
 ('.', '.')]