**Analyse morpho-syntaxique avec NLTK (https://www.nltk.org/)**
Dans ce TP je vous propose de comprendre et ensuite de tester plusieurs fonctionnalités de NLTK et SpaCy pour le pre-traitement et la vectorisation de textes. Je vous donne un example pour chaque fonctionnalité, prenez le temps de tester avec des autres phrases, et comprendre comment manipuler ce type de données textuelles.

**Tokenisation et POS** avec NLTK:

In [26]:
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
text = nltk.word_tokenize("Today is raining!")
print(nltk.pos_tag(text))

[('Today', 'NN'), ('is', 'VBZ'), ('raining', 'VBG'), ('!', '.')]


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Skyzo\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\Skyzo\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


Découvrons quelles sont les étiquettes les plus courantes dans la catégorie NEWS du corpus Brown:

In [3]:
from nltk.corpus import brown
nltk.download('brown')
nltk.download('universal_tagset')
brown_news_tagged = brown.tagged_words(categories='news',tagset='universal')
tag_fd = nltk.FreqDist(tag for (word, tag) in brown_news_tagged)
tag_fd.most_common()

[nltk_data] Downloading package brown to
[nltk_data]     C:\Users\Skyzo\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\brown.zip.
[nltk_data] Downloading package universal_tagset to
[nltk_data]     C:\Users\Skyzo\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping taggers\universal_tagset.zip.


[('NOUN', 30654),
 ('VERB', 14399),
 ('ADP', 12355),
 ('.', 11928),
 ('DET', 11389),
 ('ADJ', 6706),
 ('ADV', 3349),
 ('CONJ', 2717),
 ('PRON', 2535),
 ('PRT', 2264),
 ('NUM', 2166),
 ('X', 92)]

**Stemming** avec NLTK.

In [4]:
from nltk.stem import PorterStemmer
porter = PorterStemmer()

for word in ['walking', 'walks', 'walked']:
    print(porter.stem(word))

walk
walk
walk


**Distribution des mots dans le texte**

La méthode text.similar() prend un mot w, recherche tous les contextes w1 w w2, puis tous les mots w’ qui apparaissent dans le même contexte, c.-à-d. w1 w’ w2

In [5]:
text = nltk.Text(word.lower() for word in nltk.corpus.brown.words())
#text.similar('woman')
#Testez les mots suivants et d'autres mots:
text.similar('bought')
#text.similar('the')

made said done put had seen found given left heard was been brought
set got that took in told felt


**Commment créer une CFG?**

Définissons une grammaire et voyons comment analyser une phrase simple admise par la grammaire.

Quelles phrases peut reconnaître cette grammaire?

In [6]:
from nltk.corpus import treebank
grammar1 = nltk.CFG.fromstring("""
S -> NP VP
VP -> V NP | V NP PP
PP -> P NP
V -> 'saw' | 'ate' | 'walked' | 'chase'
NP -> 'John' | 'Mary' | 'Bob' | Det N | Det N PP | N
Det -> 'a' | 'an' | 'the' | 'my'
N -> 'man' | 'dog' | 'cat' | 'telescope' | 'park'| 'dogs' | 'cats'
P -> 'in' | 'on' | 'by' | 'with'
""")
#sent = "Mary saw Bob".split()
sent = "dogs chase cats".split()
rd_parser = nltk.RecursiveDescentParser(grammar1)
for tree in rd_parser.parse(sent):
  print(tree)

(S (NP (N dogs)) (VP (V chase) (NP (N cats))))


Modifiez la grammaire pour que elle puisse reconnaitre la phrase: "dogs chase cats". Testez avec NLTK!!

In [7]:
#Testez ici!!

**Une CFG pour le Français.** Quelles phrases peut reconnaître cette grammaire?

Modifiez la grammaire pour que elle puisse reconnaitre des autres phrases!

In [8]:
grammaire = nltk.CFG.fromstring("""
S -> SN SV
SN -> Art Nom
SV -> V SN | V
Nom -> 'chien' | 'chat'
Art -> 'le'
V -> 'mange'
V -> 'dort'
""")
sent = "le chien dort".split()
rd_parser = nltk.RecursiveDescentParser(grammaire)
for tree in rd_parser.parse(sent):
  print(tree)

(S (SN (Art le) (Nom chien)) (SV (V dort)))


Testons maintenant l'outil SpaCy, une autre bibliothèque open-source pour le traitement avancé du langage naturel en Python.

In [21]:
!python -m spacy download en_core_web_sm


Traceback (most recent call last):
  File "<frozen runpy>", line 189, in _run_module_as_main
  File "<frozen runpy>", line 148, in _get_module_details
  File "<frozen runpy>", line 112, in _get_module_details
  File "C:\Users\Skyzo\AppData\Roaming\Python\Python312\site-packages\spacy\__init__.py", line 13, in <module>
    from . import pipeline  # noqa: F401
    ^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Skyzo\AppData\Roaming\Python\Python312\site-packages\spacy\pipeline\__init__.py", line 1, in <module>
    from .attributeruler import AttributeRuler
  File "C:\Users\Skyzo\AppData\Roaming\Python\Python312\site-packages\spacy\pipeline\attributeruler.py", line 8, in <module>
    from ..language import Language
  File "C:\Users\Skyzo\AppData\Roaming\Python\Python312\site-packages\spacy\language.py", line 46, in <module>
    from .pipe_analysis import analyze_pipes, print_pipe_analysis, validate_attrs
  File "C:\Users\Skyzo\AppData\Roaming\Python\Python312\site-packages\spacy\pipe_analysis.py

In [22]:
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp('Hello     World!')
for token in doc:
    print('"' + token.text + '"')

ModuleNotFoundError: No module named 'spacy'

In [None]:
# détection de phrases

doc = nlp("These are apples. These are oranges.")

for sent in doc.sents:
    print(sent)



In [None]:
# POS Tagging

doc = nlp("Next week I'll be in Madrid.")
print([(token.text, token.tag_) for token in doc])



In [None]:
# NER Named Entity Recognition

doc = nlp(u"Next week I'll be in Madrid.")
for ent in doc.ents:
    print(ent.text, ent.label_)


In [None]:
# Spacy Entity Types

doc = nlp(u"I just bought 2 shares at 9 a.m. because the stock went up 30% in just 2 days according to the WSJ")
for ent in doc.ents:
    print(ent.text, ent.label_)


In [None]:
# displaCy

from spacy import displacy

doc = nlp(u'I just bought 2 shares at 9 a.m. because the stock went up 30% in just 2 days according to the WSJ')
displacy.render(doc, style='ent', jupyter=True)

In [None]:
# Chunking

doc = nlp("Wall Street Journal just published an interesting piece on crypto currencies")
for chunk in doc.noun_chunks:
    print(chunk.text, chunk.label_, chunk.root.text)


In [None]:
# Dependency Parsing

doc = nlp('Wall Street Journal just published an interesting piece on crypto currencies')

for token in doc:
    print("{0}/{1} <--{2}-- {3}/{4}".format(
        token.text, token.tag_, token.dep_, token.head.text, token.head.tag_))

In [None]:
# Visualisation des Dependency Parsing

from spacy import displacy

doc = nlp(u'Wall Street Journal just published an interesting piece on crypto currencies')
displacy.render(doc, style='dep', jupyter=True, options={'distance': 90})

In [None]:
!python -m spacy download en_core_web_lg

In [None]:
# Load the en_core_web_lg embeddings

nlp = spacy.load('en_core_web_lg')


In [None]:
# View vector representation for the word 'banana'

print(nlp.vocab[u'banana'].vector)

In [None]:
# Word embedding Math: "queen" = "king"

from scipy import spatial

cosine_similarity = lambda x, y: 1 - spatial.distance.cosine(x, y)

man = nlp.vocab[u'man'].vector
woman = nlp.vocab[u'woman'].vector
queen = nlp.vocab[u'queen'].vector
king = nlp.vocab[u'king'].vector

# We now need to find the closest vector in the vocabulary to the result of "man" - "woman" + "queen"
maybe_king = man - woman + queen
computed_similarities = []

for word in nlp.vocab:
    # Ignore words without vectors
    if not word.has_vector:
        continue

    similarity = cosine_similarity(maybe_king, word.vector)
    computed_similarities.append((word, similarity))

computed_similarities = sorted(computed_similarities, key=lambda item: -item[1])
print([w[0].text for w in computed_similarities[:10]])

# ['Queen', 'QUEEN', 'queen', 'King', 'KING', 'king', 'KIng', 'KINGS', 'kings', 'Kings']


In [None]:
# Computing Similiarity

banana = nlp.vocab[u'banana']
dog = nlp.vocab[u'dog']
fruit = nlp.vocab[u'fruit']
animal = nlp.vocab[u'animal']

print(dog.similarity(animal), dog.similarity(fruit)) # 0.6618534 0.23552845
print(banana.similarity(fruit), banana.similarity(animal)) # 0.67148364 0.2427285



In [None]:
# Computing Similarity on entire texts

target = nlp(u"Cats are beautiful animals.")

doc1 = nlp(u"Dogs are awesome.")
doc2 = nlp(u"Some gorgeous creatures are felines.")
doc3 = nlp(u"Dolphins are swimming mammals.")

print(target.similarity(doc1))  # 0.8901765218466683
print(target.similarity(doc2))  # 0.9115828449161616
print(target.similarity(doc3))  # 0.7822956752876101