### Dans ce document nous allons voir les differentes étapes de base du prétraitement d'un texte anglais. Ces étapes sont nécessaires pour transférer du texte du langage humain vers un format lisible par machine pour un traitement ultérieur.

In [3]:
import nltk
#nltk.download()

**Etapes 1 :** Convertion du texte en **minuscules**

In [4]:
text_src = """Perhaps one of the most significant advances made by Arabic mathematics began at this time with the work of al-Khwarizmi, namely the beginnings of algebra. It is important to understand just how significant this new idea was. It was a revolutionary move away from the Greek concept of mathematics which was essentially geometry. Algebra was a unifying theory which allowed rational numbers, irrational numbers, geometrical magnitudes, etc., to all be treated as "algebraic objects" . It gave mathematics a whole new development path so much broader in concept to that which had existed before, and provided a vehicle for future development of the subject. Another important aspect of the introduction of algebraic ideas was that it allowed mathematics to be applied to itself in a way which had not happened before."""
text_lower = text_src.lower()
print(text_lower)

perhaps one of the most significant advances made by arabic mathematics began at this time with the work of al-khwarizmi, namely the beginnings of algebra. it is important to understand just how significant this new idea was. it was a revolutionary move away from the greek concept of mathematics which was essentially geometry. algebra was a unifying theory which allowed rational numbers, irrational numbers, geometrical magnitudes, etc., to all be treated as "algebraic objects" . it gave mathematics a whole new development path so much broader in concept to that which had existed before, and provided a vehicle for future development of the subject. another important aspect of the introduction of algebraic ideas was that it allowed mathematics to be applied to itself in a way which had not happened before.


**Etape 2:** Suppression des signes de ponctuation, des marques d’accentuation et d’autres signes diacritiques en utilisant **les expressions regulières**

In [5]:
import string
text_result = text_lower.translate(str.maketrans('', '', string.punctuation))
print(text_result)

perhaps one of the most significant advances made by arabic mathematics began at this time with the work of alkhwarizmi namely the beginnings of algebra it is important to understand just how significant this new idea was it was a revolutionary move away from the greek concept of mathematics which was essentially geometry algebra was a unifying theory which allowed rational numbers irrational numbers geometrical magnitudes etc to all be treated as algebraic objects  it gave mathematics a whole new development path so much broader in concept to that which had existed before and provided a vehicle for future development of the subject another important aspect of the introduction of algebraic ideas was that it allowed mathematics to be applied to itself in a way which had not happened before


**Etapes 3: Tokenisation** à fin d'extraire le texte mot par mot

In [6]:
from nltk import word_tokenize
word_tokenize = word_tokenize(text_result)
print(word_tokenize)

['perhaps', 'one', 'of', 'the', 'most', 'significant', 'advances', 'made', 'by', 'arabic', 'mathematics', 'began', 'at', 'this', 'time', 'with', 'the', 'work', 'of', 'alkhwarizmi', 'namely', 'the', 'beginnings', 'of', 'algebra', 'it', 'is', 'important', 'to', 'understand', 'just', 'how', 'significant', 'this', 'new', 'idea', 'was', 'it', 'was', 'a', 'revolutionary', 'move', 'away', 'from', 'the', 'greek', 'concept', 'of', 'mathematics', 'which', 'was', 'essentially', 'geometry', 'algebra', 'was', 'a', 'unifying', 'theory', 'which', 'allowed', 'rational', 'numbers', 'irrational', 'numbers', 'geometrical', 'magnitudes', 'etc', 'to', 'all', 'be', 'treated', 'as', 'algebraic', 'objects', 'it', 'gave', 'mathematics', 'a', 'whole', 'new', 'development', 'path', 'so', 'much', 'broader', 'in', 'concept', 'to', 'that', 'which', 'had', 'existed', 'before', 'and', 'provided', 'a', 'vehicle', 'for', 'future', 'development', 'of', 'the', 'subject', 'another', 'important', 'aspect', 'of', 'the', 'in

**Etapes 4: Stopword Filtering** cette étape sert à supprimer les mots les plus courants dans le language et qui n’ont pas de signification importante pour le text comme 'The', 'is', 'for', 'a' ...

In [7]:
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
#print(stop_words)

filtered_words = [i for i in word_tokenize if i not in stop_words]
print(filtered_words)

['perhaps', 'one', 'significant', 'advances', 'made', 'arabic', 'mathematics', 'began', 'time', 'work', 'alkhwarizmi', 'namely', 'beginnings', 'algebra', 'important', 'understand', 'significant', 'new', 'idea', 'revolutionary', 'move', 'away', 'greek', 'concept', 'mathematics', 'essentially', 'geometry', 'algebra', 'unifying', 'theory', 'allowed', 'rational', 'numbers', 'irrational', 'numbers', 'geometrical', 'magnitudes', 'etc', 'treated', 'algebraic', 'objects', 'gave', 'mathematics', 'whole', 'new', 'development', 'path', 'much', 'broader', 'concept', 'existed', 'provided', 'vehicle', 'future', 'development', 'subject', 'another', 'important', 'aspect', 'introduction', 'algebraic', 'ideas', 'allowed', 'mathematics', 'applied', 'way', 'happened']


**Etapes 5:** Suppression des termes épars et des mots particuliers

- **Stemming** cette technique consiste à réduire les mots à leur radical, à leur base ou à leur forme racine

In [8]:
from nltk.stem.porter import PorterStemmer
porter = PorterStemmer()
stemmed = [porter.stem(word) for word in filtered_words]
print(stemmed)

['perhap', 'one', 'signific', 'advanc', 'made', 'arab', 'mathemat', 'began', 'time', 'work', 'alkhwarizmi', 'name', 'begin', 'algebra', 'import', 'understand', 'signific', 'new', 'idea', 'revolutionari', 'move', 'away', 'greek', 'concept', 'mathemat', 'essenti', 'geometri', 'algebra', 'unifi', 'theori', 'allow', 'ration', 'number', 'irrat', 'number', 'geometr', 'magnitud', 'etc', 'treat', 'algebra', 'object', 'gave', 'mathemat', 'whole', 'new', 'develop', 'path', 'much', 'broader', 'concept', 'exist', 'provid', 'vehicl', 'futur', 'develop', 'subject', 'anoth', 'import', 'aspect', 'introduct', 'algebra', 'idea', 'allow', 'mathemat', 'appli', 'way', 'happen']


- **Lemmatization** cette technique utilise des bases de connaissances lexicales pour obtenir les formes de base correctes des mots

In [9]:
from nltk.stem import WordNetLemmatizer
lemmatizer=WordNetLemmatizer()
lemmatized = [lemmatizer.lemmatize(word) for word in filtered_words]
print(lemmatized)

['perhaps', 'one', 'significant', 'advance', 'made', 'arabic', 'mathematics', 'began', 'time', 'work', 'alkhwarizmi', 'namely', 'beginning', 'algebra', 'important', 'understand', 'significant', 'new', 'idea', 'revolutionary', 'move', 'away', 'greek', 'concept', 'mathematics', 'essentially', 'geometry', 'algebra', 'unifying', 'theory', 'allowed', 'rational', 'number', 'irrational', 'number', 'geometrical', 'magnitude', 'etc', 'treated', 'algebraic', 'object', 'gave', 'mathematics', 'whole', 'new', 'development', 'path', 'much', 'broader', 'concept', 'existed', 'provided', 'vehicle', 'future', 'development', 'subject', 'another', 'important', 'aspect', 'introduction', 'algebraic', 'idea', 'allowed', 'mathematics', 'applied', 'way', 'happened']


**Etapes 5: Part of speech tagging** c’est un processus de conversion d’une phrase en liste de tuples (où chaque tuple a une forme (mot, balise). La balise dans notre cas signifie si le mot est un nom, un adjectif, un verbe, etc.

In [10]:
from nltk import pos_tag
pos = pos_tag(filtered_words)
print(pos)

nltk.help.upenn_tagset("NNP")

[('perhaps', 'RB'), ('one', 'CD'), ('significant', 'JJ'), ('advances', 'NNS'), ('made', 'VBN'), ('arabic', 'JJ'), ('mathematics', 'NNS'), ('began', 'VBD'), ('time', 'NN'), ('work', 'NN'), ('alkhwarizmi', 'RB'), ('namely', 'RB'), ('beginnings', 'JJ'), ('algebra', 'NN'), ('important', 'JJ'), ('understand', 'NN'), ('significant', 'JJ'), ('new', 'JJ'), ('idea', 'NN'), ('revolutionary', 'JJ'), ('move', 'VB'), ('away', 'RB'), ('greek', 'JJ'), ('concept', 'NN'), ('mathematics', 'NNS'), ('essentially', 'RB'), ('geometry', 'VBP'), ('algebra', 'JJ'), ('unifying', 'VBG'), ('theory', 'NN'), ('allowed', 'VBN'), ('rational', 'JJ'), ('numbers', 'NNS'), ('irrational', 'JJ'), ('numbers', 'NNS'), ('geometrical', 'JJ'), ('magnitudes', 'NNS'), ('etc', 'VBP'), ('treated', 'VBN'), ('algebraic', 'JJ'), ('objects', 'NNS'), ('gave', 'VBD'), ('mathematics', 'NNS'), ('whole', 'JJ'), ('new', 'JJ'), ('development', 'NN'), ('path', 'NN'), ('much', 'JJ'), ('broader', 'JJR'), ('concept', 'NN'), ('existed', 'VBD'), ('

**Etapes 6: Chunking** (facultatif) est un processus de langage naturel qui identifie les parties constitutives des phrases (noms, verbes, adjectifs, etc.) et les relie à des unités d’ordre supérieur qui ont des significations grammaticales discrètes (groupes de noms ou phrases, groupes de verbes, etc.)

Nous allons définir cela à l’aide d’une règle d’expression régulière unique. Elle stipule que chaque fois que le segment trouve un détermineur facultatif (DT) suivi d’un nombre quelconque d’adjectifs (JJ), puis d’un nom (NN), le segment phrase nominale (NP) doit être formé.

In [163]:
regular_exp = "expression: {<DT>?<JJ>*<NN>}"

chunkParser = nltk.RegexpParser(regular_exp)
chunkParser = nltk.RegexpParser(grammar)
tree = chunkParser.parse(pos_tag(word_tokenize))

for subtree in tree.subtrees():
    print(subtree)

tree.draw()

(S
  perhaps/RB
  one/CD
  of/IN
  the/DT
  most/RBS
  significant/JJ
  advances/NNS
  made/VBN
  by/IN
  arabic/JJ
  mathematics/NNS
  began/VBD
  at/IN
  (NP this/DT time/NN)
  with/IN
  (NP the/DT work/NN)
  of/IN
  (NP alkhwarizmi/NN)
  namely/RB
  the/DT
  beginnings/NNS
  of/IN
  (NP algebra/NN)
  it/PRP
  is/VBZ
  important/JJ
  to/TO
  understand/VB
  just/RB
  how/WRB
  significant/JJ
  (NP this/DT new/JJ idea/NN)
  was/VBD
  it/PRP
  was/VBD
  (NP a/DT revolutionary/JJ move/NN)
  away/RB
  from/IN
  (NP the/DT greek/JJ concept/NN)
  of/IN
  mathematics/NNS
  which/WDT
  was/VBD
  essentially/RB
  (NP geometry/JJ algebra/NN)
  was/VBD
  (NP a/DT unifying/JJ theory/NN)
  which/WDT
  allowed/VBD
  rational/JJ
  numbers/NNS
  irrational/JJ
  numbers/NNS
  geometrical/JJ
  magnitudes/NNS
  etc/VBP
  to/TO
  all/DT
  be/VB
  treated/VBN
  as/IN
  algebraic/JJ
  objects/NNS
  it/PRP
  gave/VBD
  mathematics/NNS
  (NP a/DT whole/JJ new/JJ development/NN)
  (NP path/NN)
  so/RB
  much