<div style="font-size:1.5em">
    <p>📜 Table des matières:</p>
    <ul>
       <li><a href="#nltk">Text Preprocessing with NLTK</a></li>  
       <li>
          <a href="#spacy">Text Preprocessing with spaCy</a>
       </li>
       <li>
          <a href="#comp">Comparing results after spaCy's NER</a>
       </li>
    </ul>
</div>

<h1 id ="nltk">Text Preprocessing with NLTK</h1>

In [26]:
#import libraries
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import WhitespaceTokenizer , TreebankWordTokenizer,WordPunctTokenizer
from nltk.corpus import stopwords
import string
import contractions
from spellchecker import SpellChecker
import re
from french_lefff_lemmatizer.french_lefff_lemmatizer import FrenchLefffLemmatizer
import unicodedata
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('words')
nltk.download('wordnet')

def remove_urls(text):
    url_pattern = re.compile(r'https?://\S+|www\.\S+')
    return url_pattern.sub(r'', text)

def remove_accented_chars(text):
    new_text = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8', 'ignore')
    return new_text

def text_preprocessing(text,lang,accented=True,tokenizer=WhitespaceTokenizer(),stopw=True,punctuation=True,lowercase=True,lemmatize=True,spelling=True,expand_contraction=True,urls=True):
    if lang.lower()=='english':
        stopword =stopwords.words('english')
        lemmatizer = WordNetLemmatizer()
        spell = SpellChecker()
    else :#if lang="french"
        stopword = stopwords.words('french')
        lemmatizer = FrenchLefffLemmatizer()
        spell = SpellChecker(language="fr")
    if lowercase:
        #lowercase the text 
        text = text.lower()
    if urls:
        #remove urls
        text=remove_urls(text)
    #tokenize the text 
    tokens = tokenizer.tokenize(text)
    if expand_contraction:
        #expand contractions
        tokens = [contractions.fix(token) for token in tokens]
    if punctuation:
        #remove punctuation
        tokens = [token for token in tokens if token not in string.punctuation]
    if stopw:
        #remove stopwords
        tokens = [token for token in tokens if token not in stopword]
    if accented:
        tokens = [remove_accented_chars(token) for token in tokens]
    if spelling:
        #spell check:
        tokens = [spell.correction(token) for token in tokens]
    if lemmatize:
        #lemmatization : 
        tokens = [lemmatizer.lemmatize(token) for token in tokens]
    return ' '.join(word for word in tokens)

    #Some tests:
file = 'C:/Users/PC2/Downloads/test-ex.txt'
f=open(file,'r')
data = f.read()
print(text_preprocessing(data,lang='french'))


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\PC2\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\PC2\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package words to
[nltk_data]     C:\Users\PC2\AppData\Roaming\nltk_data...
[nltk_data]   Package words is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\PC2\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


gisevciconia zero solution objet ordre virement Monsieur débit compte :_rib: 022 222 222 172 22 29576475 53 set agilisys zero solution swift agilamcxxx virer montant 89 250,00 dirais (quatre-vingt-neuf mille deux cent cinquante dirais de profit i parler export switzerland ian chez 0483 555 555 0266600 0 swift creschzz80a motif 22a005 signature agilisys industrie sara siege social angle ne rue ibn aisha boulevard abdelkrim el khattabi millimètre pari 2TM* etage appât ne gueulez marrakech tu +212 (0) 999 99 6161 f +212 (0) 999 999 130 sara capital 500.000 000 des rcn35 33 if n 3939333 ton 3939 cnss:939 mt-103-stp 2838383393 et point pote trot cc cs con instance type and transmission de original receive from


<h1 id ="spacy">Text Preprocessing with spaCy</h1>

In [27]:

#Now we ll do preprocessing using mainly spacy
import spacy
#load only french and english models tokenizers
nlp_en = spacy.load("en_core_web_sm", disable=['parser', 'tagger', 'ner'])
nlp_fr = spacy.load("fr_core_news_sm", disable=['parser', 'tagger', 'ner'])

def spacy_preprocessing(text,lang,lowercase=True,stopw=True,punctuation=True,alphabetic=True,lemmatize=True,):
    if lang =="en":
        nlp = nlp_en
    else :
        nlp = nlp_fr
    if lowercase:
        text = text.lower()
    #tokenize with spacy's default tokenizer
    tokens = nlp(text)
    if stopw :
        tokens = [token for token in tokens if not token.is_stop]
    if lemmatize :
        tokens = [token.lemma_.strip() for token in tokens]
    if punctuation :
        tokens = [re.sub('<[^>]*>', '', token) for token in tokens]
    if alphabetic:
        tokens = [re.sub('[\W]+','',token.lower()) for token in tokens]
    return ' '.join(word for word in tokens)

file = 'C:/Users/PC2/Downloads/test-ex.txt'
f=open(file,'r')
data = f.read()
print(spacy_preprocessing(data,lang='french'))

 gisevciconia  aero solutions  objet  ordre virement  monsieur   débit compte  _ rib  022 222 222 172 22 29576475 53  swt  agilisy aero solution  swift  agilamcxx   virer montant  89 25000 dirham  quatrevingtneuf dirham     profit  carlex export switzerland  iban  ch35 0483 555 555 0266600 0  swift  creschzz80a  motif  22a005  signature   agilisys industrie sarl  siég social  angle numéro 2 rue ibn aicha bd abdelkrim el khattabi imm pari 2 degré   étage appt numéro 23 gueliz  marrakech  t  212  0  999 99 6161 l f  212  0  999 999 130   sarl capital 500000 000 dhs  rcn degré 35 33  if numéro 3939333  tpn degré 3939  cnss939  mt103stp 2838383393  mt print   poet trot ccc cs econ instance type and transmission   original received from 


<h1 id ="comp">Comparing results after spaCy's NER</h1>

In [28]:
import spacy
nlp = spacy.load("fr_core_news_sm")
file = 'C:/Users/PC2/Downloads/test-ex.txt'
f=open(file,'r')
data = f.read()
doc = nlp(data)
for ent in doc.ents :
    print('{:<12}{:<10}{:<10}'.format(ent.text,ent.label_,spacy.explain(str(ent.label_))))


Messieurs   PER       Named person or family.
AGILAMCXXX
|ORG       Companies, agencies, institutions, etc.
Montant     LOC       Non-GPE locations, mountain ranges, bodies of water
Dirhams     MISC      Miscellaneous entities, e.g. events, nationalities, products or works of art
Mille deux cent cinquanteMISC      Miscellaneous entities, e.g. events, nationalities, products or works of art
Dirhams     MISC      Miscellaneous entities, e.g. events, nationalities, products or works of art
CH35        LOC       Non-GPE locations, mountain ranges, bodies of water
SWIFT       ORG       Companies, agencies, institutions, etc.
CRESCHZZ80A

 

MotifMISC      Miscellaneous entities, e.g. events, nationalities, products or works of art
INDUSTRIES SARLMISC      Miscellaneous entities, e.g. events, nationalities, products or works of art
Angle n°2 rue ibn aichaMISC      Miscellaneous entities, e.g. events, nationalities, products or works of art
Abdelkrim el khattabiLOC       Non-GPE locations, mo

In [29]:
nltk_preprocessed = text_preprocessing(data,'fr')
doc  = nlp(nltk_preprocessed)
for ent in doc.ents :
    print('{:<15}:{:<15}:{:<15}'.format(ent.text,ent.label_,spacy.explain(str(ent.label_))))

gisevciconia zero:PER            :Named person or family.
Monsieur       :PER            :Named person or family.
zero solution swift:MISC           :Miscellaneous entities, e.g. events, nationalities, products or works of art
switzerland ian:MISC           :Miscellaneous entities, e.g. events, nationalities, products or works of art
el khattabi    :ORG            :Companies, agencies, institutions, etc.
appât          :PER            :Named person or family.
gueulez marrakech:LOC            :Non-GPE locations, mountain ranges, bodies of water
rcn35          :LOC            :Non-GPE locations, mountain ranges, bodies of water
cnss:939 mt-103-stp 2838383393:MISC           :Miscellaneous entities, e.g. events, nationalities, products or works of art


In [30]:
spacy_preprocessed = spacy_preprocessing(data,'fr')
doc  = nlp(spacy_preprocessed)
for ent in doc.ents :
    print('{:<12}:{:<10}:{:<10}'.format(ent.text,ent.label_,spacy.explain(str(ent.label_))))

swt  agilisy aero solution  swift  agilamcxx   :MISC      :Miscellaneous entities, e.g. events, nationalities, products or works of art
dirham      :MISC      :Miscellaneous entities, e.g. events, nationalities, products or works of art
switzerland  iban  ch35:MISC      :Miscellaneous entities, e.g. events, nationalities, products or works of art
el khattabi imm:ORG       :Companies, agencies, institutions, etc.
dhs         :ORG       :Companies, agencies, institutions, etc.
