# Text processing

## Tokenization

Several tokenizers are available. As you will see bellow, spaCy is much faster than the other implementations (Moses, NLTK) and often return better results. 

In [1]:
from nautilus_nlp.preprocessing.tokenizer import tokenize, untokenize

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [2]:
fr_txt = "Ceci est un texte français, j'adore 1 !"
eng_txt = "Let's play together!"

In [3]:
str_ = """Les moteurs de recherche tels Google, Exalead ou Yahoo! sont des applications très connues de fouille de textes sur de grandes masses de données. Cependant, les moteurs de recherche ne se basent pas uniquement sur le texte pour l'indexer, mais également sur la façon dont les pages sont mises en valeur les unes par rapport aux autres. L'algorithme utilisé par Google est PageRank, et il est courant de voir HITS dans le milieu académique"""

### French spaCy

In [4]:
%%time

tokenized_fr_txt = tokenize(str_, lang_module="fr_spacy")

CPU times: user 5.07 ms, sys: 130 µs, total: 5.2 ms
Wall time: 5.03 ms


In [5]:
tokenized_fr_txt = tokenize(str_, lang_module="fr_spacy")

In [6]:
print(tokenized_fr_txt[:10])

['Les', 'moteurs', 'de', 'recherche', 'tels', 'Google', ',', 'Exalead', 'ou', 'Yahoo']


### French Moses

In [7]:
%%time
tokenized_fr_txt = tokenize(str_, lang_module="fr_moses")

CPU times: user 13.2 s, sys: 6.59 ms, total: 13.2 s
Wall time: 13.2 s


In [8]:
print(tokenized_fr_txt[:10])

['Les', 'moteurs', 'de', 'recherche', 'tels', 'Google', ',', 'Exalead', 'ou', 'Yahoo']


### English spaCy

In [9]:
%%time
tokenized_eng_txt = tokenize(eng_txt, lang_module="en_spacy")

CPU times: user 990 µs, sys: 0 ns, total: 990 µs
Wall time: 998 µs


In [10]:
tokenized_eng_txt = tokenize(eng_txt, lang_module="en_spacy")

In [11]:
tokenized_eng_txt

['Let', "'s", 'play', 'together', '!']

### English NLTK 

In [12]:
%%time
tokenized_eng_txt = tokenize(eng_txt, lang_module="en_nltk")
tokenized_eng_txt

CPU times: user 7.94 ms, sys: 122 µs, total: 8.06 ms
Wall time: 6.95 ms


## Un-tokenization

In [13]:
%%time
untokenize(tokenized_eng_txt,lang='en')

CPU times: user 1.46 s, sys: 35 µs, total: 1.46 s
Wall time: 1.46 s


"Let's play together!"

In [14]:
%%time
untokenize(tokenized_fr_txt,lang='fr')

CPU times: user 3 s, sys: 351 µs, total: 3 s
Wall time: 3 s


"Les moteurs de recherche tels Google, Exalead ou Yahoo ! sont des applications très connues de fouille de textes sur de grandes masses de données. Cependant, les moteurs de recherche ne se basent pas uniquement sur le texte pour l' indexer, mais également sur la façon dont les pages sont mises en valeur les unes par rapport aux autres. L' algorithme utilisé par Google est PageRank, et il est courant de voir HITS dans le milieu académique"

# Stemming 

In [15]:
from nautilus_nlp.preprocessing.stemming import stem_tokens

In [16]:
stem_tokens(['I','survived','these', 'dogs'], lang='english')

['i', 'surviv', 'these', 'dog']

In [17]:
stem_tokens(['je', 'mangerai', 'dans', 'les', 'cuisines', 'du', 'château'],lang='french')

['je', 'mang', 'dan', 'le', 'cuisin', 'du', 'château']

# Lemmatization

## French 

In [18]:
from nautilus_nlp.preprocessing.lemmatization import lemmatize_french_tokens

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [19]:
txt_to_tokenize=['Ceci', 'est', 'un', 'texte', 'français', ',', "j'", 'adore', 'tes', 'frites', 'bien', 'grasses', 'YOLO', '!']
print(txt_to_tokenize)

['Ceci', 'est', 'un', 'texte', 'français', ',', "j'", 'adore', 'tes', 'frites', 'bien', 'grasses', 'YOLO', '!']


In [20]:
%%time
lemmatized_tokens = lemmatize_french_tokens(txt_to_tokenize, module='spacy')

CPU times: user 17.3 ms, sys: 0 ns, total: 17.3 ms
Wall time: 15.8 ms


In [21]:
lemmatized_tokens = lemmatize_french_tokens(txt_to_tokenize, module='spacy')

In [22]:
print(lemmatized_tokens)

['ceci', 'être', 'un', 'texte', 'français', ',', 'j', "'", 'adorer', 'ton', 'frit', 'bien', 'gras', 'yolo', '!']


## English

In [23]:
from nautilus_nlp.preprocessing.lemmatization import lemmatize_english_tokens

In [24]:
to_lemmatize = ['The', 'striped', 'bats', 'are', 'hanging', 'on', 'their', 'feet', 'for', 'best']

In [25]:
%%time
lemmatize_english_tokens(to_lemmatize, module='spacy')

CPU times: user 13.7 ms, sys: 0 ns, total: 13.7 ms
Wall time: 12.5 ms


['the', 'strip', 'bat', 'be', 'hang', 'on', '-PRON-', 'foot', 'for', 'good']

In [26]:
%%time
lemmatize_english_tokens(to_lemmatize, module='nltk')

CPU times: user 1.72 s, sys: 125 ms, total: 1.85 s
Wall time: 1.82 s


['The', 'strip', 'bat', 'be', 'hang', 'on', 'their', 'foot', 'for', 'best']

# Remove stop words

In [27]:
from nautilus_nlp.preprocessing.preprocess import remove_stopwords
from nautilus_nlp.preprocessing.preprocess import get_stopwords

In [28]:
FRENCH_SW = get_stopwords('fr')

In [29]:
text = "J'ai un beau cheval"

In [30]:
remove_stopwords(text, FRENCH_SW)

["J'ai", 'cheval']

In [31]:
remove_stopwords(tokenize(text, lang_module="fr_spacy"),FRENCH_SW)

["J'", 'cheval']