In [1]:
# https://textblob.readthedocs.io/en/dev/_modules/textblob/blob.html#Word.lemmatize

Things To Consider In Utilizing Lemmatization

1. It uses dictionary-based words. With the term lemma which means the root or base form of a word, lemmatization aims to provide the base form of a word rather than just removing the inflections of a word.
2. It completely depends on parts of speech to find a base word. Without specifying the parts of speech), lemmatization might not perform well and you might not get the result that you’re looking for.
3. It is slower than stemming but it’s more powerful. Since lemmatization doesn’t follow an algorithm to perform on words and the need of providing parts of speech, it is considered slower than stemming. However, it’s more powerful in a way that it uses dictionary-based words for results. 
4. It has higher accuracy in looking for the root word. As lemmatization uses dictionary-based words in laying out results from an inflected word, you’ll have higher chances of getting accurate outputs.

In [76]:
!pip install textblob



In [2]:
from textblob import TextBlob

# 1 - lemmatization is different from stemming

In [74]:
example = TextBlob("According to the zen of Python, simple codes are better than complex codes.") # mostrar mais de um exemplo
# com textao

# lemmatization
print('lemmatization:', example.words.lemmatize())
# stemming
print('stemming:', example.words.stem())

lemmatization: ['According', 'to', 'the', 'zen', 'of', 'Python', 'simple', 'code', 'are', 'better', 'than', 'complex', 'code']
stemming: ['accord', 'to', 'the', 'zen', 'of', 'python', 'simpl', 'code', 'are', 'better', 'than', 'complex', 'code']


# 2 - lemmatization depends on POS, with bad POS, it might not perform well

In [75]:
example = TextBlob("Simple codes are better than complex codes.")

CC coordinating conjunction
CD cardinal digit
DT determiner
EX existential there (like: “there is” … think of it like “there exists”)
FW foreign word
IN preposition/subordinating conjunction
JJ adjective ‘big’
JJR adjective, comparative ‘bigger’
JJS adjective, superlative ‘biggest’
LS list marker 1)
MD modal could, will
NN noun, singular ‘desk’
NNS noun plural ‘desks’
NNP proper noun, singular ‘Harrison’
NNPS proper noun, plural ‘Americans’
PDT predeterminer ‘all the kids’
POS possessive ending parent‘s
PRP personal pronoun I, he, she
PRP$ possessive pronoun my, his, hers
RB adverb very, silently,
RBR adverb, comparative better
RBS adverb, superlative best
RP particle give up
TO to go ‘to‘ the store.
UH interjection errrrrrrrm
VB verb, base form take
VBD verb, past tense took
VBG verb, gerund/present participle taking
VBN verb, past participle taken
VBP verb, sing. present, non-3d take
VBZ verb, 3rd person sing. present takes
WDT wh-determiner which
WP wh-pronoun who, what
WP$ possessive wh-pronoun whose
WRB wh-adverb where, when

In [46]:
sentence_example = TextBlob('This is a complete sentence')
sentence_example.words.lemmatize()

WordList(['This', 'is', 'a', 'complete', 'sentence'])

In [47]:
sentence_example.tags
# dt - determiner
# vbz - verb, 3rd person singular
# jj - adjective
# nn - noun

[('This', 'DT'),
 ('is', 'VBZ'),
 ('a', 'DT'),
 ('complete', 'JJ'),
 ('sentence', 'NN')]

In [55]:
from nltk.stem import WordNetLemmatizer

wnl = WordNetLemmatizer()
print('lemma with pos tag:', wnl.lemmatize('is', 'v'))
print('lemma without pos tag:', wnl.lemmatize('is'))

lemma with pos tag: be
lemma without pos tag: is


In [41]:
example.tags 
# vbg - verb, gerund/ present participle
# to - to
# dt - determiner
# nn - noun
# nnp - proper noun
# jj - adjective
# nns - noun plural
# vbp - verb, singular present
# jjr - adjective comparative
# in - preposition / subordinating conjunction

[('According', 'VBG'),
 ('to', 'TO'),
 ('the', 'DT'),
 ('zen', 'NN'),
 ('of', 'IN'),
 ('Python', 'NNP'),
 ('simple', 'JJ'),
 ('codes', 'NNS'),
 ('are', 'VBP'),
 ('better', 'JJR'),
 ('than', 'IN'),
 ('complex', 'JJ'),
 ('codes', 'NNS')]

# 3 - Lemmatization is slower than stemming

In [72]:
import time

start_lemma = time.time()
example.words.lemmatize()
end_lemma = time.time()
elapsed_lemma = end_lemma - start_lemma
print('Elapsed time:', elapsed_lemma)

Elapsed time: 0.0005240440368652344


In [71]:
start_stem = time.time()
example.words.stem()
end_stem = time.time()
elapsed_stem = end_stem - start_stem
print('Elapsed time:', elapsed_stem)

Elapsed time: 0.0008702278137207031


In [73]:
elapsed_lemma > elapsed_stem

False

# 4 - Lemmatization has higher accuracy

Accuracy: Lemmatization does not merely cut words off as you see in stemming algorithms. Analysis of words is conducted based on the word’s POS to take context into consideration when producing lemmas. Also, lemmatization leads to real dictionary words being produced.

Here you can see some of the output words are not part of the english dictionary (i.e., “becaus,” “people,” and “programm.”). Another thing to notice is that context is not taken into consideration. For instance, “programmers” is a plural noun but it was reduced down to “program,” which can be a noun or a verb – in other words, the root words are ambiguous.

The motivation behind context-sensitive lemmatizers was to improve the performance on unseen and ambiguous words. In our lemmatization example, we will be using a popular lemmatizer called WordNet lemmatizer. 

Wordnet is a large, free, and publicly available lexical database for the English language aiming to establish structured semantic relationships between words.

Input words passed to our lemmatizer will remain unchanged if it cannot be found in WordNet. This means context must be provided, which is done by giving the value for the part-of-speech parameter,  pos, in wordnet_lemmatizer.lemmatize.

Notice the word “programmer” were not cut down to “program” by our lemmatizer: this is because we told our lemmatizer to only stem verbs. 

-> Text preprocessing includes both Stemming as well as Lemmatization. Many times people find these two terms confusing. Some treat these two as the same. Actually, lemmatization is preferred over Stemming because lemmatization does morphological analysis of the words.

0. https://textblob.readthedocs.io/en/dev/_modules/textblob/blob.html#Word.lemmatize 

basear topicos nos metodos da documentacao

1. https://www.alura.com.br/artigos/textblob-alternativa-para-processamento-linguagem-natural - done
2. https://www.geeksforgeeks.org/python-lemmatization-with-textblob/ - done
3. https://stackabuse.com/python-for-nlp-introduction-to-the-textblob-library/ - done
4. https://hannibunny.github.io/nlpbook/02normalisation/03StemLemma.html - done
5. https://www.analyticsvidhya.com/blog/2021/10/making-natural-language-processing-easy-with-textblob/ - done
6. https://textblob.readthedocs.io/en/dev/quickstart.html - done
7. https://www.machinelearningplus.com/nlp/lemmatization-examples-python/ - done
8. https://blog.enterprisedna.co/lemmatization-in-python-a-beginners-guide/ - done
9. https://acervolima.com/python-abordagens-de-lematizacao-com-exemplos/ - done
10. https://www.guru99.com/stemming-lemmatization-python-nltk.html - done
11. Wordnet: https://wordnet.princeton.edu/

-- perguntar novamente qual a ideia -- apresentar duas outlines e ver qual ele acha

- outline 1: stem vs lemma
- outline 2: todas as func da text blob -- esse
- outline 3: lemma como parte de um projeto maior - pesquisar tf-idf, LDA e topic modelling


In [1]:
!python -m textblob.download_corpora

[nltk_data] Downloading package brown to /Users/csamp/nltk_data...
[nltk_data]   Package brown is already up-to-date!
[nltk_data] Downloading package punkt to /Users/csamp/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/csamp/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/csamp/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package conll2000 to /Users/csamp/nltk_data...
[nltk_data]   Package conll2000 is already up-to-date!
[nltk_data] Downloading package movie_reviews to
[nltk_data]     /Users/csamp/nltk_data...
[nltk_data]   Package movie_reviews is already up-to-date!
Finished.
