# Bibliotecas úteis para NLP e Text Mining

Este notebook permite ilustrar o funcionamento de algumas bibliotecas, tais como *NLTK*, *TextBlob* e *Spacy*. 

Para os mais curiosos, o website http://textanalysisonline.com permite também testar o funcionamento dessas bibliotecas, além de permitir também demonstrar o funcionamento de algumas tarefas comuns, tais como Análise de Sentimento. 

## NLTK
NLTK is the most famous Python Natural Language Processing Toolkit. 

Check [Dive Into NLTK, Part I: Getting Started with NLTK](http://textminingonline.com/dive-into-nltk-part-i-getting-started-with-nltk) if you want to check some of theses functionalities in more detail

In [2]:
import nltk

Listas pré-definidas de palavras fechadas

In [None]:
stopwords = nltk.corpus.stopwords.words('portuguese')
print("total words:", len(stopwords))
print(stopwords[:90])

### Tokenization

In [None]:
text = """Samsung recently revealed that it would give its phones
an impressive four years of security updates. Paired
with its high position in our regular Android update tracker, 
the company is rapidly becoming the Android update king. In 
light of the news, we're curious to know how
up-to-date your phone might be. If you're willing 
to dig through the settings app for just a moment, let's
compare: What Android security patch level is your phone running?"""

tokens = nltk.word_tokenize(text)
print(tokens)

Finding the most frequent tokens ...

In [None]:
freq = nltk.FreqDist(tokens)
print(freq.most_common(20))

### Part-of-speech (POS)
O Conjunto de etiquetas utilizado nos próximos exemplos encontra-se descrito aqui: [Tagset](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html)

In [None]:
text = "Massive iceberg, larger than New York City, breaks off in Antarctica"
print(nltk.pos_tag( text.split() ))

Os resultados parecem-lhe corretos? Como faria para melhorar a tokenização da frase e obter etiquetas mais adequadas?

### Stemming

In [None]:
import nltk.stem as stem
sentence = "the flies died and denied their dead stating sensational lies"

stemmer = stem.porter.PorterStemmer()
res = [ stemmer.stem(w) for w in sentence.split()]
print(" ".join(res) )

Temos também `stemmers` mais sofisticados. Por exemplo, o SnowballStemmer também funciona para Português (apesar de ser mauzito) ...

In [None]:
print("Languages:", stem.snowball.SnowballStemmer.languages)

In [None]:
frase = "lindos são os prados verdes e com muita luz deste Portugal"
stemmer = stem.snowball.SnowballStemmer("portuguese")
res = [stemmer.stem(w) for w in frase.split()]
print(" ".join(res) )

Temos acesso a textos em português, que podemos usar para treinar e testar modelos

In [None]:
from nltk.corpus import floresta
print("Contains %s words" % len(floresta.words()))

In [None]:
print(nltk.corpus.floresta.sents()[:5])

## Spacy

In [None]:
%pip install spacy

In [19]:
text="""Samsung recently revealed that it would give its phones
an impressive four years of security updates. Paired with its high
position in our regular Android update tracker, the company
is rapidly becoming the Android update king. In light of the news,
we're curious to know how up-to-date your phone might be.
If you're willing to dig through the settings app for just a moment,
let's compare: What Android security patch level is your phone running?"""
text = text.replace("\n", " ")

In [None]:
print(text)

In [None]:
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp(text)

Token, lemma, POS, PT tag, Dependency tag

In [None]:
for token in doc[:10]:
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_, token.is_alpha, token.is_stop)

Chunks

In [None]:
for chunk in doc.noun_chunks:
    print(f"{chunk.text:35} | {chunk.root.text} | {chunk.root.dep_} | {chunk.root.head.text}")

#### Named entities

In [None]:
for ent in doc.ents:
    print(f"{ent.text:15} {ent.label_:10} {ent.start_char} {ent.end_char}")

Português

In [None]:
import spacy
nlp = spacy.load('pt_core_news_sm')

#### Tokenization

In [None]:
doc = nlp("""verificaram-se crescimentos económicos em grande escala""")
print(" | ".join([token.text for token in doc]))

Lemma, POS, Tag, Dependency Tag, ...

In [None]:
for token in doc:
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_, token.is_alpha, token.is_stop)

## TextBlob
TextBlob is a new python natural language processing toolkit, which stands on the shoulders of giants like NLTK and Pattern, provides text mining, text analysis and text processing modules for python developers.<BR>
Material baseado em: http://textminingonline.com/getting-started-with-textblob

In [None]:
%pip install textblob

#### Segmentação de um texto em frases

In [None]:
from textblob import TextBlob

text = """Natural language processing (NLP) deals with
the application of computational models to text or speech
data. Application areas within NLP include automatic
(machine) translation between languages; dialogue systems,
which allow a human to interact with a machine using natural
language; and information extraction, where the goal is to
transform unstructured text into structured (database) 
representations that can be searched and browsed in flexible
ways. I love all the languages."""
text = text.replace("\n", " ")
print(text)

In [None]:
blob = TextBlob(text)
print(blob.sentences)

In [None]:
for num, frase in enumerate(blob.sentences):
    print(f"{num} => {frase}")

In [31]:
text="""Samsung recently revealed that it would give its phones
an impressive four years of security updates. Paired with its high
position in our regular Android update tracker, the company
is rapidly becoming the Android update king. In light of the news,
we're curious to know how up-to-date your phone might be.
If you're willing to dig through the settings app for just a moment,
let's compare: What Android security patch level is your phone running?"""
text = text.replace("\n", " ")

In [None]:
blob = TextBlob(text)
print(blob.tags)

In [None]:
print(blob.noun_phrases)

In [None]:
for s in blob.sentences:
    print("-->", s)

#### Análise de sentimento

In [None]:
frase = TextBlob("This is an amazing library!")
print(frase, "-->", frase.sentiment.polarity)

In [36]:
textos=["I love chocolate", "I hate to eat", "I don't love", "I like cakes"]

In [None]:
for i in range(len(textos)):
    polaridade = TextBlob(textos[i]).sentiment.polarity
    print(textos[i], "-->", polaridade)