# Bibliotecas para Text Mining e NLP
O site http://textanalysisonline.com permite testar online algum do funcionamento das bibliotecas: NLTK, TextBlob, Spacy

## NLTK
NLTK is the most famous Python Natural Language Processing Toolkit. Check [Dive Into NLTK, Part I: Getting Started with NLTK](http://textminingonline.com/dive-into-nltk-part-i-getting-started-with-nltk) if you want to check some of theses functionalities in more detail

In [None]:
from pprint import pprint
import nltk

Listas pré-definidas de palavras fechadas

In [None]:
stopwords = nltk.corpus.stopwords.words('portuguese')
print("total words:", len(stopwords), stopwords[:90])

### tokenization

In [None]:
text = """Samsung recently revealed that it would give its phones
an impressive four years of security updates. Paired
with its high position in our regular Android update tracker, 
the company is rapidly becoming the Android update king. In 
light of the news, we're curious to know how
up-to-date your phone might be. If you're willing 
to dig through the settings app for just a moment, let's
compare: What Android security patch level is your phone running?"""
tokens = nltk.word_tokenize(text)
print(tokens)
print()
fd = nltk.FreqDist(tokens)
print(fd.most_common(20))

### Part-of-speech (POS)

In [None]:
text = "Massive iceberg, larger than New York City, breaks off in Antarctica"
print(nltk.pos_tag( text.split() ) )

### Stemming

In [None]:
import nltk.stem as stem
sentence = "the flies died and denied their dead stating sensational lies"

stemmer = stem.porter.PorterStemmer()
res = [ stemmer.stem(w) for w in sentence.split()]
print(" ".join(res) )

Temos também `stemmers` mais sofisticados. Por exemplo, o SnowballStemmer também funciona para Português (apesar de ser mauzito) ...

In [None]:
print("Languages:", stem.snowball.SnowballStemmer.languages)

In [None]:
frase = "lindos são os prados verdes e com muita luz deste Portugal"
stemmer = stem.snowball.SnowballStemmer("portuguese")
res = [stemmer.stem(w) for w in frase.split()]
print(" ".join(res) )

Temos acesso a textos em português, que podemos usar para treinar e testar modelos

In [None]:
from nltk.corpus import floresta
print("Contains %s words" % len(floresta.words()))

In [None]:
print(nltk.corpus.floresta.sents()[:5])

## NLP using Spacy

In [None]:
import spacy
nlp = spacy.load('en_core_web_sm')

doc = nlp("""Samsung recently revealed that it would give its phones
an impressive four years of security updates. Paired
with its high position in our regular Android update tracker, 
the company is rapidly becoming the Android update king. In 
light of the news, we're curious to know how
up-to-date your phone might be. If you're willing 
to dig through the settings app for just a moment, let's
compare: What Android security patch level is your phone running?""")

In [None]:
import spacy
nlp = spacy.load('pt_core_news_sm')
doc = nlp("""verificaram-se crescimentos económicos em grande escala""")

Tokens

In [None]:
print(" | ".join([token.text for token in doc]))

Lemma, POS, Tag, Dependency Tag, ...

In [None]:
for token in doc:
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_, token.is_alpha, token.is_stop)

Chunks

In [None]:
for chunk in doc.noun_chunks:
    print(f"{chunk.text:40} | {chunk.root.text} | {chunk.root.dep_} | {chunk.root.head.text}")

Named entities

In [None]:
for ent in doc.ents:
    print(f"{ent.text:15} {ent.label_:10} {ent.start_char} {ent.end_char}")

## TextBlob
TextBlob is a new python natural language processing toolkit, which stands on the shoulders of giants like NLTK and Pattern, provides text mining, text analysis and text processing modules for python developers.<BR>
Material baseado em: http://textminingonline.com/getting-started-with-textblob

In [None]:
from textblob import TextBlob

In [None]:
text = """Samsung recently revealed that it would give its phones
an impressive four years of security updates. Paired
with its high position in our regular Android update tracker, 
the company is rapidly becoming the Android update king. In 
light of the news, we're curious to know how
up-to-date your phone might be. If you're willing 
to dig through the settings app for just a moment, let's
compare: What Android security patch level is your phone running?"""

In [None]:
blob = TextBlob(text)
print(blob.tags)

In [None]:
print(blob.noun_phrases)

In [None]:
for s in blob.sentences:
    print("-->", s.replace("\n"," "))

In [None]:
print(blob.translate(to="fr"))

### Treinar um classificador (opcional)
Additional references: [Tutorial: Building a Text Classification System](https://textblob.readthedocs.io/en/dev/classifiers.html#loading-data-and-creating-a-classifier)

In [None]:
train = [
     ('I love this sandwich.', 'pos'),
     ('this is an amazing place!', 'pos'),
     ('I feel very good about these beers.', 'pos'),
     ('this is my best work.', 'pos'),
     ("what an awesome view", 'pos'),
     ('I do not like this restaurant', 'neg'),
     ('I am tired of this stuff.', 'neg'),
     ("I can't deal with this", 'neg'),
     ('he is my sworn enemy!', 'neg'),
     ('my boss is horrible.', 'neg')
 ]

In [None]:
from textblob.classifiers import NaiveBayesClassifier

In [None]:
cl = NaiveBayesClassifier(train)

In [None]:
cl.classify("This is an amazing library!")