<a href="https://colab.research.google.com/github/eduardodut/Mineracao_dados_textos_web/blob/master/projeto01_equipe01_entrega30062020.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<b> EQUIPE: </b>
  - Eduardo Façanha
  - Giovanni Brígido
  - Maurício Brito

<b> ATIVIDADE 01 </b> - Pré-processamento dos textos (Prazo: 11/05/2020 - 30%)

- Tokenização
- Lematização
- POS Tagging
- Normalização (hashtags, menções, emojis e símbolos especiais)
- Chunking
- NER (entidades nomeadas)
- Remoção stop-words

<b> ATIVIDADE 02 </b> - Representação Semântica (Prazo: 30/06/2020 - 30%)

- Uso de bases de conhecimento externas
- Identificação de tópicos
- Representação vetorial das palavras e textos

<b> ATIVIDADE 03 </b> - Analise da Linguagem Ofensiva - Subtarefas A e B (Prazo: 30/07/2020 - 40%)

- Resultado da subtarefa A para um conjunto de teste a ser fornecido
- Resultado da subtarefa B para um conjunto de teste a ser fornecido


<b> Carregamento do arquivo de dados e transformação em DataFrame </b>

É realizado o download do arquivo e instanciado um DataFrame com os dados. A variável do DataFrame é chamada 'tweets'

In [1]:
import pandas as pd
#download o arquivo localizado no reposítório do projeto
!curl --remote-name \
    -H 'Accept: application/vnd.github.v3.raw' \
    --location https://raw.githubusercontent.com/eduardodut/Mineracao_dados_textos_web/master/datasets/olid-training-v1.0.tsv

#leitura para objeto dataframe
tweets = pd.read_csv('/content/olid-training-v1.0.tsv', sep='\t',encoding= 'utf-8')

#conversão da coluna 'id' de inteiro para string
tweets['id'] = tweets['id'].astype('str')

#visualização dos primeiros registros

tweets = tweets[['subtask_c','subtask_b','subtask_a','id','tweet']]
tweets.head(20)

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 1915k  100 1915k    0     0  2970k      0 --:--:-- --:--:-- --:--:-- 2970k


Unnamed: 0,subtask_c,subtask_b,subtask_a,id,tweet
0,,UNT,OFF,86426,@USER She should ask a few native Americans wh...
1,IND,TIN,OFF,90194,@USER @USER Go home you’re drunk!!! @USER #MAG...
2,,,NOT,16820,Amazon is investigating Chinese employees who ...
3,,UNT,OFF,62688,"@USER Someone should'veTaken"" this piece of sh..."
4,,,NOT,43605,@USER @USER Obama wanted liberals &amp; illega...
5,OTH,TIN,OFF,97670,@USER Liberals are all Kookoo !!!
6,,UNT,OFF,77444,@USER @USER Oh noes! Tough shit.
7,GRP,TIN,OFF,52415,@USER was literally just talking about this lo...
8,,,NOT,45157,@USER Buy more icecream!!!
9,IND,TIN,OFF,13384,@USER Canada doesn’t need another CUCK! We alr...


In [2]:
#verificação e remoção de duplicatas
if tweets.duplicated(['tweet']).sum()>0:
  tweets.drop_duplicates(subset='tweet', keep='first', inplace=True)

print('TWEETS DUPLICADOS: ',tweets.duplicated(['tweet']).sum())

TWEETS DUPLICADOS:  0


<b> Tratamento inicial do texto </b>

Converte o texto de cada tweet, separadamente, em minúsculo e remove espaços e tabulações extras. O resultado é guardado no DataFrame tweets em uma nova coluna.

Entrada: tweets['tweet']<br/>
Saída: tweets['tweet_tratado']

In [3]:
from nltk.tokenize import TweetTokenizer, sent_tokenize
import re
import string
from nltk.corpus import stopwords as sw

def tratamento_texto(tweet):
  
  tweet = tweet.lower()
  tweet = tweet.strip()
  
  #remove as menções a usuários de cada tweet
  # tweet = re.sub(r'@user', '', tweet, flags=re.MULTILINE)
  #remove as palavras url
  tweet = re.sub(r'url', '', tweet, flags=re.MULTILINE)
  #remove as quebras de linha
  tweet = re.sub(r'\n', '', tweet)
  #substitui tabulações por um espaço em branco
  tweet = re.sub(r'\t', ' ', tweet)
  #substitui um ou mais espaços em branco por um espaço
  tweet= re.sub(r'\s+', ' ', tweet, flags=re.I)
  #&amp;
  #remove aspas e apóstofres
  # tweet = re.sub('[\'"‘’“”…]', '', tweet)
  return tweet

#cria uma nova coluna no dataframe 'tweets' com cada tweet tokenizado
tweets['tweet_tratado'] = tweets['tweet'].apply(tratamento_texto)
tweets[tweets.columns[::-1]].head()

Unnamed: 0,tweet_tratado,tweet,id,subtask_a,subtask_b,subtask_c
0,@user she should ask a few native americans wh...,@USER She should ask a few native Americans wh...,86426,OFF,UNT,
1,@user @user go home you’re drunk!!! @user #mag...,@USER @USER Go home you’re drunk!!! @USER #MAG...,90194,OFF,TIN,IND
2,amazon is investigating chinese employees who ...,Amazon is investigating Chinese employees who ...,16820,NOT,,
3,"@user someone should'vetaken"" this piece of sh...","@USER Someone should'veTaken"" this piece of sh...",62688,OFF,UNT,
4,@user @user obama wanted liberals &amp; illega...,@USER @USER Obama wanted liberals &amp; illega...,43605,NOT,,


<b> Separação em sentenças </b>

Separa cada tweet em sentenças.

Entrada: tweets['tweet_tratado']<br/>
Saída: tweets['tweet_em_sentencas']

In [4]:
import nltk
from contextlib import redirect_stdout
import os

with redirect_stdout(open(os.devnull, "w")):
  nltk.download("stopwords") 
  nltk.download('punkt')

def separa_sentencas(tweet):
  
  lista_sentencas = sent_tokenize(tweet)
  # lista_setencas.str.strip()
  nova_lista = []
  for sent in lista_sentencas:
    nova_lista.append(sent.strip())

  return nova_lista #retorna lista de sentenças com a função .strip() aplicada
tweets['tweet_em_sentencas'] = tweets['tweet_tratado'].apply(separa_sentencas)
tweets[tweets.columns[::-1]].head()

Unnamed: 0,tweet_em_sentencas,tweet_tratado,tweet,id,subtask_a,subtask_b,subtask_c
0,[@user she should ask a few native americans w...,@user she should ask a few native americans wh...,@USER She should ask a few native Americans wh...,86426,OFF,UNT,
1,"[@user @user go home you’re drunk!!!, @user #m...",@user @user go home you’re drunk!!! @user #mag...,@USER @USER Go home you’re drunk!!! @USER #MAG...,90194,OFF,TIN,IND
2,[amazon is investigating chinese employees who...,amazon is investigating chinese employees who ...,Amazon is investigating Chinese employees who ...,16820,NOT,,
3,"[@user someone should'vetaken"" this piece of s...","@user someone should'vetaken"" this piece of sh...","@USER Someone should'veTaken"" this piece of sh...",62688,OFF,UNT,
4,[@user @user obama wanted liberals &amp; illeg...,@user @user obama wanted liberals &amp; illega...,@USER @USER Obama wanted liberals &amp; illega...,43605,NOT,,


<b> Tokenização </b>

Tokenização do tweet.

Entrada: tweets['tweet_em_sentencas']<br/>
Saída: tweets['tweet_tokenizado']

In [5]:
def tokeniza_sentenca(lista_sentencas):
  # tokenizer = TweetTokenizer()
  # #união das sentenças
  # sentencas_unidas = " ".join(w for w in lista_sentencas)
  # #tokenização das sentenças unidas
  # tokens = tokenizer.tokenize(sentencas_unidas)

  tokenizer = TweetTokenizer()
  tokens = []

  for sentenca in lista_sentencas:

    tokens.append(tokenizer.tokenize(sentenca))

  return tokens

tweets['tweet_tokenizado'] = tweets['tweet_em_sentencas'].apply(tokeniza_sentenca)
tweets[tweets.columns[::-1]].head()

Unnamed: 0,tweet_tokenizado,tweet_em_sentencas,tweet_tratado,tweet,id,subtask_a,subtask_b,subtask_c
0,"[[@user, she, should, ask, a, few, native, ame...",[@user she should ask a few native americans w...,@user she should ask a few native americans wh...,@USER She should ask a few native Americans wh...,86426,OFF,UNT,
1,"[[@user, @user, go, home, you, ’, re, drunk, !...","[@user @user go home you’re drunk!!!, @user #m...",@user @user go home you’re drunk!!! @user #mag...,@USER @USER Go home you’re drunk!!! @USER #MAG...,90194,OFF,TIN,IND
2,"[[amazon, is, investigating, chinese, employee...",[amazon is investigating chinese employees who...,amazon is investigating chinese employees who ...,Amazon is investigating Chinese employees who ...,16820,NOT,,
3,"[[@user, someone, should'vetaken, "", this, pie...","[@user someone should'vetaken"" this piece of s...","@user someone should'vetaken"" this piece of sh...","@USER Someone should'veTaken"" this piece of sh...",62688,OFF,UNT,
4,"[[@user, @user, obama, wanted, liberals, &, il...",[@user @user obama wanted liberals &amp; illeg...,@user @user obama wanted liberals &amp; illega...,@USER @USER Obama wanted liberals &amp; illega...,43605,NOT,,


<b> POS Tagger </b>

Realiza a part of speech tagging do texto de cada token

Entrada: tweets['tweet_tokenizado']<br/>
Saída: tweets['tweet_POS_tagged']

In [6]:
from contextlib import redirect_stdout
import os

with redirect_stdout(open(os.devnull, "w")):
    nltk.download('averaged_perceptron_tagger')
# a função map aplica a funcao nltk.post_tag para cada lista contida da coluna tweet tokenizado
 
def pos_taggeador(lista_tokens):
  setenca_taggeada = []
  for lista in lista_tokens:
    setenca_taggeada.append(nltk.pos_tag(lista))

  return setenca_taggeada

                                                        #apply(nltk.pos) se a coluna for composta de lista de tokens
tweets['tweet_POS_tagged'] = tweets['tweet_tokenizado'].apply(pos_taggeador)#
tweets[tweets.columns[::-1]].head()

Unnamed: 0,tweet_POS_tagged,tweet_tokenizado,tweet_em_sentencas,tweet_tratado,tweet,id,subtask_a,subtask_b,subtask_c
0,"[[(@user, IN), (she, PRP), (should, MD), (ask,...","[[@user, she, should, ask, a, few, native, ame...",[@user she should ask a few native americans w...,@user she should ask a few native americans wh...,@USER She should ask a few native Americans wh...,86426,OFF,UNT,
1,"[[(@user, NNP), (@user, NNP), (go, VBP), (home...","[[@user, @user, go, home, you, ’, re, drunk, !...","[@user @user go home you’re drunk!!!, @user #m...",@user @user go home you’re drunk!!! @user #mag...,@USER @USER Go home you’re drunk!!! @USER #MAG...,90194,OFF,TIN,IND
2,"[[(amazon, NN), (is, VBZ), (investigating, VBG...","[[amazon, is, investigating, chinese, employee...",[amazon is investigating chinese employees who...,amazon is investigating chinese employees who ...,Amazon is investigating Chinese employees who ...,16820,NOT,,
3,"[[(@user, NN), (someone, NN), (should'vetaken,...","[[@user, someone, should'vetaken, "", this, pie...","[@user someone should'vetaken"" this piece of s...","@user someone should'vetaken"" this piece of sh...","@USER Someone should'veTaken"" this piece of sh...",62688,OFF,UNT,
4,"[[(@user, NNP), (@user, NNP), (obama, NN), (wa...","[[@user, @user, obama, wanted, liberals, &, il...",[@user @user obama wanted liberals &amp; illeg...,@user @user obama wanted liberals &amp; illega...,@USER @USER Obama wanted liberals &amp; illega...,43605,NOT,,


**Chunking**

Separação de cada sentença em chunks. 

Entrada: tweets['tweet_POS_tagged']<br/>
Saída: tweets['tweet_chunked']

In [7]:
from nltk.chunk import conlltags2tree, tree2conlltags

pattern = 'NP: {<DT>?<JJ>*<NN>}'
pattern1 = 'NP: {<DT>?<JJ>*<NN.*>*}'
pattern2 = 'NP: {<DT><NN.*><.*>*<NN.*>}'

def chunker(lista_tweets_pos_tagged):

  lista_saida = []

  pattern = 'NP: {<DT>?<JJ>*<NN>}'
  pattern1 = 'NP: {<DT>?<JJ>*<NN.*>*}'
  pattern2 = 'NP: {<DT><NN.*><.*>*<NN.*>}'


  for lista in lista_tweets_pos_tagged:
    cp = nltk.RegexpParser(pattern1)
    cs = cp.parse(lista)
    iob_tagged = tree2conlltags(cs)
    
    lista_saida.append(iob_tagged)
  return lista_saida


tweets['tweet_chunked'] = tweets['tweet_POS_tagged'].apply(chunker)

tweets[tweets.columns[::-1]].head()

Unnamed: 0,tweet_chunked,tweet_POS_tagged,tweet_tokenizado,tweet_em_sentencas,tweet_tratado,tweet,id,subtask_a,subtask_b,subtask_c
0,"[[(@user, IN, O), (she, PRP, O), (should, MD, ...","[[(@user, IN), (she, PRP), (should, MD), (ask,...","[[@user, she, should, ask, a, few, native, ame...",[@user she should ask a few native americans w...,@user she should ask a few native americans wh...,@USER She should ask a few native Americans wh...,86426,OFF,UNT,
1,"[[(@user, NNP, B-NP), (@user, NNP, I-NP), (go,...","[[(@user, NNP), (@user, NNP), (go, VBP), (home...","[[@user, @user, go, home, you, ’, re, drunk, !...","[@user @user go home you’re drunk!!!, @user #m...",@user @user go home you’re drunk!!! @user #mag...,@USER @USER Go home you’re drunk!!! @USER #MAG...,90194,OFF,TIN,IND
2,"[[(amazon, NN, B-NP), (is, VBZ, O), (investiga...","[[(amazon, NN), (is, VBZ), (investigating, VBG...","[[amazon, is, investigating, chinese, employee...",[amazon is investigating chinese employees who...,amazon is investigating chinese employees who ...,Amazon is investigating Chinese employees who ...,16820,NOT,,
3,"[[(@user, NN, B-NP), (someone, NN, I-NP), (sho...","[[(@user, NN), (someone, NN), (should'vetaken,...","[[@user, someone, should'vetaken, "", this, pie...","[@user someone should'vetaken"" this piece of s...","@user someone should'vetaken"" this piece of sh...","@USER Someone should'veTaken"" this piece of sh...",62688,OFF,UNT,
4,"[[(@user, NNP, B-NP), (@user, NNP, I-NP), (oba...","[[(@user, NNP), (@user, NNP), (obama, NN), (wa...","[[@user, @user, obama, wanted, liberals, &, il...",[@user @user obama wanted liberals &amp; illeg...,@user @user obama wanted liberals &amp; illega...,@USER @USER Obama wanted liberals &amp; illega...,43605,NOT,,


<b> NER </b>

Realiza a reconhecimento de entidades, NER.

Entrada: tweets['tweet_POS_tagged']<br/>
Saída: tweets['tweet_NER']

In [8]:
from nltk.tag import pos_tag
from nltk.chunk import conlltags2tree, tree2conlltags
from pprint import pprint
from nltk.chunk.regexp import ChunkString, ChunkRule, ChinkRule 
from nltk.tree import Tree 
from contextlib import redirect_stdout
import os

with redirect_stdout(open(os.devnull, "w")):
    nltk.download('maxent_ne_chunker')
    nltk.download('words')

def ner(lista_tokens_taggeados):
  lista_tokens_ner = []
  for lista in lista_tokens_taggeados:
    lista_tokens_ner.append(nltk.ne_chunk(lista))

  return lista_tokens_ner


tweets['tweet_NER'] = tweets['tweet_POS_tagged'].apply(ner)
tweets[tweets.columns[::-1]].head()

Unnamed: 0,tweet_NER,tweet_chunked,tweet_POS_tagged,tweet_tokenizado,tweet_em_sentencas,tweet_tratado,tweet,id,subtask_a,subtask_b,subtask_c
0,"[[(@user, IN), (she, PRP), (should, MD), (ask,...","[[(@user, IN, O), (she, PRP, O), (should, MD, ...","[[(@user, IN), (she, PRP), (should, MD), (ask,...","[[@user, she, should, ask, a, few, native, ame...",[@user she should ask a few native americans w...,@user she should ask a few native americans wh...,@USER She should ask a few native Americans wh...,86426,OFF,UNT,
1,"[[(@user, NNP), (@user, NNP), (go, VBP), (home...","[[(@user, NNP, B-NP), (@user, NNP, I-NP), (go,...","[[(@user, NNP), (@user, NNP), (go, VBP), (home...","[[@user, @user, go, home, you, ’, re, drunk, !...","[@user @user go home you’re drunk!!!, @user #m...",@user @user go home you’re drunk!!! @user #mag...,@USER @USER Go home you’re drunk!!! @USER #MAG...,90194,OFF,TIN,IND
2,"[[(amazon, NN), (is, VBZ), (investigating, VBG...","[[(amazon, NN, B-NP), (is, VBZ, O), (investiga...","[[(amazon, NN), (is, VBZ), (investigating, VBG...","[[amazon, is, investigating, chinese, employee...",[amazon is investigating chinese employees who...,amazon is investigating chinese employees who ...,Amazon is investigating Chinese employees who ...,16820,NOT,,
3,"[[(@user, NN), (someone, NN), (should'vetaken,...","[[(@user, NN, B-NP), (someone, NN, I-NP), (sho...","[[(@user, NN), (someone, NN), (should'vetaken,...","[[@user, someone, should'vetaken, "", this, pie...","[@user someone should'vetaken"" this piece of s...","@user someone should'vetaken"" this piece of sh...","@USER Someone should'veTaken"" this piece of sh...",62688,OFF,UNT,
4,"[[(@user, NNP), (@user, NNP), (obama, NN), (wa...","[[(@user, NNP, B-NP), (@user, NNP, I-NP), (oba...","[[(@user, NNP), (@user, NNP), (obama, NN), (wa...","[[@user, @user, obama, wanted, liberals, &, il...",[@user @user obama wanted liberals &amp; illeg...,@user @user obama wanted liberals &amp; illega...,@USER @USER Obama wanted liberals &amp; illega...,43605,NOT,,


<b> Remoção de stop words </b>

Remove da lista de tokens de cada tweet as stop words da língua inglesa e pontuações.

Entradas:<br/>
         * tweets['tweet_tokenizado']<br/>
         * tweets['tweet_ner']<br/>
         * tweets['tweet_chunked']<br/>

Saída:<br/>
         * tweets['tokens_sem_stopwords']<br/>
         * tweets['NER_sem_stopwords'] <br/>
         * tweets['chunks_sem_stopwords']<br/>



In [9]:
from contextlib import redirect_stdout
import os
# import string library function  
from string import punctuation
    


def remove_stop_words(lista_token_sentenca):
  '''Função de remoção de stop word que recebe lista de tokens e devolve
  lista de tokens
  '''
  with redirect_stdout(open(os.devnull, "w")):
    nltk.download("stopwords") 
    nltk.download('punkt')
  
  stopwords = sw.words('english')
  stop_words = set(stopwords + list(punctuation ))

  lista_saida = []

  for lista_tokens in lista_token_sentenca:
    tokens = [w for w in lista_tokens if not w in stop_words]
    lista_saida.append(tokens)


  return lista_saida

def remove_stop_words_tuplas(lista_tuplas_sentencas):
  '''Função de remoção de stop word que recebe lista de tuplas de token e tag e devolve
  lista de tuplas de token e tag
  '''
  with redirect_stdout(open(os.devnull, "w")):
    nltk.download("stopwords") 
    nltk.download('punkt')
  
  stopwords = sw.words('english')
  stop_words = set(stopwords + list(punctuation ))
  lista_saida = []
  for lista_tuplas in lista_tuplas_sentencas:
    tuplas = [w for w in lista_tuplas if not w[0] in stop_words]
    lista_saida.append(tuplas)
  return lista_saida

tweets['tokens_sem_stopwords'] = tweets['tweet_tokenizado'].apply(remove_stop_words)
tweets['NER_sem_stopwords'] = tweets['tweet_NER'].apply(remove_stop_words_tuplas)
tweets['chunks_sem_stopwords'] = tweets['tweet_chunked'].apply(remove_stop_words_tuplas)
tweets[tweets.columns[::-1]].head()

Unnamed: 0,chunks_sem_stopwords,NER_sem_stopwords,tokens_sem_stopwords,tweet_NER,tweet_chunked,tweet_POS_tagged,tweet_tokenizado,tweet_em_sentencas,tweet_tratado,tweet,id,subtask_a,subtask_b,subtask_c
0,"[[(@user, IN, O), (ask, VB, O), (native, JJ, I...","[[(@user, IN), (ask, VB), (native, JJ), (ameri...","[[@user, ask, native, americans, take]]","[[(@user, IN), (she, PRP), (should, MD), (ask,...","[[(@user, IN, O), (she, PRP, O), (should, MD, ...","[[(@user, IN), (she, PRP), (should, MD), (ask,...","[[@user, she, should, ask, a, few, native, ame...",[@user she should ask a few native americans w...,@user she should ask a few native americans wh...,@USER She should ask a few native Americans wh...,86426,OFF,UNT,
1,"[[(@user, NNP, B-NP), (@user, NNP, I-NP), (go,...","[[(@user, NNP), (@user, NNP), (go, VBP), (home...","[[@user, @user, go, home, ’, drunk], [@user, #...","[[(@user, NNP), (@user, NNP), (go, VBP), (home...","[[(@user, NNP, B-NP), (@user, NNP, I-NP), (go,...","[[(@user, NNP), (@user, NNP), (go, VBP), (home...","[[@user, @user, go, home, you, ’, re, drunk, !...","[@user @user go home you’re drunk!!!, @user #m...",@user @user go home you’re drunk!!! @user #mag...,@USER @USER Go home you’re drunk!!! @USER #MAG...,90194,OFF,TIN,IND
2,"[[(amazon, NN, B-NP), (investigating, VBG, O),...","[[(amazon, NN), (investigating, VBG), (chinese...","[[amazon, investigating, chinese, employees, s...","[[(amazon, NN), (is, VBZ), (investigating, VBG...","[[(amazon, NN, B-NP), (is, VBZ, O), (investiga...","[[(amazon, NN), (is, VBZ), (investigating, VBG...","[[amazon, is, investigating, chinese, employee...",[amazon is investigating chinese employees who...,amazon is investigating chinese employees who ...,Amazon is investigating Chinese employees who ...,16820,NOT,,
3,"[[(@user, NN, B-NP), (someone, NN, I-NP), (sho...","[[(@user, NN), (someone, NN), (should'vetaken,...","[[@user, someone, should'vetaken, piece, shit,...","[[(@user, NN), (someone, NN), (should'vetaken,...","[[(@user, NN, B-NP), (someone, NN, I-NP), (sho...","[[(@user, NN), (someone, NN), (should'vetaken,...","[[@user, someone, should'vetaken, "", this, pie...","[@user someone should'vetaken"" this piece of s...","@user someone should'vetaken"" this piece of sh...","@USER Someone should'veTaken"" this piece of sh...",62688,OFF,UNT,
4,"[[(@user, NNP, B-NP), (@user, NNP, I-NP), (oba...","[[(@user, NNP), (@user, NNP), (obama, NN), (wa...","[[@user, @user, obama, wanted, liberals, illeg...","[[(@user, NNP), (@user, NNP), (obama, NN), (wa...","[[(@user, NNP, B-NP), (@user, NNP, I-NP), (oba...","[[(@user, NNP), (@user, NNP), (obama, NN), (wa...","[[@user, @user, obama, wanted, liberals, &, il...",[@user @user obama wanted liberals &amp; illeg...,@user @user obama wanted liberals &amp; illega...,@USER @USER Obama wanted liberals &amp; illega...,43605,NOT,,


<b> Fim da atividade 01 </b>

Tem-se como principais entregas as colunas tweets['tokens_sem_stopwords'] e tweets['NER_sem_stopwords'] do dataset tweets.

In [0]:
tweets[['NER_sem_stopwords','chunks_sem_stopwords']].head()

Unnamed: 0,NER_sem_stopwords,chunks_sem_stopwords
0,"[[(ask, VB), (native, JJ), (americans, NNS), (...","[[(ask, VB, O), (native, JJ, I-NP), (americans..."
1,"[[(go, VB), (home, NN), (’, VBP), (drunk, NN)]...","[[(go, VB, O), (home, NN, B-NP), (’, VBP, O), ..."
2,"[[(amazon, NN), (investigating, VBG), (chinese...","[[(amazon, NN, B-NP), (investigating, VBG, O),..."
3,"[[(someone, NN), (should'vetaken, VBD), (piece...","[[(someone, NN, B-NP), (should'vetaken, VBD, O..."
4,"[[(obama, RB), (wanted, VBD), (liberals, NNS),...","[[(obama, RB, O), (wanted, VBD, O), (liberals,..."


base externa: wordnet lemmatizer?
Não deixar palavras de meio de sentença em minúsculo pois podem ser entidades
identificar "typos"
Remover n-grams de alta frequência (não adicionam informação), e de  baixa frequência com erros (para prevenir overfit)

In [0]:
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
texts = [
    "good movie", "not a good movie", "did not like", 
    "i like it", "good one"
]
# using default tokenizer in TfidfVectorizer
tfidf = TfidfVectorizer(min_df=2, max_df=0.5, ngram_range=(1, 2))
features = tfidf.fit_transform(texts)
pd.DataFrame(
    features.todense(),
    columns=tfidf.get_feature_names()
)