<a href="https://colab.research.google.com/github/eduardodut/Mineracao_dados_textos_web/blob/master/projeto01_equipe01_Atividade02.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<b> EQUIPE: </b>
  - Eduardo Façanha
  - Giovanni Brígido
  - Maurício Brito

<b> ATIVIDADE 01 </b> - Pré-processamento dos textos (Prazo: 11/05/2020 - 30%)

- Tokenização
- Lematização
- POS Tagging
- Normalização (hashtags, menções, emojis e símbolos especiais)
- Chunking
- NER (entidades nomeadas)
- Remoção stop-words

<b> ATIVIDADE 02 </b> - Representação Semântica (Prazo: 30/06/2020 - 30%)

- Uso de bases de conhecimento externas
- Identificação de tópicos
- Representação vetorial das palavras e textos

<b> ATIVIDADE 03 </b> - Analise da Linguagem Ofensiva - Subtarefas A e B (Prazo: 30/07/2020 - 40%)

- Resultado da subtarefa A para um conjunto de teste a ser fornecido
- Resultado da subtarefa B para um conjunto de teste a ser fornecido


## Atividade 01

### Carregamento do arquivo de dados e transformação em DataFrame

É realizado o download do arquivo e instanciado um DataFrame com os dados. A variável do DataFrame é chamada 'tweets'

In [236]:
import pandas as pd
#download o arquivo localizado no reposítório do projeto
!curl --remote-name \
    -H 'Accept: application/vnd.github.v3.raw' \
    --location https://raw.githubusercontent.com/eduardodut/Mineracao_dados_textos_web/master/datasets/olid-training-v1.0.tsv

#leitura para objeto dataframe
tweets = pd.read_csv('/content/olid-training-v1.0.tsv', sep='\t',encoding= 'utf-8')

#conversão da coluna 'id' de inteiro para string
tweets['id'] = tweets['id'].astype('str')

#visualização dos primeiros registros

tweets = tweets[['subtask_c','subtask_b','subtask_a','id','tweet']]
tweets.head(20)

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0100 1915k  100 1915k    0     0  4789k      0 --:--:-- --:--:-- --:--:-- 4789k


Unnamed: 0,subtask_c,subtask_b,subtask_a,id,tweet
0,,UNT,OFF,86426,@USER She should ask a few native Americans wh...
1,IND,TIN,OFF,90194,@USER @USER Go home you’re drunk!!! @USER #MAG...
2,,,NOT,16820,Amazon is investigating Chinese employees who ...
3,,UNT,OFF,62688,"@USER Someone should'veTaken"" this piece of sh..."
4,,,NOT,43605,@USER @USER Obama wanted liberals &amp; illega...
5,OTH,TIN,OFF,97670,@USER Liberals are all Kookoo !!!
6,,UNT,OFF,77444,@USER @USER Oh noes! Tough shit.
7,GRP,TIN,OFF,52415,@USER was literally just talking about this lo...
8,,,NOT,45157,@USER Buy more icecream!!!
9,IND,TIN,OFF,13384,@USER Canada doesn’t need another CUCK! We alr...


In [237]:
#verificação e remoção de duplicatas
if tweets.duplicated(['tweet']).sum()>0:
  tweets.drop_duplicates(subset='tweet', keep='first', inplace=True)

print('TWEETS DUPLICADOS: ',tweets.duplicated(['tweet']).sum())

TWEETS DUPLICADOS:  0


### Tratamento inicial do texto

Converte o texto de cada tweet, separadamente, em minúsculo e remove espaços e tabulações extras. O resultado é guardado no DataFrame tweets em uma nova coluna.

Entrada: tweets['tweet']<br/>
Saída: tweets['tweet_tratado']

In [238]:
from nltk.tokenize import TweetTokenizer, sent_tokenize
import re
import string
from nltk.corpus import stopwords as sw

def tratamento_texto(tweet):
  
  tweet = tweet.lower()
  tweet = tweet.strip()
  
  #remove as menções a usuários de cada tweet
  # tweet = re.sub(r'@user', '', tweet, flags=re.MULTILINE)
  #remove as palavras url
  tweet = re.sub(r'url', '', tweet, flags=re.MULTILINE)
  #remove as quebras de linha
  tweet = re.sub(r'\n', '', tweet)
  #substitui tabulações por um espaço em branco
  tweet = re.sub(r'\t', ' ', tweet)
  #substitui um ou mais espaços em branco por um espaço
  tweet= re.sub(r'\s+', ' ', tweet, flags=re.I)
  #&amp;
  #remove aspas e apóstofres
  tweet = re.sub('[\'"‘’“”…]', '', tweet)
  #remove aspas e apóstofres
  tweet = re.sub('^#$', '', tweet)
  tweet = re.sub('@', '', tweet)
  return tweet

#cria uma nova coluna no dataframe 'tweets' com cada tweet tokenizado
tweets['tweet_tratado'] = tweets['tweet'].apply(tratamento_texto)
tweets[tweets.columns[::-1]].head()

Unnamed: 0,tweet_tratado,tweet,id,subtask_a,subtask_b,subtask_c
0,user she should ask a few native americans wha...,@USER She should ask a few native Americans wh...,86426,OFF,UNT,
1,user user go home youre drunk!!! user #maga #t...,@USER @USER Go home you’re drunk!!! @USER #MAG...,90194,OFF,TIN,IND
2,amazon is investigating chinese employees who ...,Amazon is investigating Chinese employees who ...,16820,NOT,,
3,user someone shouldvetaken this piece of shit ...,"@USER Someone should'veTaken"" this piece of sh...",62688,OFF,UNT,
4,user user obama wanted liberals &amp; illegals...,@USER @USER Obama wanted liberals &amp; illega...,43605,NOT,,


In [239]:
tweets.tweet_tratado[1]

'user user go home youre drunk!!! user #maga #trump2020 👊🇺🇸👊 '

<b> Separação em sentenças </b>

Separa cada tweet em sentenças.

Entrada: tweets['tweet_tratado']<br/>
Saída: tweets['tweet_em_sentencas']

In [240]:
import nltk
from contextlib import redirect_stdout
import os

with redirect_stdout(open(os.devnull, "w")):
  nltk.download("stopwords") 
  nltk.download('punkt')

def separa_sentencas(tweet):
  
  lista_sentencas = sent_tokenize(tweet)
  # lista_setencas.str.strip()
  nova_lista = []
  for sent in lista_sentencas:
    nova_lista.append(sent.strip())

  return nova_lista #retorna lista de sentenças com a função .strip() aplicada
tweets['tweet_em_sentencas'] = tweets['tweet_tratado'].apply(separa_sentencas)
tweets[tweets.columns[::-1]].head()

Unnamed: 0,tweet_em_sentencas,tweet_tratado,tweet,id,subtask_a,subtask_b,subtask_c
0,[user she should ask a few native americans wh...,user she should ask a few native americans wha...,@USER She should ask a few native Americans wh...,86426,OFF,UNT,
1,"[user user go home youre drunk!!!, user #maga ...",user user go home youre drunk!!! user #maga #t...,@USER @USER Go home you’re drunk!!! @USER #MAG...,90194,OFF,TIN,IND
2,[amazon is investigating chinese employees who...,amazon is investigating chinese employees who ...,Amazon is investigating Chinese employees who ...,16820,NOT,,
3,[user someone shouldvetaken this piece of shit...,user someone shouldvetaken this piece of shit ...,"@USER Someone should'veTaken"" this piece of sh...",62688,OFF,UNT,
4,[user user obama wanted liberals &amp; illegal...,user user obama wanted liberals &amp; illegals...,@USER @USER Obama wanted liberals &amp; illega...,43605,NOT,,


### Tokenização 

Tokenização do tweet.

Entrada: tweets['tweet_em_sentencas']<br/>
Saída: tweets['tweet_tokenizado']

In [241]:
import string as punctuation

nltk.download('punkt')
def tokeniza_sentenca(lista_sentencas):
  # tokenizer = TweetTokenizer()
  # #união das sentenças
  # sentencas_unidas = " ".join(w for w in lista_sentencas)
  # #tokenização das sentenças unidas
  # tokens = tokenizer.tokenize(sentencas_unidas)

  tokenizer = TweetTokenizer()
  tokens = []
  
  for sentenca in lista_sentencas:
    lista_tokens = tokenizer.tokenize(sentenca)
       
    sentenca_sem_stopword = []

    for token in lista_tokens:
      if token not in string.punctuation:
        sentenca_sem_stopword.append(token)

    tokens.append(sentenca_sem_stopword)

  # tweet = re.sub('[\'"‘’“”!…]', '', tweet)

  return tokens


# tokeniza_sentenca(tweets['tweet_em_sentencas'][2])

tweets['tweet_tokenizado'] = tweets['tweet_em_sentencas'].apply(tokeniza_sentenca)
tweets[tweets.columns[::-1]].head()

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Unnamed: 0,tweet_tokenizado,tweet_em_sentencas,tweet_tratado,tweet,id,subtask_a,subtask_b,subtask_c
0,"[[user, she, should, ask, a, few, native, amer...",[user she should ask a few native americans wh...,user she should ask a few native americans wha...,@USER She should ask a few native Americans wh...,86426,OFF,UNT,
1,"[[user, user, go, home, youre, drunk], [user, ...","[user user go home youre drunk!!!, user #maga ...",user user go home youre drunk!!! user #maga #t...,@USER @USER Go home you’re drunk!!! @USER #MAG...,90194,OFF,TIN,IND
2,"[[amazon, is, investigating, chinese, employee...",[amazon is investigating chinese employees who...,amazon is investigating chinese employees who ...,Amazon is investigating Chinese employees who ...,16820,NOT,,
3,"[[user, someone, shouldvetaken, this, piece, o...",[user someone shouldvetaken this piece of shit...,user someone shouldvetaken this piece of shit ...,"@USER Someone should'veTaken"" this piece of sh...",62688,OFF,UNT,
4,"[[user, user, obama, wanted, liberals, illegal...",[user user obama wanted liberals &amp; illegal...,user user obama wanted liberals &amp; illegals...,@USER @USER Obama wanted liberals &amp; illega...,43605,NOT,,


### POS Tagger

Realiza a part of speech tagging do texto de cada token

Entrada: tweets['tweet_tokenizado']<br/>
Saída: tweets['tweet_POS_tagged']

In [243]:
from contextlib import redirect_stdout
import os

with redirect_stdout(open(os.devnull, "w")):
    nltk.download('averaged_perceptron_tagger')
# a função map aplica a funcao nltk.post_tag para cada lista contida da coluna tweet tokenizado
 
def pos_taggeador(lista_tokens):
  setenca_taggeada = []
  for lista in lista_tokens:
    setenca_taggeada.append(nltk.pos_tag(lista))

  return setenca_taggeada

                                                        #apply(nltk.pos) se a coluna for composta de lista de tokens
tweets['tweet_POS_tagged'] = tweets['tweet_tokenizado'].apply(pos_taggeador)#
tweets[tweets.columns[::-1]].head()

Unnamed: 0,tweet_POS_tagged,tweet_tokenizado,tweet_em_sentencas,tweet_tratado,tweet,id,subtask_a,subtask_b,subtask_c
0,"[[(user, IN), (she, PRP), (should, MD), (ask, ...","[[user, she, should, ask, a, few, native, amer...",[user she should ask a few native americans wh...,user she should ask a few native americans wha...,@USER She should ask a few native Americans wh...,86426,OFF,UNT,
1,"[[(user, NN), (user, NN), (go, VBP), (home, NN...","[[user, user, go, home, youre, drunk], [user, ...","[user user go home youre drunk!!!, user #maga ...",user user go home youre drunk!!! user #maga #t...,@USER @USER Go home you’re drunk!!! @USER #MAG...,90194,OFF,TIN,IND
2,"[[(amazon, NN), (is, VBZ), (investigating, VBG...","[[amazon, is, investigating, chinese, employee...",[amazon is investigating chinese employees who...,amazon is investigating chinese employees who ...,Amazon is investigating Chinese employees who ...,16820,NOT,,
3,"[[(user, NN), (someone, NN), (shouldvetaken, V...","[[user, someone, shouldvetaken, this, piece, o...",[user someone shouldvetaken this piece of shit...,user someone shouldvetaken this piece of shit ...,"@USER Someone should'veTaken"" this piece of sh...",62688,OFF,UNT,
4,"[[(user, RB), (user, JJ), (obama, NN), (wanted...","[[user, user, obama, wanted, liberals, illegal...",[user user obama wanted liberals &amp; illegal...,user user obama wanted liberals &amp; illegals...,@USER @USER Obama wanted liberals &amp; illega...,43605,NOT,,


### Chunking

Separação de cada sentença em chunks. 

Entrada: tweets['tweet_POS_tagged']<br/>
Saída: tweets['tweet_chunked']

In [None]:
from nltk.chunk import conlltags2tree, tree2conlltags

pattern = 'NP: {<DT>?<JJ>*<NN>}'
pattern1 = 'NP: {<DT>?<JJ>*<NN.*>*}'
pattern2 = 'NP: {<DT><NN.*><.*>*<NN.*>}'

def chunker(lista_tweets_pos_tagged):

  lista_saida = []

  pattern = 'NP: {<DT>?<JJ>*<NN>}'
  pattern1 = 'NP: {<DT>?<JJ>*<NN.*>*}'
  pattern2 = 'NP: {<DT><NN.*><.*>*<NN.*>}'


  for lista in lista_tweets_pos_tagged:
    cp = nltk.RegexpParser(pattern1)
    cs = cp.parse(lista)
    iob_tagged = tree2conlltags(cs)
    
    lista_saida.append(iob_tagged)
  return lista_saida


tweets['tweet_chunked'] = tweets['tweet_POS_tagged'].apply(chunker)

tweets[tweets.columns[::-1]].head()

### NER 

Realiza a reconhecimento de entidades, NER.

Entrada: tweets['tweet_POS_tagged']<br/>
Saída: tweets['tweet_NER']

In [245]:
from nltk.tag import pos_tag
from nltk.chunk import conlltags2tree, tree2conlltags
from pprint import pprint
from nltk.chunk.regexp import ChunkString, ChunkRule, ChinkRule 
from nltk.tree import Tree 
from contextlib import redirect_stdout
import os

with redirect_stdout(open(os.devnull, "w")):
    nltk.download('maxent_ne_chunker')
    nltk.download('words')

def ner(lista_tokens_taggeados):
  lista_tokens_ner = []
  for lista in lista_tokens_taggeados:
    lista_tokens_ner.append(nltk.ne_chunk(lista))

  return lista_tokens_ner


tweets['tweet_NER'] = tweets['tweet_POS_tagged'].apply(ner)
tweets[tweets.columns[::-1]].head()

Unnamed: 0,tweet_NER,tweet_chunked,tweet_POS_tagged,tweet_tokenizado,tweet_em_sentencas,tweet_tratado,tweet,id,subtask_a,subtask_b,subtask_c
0,"[[(user, IN), (she, PRP), (should, MD), (ask, ...","[[(user, IN, O), (she, PRP, O), (should, MD, O...","[[(user, IN), (she, PRP), (should, MD), (ask, ...","[[user, she, should, ask, a, few, native, amer...",[user she should ask a few native americans wh...,user she should ask a few native americans wha...,@USER She should ask a few native Americans wh...,86426,OFF,UNT,
1,"[[(user, NN), (user, NN), (go, VBP), (home, NN...","[[(user, NN, B-NP), (user, NN, I-NP), (go, VBP...","[[(user, NN), (user, NN), (go, VBP), (home, NN...","[[user, user, go, home, youre, drunk], [user, ...","[user user go home youre drunk!!!, user #maga ...",user user go home youre drunk!!! user #maga #t...,@USER @USER Go home you’re drunk!!! @USER #MAG...,90194,OFF,TIN,IND
2,"[[(amazon, NN), (is, VBZ), (investigating, VBG...","[[(amazon, NN, B-NP), (is, VBZ, O), (investiga...","[[(amazon, NN), (is, VBZ), (investigating, VBG...","[[amazon, is, investigating, chinese, employee...",[amazon is investigating chinese employees who...,amazon is investigating chinese employees who ...,Amazon is investigating Chinese employees who ...,16820,NOT,,
3,"[[(user, NN), (someone, NN), (shouldvetaken, V...","[[(user, NN, B-NP), (someone, NN, I-NP), (shou...","[[(user, NN), (someone, NN), (shouldvetaken, V...","[[user, someone, shouldvetaken, this, piece, o...",[user someone shouldvetaken this piece of shit...,user someone shouldvetaken this piece of shit ...,"@USER Someone should'veTaken"" this piece of sh...",62688,OFF,UNT,
4,"[[(user, RB), (user, JJ), (obama, NN), (wanted...","[[(user, RB, O), (user, JJ, B-NP), (obama, NN,...","[[(user, RB), (user, JJ), (obama, NN), (wanted...","[[user, user, obama, wanted, liberals, illegal...",[user user obama wanted liberals &amp; illegal...,user user obama wanted liberals &amp; illegals...,@USER @USER Obama wanted liberals &amp; illega...,43605,NOT,,


### Remoção de stop words

Remove da lista de tokens de cada tweet as stop words da língua inglesa e pontuações.

Entradas:<br/>
         * tweets['tweet_tokenizado']<br/>
         * tweets['tweet_ner']<br/>
         * tweets['tweet_chunked']<br/>

Saída:<br/>
         * tweets['tokens_sem_stopwords']<br/>
         * tweets['NER_sem_stopwords'] <br/>
         * tweets['chunks_sem_stopwords']<br/>



In [246]:
from contextlib import redirect_stdout
import os
# import string library function  
from string import punctuation
    


def remove_stop_words(lista_token_sentenca):
  '''Função de remoção de stop word que recebe lista de tokens e devolve
  lista de tokens
  '''
  with redirect_stdout(open(os.devnull, "w")):
    nltk.download("stopwords") 
    nltk.download('punkt')
  
  stopwords = sw.words('english')
  stop_words = set(stopwords + list(punctuation ))

  lista_saida = []

  for lista_tokens in lista_token_sentenca:
    tokens = [w for w in lista_tokens if not w in stop_words]
    lista_saida.append(tokens)


  return lista_saida

def remove_stop_words_tuplas(lista_tuplas_sentencas):
  '''Função de remoção de stop word que recebe lista de tuplas de token e tag e devolve
  lista de tuplas de token e tag
  '''
  with redirect_stdout(open(os.devnull, "w")):
    nltk.download("stopwords") 
    nltk.download('punkt')
  
  stopwords = sw.words('english')
  stop_words = set(stopwords + list(punctuation ))
  lista_saida = []
  for lista_tuplas in lista_tuplas_sentencas:
    tuplas = [w for w in lista_tuplas if not w[0] in stop_words]
    lista_saida.append(tuplas)
  return lista_saida

tweets['tokens_sem_stopwords'] = tweets['tweet_tokenizado'].apply(remove_stop_words)
tweets['NER_sem_stopwords'] = tweets['tweet_NER'].apply(remove_stop_words_tuplas)
tweets['chunks_sem_stopwords'] = tweets['tweet_chunked'].apply(remove_stop_words_tuplas)
tweets[tweets.columns[::-1]].head()

Unnamed: 0,chunks_sem_stopwords,NER_sem_stopwords,tokens_sem_stopwords,tweet_NER,tweet_chunked,tweet_POS_tagged,tweet_tokenizado,tweet_em_sentencas,tweet_tratado,tweet,id,subtask_a,subtask_b,subtask_c
0,"[[(user, IN, O), (ask, VB, O), (native, JJ, I-...","[[(user, IN), (ask, VB), (native, JJ), (americ...","[[user, ask, native, americans, take]]","[[(user, IN), (she, PRP), (should, MD), (ask, ...","[[(user, IN, O), (she, PRP, O), (should, MD, O...","[[(user, IN), (she, PRP), (should, MD), (ask, ...","[[user, she, should, ask, a, few, native, amer...",[user she should ask a few native americans wh...,user she should ask a few native americans wha...,@USER She should ask a few native Americans wh...,86426,OFF,UNT,
1,"[[(user, NN, B-NP), (user, NN, I-NP), (go, VBP...","[[(user, NN), (user, NN), (go, VBP), (home, NN...","[[user, user, go, home, youre, drunk], [user, ...","[[(user, NN), (user, NN), (go, VBP), (home, NN...","[[(user, NN, B-NP), (user, NN, I-NP), (go, VBP...","[[(user, NN), (user, NN), (go, VBP), (home, NN...","[[user, user, go, home, youre, drunk], [user, ...","[user user go home youre drunk!!!, user #maga ...",user user go home youre drunk!!! user #maga #t...,@USER @USER Go home you’re drunk!!! @USER #MAG...,90194,OFF,TIN,IND
2,"[[(amazon, NN, B-NP), (investigating, VBG, O),...","[[(amazon, NN), (investigating, VBG), (chinese...","[[amazon, investigating, chinese, employees, s...","[[(amazon, NN), (is, VBZ), (investigating, VBG...","[[(amazon, NN, B-NP), (is, VBZ, O), (investiga...","[[(amazon, NN), (is, VBZ), (investigating, VBG...","[[amazon, is, investigating, chinese, employee...",[amazon is investigating chinese employees who...,amazon is investigating chinese employees who ...,Amazon is investigating Chinese employees who ...,16820,NOT,,
3,"[[(user, NN, B-NP), (someone, NN, I-NP), (shou...","[[(user, NN), (someone, NN), (shouldvetaken, V...","[[user, someone, shouldvetaken, piece, shit, v...","[[(user, NN), (someone, NN), (shouldvetaken, V...","[[(user, NN, B-NP), (someone, NN, I-NP), (shou...","[[(user, NN), (someone, NN), (shouldvetaken, V...","[[user, someone, shouldvetaken, this, piece, o...",[user someone shouldvetaken this piece of shit...,user someone shouldvetaken this piece of shit ...,"@USER Someone should'veTaken"" this piece of sh...",62688,OFF,UNT,
4,"[[(user, RB, O), (user, JJ, B-NP), (obama, NN,...","[[(user, RB), (user, JJ), (obama, NN), (wanted...","[[user, user, obama, wanted, liberals, illegal...","[[(user, RB), (user, JJ), (obama, NN), (wanted...","[[(user, RB, O), (user, JJ, B-NP), (obama, NN,...","[[(user, RB), (user, JJ), (obama, NN), (wanted...","[[user, user, obama, wanted, liberals, illegal...",[user user obama wanted liberals &amp; illegal...,user user obama wanted liberals &amp; illegals...,@USER @USER Obama wanted liberals &amp; illega...,43605,NOT,,


<b> Fim da atividade 01 </b>

Tem-se como principais entregas as colunas tweets['tokens_sem_stopwords'] e tweets['NER_sem_stopwords'] do dataset tweets.

In [247]:
tweets[['NER_sem_stopwords','chunks_sem_stopwords']].head()

Unnamed: 0,NER_sem_stopwords,chunks_sem_stopwords
0,"[[(user, IN), (ask, VB), (native, JJ), (americ...","[[(user, IN, O), (ask, VB, O), (native, JJ, I-..."
1,"[[(user, NN), (user, NN), (go, VBP), (home, NN...","[[(user, NN, B-NP), (user, NN, I-NP), (go, VBP..."
2,"[[(amazon, NN), (investigating, VBG), (chinese...","[[(amazon, NN, B-NP), (investigating, VBG, O),..."
3,"[[(user, NN), (someone, NN), (shouldvetaken, V...","[[(user, NN, B-NP), (someone, NN, I-NP), (shou..."
4,"[[(user, RB), (user, JJ), (obama, NN), (wanted...","[[(user, RB, O), (user, JJ, B-NP), (obama, NN,..."


base externa: wordnet lemmatizer?
Não deixar palavras de meio de sentença em minúsculo pois podem ser entidades
identificar "typos"
Remover n-grams de alta frequência (não adicionam informação), e de  baixa frequência com erros (para prevenir overfit)

## Atividade 02

### Bag of wods: unigramas e bigramas

entrada: tweets sem stopwords
saida: uma coluna para os bigramas por tweet

In [248]:
from sklearn.feature_extraction.text import CountVectorizer
#passado como argumento para evitar outro preprocessamento pelo objeto de contagem
def preprocessador_nulo(texto):
  return texto
#Para não realizar outra tokenização e remover as hashtags e emojis
def tokenizador_nulo(texto):
  return texto.split(" ")

def ngrams_por_tweet(lista_tokens):

  texto = ""

  for sentenca in lista_tokens:

    texto = texto + " ".join(x for x in sentenca)
  
  # print(texto)

  cv = CountVectorizer(tokenizer = tokenizador_nulo, ngram_range= (1,2))
  bow = cv.fit_transform([texto])

  dicionario = dict(zip(cv.get_feature_names(),bow.toarray().sum(axis=0)))

  return dicionario

tweets['ngrams_por_tweet'] =  tweets['tokens_sem_stopwords'].apply(ngrams_por_tweet)

In [249]:
tweets[tweets.columns[::-1]].head()

Unnamed: 0,ngrams_por_tweet,chunks_sem_stopwords,NER_sem_stopwords,tokens_sem_stopwords,tweet_NER,tweet_chunked,tweet_POS_tagged,tweet_tokenizado,tweet_em_sentencas,tweet_tratado,tweet,id,subtask_a,subtask_b,subtask_c
0,"{'americans': 1, 'americans take': 1, 'ask': 1...","[[(user, IN, O), (ask, VB, O), (native, JJ, I-...","[[(user, IN), (ask, VB), (native, JJ), (americ...","[[user, ask, native, americans, take]]","[[(user, IN), (she, PRP), (should, MD), (ask, ...","[[(user, IN, O), (she, PRP, O), (should, MD, O...","[[(user, IN), (she, PRP), (should, MD), (ask, ...","[[user, she, should, ask, a, few, native, amer...",[user she should ask a few native americans wh...,user she should ask a few native americans wha...,@USER She should ask a few native Americans wh...,86426,OFF,UNT,
1,"{'#maga': 1, '#maga #trump2020': 1, '#trump202...","[[(user, NN, B-NP), (user, NN, I-NP), (go, VBP...","[[(user, NN), (user, NN), (go, VBP), (home, NN...","[[user, user, go, home, youre, drunk], [user, ...","[[(user, NN), (user, NN), (go, VBP), (home, NN...","[[(user, NN, B-NP), (user, NN, I-NP), (go, VBP...","[[(user, NN), (user, NN), (go, VBP), (home, NN...","[[user, user, go, home, youre, drunk], [user, ...","[user user go home youre drunk!!!, user #maga ...",user user go home youre drunk!!! user #maga #t...,@USER @USER Go home you’re drunk!!! @USER #MAG...,90194,OFF,TIN,IND
2,"{'#china': 1, '#china #tcot': 1, '#kag': 1, '#...","[[(amazon, NN, B-NP), (investigating, VBG, O),...","[[(amazon, NN), (investigating, VBG), (chinese...","[[amazon, investigating, chinese, employees, s...","[[(amazon, NN), (is, VBZ), (investigating, VBG...","[[(amazon, NN, B-NP), (is, VBZ, O), (investiga...","[[(amazon, NN), (is, VBZ), (investigating, VBG...","[[amazon, is, investigating, chinese, employee...",[amazon is investigating chinese employees who...,amazon is investigating chinese employees who ...,Amazon is investigating Chinese employees who ...,16820,NOT,,
3,"{'piece': 1, 'piece shit': 1, 'shit': 1, 'shit...","[[(user, NN, B-NP), (someone, NN, I-NP), (shou...","[[(user, NN), (someone, NN), (shouldvetaken, V...","[[user, someone, shouldvetaken, piece, shit, v...","[[(user, NN), (someone, NN), (shouldvetaken, V...","[[(user, NN, B-NP), (someone, NN, I-NP), (shou...","[[(user, NN), (someone, NN), (shouldvetaken, V...","[[user, someone, shouldvetaken, this, piece, o...",[user someone shouldvetaken this piece of shit...,user someone shouldvetaken this piece of shit ...,"@USER Someone should'veTaken"" this piece of sh...",62688,OFF,UNT,
4,"{'illegals': 1, 'illegals move': 1, 'liberals'...","[[(user, RB, O), (user, JJ, B-NP), (obama, NN,...","[[(user, RB), (user, JJ), (obama, NN), (wanted...","[[user, user, obama, wanted, liberals, illegal...","[[(user, RB), (user, JJ), (obama, NN), (wanted...","[[(user, RB, O), (user, JJ, B-NP), (obama, NN,...","[[(user, RB), (user, JJ), (obama, NN), (wanted...","[[user, user, obama, wanted, liberals, illegal...",[user user obama wanted liberals &amp; illegal...,user user obama wanted liberals &amp; illegals...,@USER @USER Obama wanted liberals &amp; illega...,43605,NOT,,


### Consulta de bases externas

#### Busca de sinônimos e antônimos

In [None]:
#função para descobrir os sinônimos dos tokens da colunas "tokens_sem_stopwords"

import nltk
nltk.download('punkt')
nltk.download('wordnet')
from nltk.corpus import stopwords
from nltk.corpus import wordnet as wn


def busca_sinonimos_antonimos(lista_sentencas_tokenizadas):
  dicionario_sinonimos = dict()
  dicionario_antonimos = dict()

  for sent in lista_sentencas_tokenizadas:
    for palavra in sent:
      sinonimos = []
      antonimos = []
      for syn  in wn.synsets(palavra):
        for l in syn.lemmas():
          if l.name() not in sinonimos:
            sinonimos.append(l.name()) 
          if l.antonyms():
              antonimos.append(l.antonyms()[0].name())
      if len(sinonimos) > 0:
        dicionario_sinonimos[palavra] = sinonimos
      if len(antonimos) > 0:
        dicionario_antonimos[palavra] = antonimos
    
  return dicionario_sinonimos, dicionario_antonimos



[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [None]:
tweets['sinonimos_antonimos'] = tweets['tokens_sem_stopwords'].apply(busca_sinonimos_antonimos)
tweets['sinonimos_antonimos'] 

0        ({'ask': ['ask', 'inquire', 'enquire', 'requir...
1        ({'go': ['go', 'spell', 'tour', 'turn', 'Adam'...
2        ({'amazon': ['amazon', 'virago', 'Amazon', 'Am...
3        ({'someone': ['person', 'individual', 'someone...
4        ({'wanted': ['desire', 'want', 'need', 'requir...
                               ...                        
13235    ({'sometimes': ['sometimes'], 'get': ['get', '...
13236    ({'shabby': ['moth-eaten', 'ratty', 'shabby', ...
13237    ({'report': ['report', 'study', 'written_repor...
13238    ({'pussy': ['cunt', 'puss', 'pussy', 'slit', '...
13239    ({'vs': ['volt', 'V', 'vanadium', 'atomic_numb...
Name: sinonimos_antonimos, Length: 13207, dtype: object

#### Empath

In [286]:
# !pip install empath
from empath import Empath
lexicon = Empath()


tweets['classific_empath'] = tweets["tweet_tratado"].apply(lexicon.analyze)


### Recuperação dos word embeddings

In [276]:
# from gensim.models.doc2vec import TaggedDocument, Doc2Vec
from gensim.models.word2vec import Word2Vec

def word_embedding(lista_sentencas_tokenizadas):
  dicionario_saida = dict()
  model = Word2Vec(lista_sentencas_tokenizadas, min_count=1,size= 50,workers=3, window =2, sg = 1)
  for palavra in model.wv.vocab:
    dicionario_saida[palavra] = model[palavra]

  return dicionario_saida


# word_embedding(tweets['tweet_tokenizado'].apply(lambda l: [item for sublist in l for item in sublist]))#.apply(" ".join))
# tweets['tweet_embeddings'] = tweets['tweet_tokenizado'].apply(lambda l: [item for sublist in l for item in sublist]).apply(word_embedding)
tweets['tweet_embeddings'] = tweets['tweet_tokenizado'].apply(word_embedding)

  


In [279]:
#exemplo dos vetores de cada tweet
print(tweets['tweet_tokenizado'][2])
print(tweets['tweet_embeddings'][2])

[['amazon', 'is', 'investigating', 'chinese', 'employees', 'who', 'are', 'selling', 'internal', 'data', 'to', 'third-party', 'sellers', 'looking', 'for', 'an', 'edge', 'in', 'the', 'competitive', 'marketplace'], ['#amazon', '#maga', '#kag', '#china', '#tcot']]
{'amazon': array([ 0.00708107, -0.00160974, -0.00302817, -0.00557131,  0.00803064,
       -0.00126878,  0.00032594, -0.0044409 , -0.003109  ,  0.00228831,
       -0.00240538,  0.00445926,  0.00242291, -0.00491508,  0.00508993,
       -0.00803601,  0.00654048, -0.00555123,  0.00293973,  0.0090288 ,
       -0.00339887,  0.00253844,  0.00222103,  0.00992467,  0.00517678,
        0.0073553 ,  0.00747471, -0.00740749, -0.00537205,  0.00851592,
        0.00139181,  0.00637761,  0.00542973, -0.0092469 ,  0.00479682,
        0.00642271, -0.00557764, -0.00363793,  0.00676623,  0.00828977,
       -0.00604183,  0.00758583,  0.00558285,  0.00663622,  0.00286959,
       -0.00754548, -0.00605596, -0.00688615, -0.00524256,  0.00821458],
      d

In [281]:
tweets.tweet_tratado[1]

'user user go home youre drunk!!! user #maga #trump2020 👊🇺🇸👊 '

### Vetorização por tfidf


In [265]:
flatten = lambda l: [item for sublist in l for item in sublist]
tweets['tweet_tokenizado'].apply(lambda l: [item for sublist in l for item in sublist]).apply(" ".join)

0        user she should ask a few native americans wha...
1        user user go home youre drunk user #maga #trum...
2        amazon is investigating chinese employees who ...
3        user someone shouldvetaken this piece of shit ...
4        user user obama wanted liberals illegals to mo...
                               ...                        
13235    user sometimes i get strong vibes from people ...
13236    benidorm ✅ creamfields ✅ maga ✅ not too shabby...
13237    user and why report this garbage we dont give ...
13238                                           user pussy
13239    #spanishrevenge vs #justice #humanrights and #...
Name: tweet_tokenizado, Length: 13207, dtype: object

In [266]:
from sklearn.feature_extraction.text import TfidfVectorizer


tokenizador = TweetTokenizer()
#é utilizado o mesmo tokenizador para o processo de vetorização
cv = TfidfVectorizer(tokenizer = tokenizador.tokenize, ngram_range= (1,1))

vetorizacao_unigram = cv.fit_transform(tweets['tweet_tokenizado'].apply(lambda l: [item for sublist in l for item in sublist]).apply(" ".join))


In [267]:
dataframe_vetorizacao_unigram = pd.DataFrame(vetorizacao_unigram.toarray(), columns= cv.get_feature_names())

In [268]:
dataframe_vetorizacao_unigram.head()

Unnamed: 0,#100thmonkey,#102,#10millionsubscribers,#12,#180,#18n18,#1950sbornwomen,#1950swomen,#1a,#1ab,#1linewed,#1standlast,#1worldonlines,#2019loancharge,#2020,#2020maga,#2a,#2adefenders,#2ashallnotbeinfringed,#2birdsofafeather,#4-reds,#405,#51,#60minutes,#80s,#8217,#88,#a2,#a8,#aba,#abcnews,#abetterway,#ableg,#abortion,#ac360,#accountability,#activist,#adamandeve,#adelaide,#adiya,...,🚶,🛑,🛵,🛸,🤐,🤑,🤒,🤔,🤖,🤗,🤙,🤞,🤟,🤠,🤡,🤢,🤣,🤤,🤥,🤦,🤧,🤨,🤩,🤪,🤫,🤬,🤭,🤮,🤯,🤷,🥀,🥂,🦁,🦅,🦇,🦊,🧐,🧟,🧠,🧡
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


0        [user, she, should, ask, a, few, native, ameri...
1        [user, user, go, home, youre, drunk, user, #ma...
2        [amazon, is, investigating, chinese, employees...
3        [user, someone, shouldvetaken, this, piece, of...
4        [user, user, obama, wanted, liberals, illegals...
                               ...                        
13235    [user, sometimes, i, get, strong, vibes, from,...
13236    [benidorm, ✅, creamfields, ✅, maga, ✅, not, to...
13237    [user, and, why, report, this, garbage, we, do...
13238                                        [user, pussy]
13239    [#spanishrevenge, vs, #justice, #humanrights, ...
Name: tweet_tokenizado, Length: 13207, dtype: object
['user', 'liberals', 'are', 'all', 'kookoo']


In [269]:
from sklearn.feature_extraction.text import TfidfVectorizer

tokenizador = TweetTokenizer()

cv2 = TfidfVectorizer(tokenizer = tokenizador.tokenize, ngram_range= (2,2))

vetorizacao_bigram = cv2.fit_transform(tweets['tweet_tokenizado'].apply(lambda l: [item for sublist in l for item in sublist]).apply(" ".join))

In [None]:
#A visualização em dataframe não é possível por limitação de memória ram

#dataframe_vetorizacao_bigram = pd.DataFrame(vetorizacao_bigram.toarray(), columns= cv2.get_feature_names())

In [270]:
vetorizacao_bigram

<13207x130705 sparse matrix of type '<class 'numpy.float64'>'
	with 265934 stored elements in Compressed Sparse Row format>

## rascunhos

In [None]:
for sent in tweets['tokens_sem_stopwords'][0]:
  for palavra in sent:
    print(palavra)
    if len(wn.synsets(palavra))>0:
      print(wn.synsets(palavra)[0].hypernyms()[0].name())
    

def busca_hiperonimos(lista_sentencas_tokenizadas):
  dicionario_sinonimos = dict()
  dicionario_antonimos = dict()

  for sent in lista_sentencas_tokenizadas:
    for palavra in sent:
      sinonimos = []
      antonimos = []
      for syn  in wn.synsets(palavra):
        for l in syn.lemmas():
          if l.name() not in sinonimos:
            sinonimos.append(l.name()) 
          if l.antonyms():
              antonimos.append(l.antonyms()[0].name())
      if len(sinonimos) > 0:
        dicionario_sinonimos[palavra] = sinonimos
      if len(antonimos) > 0:
        dicionario_antonimos[palavra] = antonimos
    
  return dicionario_sinonimos, dicionario_antonimos



@user
ask
communicate.v.02
native
person.n.01
americans
inhabitant.n.01
take
income.n.01


In [None]:
import unicodedata
from sklearn.base import BaseEstimator, TransformerMixin

class TextNormalizer(BaseEstimator, TransformerMixin):
  def __init__(self, language='english'):
    self.stopwords = set(nltk.corpus.stopwords.words(language))
    self.lemmatizer = WordNetLemmatizer()
  
  def is_punct(self, token):
    return all(
    unicodedata.category(char).startswith('P') for char in token)
  def is_stopword(self, token):
    return token.lower() in self.stopwords

  def normalize(self, document):

    return [
    self.lemmatize(token, tag).lower()
    for paragraph in document
    for sentence in paragraph
    for (token, tag) in sentence
    if not self.is_punct(token) and not self.is_stopword(token)
    ]


def lemmatize(self, token, pos_tag):
  tag = {
  'N': wn.NOUN,
  'V': wn.VERB,
  'R': wn.ADV,
  'J': wn.ADJ
  }.get(pos_tag[0], wn.NOUN)
  return self.lemmatizer.lemmatize(token, tag)

def fit(self, X, y=None):
  return self
def transform(self, documents):
  for document in documents:
    yield self.normalize(document)
