<a href="https://colab.research.google.com/github/eduardodut/Mineracao_dados_textos_web/blob/master/Projeto01_min_texto_web.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<b> EQUIPE: </b>
  - Eduardo Façanha
  - Giovanni Brígido
  - Maurício Brito

<b> ATIVIDADE 01 </b> - Pré-processamento dos textos (Prazo: 11/05/2020 - 30%)

- Tokenização
- Lematização
- POS Tagging
- Normalização (hashtags, menções, emojis e símbolos especiais)
- NER (entidades nomeadas)
- Remoção stop-words

<b> ATIVIDADE 02 </b> - Representação Semântica (Prazo: 22/06/2020 - 30%)

- Uso de bases de conhecimento externas
- Identificação de tópicos
- Representação vetorial das palavras e textos

<b> ATIVIDADE 03 </b> - Analise da Linguagem Ofensiva - Subtarefas A e B (Prazo: 27/07/2020 - 40%)

- Resultado da subtarefa A para um conjunto de teste a ser fornecido
- Resultado da subtarefa B para um conjunto de teste a ser fornecido


<b> ATIVIDADE 01 (parcial)</b>
  - Leitura do arquivo txt (tratar os tweets separadamente)
  - Tratamento de hashtags, menções, emojis e símbolos especiais, valores, datas
  - Separação das sentenças e Tokenização
  - Remoção stop-words


<b> Carregamento do arquivo de dados e transformação em DataFrame </b>

É realizado o download do arquivo e instanciado um DataFrame com os dados. A variável do DataFrame é chamada 'tweets'

In [246]:
import pandas as pd
#download o arquivo localizado no reposítório do projeto
!curl --remote-name \
    -H 'Accept: application/vnd.github.v3.raw' \
    --location https://raw.githubusercontent.com/eduardodut/Mineracao_dados_textos_web/master/datasets/olid-training-v1.0.tsv

#leitura para objeto dataframe
tweets = pd.read_csv('/content/olid-training-v1.0.tsv', sep='\t',encoding= 'utf-8')

#conversão da coluna 'id' de inteiro para string
tweets['id'] = tweets['id'].astype('str')

#visualização dos primeiros registros
tweets.head()

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0100 1915k  100 1915k    0     0  3297k      0 --:--:-- --:--:-- --:--:-- 3303k


Unnamed: 0,id,tweet,subtask_a,subtask_b,subtask_c
0,86426,@USER She should ask a few native Americans wh...,OFF,UNT,
1,90194,@USER @USER Go home you’re drunk!!! @USER #MAG...,OFF,TIN,IND
2,16820,Amazon is investigating Chinese employees who ...,NOT,,
3,62688,"@USER Someone should'veTaken"" this piece of sh...",OFF,UNT,
4,43605,@USER @USER Obama wanted liberals &amp; illega...,NOT,,


In [0]:
#drop das colunas que começam com 'sub'
tweets.drop(tweets.columns[tweets.columns.str.startswith('sub')], axis=1,inplace=True)

In [248]:
#verificação e remoção de duplicatas
if tweets.duplicated(['tweet']).sum()>0:
  tweets.drop_duplicates(subset='tweet', keep='first', inplace=True)

print('VALORES DUPLICADOS: ',tweets.duplicated(['tweet']).sum())

VALORES DUPLICADOS:  0


<b> Tratamento inicial do texto </b>

Converte o texto de cada tweet, separadamente, em minúsculo e remove espaços e tabulações extras. O resultado e guardado no DataFrame tweets em uma nova coluna.

In [261]:
from nltk.tokenize import TweetTokenizer, sent_tokenize
import re
import string
from nltk.corpus import stopwords as sw

def tratamento_texto(tweet):
  
  tweet = tweet.lower()
  tweet = tweet.strip()
  
  #remove as menções a usuários de cada tweet
  tweet = re.sub(r'@user', '', tweet, flags=re.MULTILINE)
  
  #remove as quebras de linha
  tweet = re.sub('\n', '', tweet)
  #substitui tabulações por um espaço em branco
  tweet = re.sub('\t', ' ', tweet)
  #substitui um ou mais espaços em branco por um espaço
  tweet= re.sub(r'\s+', ' ', tweet, flags=re.I)
  
  #remove aspas e apóstofres
  tweet = re.sub('[‘’“”…]', '', tweet)
  return tweet

#cria uma nova coluna no dataframe 'tweets' com cada tweet tokenizado
tweets['tweet_tratado'] = tweets['tweet'].apply(tratamento_texto)
tweets[tweets.columns[::-1]].head()

Unnamed: 0,tokens_sem_stopwords,tweet_tokenizado,tweet_em_sentencas,tweet_tratado,tweet,id
0,"[ask, native, americans, take]","[she, should, ask, a, few, native, americans, ...",[she should ask a few native americans what th...,she should ask a few native americans what th...,@USER She should ask a few native Americans wh...,86426
1,"[go, home, ’, drunk, #maga, #trump2020, 👊, 🇺, ...","[go, home, you, ’, re, drunk, !, !, !, #maga, ...","[go home you’re drunk!!!, #maga #trump2020 👊🇺🇸...",go home youre drunk!!! #maga #trump2020 👊🇺🇸👊 url,@USER @USER Go home you’re drunk!!! @USER #MAG...,90194
2,"[amazon, investigating, chinese, employees, se...","[amazon, is, investigating, chinese, employees...",[amazon is investigating chinese employees who...,amazon is investigating chinese employees who ...,Amazon is investigating Chinese employees who ...,16820
3,"[someone, should'vetaken, piece, shit, volcano...","[someone, should'vetaken, "", this, piece, of, ...","[someone should'vetaken"" this piece of shit to...","someone should'vetaken"" this piece of shit to...","@USER Someone should'veTaken"" this piece of sh...",62688
4,"[obama, wanted, liberals, illegals, move, red,...","[obama, wanted, liberals, &, illegals, to, mov...",[obama wanted liberals &amp; illegals to move ...,obama wanted liberals &amp; illegals to move ...,@USER @USER Obama wanted liberals &amp; illega...,43605


<b> Separação em sentenças </b>

Separa cada tweet em sentenças e os coloca no DataFrame, em uma nova coluna

In [262]:
def separa_sentencas(tweet):
  
  lista_sentencas = sent_tokenize(tweet)
  nova_lista = []
  for sent in lista_sentencas:
    nova_lista.append(sent.strip())

  return nova_lista
tweets['tweet_em_sentencas']  = tweets['tweet_tratado'].apply(separa_sentencas)
tweets[tweets.columns[::-1]].head()

Unnamed: 0,tokens_sem_stopwords,tweet_tokenizado,tweet_em_sentencas,tweet_tratado,tweet,id
0,"[ask, native, americans, take]","[she, should, ask, a, few, native, americans, ...",[she should ask a few native americans what th...,she should ask a few native americans what th...,@USER She should ask a few native Americans wh...,86426
1,"[go, home, ’, drunk, #maga, #trump2020, 👊, 🇺, ...","[go, home, you, ’, re, drunk, !, !, !, #maga, ...","[go home youre drunk!!!, #maga #trump2020 👊🇺🇸👊...",go home youre drunk!!! #maga #trump2020 👊🇺🇸👊 url,@USER @USER Go home you’re drunk!!! @USER #MAG...,90194
2,"[amazon, investigating, chinese, employees, se...","[amazon, is, investigating, chinese, employees...",[amazon is investigating chinese employees who...,amazon is investigating chinese employees who ...,Amazon is investigating Chinese employees who ...,16820
3,"[someone, should'vetaken, piece, shit, volcano...","[someone, should'vetaken, "", this, piece, of, ...","[someone should'vetaken"" this piece of shit to...","someone should'vetaken"" this piece of shit to...","@USER Someone should'veTaken"" this piece of sh...",62688
4,"[obama, wanted, liberals, illegals, move, red,...","[obama, wanted, liberals, &, illegals, to, mov...",[obama wanted liberals &amp; illegals to move ...,obama wanted liberals &amp; illegals to move ...,@USER @USER Obama wanted liberals &amp; illega...,43605


<b> Tokenização </b>

Reúne as sentenças em uma string única e realiza a tokenização do tweet. O resultado pode ser observado em uma nova coluna.

In [263]:
def tokeniza_sentenca(lista_sentencas):

  sentencas_unidas = " ".join(w for w in lista_sentencas)
  tokens = tokenizer.tokenize(sentencas_unidas)

  return tokens

tweets['tweet_tokenizado'] = tweets['tweet_em_sentencas'].apply(tokeniza_sentenca)
tweets[tweets.columns[::-1]].head()

Unnamed: 0,tokens_sem_stopwords,tweet_tokenizado,tweet_em_sentencas,tweet_tratado,tweet,id
0,"[ask, native, americans, take]","[she, should, ask, a, few, native, americans, ...",[she should ask a few native americans what th...,she should ask a few native americans what th...,@USER She should ask a few native Americans wh...,86426
1,"[go, home, ’, drunk, #maga, #trump2020, 👊, 🇺, ...","[go, home, youre, drunk, !, !, !, #maga, #trum...","[go home youre drunk!!!, #maga #trump2020 👊🇺🇸👊...",go home youre drunk!!! #maga #trump2020 👊🇺🇸👊 url,@USER @USER Go home you’re drunk!!! @USER #MAG...,90194
2,"[amazon, investigating, chinese, employees, se...","[amazon, is, investigating, chinese, employees...",[amazon is investigating chinese employees who...,amazon is investigating chinese employees who ...,Amazon is investigating Chinese employees who ...,16820
3,"[someone, should'vetaken, piece, shit, volcano...","[someone, should'vetaken, "", this, piece, of, ...","[someone should'vetaken"" this piece of shit to...","someone should'vetaken"" this piece of shit to...","@USER Someone should'veTaken"" this piece of sh...",62688
4,"[obama, wanted, liberals, illegals, move, red,...","[obama, wanted, liberals, &, illegals, to, mov...",[obama wanted liberals &amp; illegals to move ...,obama wanted liberals &amp; illegals to move ...,@USER @USER Obama wanted liberals &amp; illega...,43605


<b> Remoção de stop words </b>

Remove da lista de tokens de cada tweet as stop words da língua inglesa e pontuações.

In [264]:
from contextlib import redirect_stdout
import os
def remove_stop_words(lista_tokens):
  
  with redirect_stdout(open(os.devnull, "w")):
    nltk.download("stopwords") 
    nltk.download('punkt')
  
  stopwords = sw.words('english')
  stop_words = set(stopwords + list(punctuation))

  tokens = [w for w in lista_tokens if not w in stop_words]

  return tokens

tweets['tokens_sem_stopwords'] = tweets['tweet_tokenizado'].apply(remove_stop_words)
tweets[tweets.columns[::-1]].head()

Unnamed: 0,tokens_sem_stopwords,tweet_tokenizado,tweet_em_sentencas,tweet_tratado,tweet,id
0,"[ask, native, americans, take]","[she, should, ask, a, few, native, americans, ...",[she should ask a few native americans what th...,she should ask a few native americans what th...,@USER She should ask a few native Americans wh...,86426
1,"[go, home, youre, drunk, #maga, #trump2020, 👊,...","[go, home, youre, drunk, !, !, !, #maga, #trum...","[go home youre drunk!!!, #maga #trump2020 👊🇺🇸👊...",go home youre drunk!!! #maga #trump2020 👊🇺🇸👊 url,@USER @USER Go home you’re drunk!!! @USER #MAG...,90194
2,"[amazon, investigating, chinese, employees, se...","[amazon, is, investigating, chinese, employees...",[amazon is investigating chinese employees who...,amazon is investigating chinese employees who ...,Amazon is investigating Chinese employees who ...,16820
3,"[someone, should'vetaken, piece, shit, volcano...","[someone, should'vetaken, "", this, piece, of, ...","[someone should'vetaken"" this piece of shit to...","someone should'vetaken"" this piece of shit to...","@USER Someone should'veTaken"" this piece of sh...",62688
4,"[obama, wanted, liberals, illegals, move, red,...","[obama, wanted, liberals, &, illegals, to, mov...",[obama wanted liberals &amp; illegals to move ...,obama wanted liberals &amp; illegals to move ...,@USER @USER Obama wanted liberals &amp; illega...,43605
