# Processamento de Linguagem Natural - Gabarito

O objetivo do PLN é fornecer aos computadores a capacidade de entender e compor textos. “Entender” um texto significa reconhecer o contexto, fazer análise sintática, semântica, léxica e morfológica, criar resumos, extrair informação, interpretar os sentidos, analisar sentimentos e até aprender conceitos com os textos processados.


Neste notebook, exploraremos dois problemas clássicos de PLN: `classificação de texto` e `agrupamento de tópicos`;

### Bibliotecas Auxiliares

In [21]:
import pandas as pd
from re import sub

from numpy import asarray
import matplotlib.pyplot as plt

from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer
from nltk.stem import PorterStemmer 
from nltk import download

from sklearn.model_selection import StratifiedKFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.cluster import KMeans
from sklearn.metrics import classification_report
from sklearn.feature_extraction.text import TfidfVectorizer

from warnings import filterwarnings

filterwarnings('ignore')
download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/thaisalmeida/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

## Prática I : Identificação de Fake News

<img src="https://media.giphy.com/media/26n6ziTEeDDbowBkQ/giphy.gif"/>

### Carregando base de dados

In [12]:
fake_set = pd.read_csv('datasets/fakenews_silverman.csv')
real_set = pd.read_csv('datasets/realnews_silverman.csv')

In [13]:
print(f'|fake news| = {fake_set.shape[0]} samples \n|legitimate news| = {real_set.shape[0]} samples')

|fake news| = 467 samples 
|legitimate news| = 467 samples


In [14]:
fake_set.head(3)

Unnamed: 0,headline,main_content,label
0,AUSTRALIA: 600-POUND WOMAN GIVES BIRTH TO 40-P...,Perth | A 600-pound woman has given birth to a...,0
1,Jonathan S. Geller,Apple has been hard at work on multiple upcomi...,0
2,Amazon Is Opening a Brick-and-Mortar Store in ...,"Amazon, the cyber store that sells everything,...",0


In [15]:
real_set.head(3)

Unnamed: 0,headline,main_content,label
0,Apple’s next major Mac revealed: the radically...,Apple is preparing an all-new MacBook Air for ...,1
1,Report: A Radically Redesigned 12-Inch MacBook...,Everyone's been waiting years and years for a ...,1
2,Apple may launch 12-inch MacBook Air with Reti...,Apple would never lower itself to rubbing elbo...,1


In [16]:
news_list = pd.concat([fake_set['headline'],real_set['headline']], axis=0, ignore_index=True)
target_list = pd.concat([fake_set['label'],real_set['label']], axis=0, ignore_index=True)

print(f'|corpus| = {news_list.shape[0]} samples')

|corpus| = 934 samples


### Limpeza de dados + Engenharia de Atributos

In [17]:
def remove_stopwords_and_normalize(doc_text, stopwords_hash):
    content = []
    stemmer = PorterStemmer() 
    
    for word in doc_text:
        word_clean = word.lower().strip()
        if(stopwords_hash.get(word_clean) == None):
            word_clean = stemmer.stem(word_clean)    
            content.append(word_clean)
    return content

def tokenizer(text):
    tokenizer = RegexpTokenizer(r'\w+')
    tokens = tokenizer.tokenize(text)
    
    return tokens

def data_cleaning(news_list, target_list):
    X_clean, Y_clean = [], []
    
    stopwords_dict = {word:0 for word in stopwords.words('english')}    
    
    for idx, news in enumerate(news_list):
        text = sub(r'[^\w\s]',' ', news)
        text = sub(r'[^\D]',' ', text)
        text = tokenizer(text)
        text = remove_stopwords_and_normalize(text, stopwords_dict)
        text = ' '.join(text).strip()
        
        if(len(text) > 0):
            X_clean.append(text)
            Y_clean.append(target_list[idx])
    return X_clean, Y_clean 

In [18]:
X, y = data_cleaning(news_list, target_list)

### Engenharia de Atributos + Classificação de Texto

In [19]:
X, y = asarray(X), asarray(y)

kfold = StratifiedKFold(n_splits=5, random_state=42, shuffle=True)

iteration = 1
for train_index, test_index in kfold.split(X, y):

    X_train, X_test = X[train_index], X[test_index]
    Y_train, Y_test = y[train_index], y[test_index]

    vectorizer = TfidfVectorizer(use_idf=True, ngram_range = (1,1),\
                     min_df = 5, max_df = 0.70)

    X_train = vectorizer.fit_transform(X_train)
    X_test  = vectorizer.transform(X_test)

    classifier = RandomForestClassifier(random_state=5)
    classifier.fit(X_train, Y_train)
    predictions = classifier.predict(X_test)
    
    print(f'Fold: {iteration}')
    print(classification_report(Y_test, predictions, target_names=['fake','legitimate']),'\n\n')
    iteration+=1

Fold: 1
              precision    recall  f1-score   support

        fake       0.82      0.87      0.85        94
  legitimate       0.86      0.81      0.84        94

    accuracy                           0.84       188
   macro avg       0.84      0.84      0.84       188
weighted avg       0.84      0.84      0.84       188
 


Fold: 2
              precision    recall  f1-score   support

        fake       0.77      0.82      0.79        94
  legitimate       0.81      0.76      0.78        94

    accuracy                           0.79       188
   macro avg       0.79      0.79      0.79       188
weighted avg       0.79      0.79      0.79       188
 


Fold: 3
              precision    recall  f1-score   support

        fake       0.79      0.86      0.82        93
  legitimate       0.85      0.77      0.81        93

    accuracy                           0.82       186
   macro avg       0.82      0.82      0.82       186
weighted avg       0.82      0.82      0.82 

### Desafio:

- Criar um modelo de identificação de notícias falsas utilizando o `conteúdo` das notícias representado por `bigramas` ponderados por TF-IDF.

<img height="50" width="300" src="https://media.giphy.com/media/l2YWs1NexTst9YmFG/giphy.gif"/>

## Prática II : Agrupamento em Tópicos

### Carregando base de dados

In [2]:
fake_set = pd.read_csv('datasets/fakenews_silverman.csv')
real_set = pd.read_csv('datasets/realnews_silverman.csv')

In [3]:
print(f'|fake news| = {fake_set.shape[0]} samples \n|legitimate news| = {real_set.shape[0]} samples')

|fake news| = 467 samples 
|legitimate news| = 467 samples


In [6]:
news_list = pd.concat([fake_set['headline'],real_set['headline']], axis=0, ignore_index=True)
target_list = pd.concat([fake_set['label'],real_set['label']], axis=0, ignore_index=True)

print(f'|corpus| = {news_list.shape[0]} samples')

|corpus| = 934 samples


In [7]:
def top_cluster_terms(km, tfidf_vectorizer, number_of_clusters):
    order_centroids = km.cluster_centers_.argsort()[:, ::-1]
    terms = tfidf_vectorizer.get_feature_names()
    dist = clusters_distribution(km)
    
    top_ten_list, dist_list = [],[]
    for i in range(number_of_clusters):
        top_ten_words = [terms[ind] for ind in order_centroids[i, :7]]
        print("Cluster ",i,f'| Total: {dist[i]}|',' '.join(top_ten_words),)
        
def clusters_distribution(km):
    clusters_count = {}
    for i in km.labels_:
        if(clusters_count.get(i)!=None):
            clusters_count[i]+=1
        else:
            clusters_count[i]=1
    return clusters_count

In [8]:
def remove_stopwords(doc_text, stopwords_hash):
    content = []
    
    for word in doc_text:
        word_clean = word.lower().strip()
        if(stopwords_hash.get(word_clean) == None):
            content.append(word_clean)
    return content

def tokenizer(text):
    tokenizer = RegexpTokenizer(r'\w+')
    tokens = tokenizer.tokenize(text)
    
    return tokens

def data_cleaning(news_list):
    X_clean = []
    
    stopwords_dict = {word:0 for word in stopwords.words('english')}    
    
    for idx, news in enumerate(news_list):
        text = sub(r'[^\w\s]',' ', news)
        text = sub(r'[^\D]',' ', text)
        text = tokenizer(text)
        text = remove_stopwords(text, stopwords_dict)
        text = ' '.join(text).strip()
        
        if(len(text) > 0):
            X_clean.append(text)
    return X_clean 

In [9]:
X_clean = data_cleaning(news_list)

In [10]:
vectorizer = TfidfVectorizer(use_idf=True, sublinear_tf=False, ngram_range=(2,2))
X = vectorizer.fit_transform(X_clean)

kmeans = KMeans(n_clusters=10, random_state = 42).fit(X)

top_cluster_terms(kmeans, vectorizer, 10)


Cluster  0 | Total: 179| banksy arrested argentina president batmobile stolen stolen detroit turning werewolf identity revealed president adopts
Cluster  1 | Total: 508| boko haram third breast bank hank big bank sugarhill gang vladimir putin jose canseco
Cluster  2 | Total: 24| justin bieber bear attack bieber ringtone russian fisherman saves man saves russian mauled bear
Cluster  3 | Total: 26| rescue attempt luke somers attempt yemen yemen rescue british born rescue bid killed rescue
Cluster  4 | Total: 76| apple watch macbook air inch macbook watch edition gold apple steel apple stainless steel
Cluster  5 | Total: 47| islamic state james foley killed us us journalist missing american journalist james american journalist
Cluster  6 | Total: 26| james wright wright foley journalist james american journalist beheads american isis beheads islamic state
Cluster  7 | Total: 24| isis fighters contracted ebola fighters contracted isis militants iraqi media media reports reports isis
Cluste

In [11]:
X_clean = data_cleaning(news_list)

vectorizer = TfidfVectorizer(use_idf=True, sublinear_tf=False)
X = vectorizer.fit_transform(X_clean)

kmeans = KMeans(n_clusters=10, random_state = 42).fit(X)

top_cluster_terms(kmeans, vectorizer, 10)


Cluster  0 | Total: 390| banksy hoax woman isis saudi caught arrested
Cluster  1 | Total: 29| haram boko nigeria ceasefire girls kidnapped schoolgirls
Cluster  2 | Total: 153| batmobile president werewolf argentina stolen boy million
Cluster  3 | Total: 42| killed rescue yemen attempt al hostage us
Cluster  4 | Total: 90| apple watch gold macbook air inch cost
Cluster  5 | Total: 66| claims isis islamic state weapons airdrop us
Cluster  6 | Total: 76| ebola bear bieber justin contracted isis man
Cluster  7 | Total: 46| journalist james foley american wright video beheaded
Cluster  8 | Total: 23| companies two packard hewlett split hp break
Cluster  9 | Total: 19| bank gang hank sugarhill big canadian captured


### Referências:

- https://www.amazon.com.br/Express%C3%B5es-Regulares-Uma-Abordagem-Divertida/dp/8575223372
- Baeza-Yates, Ricardo, and Berthier Ribeiro-Neto. Recuperação de Informação-: Conceitos e Tecnologia das Máquinas de Busca. Bookman Editora, 2013.
- https://medium.com/botsbrasil/o-que-%C3%A9-o-processamento-de-linguagem-natural-49ece9371cff