# Projeto: sumarizador de documentos e extração de palavras-chave
---
#### Elisa Malzoni e Bruna Kimura
---
Projeto em duplas.

Neste projeto vamos usar o que aprendemos sobre vetorização de documentos e similaridade para construir sumários de documentos. Os sumários serão construídos à partir das sentenças existentes no texto, como se estivéssemos usando uma caneta marcadora de texto para ressaltar as sentenças mais importantes. Este tipo de sumarização é conhecido como sumarização extrativa. 

## Sumarização

Você deve implementar duas técnicas de sumarização de documentos:

1. Clustering

    Nesta técnica você deverá:

    - Vetorizar as sentenças do documento. Teste as várias opções de vetorização que aprendemos (TF-IDF, Doc2Vec, LDA, etc).
    - Agrupar as sentenças do documento em clusters usando K-Means.
    - Escolher as sentenças mais próximas do centro do cluster, para cada cluster.
    - Exibir estas sentenças na ordem em que se apresentaram no texto original.


2. TextRank

    A ideia do TextRank é conectar todas as sentenças entre si através da sua similaridade (e.g. $1 - \text{distância cosseno}$) em uma *matriz de similaridade*. Nesta matriz a entrada $(i,j)$ representa a similaridade entre as sentenças $i$ e $j$. Sentenças mais informativas tendem a estar conectadas com várias outras sentenças do texto, e servirão como representantes dos assuntos que estão sendo discutidos nestas outras sentenças. Para determinar quais são as sentenças de maior conectividade vamos usar o algoritmo PageRank (https://en.wikipedia.org/wiki/PageRank), que já está implementado na biblioteca `networkx` em Python.

    Veja o artigo https://www.analyticsvidhya.com/blog/2018/11/introduction-text-summarization-textrank-python/ para entender mais sobre essa técnica. Para conhecer os detalhes finos do algoritmo, veja o artigo original em https://web.eecs.umich.edu/~mihalcea/papers/mihalcea.emnlp04.pdf

    Implemente o TextRank e use-o para determinar as sentenças mais importantes do documento. Apresente estas sentenças na ordem em que aparecem no documento original.

## Extração de palavras-chave

As mesmas técnicas de sumarização de documentos servem para extrair palavras-chave de documentos: basta considerar a similaridade entre embeddings de palavras ao invés de vetores de sentença!

Implemente também a extração de palavras-chave de documentos.

## Testando as implementações

Para testar suas implementações de sumarização você pode usar o dataset "CNN/Daily Mail" (https://github.com/abisee/cnn-dailymail). Cuidado: é um dataset bem grande, para testar seus desenvolvimentos é recomendável não rodar no corpus inteiro toda vez.

## Entregáveis

- O repositório com o código
- Um relatório completo: 
    - Introdução
        - Explicar o que é sumarização de texto, diferentes tipos, e fazer uma revisão bibliográfica pequena
        - Explicar os dois algoritmos
    - Métodos
        - Explicar o experimento: qual dataset, como vai medir desempenho, etc.
    - Resultados
        - Métricas automatizadas: ROUGE (https://pypi.org/project/rouge/)
        - Comparação qualitativa
    - Conclusão
    
## Rubrica

| Conceito | Definição |
|:--------:|:----------|
|    I     | Não entregou ou entregou nonsense |
|    D     | O relatório está incompleto ou com falhas, o código tem erros mas está aproximadamente correto, falta mais de uma implementação |
|    C     | O relatório está completo mas com escrita pobre, o código está meio bagunçado mas correto, falta uma implementação apenas |
|    B     | Bom relatório, à exceção do ROUGE. Boa implementação. |
|    A     | Implementou métrica ROUGE de desempenho. Aprimorou os métodos por conta própria, melhorando o desempenho. |

In [11]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import pairwise_distances_argmin_min
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.cluster import MiniBatchKMeans
from nltk.tokenize import sent_tokenize
from nltk.corpus import stopwords
from pprint import pprint
import networkx as nx
import pandas as pd
import numpy as np
import gensim
import re
import os

### Preparando o Dataset

In [12]:
stories = os.listdir('./stories')

doc_sents = {}
all_sents = []
num_all_sents = 0

for storie in stories:
    path = './stories/' + storie
    
    
    with open(path, 'r') as f:
        doc = []
        sent = []
        for cnt, line in enumerate(f):
            if cnt > 13:
                if line != '\n':
                    sent.append(re.sub('\n', '', line))
                if line == '@highlight\n':
                    break
                else:

                    doc+= sent
                    sent = []
                    
    all_sents += doc
    num_sents = len(doc)
    doc_sents[storie] = list(range(num_all_sents, num_all_sents + num_sents))
    num_all_sents+=num_sents

## 1 - Clustering

In [13]:
vectorizer = TfidfVectorizer(analyzer=lambda x: x)
vectors = vectorizer.fit_transform(all_sents)

## Clustering TF-IDF

In [14]:
for doc, sent_indexes in doc_sents.items():
    num_sents = len(sent_indexes)
    
    first_index = doc_sents[doc][0]
    last_index = doc_sents[doc][-1] + 1
    
    if num_sents >= 2:
        kmeans = MiniBatchKMeans(n_clusters= 2, random_state=42)
        result = kmeans.fit_transform(vectors[first_index:last_index])

        closest, dis = pairwise_distances_argmin_min(kmeans.cluster_centers_, vectors[first_index:last_index])
        
        closest.sort()
        
        texts_centroids=[]
        for e in closest:
            sent = all_sents[e+first_index]

            texts_centroids.append(sent)
    
        print(f'--------------- doc: {doc} ---------------------------------')    
        print(texts_centroids)

--------------- doc: 0a3dddec7c0492895ee26b68ae57f25bb2386fd8.story ---------------------------------
['Scroll down for video', "The park's zoological team members report the mother and baby appear to be healthy, but as with any newborn, the first few days are critical, with the calf's sex is yet to be determined"]
--------------- doc: 0a1cabbc6b9c07c97e8001dca6301770c28f1833.story ---------------------------------
["Up and running: Scotland have put together some decent results under Strachan's guidance", "Ralf Mutschke, head of security with the governing body, told the Telegraph: 'FIFA, and in particular myself, has to make the presumption that the World Cup itself is under threat and implement the maximum protection for our competition as we can. We are trying to protect the World Cup from fixing and we have set up a pretty wide range of measures to do so.'"]
--------------- doc: 0a1a94f06809b73d31cf1f43435827cd21467d94.story ---------------------------------
['Regina Bennett, 46, 

## Clustering usando cbow

In [15]:
with open('sentences.txt', 'w', encoding='utf8') as file:
    for sentence in all_sents:
        file.write(f'{sentence}\n')

In [16]:
%%time
model_cbow = gensim.models.Word2Vec(
    corpus_file='sentences.txt',
    window=5,
    size=200,
    seed=42,
    iter=100,
    workers=12,
)

CPU times: user 19.3 s, sys: 7.11 s, total: 26.4 s
Wall time: 5.73 s


In [17]:
def cbow(model, sent):
    vec = np.zeros(model.wv.vector_size)
    for word in sent:
        if word in model:
            vec += model.wv.get_vector(word)
            
    norm = np.linalg.norm(vec)
    if norm > np.finfo(float).eps:
        vec /= norm
    return vec

In [18]:
vecs_cbow = [cbow(model_cbow, sent) for sent in all_sents]

  after removing the cwd from sys.path.


In [19]:
for doc, sent_indexes in doc_sents.items():
    num_sents = len(sent_indexes)
    
    first_index = doc_sents[doc][0]
    last_index = doc_sents[doc][-1] + 1
    
    if num_sents >= 2:
        kmeans_cbow = MiniBatchKMeans(n_clusters= 2, random_state=42)
        result = kmeans_cbow.fit_transform(vecs_cbow[first_index:last_index])

        closest, dis = pairwise_distances_argmin_min(kmeans_cbow.cluster_centers_, vecs_cbow[first_index:last_index])
        closest.sort()
        texts_centroids=[]
        
        for e in closest:
            sent = all_sents[e+first_index]

            texts_centroids.append(sent)
    
        print(f'-------- doc: {doc} ---------------------------------')    
        print(texts_centroids)

-------- doc: 0a3dddec7c0492895ee26b68ae57f25bb2386fd8.story ---------------------------------
["Seconds later, the baby whale - the sixth successful killer whale birth in the parkís 49-year history - instinctively swam to the water's surface to take its first breath", "The baby killer whale takes its first breath at SeaWorld San Diego. Researchers say both are doing well, although the newborn's sex has not yet been determined"]
-------- doc: 0a1cabbc6b9c07c97e8001dca6301770c28f1833.story ---------------------------------
['FIFA have plans in place to combat the threat of match-fixing during the World Cup, which kicks off next month.', "Ralf Mutschke, head of security with the governing body, told the Telegraph: 'FIFA, and in particular myself, has to make the presumption that the World Cup itself is under threat and implement the maximum protection for our competition as we can. We are trying to protect the World Cup from fixing and we have set up a pretty wide range of measures to do

## 2 - TextRank

Utilizando o tutorial - https://www.analyticsvidhya.com/blog/2018/11/introduction-text-summarization-textrank-python/

In [20]:
# remove punctuations, numbers and special characters
clean_sentences = pd.Series(all_sents).str.replace("[^a-zA-Z]", " ")

# make alphabets lowercase
clean_sentences = [s.lower() for s in clean_sentences]

In [21]:
stop_words = stopwords.words('english')

In [22]:
# function to remove stopwords
def remove_stopwords(sen):
    sen_new = " ".join([i for i in sen if i not in stop_words])
    return sen_new

In [23]:
# remove stopwords from the sentences
clean_sentences = [remove_stopwords(r.split()) for r in clean_sentences]

In [24]:
# Extract word vectors
word_embeddings = {}
f = open('glove.6B.100d.txt', encoding='utf-8')
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    word_embeddings[word] = coefs
f.close()

In [25]:
sentence_vectors = []
for i in clean_sentences:
    if len(i) != 0:
        v = sum([word_embeddings.get(w, np.zeros((100,))) for w in i.split()])/(len(i.split())+0.001)
    else:
        v = np.zeros((100,))
    sentence_vectors.append(v)

In [26]:
for doc, sent_indexes in doc_sents.items():
    num_sents = len(sent_indexes)
    sim_mat = np.zeros([num_sents, num_sents])
    for i, s1 in enumerate(sent_indexes):
        for j, s2 in enumerate(sent_indexes):
            if i != j:
                sim_mat[i][j] = cosine_similarity(sentence_vectors[s1].reshape(1,100), sentence_vectors[s2].reshape(1,100))[0,0]
    
    nx_graph = nx.from_numpy_array(sim_mat)
    scores = nx.pagerank(nx_graph)
    
    first_index = doc_sents[doc][0]
    last_index = doc_sents[doc][-1] + 1
    
    ranked_sentences = sorted(((scores[i],s) for i,s in enumerate(all_sents[first_index:last_index])), reverse=True)
    print(f'doc: {doc} \n')
    for i in range(3):
        print(ranked_sentences[i][1])
    print('*' * 30)
    print(all_sents[first_index:last_index])
    print('-' * 50)


doc: 0a3dddec7c0492895ee26b68ae57f25bb2386fd8.story 

The baby killer whale takes its first breath at SeaWorld San Diego. Researchers say both are doing well, although the newborn's sex has not yet been determined
Seconds later, the baby whale - the sixth successful killer whale birth in the parkís 49-year history - instinctively swam to the water's surface to take its first breath
Seconds later, the baby whale - the sixth successful killer whale birth in the park's 49-year history -  instinctively swam to the water's surface to take its first breath.
******************************
['The moment a killer whale is born has been captured on video at a San Diego attraction.', 'Kasatka the killer whale gave birth to a 7 feet long , 350 pounds baby killer whale after an almost 18-month gestation.', "The newborn killer whale calf was born at Shamu Stadium, SeaWorld San Diego, under the watchful eyes of the SeaWorld's zoological team members after more than an hour of labour.", 'Scroll down fo

## 3 - Extração de palavras-chave com TextRank

**!!!! Vai demorar bastante para rodar !!!!**

In [9]:
punctuations = '''!()-[]{};:'"\,<>./?@#$%^&*_~'''

In [10]:
count = 0
for doc, sent_indexes in doc_sents.items():
    count +=1
    # palavras chaves somente pros primeiros textos
    if count > 5:
        break
    words_doc = []
    words_doc_clean = []
    for i in sent_indexes:
        
        words_doc += all_sents[i].split(' ')
    
    for word in words_doc:
        no_punct = ''
        for char in word:
            if char not in punctuations:
                no_punct = no_punct + char
        no_punct = no_punct.lower()
        words_doc_clean.append(no_punct)


    num_words = len(words_doc_clean)
    num_sents = len(sent_indexes)
    sim_mat = np.zeros([num_words, num_words])
    
    for i, w1 in enumerate(words_doc_clean):
        for j, w2 in enumerate(words_doc_clean):
            if i != j:
                try:
                    sim_mat[i][j] = cosine_similarity(word_embeddings[w1].reshape(1,100), word_embeddings[w2].reshape(1,100))[0,0]
                except KeyError:
                    sim_mat[i][j] = -1
    
    nx_graph = nx.from_numpy_array(sim_mat)
    
    try:
        scores = nx.pagerank(nx_graph)

    except:
        continue
        
    first_index = doc_sents[doc][0]
    last_index = doc_sents[doc][-1] + 1
    
    ranked_words = sorted(((scores[i],s) for i,s in enumerate(words_doc_clean[first_index:last_index])), reverse=True)
    ranked_words_clean = []
    for w in ranked_words:
        if w[1] not in stop_words:
            ranked_words_clean.append(w)
    print(f'doc: {doc} \n')
    for i in range(4):
        print(ranked_words_clean[i][1])
    print('*' * 30)
    print(all_sents[first_index:last_index])
    print('-' * 50)

doc: 0a3dddec7c0492895ee26b68ae57f25bb2386fd8.story 

moment
captured
born
killer
******************************
['The moment a killer whale is born has been captured on video at a San Diego attraction.', 'Kasatka the killer whale gave birth to a 7 feet long , 350 pounds baby killer whale after an almost 18-month gestation.', "The newborn killer whale calf was born at Shamu Stadium, SeaWorld San Diego, under the watchful eyes of the SeaWorld's zoological team members after more than an hour of labour.", 'Scroll down for video', 'After an almost 18-month gestation, the killer whale begins to emerge from its mother at Shamu Stadium, SeaWorld San Diego', 'Kasatka the killer whale gave birth to a 7 feet long , 350 pounds baby killer whale - with the birth captured on video', "Seconds later, the baby whale - the sixth successful killer whale birth in the parkís 49-year history - instinctively swam to the water's surface to take its first breath", "Seconds later, the baby whale - the sixth s