# Projeto: sumarizador de documentos e extração de palavras-chave

Projeto em duplas.

Neste projeto vamos usar o que aprendemos sobre vetorização de documentos e similaridade para construir sumários de documentos. Os sumários serão construídos à partir das sentenças existentes no texto, como se estivéssemos usando uma caneta marcadora de texto para ressaltar as sentenças mais importantes. Este tipo de sumarização é conhecido como sumarização extrativa. 

## Sumarização

Você deve implementar duas técnicas de sumarização de documentos:

1. Clustering

Nesta técnica você deverá:

- Vetorizar as sentenças do documento. Teste as várias opções de vetorização que aprendemos (TF-IDF, Doc2Vec, LDA, etc).
- Agrupar as sentenças do documento em clusters usando K-Means.
- Escolher as sentenças mais próximas do centro do cluster, para cada cluster.
- Exibir estas sentenças na ordem em que se apresentaram no texto original.


2. TextRank

A ideia do TextRank é conectar todas as sentenças entre si através da sua similaridade (e.g. $1 - \text{distância cosseno}$) em uma *matriz de similaridade*. Nesta matriz a entrada $(i,j)$ representa a similaridade entre as sentenças $i$ e $j$. Sentenças mais informativas tendem a estar conectadas com várias outras sentenças do texto, e servirão como representantes dos assuntos que estão sendo discutidos nestas outras sentenças. Para determinar quais são as sentenças de maior conectividade vamos usar o algoritmo PageRank (https://en.wikipedia.org/wiki/PageRank), que já está implementado na biblioteca `networkx` em Python.

Veja o artigo https://www.analyticsvidhya.com/blog/2018/11/introduction-text-summarization-textrank-python/ para entender mais sobre essa técnica. Para conhecer os detalhes finos do algoritmo, veja o artigo original em https://web.eecs.umich.edu/~mihalcea/papers/mihalcea.emnlp04.pdf

Implemente o TextRank e use-o para determinar as sentenças mais importantes do documento. Apresente estas sentenças na ordem em que aparecem no documento original.

## Extração de palavras-chave

As mesmas técnicas de sumarização de documentos servem para extrair palavras-chave de documentos: basta considerar a similaridade entre embeddings de palavras ao invés de vetores de sentença!

Implemente também a extração de palavras-chave de documentos.

## Testando as implementações

Para testar suas implementações de sumarização você pode usar o dataset "CNN/Daily Mail" (https://github.com/abisee/cnn-dailymail). Cuidado: é um dataset bem grande, para testar seus desenvolvimentos é recomendável não rodar no corpus inteiro toda vez.

## Entregáveis

- O repositório com o código
- Um relatório completo: 
    - Introdução
        - Explicar o que é sumarização de texto, diferentes tipos, e fazer uma revisão bibliográfica pequena
        - Explicar os dois algoritmos
    - Métodos
        - Explicar o experimento: qual dataset, como vai medir desempenho, etc.
    - Resultados
        - Métricas automatizadas: ROUGE (https://pypi.org/project/rouge/)
        - Comparação qualitativa
    - Conclusão
    
## Rubrica

| Conceito | Definição |
|:--------:|:----------|
|    I     | Não entregou ou entregou nonsense |
|    D     | O relatório está incompleto ou com falhas, o código tem erros mas está aproximadamente correto, falta mais de uma implementação |
|    C     | O relatório está completo mas com escrita pobre, o código está meio bagunçado mas correto, falta uma implementação apenas |
|    B     | Bom relatório, à exceção do ROUGE. Boa implementação. |
|    A     | Implementou métrica ROUGE de desempenho. Aprimorou os métodos por conta própria, melhorando o desempenho. |

## 1 - Clustering

In [15]:
# vetorização
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.cluster import MiniBatchKMeans
from sklearn.metrics import pairwise_distances_argmin_min
import pandas as pd
import numpy as np
from nltk.corpus import reuters
from nltk.corpus import brown
from pprint import pprint


In [None]:
docs = {}
for fileid in reuters.fileids():
    docs[fileid] = reuters.sents(fileid)

for k, v in docs.items():
    docs[k] = []
    sent = ''
    for s in v:
        sent = ' '.join(s)
        docs[k].append(sent)   

In [None]:
doc_sents = {}
all_sents = []
num_all_sents = 0

for doc_id, doc in docs.items():
    all_sents += doc
    num_sents = len(doc)
    doc_sents[doc_id] = list(range(num_all_sents, num_all_sents + num_sents))
    num_all_sents+=num_sents


In [None]:
vectorizer = TfidfVectorizer(analyzer=lambda x: x)
vectors = vectorizer.fit_transform(all_sents)

## Clustering TF-IDF

In [None]:
for doc, sent_indexes in doc_sents.items():
    num_sents = len(sent_indexes)
    
    first_index = doc_sents[doc][0]
    last_index = doc_sents[doc][-1] + 1
    
    if num_sents >= 2:
        kmeans = MiniBatchKMeans(n_clusters= 2, random_state=42)
        result = kmeans.fit_transform(vectors[first_index:last_index])

        closest, dis = pairwise_distances_argmin_min(kmeans.cluster_centers_, vectors[first_index:last_index])
        
        closest.sort()
        
        texts_centroids=[]
        for e in closest:
            sent = all_sents[e+first_index]

            texts_centroids.append(sent)
    
        print(f'doc: {doc} ---------------------------------')    
        print(texts_centroids)

## Clustering usando cbow

In [None]:
import gensim
with open('sentences.txt', 'w', encoding='utf8') as file:
    for sentence in all_sents:
        file.write(f'{sentence}\n')

In [None]:
%%time
model_cbow = gensim.models.Word2Vec(
    corpus_file='sentences.txt',
    window=5,
    size=200,
    seed=42,
    iter=100,
    workers=12,
)

In [None]:
def cbow(model, sent):
    vec = np.zeros(model.wv.vector_size)
    for word in sent:
        if word in model:
            vec += model.wv.get_vector(word)
            
    norm = np.linalg.norm(vec)
    if norm > np.finfo(float).eps:
        vec /= norm
    return vec

In [None]:
vecs_cbow = [cbow(model_cbow, sent) for sent in all_sents]

In [None]:
for doc, sent_indexes in doc_sents.items():
    num_sents = len(sent_indexes)
    
    first_index = doc_sents[doc][0]
    last_index = doc_sents[doc][-1] + 1
    
    if num_sents >= 2:
        kmeans_cbow = MiniBatchKMeans(n_clusters= 2, random_state=42)
        result = kmeans_cbow.fit_transform(vecs_cbow[first_index:last_index])

        closest, dis = pairwise_distances_argmin_min(kmeans_cbow.cluster_centers_, vecs_cbow[first_index:last_index])
        closest.sort()
        texts_centroids=[]
        
        for e in closest:
            sent = all_sents[e+first_index]

            texts_centroids.append(sent)
    
        print(f'doc: {doc} ---------------------------------')    
        print(texts_centroids)

## 2 - TextRank

Utilizando o tutorial - https://www.analyticsvidhya.com/blog/2018/11/introduction-text-summarization-textrank-python/

In [1]:
import re

In [5]:
df = pd.read_csv('tennis_articles_v4.csv')

In [6]:
df.head()

Unnamed: 0,article_id,article_text,source
0,1,Maria Sharapova has basically no friends as te...,https://www.tennisworldusa.org/tennis/news/Mar...
1,2,"BASEL, Switzerland (AP), Roger Federer advance...",http://www.tennis.com/pro-game/2018/10/copil-s...
2,3,Roger Federer has revealed that organisers of ...,https://scroll.in/field/899938/tennis-roger-fe...
3,4,Kei Nishikori will try to end his long losing ...,http://www.tennis.com/pro-game/2018/10/nishiko...
4,5,"Federer, 37, first broke through on tour over ...",https://www.express.co.uk/sport/tennis/1036101...


In [7]:
df['article_text'][0]

"Maria Sharapova has basically no friends as tennis players on the WTA Tour. The Russian player has no problems in openly speaking about it and in a recent interview she said: 'I don't really hide any feelings too much. I think everyone knows this is my job here. When I'm on the courts or when I'm on the court playing, I'm a competitor and I want to beat every single person whether they're in the locker room or across the net.So I'm not the one to strike up a conversation about the weather and know that in the next few minutes I have to go and try to win a tennis match. I'm a pretty competitive girl. I say my hellos, but I'm not sending any players flowers as well. Uhm, I'm not really friendly or close to many players. I have not a lot of friends away from the courts.' When she said she is not really close to a lot of players, is that something strategic that she is doing? Is it different on the men's tour than the women's tour? 'No, not at all. I think just because you're in the same 

In [8]:
from nltk.tokenize import sent_tokenize
from pprint import pprint
sentences = []
for s in df['article_text']:
    sentences.append(sent_tokenize(s))
# print(sentences)
doc_sents = {}
all_sents = []
num_all_sents = 0

for index, doc in enumerate(sentences):
    all_sents += doc
    num_sents = len(doc)
    doc_sents[index] = list(range(num_all_sents, num_all_sents + num_sents))
    num_all_sents+=num_sents                        

pprint(doc_sents)
sentences = [y for x in sentences for y in x] # flatten list
# print(sentences)

{0: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16],
 1: [17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28],
 2: [29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45],
 3: [46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58],
 4: [59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76],
 5: [77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88],
 6: [89, 90, 91, 92, 93, 94, 95, 96, 97, 98],
 7: [99,
     100,
     101,
     102,
     103,
     104,
     105,
     106,
     107,
     108,
     109,
     110,
     111,
     112,
     113,
     114,
     115,
     116,
     117,
     118]}


In [9]:
# !wget http://nlp.stanford.edu/data/glove.6B.zip
# !unzip glove*.zip

In [10]:
# remove punctuations, numbers and special characters
clean_sentences = pd.Series(sentences).str.replace("[^a-zA-Z]", " ")

# make alphabets lowercase
clean_sentences = [s.lower() for s in clean_sentences]

In [11]:
from nltk.corpus import stopwords
stop_words = stopwords.words('english')

In [12]:
# function to remove stopwords
def remove_stopwords(sen):
    sen_new = " ".join([i for i in sen if i not in stop_words])
    return sen_new

In [13]:
# remove stopwords from the sentences
clean_sentences = [remove_stopwords(r.split()) for r in clean_sentences]

In [16]:
# Extract word vectors
word_embeddings = {}
f = open('glove.6B.100d.txt', encoding='utf-8')
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    word_embeddings[word] = coefs
f.close()

In [17]:
sentence_vectors = []
for i in clean_sentences:
    if len(i) != 0:
        v = sum([word_embeddings.get(w, np.zeros((100,))) for w in i.split()])/(len(i.split())+0.001)
    else:
        v = np.zeros((100,))
    sentence_vectors.append(v)

In [18]:
from sklearn.metrics.pairwise import cosine_similarity
import networkx as nx

In [19]:
for doc, sent_indexes in doc_sents.items():
    num_sents = len(sent_indexes)
    sim_mat = np.zeros([num_sents, num_sents])
    for i, s1 in enumerate(sent_indexes):
        for j, s2 in enumerate(sent_indexes):
            if i != j:
                sim_mat[i][j] = cosine_similarity(sentence_vectors[s1].reshape(1,100), sentence_vectors[s2].reshape(1,100))[0,0]
    
    nx_graph = nx.from_numpy_array(sim_mat)
    scores = nx.pagerank(nx_graph)
    
    first_index = doc_sents[doc][0]
    last_index = doc_sents[doc][-1] + 1
    
    ranked_sentences = sorted(((scores[i],s) for i,s in enumerate(sentences[first_index:last_index])), reverse=True)
    print(f'doc: {doc} \n')
    for i in range(3):
        print(ranked_sentences[i][1])
    print('*' * 30)
    print(sentences[first_index:last_index])
    print('-' * 50)


doc: 0 

I think just because you're in the same sport doesn't mean that you have to be friends with everyone just because you're categorized, you're a tennis player, so you're going to get along with tennis players.
I think everyone just thinks because we're tennis players we should be the greatest of friends.
When I'm on the courts or when I'm on the court playing, I'm a competitor and I want to beat every single person whether they're in the locker room or across the net.So I'm not the one to strike up a conversation about the weather and know that in the next few minutes I have to go and try to win a tennis match.
******************************
['Maria Sharapova has basically no friends as tennis players on the WTA Tour.', "The Russian player has no problems in openly speaking about it and in a recent interview she said: 'I don't really hide any feelings too much.", 'I think everyone knows this is my job here.', "When I'm on the courts or when I'm on the court playing, I'm a compet

## 3 - Extração de palavras-chave

In [36]:
punctuations = '''!()-[]{};:'"\,<>./?@#$%^&*_~'''

In [60]:
for doc, sent_indexes in doc_sents.items():
    words_doc = []
    words_doc_clean = []
    for i in sent_indexes:
        
        words_doc += all_sents[i].split(' ')
    
    for word in words_doc:
        no_punct = ''
        for char in word:
            if char not in punctuations:
                no_punct = no_punct + char
        no_punct = no_punct.lower()
        words_doc_clean.append(no_punct)


    num_words = len(words_doc_clean)
    num_sents = len(sent_indexes)
    sim_mat = np.zeros([num_words, num_words])
    
    for i, w1 in enumerate(words_doc_clean):
        for j, w2 in enumerate(words_doc_clean):
            if i != j:
                try:
                    sim_mat[i][j] = cosine_similarity(word_embeddings[w1].reshape(1,100), word_embeddings[w2].reshape(1,100))[0,0]
                except KeyError:
                    sim_mat[i][j] = -1
    
    nx_graph = nx.from_numpy_array(sim_mat)
    scores = nx.pagerank(nx_graph)

    
    first_index = doc_sents[doc][0]
    last_index = doc_sents[doc][-1] + 1
    
    ranked_words = sorted(((scores[i],s) for i,s in enumerate(words_doc_clean[first_index:last_index])), reverse=True)
    ranked_words_clean = []
    for w in ranked_words:
        if w[1] not in stop_words:
            ranked_words_clean.append(w)
    print(f'doc: {doc} \n')
    for i in range(4):
        print(ranked_words_clean[i][1])
    print('*' * 30)
    print(sentences[first_index:last_index])
    print('-' * 50)

doc: 0 

players
friends
basically
player
******************************
['Maria Sharapova has basically no friends as tennis players on the WTA Tour.', "The Russian player has no problems in openly speaking about it and in a recent interview she said: 'I don't really hide any feelings too much.", 'I think everyone knows this is my job here.', "When I'm on the courts or when I'm on the court playing, I'm a competitor and I want to beat every single person whether they're in the locker room or across the net.So I'm not the one to strike up a conversation about the weather and know that in the next few minutes I have to go and try to win a tennis match.", "I'm a pretty competitive girl.", "I say my hellos, but I'm not sending any players flowers as well.", "Uhm, I'm not really friendly or close to many players.", "I have not a lot of friends away from the courts.'", 'When she said she is not really close to a lot of players, is that something strategic that she is doing?', "Is it differe