# Sumarizador de texto

Esse projeto tem como objetivo criar resumos para diversos textos selecionando sentenças ou palavras que melhor representem os textos pelo grupamento de seus embedings. Isso signifia que, para fazer o resumo em sentenças, cada sentença do document passará por um algoritimo de embeding que a transformará em um vetor; com todas as sentenças como vetores é possivel utilizar clustering ou pagerank, para através da similaridade das sentenças, encontrar algumas que passem a ideia geral do texto resumido.

Esse tipo de sumarização é conhecido como sumarização extrativa e diversos mecanismos de embeding podem ser utilizados para traçar a seelhança entre as sentenças e ajudar no processo de extração. Alguns deles, que serão aborados nesse projeto sao: TF-IDF, CBow, Doc2Vec, LDA e Word2Vec.

De forma a verificar a qualidade dos sumarios, se optou por utilizar a metrica Rouge, que calcula pelo numero de palavras em comum entre um sumario referencia e o sumario criado o F1 score, a precisão e o recall do summario. Sendo que quanto maior a precisão, maior o recall e maior o f1, melhor é o sumario

# Summario

### [Abre o Corpus](#Open)

### [Funções Relevantes](#Func)

### [Sumariza em sentenças](#Sent)

[TF-IDF](#tfidf)

[CBow](#cbow)

[Doc 2 Vec](#doc2vec)

[LDA](#lda)

### [Sumariza em palavras](#Word)

[Word 2 Vec](#word2vec)

### [Conclusão](#Conc)



### Inputs do usuario

In [1]:
# dados que serão estudados ["cnn_stories_sample", "cnn_stories"]
corpus = "cnn_stories_sample"

# Configurações para os filtros das palavras
min_sent_size = 5
use_stopwords = True

### Imports relevantes

In [2]:
# Imports para tratamento de dados
import pandas as pd
import numpy as np
import scipy
import pickle

import json
import os

# Imports para tratamento de texto
import re
from nltk.corpus import stopwords
stopwords = stopwords.words("english")

# Imports para algoritimos de vectorização
from sklearn.feature_extraction.text import TfidfVectorizer

import gensim
from gensim.corpora import Dictionary
from gensim.models.ldamulticore import LdaMulticore
import pyLDAvis.gensim

# Imports para algoritimos de ranqueamento
import networkx as nx
from sklearn.cluster import MiniBatchKMeans

# Imports para verificação do sumario
from rouge import Rouge
rouge = Rouge()

# Outros
import warnings
warnings.filterwarnings("ignore",category=DeprecationWarning)

## Abre o Corpus
<a id='Open'></a>

Nessa seção o corpus selecionado será aberto, filtrado pela função preprocess e salvo em jasons e TXTs para evitar novas filtragens no futuro

In [3]:
# Dicionario de palavras que devem ser removidas/modificadas
manual_conversions = {"nt":"not", "ll":"will", "m":"am", "s":"TRASH"}
if use_stopwords:
    manual_conversions.update({stopword:"TRASH" for stopword in stopwords})

In [4]:
# Função que filtra as palavras de uma sentença do texto
def preprocess(sent):
    sent = re.sub("-LRB-|-RRB-", "", sent, flags=re.DOTALL|re.MULTILINE)
    sent = sent.lower()
    sent = re.sub(r"[^a-z0-9\ ]", "", sent, flags=re.DOTALL|re.MULTILINE)
    sent = re.sub(r"[0-9]+", "num", sent, flags=re.DOTALL|re.MULTILINE)
    sent = re.sub(r" +(?= )", "", sent, flags=re.DOTALL|re.MULTILINE).strip()
    sent = sent.split(" ")
    sent = [manual_conversions[word] if word in manual_conversions.keys() else word for word in sent]
    return [word for word in sent if word != "TRASH"]

In [5]:
# Abre os dados do corpus selecionado e faz a filtragem
sents = {}
orig_text = {}
highlights = {}
available_stories = os.listdir(f"./{corpus}")
for storie in available_stories:
    with open(f"./cnn_stories_sample/{storie}", "r", encoding="utf-8") as file:
        file = file.read()
    original_sents = re.sub(r"\n\n", " ", file.split("@highlight")[0], flags=re.DOTALL|re.MULTILINE).split(" . ")
    highlights_sents = re.sub(r"\n\n*", " . ", "".join(file.split("@highlight")[1:]), flags=re.DOTALL|re.MULTILINE)

    # Guarda o texto original e os highlights
    orig_text[storie] = original_sents
    highlights[storie] = highlights_sents

    # Guarda as sentenças préprocessadas
    preprocessed_sent = [preprocess(original_sent) for original_sent in original_sents]
    sents[storie] = [sent for sent in preprocessed_sent if len(sent) > min_sent_size]

In [6]:
# Salva os dadods filtrados tokenizados e originais
with open("storage/orig_sents.json", "w") as file:
    json.dump(orig_text, file)

with open("storage/highlights.json", "w") as file:
    json.dump(highlights, file)

sents_reference = {}
i = 0
with open('storage/all_sents.txt', 'w', encoding='utf8') as file:
    
    for text_id, text in sents.items():
        sents_reference[text_id] = []
        
        for sentence in text:
            file.write(" ".join([tok for tok in sentence]) + "\n")
            sents_reference[text_id].append(i)
            i += 1
            
with open("storage/sents_reference.json", "w") as file:
    json.dump(sents_reference, file)

## Define Funções relevantes
<a id='Func'></a>

Nessa seção, algumas funções são criadas para evitar repetição de código em cada modelo de embeding um veez que as etapas de carregamento dos dados, clustering, pagerank, calculo de score rouge e salvamento dos sumarios é muito semelhante

In [112]:
def load_data():
     
    with open('storage/all_sents.txt', 'r', encoding='utf8') as file:
        all_sents = [sent.split(" ") for sent in file.read().split("\n")]
    
    with open("storage/sents_reference.json", "r") as file:
        sents_reference = json.load(file)
        
    with open("storage/orig_sents.json", "r") as file:
        orig_sents = json.load(file)    

    with open("storage/highlights.json", "r") as file:
        highlights = json.load(file)
        
    return all_sents, sents_reference, orig_sents, highlights


def find_most_relevant_cl(vecs, sents_reference, clusters=3, per_cluster_sent=1):
    
    best = {}
    warned = False
    for text_id in sents_reference.keys():
    
        # Pega os vetores da sentença desse texto
        target_vecs = vecs[sents_reference[text_id][0]:sents_reference[text_id][-1]+1] 

        # Faz a clusterização dessas sentenças
        kmeans_cbow = MiniBatchKMeans(n_clusters=3, random_state=42)
        result = kmeans_cbow.fit_transform(target_vecs)
        df = pd.DataFrame(result)

        # Seleciona a sentença mais próxima de cada centro de cluster
        this_best = []
        for cluster_number in range(result.shape[1]):
            result = df[kmeans_cbow.labels_ == cluster_number].sort_values(by=cluster_number).index.values[:per_cluster_sent]
            if len(result) > 0:
                
                result = list(result)
                while len(result) < per_cluster_sent:
                    result.append(0)
                    
                this_best.append(result)
                
            elif not warned:
                warned = True
                warnings.warn("No center vector found in one of the clusters", RuntimeWarning)
                
        best[text_id] = set(sorted(list(np.array(this_best).flatten())))
            
    return best

def find_most_relevant_pr(vecs, sents_reference, n_sents=3, squared=False):

    best = {}
    for text_id in sents_reference.keys():

        # Pega os vetores da sentença desse texto
        target_vecs = vecs[sents_reference[text_id][0]:sents_reference[text_id][-1]+1] 
        target_vecs = [vec.toarray() for vec in target_vecs]
        # Faz a clusterização dessas sentenças
        sim_mat = np.zeros((len(sents_reference[text_id]), len(sents_reference[text_id])))
        for i, v1 in enumerate(target_vecs):
            for j, v2 in enumerate(target_vecs):
                norm1 = np.linalg.norm(v1)
                norm2 = np.linalg.norm(v2)
                # Verifica se alguem vetor possui apenas zeros e Verifica se o valor da normalização é razoavel
                if (v1.sum() != 0 and v2.sum() != 0) and ((norm1 + norm2) > np.finfo(float).eps):
                    if squared:
                        sim_mat[i][j] = ((v1 * v2).sum() / (norm1 + norm2)) ** 2
                    else:
                        sim_mat[i][j] = (v1 * v2).sum() / (norm1 + norm2)

        graph = nx.from_numpy_array(sim_mat)
        pr = nx.pagerank_numpy(graph)

        best[text_id] = set(sorted(pr, key=pr.get)[:n_sents])
        
    return best

def calculate_rouge(highlights, cl_predict, pr_predict):
    
    rouge_results = {}
    for text_id in cl_predict.keys():
        this_cl_summary = " . ".join(cl_predict[text_id])
        this_pr_summary = " . ".join(pr_predict[text_id])
        this_highlights = highlights[text_id]
        
        this_cl_summary = re.sub(r"[^a-z0-9\ ]", "", this_cl_summary.lower(), flags=re.DOTALL|re.MULTILINE)
        this_pr_summary = re.sub(r"[^a-z0-9\ ]", "", this_pr_summary.lower(), flags=re.DOTALL|re.MULTILINE)
        this_highlights = re.sub(r"[^a-z0-9\ ]", "", this_highlights.lower(), flags=re.DOTALL|re.MULTILINE)
        
        cl_score = rouge.get_scores(this_cl_summary, this_highlights)
        pr_score = rouge.get_scores(this_pr_summary, this_highlights)
        
        rouge_results[text_id] = {name:result for name, result in zip(["Cluster", "PageRank"], cl_score + pr_score)}
        
    return rouge_results

def average_rouge(rouge_results):
    
    n_texts = len(rouge_results)

    sample = np.array([0, 0, 0])
    cl_results = {"Rouge1":sample,"Rouge2":sample,"RougeL":sample}
    pr_results = {"Rouge1":sample,"Rouge2":sample,"RougeL":sample}

    for text_id, score in rouge_results.items():
        cl_results["Rouge1"] = cl_results["Rouge1"] + np.array(list(score["Cluster"]['rouge-1'].values()))
        cl_results["Rouge2"] = cl_results["Rouge2"] + np.array(list(score["Cluster"]['rouge-2'].values()))
        cl_results["RougeL"] = cl_results["RougeL"] + np.array(list(score["Cluster"]['rouge-l'].values()))

        pr_results["Rouge1"] = pr_results["Rouge1"] + np.array(list(score["PageRank"]['rouge-1'].values()))
        pr_results["Rouge2"] = pr_results["Rouge2"] + np.array(list(score["PageRank"]['rouge-2'].values()))
        pr_results["RougeL"] = pr_results["RougeL"] + np.array(list(score["PageRank"]['rouge-l'].values()))

    for key in cl_results.keys():
        cl_results[key] = [round(i, 3) for i in cl_results[key]/n_texts]
        pr_results[key] = [round(i, 3) for i in pr_results[key]/n_texts]
    
    return cl_results, pr_results

def save_summary_result(model_name, summary_cl, summary_pr, rouge_results):
    
    if summary_cl != None:
        with open(f"outputs/summary_{model_name}_cl.json", "w") as file:
            json.dump(summary_cl, file)

    if summary_pr != None:
        with open(f"outputs/summary_{model_name}_pr.json", "w") as file:
            json.dump(summary_pr, file)
    
    if rouge_results != None:
        with open(f"outputs/rouge_{model_name}.json", "w") as file:
            json.dump(rouge_results, file)

# Resumo com as sentenças
<a id='Sent'></a>

Nessa seção os modelos de embeding serão treinados no conjunto total de sentenças de todos os doumentos do corpus, em seguida serão utilizados para transformar as sentenças em vetores e esses vetores, dividos por documentos, passaram por algoritimos de clusterização e pagerank para terem suas sentenças mais relevantes selecionadas.

### TF-IDF
<a id='tfidf'></a>

In [8]:
all_sents, sents_reference, orig_text, highlights = load_data()

In [9]:
%%time
tfidf = TfidfVectorizer(min_df=5, 
                        max_df=0.9, 
                        max_features=5000, 
                        sublinear_tf=False, 
                        analyzer=lambda x: x)

tfidf_vecs = tfidf.fit_transform(all_sents)

Wall time: 666 ms


#### Clusterização

In [10]:
tfidf_cl_best = find_most_relevant_cl(tfidf_vecs, sents_reference)
tfidf_cl_summary = {}
for text_id in tfidf_cl_best.keys():
    tfidf_cl_summary[text_id] = [orig_text[text_id][sent] for sent in tfidf_cl_best[text_id]]



#### Page rank

In [11]:
tfidf_pr_best = find_most_relevant_pr(tfidf_vecs, sents_reference)
tfidf_pr_summary = {}
for text_id in tfidf_pr_best.keys():
    tfidf_pr_summary[text_id] = [orig_text[text_id][sent] for sent in tfidf_pr_best[text_id]]

#### Faz o teste ROUGE

In [12]:
tfidf_rouge_results = calculate_rouge(highlights, tfidf_cl_summary, tfidf_pr_summary)
tfidf_cl_score, tfidf_pr_score = average_rouge(tfidf_rouge_results)

In [13]:
print("Results in order as: f, p, r")
print("Cluster ", tfidf_cl_score)
print("PageRank", tfidf_pr_score)

Results in order as: f, p, r
Cluster  {'Rouge1': [0.238, 0.217, 0.292], 'Rouge2': [0.063, 0.057, 0.078], 'RougeL': [0.167, 0.154, 0.197]}
PageRank {'Rouge1': [0.216, 0.189, 0.275], 'Rouge2': [0.049, 0.042, 0.065], 'RougeL': [0.144, 0.127, 0.178]}


#### Olha um resultado

In [15]:
text_id = list(tfidf_cl_summary.keys())[np.random.randint(0, len(tfidf_cl_summary))]

In [16]:
". ".join(orig_text[text_id])

"-LRB- CNN -RRB- -- Andy Murray 's first match since undergoing back surgery in September ended in a straight sets defeat to Jo-Wilfried Tsonga at an exhibition tournament in Abu Dhabi Thursday. The reigning Wimbledon champion went down 7-5 6-3 to the Frenchman , who himself was plagued by injury at the back end of this year. Murray , who has dropped to No. 4 in the rankings , lacked sharpness after his layoff and was broken in the 12th game of the opening set to fall behind. The British star has been training at his base in Florida to prepare for the upcoming season and looked set to even the match up when he gained an early break of service in the second set. But Tsonga hit back with two breaks of his own to wrap up victory in 72 minutes at the Zayed Sports City complex. `` The courts here are very fast and you have to react quickly , '' said 26-year-old Murray. `` Jo was sharper than me today , he served very well. `` It 's always good fun here. It 's great preparation for the seaso

In [17]:
tfidf_cl_summary[text_id]

['The reigning Wimbledon champion went down 7-5 6-3 to the Frenchman , who himself was plagued by injury at the back end of this year',
 "`` It 's always good fun here",
 "'' The organizers of the Mubadala World Tennis Championship have indeed attracted a stellar field with the top two ranked players , Rafael Nadal and Novak Djokovic , in the line-up"]

In [18]:
tfidf_pr_summary[text_id]

['`` Jo was sharper than me today , he served very well',
 "`` It 's always good fun here",
 'But Tsonga hit back with two breaks of his own to wrap up victory in 72 minutes at the Zayed Sports City complex']

In [19]:
text_rouge = tfidf_rouge_results[text_id]

reform = {(level1_key, level2_key): values
          for level1_key, level2_dict in text_rouge.items()
          for level2_key, values in level2_dict.items()}

pd.DataFrame(reform).round(3)

Unnamed: 0_level_0,Cluster,Cluster,Cluster,PageRank,PageRank,PageRank
Unnamed: 0_level_1,rouge-1,rouge-2,rouge-l,rouge-1,rouge-2,rouge-l
f,0.267,0.097,0.222,0.209,0.0,0.128
p,0.241,0.088,0.196,0.231,0.0,0.128
r,0.298,0.109,0.256,0.191,0.0,0.128


In [113]:
save_summary_result("tfidf", tfidf_cl_summary, tfidf_pr_summary, tfidf_rouge_results)

### CBOW
<a id='cbow'></a>

In [20]:
all_sents, sents_reference, orig_text, highlights = load_data()

In [21]:
%%time
cbow = gensim.models.Word2Vec(
    corpus_file='storage/all_sents.txt',
    window=5,
    size=200,
    seed=42,
    iter=100,
    workers=12,
)

Wall time: 35.3 s


In [22]:
def sum_word_vecs(model, sent):
    vec = np.zeros(model.wv.vector_size)
    for word in sent:
        if word in model:
            vec += model.wv.get_vector(word)
            
    norm = np.linalg.norm(vec)
    if norm > np.finfo(float).eps:
        vec /= norm
    return vec

In [23]:
cbow_vecs = scipy.sparse.csr.csr_matrix([sum_word_vecs(cbow, sent) for sent in all_sents])

#### Clusterização

In [24]:
cbow_cl_best = find_most_relevant_cl(cbow_vecs, sents_reference)
cbow_cl_summary = {}
for text_id in cbow_cl_best.keys():
    cbow_cl_summary[text_id] = [orig_text[text_id][sent] for sent in cbow_cl_best[text_id]]



#### Page rank

In [25]:
cbow_pr_best = find_most_relevant_pr(cbow_vecs, sents_reference)
cbow_pr_summary = {}
for text_id in cbow_pr_best.keys():
    cbow_pr_summary[text_id] = [orig_text[text_id][sent] for sent in cbow_pr_best[text_id]]

  return dict(zip(G, map(float, largest / norm)))


#### Faz o teste ROUGE

In [26]:
cbow_rouge_results = calculate_rouge(highlights, cbow_cl_summary, cbow_pr_summary)
cbow_cl_score, cbow_pr_score = average_rouge(cbow_rouge_results)

In [27]:
print("Results in order as: f, p, r")
print("Cluster ", cbow_cl_score)
print("PageRank", cbow_pr_score)

Results in order as: f, p, r
Cluster  {'Rouge1': [0.231, 0.209, 0.287], 'Rouge2': [0.056, 0.05, 0.07], 'RougeL': [0.16, 0.146, 0.191]}
PageRank {'Rouge1': [0.219, 0.193, 0.273], 'Rouge2': [0.051, 0.044, 0.064], 'RougeL': [0.146, 0.13, 0.177]}


#### Olha um resultado

In [28]:
text_id = list(cbow_cl_summary.keys())[np.random.randint(0, len(cbow_cl_summary))]

In [29]:
". ".join(orig_text[text_id])

"-LRB- CNN -RRB- -- Didier Drogba scored the only goal as Chelsea beat Juventus 1-0 at Stamford Bridge to give Guus Hiddink 's side a slender advantage ahead of their Champions League last-16 second leg in Turin. Didier Drogba celebrates his goal as Chelsea took a narrow advantage after their home tie against Juventus. Drogba , his season hampered by injury , suspension and a fallout with axed coach Luiz Felipe Scolari , looked back to his predatory best when he took a pass from Salomon Kalou and despatched the ball beyond Gianluigi Buffon in the 12th minute. Former Chelsea coach Claudio Ranieri , now in charge of Juve , was given a warm reception by the home fans before the game. Ranieri is still held in high esteem by Chelsea supporters even though he failed to win a single trophy during his four-year stint at Stamford Bridge. Hiddink was taking charge of a Chelsea side at home for the first time since his temporary appointment and it was the hosts who made the first inroads towards 

In [30]:
cbow_cl_summary[text_id]

['In the 15th minute , Drogba should have made it two when he met a corner from Frank Lampard inside the six-yard box , but he inexplicably headed wide',
 'Juventus enjoyed plenty of possession after the interval but found the Chelsea defense in fine form , with Petr Cech only having to deal with a succession of long-range efforts']

In [31]:
cbow_pr_summary[text_id]

['Juventus enjoyed plenty of possession after the interval but found the Chelsea defense in fine form , with Petr Cech only having to deal with a succession of long-range efforts',
 'Marco Marchionni and Alessandro del Piero both tried their luck from distance and Pavel Nedved went close near the end for the visitors',
 'Ranieri is still held in high esteem by Chelsea supporters even though he failed to win a single trophy during his four-year stint at Stamford Bridge']

In [32]:
text_rouge = cbow_rouge_results[text_id]

reform = {(level1_key, level2_key): values
          for level1_key, level2_dict in text_rouge.items()
          for level2_key, values in level2_dict.items()}

pd.DataFrame(reform).round(3)

Unnamed: 0_level_0,Cluster,Cluster,Cluster,PageRank,PageRank,PageRank
Unnamed: 0_level_1,rouge-1,rouge-2,rouge-l,rouge-1,rouge-2,rouge-l
f,0.274,0.022,0.146,0.205,0.0,0.097
p,0.232,0.018,0.128,0.154,0.0,0.074
r,0.333,0.026,0.171,0.308,0.0,0.143


In [114]:
save_summary_result("cbow", cbow_cl_summary, cbow_pr_summary, cbow_rouge_results)

### Doc2Vec
<a id='doc2vec'></a>

In [33]:
all_sents, sents_reference, orig_text, highlights = load_data()

In [34]:
doc2vec = gensim.models.Doc2Vec(
    corpus_file='storage/all_sents.txt',
    vector_size=200,
    window=5,
    min_count=5,
    workers=12,
    epochs=100,
)

In [35]:
doc2vec_vecs = scipy.sparse.csr.csr_matrix(doc2vec.docvecs.vectors_docs)

#### Clusterização

In [36]:
doc2vec_cl_best = find_most_relevant_cl(doc2vec_vecs, sents_reference)
doc2vec_cl_summary = {}
for text_id in doc2vec_cl_best.keys():
    doc2vec_cl_summary[text_id] = [orig_text[text_id][sent] for sent in doc2vec_cl_best[text_id]]



#### Page rank

In [37]:
doc2vec_pr_best = find_most_relevant_pr(doc2vec_vecs, sents_reference)
doc2vec_pr_summary = {}
for text_id in doc2vec_pr_best.keys():
    doc2vec_pr_summary[text_id] = [orig_text[text_id][sent] for sent in doc2vec_pr_best[text_id]]

#### Faz o teste ROUGE

In [38]:
doc2vec_rouge_results = calculate_rouge(highlights, doc2vec_cl_summary, doc2vec_pr_summary)
doc2vec_cl_score, doc2vec_pr_score = average_rouge(doc2vec_rouge_results)

In [39]:
print("Results in order as: f, p, r")
print("Cluster ", doc2vec_cl_score)
print("PageRank", doc2vec_pr_score)

Results in order as: f, p, r
Cluster  {'Rouge1': [0.232, 0.206, 0.292], 'Rouge2': [0.057, 0.051, 0.073], 'RougeL': [0.161, 0.145, 0.195]}
PageRank {'Rouge1': [0.233, 0.2, 0.301], 'Rouge2': [0.06, 0.051, 0.079], 'RougeL': [0.155, 0.135, 0.193]}


#### Olha um resultado

In [40]:
text_id = list(doc2vec_cl_summary.keys())[np.random.randint(0, len(doc2vec_cl_summary))]

In [41]:
". ".join(orig_text[text_id])

"Tripoli , Libya -LRB- CNN -RRB- -- The head of Libya 's opposition government told reporters Saturday he welcomed a call Friday by Russian President Dmitry Medvedev for Moammar Gadhafi to step down. Medvedev 's statement , echoing the stance of American and European leaders , appeared to indicate a closing diplomatic window for the longtime Libyan strongman. The chairman of the National Transitional Council , Mustafa Abdul Jalil , said he has offered amnesty to Gadhafi loyalists who defect before the demise of the regime , but reiterated that there will be `` no negotiation for any solution until Gadhafi 's departure. '' Once that happens , elections and a constitutional referendum will be held within a year , Jailil said in the opposition stronghold of Benghazi. In an interview with CNN , Jalil said the council had sold a shipment of oil to China for $ 160 million. The confirmation of the sale is expected to buttress the political and economic credibility of the fledgling rebel power

In [42]:
doc2vec_cl_summary[text_id]

['`` We should celebrate what our heroic sons have accomplished in Misrata and the Nafusa mountains , as well as applaud the wide international support for our revolution',
 'A decade ago , the site was used as a military station']

In [43]:
doc2vec_pr_summary[text_id]

['A decade ago , the site was used as a military station',
 "Tripoli , Libya -LRB- CNN -RRB- -- The head of Libya 's opposition government told reporters Saturday he welcomed a call Friday by Russian President Dmitry Medvedev for Moammar Gadhafi to step down",
 "Jalil marked the 100th day of the nation 's civil war"]

In [None]:
highlights[text_id]

In [44]:
text_rouge = doc2vec_rouge_results[text_id]

reform = {(level1_key, level2_key): values
          for level1_key, level2_dict in text_rouge.items()
          for level2_key, values in level2_dict.items()}

pd.DataFrame(reform).round(3)

Unnamed: 0_level_0,Cluster,Cluster,Cluster,PageRank,PageRank,PageRank
Unnamed: 0_level_1,rouge-1,rouge-2,rouge-l,rouge-1,rouge-2,rouge-l
f,0.078,0.0,0.062,0.108,0.0,0.051
p,0.081,0.0,0.065,0.094,0.0,0.044
r,0.075,0.0,0.061,0.125,0.0,0.061


In [115]:
save_summary_result("doc2vec", doc2vec_cl_summary, doc2vec_pr_summary, doc2vec_rouge_results)

### LDA
<a id='lda'></a>

In [45]:
all_sents, sents_reference, orig_text, highlights = load_data()

In [46]:
dictionary = Dictionary(all_sents)
doc2bow = [dictionary.doc2bow(sent) for sent in all_sents]

In [47]:
%%time
NUM_TOPICS = 20
ldamodel = LdaMulticore(doc2bow, num_topics=NUM_TOPICS, id2word=dictionary, passes=30)

Wall time: 1min 15s


In [48]:
# Caso se queira explorar a LDA mudar para True
if False:
    lda_display = pyLDAvis.gensim.prepare(ldamodel, doc2bow, dictionary, sort_topics=False)
    pyLDAvis.display(lda_display)

In [49]:
raw_vecs = [ldamodel.get_document_topics(text) for text in doc2bow]

In [50]:
lda_vecs = []
for vec in raw_vecs:
    this_vec = []
    curr = 0
    for i in range(NUM_TOPICS):
        if (i == vec[curr][0]):
            this_vec.append(vec[curr][1])
            curr+=1
            if curr == len(vec):
                curr = -1
        else:
            this_vec.append(0)
    lda_vecs.append(this_vec)
    
lda_vecs = scipy.sparse.csr.csr_matrix(lda_vecs)

#### Clusterização

In [51]:
lda_cl_best = find_most_relevant_cl(lda_vecs, sents_reference)
lda_cl_summary = {}
for text_id in lda_cl_best.keys():
    lda_cl_summary[text_id] = [orig_text[text_id][sent] for sent in lda_cl_best[text_id]]



#### Page rank

In [52]:
lda_pr_best = find_most_relevant_pr(lda_vecs, sents_reference)
lda_pr_summary = {}
for text_id in lda_pr_best.keys():
    lda_pr_summary[text_id] = [orig_text[text_id][sent] for sent in lda_pr_best[text_id]]

#### Faz o teste ROUGE

In [53]:
lda_rouge_results = calculate_rouge(highlights, lda_cl_summary, lda_pr_summary)
lda_cl_score, lda_pr_score = average_rouge(lda_rouge_results)

In [54]:
print("Results in order as: f, p, r")
print("Cluster ", lda_cl_score)
print("PageRank", lda_pr_score)

Results in order as: f, p, r
Cluster  {'Rouge1': [0.232, 0.207, 0.291], 'Rouge2': [0.058, 0.052, 0.074], 'RougeL': [0.162, 0.147, 0.197]}
PageRank {'Rouge1': [0.227, 0.198, 0.292], 'Rouge2': [0.055, 0.047, 0.072], 'RougeL': [0.152, 0.134, 0.19]}


#### Olha um resultado

In [55]:
text_id = list(lda_cl_summary.keys())[np.random.randint(0, len(lda_cl_summary))]

In [56]:
". ".join(orig_text[text_id])



In [57]:
lda_cl_summary[text_id]

['Historically , the role has gone to cardinals , however',
 'But Cardinal Roger Mahony , the retired archbishop of Los Angeles , suggested that the announcement might not be far away in a tweet posted Thursday',
 "Although some may be wondering why it 's taking so long to set the date for the conclave , Lombardi pushed back against the idea that the cardinals were dragging their feet"]

In [58]:
lda_pr_summary[text_id]

["Thursday morning 's business included reports on the financial state of the Holy See , Lombardi said",
 'Lombardi said 152 cardinals met Thursday morning',
 'CNN Vatican analyst John Allen , also a correspondent for National Catholic Reporter , wrote last month that Schoenborn `` certainly has the right pedigree for the job']

In [120]:
text_rouge = doc2vec_rouge_results[text_id]

reform = {(level1_key, level2_key): values
          for level1_key, level2_dict in text_rouge.items()
          for level2_key, values in level2_dict.items()}

pd.DataFrame(reform).round(3)

Unnamed: 0_level_0,Cluster,Cluster,Cluster,PageRank,PageRank,PageRank
Unnamed: 0_level_1,rouge-1,rouge-2,rouge-l,rouge-1,rouge-2,rouge-l
f,0.263,0.062,0.212,0.088,0.0,0.107
p,0.191,0.045,0.158,0.067,0.0,0.085
r,0.419,0.1,0.321,0.129,0.0,0.143


In [116]:
save_summary_result("lda", lda_cl_summary, lda_pr_summary, lda_rouge_results)

# Resumo com palavras chave
<a id='Word'></a>

Nessa seção os modelos de embeding serão treinados no conjunto total de palavras de todos os doumentos do corpus, em seguida serão utilizados para transformar as palavras em vetores e esses vetores, dividos por documentos, passaram por algoritimos de clusterização e pagerank para terem suas palavras mais relevantes selecionadas

### Word2Vec
<a id='word2vec'></a>

In [60]:
all_sents, sents_reference, orig_text, highlights = load_data()

In [61]:
%%time
model_word2vec = gensim.models.Word2Vec(
    corpus_file='storage/all_sents.txt',
    window=5,
    size=200,
    seed=42,
    iter=100,
    workers=12,
)

Wall time: 40.1 s


In [62]:
all_words = []
word2vec_vecs = []
words_reference = {}
i = 0
for text_id, sents in sents_reference.items():
    words_reference[text_id] = []
    for sent in sents:
        for word in all_sents[sent]:
            if word in model_word2vec:
                all_words.append(word)
                word2vec_vecs.append(model_word2vec.wv.get_vector(word))
                words_reference[text_id].append(i)
                i += 1
                
word2vec_vecs = scipy.sparse.csr.csr_matrix(word2vec_vecs)

#### Clusterização

In [89]:
word2vec_cl_best = find_most_relevant_cl(word2vec_vecs, words_reference, per_cluster_sent=5)
word2vec_cl_summary = {}
for text_id in word2vec_cl_best.keys():
    word2vec_cl_summary[text_id] = [all_words[word] for word in word2vec_cl_best[text_id]]



#### Page rank

In [104]:
# Como o Page rank trabalha com iterações de multiplicação de matrix de dimensões de NxN (N sendo o numero de vetores 
# do documento) e, quando se trata de palavras, o numero de vetores aumenta muito para documentos longos, o page rank passa
# a demorar um tempo execivo para o calulo sendo que sua utilização não é reomendade nesse caso
if False:
    word2vec_pr_best = find_most_relevant_pr(word2vec_vecs, words_reference, squared=True, n_sents=15)
    word2vec_pr_summary = {}
    for text_id in word2vec_pr_best.keys():
        word2vec_pr_summary[text_id] = [all_words[word] for word in word2vec_pr_best[text_id]]
else:
    word2vec_pr_summary = None

#### Olha um resultado

In [105]:
text_id = list(word2vec_cl_summary.keys())[np.random.randint(0, len(word2vec_cl_summary))]

In [106]:
". ".join(orig_text[text_id])

"-LRB- CNN -RRB- -- So , Gary Oldman , tell us what you really think. In a raw interview with Playboy , the actor , 56 , railed against Hollywood `` dishonesty '' and double standards , said that Mel Gibson and Alec Baldwin have been victims of hypocrisy and asserted that not voting for `` 12 Years a Slave '' to win an Oscar meant `` you were a racist. '' Oh , and he does n't like the Golden Globes , helicopter parents or reality TV , either. Indeed , the `` Dark Knight '' actor , who 's starring in the forthcoming `` Dawn of the Planet of the Apes , '' pulled no punches when talking about pretty much anything. The conversation will appear in the magazine 's July/August issue. The Gibson and Baldwin affairs really angered him , he said , because he believes their accusers do n't exactly have clean hands themselves. `` I do n't know about Mel. He got drunk and said a few things , but we 've all said those things. We 're all f *** ing hypocrites , '' Oldman said. `` The policeman who arr

In [107]:
word2vec_cl_summary[text_id]

['cnn',
 'birthplace',
 'francis',
 'churches',
 'pope',
 'far',
 'church',
 'francis',
 'philippines',
 'places',
 'cnn',
 'coming',
 'num',
 'power',
 'centers']

In [110]:
word2vec_pr_summary

In [117]:
save_summary_result("word2vec", word2vec_cl_summary, word2vec_pr_summary, None)

# Conclusões
<a id='Conc'></a>

### Resultados:

In [152]:
all_results = {"TFIDF":{"Cluster":tfidf_cl_score, "PageRank":tfidf_pr_score},
               "CBow":{"Cluster":cbow_cl_score, "PageRank":cbow_pr_score},
               "Doc2Vec":{"Cluster":doc2vec_cl_score, "PageRank":doc2vec_pr_score},
               "LDA":{"Cluster":lda_cl_score, "PageRank":lda_pr_score}}

reform = {(level1_key, level2_key, level3_key): values
           for level1_key, level2_dict in all_results.items()
           for level2_key, level3_dict in level2_dict.items()
           for level3_key, values      in level3_dict.items()}

all_results = pd.DataFrame(reform, index=["F1", "Precision", "Recall"]).T

In [156]:
all_results

Unnamed: 0,Unnamed: 1,Unnamed: 2,F1,Precision,Recall
TFIDF,Cluster,Rouge1,0.238,0.217,0.292
TFIDF,Cluster,Rouge2,0.063,0.057,0.078
TFIDF,Cluster,RougeL,0.167,0.154,0.197
TFIDF,PageRank,Rouge1,0.216,0.189,0.275
TFIDF,PageRank,Rouge2,0.049,0.042,0.065
TFIDF,PageRank,RougeL,0.144,0.127,0.178
CBow,Cluster,Rouge1,0.231,0.209,0.287
CBow,Cluster,Rouge2,0.056,0.05,0.07
CBow,Cluster,RougeL,0.16,0.146,0.191
CBow,PageRank,Rouge1,0.219,0.193,0.273


In [165]:
all_results.sort_values("Precision", ascending=False)

Unnamed: 0,Unnamed: 1,Unnamed: 2,F1,Precision,Recall
TFIDF,Cluster,Rouge1,0.238,0.217,0.292
CBow,Cluster,Rouge1,0.231,0.209,0.287
LDA,Cluster,Rouge1,0.232,0.207,0.291
Doc2Vec,Cluster,Rouge1,0.232,0.206,0.292
Doc2Vec,PageRank,Rouge1,0.233,0.2,0.301
LDA,PageRank,Rouge1,0.227,0.198,0.292
CBow,PageRank,Rouge1,0.219,0.193,0.273
TFIDF,PageRank,Rouge1,0.216,0.189,0.275
TFIDF,Cluster,RougeL,0.167,0.154,0.197
LDA,Cluster,RougeL,0.162,0.147,0.197


### Conlusão:

A partir dos scores rouge foi possivel observar que nenhum dos modelos obteve um sumario muito satisfatório, isso pode se dar:
* Pela diferença de tamanho entre as sentenças selecionadas pelos algoritimos e pelo tamanho do highlight (as vezes mais palavras no highlight as vezes muitas palavras no summario) 
* O metodo de préprocessamento do texto, que ao rmover sentenças muito curtas e stopwords acabou atrapalhando o processo de rankeamento
* A métrica rouge pode ser muito rigida para o tipo de teste realizado 
* A sumarização seja muito complexa para o nivel de desenvolvimento que foi feita para esses modelos

De qualquer forma, pela validação manual foi possivel verifiar que alguns resumos ficaram razoaveis para o texto, apesar de não possuirem pontuações muito elevadas.