# Sumarizador de texto

Esse projeto tem como objetivo criar resumos para diversos textos selecionando sentenças ou palavras que melhor representem os textos pelo grupamento de seus embedings. Isso signifia que, para fazer o resumo em sentenças, cada sentença do document passará por um algoritimo de embeding que a transformará em um vetor; com todas as sentenças como vetores é possivel utilizar clustering ou pagerank, para através da similaridade das sentenças, encontrar algumas que passem a ideia geral do texto resumido.

Esse tipo de sumarização é conhecido como sumarização extrativa e diversos mecanismos de embeding podem ser utilizados para traçar a seelhança entre as sentenças e ajudar no processo de extração, o principio por trás desse tipo de sumarização é que se duas sentenças são muito parecidas, você só precisa deuma delas. Alguns dos algoritimos que serão aborados nesse projeto sao: TF-IDF, CBow, Doc2Vec, LDA e Word2Vec.

De forma a verificar a qualidade dos sumarios, se optou por utilizar a metrica Rouge, que calcula pelo numero de palavras em comum entre um sumario referencia e o sumario criado o F1 score, a precisão e o recall do summario. Sendo que quanto maior a precisão, maior o recall e maior o f1, melhor é o sumario

# Summario

### [Abre o Corpus](#Open)

### [Funções Relevantes](#Func)

### [Sumariza em sentenças](#Sent)

[TF-IDF](#tfidf)

[CBow](#cbow)

[Doc 2 Vec](#doc2vec)

[LDA](#lda)

### [Sumariza em palavras](#Word)

[Word 2 Vec](#word2vec)

### [Conclusão](#Conc)



### Inputs do usuario

In [91]:
# dados que serão estudados ["cnn_stories_sample", "cnn_stories"]
corpus = "cnn_stories_sample"

# Configurações para os filtros das palavras
min_sent_size = 5
use_stopwords = True

### Imports relevantes

In [92]:
# Imports para tratamento de dados
import pandas as pd
import numpy as np
import scipy
import pickle

import json
import os

# Imports para tratamento de texto
import re
from nltk.corpus import stopwords
stopwords = stopwords.words("english")

# Imports para algoritimos de vectorização
from sklearn.feature_extraction.text import TfidfVectorizer

import gensim
from gensim.corpora import Dictionary
from gensim.models.ldamulticore import LdaMulticore
import pyLDAvis.gensim

# Imports para algoritimos de ranqueamento
import networkx as nx
from sklearn.cluster import MiniBatchKMeans

# Imports para verificação do sumario
from rouge import Rouge
rouge = Rouge()

# Outros
import warnings
warnings.filterwarnings("ignore",category=DeprecationWarning)

## Abre o Corpus
<a id='Open'></a>

Nessa seção o corpus selecionado será aberto, filtrado pela função preprocess e salvo em jasons e TXTs para evitar novas filtragens no futuro

In [93]:
# Dicionario de palavras que devem ser removidas/modificadas
manual_conversions = {"nt":"not", "ll":"will", "m":"am", "s":"TRASH"}
if use_stopwords:
    manual_conversions.update({stopword:"TRASH" for stopword in stopwords})

In [94]:
# Função que filtra as palavras de uma sentença do texto
def preprocess(sent):
    sent = re.sub("-LRB-|-RRB-", "", sent, flags=re.DOTALL|re.MULTILINE)
    sent = sent.lower()
    sent = re.sub(r"[^a-z0-9\ ]", "", sent, flags=re.DOTALL|re.MULTILINE)
    sent = re.sub(r"[0-9]+", "num", sent, flags=re.DOTALL|re.MULTILINE)
    sent = re.sub(r" +(?= )", "", sent, flags=re.DOTALL|re.MULTILINE).strip()
    sent = sent.split(" ")
    sent = [manual_conversions[word] if word in manual_conversions.keys() else word for word in sent]
    return [word for word in sent if word != "TRASH"]

In [95]:
# Abre os dados do corpus selecionado e faz a filtragem
sents = {}
orig_text = {}
highlights = {}
available_stories = os.listdir(f"./{corpus}")
for storie in available_stories:
    with open(f"./cnn_stories_sample/{storie}", "r", encoding="utf-8") as file:
        file = file.read()
    original_sents = re.sub(r"\n\n", " ", file.split("@highlight")[0], flags=re.DOTALL|re.MULTILINE).split(" . ")
    highlights_sents = re.sub(r"\n\n*", " . ", "".join(file.split("@highlight")[1:]), flags=re.DOTALL|re.MULTILINE)

    # Guarda o texto original e os highlights
    orig_text[storie] = original_sents
    highlights[storie] = highlights_sents

    # Guarda as sentenças préprocessadas
    preprocessed_sent = [preprocess(original_sent) for original_sent in original_sents]
    sents[storie] = [sent for sent in preprocessed_sent if len(sent) > min_sent_size]

In [96]:
# Salva os dadods filtrados tokenizados e originais
with open("storage/orig_sents.json", "w") as file:
    json.dump(orig_text, file)

with open("storage/highlights.json", "w") as file:
    json.dump(highlights, file)

sents_reference = {}
i = 0
with open('storage/all_sents.txt', 'w', encoding='utf8') as file:
    
    for text_id, text in sents.items():
        sents_reference[text_id] = []
        
        for sentence in text:
            file.write(" ".join([tok for tok in sentence]) + "\n")
            sents_reference[text_id].append(i)
            i += 1
            
with open("storage/sents_reference.json", "w") as file:
    json.dump(sents_reference, file)

## Define Funções relevantes
<a id='Func'></a>

Nessa seção, algumas funções são criadas para evitar repetição de código em cada modelo de embeding um veez que as etapas de carregamento dos dados, clustering, pagerank, calculo de score rouge e salvamento dos sumarios é muito semelhante

In [97]:
def load_data():
     
    with open('storage/all_sents.txt', 'r', encoding='utf8') as file:
        all_sents = [sent.split(" ") for sent in file.read().split("\n")]
    
    with open("storage/sents_reference.json", "r") as file:
        sents_reference = json.load(file)
        
    with open("storage/orig_sents.json", "r") as file:
        orig_sents = json.load(file)    

    with open("storage/highlights.json", "r") as file:
        highlights = json.load(file)
        
    return all_sents, sents_reference, orig_sents, highlights


def find_most_relevant_cl(vecs, sents_reference, clusters=3, per_cluster_sent=1):
    
    best = {}
    warned = False
    for text_id in sents_reference.keys():
    
        # Pega os vetores da sentença desse texto
        target_vecs = vecs[sents_reference[text_id][0]:sents_reference[text_id][-1]+1] 

        # Faz a clusterização dessas sentenças
        kmeans_cbow = MiniBatchKMeans(n_clusters=3, random_state=42)
        result = kmeans_cbow.fit_transform(target_vecs)
        df = pd.DataFrame(result)

        # Seleciona a sentença mais próxima de cada centro de cluster
        this_best = []
        for cluster_number in range(result.shape[1]):
            result = df[kmeans_cbow.labels_ == cluster_number].sort_values(by=cluster_number).index.values[:per_cluster_sent]
            if len(result) > 0:
                
                result = list(result)
                while len(result) < per_cluster_sent:
                    result.append(0)
                    
                this_best.append(result)
                
            elif not warned:
                warned = True
                warnings.warn("No center vector found in one of the clusters", RuntimeWarning)
                
        best[text_id] = set(sorted(list(np.array(this_best).flatten())))
            
    return best

def find_most_relevant_pr(vecs, sents_reference, n_sents=3, squared=False):

    best = {}
    for text_id in sents_reference.keys():

        # Pega os vetores da sentença desse texto
        target_vecs = vecs[sents_reference[text_id][0]:sents_reference[text_id][-1]+1] 
        target_vecs = [vec.toarray() for vec in target_vecs]
        # Faz a clusterização dessas sentenças
        sim_mat = np.zeros((len(sents_reference[text_id]), len(sents_reference[text_id])))
        for i, v1 in enumerate(target_vecs):
            for j, v2 in enumerate(target_vecs):
                norm1 = np.linalg.norm(v1)
                norm2 = np.linalg.norm(v2)
                # Verifica se alguem vetor possui apenas zeros e Verifica se o valor da normalização é razoavel
                if (v1.sum() != 0 and v2.sum() != 0) and ((norm1 + norm2) > np.finfo(float).eps):
                    if squared:
                        sim_mat[i][j] = ((v1 * v2).sum() / (norm1 + norm2)) ** 2
                    else:
                        sim_mat[i][j] = (v1 * v2).sum() / (norm1 + norm2)

        graph = nx.from_numpy_array(sim_mat)
        pr = nx.pagerank_numpy(graph)

        best[text_id] = set(sorted(pr, key=pr.get)[:n_sents])
        
    return best

def calculate_rouge(highlights, cl_predict, pr_predict):
    
    rouge_results = {}
    for text_id in cl_predict.keys():
        this_cl_summary = " . ".join(cl_predict[text_id])
        this_pr_summary = " . ".join(pr_predict[text_id])
        this_highlights = highlights[text_id]
        
        this_cl_summary = re.sub(r"[^a-z0-9\ ]", "", this_cl_summary.lower(), flags=re.DOTALL|re.MULTILINE)
        this_pr_summary = re.sub(r"[^a-z0-9\ ]", "", this_pr_summary.lower(), flags=re.DOTALL|re.MULTILINE)
        this_highlights = re.sub(r"[^a-z0-9\ ]", "", this_highlights.lower(), flags=re.DOTALL|re.MULTILINE)
        
        cl_score = rouge.get_scores(this_cl_summary, this_highlights)
        pr_score = rouge.get_scores(this_pr_summary, this_highlights)
        
        rouge_results[text_id] = {name:result for name, result in zip(["Cluster", "PageRank"], cl_score + pr_score)}
        
    return rouge_results

def summary_rouge(rouge_results):
    
    n_texts = len(rouge_results)

    cl_results = {"Rouge1":[],"Rouge2":[],"RougeL":[]}
    pr_results = {"Rouge1":[],"Rouge2":[],"RougeL":[]}

    for text_id, score in rouge_results.items():
        cl_results["Rouge1"].append(np.array(list(score["Cluster"]['rouge-1'].values())))
        cl_results["Rouge2"].append(np.array(list(score["Cluster"]['rouge-2'].values())))
        cl_results["RougeL"].append(np.array(list(score["Cluster"]['rouge-l'].values())))

        pr_results["Rouge1"].append(np.array(list(score["PageRank"]['rouge-1'].values())))
        pr_results["Rouge2"].append(np.array(list(score["PageRank"]['rouge-2'].values())))
        pr_results["RougeL"].append(np.array(list(score["PageRank"]['rouge-l'].values())))
    
    params = ["f", "p", "r"]
    params_std = ["f_std", "p_std", "r_std"]
    for key in list(cl_results.keys()):
        cl_mean, cl_std = np.mean(cl_results[key], axis=0), np.std(cl_results[key], axis=0)
        cl_results[key] = {k:v for k, v in zip(params, cl_mean)} 
        cl_results[f"{key}_std"] = {k:v for k, v in zip(params_std, cl_std)}
        
        pr_mean, pr_std = np.mean(pr_results[key], axis=0), np.std(pr_results[key], axis=0)
        pr_results[key] = {k:v for k, v in zip(params, pr_mean)}
        pr_results[f"{key}_std"] = {k:v for k, v in zip(params_std, pr_std)}
        
    data = {"Cluster":{key:value for key, value in cl_results.items() if not "std" in key},
            "Cluster_std":{key.replace("_std", ""):value for key, value in cl_results.items() if "std" in key},  
            "PageRank":{key:value for key, value in pr_results.items() if not "std" in key},
            "PageRank_std":{key.replace("_std", ""):value for key, value in pr_results.items() if "std" in key}}

    for grouper in data.keys():
        data[grouper] = {(rouge, score.replace("_std", "")):value for rouge in data[grouper].keys() 
                                                                  for score, value in data[grouper][rouge].items()}
                         
    return pd.DataFrame(data)

def save_summary_result(model_name, summary_cl, summary_pr, rouge_results):
    
    if summary_cl != None:
        with open(f"outputs/summary_{model_name}_cl.json", "w") as file:
            json.dump(summary_cl, file)

    if summary_pr != None:
        with open(f"outputs/summary_{model_name}_pr.json", "w") as file:
            json.dump(summary_pr, file)
    
    if rouge_results != None:
        with open(f"outputs/rouge_{model_name}.json", "w") as file:
            json.dump(rouge_results, file)

# Resumo com as sentenças
<a id='Sent'></a>

Nessa seção os modelos de embeding serão treinados no conjunto total de sentenças de todos os doumentos do corpus, em seguida serão utilizados para transformar as sentenças em vetores e esses vetores, dividos por documentos, passaram por algoritimos de clusterização e pagerank para terem suas sentenças mais relevantes selecionadas.

### TF-IDF
<a id='tfidf'></a>

O valor tf–idf (abreviação do inglês term frequency–inverse document frequency, que significa frequência do termo–inverso da frequência nos documentos), é uma medida estatística que tem o intuito de indicar a importância de uma palavra de um documento em relação a uma coleção de documentos ou em um corpus linguístico de acordo com quantas vezes a palavra aparece nesse documento em relçaão a quantas vezes ela aparece no corpus. Nesse caso se verifia quantas palavras são importantes em cada sentença do próprio documento em relação ao documento.

In [98]:
all_sents, sents_reference, orig_text, highlights = load_data()

In [99]:
%%time
tfidf = TfidfVectorizer(min_df=5, 
                        max_df=0.9, 
                        max_features=5000, 
                        sublinear_tf=False, 
                        analyzer=lambda x: x)

tfidf_vecs = tfidf.fit_transform(all_sents)

Wall time: 366 ms


#### Clusterização

In [100]:
tfidf_cl_best = find_most_relevant_cl(tfidf_vecs, sents_reference)
tfidf_cl_summary = {}
for text_id in tfidf_cl_best.keys():
    tfidf_cl_summary[text_id] = [orig_text[text_id][sent] for sent in tfidf_cl_best[text_id]]

#### Page rank

In [101]:
tfidf_pr_best = find_most_relevant_pr(tfidf_vecs, sents_reference)
tfidf_pr_summary = {}
for text_id in tfidf_pr_best.keys():
    tfidf_pr_summary[text_id] = [orig_text[text_id][sent] for sent in tfidf_pr_best[text_id]]

#### Faz o teste ROUGE

In [102]:
tfidf_rouge_results = calculate_rouge(highlights, tfidf_cl_summary, tfidf_pr_summary)
tfidf_rouge_summary = summary_rouge(tfidf_rouge_results)
tfidf_rouge_summary.T

Unnamed: 0_level_0,Rouge1,Rouge1,Rouge1,Rouge2,Rouge2,Rouge2,RougeL,RougeL,RougeL
Unnamed: 0_level_1,f,p,r,f,p,r,f,p,r
Cluster,0.25976,0.217934,0.350955,0.077447,0.063872,0.107217,0.18172,0.154687,0.235619
Cluster_std,0.100982,0.091195,0.151006,0.079896,0.067569,0.112455,0.091264,0.07998,0.127176
PageRank,0.216318,0.189212,0.274902,0.049444,0.042177,0.06494,0.145535,0.12837,0.1799
PageRank_std,0.094421,0.086779,0.135446,0.07131,0.061034,0.09719,0.080917,0.072245,0.10992


#### Olha um resultado

In [103]:
text_id = list(tfidf_cl_summary.keys())[np.random.randint(0, len(tfidf_cl_summary))]

In [104]:
". ".join(orig_text[text_id])

"-LRB- CNN -RRB- -- More than 4.3 million people tuned in to watch the U.S. women 's soccer team beat Japan in a 2-1 victory in the gold medal Olympic game. Shannon Boxx was just happy to be on the field. After injuring her hamstring , Boxx was sidelined for the team 's earlier game against Colombia. It was heartbreaking for the athlete to sit on the bench after all the health problems she had already battled during her journey to London. Boxx was diagnosed with lupus in 2007 when she was 30 years old. At the time she was playing for the U.S. National Team and had begun feeling extremely fatigued ; regular training sessions left her with joint pain and muscle soreness. She went public with her lupus diagnosis in April 2012 and is now working with the Lupus Foundation of America to create awareness about this chronic autoimmune disease that affects 1.5 million people in the U.S. With lupus , Boxx 's body produces auto-antibodies that attack and destroy healthy tissue because her immune 

In [105]:
tfidf_cl_summary[text_id]

['I remember willing myself through those training sessions and then getting home and lying on the couch the rest of the day',
 'Boxx was diagnosed with lupus in 2007 when she was 30 years old',
 'At the time she was playing for the U.S. National Team and had begun feeling extremely fatigued ; regular training sessions left her with joint pain and muscle soreness']

In [106]:
tfidf_pr_summary[text_id]

["I do my best to eat a balanced diet , but as of right now it is n't any different than the rest of my teammates",
 'I do less Olympic lifts and more body-weight exercises',
 "-RRB- Sjogren 's Syndrome is an autoimmune in which your body attacks your moisture-producing glands"]

In [107]:
text_rouge = tfidf_rouge_results[text_id]

reform = {(level1_key, level2_key): values
          for level1_key, level2_dict in text_rouge.items()
          for level2_key, values in level2_dict.items()}

pd.DataFrame(reform).round(3)

Unnamed: 0_level_0,Cluster,Cluster,Cluster,PageRank,PageRank,PageRank
Unnamed: 0_level_1,rouge-1,rouge-2,rouge-l,rouge-1,rouge-2,rouge-l
f,0.211,0.065,0.25,0.173,0.051,0.137
p,0.159,0.048,0.2,0.143,0.042,0.116
r,0.312,0.097,0.333,0.219,0.065,0.167


In [108]:
save_summary_result("tfidf", tfidf_cl_summary, tfidf_pr_summary, tfidf_rouge_results)

### CBOW
<a id='cbow'></a>

O CBOW consiste em utilizar o modelo Word2Vec para encontrar vetores para cada palavra do corpus de acordo com as palavras que normalmente estão em sua vizinhança. Para aplicar isso aos documentos, os vetores de cada palavra das sentenças são somados de forma que sentenças com palavras similares deveriam um vetor de soma total similar.

In [109]:
all_sents, sents_reference, orig_text, highlights = load_data()

In [110]:
%%time
cbow = gensim.models.Word2Vec(
    corpus_file='storage/all_sents.txt',
    window=5,
    size=200,
    seed=42,
    iter=100,
    workers=12,
)

Wall time: 38.7 s


In [111]:
def sum_word_vecs(model, sent):
    vec = np.zeros(model.wv.vector_size)
    for word in sent:
        if word in model:
            vec += model.wv.get_vector(word)
            
    norm = np.linalg.norm(vec)
    if norm > np.finfo(float).eps:
        vec /= norm
    return vec

In [112]:
cbow_vecs = scipy.sparse.csr.csr_matrix([sum_word_vecs(cbow, sent) for sent in all_sents])

#### Clusterização

In [113]:
cbow_cl_best = find_most_relevant_cl(cbow_vecs, sents_reference)
cbow_cl_summary = {}
for text_id in cbow_cl_best.keys():
    cbow_cl_summary[text_id] = [orig_text[text_id][sent] for sent in cbow_cl_best[text_id]]

#### Page rank

In [114]:
cbow_pr_best = find_most_relevant_pr(cbow_vecs, sents_reference)
cbow_pr_summary = {}
for text_id in cbow_pr_best.keys():
    cbow_pr_summary[text_id] = [orig_text[text_id][sent] for sent in cbow_pr_best[text_id]]

#### Faz o teste ROUGE

In [115]:
cbow_rouge_results = calculate_rouge(highlights, cbow_cl_summary, cbow_pr_summary)
cbow_rouge_summary = summary_rouge(cbow_rouge_results)
cbow_rouge_summary.T

Unnamed: 0_level_0,Rouge1,Rouge1,Rouge1,Rouge2,Rouge2,Rouge2,RougeL,RougeL,RougeL
Unnamed: 0_level_1,f,p,r,f,p,r,f,p,r
Cluster,0.257674,0.213397,0.352303,0.075178,0.061447,0.104808,0.179528,0.15092,0.235734
Cluster_std,0.09685,0.086449,0.14593,0.078356,0.065913,0.109651,0.08632,0.07491,0.121574
PageRank,0.218177,0.191844,0.273898,0.048831,0.04244,0.062395,0.14631,0.129873,0.178445
PageRank_std,0.093517,0.087612,0.130342,0.069202,0.060728,0.091309,0.079103,0.072535,0.103133


#### Olha um resultado

In [116]:
text_id = list(cbow_cl_summary.keys())[np.random.randint(0, len(cbow_cl_summary))]

In [117]:
". ".join(orig_text[text_id])

"Texas Gov. Rick Perry will immediately send up to 1,000 National Guard troops to help secure the southern border , where tens of thousands of unaccompanied minors from Central America have crossed into the United States this year in a surge that is deemed a humanitarian crisis. Perry also wants President Barack Obama and Congress to hire an additional 3,000 border patrol agents for the Texas border , which would eventually replace the temporary guard forces. `` I will not stand idly by , '' Perry said in Austin Monday , announcing what he called Operation Strong Safety. `` The price of inaction is too high. '' Perry 's state has received the majority of migrant children , especially in the Rio Grande region , and he has repeatedly called on the federal government to beef up border security. White House spokesman Josh Earnest said the White House has not yet received the formal communication required for Perry to deploy guard troops. But he said if Perry follows through , he hopes thos

In [118]:
cbow_cl_summary[text_id]

['Perry also wants President Barack Obama and Congress to hire an additional 3,000 border patrol agents for the Texas border , which would eventually replace the temporary guard forces',
 'At the same time , resources for border security have steadily increased : More than 18,000 agents patrolled the border in 2013 compared to 10,000 a decade ago',
 "Children at the border : What 's happening and why But Perry said the guard will be `` force multipliers , '' helping Customs and Border Protection agents both on the ground and in the air to catch the 80 % of people crossing the border who are n't children and to combat cartel and trafficking crime"]

In [119]:
cbow_pr_summary[text_id]

['About 1 million people have been caught crossing the border nearly every year between 1983 until 2006 , but that number has dropped to about 400,000 in 2013',
 'But he said if Perry follows through , he hopes those forces would be coordinated `` with the significant ongoing efforts already in place',
 "The Obama administration questioned Perry 's motives since many of the minors are not trying to evade the border patrol but are turning themselves in after crossing the border"]

In [120]:
text_rouge = cbow_rouge_results[text_id]

reform = {(level1_key, level2_key): values
          for level1_key, level2_dict in text_rouge.items()
          for level2_key, values in level2_dict.items()}

pd.DataFrame(reform).round(3)

Unnamed: 0_level_0,Cluster,Cluster,Cluster,PageRank,PageRank,PageRank
Unnamed: 0_level_1,rouge-1,rouge-2,rouge-l,rouge-1,rouge-2,rouge-l
f,0.273,0.053,0.19,0.254,0.032,0.078
p,0.198,0.038,0.147,0.205,0.026,0.066
r,0.438,0.085,0.268,0.333,0.043,0.098


In [121]:
save_summary_result("cbow", cbow_cl_summary, cbow_pr_summary, cbow_rouge_results)

### Doc2Vec
<a id='doc2vec'></a>

O Doc2Vec funciona de forma similar ao Word2Vec, mas de forma a classificar documentos. Para isso além de observar as palavras do paragrafo ele utiliza o id do pargrafo para classificar que plavras são mais comuns em cada parte do documento, dessa forma, utilizando as sentenças como paragrafos o vetor gerado pelo Doc2Vec consegue descrever a sentença, utilizando, além das palavras que a compõe, o tipo da sentença também.

In [122]:
all_sents, sents_reference, orig_text, highlights = load_data()

In [123]:
doc2vec = gensim.models.Doc2Vec(
    corpus_file='storage/all_sents.txt',
    vector_size=200,
    window=5,
    min_count=5,
    workers=12,
    epochs=100,
)

In [124]:
doc2vec_vecs = scipy.sparse.csr.csr_matrix(doc2vec.docvecs.vectors_docs)

#### Clusterização

In [125]:
doc2vec_cl_best = find_most_relevant_cl(doc2vec_vecs, sents_reference)
doc2vec_cl_summary = {}
for text_id in doc2vec_cl_best.keys():
    doc2vec_cl_summary[text_id] = [orig_text[text_id][sent] for sent in doc2vec_cl_best[text_id]]

#### Page rank

In [126]:
doc2vec_pr_best = find_most_relevant_pr(doc2vec_vecs, sents_reference)
doc2vec_pr_summary = {}
for text_id in doc2vec_pr_best.keys():
    doc2vec_pr_summary[text_id] = [orig_text[text_id][sent] for sent in doc2vec_pr_best[text_id]]

  return dict(zip(G, map(float, largest / norm)))


#### Faz o teste ROUGE

In [127]:
doc2vec_rouge_results = calculate_rouge(highlights, doc2vec_cl_summary, doc2vec_pr_summary)
doc2vec_rouge_summary = summary_rouge(doc2vec_rouge_results)
doc2vec_rouge_summary.T

Unnamed: 0_level_0,Rouge1,Rouge1,Rouge1,Rouge2,Rouge2,Rouge2,RougeL,RougeL,RougeL
Unnamed: 0_level_1,f,p,r,f,p,r,f,p,r
Cluster,0.24157,0.204844,0.32057,0.062322,0.05233,0.084075,0.166657,0.143568,0.212597
Cluster_std,0.097813,0.089677,0.14044,0.074917,0.065258,0.101684,0.086872,0.077918,0.117063
PageRank,0.231744,0.199216,0.299695,0.058134,0.049416,0.076256,0.157619,0.137367,0.196575
PageRank_std,0.098235,0.090443,0.138975,0.076196,0.067011,0.100473,0.086874,0.077847,0.114539


#### Olha um resultado

In [128]:
text_id = list(doc2vec_cl_summary.keys())[np.random.randint(0, len(doc2vec_cl_summary))]

In [129]:
". ".join(orig_text[text_id])

"Placerville , California -LRB- CNN -RRB- -- California 's attorney general is `` actively reviewing '' an animal charity executive who had agreed not to take a higher office with another charity after a state investigation into how her previous employer had spent its donations , a spokesman for the AG 's office told CNN. The woman at the center of the review , Terri Crisp , has been identified by SPCA International in its tax filings as one of its directors or officers. She also serves as the spokeswoman for the charity 's `` Baghdad Pups '' program which , according to SPCA International , `` helps U.S. troops safely transport home the companion animals they befriend in the war zone. '' Before her work with SPCA International , Crisp headed the California-based animal rescue charity Noah 's Wish , which received millions of dollars in donations after Hurricane Katrina struck the U.S. Gulf Coast in 2005. It promised to use the money to help animals affected by the disaster The Califor

In [130]:
doc2vec_cl_summary[text_id]

["In addition to its questionable finances , CNN found that SPCA International misrepresented the `` Baghdad Pups '' program on its tax filings",
 'But this organization has an enormous amount of fund-raising costs , certainly relative to the amount of money being spent',
 "'' Of the $ 14 million raised in 2010 , SPCA International reports it spent less than 0.5 % -- about $ 60,000 -- in small cash grants to animal shelters across the United States"]

In [131]:
doc2vec_pr_summary[text_id]

['CNN requested an on-camera interview several weeks ago from Stephanie Scott , the SPCA International public relations director , but Scott never responded either by phone or e-mail',
 "Quadriga Art is one of the world 's largest direct-mail providers to charities and nonprofits",
 "'' Of the $ 14 million raised in 2010 , SPCA International reports it spent less than 0.5 % -- about $ 60,000 -- in small cash grants to animal shelters across the United States"]

In [132]:
highlights[text_id]

" . Under a 2007 settlement , Terri Crisp agreed not to serve as a charity official . Yet , last year , she was named as one of SPCA International 's directors and officers . California 's attorney general is now reviewing Crisp 's involvement with SPCAI"

In [133]:
text_rouge = doc2vec_rouge_results[text_id]

reform = {(level1_key, level2_key): values
          for level1_key, level2_dict in text_rouge.items()
          for level2_key, values in level2_dict.items()}

pd.DataFrame(reform).round(3)

Unnamed: 0_level_0,Cluster,Cluster,Cluster,PageRank,PageRank,PageRank
Unnamed: 0_level_1,rouge-1,rouge-2,rouge-l,rouge-1,rouge-2,rouge-l
f,0.074,0.019,0.09,0.145,0.037,0.083
p,0.06,0.015,0.075,0.116,0.029,0.067
r,0.098,0.025,0.111,0.195,0.05,0.111


In [134]:
save_summary_result("doc2vec", doc2vec_cl_summary, doc2vec_pr_summary, doc2vec_rouge_results)

### LDA
<a id='lda'></a>

A modelagem de tópicos refere-se à tarefa de identificar os tópicos que melhor descrevem um conjunto de documentos. Esses tópicos surgirão apenas durante o processo de modelagem de tópicos (portanto chamado de latente). E uma técnica popular de modelagem de tópicos é conhecida como Alocação de Dirichlet Latente (LDA). A LDA imagina um conjunto fixo de tópicos. Cada tópico representa um conjunto de palavras. E o objetivo do LDA é mapear todos os documentos para os tópicos de modo que as palavras em cada documento sejam capturadas principalmente por esses tópicos imaginários. Assim, nesse caso, a LDA é utilizada para tentar encontrar tópicos nas sentenças dos corpus, de forma que sentenças com topicos pareidos devem ser pareidas.

In [135]:
all_sents, sents_reference, orig_text, highlights = load_data()

In [136]:
dictionary = Dictionary(all_sents)
doc2bow = [dictionary.doc2bow(sent) for sent in all_sents]

In [137]:
%%time
NUM_TOPICS = 20
ldamodel = LdaMulticore(doc2bow, num_topics=NUM_TOPICS, id2word=dictionary, passes=30)

Wall time: 1min 16s


In [138]:
# Caso se queira explorar a LDA mudar para True
if False:
    lda_display = pyLDAvis.gensim.prepare(ldamodel, doc2bow, dictionary, sort_topics=False)
    pyLDAvis.display(lda_display)

In [139]:
raw_vecs = [ldamodel.get_document_topics(text) for text in doc2bow]

In [140]:
lda_vecs = []
for vec in raw_vecs:
    this_vec = []
    curr = 0
    for i in range(NUM_TOPICS):
        if (i == vec[curr][0]):
            this_vec.append(vec[curr][1])
            curr+=1
            if curr == len(vec):
                curr = -1
        else:
            this_vec.append(0)
    lda_vecs.append(this_vec)
    
lda_vecs = scipy.sparse.csr.csr_matrix(lda_vecs)

#### Clusterização

In [141]:
lda_cl_best = find_most_relevant_cl(lda_vecs, sents_reference)
lda_cl_summary = {}
for text_id in lda_cl_best.keys():
    lda_cl_summary[text_id] = [orig_text[text_id][sent] for sent in lda_cl_best[text_id]]

#### Page rank

In [142]:
lda_pr_best = find_most_relevant_pr(lda_vecs, sents_reference)
lda_pr_summary = {}
for text_id in lda_pr_best.keys():
    lda_pr_summary[text_id] = [orig_text[text_id][sent] for sent in lda_pr_best[text_id]]

#### Faz o teste ROUGE

In [143]:
lda_rouge_results = calculate_rouge(highlights, lda_cl_summary, lda_pr_summary)
lda_rouge_summary = summary_rouge(lda_rouge_results)
lda_rouge_summary.T

Unnamed: 0_level_0,Rouge1,Rouge1,Rouge1,Rouge2,Rouge2,Rouge2,RougeL,RougeL,RougeL
Unnamed: 0_level_1,f,p,r,f,p,r,f,p,r
Cluster,0.244753,0.206479,0.328869,0.064587,0.053491,0.088854,0.167786,0.143892,0.215974
Cluster_std,0.095793,0.087585,0.14411,0.074595,0.064001,0.103575,0.084989,0.076337,0.117416
PageRank,0.2275,0.197228,0.293353,0.056417,0.048156,0.074433,0.15484,0.13607,0.193202
PageRank_std,0.093677,0.086917,0.136365,0.07226,0.064021,0.096826,0.081134,0.074008,0.111042


#### Olha um resultado

In [145]:
text_id = list(lda_cl_summary.keys())[np.random.randint(0, len(lda_cl_summary))]

In [146]:
". ".join(orig_text[text_id])

"-LRB- CNN -RRB- -- Bolivian President Evo Morales on Sunday pledged to continue his hunger strike until Monday , when Congress -- including the opposition-led Senate -- is set to reconvene. Evo Morales on hunger strike at the presidential palace in Bolivia 's capital , La Paz. Morales ' speech , televised by a state-run station , was his first formal address to the nation since starting the strike Thursday in the government palace. More than three days into the strike , Morales appeared healthy during his address. The president wants the opposition-led Senate to set a date for general elections that are expected to give him another five-year term. Morales on Friday called on opposition members -- who walked out of the Congress in mid-session late Thursday -- to pass the election law , the government-run Bolivian Information Agency said. The nation 's first indigenous president reportedly carried out an 18-day hunger strike in 2002 , when he was expelled from Congress. "

In [147]:
lda_cl_summary[text_id]

['-LRB- CNN -RRB- -- Bolivian President Evo Morales on Sunday pledged to continue his hunger strike until Monday , when Congress -- including the opposition-led Senate -- is set to reconvene',
 'More than three days into the strike , Morales appeared healthy during his address',
 "The nation 's first indigenous president reportedly carried out an 18-day hunger strike in 2002 , when he was expelled from Congress"]

In [148]:
lda_pr_summary[text_id]

['-LRB- CNN -RRB- -- Bolivian President Evo Morales on Sunday pledged to continue his hunger strike until Monday , when Congress -- including the opposition-led Senate -- is set to reconvene',
 "Evo Morales on hunger strike at the presidential palace in Bolivia 's capital , La Paz",
 'More than three days into the strike , Morales appeared healthy during his address']

In [149]:
text_rouge = doc2vec_rouge_results[text_id]

reform = {(level1_key, level2_key): values
          for level1_key, level2_dict in text_rouge.items()
          for level2_key, values in level2_dict.items()}

pd.DataFrame(reform).round(3)

Unnamed: 0_level_0,Cluster,Cluster,Cluster,PageRank,PageRank,PageRank
Unnamed: 0_level_1,rouge-1,rouge-2,rouge-l,rouge-1,rouge-2,rouge-l
f,0.476,0.194,0.414,0.427,0.198,0.391
p,0.391,0.159,0.367,0.355,0.164,0.347
r,0.61,0.25,0.474,0.537,0.25,0.447


In [150]:
save_summary_result("lda", lda_cl_summary, lda_pr_summary, lda_rouge_results)

# Resumo com palavras chave
<a id='Word'></a>

Nessa seção os modelos de embeding serão treinados no conjunto total de palavras de todos os doumentos do corpus, em seguida serão utilizados para transformar as palavras em vetores e esses vetores, dividos por documentos, passaram por algoritimos de clusterização e pagerank para terem suas palavras mais relevantes selecionadas

### Word2Vec
<a id='word2vec'></a>

O modelo Word2Vec utiliza a vizinhança das palavras do documento para tentar descrever a palavra, assim palavras com vetores parecidos são encontradas, normalmente, no mesmo contexto. De forma que, para a sumarização, se duas palavras são usadas no mesmo contexto, apenas uma é necessaria para o resumir o texto.

In [151]:
all_sents, sents_reference, orig_text, highlights = load_data()

In [152]:
%%time
model_word2vec = gensim.models.Word2Vec(
    corpus_file='storage/all_sents.txt',
    window=5,
    size=200,
    seed=42,
    iter=100,
    workers=12,
)

Wall time: 39.3 s


In [153]:
all_words = []
word2vec_vecs = []
words_reference = {}
i = 0
for text_id, sents in sents_reference.items():
    words_reference[text_id] = []
    for sent in sents:
        for word in all_sents[sent]:
            if word in model_word2vec:
                all_words.append(word)
                word2vec_vecs.append(model_word2vec.wv.get_vector(word))
                words_reference[text_id].append(i)
                i += 1
                
word2vec_vecs = scipy.sparse.csr.csr_matrix(word2vec_vecs)

#### Clusterização

In [154]:
word2vec_cl_best = find_most_relevant_cl(word2vec_vecs, words_reference, per_cluster_sent=5)
word2vec_cl_summary = {}
for text_id in word2vec_cl_best.keys():
    word2vec_cl_summary[text_id] = [all_words[word] for word in word2vec_cl_best[text_id]]



#### Page rank

In [155]:
# Como o Page rank trabalha com iterações de multiplicação de matrix de dimensões de NxN (N sendo o numero de vetores 
# do documento) e, quando se trata de palavras, o numero de vetores aumenta muito para documentos longos, o page rank passa
# a demorar um tempo execivo para o calulo sendo que sua utilização não é reomendade nesse caso
if False:
    word2vec_pr_best = find_most_relevant_pr(word2vec_vecs, words_reference, squared=True, n_sents=15)
    word2vec_pr_summary = {}
    for text_id in word2vec_pr_best.keys():
        word2vec_pr_summary[text_id] = [all_words[word] for word in word2vec_pr_best[text_id]]
else:
    word2vec_pr_summary = None

#### Olha um resultado

In [216]:
text_id = list(word2vec_cl_summary.keys())[np.random.randint(0, len(word2vec_cl_summary))]

In [221]:
". ".join(orig_text[text_id])

"-LRB- CNN -RRB- -- Syrian opposition forces may have executed as many as 30 people , most of them government soldiers , in rural Aleppo , according to the United Nations , which cited videos of the killings posted on the Internet in July. U.N High Commissioner for Human Rights Navi Pillay called the allegations `` deeply shocking '' and called Friday for an independent investigation into the incident , which appears to have taken place in Khan al-Assal in northern Syria. `` There needs to be a thorough independent investigation to establish whether war crimes have been committed , and those responsible for such crimes should be brought to justice , '' Pillay said in a statement. The videos , posted to the Internet between July 22 and 26 , show government soldiers being ordered to lie on the ground , bodies being collected by doctors , corpses strewn along a wall and bodies in Khan al-Assal bearing gunshot wounds to the head. Pillay 's office also has information that Syrian rebels are

In [222]:
word2vec_cl_summary[text_id]

['said',
 'feast',
 'americans',
 'pope',
 'places',
 'year',
 'allen',
 'deserves',
 'cardinals',
 'cape',
 'senior',
 'small',
 'cardinals',
 'catholic',
 'jesus']

In [219]:
word2vec_pr_summary

In [220]:
save_summary_result("word2vec", word2vec_cl_summary, word2vec_pr_summary, None)

# Conclusões
<a id='Conc'></a>

### Resultados númericos:

In [213]:
all_results = {"TFIDF":tfidf_rouge_summary,
               "CBow":cbow_rouge_summary,
               "Doc2Vec":doc2vec_rouge_summary,
               "LDA":lda_rouge_summary}

all_results_mean = pd.DataFrame({(level1_key, level2_key): values
                                  for level1_key, level2_dict in all_results.items()
                                  for level2_key, values in level2_dict.items() if not "std" in level2_key}).T

all_results_std = pd.DataFrame({(level1_key, level2_key.replace("_std", "")): values
                               for level1_key, level2_dict in all_results.items()
                               for level2_key, values in level2_dict.items() if "std" in level2_key}).T

order = all_results_mean.sort_values(("Rouge1", "p"), ascending=False).index

In [214]:
print("Média do Score de todos os modelos")
all_results_mean.loc[order]

Média do Score de todos os modelos


Unnamed: 0_level_0,Unnamed: 1_level_0,Rouge1,Rouge1,Rouge1,Rouge2,Rouge2,Rouge2,RougeL,RougeL,RougeL
Unnamed: 0_level_1,Unnamed: 1_level_1,f,p,r,f,p,r,f,p,r
TFIDF,Cluster,0.25976,0.217934,0.350955,0.077447,0.063872,0.107217,0.18172,0.154687,0.235619
CBow,Cluster,0.257674,0.213397,0.352303,0.075178,0.061447,0.104808,0.179528,0.15092,0.235734
LDA,Cluster,0.244753,0.206479,0.328869,0.064587,0.053491,0.088854,0.167786,0.143892,0.215974
Doc2Vec,Cluster,0.24157,0.204844,0.32057,0.062322,0.05233,0.084075,0.166657,0.143568,0.212597
Doc2Vec,PageRank,0.231744,0.199216,0.299695,0.058134,0.049416,0.076256,0.157619,0.137367,0.196575
LDA,PageRank,0.2275,0.197228,0.293353,0.056417,0.048156,0.074433,0.15484,0.13607,0.193202
CBow,PageRank,0.218177,0.191844,0.273898,0.048831,0.04244,0.062395,0.14631,0.129873,0.178445
TFIDF,PageRank,0.216318,0.189212,0.274902,0.049444,0.042177,0.06494,0.145535,0.12837,0.1799


In [215]:
print("Desvio padrão do Score de todos os modelos")
all_results_std.loc[order]

Desvio padrão do Score de todos os modelos


Unnamed: 0_level_0,Unnamed: 1_level_0,Rouge1,Rouge1,Rouge1,Rouge2,Rouge2,Rouge2,RougeL,RougeL,RougeL
Unnamed: 0_level_1,Unnamed: 1_level_1,f,p,r,f,p,r,f,p,r
TFIDF,Cluster,0.100982,0.091195,0.151006,0.079896,0.067569,0.112455,0.091264,0.07998,0.127176
CBow,Cluster,0.09685,0.086449,0.14593,0.078356,0.065913,0.109651,0.08632,0.07491,0.121574
LDA,Cluster,0.095793,0.087585,0.14411,0.074595,0.064001,0.103575,0.084989,0.076337,0.117416
Doc2Vec,Cluster,0.097813,0.089677,0.14044,0.074917,0.065258,0.101684,0.086872,0.077918,0.117063
Doc2Vec,PageRank,0.098235,0.090443,0.138975,0.076196,0.067011,0.100473,0.086874,0.077847,0.114539
LDA,PageRank,0.093677,0.086917,0.136365,0.07226,0.064021,0.096826,0.081134,0.074008,0.111042
CBow,PageRank,0.093517,0.087612,0.130342,0.069202,0.060728,0.091309,0.079103,0.072535,0.103133
TFIDF,PageRank,0.094421,0.086779,0.135446,0.07131,0.061034,0.09719,0.080917,0.072245,0.10992


### Conlusão:

A partir dos scores rouge foi possivel observar que nenhum dos modelos obteve um sumario muito satisfatório, isso pode se dar:
* Pela diferença de tamanho entre as sentenças selecionadas pelos algoritimos e pelo tamanho do highlight (as vezes mais palavras no highlight as vezes muitas palavras no summario) 
* O metodo de préprocessamento do texto, que ao rmover sentenças muito curtas e stopwords acabou atrapalhando o processo de rankeamento
* A métrica rouge pode ser muito rigida para o tipo de teste realizado 
* A sumarização seja muito complexa para o nivel de desenvolvimento que foi feita para esses modelos

De qualquer forma, pela validação manual foi possivel verifiar que alguns resumos ficaram razoaveis para o texto, apesar de não possuirem pontuações muito elevadas.

Trabalhos Futuros: Uma boa opção para melhorar a qualidade dos sumarios poderia ser utilizar o melhor resumo criado pelos algoritimos de forma que dos 8 possiveis resumos, apenas o com maior Precisão ou F1 score fosse apresentado ao leitor
