<center><h1>Algoritmo para Resumir un Texto (Naive)</h1></center>

<img src="./img/ml.jpeg" width="550">

In [20]:
import re

class SummaryTool:

    def split_content_to_sentences(self, content):
        '''
        Método para dividir un texto en oraciones (naive).
        '''
        content = content.replace("\n", ". ")
        return content.split(". ")

    def split_content_to_paragraphs(self, content):
        '''
        Método para dividir un texto en parágrafos (naive)
        '''
        return content.split("\n\n")

    def sentences_intersection(self, sent1, sent2):
        '''
        Calcula la intersección entre 2 oraciones
        '''

        # divide las oraciones en palabras/tokens
        s1 = set(sent1.split(" "))
        s2 = set(sent2.split(" "))

        # Si no hay intersección, retorna 0
        if len(s1) + len(s2) == 0: return 0

        # Normalizamos el resultado por el número medio de palabras
        return len(s1.intersection(s2)) / ((len(s1) + len(s2)) / 2)

    def format_sentence(self, sentence):
        '''
        Dar formato a una oración: elimine todos los caracteres no alfabéticos de la oración
        Usaremos la oración formateada como clave en nuestro diccionario de oraciones
        '''
        sentence = re.sub(r'\W+', '', sentence)
        return sentence

    def get_senteces_ranks(self, content):
        '''
        Convierte el contenido dentro de diccionario <k, v>
        k = La oración formateada
        v = El rango de la oración
        '''

        # Divide el contenido en oraciones
        sentences = self.split_content_to_sentences(content)

        # Calcula la intersección de cada dos oraciones
        n = len(sentences)
        values = [[0 for x in range(n)] for x in range(n)]
        for i in range(n):
            for j in range(n):
                values[i][j] = self.sentences_intersection(sentences[i], sentences[j])

        # Construye el diccionario de oraciones
        # La puntuación de una oración es la suma de todas sus intersecciones.
        sentences_dic = {}
        for i in range(n):
            score = 0
            for j in range(n):
                if i == j: continue
                score += values[i][j]
            sentences_dic[self.format_sentence(sentences[i])] = score

        return sentences_dic

    def get_best_sentence(self, paragraph, sentences_dic):
        '''
        Devuelve la mejor oración en un párrafo
        '''

        # Divide el párrafo en oraciones
        sentences = self.split_content_to_sentences(paragraph)

        # Ignorar párrafos cortos
        if len(sentences) < 2: return ""

        # Obtén la mejor oración según el diccionario de oraciones
        best_sentence = ""
        max_value = 0
        for s in sentences:
            strip_s = self.format_sentence(s)
            if strip_s:
                if sentences_dic[strip_s] > max_value:
                    max_value = sentences_dic[strip_s]
                    best_sentence = s

        return best_sentence

    def get_summary(self, title, content, sentences_dic):
        '''
        Construye el resumen
        '''

        # Divide el contenido en párrafos
        paragraphs = self.split_content_to_paragraphs(content)

        # Agrega el título
        summary = []
        summary.append(title.strip())
        summary.append("")

        # Agrega la mejor oración de cada párrafo
        for p in paragraphs:
            sentence = self.get_best_sentence(p, sentences_dic).strip()
            if sentence: summary.append(sentence)

        return ("\n").join(summary)

In [28]:
# Articulo de Pablo Robledo - 'El Espectador' (Nov 25/2020)

title = 'Un juez suspende la norma que obligaría a los condados de Georgia a realizar el recuento manual de votos'

content = (open('text.txt', encoding='utf-8').read())

In [29]:
# Crear un objeto SummaryTool
st = SummaryTool()

# Construye el diccionario de oraciones
sentences_dic = st.get_senteces_ranks(content)
sentences_dic

{'THEADVENTURESOF': 0.0,
 'SHERLOCKHOLMES': 0.0,
 '': 226.83333333333331,
 'BY': 0.0,
 'SIRARTHURCONANDOYLE': 0.0,
 'CONTENTS': 0.0,
 'IAScandalinBohemia': 10.722054521351408,
 'IITheRedHeadedLeague': 0.0,
 'IIIACaseofIdentity': 15.10135849933062,
 'IVTheBoscombeValleyMystery': 0.0,
 'VTheFiveOrangePips': 0.0,
 'VITheManwiththeTwistedLip': 24.484474636719018,
 'VIITheAdventureoftheBlueCarbuncle': 35.311924626942925,
 'VIIITheAdventureoftheSpeckledBand': 35.311924626942925,
 'IXTheAdventureoftheEngineersThumb': 35.311924626942925,
 'XTheAdventureoftheNobleBachelor': 35.311924626942925,
 'XITheAdventureoftheBerylCoronet': 35.311924626942925,
 'XIITheAdventureoftheCopperBeeches': 35.311924626942925,
 'ADVENTUREI': 131.53178429264776,
 'ASCANDALINBOHEMIA': 76.9812326568472,
 'I': 0.0,
 'ToSherlockHolmessheisalwaysthewoman': 32.79900844412051,
 'Ihaveseldomheardhimmentionherunderanyothername': 21.096698559644615,
 'Inhiseyessheeclipsesandpredominatesthewholeofhersex': 45.108920807001006,
 '

In [30]:
# Construye el resumen con el diccionario de oraciones
summary = st.get_summary(title, content, sentences_dic)
print(summary)

THE ADVENTURES OF SHERLOCK HOLMES

VII.	The Adventure of the Blue Carbuncle
ADVENTURE  I
And yet there was but one woman to him, and that woman was the late Irene Adler, of dubious and questionable memory.
From time to time I heard some vague account of his doings: of his summons to Odessa in the case of the Trepoff murder, of his clearing up of the singular tragedy of the Atkinson brothers at Trincomalee, and finally of the mission which he had accomplished so delicately and successfully for the reigning family of Holland
I rang the bell and was shown up to the chamber which had formerly been in part my own.
With hardly a word spoken, but with a kindly eye, he waved me to an armchair, threw across his case of cigars, and indicated a spirit case and a gasogene in the corner
"I think, Watson, that you have put on seven and a half pounds since I saw you."
"Indeed, I should have thought a little more
How do I know that you have been getting yourself very wet lately, and that you have a mo

In [31]:
# Imprima la relación entre la longitud del resumen y la longitud del original
print("")
print("Longitud original %s" % (len(title) + len(content)))
print("Longitud del resumen %s" % len(summary))
print("Razon del resumen: %s" % (100 - (100 * (len(summary) / (len(title) + len(content))))))


Longitud original 41066
Longitud del resumen 9676
Razon del resumen: 76.43792918716213
