Feat: add Lexrank text summarization #62

genesluna · 2019-03-18T22:22:56Z

O Problema

Atualmente o robô de texto escolhe as primeiras n-frases do conteúdo que é retornado da wikipedia. Acontece que estas primeiras frases nem sempre são a melhor representação(resumo) do conteúdo da página.

Com isso em mente e sabendo que a ideia do projeto é utilizar ao máximo a automatização, resolvi contribuir trazendo suporte a sumarização não supervisionada de texto usando o algorítimo Lexrank de Radev http://www.jair.org/papers/paper1523.html. Basicamente, ele aplica uma classificação lexicográfica a cada frase de um documento, encontrando as frases mais importantes e reproduzindo-as.

Exemplo

Atualmente, se fizermos uma busca com o termo "Javascript" no video-maker, receberemos como resultado as seguintes frases:

JavaScript , often abbreviated as JS, is a high-level, interpreted programming language that conforms to the ECMAScript specification.
It is a programming language that is characterized as dynamic, weakly typed, prototype-based and multi-paradigm.
Alongside HTML and CSS, JavaScript is one of the core technologies of the World Wide Web. JavaScript enables interactive web pages and is an essential part of web applications.
The vast majority of websites use it, and major web browsers have a dedicated JavaScript engine to execute it.
As a multi-paradigm language, JavaScript supports event-driven, functional, and imperative programming styles.
It has APIs for working with text, arrays, dates, regular expressions, and the DOM, but the language itself does not include any I/O, such as networking, storage, or graphics facilities.
It relies upon the host environment in which it is embedded to provide these features.

Com o uso da sumarização automatizada teríamos o seguinte resultado:

As a multi-paradigm language, JavaScript supports event-driven, functional, and imperative programming styles.
JavaScript was influenced by programming languages such as Self and Scheme.
It is a programming language that is characterized as dynamic, weakly typed, prototype-based and multi-paradigm.
JavaScript , often abbreviated as JS, is a high-level, interpreted programming language that conforms to the ECMAScript specification.
The terms Vanilla JavaScript and Vanilla JS refer to JavaScript not extended by any frameworks or additional libraries.
The vast majority of websites use it, and major web browsers have a dedicated JavaScript engine to execute it.
It relies upon the host environment in which it is embedded to provide these features.

Observem que além da ordem de algumas frases ter sido alterada de acordo com a sua relevância, outras foram removidas e substituídas por frases consideradas mais relevantes pelo algorítimo.

Segue abaixo o texto completo analisado:

JavaScript , often abbreviated as JS, is a high-level, interpreted programming language that conforms to the ECMAScript specification. It is a programming language that is characterized as dynamic, weakly typed, prototype-based and multi-paradigm. Alongside HTML and CSS, JavaScript is one of the core technologies of the World Wide Web. JavaScript enables interactive web pages and is an essential part of web applications. The vast majority of websites use it, and major web browsers have a dedicated JavaScript engine to execute it. As a multi-paradigm language, JavaScript supports event-driven, functional, and imperative programming styles. It has APIs for working with text, arrays, dates, regular expressions, and the DOM, but the language itself does not include any I/O, such as networking, storage, or graphics facilities. It relies upon the host environment in which it is embedded to provide these features. Initially only implemented client-side in web browsers, JavaScript engines are now embedded in many other types of host software, including server-side in web servers and databases, and in non-web programs such as word processors and PDF software, and in runtime environments that make JavaScript available for writing mobile and desktop applications, including desktop widgets. The terms Vanilla JavaScript and Vanilla JS refer to JavaScript not extended by any frameworks or additional libraries. Scripts written in Vanilla JS are plain JavaScript code. Although there are similarities between JavaScript and Java, including language name, syntax, and respective standard libraries, the two languages are distinct and differ greatly in design. JavaScript was influenced by programming languages such as Self and Scheme.

Para ver um exemplo com o termo "Michael Jackson" clique aqui. Acredito que seja um exemplo até melhor que o anterior.

Também modifiquei o resultado da busca no algorithmia de wikipediaContent.content para wikipediaContent.summary, pois o primeiro elemento retornado dentro do 'content' é justamente o 'summary'. Com isso nós agilizamos o processamento das expressões regulares e ainda de quebra ajudamos o algorítimo Lexrank, pois ele terá que fazer um 'resumo do resumo' e não um resumo de todo conteúdo.

Tenho ciência de que o PR não será 'mergeado'. A intenção é somente a de mostrar mais uma, dentre as muitas possibilidade de automação que temos a nossa disposição hoje em dia.

maycrodrigues · 2019-03-18T23:53:13Z

Parabéns amigo! Que show! Vou implementar e testar no meu projeto também! Irado!!!! 👍👍👍

PS.: Estou fazendo em TS https://github.com/maycrodrigues/video-maker-typescript 😎

acristh · 2019-03-19T00:39:20Z

Show de bola!
Dessa forma todo o texto é analisado, e evita-se perder informações importantes. 😃👏

marceloavf · 2019-03-19T13:41:48Z

Parabéns @genesluna!

Tinha pensado nisso assim que vi o vídeo, porém não conhecia essa inteligência de classificação lexicográfica.

filipedeschamps · 2019-03-19T14:14:08Z

Sensacional!!!!

robots/text.js

leodutra · 2019-04-08T11:40:59Z

@genesluna Tenho uma dúvida: os segundo exemplo que você deu foi gerado pela análise?
Tem uma quebra de contexto que talvez tenhamos que resolver para integrar a melhoria:

The vast majority of websites use it, and major web browsers have a dedicated JavaScript engine to execute it.

It relies upon the host environment in which it is embedded to provide these features.

"these features" não está diretamente em concordância com a última frase anterior.

Esta segunda frase perdeu o contexto depois da Lexrank... antes estava assim:

It has APIs for working with text, arrays, dates, regular expressions, and the DOM, but the language itself does not include any I/O, such as networking, storage, or graphics facilities.

It relies upon the host environment in which it is embedded to provide these features.

Alguma ideia?
Talvez porque tenha rodado a Lexrank em sentenças ao invés do texto original inteiro?

felipealfah · 2019-05-22T02:36:49Z

@genesluna Tudo bom? Tentei implementar mas ele me retorna erros como a falta de módulos do lexrank, alguma dica para corrigir isso??

C:\Users\Felipe\video-maker>node index.js
internal/modules/cjs/loader.js:584
throw err;
^

Error: Cannot find module 'lexrank.js'
at Function.Module._resolveFilename (internal/modules/cjs/loader.js:582:15)
at Function.Module._load (internal/modules/cjs/loader.js:508:25)
at Module.require (internal/modules/cjs/loader.js:637:17)
at require (internal/modules/cjs/helpers.js:22:18)
at Object. (C:\Users\Felipe\video-maker\robots\text.js:4:17)
at Module._compile (internal/modules/cjs/loader.js:701:30)
at Object.Module._extensions..js (internal/modules/cjs/loader.js:712:10)
at Module.load (internal/modules/cjs/loader.js:600:32)
at tryModuleLoad (internal/modules/cjs/loader.js:539:12)
at Function.Module._load (internal/modules/cjs/loader.js:531:3)

HelioLuna · 2019-08-30T19:29:29Z

Olá, vi que varias pessoas estavam com problemas nesta branch e resolvi dar uma ajuda. Eu consegui implementar o algoritmo do lexrank (https://www.npmjs.com/package/lexrank) seguindo os seguintes passos:

1 - Instalar no projeto: npm i lexrank
2 - Implementar no projeto:
const lexrank = require('lexrank');

async function quebrarContentEmSentencasLexicasRankeadas(content){
        return new Promise(() => {
            content.sentences = []
            
            lexrank.summarize(content.sourceContentSanitizada, 5,(error, result) => {           
                if (error) {
                    throw error
                    return reject(error)
                }
                
                result.forEach((sentence) => {
                content.sentences.push({
                    text: sentence.text,
                    keywords: [],
                    images: []
                })
                })
            console.log(content.sentences)
            })
        })
    }

versão utilizada: "lexrank": "^1.0.5"

leodutra · 2019-08-31T09:42:32Z

@HelioLuna, poderia por favor dar uma olhada no meu comentário #62 (comment)?

Consegue aproveitar a implemetação e testar talvez com o mesmo texto do OP?

HelioLuna · 2019-09-01T13:49:00Z

@HelioLuna, poderia por favor dar uma olhada no meu comentário #62 (comment)?

Consegue aproveitar a implemetação e testar talvez com o mesmo texto do OP?

Então, eu rodei o lexrank em cima do texto inteiro, e o lexrank que se encarregou de procura e me devolver as melhores sentenças baseadas no seu algoritmo de rankeamento.

guilherme-argentino · 2020-03-10T01:57:30Z

Fiquei bem interessado, mas tomei este erro e fiquei preso nele.

(node:23892) UnhandledPromiseRejectionWarning: TypeError: Cannot read property 'compact' of undefined

Isso foi dentro do sentence-tokenizer: dependencia do lexrank

rodrigo-sntg · 2020-05-09T15:59:37Z

Fiquei bem interessado, mas tomei este erro e fiquei preso nele.

(node:23892) UnhandledPromiseRejectionWarning: TypeError: Cannot read property 'compact' of undefined

Isso foi dentro do sentence-tokenizer: dependencia do lexrank

@guilherme-argentino, eu resolvi isso usando o lexrank.js mesmo.

colocando meu codigo abaixo:

const algorithmia = require('algorithmia')
const lexrank = require('lexrank.js')
const algorithmiaApiKey = require('../credentials/algorithmia.json').apiKey
const sentenceBoundaryDetection = require(`sbd`)

const watsonApiKey = require('../credentials/watson-nlu.json').apikey
const NaturalLanguageUnderstandingV1 = require('watson-developer-cloud/natural-language-understanding/v1.js')


const nlu = new NaturalLanguageUnderstandingV1({
    iam_apikey: watsonApiKey,
    version: '2018-04-05',
    url: 'https://gateway.watsonplatform.net/natural-language-understanding/api/'
})

const state = require('./state.js')

async function robot() {
    const content = state.load()
    await fetchContentFromWiki(content)
    sanitizeContent(content)
    // breakContentIntoSentences(content)
    await breakContentIntoLexicalRankedSentences(content)
    limitMaximumSentences(content)
    await fetchKeywordsOfAllSentences(content)

    state.save(content)

    
    async function fetchContentFromWiki(content){
        const algorithmiaAuthenticated = algorithmia(algorithmiaApiKey)
        const wikipediaAlgo = algorithmiaAuthenticated.algo('web/WikipediaParser/0.1.2')
        const wikipediaResponse = await wikipediaAlgo.pipe(content.searchTerm)
        const wikipediaContent = wikipediaResponse.get()
        
        content.sourceContentOriginal = wikipediaContent
        
        // content.sourceContentOriginal = wikipediaContent.summary

    }

    function sanitizeContent(content){
        const withoutBlankLinesAndMarkdown = removeBlankLinesAndMarkdown(content.sourceContentOriginal.content)
        const withoutDatesInParenthesis = removeDatesInParenthesis(withoutBlankLinesAndMarkdown)

        content.sourceContentSanitized = withoutDatesInParenthesis

        function removeBlankLinesAndMarkdown(text){
            const allLines = text.split('\n')

            const withoutBlankLinesAndMarkdown = allLines.filter((line) => {
                if (line.trim().length === 0 || line.trim().startsWith('=')) {
                return false
                }

                return true
            })

            return withoutBlankLinesAndMarkdown.join(' ')
        }
    }

    function removeDatesInParenthesis(text) {
        return text.replace(/\((?:\([^()]*\)|[^()])*\)/gm, '').replace(/  /g,' ')
    }

    function breakContentIntoSentences(content) {
        content.sentences = []
    
        const sentences = sentenceBoundaryDetection.sentences(content.sourceContentSanitized)
        sentences.forEach((sentence) => {
          content.sentences.push({
            text: sentence,
            keywords: [],
            images: []
          })
        })
    }

    function limitMaximumSentences(content){
        content.sentences = content.sentences.slice(0, content.maximumSentences)
    }

    async function fetchKeywordsOfAllSentences(content) {
        console.log('> [text-robot] Starting to fetch keywords from Watson')
        const listOfKeywordsToFetch = []
        for (const sentence of content.sentences) {
            sentence.keywords = await fetchWatsonAndReturnKeywords(sentence)
            listOfKeywordsToFetch.push(
              fetchWatsonAndReturnKeywords(sentence)
            )
        }
      
        await Promise.all(listOfKeywordsToFetch)

      }

    async function fetchWatsonAndReturnKeywords(sentence) {
        return new Promise((resolve, reject) => {
          nlu.analyze({
            text: sentence.text,
            features: {
              keywords: {}
            }
          }, (error, response) => {
            if (error) {
              reject(error)
              return
            }
    
            const keywords = response.keywords.map((keyword) => {
              return keyword.text
            })

            sentence.keywords = keywords
    
            resolve(keywords)
          })
        })
      }

      async function breakContentIntoLexicalRankedSentences(content) {
        content.sentences = []

        lexrank(content.sourceContentSanitized, (err, result) => {
          if (err) {
            throw error
          }

          sentences = result[0].sort(function(a,b){return b.weight.average - a.weight.average})
          
          sentences.forEach((sentence) => {
            content.sentences.push({
              text: sentence.text,
              keywords: [],
              images: []
            })
          })
        })
      }


    
}

module.exports = robot

O erro no codigo do @genesluna era que estava chamando na funcao summary.lexrank.
Eu alterei para usar apenas o lexrank.
Assim nao deu erro.

Feat: add Lexrank text summarization

fb9ddf4

filipedeschamps reviewed Mar 19, 2019

View reviewed changes

robots/text.js Outdated Show resolved Hide resolved

Fix: callback promise wrap

b0df011

matbrgz added the enhancement New feature or request label Oct 31, 2019

luanadriani mentioned this pull request Dec 4, 2019

Escopos e Tarefas - Futuras Implementações luanadriani/video-maker#5

Open

25 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat: add Lexrank text summarization #62

Feat: add Lexrank text summarization #62

genesluna commented Mar 18, 2019 •

edited

Loading

maycrodrigues commented Mar 18, 2019

acristh commented Mar 19, 2019

marceloavf commented Mar 19, 2019

filipedeschamps commented Mar 19, 2019

leodutra commented Apr 8, 2019 •

edited

Loading

felipealfah commented May 22, 2019

HelioLuna commented Aug 30, 2019

leodutra commented Aug 31, 2019

HelioLuna commented Sep 1, 2019

guilherme-argentino commented Mar 10, 2020

rodrigo-sntg commented May 9, 2020

Feat: add Lexrank text summarization #62

Are you sure you want to change the base?

Feat: add Lexrank text summarization #62

Conversation

genesluna commented Mar 18, 2019 • edited Loading

O Problema

Exemplo

maycrodrigues commented Mar 18, 2019

acristh commented Mar 19, 2019

marceloavf commented Mar 19, 2019

filipedeschamps commented Mar 19, 2019

leodutra commented Apr 8, 2019 • edited Loading

felipealfah commented May 22, 2019

HelioLuna commented Aug 30, 2019

leodutra commented Aug 31, 2019

HelioLuna commented Sep 1, 2019

guilherme-argentino commented Mar 10, 2020

rodrigo-sntg commented May 9, 2020

genesluna commented Mar 18, 2019 •

edited

Loading

leodutra commented Apr 8, 2019 •

edited

Loading