Readbility's title #64

Open
tybenz opened this Issue Mar 4, 2014 · 4 comments

Comments

Projects
None yet
2 participants

tybenz commented Mar 4, 2014

Readability pulls its article title from the title tag right? Well more often than not, the title tag has a whole lot of other information besides just the title of the article. It usually includes the title of the site itself and sometimes a category.

I know the original readability script just grabbed the title, but I'm wondering if this version of the script can be modified to grab the actual title of the article from the markup. It seems as though the scoring system is set up to exclude the header tag that contains the article title.

Example:

<article>
  <div class="article-title">
    <h1>Article title</h1>
  </div>
  <div class="article-content">
    <p>
      Claritatem insitam; est usus legentis in iis qui facit eorum claritatem.
      Investigationes demonstraverunt lectores legere me lius quod ii legunt
      saepius. Claritas est etiam processus dynamicus, qui sequitur mutationem
      consuetudium lectorum. Mirum est notare quam littera gothica, quam nunc
      putamus parum claram, anteposuerit litterarum formas humanitatis per seacula
      quarta decima et quinta decima. Eodem modo typi, qui nunc nobis videntur
      parum clari, fiant sollemnes in futurum.
    </p>
    <p>
      Nunc varius risus quis nulla. Vivamus vel magna. Ut rutrum. Aenean
      dignissim, leo quis faucibus semper, massa est faucibus massa, sit amet
      pharetra arcu nunc et sem. Aliquam tempor. Nam lobortis sem non urna.
      Pellentesque et urna sit amet leo accumsan volutpat. Nam molestie lobortis
      lorem. Quisque eu nulla. Donec id orci in ligula dapibus egestas. Donec sed
      velit ac lectus mattis sagittis.
    </p>
  </div>
</article>

In the above example, readability will always grab the content from .article-content and not the <article> tag itself. What can I do to modify the script to grab the whole article, title and all?

Owner

cantino commented Mar 5, 2014

Hey @tybenz! Interesting idea. Do you want to work on a pull request for that?

tybenz commented Mar 5, 2014

Yeah. I'd love to. I don't know enough about the scoring algorithm though. Wondering if you had any ideas on what a good start might be.

Owner

cantino commented Mar 5, 2014

No problem. I'd try to write a failing spec, then I'd take a look at score_node, class_weight, and REGEXES and see if something similar could be written to estimate which node is the title.

tybenz commented Mar 6, 2014

Also, I want to get something straight. Is it true that you only ever score p tags, td tags, and their parents and grandparents?

https://github.com/cantino/ruby-readability/blob/master/lib/readability.rb#L270-L271

Am I missing something?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment