GitHub is home to over 20 million developers working together to host and review code, manage projects, and build software together.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
Already on GitHub? Sign in to your account
Readability pulls its article title from the title tag right? Well more often than not, the title tag has a whole lot of other information besides just the title of the article. It usually includes the title of the site itself and sometimes a category.
I know the original readability script just grabbed the title, but I'm wondering if this version of the script can be modified to grab the actual title of the article from the markup. It seems as though the scoring system is set up to exclude the header tag that contains the article title.
Claritatem insitam; est usus legentis in iis qui facit eorum claritatem.
Investigationes demonstraverunt lectores legere me lius quod ii legunt
saepius. Claritas est etiam processus dynamicus, qui sequitur mutationem
consuetudium lectorum. Mirum est notare quam littera gothica, quam nunc
putamus parum claram, anteposuerit litterarum formas humanitatis per seacula
quarta decima et quinta decima. Eodem modo typi, qui nunc nobis videntur
parum clari, fiant sollemnes in futurum.
Nunc varius risus quis nulla. Vivamus vel magna. Ut rutrum. Aenean
dignissim, leo quis faucibus semper, massa est faucibus massa, sit amet
pharetra arcu nunc et sem. Aliquam tempor. Nam lobortis sem non urna.
Pellentesque et urna sit amet leo accumsan volutpat. Nam molestie lobortis
lorem. Quisque eu nulla. Donec id orci in ligula dapibus egestas. Donec sed
velit ac lectus mattis sagittis.
In the above example, readability will always grab the content from .article-content and not the <article> tag itself. What can I do to modify the script to grab the whole article, title and all?
Hey @tybenz! Interesting idea. Do you want to work on a pull request for that?
Yeah. I'd love to. I don't know enough about the scoring algorithm though. Wondering if you had any ideas on what a good start might be.
No problem. I'd try to write a failing spec, then I'd take a look at score_node, class_weight, and REGEXES and see if something similar could be written to estimate which node is the title.
Also, I want to get something straight. Is it true that you only ever score p tags, td tags, and their parents and grandparents?
Am I missing something?