-
-
Notifications
You must be signed in to change notification settings - Fork 253
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Simple content scoring prototype #583
Conversation
In case you are interested, here are pages with [
{
"file": "die-partei.net.luebeck.html",
"score": 0.7719298245614035,
"html": 57,
"trafilatura": 44,
},
{
"file": "schleifen.ucoz.de.briefe.html",
"score": 1.052325581395349,
"html": 172,
"trafilatura": 181,
},
{
"file": "love-hina.ch.0409.html",
"score": 0.3643410852713178,
"html": 129,
"trafilatura": 47,
},
{
"file": "wehranlage-horka.de.887.html",
"score": 0.5235602094240838,
"html": 191,
"trafilatura": 100,
},
{
"file": "nextkabinett.wordpress.com.garden.html",
"score": 0.09857482185273159,
"html": 842,
"trafilatura": 83,
},
{
"file": "wiki.piratenpartei.de.stammtisch.html",
"score": 0.38513513513513514,
"html": 148,
"trafilatura": 57,
},
{
"file": "pix-bavaria.de.html",
"score": 0.7720588235294118,
"html": 136,
"trafilatura": 105,
},
{
"file": "lavazza.de.qualita.html",
"score": 0.08863636363636364,
"html": 440,
"trafilatura": 39,
},
{
"file": "gnaur.wordpress.com.moglichkeit.html",
"score": 0.12643678160919541,
"html": 174,
"trafilatura": 22,
},
{"file": "seelenradio.de.leo.html", "score": 0.2, "html": 185, "trafilatura": 37},
{
"file": "ohneq.de.johannes.html",
"score": 0.6134453781512605,
"html": 119,
"trafilatura": 73,
},
{
"file": "xinhuanet.com.c_1125597921.html",
"score": 0.4955357142857143,
"html": 224,
"trafilatura": 111,
},
{
"file": "banyuetan.org.1000200033136171577956287380194268_1.html",
"score": 0.4943820224719101,
"html": 356,
"trafilatura": 176,
},
{
"file": "baike.baidu.com.tanya.html",
"score": 0.6725197541703248,
"html": 1139,
"trafilatura": 766,
},
{
"file": "scmp.com.playbook.html",
"score": 0.21359223300970873,
"html": 103,
"trafilatura": 22,
},
{
"file": "juliasleseblog.blogspot.com.irland.html",
"score": 0,
"html": 0,
"trafilatura": 0,
},
{"file": "cecil.de.lieblingsfarbe.html", "score": 1.0, "html": 1, "trafilatura": 1},
{"file": "street-one.de.blue.html", "score": 1.0, "html": 1, "trafilatura": 1},
{
"file": "it-for-kids.org.variables.html",
"score": 0.9117647058823529,
"html": 34,
"trafilatura": 31,
},
{
"file": "zahlenzauberin.wordpress.com.ferien.html",
"score": 0.5082872928176796,
"html": 181,
"trafilatura": 92,
},
{
"file": "rueda.wikidot.com.enchufla.html",
"score": 0.6058091286307054,
"html": 482,
"trafilatura": 292,
},
{"file": "changenow.de.loibl.html", "score": 0, "html": 0, "trafilatura": 0},
{
"file": "chip.de.bestcrypt.html",
"score": 0.4508670520231214,
"html": 346,
"trafilatura": 156,
},
{
"file": "faz.net.leone.html",
"score": 0.028588098016336057,
"html": 3428,
"trafilatura": 98,
},
{
"file": "archive.ordnungsrausch.com.orga-life.html",
"score": 0.1505016722408027,
"html": 299,
"trafilatura": 45,
},
{
"file": "weselpower.wordpress.com.monstergesprche.html",
"score": 0.2,
"html": 90,
"trafilatura": 18,
},
{
"file": "0b4609a864eb4fa0bbcb2b395f6be9eb.html",
"score": 0.18,
"html": 200,
"trafilatura": 36,
},
{
"file": "backen.de.maulwurfkuchen.html",
"score": 0.2687074829931973,
"html": 294,
"trafilatura": 79,
},
{"file": "thelocal.se.tattooed.html", "score": 1.0, "html": 13, "trafilatura": 13},
{
"file": "bundeswehrkarriere.de.Laura.html",
"score": 0,
"html": 0,
"trafilatura": 0,
},
{
"file": "fouryears.eu.interning.html",
"score": 0.3142857142857143,
"html": 175,
"trafilatura": 55,
},
{"file": "wevolver.com.vehicle.html", "score": 1.0, "html": 8, "trafilatura": 8},
{
"file": "nhk.or.jp.k100.html",
"score": 0.5454545454545454,
"html": 33,
"trafilatura": 18,
},
{
"file": "bettycrocker.com.pineapple.html",
"score": 0.18652849740932642,
"html": 386,
"trafilatura": 72,
},
{
"file": "cybercook.com.br.sequilho.html",
"score": 0.5740740740740741,
"html": 162,
"trafilatura": 93,
},
{"file": "workable.com.gousto.html", "score": 0, "html": 0, "trafilatura": 0},
{
"file": "journals.univie.ac.at.submissions.html",
"score": 0.3556701030927835,
"html": 388,
"trafilatura": 138,
},
{
"file": "sports.fr.lorient.html",
"score": 0.9892665474060823,
"html": 559,
"trafilatura": 553,
},
{
"file": "_Ziemniaki na szóstej, surówka na dziesiątej_. Jak pomagać, żeby nie zaszkodzić_ [PORADNIK W PIGUŁCE].html",
"score": 0.053987730061349694,
"html": 815,
"trafilatura": 44,
},
{"file": "dlg.org-Preis.html", "score": 0.88, "html": 25, "trafilatura": 22},
{
"file": "homify.de-Tischdecke.html",
"score": 0.625,
"html": 32,
"trafilatura": 20,
},
{
"file": "outdoor-magazin.com-vanlife.html",
"score": 0.08672936259143156,
"html": 957,
"trafilatura": 83,
},
{"file": "camping.info-ligurien.html", "score": 1.0, "html": 17, "trafilatura": 17},
{
"file": "dw.com-elephants.html",
"score": 0.15234375,
"html": 256,
"trafilatura": 39,
},
{
"file": "mitundvoneinander.com-Frühling.html",
"score": 0.20689655172413793,
"html": 232,
"trafilatura": 48,
},
{
"file": "nestle-family-com-chicken.html",
"score": 0.32664756446991405,
"html": 349,
"trafilatura": 114,
},
{
"file": "ekiba.de-trauer.html",
"score": 0.686084142394822,
"html": 309,
"trafilatura": 212,
},
{
"file": "eurosport.de-corona.html",
"score": 0.6981981981981982,
"html": 444,
"trafilatura": 310,
},
{
"file": "eurailpress.de-rekordniveau.html",
"score": 0.4868421052631579,
"html": 152,
"trafilatura": 74,
},
] |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #583 +/- ##
==========================================
- Coverage 97.82% 95.97% -1.86%
==========================================
Files 23 23
Lines 3449 3503 +54
==========================================
- Hits 3374 3362 -12
- Misses 75 141 +66 ☔ View full report in Codecov by Sentry. |
Hi @zirkelc, thanks for your work, here are a few comments:
|
I opened a new PR for |
@zirkelc I'm not sure what to do with this pull request, do you want to keep working on it by leveraging the functionality you just introduced? |
I wanted to get back to you on that. I did some more tests and re-implemented this function in javascript (because that's the environment I'm usually working in). The results are mixed with the majority of scores for non-readable pages fall in the ranges 0-0.5 or they are greaten than 1.0 (both ranges are bad so the result is good). However, there a certain cases where the result are in the good range 0.5-1.0 even though the pages are not readable. So I'm not sure how reliable the scores will get. Maybe the combination of If you think this score would be an useful addition to Trafilatura, I can develop further and update the evaluation to use the new |
I also think this reliability issue would prevent us from directly using such a metric. It's nice to have ported I'd be in favor of closing this PR now and focusing on improving/porting further components of readability.js. If you'd like you can work on a PR or if it's easier for you in the short term maybe list striking differences between the current version and the port in a new issue? |
Okay, I agree. Let's close this PR and I will create a new issue to discuss the port. |
I implemented a small prototype as discussed in #572. It's super basic and I don't expect you to add it to the library, just wanted to show you some results and hear your opinion.
The
scoring.py
module contains the functions to count the unique words for a HTML and a markdown. The semantic elements (header, footer, nav) are removed, but other elements could be added based on class names or ids, or other tags link forms, inputs, etc. Links are not removed yet, but I think relative links are a good indicator that these are navigation elements and so they could also be removed I think. Absolute links or more specifically links to other hostnames should probably be kept.The ratio
unique_words_html / unique_words_markdown
is then considered the content score. My assumption is the following:I ported the Readability.js
isProbablyReaderable
to compare the results.My assumption is that a HTML file withis_probably_readerable=False
should also have a low content score.The
scoring_small.py
module is a copy ofcomparison_small.py
. I collect all unique words counts and calculate some statistics. Here are the results so far:The current average content score across all files is 0.676 and for files which are not readerable (is_probably_readerable=False) it is 0.456. I assume the content score could be increased even more with more cleansing of the HTML. I will have to investigate the cases with a low content score and a score above 1.0 to see if they are actually bad.
Do you have any remark or ideas for improvements? Of course, these are all assumptions, so please don't hesitate to point out flaws in my logic or implementation.