Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trafilatura is skipping ordered list elements (ol tag) #10

Closed
qx54 opened this issue Apr 5, 2020 · 4 comments
Closed

Trafilatura is skipping ordered list elements (ol tag) #10

qx54 opened this issue Apr 5, 2020 · 4 comments

Comments

@qx54
Copy link

qx54 commented Apr 5, 2020

Trafilatura is not including the ol tag on the following page, e.g. where the first list element starts with "Der Arbeitgeber finanziert Ihre bAV allein":

https://www.finanztip.de/betriebliche-altersvorsorge/

It's also skipping the h3 titles on the aforementioned page.

PS: Not sure if you want issues with the result precision to be reported here?

@adbar
Copy link
Owner

adbar commented Apr 9, 2020

Thanks for reporting the issue, I'll look into it. Do you have an example for result precision?

@qx54
Copy link
Author

qx54 commented Apr 10, 2020

I meant that I'm not sure if you want such issues concerning the precision of the extracted content to be reported here.

Anyway, I did some further manual tests. Here's the issues I found:

https://www.smava.de/privatkredit/privatkredit-zinsen/
-> skipping most of the content

https://www.finanzcheck.de/autokredit/leasing-oder-finanzierung/
-> not catching most titles and unsorted lists.

https://www.vergleich.de/auto-leasen-finanzieren-oder-kaufen.html
-> Content in accordion gets included, but the h3 title (e.g. "Vorteile beim Barkauf:") of the accordion box gets skipped.

https://www.focus.de/auto/experten/auto-leasen-oder-kaufen-fuer-wen-lohnt-sich-was_id_9209161.html
-> "Zur Person" insertion in the middle of the text gets included.
-> Plenty of text regarding the comment function gets included (include_comments=False):
"Vielen Dank! Ihr Kommentar wurde abgeschickt.
Hier können Sie selbst Artikel verfassen: Bericht schreiben
Im Interesse unserer User behalten wir uns vor, jeden Beitrag vor der Veröffentlichung zu prüfen. Als registrierter Nutzer werden Sie automatisch per E-Mail benachrichtigt, wenn Ihr Kommentar freigeschaltet wurde."

https://www.comparis.ch/leasing/info/autofinanzierung
-> parts of the inserted form get included, e.g. "Berechnen Sie die Kosten für Ihren Privatkredit – und vergleichen Sie." and "Laufzeit in Monaten"
-> testimonial content & disclaimer gets included

https://www.verivox.de/kredit/leasing-oder-finanzierung/
-> Good job besides some smaller issues, e.g. "Das sagen unsere Kunden" gets included (bad) while the testimonial / Ekomi rating block doesn't get included (good).

https://fincompare.de/firmenwagen-leasing-oder-finanzierung
-> the whole content gets skipped, only the following footer sentence gets included: "FinCompare wurde als geprüftes Vergleichsportal in der Kategorie Vermittlungsservice ausgezeichnet. Damit ist FinCompare als erster Vermittlungsservice vom TÜV Saarland nach den folgenden Kriterien zertifiziert: Qualität der Beratung, Aktualität, vielfältige Suchoptionen, Transparenz, Übersichtlichkeit, Datenschutz."

https://www.giromatch.com/online-kredit/1000-euro-kredit
-> CTA button text gets included: "Jetzt 1000 Euro Kredit sichern »"

https://www.financescout24.de/kredit/autokredit
-> h3 title of the FAQ accordion gets not included

Thanks

@adbar
Copy link
Owner

adbar commented Apr 16, 2020

Thank you for your feedback! I could solve part of the problems you mentioned, some of them cannot be treated ad hoc without destabilizing the whole structure. The improvements will ship very soon with version 0.4.1.

Do you have other examples of extraction issues, especially pages where the main content is missing?

@adbar
Copy link
Owner

adbar commented Jul 15, 2020

The extraction has been significantly improved, also for the web pages you mentioned, I'm closing the issue for now.

@adbar adbar closed this as completed Jul 15, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants