Adds a mechanism for accepting or rejecting articles #184

Weyaaron · 2023-05-08T12:01:53Z

This Pr enhances fundus by solving issue #181.

It consists of a mechanism that filters articles based before extracting them. It is implemented in a functional manner: The Publisherspec gained an optional argument for a function constructor that constructs a filter function based on the url and the html for a given article.

One example function is given,it is used to filter articles from the ndr that are podcasts.

# Conflicts: # src/fundus/scraping/pipeline.py # src/fundus/scraping/scraper.py

MaxDall · 2023-05-08T13:12:51Z

Thanks for adding this @Weyaaron.

IMO there are a few things missing here to solve #181 but I would consider this a good first step.
I would handle this review as the following: First talk about some more general things about the implementation and then go more and more into detail as the actual implementation adjusts. I think there will be many changes to come so I will refrain from doing a code review yet.

In no particular order some things I noticed:

As far as I can tell this implementation only supports one classifier per publisher. Please correct me if I'm missing something here; this is my biggest concern. Not being able to use more than one classifier per publisher makes it almost impossible to handle future expectations.
Since we're dealing with heuristics here, especially filters, any implementation solving [Proposal] Add a classification step before the parsing that classifies HTML as article, article-hub, ... #181, or article classification in general, should add support for logic operators. I can think about something like this:

article_classification... = AND(classification1, OR(classification2, NOT(classification3)))

Implementing logic would make things more flexible and in the end even easier. I.e. take the now existing url_based_classifier.
In contrast to the contrary regex parameters accepting.../rejecting..., which takes away simplicity and readability, you could have a simple url_regex_classifier... and inverse it if you want to reject urls.

Weyaaron · 2023-05-08T13:51:16Z

All good points, I thought about combining multiple classifiers, but refrained from adding explicit support so far. I support both of these points and will look into this soon.

Weyaaron · 2023-05-08T14:18:34Z

My first thoughts involved a lot of overengineering, I will rewrite this slightly.

@MaxDall

…ome of the comments from @MaxDall

Weyaaron · 2023-05-08T14:30:25Z

The rewrite has taken place, the overall quality has been improved in my opionion.

MaxDall

Thanks for addressing the concerns raised.

src/fundus/publishers/de/__init__.py

src/fundus/publishers/base_objects.py

src/fundus/publishers/de/__init__.py

src/fundus/scraping/scraper.py

src/fundus/utils/article_classification.py

src/fundus/publishers/de/__init__.py

MaxDall · 2023-05-09T08:30:00Z

src/fundus/publishers/base_objects.py

@@ -11,6 +11,7 @@ class PublisherSpec:
    parser: Type[BaseParser]
    rss_feeds: List[str] = field(default_factory=list)
    sitemaps: List[str] = field(default_factory=list)
+    article_classifier: Optional[Callable[[str, str], bool]] = field(default=None)


Just one last thing then this one is good to go. Could you type hint article_clasifier with a Protocol called ArticleClasifier as we did with the ExtractionFilter? This seems like over engineering at first, but doing this we can provide the information that the first parameter gonna be the URL and the second one is the HTML.

Has been done, although the first parameter is the HTML, the second the URL.

MaxDall · 2023-05-09T08:33:24Z

src/fundus/publishers/de/__init__.py

@@ -127,6 +128,7 @@ class DE(PublisherEnum):
        news_map="https://www.ndr.de/sitemap112-newssitemap.xml",
        sitemaps=["https://www.ndr.de/sitemap112-sitemap.xml"],
        parser=NDRParser,
+        article_classifier=lambda _, url: not bool(re.search("podcast[0-9]{4}", url)),


How now seeing this I have to admit the previous implementation was way more readable. Maybe we should bring it back and call it regex_classifier wich takes a single argument in the call. Sorry for the effort but what do you think @Weyaaron?

I hope I changed it according to your comment.

Weyaaron added 7 commits April 26, 2023 15:19

Starts with adding html classification

ae45f34

Changes the html classification to be function based

e716e21

Implements url based filtering for the ndr

0f2db55

Changes the classifier to be optional

c59e34b

Merge branch 'master' into add_classification

bdd420f

# Conflicts: # src/fundus/scraping/pipeline.py # src/fundus/scraping/scraper.py

Finishes merge with main

88c1cf4

Fixes isort

20c6bbb

Weyaaron mentioned this pull request May 8, 2023

[WIP] A parser for the titanic #185

Closed

3 tasks

Improves the readability of the article classification and adresses s…

8d2c919

…ome of the comments from @MaxDall

MaxDall requested changes May 8, 2023

View reviewed changes

MaxDall reviewed May 8, 2023

View reviewed changes

src/fundus/publishers/de/__init__.py Outdated Show resolved Hide resolved

Weyaaron added 2 commits May 8, 2023 21:17

Adresses the comments from @MaxDall

8a937b1

Fixes a variable name

a837002

MaxDall requested changes May 9, 2023

View reviewed changes

MaxDall reviewed May 9, 2023

View reviewed changes

Weyaaron and others added 3 commits May 9, 2023 14:52

Updates the typing of the classifier

42784c7

Moved classification to a new a file and reworked classifier a bit.

ca3235b

added some documentation and switched html/url parameter

8f41e03

MaxDall approved these changes May 9, 2023

View reviewed changes

Weyaaron merged commit 35e15fe into master May 9, 2023

Weyaaron deleted the add_classification branch May 9, 2023 15:18

dobbersc mentioned this pull request May 10, 2023

Scraping "Occupy Democrats" over Sitemap #178

Closed

Weyaaron mentioned this pull request May 11, 2023

What level of noise is acceptable in the articles? #195

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adds a mechanism for accepting or rejecting articles #184

Adds a mechanism for accepting or rejecting articles #184

Weyaaron commented May 8, 2023

MaxDall commented May 8, 2023 •

edited

Loading

Weyaaron commented May 8, 2023

Weyaaron commented May 8, 2023

Weyaaron commented May 8, 2023

MaxDall left a comment

MaxDall May 9, 2023

Weyaaron May 9, 2023

MaxDall May 9, 2023

Weyaaron May 9, 2023

Adds a mechanism for accepting or rejecting articles #184

Adds a mechanism for accepting or rejecting articles #184

Conversation

Weyaaron commented May 8, 2023

MaxDall commented May 8, 2023 • edited Loading

Weyaaron commented May 8, 2023

Weyaaron commented May 8, 2023

Weyaaron commented May 8, 2023

MaxDall left a comment

Choose a reason for hiding this comment

MaxDall May 9, 2023

Choose a reason for hiding this comment

Weyaaron May 9, 2023

Choose a reason for hiding this comment

MaxDall May 9, 2023

Choose a reason for hiding this comment

Weyaaron May 9, 2023

Choose a reason for hiding this comment

MaxDall commented May 8, 2023 •

edited

Loading