Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds a mechanism for accepting or rejecting articles #184

Merged
merged 13 commits into from
May 9, 2023

Conversation

Weyaaron
Copy link
Collaborator

@Weyaaron Weyaaron commented May 8, 2023

This Pr enhances fundus by solving issue #181.

It consists of a mechanism that filters articles based before extracting them. It is implemented in a functional manner: The Publisherspec gained an optional argument for a function constructor that constructs a filter function based on the url and the html for a given article.

One example function is given,it is used to filter articles from the ndr that are podcasts.

@Weyaaron Weyaaron mentioned this pull request May 8, 2023
3 tasks
@MaxDall
Copy link
Collaborator

MaxDall commented May 8, 2023

Thanks for adding this @Weyaaron.

IMO there are a few things missing here to solve #181 but I would consider this a good first step.
I would handle this review as the following: First talk about some more general things about the implementation and then go more and more into detail as the actual implementation adjusts. I think there will be many changes to come so I will refrain from doing a code review yet.

In no particular order some things I noticed:

  • As far as I can tell this implementation only supports one classifier per publisher. Please correct me if I'm missing something here; this is my biggest concern. Not being able to use more than one classifier per publisher makes it almost impossible to handle future expectations.
  • Since we're dealing with heuristics here, especially filters, any implementation solving [Proposal] Add a classification step before the parsing that classifies HTML as article, article-hub, ...  #181, or article classification in general, should add support for logic operators. I can think about something like this:
article_classification... = AND(classification1, OR(classification2, NOT(classification3)))

Implementing logic would make things more flexible and in the end even easier. I.e. take the now existing url_based_classifier.
In contrast to the contrary regex parameters accepting.../rejecting..., which takes away simplicity and readability, you could have a simple url_regex_classifier... and inverse it if you want to reject urls.

@Weyaaron
Copy link
Collaborator Author

Weyaaron commented May 8, 2023

All good points, I thought about combining multiple classifiers, but refrained from adding explicit support so far. I support both of these points and will look into this soon.

@Weyaaron
Copy link
Collaborator Author

Weyaaron commented May 8, 2023

My first thoughts involved a lot of overengineering, I will rewrite this slightly.

@Weyaaron
Copy link
Collaborator Author

Weyaaron commented May 8, 2023

The rewrite has taken place, the overall quality has been improved in my opionion.

Copy link
Collaborator

@MaxDall MaxDall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing the concerns raised.

src/fundus/publishers/de/__init__.py Show resolved Hide resolved
src/fundus/publishers/base_objects.py Outdated Show resolved Hide resolved
src/fundus/publishers/base_objects.py Outdated Show resolved Hide resolved
src/fundus/publishers/de/__init__.py Outdated Show resolved Hide resolved
src/fundus/scraping/scraper.py Outdated Show resolved Hide resolved
src/fundus/scraping/scraper.py Outdated Show resolved Hide resolved
src/fundus/utils/article_classification.py Outdated Show resolved Hide resolved
@@ -11,6 +11,7 @@ class PublisherSpec:
parser: Type[BaseParser]
rss_feeds: List[str] = field(default_factory=list)
sitemaps: List[str] = field(default_factory=list)
article_classifier: Optional[Callable[[str, str], bool]] = field(default=None)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one last thing then this one is good to go. Could you type hint article_clasifier with a Protocol called ArticleClasifier as we did with the ExtractionFilter? This seems like over engineering at first, but doing this we can provide the information that the first parameter gonna be the URL and the second one is the HTML.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Has been done, although the first parameter is the HTML, the second the URL.

@@ -127,6 +128,7 @@ class DE(PublisherEnum):
news_map="https://www.ndr.de/sitemap112-newssitemap.xml",
sitemaps=["https://www.ndr.de/sitemap112-sitemap.xml"],
parser=NDRParser,
article_classifier=lambda _, url: not bool(re.search("podcast[0-9]{4}", url)),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How now seeing this I have to admit the previous implementation was way more readable. Maybe we should bring it back and call it regex_classifier wich takes a single argument in the call. Sorry for the effort but what do you think @Weyaaron?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope I changed it according to your comment.

@Weyaaron Weyaaron merged commit 35e15fe into master May 9, 2023
@Weyaaron Weyaaron deleted the add_classification branch May 9, 2023 15:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants