-
Notifications
You must be signed in to change notification settings - Fork 74
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds a mechanism for accepting or rejecting articles #184
Conversation
# Conflicts: # src/fundus/scraping/pipeline.py # src/fundus/scraping/scraper.py
Thanks for adding this @Weyaaron. IMO there are a few things missing here to solve #181 but I would consider this a good first step. In no particular order some things I noticed:
article_classification... = AND(classification1, OR(classification2, NOT(classification3))) Implementing logic would make things more flexible and in the end even easier. I.e. take the now existing |
All good points, I thought about combining multiple classifiers, but refrained from adding explicit support so far. I support both of these points and will look into this soon. |
My first thoughts involved a lot of overengineering, I will rewrite this slightly. |
…ome of the comments from @MaxDall
The rewrite has taken place, the overall quality has been improved in my opionion. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for addressing the concerns raised.
@@ -11,6 +11,7 @@ class PublisherSpec: | |||
parser: Type[BaseParser] | |||
rss_feeds: List[str] = field(default_factory=list) | |||
sitemaps: List[str] = field(default_factory=list) | |||
article_classifier: Optional[Callable[[str, str], bool]] = field(default=None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just one last thing then this one is good to go. Could you type hint article_clasifier
with a Protocol
called ArticleClasifier
as we did with the ExtractionFilter
? This seems like over engineering at first, but doing this we can provide the information that the first parameter gonna be the URL and the second one is the HTML.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Has been done, although the first parameter is the HTML, the second the URL.
src/fundus/publishers/de/__init__.py
Outdated
@@ -127,6 +128,7 @@ class DE(PublisherEnum): | |||
news_map="https://www.ndr.de/sitemap112-newssitemap.xml", | |||
sitemaps=["https://www.ndr.de/sitemap112-sitemap.xml"], | |||
parser=NDRParser, | |||
article_classifier=lambda _, url: not bool(re.search("podcast[0-9]{4}", url)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How now seeing this I have to admit the previous implementation was way more readable. Maybe we should bring it back and call it regex_classifier
wich takes a single argument in the call. Sorry for the effort but what do you think @Weyaaron?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hope I changed it according to your comment.
This Pr enhances fundus by solving issue #181.
It consists of a mechanism that filters articles based before extracting them. It is implemented in a functional manner: The Publisherspec gained an optional argument for a function constructor that constructs a filter function based on the url and the html for a given article.
One example function is given,it is used to filter articles from the ndr that are podcasts.