Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Makes article properties over writable by parser #330

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 8 additions & 2 deletions src/fundus/parser/__init__.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,10 @@
from .base_parser import BaseParser, ParserProxy, attribute, function
from .base_parser import (
BaseParser,
ParserProxy,
attribute,
function,
overwrite_attribute,
)
from .data import ArticleBody

__all__ = ["ParserProxy", "BaseParser", "attribute", "function", "ArticleBody"]
__all__ = ["ParserProxy", "BaseParser", "attribute", "function", "overwrite_attribute", "ArticleBody"]
16 changes: 12 additions & 4 deletions src/fundus/parser/base_parser.py
Original file line number Diff line number Diff line change
Expand Up @@ -76,8 +76,9 @@ def __repr__(self):


class Attribute(RegisteredFunction):
def __init__(self, func: Callable[[object], Any], priority: Optional[int], validate: bool):
self.validate = validate
def __init__(self, func: Callable[[object], Any], priority: Optional[int], validate: bool, overwrite: bool = False):
self.validate = validate if not overwrite else False
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just writing the true path first for clarity.

Suggested change
self.validate = validate if not overwrite else False
self.validate = False if overwrite else validate

self.overwrite = overwrite
Comment on lines +79 to +81
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you clarify why an overwritten attribute can not be a validated attribute? If we allow overwritten attributes to be validated, they are the same as the regular attributes.

super(Attribute, self).__init__(func=func, priority=priority)


Expand All @@ -88,7 +89,10 @@ def __init__(self, func: Callable[[object], Any], priority: Optional[int]):

def _register(cls, factory: Type[RegisteredFunction], **kwargs):
def wrapper(func):
return functools.update_wrapper(factory(func, **kwargs), func)
try:
return functools.update_wrapper(factory(func, **kwargs), func)
except TypeError as err:
raise err
dobbersc marked this conversation as resolved.
Show resolved Hide resolved

# _register was called with parenthesis
if cls is None:
Expand All @@ -102,6 +106,10 @@ def attribute(cls=None, /, *, priority: Optional[int] = None, validate: bool = T
return _register(cls, factory=Attribute, priority=priority, validate=validate)


def overwrite_attribute(cls):
return _register(cls, factory=Attribute, priority=None, validate=False, overwrite=True)


def function(cls=None, /, *, priority: Optional[int] = None):
return _register(cls, factory=Function, priority=priority)

Expand Down Expand Up @@ -137,7 +145,7 @@ def validated(self) -> "AttributeCollection":

@property
def unvalidated(self) -> "AttributeCollection":
return AttributeCollection(*[attr for attr in self.functions if not attr.validate])
return AttributeCollection(*[attr for attr in self.functions if not attr.validate and not attr.overwrite])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't the following line modify the validated attribute already, s.t. it correctly identifies (un)validated attributes?

self.validate = validate if not overwrite else False

Suggested change
return AttributeCollection(*[attr for attr in self.functions if not attr.validate and not attr.overwrite])
return AttributeCollection(*[attr for attr in self.functions if not attr.validate])



class FunctionCollection(RegisteredFunctionCollection[Function]):
Expand Down
10 changes: 8 additions & 2 deletions src/fundus/scraping/article.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@
from fundus.logging.logger import basic_logger
from fundus.parser import ArticleBody
from fundus.scraping.html import HTML
from fundus.utils.caching import cached_attribute


@dataclass(frozen=True)
Expand Down Expand Up @@ -41,12 +42,17 @@ def from_extracted(cls, html: HTML, extracted: Dict[str, Any], exception: Option

return article

@property
@cached_attribute
def plaintext(self) -> Optional[str]:
return str(self.body) if self.body else None

@property
@cached_attribute
def lang(self) -> Optional[str]:
"""
computes used language
Returns:

"""
language: Optional[str] = None

if self.plaintext:
Expand Down
39 changes: 39 additions & 0 deletions src/fundus/utils/caching.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
import functools


class _CachedAttribute(object):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, I was a bit confused with the features added in this PR. Firstly, the overwrite_attribute decorator and, secondly, the property caching. I'll spare you with my previous comments of confusion, but at first I thought that these are separate features and the custom cached attribute is only motivated to improve the general perfomance. Nertheless, here is how I understand it now:

The original problem outlined in #328 is caused by the use of the @property decorator on lang that we want to overwrite in the parsers. Here, we cannot continue using the standard properties since the defined properties define no setter, thus preventing the properties to be overwritten in subclasses with methods with the @attribute decorator. Did I understand this correctly? The origin of the error or at least an error traceback would have been helpful to include in the issue or the PR description.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit unsure if this entire attribute caching via a custom property is necessary. Some alternatives come to my mind:

  1. We could define a setter for the relevant properties. But this would be bad since we don't want lang etc. to be mutable openly.
  2. From a design perspective, why are the lang and plaintext computed in the article and not in the parser? So maybe we can move both of these properties from the Article to the BaseParser as attributes. Then we need to include lang and plaintext as fields in the Article dataclass, s.t. they are populated delayed as the other attributes. This way it would also be explicit that these are attributes one may overwrite from the BaseParser and Article would define the skeleton of attributes as dataclass fields.
  3. What would be wrong with the regular cached_attribute? I tried to replace it with the custom one and it seemed to work fine, even when overwriting it in a parser.

"""Computes attribute value and caches it in the instance.
From https://stackoverflow.com/questions/7388258/replace-property-for-perfomance-gain?noredirect=1&lq=1
Tweaked a bit to be used with a wrapper.
"""

def __init__(self, method):
self.method = method

def __get__(self, inst, cls):
if inst is None:
return self
result = self.method(inst)
object.__setattr__(inst, self.__name__, result) # type: ignore[attr-defined]
return result


# This was implemented in order to
def cached_attribute(attribute):
"""Decorate attributes to be cached.

This works like `cached_property`, but instead of `property` or `cached_property`, the decorated attribute
can be overwritten.

Args:
attribute: The attribute to decorate.

Returns:
A wrapped _CachedAttribute instance.

"""

def wrapper(func):
return functools.update_wrapper(_CachedAttribute(func), func)

return wrapper(attribute)