Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add max_file_download_size config and expose to DataSource #2080

Merged
merged 3 commits into from Jan 24, 2024

Conversation

seanstory
Copy link
Member

@seanstory seanstory commented Jan 22, 2024

Closes https://github.com/elastic/enterprise-search-team/issues/4716

Adds a new config, service.max_file_download_size that defaults to 10MB.

Also, adds a new abstraction for specific framework configs that we want to expose to the DataSource instances, without exposing all of our YAML configs. This PR exposes only the new max_file_size, but can be easily updated to expose more (follow-up work: https://github.com/elastic/enterprise-search-team/issues/5057).

Checklists

Pre-Review Checklist

  • this PR has a meaningful title
  • this PR links to all relevant github issues that it fixes or partially addresses
  • this PR has a thorough description
  • Covered the changes with automated tests
  • Tested the changes locally
  • Added a label for each target release version (example: v7.13.2, v7.14.0, v8.0.0)
  • Considered corresponding documentation changes
  • Contributed any configuration settings changes to the configuration reference

Release Note

service.max_file_download_size can now be set in the config.yml to control how large of files the framework will attempt to download. This config is only used for connectors that are not using the Data Extraction Service.

Comment on lines +198 to +207
class Builder:
def __init__(self):
self.max_file_size = DEFAULT_MAX_FILE_SIZE

def with_max_file_size(self, max_file_size):
self.max_file_size = max_file_size
return self

def build(self):
return DataSourceFrameworkConfig(self.max_file_size)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious - why use builder here?

For now it's only one property, and probably in future it'll be like 10.

Do you see partial initialisation of this config (e.g. with some properties initialised, some not), or multistep initialisation of config?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to make it easy for this to be formed up from multiple places eventually. We had a recent slack thread where we discussed how some service configs might one day have a UI and be stored in ES. And we've got some in the YAML, and maybe would someday have some in RCFs. Passing around a half-complete builder is a good pattern to use for scenarios like this where multiple actors are all contributing parts to something being built.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I understand that, though I felt that purpose of Builder is often to have a mutable object until at some point you're ready to create an immutable object and use it somewhere. Since Python has different encapsulation concepts, it's much less necessary, and Builder becomes more of a "convenience" class that makes creation of objects more pleasant.

I'd think that just having DataSourceFrameworkConfig class with getters/setters should be fine.

This however should not prevent from merging change as is, it's rather a discussion to see your POV here :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the benefit of Builder pattern is that it makes the object instantiation simple and readable when there are a lot of variables. Apparently this is implemented for future-proof.

Do you think we should define the constructor of DataSourceFrameworkConfig like this:

def __init__(self, builder):
	self.max_file_size = builder.max_file_size

and the build method like this:

def build(self):
	return DataSourceFrameworkConfig(self)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the truth is there will be N sources that provide config options, e.g.:

  • Configuration file
  • Connector-specific config
  • Deployment-wide settings

So builder will have 3 args/3 with_x methods and number is unlikely to grow IMO, I might be wrong here.

As to the syntax, I think original syntax in Sean's changeset is already good Builder syntax - only improvement (also absolutely unnecessary) could be is to make DataSourceFrameworkConfig have read-only properties with data stored in __-prefixed keys so that they could not be accessed, but I don't think it matters a lot.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the truth is there will be N sources that provide config options, e.g.:

  • Configuration file
  • Connector-specific config
  • Deployment-wide settings

So builder will have 3 args/3 with_x methods

I think the builder should not care of where this config is coming from. Take max_file_size as example, it should only define one method with_max_file_size. If it's from configuration file, you can do builder.with_max_file_size(loaded_config.max_file_size), and if it's from a connector, you can do builder.with_max_file_size(connector.max_file_size)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's just keep these options in mind as this grows. I think we've agreed that the current state is "fine", and that it could be improved on depending on how we use it in the future. So let's just plan to refactor as necessary as its usage grows.

Comment on lines +393 to +395
if attachment_size > self.framework_config.max_file_size:
self._logger.warning(
f"File size {attachment_size} of file {attachment_name} is larger than {FILE_SIZE_LIMIT} bytes. Discarding file content"
f"File size {attachment_size} of file {attachment_name} is larger than {self.framework_config.max_file_size} bytes. Discarding file content"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was also wondering if it's possible to move this condition elsewhere, so that each connector would not call it?

Best I could think of was that rather than doing comparison manually, make the class "richer" and do something like:

if self.framework_config.file_size_supported(attachment_size):
    # download

How do you feel?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not going to tackle that as part of this - my focus for this PR is making the existing logic configurable. I believe there was debate previously about moving this kind of thing to the Extraction utils, but that ended up being rejected, and I don't want to re-open that discussion at the moment.

@seanstory seanstory enabled auto-merge (squash) January 24, 2024 15:50
@seanstory seanstory merged commit 2ac9db5 into main Jan 24, 2024
2 checks passed
@seanstory seanstory deleted the seanstory/4716-make-max-filesize-configurable branch January 24, 2024 16:12
Copy link

💔 Failed to create backport PR(s)

The backport operation could not be completed due to the following error:
There are no branches to backport to. Aborting.

The backport PRs will be merged automatically after passing CI.

To backport manually run:
backport --pr 2080 --autoMerge --autoMergeMethod squash

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants