Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding ability to disable writing to the public suffix list cache #74

Merged

Conversation

jsf9k
Copy link
Member

@jsf9k jsf9k commented Mar 22, 2018

This is necessary when running in AWS Lambda, for instance, where the local filesystem is read-only.

This is necessary when running in AWS Lambda, for instance, where the
local filesystem is read-only.
@jsf9k jsf9k self-assigned this Mar 22, 2018
@jsf9k jsf9k requested a review from dav3r March 22, 2018 17:54
else:
psl_age = datetime.now() - datetime.fromtimestamp(stat(PublicSuffixListFilename).st_mtime)
if psl_age > timedelta(hours=24):
if not PublicSuffixListReadOnly:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of a new flag about the list being read only, could you rely on only downloading the file if it's not already present on disk? In Lambda the filesystem will be read-only, but in Lambda the PSL should always be on disk and packaged with the function.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The caching that trustymail uses (which is useful when trustymail is run standalone) also checks the PSL file to see if it's less than 24 hours old. So it would still try to update the Lambda PSL file if we didn't have the extra flag.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, that helps. I'll push a bit more, though -- could the file write attempt be caught, and the subsequent error on a read-only disk, caught and swallowed-with-logging?

Read-only disks will not be a common occurrence, and failing gracefully-but-silently will probably always be an okay path for those kinds of situations. For read-only disks, where there's some sort of stateless package being prepared, the package preparation could be responsible for pulling the latest version.

Alternatively, Lambda allows /tmp to be writeable. Instead of trustymail supporting a new flag just for this purpose, an environment that knows it's using trustymail in a special environment could/should be responsible for providing a location to a writeable cache location. (Or ensuring no writes occur by providing a fresh copy.)

Copy link
Member Author

@jsf9k jsf9k Mar 25, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could the file write attempt be caught, and the subsequent error on a read-only disk, caught and swallowed-with-logging?

I think swallowing the write attempt except for a log message is a little dangerous. Read-only filesystems will be uncommon, but trying to overwrite cache files that you don't have permission to will be much more common. The user may not notice the log message and therefore not realize that an old PSL is being used.

For read-only disks, where there's some sort of stateless package being prepared, the package preparation could be responsible for pulling the latest version.

I think you're suggesting that the Lambda function copy the PSL to /tmp so it can be overwritten. But then you would have every Lambda instance that is kicked off downloading its own PSL, unless you repackaged the Lambda code with the latest PSL within 24 hours of scanning.

The simplest thing would be to remove the "is the PSL older than 24 hours" check, but I'm hesitant to do that because the caching was added by one of the non-NCATS folks who uses trustymail (@seanthegeek). That implies it's useful to someone who is running trustymail outside of domain-scan.

I think a better solution might be to have the Lambda code pull the PSL from an S3 bucket and copy it to /tmp. If the PSL is older than 24 hours then the file inside the S3 bucket should be updated by the Lambda function, and then subsequent Lambda calls will share the new file. This would also obviate the need to rebuild the Lambda function just to include an updated PSL, which we currently suffer from. I'm not sure what will happen when you kick off 800 Lambda functions and they all try to update the PSL at once, though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you're suggesting that the Lambda function copy the PSL to /tmp so it can be overwritten. But then you would have every Lambda instance that is kicked off downloading its own PSL, unless you repackaged the Lambda code with the latest PSL within 24 hours of scanning.

I wasn't, I was actually suggesting that the Lambda function itself be rebuilt to include an updated PSL.

But I can see why that would affect non-USG users (such as @seanthegeek) more than USG users, since we would expect very little churn in the PSL that would be relevant to USG domains.

With pshtt, I worked hard to ensure that we never had to download third party data, either from the internet or from S3. And what do you know, it looks like I also use some lambda/local special-case logic to do that in pshtt. So I should probably drop my complaint about having any Lambda-specific logic in the trustymail scanner.

I still think, just as advice to trustymail, that having trustymail manage a flag discussing the state of the read only file system is exposing the wrong layer of abstraction. I recommend having a flag which disables the 24-hour cache feature, rather than a flag which relates to a low-level implementation detail that conflicts with that feature.

Perhaps it's also worth domain-scan having a generalized solution to packaging Lambda functions with third party data sources. For example, a scanner could specify the source of third party data that is needed for that scanner to run, and the Lambda deploy process could automatically fetch and re-package them during packaging and deployment.

I think there is a fundamental tension between "don't have every Lambda instance make a network request to get this data" and "don't ever have to repackage Lambda functions to stay fresh with this data". I'm comfortable pushing some burden on the repackaging process (especially given how easy you've made it with Docker), and suggesting that staying fresh with the PSL and other sources means setting up (perhaps automated) repackaging of functions on a regular basis. Having repackaging be "aware" of third party dependencies per-scanner could make this easier.

Copy link
Member Author

@jsf9k jsf9k Mar 25, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can convert the read-only flag into a "turn off the 24-hour cache check" flag. That's really the same thing as making the cache read-only, since if the cache already exists then the only reason it would be updated would be if it were more than a day old. You already approved the domain-scan pull request that corresponds to this one, though, so to keep them consistent I will go ahead and merge this one too.

If you're using Lambda then you obviously have access to the rest of AWS. I really like the idea of using an S3 bucket to provide a shared, always-updated third-party data cache to all domain-scan Lambda functions. This obviates the need to rebuild the Lambda function just to update third-party data files or download them in every single Lambda execution.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, trustymail isn't really managing a flag that corresponds to the state of the file system. It's managing a flag that corresponds to whether the user wants to overwrite the cache with new data after 24 hours or treat the cache as read-only. I guess that's a matter of semantics.

@jsf9k jsf9k merged commit da98d8c into develop Mar 25, 2018
@jsf9k jsf9k deleted the improvement/add_ability_to_turn_off_writing_to_psl_cache branch March 25, 2018 18:10
mcdonnnj pushed a commit that referenced this pull request Jan 23, 2023
Adds a comment to direct additional requirements into setup.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants