-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding ability to disable writing to the public suffix list cache #74
Adding ability to disable writing to the public suffix list cache #74
Conversation
This is necessary when running in AWS Lambda, for instance, where the local filesystem is read-only.
else: | ||
psl_age = datetime.now() - datetime.fromtimestamp(stat(PublicSuffixListFilename).st_mtime) | ||
if psl_age > timedelta(hours=24): | ||
if not PublicSuffixListReadOnly: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of a new flag about the list being read only, could you rely on only downloading the file if it's not already present on disk? In Lambda the filesystem will be read-only, but in Lambda the PSL should always be on disk and packaged with the function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The caching that trustymail
uses (which is useful when trustymail
is run standalone) also checks the PSL file to see if it's less than 24 hours old. So it would still try to update the Lambda PSL file if we didn't have the extra flag.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, that helps. I'll push a bit more, though -- could the file write attempt be caught, and the subsequent error on a read-only disk, caught and swallowed-with-logging?
Read-only disks will not be a common occurrence, and failing gracefully-but-silently will probably always be an okay path for those kinds of situations. For read-only disks, where there's some sort of stateless package being prepared, the package preparation could be responsible for pulling the latest version.
Alternatively, Lambda allows /tmp
to be writeable. Instead of trustymail supporting a new flag just for this purpose, an environment that knows it's using trustymail in a special environment could/should be responsible for providing a location to a writeable cache location. (Or ensuring no writes occur by providing a fresh copy.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could the file write attempt be caught, and the subsequent error on a read-only disk, caught and swallowed-with-logging?
I think swallowing the write attempt except for a log message is a little dangerous. Read-only filesystems will be uncommon, but trying to overwrite cache files that you don't have permission to will be much more common. The user may not notice the log message and therefore not realize that an old PSL is being used.
For read-only disks, where there's some sort of stateless package being prepared, the package preparation could be responsible for pulling the latest version.
I think you're suggesting that the Lambda function copy the PSL to /tmp
so it can be overwritten. But then you would have every Lambda instance that is kicked off downloading its own PSL, unless you repackaged the Lambda code with the latest PSL within 24 hours of scanning.
The simplest thing would be to remove the "is the PSL older than 24 hours" check, but I'm hesitant to do that because the caching was added by one of the non-NCATS folks who uses trustymail
(@seanthegeek). That implies it's useful to someone who is running trustymail
outside of domain-scan
.
I think a better solution might be to have the Lambda code pull the PSL from an S3 bucket and copy it to /tmp
. If the PSL is older than 24 hours then the file inside the S3 bucket should be updated by the Lambda function, and then subsequent Lambda calls will share the new file. This would also obviate the need to rebuild the Lambda function just to include an updated PSL, which we currently suffer from. I'm not sure what will happen when you kick off 800 Lambda functions and they all try to update the PSL at once, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you're suggesting that the Lambda function copy the PSL to /tmp so it can be overwritten. But then you would have every Lambda instance that is kicked off downloading its own PSL, unless you repackaged the Lambda code with the latest PSL within 24 hours of scanning.
I wasn't, I was actually suggesting that the Lambda function itself be rebuilt to include an updated PSL.
But I can see why that would affect non-USG users (such as @seanthegeek) more than USG users, since we would expect very little churn in the PSL that would be relevant to USG domains.
With pshtt
, I worked hard to ensure that we never had to download third party data, either from the internet or from S3. And what do you know, it looks like I also use some lambda
/local
special-case logic to do that in pshtt
. So I should probably drop my complaint about having any Lambda-specific logic in the trustymail
scanner.
I still think, just as advice to trustymail
, that having trustymail
manage a flag discussing the state of the read only file system is exposing the wrong layer of abstraction. I recommend having a flag which disables the 24-hour cache feature, rather than a flag which relates to a low-level implementation detail that conflicts with that feature.
Perhaps it's also worth domain-scan having a generalized solution to packaging Lambda functions with third party data sources. For example, a scanner could specify the source of third party data that is needed for that scanner to run, and the Lambda deploy process could automatically fetch and re-package them during packaging and deployment.
I think there is a fundamental tension between "don't have every Lambda instance make a network request to get this data" and "don't ever have to repackage Lambda functions to stay fresh with this data". I'm comfortable pushing some burden on the repackaging process (especially given how easy you've made it with Docker), and suggesting that staying fresh with the PSL and other sources means setting up (perhaps automated) repackaging of functions on a regular basis. Having repackaging be "aware" of third party dependencies per-scanner could make this easier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can convert the read-only flag into a "turn off the 24-hour cache check" flag. That's really the same thing as making the cache read-only, since if the cache already exists then the only reason it would be updated would be if it were more than a day old. You already approved the domain-scan
pull request that corresponds to this one, though, so to keep them consistent I will go ahead and merge this one too.
If you're using Lambda then you obviously have access to the rest of AWS. I really like the idea of using an S3 bucket to provide a shared, always-updated third-party data cache to all domain-scan
Lambda functions. This obviates the need to rebuild the Lambda function just to update third-party data files or download them in every single Lambda execution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, trustymail
isn't really managing a flag that corresponds to the state of the file system. It's managing a flag that corresponds to whether the user wants to overwrite the cache with new data after 24 hours or treat the cache as read-only. I guess that's a matter of semantics.
Adds a comment to direct additional requirements into setup.py
This is necessary when running in AWS Lambda, for instance, where the local filesystem is read-only.