-
Notifications
You must be signed in to change notification settings - Fork 2
Initial Functionality Implementation #2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Swap out instances of skeleton-python-library and do some initial reference updates
Get rid of or move any of the example library's files and put in place some files to implement for this package's functionality.
Add utility methods to get a desired hashing object and to update and get the digest from a hashing object.
The class has a primary method UrlHasher.hash_url() for functionality with helper methods. The main content processing is done by _handle* methods that are designed to handle a specific content-type as given by the remote server. These methods will always return a HandlerResult to standardize them, and the UrlHasher.hash_url() method will use the UrlResult named tuple to provide access convenience while making the results immutable.
Also fix a missing import and update __init__.py to expose the UrlHasher class.
Update the README to include information about this package's functionality.
Add logic to the GitHub Actions workflow to pull down a zipped binary from the https://github.com/adieuadieu/serverless-chrome project and extract it to a specified location. The test_hasher.py file has been updated to add testing using this binary instead of what is detected/used by pyppeteer automatically. This will only take place if the binary is detected in the specified location.
Declare the dictionary of handlers upfront instead of adding them individually.
Also correct the name of a test for the same file.
If the contents appear to be text, use the plaintext handler as a fallback instead.
Also adjust the output of the interface slightly.
Switch the build workflow to use this script instead to retrieve it in CI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Powerful work!
download_url="https://github.com/cisagov/hash-http-content", | ||
# Author details | ||
author="Cyber and Infrastructure Security Agency", | ||
author_email="ncats@hq.dhs.gov", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This email address needs to be updated upstream.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cisagov/skeleton-python-library#57 Did we reach a consensus on which of each to use? I'm happy to make an upstream PR to resolve this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please see my comment at cisagov/skeleton-python-library#57 (comment).
long_description=readme(), | ||
long_description_content_type="text/markdown", | ||
# NCATS "homepage" | ||
url="https://www.us-cert.gov/resources/ncats", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This URL needs to be updated upstream.
try: | ||
hasher = getattr(hashlib, hash_algorithm)(usedforsecurity=False) | ||
except AttributeError: | ||
# There is no named constructor for the desired hashing algorithm | ||
try: | ||
# Work around typeshed's incorrect type hints | ||
hasher = getattr(hashlib, "new")(hash_algorithm, usedforsecurity=False) | ||
except TypeError: | ||
hasher = hashlib.new(hash_algorithm) | ||
except TypeError: | ||
hasher = getattr(hashlib, hash_algorithm)() | ||
return hasher |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a complicated case...lotta ins, out, what-have-yous...
return hasher.hexdigest() | ||
|
||
|
||
class HandlerResult(NamedTuple): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Digging the named tuples!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a few comments and questions, but overall, really great work @mcdonnnj! 🏎️
Update comments with suggestions from code review. Co-authored-by: dav3r <david.redmin@trio.dhs.gov>
Add additional context and explanation about a workaround being used to pass the mypy pre-commit hook.
@cisagov/team-ois, we should give more specific comments when merging reviewer's This PR doesn't contain a particularly egregious example, but I have seen I'm not saying I always do it right either, but I've noticed that we sometimes merge others' suggestions without giving the commit comment the attention it deserves. The comment should be just as informative as if you had created the commits yourself and were committing them from your editor or the command line. Remember that when someone may be trying to figure out why things are done a certain way by studying the output of |
That's a really good point. Building on that, I think it's also important to only batch similar suggestions instead of just lumping everything together. In this PR it would have been helpful to do three commits from the suggestions instead of batching them all together. Then I could have done more specific commit messages for each. |
The 302 status code was missed for the list of redirect_status_codes so it has been added. I have also added an explanation for why these four values are chosen instead of just matching any 3xx status code.
Fix a typo I made in the comment explaining the redirect statuses being used. Co-authored-by: dav3r <david.redmin@trio.dhs.gov>
Add support for the new member to the cli and its output. Update testing to reflect this addition.
da0ea7b
to
9f0088c
Compare
This timeout is used for request.get() calls to provide a changeable limit for how long to wait for a request to finish. The default value is five seconds.
When testing against different sites, I found that for more complex sites there was not always enough time for everything to process and render. Unfortunately the setContent() coroutine does not allow you to pass additional options. I have switched to using the goto() coroutine instead, as this allows me ot use the waitUntil option to give time for the page to process and fully load.
Previously a new page was created, used, and then closed for every UrlHasher_handle_html() call. However, since we already reuse the browser object for a class instance, there's no sense in not reusing a page in the browser in the same way.
Adjust the browser's timeout while waiting for the events given in the waitUntil to occur from the default of 30 seconds to five seconds. Wrap everything in a try/except block and pull whatever content is rendered at the end of the timeout as a fallback. If a site has long-polling or similar side activity, then the networkidle2 event may never occur. Handling the timeout will allow us to retrieve a site that has likely loaded already.
Add retries to the requests.get() call for the provided URL in UrlHasher.hash_url(). This will allow a modified number of retries in case there are any network difficulties or similar. After self._retries additional attempts, it will raise the exception that was caught.
Change from a hard coded value to reusing the self._timeout value.
Added a verify kwarg to UrlHasher.hash_url() to allow control over TLS verification in the requests library. The kwarg is directly used as the kwarg of the same name in the requests.get() method.
This pull request introduces 1 alert when merging 50a6197 into 54423cf - view on LGTM.com new alerts:
|
50a6197
to
b4c49d2
Compare
When performing requests.get() we were blanket catching any Exception. Retries should only be for connection related issued, so this narrows the caught exceptions down to ConnectionError and Timeout (which will match both the ConnectTimeout and ReadTimeout errors).
Since the result of a UrlHasher.hash_url() call and the UrlHasher class itself are the primary interfaces to this class, they should both be publicly available.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now I like it even better!
page: Page = asyncio.get_event_loop().run_until_complete( | ||
self._browser.newPage() | ||
# Try to guarantee our preferred encoding | ||
page_contents = bytes(page_contents.encode(self._default_encoding)).decode( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this line have a type hint?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Certainly wouldn't hurt. I get bad about it sometimes when a variable is only assigned to once from a "guaranteed output". I'll go back and add some more type hinting to keep it all fresh and statically happy (especially since I added new instance variables with the expectation that they can be changed).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
During review type hints were mentioned, and it made me take a second look to see if I had missed any that could be easily added. This was especially important for some of the instance variables that were added, as they are expected to be modified if desired.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice improvements since the last time I looked! 👍
🗣 Description
This PR is the initial implementation of the functionality of this Python package. That functionality is to retrieve an HTTP URL and hash the contents found, with additional processing for certain types of content.
💭 Motivation and context
This is a core piece of the BOD 20-01 compliance checking we are tasked with providing.
🧪 Testing
All of our pre-commit hooks pass. I have implemented code coverage for almost the entire code base, and those tests are also passing.
✅ Checklist
to reflect the changes in this PR.