Skip to content

Conversation

mcdonnnj
Copy link
Member

🗣 Description

This PR is the initial implementation of the functionality of this Python package. That functionality is to retrieve an HTTP URL and hash the contents found, with additional processing for certain types of content.

💭 Motivation and context

This is a core piece of the BOD 20-01 compliance checking we are tasked with providing.

🧪 Testing

All of our pre-commit hooks pass. I have implemented code coverage for almost the entire code base, and those tests are also passing.

✅ Checklist

  • This PR has an informative and human-readable title.
  • Changes are limited to a single goal - eschew scope creep!
  • All relevant type-of-change labels have been added.
  • I have read the CONTRIBUTING document.
  • These code changes follow cisagov code standards.
  • All relevant repo and/or project documentation has been updated
    to reflect the changes in this PR.
  • Tests have been added and/or modified to cover the changes in this PR.
  • All new and existing tests pass.

Swap out instances of skeleton-python-library and do some initial reference
updates
Get rid of or move any of the example library's files and put in place some
files to implement for this package's functionality.
Add utility methods to get a desired hashing object and to update and get the
digest from a hashing object.
The class has a primary method UrlHasher.hash_url() for functionality with
helper methods. The main content processing is done by _handle* methods that
are designed to handle a specific content-type as given by the remote server.
These methods will always return a HandlerResult to standardize them, and the
UrlHasher.hash_url() method will use the UrlResult named tuple to provide
access convenience while making the results immutable.
Also fix a missing import and update __init__.py to expose the UrlHasher class.
Update the README to include information about this package's functionality.
Add logic to the GitHub Actions workflow to pull down a zipped binary from the
https://github.com/adieuadieu/serverless-chrome project and extract it to a
specified location. The test_hasher.py file has been updated to add testing
using this binary instead of what is detected/used by pyppeteer automatically.
This will only take place if the binary is detected in the specified location.
Declare the dictionary of handlers upfront instead of adding them individually.
Also correct the name of a test for the same file.
If the contents appear to be text, use the plaintext handler as a fallback
instead.
Also adjust the output of the interface slightly.
Switch the build workflow to use this script instead to retrieve it in CI.
@mcdonnnj mcdonnnj added the improvement This issue or pull request will add new or improve existing functionality label Feb 10, 2021
@mcdonnnj mcdonnnj self-assigned this Feb 10, 2021
Copy link
Member

@jsf9k jsf9k left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Powerful work!

download_url="https://github.com/cisagov/hash-http-content",
# Author details
author="Cyber and Infrastructure Security Agency",
author_email="ncats@hq.dhs.gov",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This email address needs to be updated upstream.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cisagov/skeleton-python-library#57 Did we reach a consensus on which of each to use? I'm happy to make an upstream PR to resolve this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

long_description=readme(),
long_description_content_type="text/markdown",
# NCATS "homepage"
url="https://www.us-cert.gov/resources/ncats",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This URL needs to be updated upstream.

Comment on lines 25 to 36
try:
hasher = getattr(hashlib, hash_algorithm)(usedforsecurity=False)
except AttributeError:
# There is no named constructor for the desired hashing algorithm
try:
# Work around typeshed's incorrect type hints
hasher = getattr(hashlib, "new")(hash_algorithm, usedforsecurity=False)
except TypeError:
hasher = hashlib.new(hash_algorithm)
except TypeError:
hasher = getattr(hashlib, hash_algorithm)()
return hasher
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a complicated case...lotta ins, out, what-have-yous...

return hasher.hexdigest()


class HandlerResult(NamedTuple):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Digging the named tuples!

Copy link
Member

@dav3r dav3r left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a few comments and questions, but overall, really great work @mcdonnnj! 🏎️

Update comments with suggestions from code review.

Co-authored-by: dav3r <david.redmin@trio.dhs.gov>
Add additional context and explanation about a workaround being used to pass
the mypy pre-commit hook.
@jsf9k
Copy link
Member

jsf9k commented Feb 11, 2021

@cisagov/team-ois, we should give more specific comments when merging reviewer's suggestion blocks. In this case, for example, "Update comments to explicitly mention AWS" would be a slightly better comment. It gives you a little better idea of what is actually in the commit.

This PR doesn't contain a particularly egregious example, but I have seen suggestion commits in other PRs where the commit comment is "Modify blah.md", for example. In such a case there is no way to know what the commit contains, short of examining the commit itself. In many cases where a reviewer leaves behind multiple suggestion blocks it makes sense to group the suggestions into multiple commits that can each be given a specific commit message.

I'm not saying I always do it right either, but I've noticed that we sometimes merge others' suggestions without giving the commit comment the attention it deserves. The comment should be just as informative as if you had created the commits yourself and were committing them from your editor or the command line. Remember that when someone may be trying to figure out why things are done a certain way by studying the output of git log or git whatchanged, without the context of the PR.

@mcdonnnj
Copy link
Member Author

@cisagov/team-ois, we should give more specific comments when merging reviewer's suggestion blocks. In this case, for example, "Update comments to explicitly mention AWS" would be a slightly better comment. It gives you a little better idea of what is actually in the commit.

This PR doesn't contain a particularly egregious example, but I have seen suggestion commits in other PRs where the commit comment is "Modify blah.md", for example. In such a case there is no way to know what the commit contains, short of examining the commit itself. In many cases where a reviewer leaves behind multiple suggestion blocks it makes sense to group the suggestions into multiple commits that can each be given a specific commit message.

I'm not saying I always do it right either, but I've noticed that we sometimes merge others' suggestions without giving the commit comment the attention it deserves. The comment should be just as informative as if you had created the commits yourself and were committing them from your editor or the command line. Remember that when someone may be trying to figure out why things are done a certain way by studying the output of git log or git whatchanged, without the context of the PR.

That's a really good point. Building on that, I think it's also important to only batch similar suggestions instead of just lumping everything together. In this PR it would have been helpful to do three commits from the suggestions instead of batching them all together. Then I could have done more specific commit messages for each.

The 302 status code was missed for the list of redirect_status_codes so it has
been added. I have also added an explanation for why these four values are
chosen instead of just matching any 3xx status code.
mcdonnnj and others added 2 commits February 12, 2021 11:52
Fix a typo I made in the comment explaining the redirect statuses being used.

Co-authored-by: dav3r <david.redmin@trio.dhs.gov>
Add support for the new member to the cli and its output. Update testing to
reflect this addition.
This timeout is used for request.get() calls to provide a changeable limit for
how long to wait for a request to finish. The default value is five seconds.
When testing against different sites, I found that for more complex sites there
was not always enough time for everything to process and render. Unfortunately
the setContent() coroutine does not allow you to pass additional options. I
have switched to using the goto() coroutine instead, as this allows me ot use
the waitUntil option to give time for the page to process and fully load.
Previously a new page was created, used, and then closed for every
UrlHasher_handle_html() call. However, since we already reuse the browser
object for a class instance, there's no sense in not reusing a page in the
browser in the same way.
Adjust the browser's timeout while waiting for the events given in the
waitUntil to occur from the default of 30 seconds to five seconds. Wrap
everything in a try/except block and pull whatever content is rendered at the
end of the timeout as a fallback. If a site has long-polling or similar side
activity, then the networkidle2 event may never occur. Handling the timeout
will allow us to retrieve a site that has likely loaded already.
Add retries to the requests.get() call for the provided URL in
UrlHasher.hash_url(). This will allow a modified number of retries in case
there are any network difficulties or similar. After self._retries additional
attempts, it will raise the exception that was caught.
Change from a hard coded value to reusing the self._timeout value.
Added a verify kwarg to UrlHasher.hash_url() to allow control over TLS
verification in the requests library. The kwarg is directly used as the kwarg
of the same name in the requests.get() method.
@lgtm-com
Copy link

lgtm-com bot commented Feb 16, 2021

This pull request introduces 1 alert when merging 50a6197 into 54423cf - view on LGTM.com

new alerts:

  • 1 for Module is imported with 'import' and 'import from'

When performing requests.get() we were blanket catching any Exception. Retries
should only be for connection related issued, so this narrows the caught
exceptions down to ConnectionError and Timeout (which will match both the
ConnectTimeout and ReadTimeout errors).
Since the result of a UrlHasher.hash_url() call and the UrlHasher class itself
are the primary interfaces to this class, they should both be publicly
available.
@mcdonnnj mcdonnnj requested review from dav3r and jsf9k February 16, 2021 15:00
Copy link
Member

@jsf9k jsf9k left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now I like it even better!

page: Page = asyncio.get_event_loop().run_until_complete(
self._browser.newPage()
# Try to guarantee our preferred encoding
page_contents = bytes(page_contents.encode(self._default_encoding)).decode(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this line have a type hint?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Certainly wouldn't hurt. I get bad about it sometimes when a variable is only assigned to once from a "guaranteed output". I'll go back and add some more type hinting to keep it all fresh and statically happy (especially since I added new instance variables with the expectation that they can be changed).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jsf9k That is actually hinted here (just above the line you were looking at):

page_contents: str = asyncio.get_event_loop().run_until_complete(

, but I took care of some low-hanging/should be there type hints in 2c51f2b.

During review type hints were mentioned, and it made me take a second look to
see if I had missed any that could be easily added. This was especially
important for some of the instance variables that were added, as they are
expected to be modified if desired.
Copy link
Member

@dav3r dav3r left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice improvements since the last time I looked! 👍

@mcdonnnj mcdonnnj merged commit c4b07e8 into develop Feb 17, 2021
@mcdonnnj mcdonnnj deleted the first-commits branch February 17, 2021 19:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

improvement This issue or pull request will add new or improve existing functionality

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants