Initial Functionality Implementation #2

mcdonnnj · 2021-02-10T17:31:32Z

🗣 Description

This PR is the initial implementation of the functionality of this Python package. That functionality is to retrieve an HTTP URL and hash the contents found, with additional processing for certain types of content.

💭 Motivation and context

This is a core piece of the BOD 20-01 compliance checking we are tasked with providing.

🧪 Testing

All of our pre-commit hooks pass. I have implemented code coverage for almost the entire code base, and those tests are also passing.

✅ Checklist

This PR has an informative and human-readable title.
Changes are limited to a single goal - eschew scope creep!
All relevant type-of-change labels have been added.
I have read the CONTRIBUTING document.
These code changes follow cisagov code standards.
All relevant repo and/or project documentation has been updated
to reflect the changes in this PR.
Tests have been added and/or modified to cover the changes in this PR.
All new and existing tests pass.

Swap out instances of skeleton-python-library and do some initial reference updates

Get rid of or move any of the example library's files and put in place some files to implement for this package's functionality.

Add utility methods to get a desired hashing object and to update and get the digest from a hashing object.

The class has a primary method UrlHasher.hash_url() for functionality with helper methods. The main content processing is done by _handle* methods that are designed to handle a specific content-type as given by the remote server. These methods will always return a HandlerResult to standardize them, and the UrlHasher.hash_url() method will use the UrlResult named tuple to provide access convenience while making the results immutable.

Also fix a missing import and update __init__.py to expose the UrlHasher class.

Update the README to include information about this package's functionality.

Add logic to the GitHub Actions workflow to pull down a zipped binary from the https://github.com/adieuadieu/serverless-chrome project and extract it to a specified location. The test_hasher.py file has been updated to add testing using this binary instead of what is detected/used by pyppeteer automatically. This will only take place if the binary is detected in the specified location.

Declare the dictionary of handlers upfront instead of adding them individually.

Also correct the name of a test for the same file.

If the contents appear to be text, use the plaintext handler as a fallback instead.

Also adjust the output of the interface slightly.

Switch the build workflow to use this script instead to retrieve it in CI.

jsf9k

Powerful work!

jsf9k · 2021-02-10T18:13:28Z

setup.py

+    download_url="https://github.com/cisagov/hash-http-content",
    # Author details
    author="Cyber and Infrastructure Security Agency",
    author_email="ncats@hq.dhs.gov",


This email address needs to be updated upstream.

cisagov/skeleton-python-library#57 Did we reach a consensus on which of each to use? I'm happy to make an upstream PR to resolve this.

Please see my comment at cisagov/skeleton-python-library#57 (comment).

jsf9k · 2021-02-10T18:13:47Z

setup.py

    long_description=readme(),
    long_description_content_type="text/markdown",
    # NCATS "homepage"
    url="https://www.us-cert.gov/resources/ncats",


This URL needs to be updated upstream.

jsf9k · 2021-02-10T18:21:03Z

src/hash_http_content/hasher.py

+    try:
+        hasher = getattr(hashlib, hash_algorithm)(usedforsecurity=False)
+    except AttributeError:
+        # There is no named constructor for the desired hashing algorithm
+        try:
+            # Work around typeshed's incorrect type hints
+            hasher = getattr(hashlib, "new")(hash_algorithm, usedforsecurity=False)
+        except TypeError:
+            hasher = hashlib.new(hash_algorithm)
+    except TypeError:
+        hasher = getattr(hashlib, hash_algorithm)()
+    return hasher


It's a complicated case...lotta ins, out, what-have-yous...

jsf9k · 2021-02-10T18:22:18Z

src/hash_http_content/hasher.py

+    return hasher.hexdigest()
+
+
+class HandlerResult(NamedTuple):


Digging the named tuples!

dav3r

I left a few comments and questions, but overall, really great work @mcdonnnj! 🏎️

README.md

src/hash_http_content/cli.py

tests/test_hasher.py

src/hash_http_content/hasher.py

Update comments with suggestions from code review. Co-authored-by: dav3r <david.redmin@trio.dhs.gov>

Add additional context and explanation about a workaround being used to pass the mypy pre-commit hook.

jsf9k · 2021-02-11T14:23:31Z

@cisagov/team-ois, we should give more specific comments when merging reviewer's suggestion blocks. In this case, for example, "Update comments to explicitly mention AWS" would be a slightly better comment. It gives you a little better idea of what is actually in the commit.

This PR doesn't contain a particularly egregious example, but I have seen suggestion commits in other PRs where the commit comment is "Modify blah.md", for example. In such a case there is no way to know what the commit contains, short of examining the commit itself. In many cases where a reviewer leaves behind multiple suggestion blocks it makes sense to group the suggestions into multiple commits that can each be given a specific commit message.

I'm not saying I always do it right either, but I've noticed that we sometimes merge others' suggestions without giving the commit comment the attention it deserves. The comment should be just as informative as if you had created the commits yourself and were committing them from your editor or the command line. Remember that when someone may be trying to figure out why things are done a certain way by studying the output of git log or git whatchanged, without the context of the PR.

mcdonnnj · 2021-02-11T16:12:06Z

@cisagov/team-ois, we should give more specific comments when merging reviewer's suggestion blocks. In this case, for example, "Update comments to explicitly mention AWS" would be a slightly better comment. It gives you a little better idea of what is actually in the commit.

This PR doesn't contain a particularly egregious example, but I have seen suggestion commits in other PRs where the commit comment is "Modify blah.md", for example. In such a case there is no way to know what the commit contains, short of examining the commit itself. In many cases where a reviewer leaves behind multiple suggestion blocks it makes sense to group the suggestions into multiple commits that can each be given a specific commit message.

I'm not saying I always do it right either, but I've noticed that we sometimes merge others' suggestions without giving the commit comment the attention it deserves. The comment should be just as informative as if you had created the commits yourself and were committing them from your editor or the command line. Remember that when someone may be trying to figure out why things are done a certain way by studying the output of git log or git whatchanged, without the context of the PR.

That's a really good point. Building on that, I think it's also important to only batch similar suggestions instead of just lumping everything together. In this PR it would have been helpful to do three commits from the suggestions instead of batching them all together. Then I could have done more specific commit messages for each.

The 302 status code was missed for the list of redirect_status_codes so it has been added. I have also added an explanation for why these four values are chosen instead of just matching any 3xx status code.

src/hash_http_content/hasher.py

Fix a typo I made in the comment explaining the redirect statuses being used. Co-authored-by: dav3r <david.redmin@trio.dhs.gov>

Add support for the new member to the cli and its output. Update testing to reflect this addition.

This timeout is used for request.get() calls to provide a changeable limit for how long to wait for a request to finish. The default value is five seconds.

When testing against different sites, I found that for more complex sites there was not always enough time for everything to process and render. Unfortunately the setContent() coroutine does not allow you to pass additional options. I have switched to using the goto() coroutine instead, as this allows me ot use the waitUntil option to give time for the page to process and fully load.

Previously a new page was created, used, and then closed for every UrlHasher_handle_html() call. However, since we already reuse the browser object for a class instance, there's no sense in not reusing a page in the browser in the same way.

Adjust the browser's timeout while waiting for the events given in the waitUntil to occur from the default of 30 seconds to five seconds. Wrap everything in a try/except block and pull whatever content is rendered at the end of the timeout as a fallback. If a site has long-polling or similar side activity, then the networkidle2 event may never occur. Handling the timeout will allow us to retrieve a site that has likely loaded already.

Add retries to the requests.get() call for the provided URL in UrlHasher.hash_url(). This will allow a modified number of retries in case there are any network difficulties or similar. After self._retries additional attempts, it will raise the exception that was caught.

Change from a hard coded value to reusing the self._timeout value.

Added a verify kwarg to UrlHasher.hash_url() to allow control over TLS verification in the requests library. The kwarg is directly used as the kwarg of the same name in the requests.get() method.

lgtm-com · 2021-02-16T02:55:22Z

This pull request introduces 1 alert when merging 50a6197 into 54423cf - view on LGTM.com

new alerts:

1 for Module is imported with 'import' and 'import from'

When performing requests.get() we were blanket catching any Exception. Retries should only be for connection related issued, so this narrows the caught exceptions down to ConnectionError and Timeout (which will match both the ConnectTimeout and ReadTimeout errors).

Since the result of a UrlHasher.hash_url() call and the UrlHasher class itself are the primary interfaces to this class, they should both be publicly available.

jsf9k

Now I like it even better!

jsf9k · 2021-02-16T16:07:47Z

src/hash_http_content/hasher.py

-        page: Page = asyncio.get_event_loop().run_until_complete(
-            self._browser.newPage()
+        # Try to guarantee our preferred encoding
+        page_contents = bytes(page_contents.encode(self._default_encoding)).decode(


Should this line have a type hint?

Certainly wouldn't hurt. I get bad about it sometimes when a variable is only assigned to once from a "guaranteed output". I'll go back and add some more type hinting to keep it all fresh and statically happy (especially since I added new instance variables with the expectation that they can be changed).

@jsf9k That is actually hinted here (just above the line you were looking at):

hash-http-content/src/hash_http_content/hasher.py

Line 216 in 2c51f2b

page_contents: str = asyncio.get_event_loop().run_until_complete(

, but I took care of some low-hanging/should be there type hints in 2c51f2b.

During review type hints were mentioned, and it made me take a second look to see if I had missed any that could be easily added. This was especially important for some of the instance variables that were added, as they are expected to be modified if desired.

dav3r

Nice improvements since the last time I looked! 👍

mcdonnnj added 14 commits February 8, 2021 15:37

Initial changes after cloing repository

d21e003

Swap out instances of skeleton-python-library and do some initial reference updates

Remove example files and put in framework for this module

5e656c4

Get rid of or move any of the example library's files and put in place some files to implement for this package's functionality.

Add utility methods for hashing

c989d96

Add utility methods to get a desired hashing object and to update and get the digest from a hashing object.

Add NamedTuples to store expected results

dde6e15

Add initial testing for hash_http_content.hasher

4a27278

Also fix a missing import and update __init__.py to expose the UrlHasher class.

Update the README

6848d0e

Update the README to include information about this package's functionality.

Consolidate handler definitions

4945a2b

Declare the dictionary of handlers upfront instead of adding them individually.

Implement a command line interface to the package

8c893c1

Add some debug logging to hash_http_content.hasher

be15172

Also correct the name of a test for the same file.

Adjust fallback handler usage

fed60c7

If the contents appear to be text, use the plaintext handler as a fallback instead.

Add testing for the command line interface

999e586

Also adjust the output of the interface slightly.

Add a script to retrieve serverless-chrome binaries

dfb8f93

Switch the build workflow to use this script instead to retrieve it in CI.

mcdonnnj added the improvement This issue or pull request will add new or improve existing functionality label Feb 10, 2021

mcdonnnj requested review from dav3r, felddy, hillaryj and jsf9k February 10, 2021 17:31

mcdonnnj self-assigned this Feb 10, 2021

jsf9k approved these changes Feb 10, 2021

View reviewed changes

dav3r approved these changes Feb 10, 2021

View reviewed changes

Update comments based on feedback

00207ee

Update comments with suggestions from code review. Co-authored-by: dav3r <david.redmin@trio.dhs.gov>

mcdonnnj mentioned this pull request Feb 10, 2021

Update mypy Workaround if It Is No Longer Needed #3

Open

4 tasks

Expand comment about mypy workaround

5635e2a

Add additional context and explanation about a workaround being used to pass the mypy pre-commit hook.

Add missing redirect status code and explain choices

a988651

The 302 status code was missed for the list of redirect_status_codes so it has been added. I have also added an explanation for why these four values are chosen instead of just matching any 3xx status code.

dav3r reviewed Feb 11, 2021

View reviewed changes

src/hash_http_content/hasher.py Outdated Show resolved Hide resolved

jsf9k approved these changes Feb 11, 2021

View reviewed changes

mcdonnnj and others added 2 commits February 12, 2021 11:52

Fix typo in comment

fa02dac

Fix a typo I made in the comment explaining the redirect statuses being used. Co-authored-by: dav3r <david.redmin@trio.dhs.gov>

Add content_type member to the UrlResult NamedTuple

9f0088c

Add support for the new member to the cli and its output. Update testing to reflect this addition.

mcdonnnj force-pushed the first-commits branch from da0ea7b to 9f0088c Compare February 13, 2021 07:29

mcdonnnj added 7 commits February 14, 2021 00:58

Add a timeout instance variable to UrlHasher

5d0486a

This timeout is used for request.get() calls to provide a changeable limit for how long to wait for a request to finish. The default value is five seconds.

Switch to reusing a browser page

0383c77

Previously a new page was created, used, and then closed for every UrlHasher_handle_html() call. However, since we already reuse the browser object for a class instance, there's no sense in not reusing a page in the browser in the same way.

Make Page.goto() timeout configurable

9ac5dc5

Change from a hard coded value to reusing the self._timeout value.

Add option to control TLS validation

7fc99db

Added a verify kwarg to UrlHasher.hash_url() to allow control over TLS verification in the requests library. The kwarg is directly used as the kwarg of the same name in the requests.get() method.

mcdonnnj force-pushed the first-commits branch from 50a6197 to b4c49d2 Compare February 16, 2021 04:47

mcdonnnj added 2 commits February 15, 2021 23:47

Add hasher.UrlResult to the public objects

3499d5c

Since the result of a UrlHasher.hash_url() call and the UrlHasher class itself are the primary interfaces to this class, they should both be publicly available.

mcdonnnj requested review from dav3r and jsf9k February 16, 2021 15:00

jsf9k approved these changes Feb 16, 2021

View reviewed changes

dav3r approved these changes Feb 16, 2021

View reviewed changes

mcdonnnj merged commit c4b07e8 into develop Feb 17, 2021

mcdonnnj deleted the first-commits branch February 17, 2021 19:44

Initial Functionality Implementation #2

Initial Functionality Implementation #2

Uh oh!

Conversation

mcdonnnj commented Feb 10, 2021

🗣 Description

💭 Motivation and context

🧪 Testing

✅ Checklist

Uh oh!

jsf9k left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dav3r left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jsf9k commented Feb 11, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mcdonnnj commented Feb 11, 2021

Uh oh!

Uh oh!

lgtm-com bot commented Feb 16, 2021

Uh oh!

jsf9k left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dav3r left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jsf9k commented Feb 11, 2021 •

edited

Loading