Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
d21e003
Initial changes after cloing repository
mcdonnnj Feb 8, 2021
5e656c4
Remove example files and put in framework for this module
mcdonnnj Feb 8, 2021
c989d96
Add utility methods for hashing
mcdonnnj Feb 8, 2021
dde6e15
Add NamedTuples to store expected results
mcdonnnj Feb 8, 2021
9b00ea9
Implement a class to provide URL hashing functionality
mcdonnnj Feb 8, 2021
4a27278
Add initial testing for hash_http_content.hasher
mcdonnnj Feb 8, 2021
6848d0e
Update the README
mcdonnnj Feb 8, 2021
c71b71a
Add testing for a supplied Chromium binary
mcdonnnj Feb 8, 2021
4945a2b
Consolidate handler definitions
mcdonnnj Feb 8, 2021
8c893c1
Implement a command line interface to the package
mcdonnnj Feb 9, 2021
be15172
Add some debug logging to hash_http_content.hasher
mcdonnnj Feb 9, 2021
fed60c7
Adjust fallback handler usage
mcdonnnj Feb 9, 2021
999e586
Add testing for the command line interface
mcdonnnj Feb 9, 2021
dfb8f93
Add a script to retrieve serverless-chrome binaries
mcdonnnj Feb 9, 2021
00207ee
Update comments based on feedback
mcdonnnj Feb 10, 2021
5635e2a
Expand comment about mypy workaround
mcdonnnj Feb 10, 2021
a988651
Add missing redirect status code and explain choices
mcdonnnj Feb 11, 2021
fa02dac
Fix typo in comment
mcdonnnj Feb 12, 2021
9f0088c
Add content_type member to the UrlResult NamedTuple
mcdonnnj Feb 13, 2021
5d0486a
Add a timeout instance variable to UrlHasher
mcdonnnj Feb 14, 2021
813be31
Update UrlHasher._handle_html() to navigate to a local file
mcdonnnj Feb 14, 2021
0383c77
Switch to reusing a browser page
mcdonnnj Feb 15, 2021
11ac606
Handle browser timeout while waiting for content
mcdonnnj Feb 15, 2021
05ce989
Add retry mechanism to requests.get() call
mcdonnnj Feb 15, 2021
9ac5dc5
Make Page.goto() timeout configurable
mcdonnnj Feb 15, 2021
7fc99db
Add option to control TLS validation
mcdonnnj Feb 15, 2021
b4c49d2
Narrow scope of try block in UrlHasher.hash_url()
mcdonnnj Feb 16, 2021
3499d5c
Add hasher.UrlResult to the public objects
mcdonnnj Feb 16, 2021
2c51f2b
Add additional type hints to the UrlHasher class
mcdonnnj Feb 16, 2021
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .coveragerc
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
# https://coverage.readthedocs.io/en/latest/config.html

[run]
source = src/example
source = src/hash_http_content
omit =
branch = true

Expand Down
2 changes: 1 addition & 1 deletion .github/lineage.yml
Original file line number Diff line number Diff line change
Expand Up @@ -3,4 +3,4 @@ version: "1"

lineage:
skeleton:
remote-url: https://github.com/cisagov/skeleton-generic.git
remote-url: https://github.com/cisagov/skeleton-python-library.git
2 changes: 2 additions & 0 deletions .github/workflows/build.yml
Original file line number Diff line number Diff line change
Expand Up @@ -64,6 +64,8 @@ jobs:
${{ hashFiles('**/requirements.txt') }}"
restore-keys: |
${{ env.BASE_CACHE_KEY }}
- name: Download and extract a serverless-chrome binary
run: ./get_serverless_chrome_binary.sh
- name: Install dependencies
run: |
python -m pip install --upgrade pip
Expand Down
10 changes: 5 additions & 5 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ all of which should be in this repository.

If you want to report a bug or request a new feature, the most direct
method is to [create an
issue](https://github.com/cisagov/skeleton-python-library/issues) in
issue](https://github.com/cisagov/hash-http-content/issues) in
this repository. We recommend that you first search through existing
issues (both open and closed) to check if your particular issue has
already been reported. If it has then you might want to add a comment
Expand All @@ -25,7 +25,7 @@ one.
## Pull requests ##

If you choose to [submit a pull
request](https://github.com/cisagov/skeleton-python-library/pulls),
request](https://github.com/cisagov/hash-http-content/pulls),
you will notice that our continuous integration (CI) system runs a
fairly extensive set of linters, syntax checkers, system, and unit tests.
Your pull request may fail these checks, and that's OK. If you want
Expand Down Expand Up @@ -111,9 +111,9 @@ can create and configure the Python virtual environment with these
commands:

```console
cd skeleton-python-library
pyenv virtualenv <python_version_to_use> skeleton-python-library
pyenv local skeleton-python-library
cd hash-http-content
pyenv virtualenv <python_version_to_use> hash-http-content
pyenv local hash-http-content
pip install --requirement requirements-dev.txt
```

Expand Down
58 changes: 36 additions & 22 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,25 +1,39 @@
# skeleton-python-library #

[![GitHub Build Status](https://github.com/cisagov/skeleton-python-library/workflows/build/badge.svg)](https://github.com/cisagov/skeleton-python-library/actions)
[![Coverage Status](https://coveralls.io/repos/github/cisagov/skeleton-python-library/badge.svg?branch=develop)](https://coveralls.io/github/cisagov/skeleton-python-library?branch=develop)
[![Total alerts](https://img.shields.io/lgtm/alerts/g/cisagov/skeleton-python-library.svg?logo=lgtm&logoWidth=18)](https://lgtm.com/projects/g/cisagov/skeleton-python-library/alerts/)
[![Language grade: Python](https://img.shields.io/lgtm/grade/python/g/cisagov/skeleton-python-library.svg?logo=lgtm&logoWidth=18)](https://lgtm.com/projects/g/cisagov/skeleton-python-library/context:python)
[![Known Vulnerabilities](https://snyk.io/test/github/cisagov/skeleton-python-library/develop/badge.svg)](https://snyk.io/test/github/cisagov/skeleton-python-library)

This is a generic skeleton project that can be used to quickly get a
new [cisagov](https://github.com/cisagov) Python library GitHub
project started. This skeleton project contains [licensing
information](LICENSE), as well as
[pre-commit hooks](https://pre-commit.com) and
[GitHub Actions](https://github.com/features/actions) configurations
appropriate for a Python library project.

## New Repositories from a Skeleton ##

Please see our [Project Setup guide](https://github.com/cisagov/development-guide/tree/develop/project_setup)
for step-by-step instructions on how to start a new repository from
a skeleton. This will save you time and effort when configuring a
new repository!
# hash-http-content #

[![GitHub Build Status](https://github.com/cisagov/hash-http-content/workflows/build/badge.svg)](https://github.com/cisagov/hash-http-content/actions)
[![Coverage Status](https://coveralls.io/repos/github/cisagov/hash-http-content/badge.svg?branch=develop)](https://coveralls.io/github/cisagov/hash-http-content?branch=develop)
[![Total alerts](https://img.shields.io/lgtm/alerts/g/cisagov/hash-http-content.svg?logo=lgtm&logoWidth=18)](https://lgtm.com/projects/g/cisagov/hash-http-content/alerts/)
[![Language grade: Python](https://img.shields.io/lgtm/grade/python/g/cisagov/hash-http-content.svg?logo=lgtm&logoWidth=18)](https://lgtm.com/projects/g/cisagov/hash-http-content/context:python)
[![Known Vulnerabilities](https://snyk.io/test/github/cisagov/hash-http-content/develop/badge.svg)](https://snyk.io/test/github/cisagov/hash-http-content)

This is a Python library to retrieve the contents of a given URL via HTTP (or
HTTPS) and hash the processed contents.

## Content processing ##

If an encoding is detected, this package will convert content into the UTF-8
encoding before proceeding.

Additional content processing is currently implemented for the following types
of content:

* HTML
* JSON

### HTML ###

HTML content is processed by leveraging the
[pyppeteer](https://github.com/pyppeteer/pyppeteer) package to execute any
JavaScript on a retrieved page. The result is then parsed by
[Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/) to reduce the
content to the human visible portions of a page.

### JSON ###

JSON content is processed by using the
[`json` library](https://docs.python.org/3/library/json.html) that is part of
the Python standard library. It is read in and then output in a deterministic
manner to adjust for any styling differences between content.

## Contributing ##

Expand Down
2 changes: 1 addition & 1 deletion bump_version.sh
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ set -o nounset
set -o errexit
set -o pipefail

VERSION_FILE=src/example/_version.py
VERSION_FILE=src/hash_http_content/_version.py

HELP_INFORMATION="bump_version.sh (show|major|minor|patch|prerelease|build|finalize)"

Expand Down
52 changes: 52 additions & 0 deletions get_serverless_chrome_binary.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
#!/usr/bin/env bash

set -o nounset
set -o errexit
set -o pipefail

function usage {
echo "Usage:"
echo " ${0##*/} [options]"
echo
echo "Options:"
echo " -h, --help Show the help message."
echo " -l, --latest Pull down the latest release on GitHub."
exit "$1"
}

# Defaults to a specific version for use in GitHub Actions
DOWNLOAD_URL="https://github.com/adieuadieu/serverless-chrome/releases/download/v1.0.0-57/stable-headless-chromium-amazonlinux-2.zip"
LOCAL_FILE="serverless-chrome.zip"
LOCAL_DIR="tests/files/"


# Get the URL of the latest stable release available
function get_latest_stable_url {
releases_url="https://api.github.com/repos/adieuadieu/serverless-chrome/releases"
# Get the URL for the latest release's assets
latest_assets=$(curl -s "$releases_url" | jq -r '.[0].assets_url')
# Download the zip for the stable branch
DOWNLOAD_URL=$(curl -s "$latest_assets" | jq -r '.[] | select(.browser_download_url | contains("stable")) | .browser_download_url')
}

while (( "$#" ))
do
case "$1" in
-h|--help)
usage 0
;;
-l|--latest)
get_latest_stable_url
shift 1
;;
-*)
usage 1
;;
esac
done

# Follow redirects and output as the specified file name
curl -L --output "$LOCAL_FILE" "$DOWNLOAD_URL"
# Extract the specified file to the specified directory and overwrite without
# prompting
unzip -o "$LOCAL_FILE" -d "$LOCAL_DIR"
27 changes: 17 additions & 10 deletions setup.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
"""
This is the setup module for the example project.
This is the setup module for the hash-http-content project.

Based on:

Expand Down Expand Up @@ -42,16 +42,16 @@ def get_version(version_file):


setup(
name="example",
name="hash-http-content",
# Versions should comply with PEP440
version=get_version("src/example/_version.py"),
description="Example python library",
version=get_version("src/hash_http_content/_version.py"),
description="HTTP content hasher",
long_description=readme(),
long_description_content_type="text/markdown",
# NCATS "homepage"
url="https://www.us-cert.gov/resources/ncats",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This URL needs to be updated upstream.

# The project's main homepage
download_url="https://github.com/cisagov/skeleton-python-library",
download_url="https://github.com/cisagov/hash-http-content",
# Author details
author="Cyber and Infrastructure Security Agency",
author_email="ncats@hq.dhs.gov",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This email address needs to be updated upstream.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cisagov/skeleton-python-library#57 Did we reach a consensus on which of each to use? I'm happy to make an upstream PR to resolve this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Expand All @@ -77,13 +77,20 @@ def get_version(version_file):
],
python_requires=">=3.6",
# What does your project relate to?
keywords="skeleton",
keywords="hash http requests",
packages=find_packages(where="src"),
package_dir={"": "src"},
package_data={"example": ["data/*.txt"]},
py_modules=[splitext(basename(path))[0] for path in glob("src/*.py")],
include_package_data=True,
install_requires=["docopt", "schema", "setuptools >= 24.2.0"],
install_requires=[
"beautifulsoup4",
"docopt",
"lxml",
"pyppeteer",
"requests",
"schema",
"setuptools >= 24.2.0",
],
extras_require={
"test": [
"coverage",
Expand All @@ -99,6 +106,6 @@ def get_version(version_file):
"pytest",
]
},
# Conveniently allows one to run the CLI tool as `example`
entry_points={"console_scripts": ["example = example.example:main"]},
# Conveniently allows one to run the CLI tool as `hash-url`
entry_points={"console_scripts": ["hash-url = hash_http_content.cli:main"]},
)
1 change: 0 additions & 1 deletion src/example/data/secret.txt

This file was deleted.

108 changes: 0 additions & 108 deletions src/example/example.py

This file was deleted.

Original file line number Diff line number Diff line change
@@ -1,9 +1,12 @@
"""The example library."""
"""The hash-http-content library."""
# Standard Python Libraries
from typing import List

# We disable a Flake8 check for "Module imported but unused (F401)" here because
# although this import is not directly used, it populates the value
# package_name.__version__, which is used to get version information about this
# Python package.
from ._version import __version__ # noqa: F401
from .example import example_div
from .hasher import UrlHasher, UrlResult

__all__ = ["example_div"]
__all__: List[str] = ["UrlHasher", "UrlResult"]
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
"""Code to run if this package is used as a Python module."""

from .example import main
from .cli import main

main()
File renamed without changes.
Loading