Python Environment Packaging Scripts (#2109)

* Python Environment Packaging Scripts * Redirect unneeded output to /dev/null * README.md changes * Add documentation to README * Documentation Updates * Documentation Updates
cooperative-computing-lab · Aug 6, 2019 · 2bab715 · 2bab715
1 parent 254769c
commit 2bab715
Show file tree

Hide file tree

Showing 4 changed files with 350 additions and 0 deletions.
diff --git a/python_packaging/README.md b/python_packaging/README.md
@@ -0,0 +1,186 @@
+# Python Packaging Utilities
+
+The Python packaging utilities allow users to easily analyze their Python scripts and create Conda environments that are specifically built to contain the necessary dependencies required for their application to run. In distributed computing systems such as Work Queue, it is often difficult to maintain homogenous work environments for their Python applications, as the scripts utilize a large number of outside resources at runtime, such as Python interpreters and imported libraries. The Python packaging collection provides three easy-to-use tools that solve this problem, helping users to analyze their Python programs and build the appropriate Conda environments that ensure consistent runtimes within the Work Queue system. 
+
+The `python_package_analyze` tool analyzes a Python script to determine all its top-level module dependencies and the interpreter version it uses. It then generates a concise, human-readable JSON output file containing the necessary information required to build a self-contained Conda virtual environment for the Python script.
+
+The `python_package_create` tool takes the output JSON file generated by `python_package_analyze` and creates this Conda environment, preinstalled with all the necessary libraries and the correct Python interpreter version. It then generates a packaged tarball of the environment that can be easily relocated to a different machine within the system to run the Python task.
+
+The `python_package_run` tool acts as a wrapper script for the Python task, unpacking and activating the Conda environment and running the task within the environment.
+
+
+
+
+
+
+# python_package_analyze(1)
+
+## NAME
+
+`python_package_analyze` - command-line utility for analyzing Python script for library and interpreter dependencies
+
+## SYNOPSIS
+
+`python_package_analyze [options] <python-script> <json-output-file>`
+
+## DESCRIPTION
+
+`python_package_analyze` is a simple command line utility for analyzing Python scripts for the necessary external dependencies. It generates an output file that can be used with `python_package_create` to build a self-contained Conda environment for the Python application.
+
+The `python-script` argument is the path (relative or absolute) to the Python script to be analyzed. The `json-output-file` argument is the path (relative or absolute) to the output JSON file that will be generated by the command. The file does not need to exist, and will overwrite a file with the same name if it already exists.
+
+## OPTIONS
+
+-h        Show this help message
+
+## EXIT STATUS
+
+On success, returns zero. On failure, returns non-zero.
+- 1 - Invalid command format
+- 2 - Invalid path to the Python script to be analyzed
+
+## EXAMPLE
+
+An example Python script `example.py` contains the following code:
+
+```
+import os
+import sys
+import pickle
+
+import antigravity
+import matplotlib
+
+
+if __name__ == "__main__":
+    print("example")
+```
+
+To analyze the `example.py` script for its dependencies and generate the output JSON dependencies file `dependencies.json`, run the following command:
+
+`$ python_package_analyze example.py dependencies.json`
+
+Once the command completes, the `dependencies.json` file within the current working directory will contain the following, when the default `python3` interpreter on the local machine is Python 3.7.3:
+
+`{"python": "3.7.3", "modules": ["antigravity", "matplotlib"]}`
+
+Note that system-level modules are not included within the `"modules"` list, as they are automatically installed into Conda virtual environments. Additionally, using a different version of the Python interpreter will result in a different mapping for the `"python"` value within the output file.
+
+## POSSIBLE IMPROVEMENTS
+1. Utilize `ModuleFinder` library to get complete list of modules that are used by the Python script
+- Provides more comprehensive list of modules used, including system-level modules, making it redundant
+- Takes longer to run compared to the currently-implemented parsing algorithm
+- More rigorously tested than the parsing algorithm, so it ensures that all modules will be listed
+2. Use `pip freeze` to find all modules that are installed within the machine
+- Instead of seeing if the module is not a system module, just see if it is installed on the machine, but requires that the module be installed on the master machine
+- Misses cases where a module is installed to the machine, but not by pip
+- The advantage to this option is that `pip freeze` includes versions, so you can add version numbers for module dependencies to get more accurate pip installations into the virtual environment
+- `stdlib_list` library that is in the current implementation requires installation and has not been rigorously tested
+
+
+
+# python_package_create(1)
+
+## NAME
+
+`python_package_create` - command-line utility for creating a Conda virtual environment given a Python dependencies file
+
+## SYNOPSIS
+
+`python_package_create [options] <dependency-file> <environment-name>`
+
+## DESCRIPTION
+
+`python_package_create` is a simple command-line utility that creates a local Conda environment from an input JSON dependency file, generated by `python_package_analyze`. The environment is installed into the default Conda directory on the local machine. The command also creates an environment tarball in the current working directory with extension `.tar.gz` that can be sent to and run on different machines with the same architecture.
+
+The `dependency-file` argument is the path (relative or absolute) to the JSON dependency file that was created by `python_package_analyze`. The `environment-name` argument specifies the name for the environment that is created. 
+
+## OPTIONS
+
+-h        Show this help message
+
+## EXIT STATUS
+
+On success, returns zero. On failure, returns non-zero.
+- 1 - Invalid command format
+- 2 - Invalid path to the JSON dependency file
+
+## EXAMPLE
+
+An dependencies file `dependencies.json` contains the following:
+
+`{"python": "3.7.3", "modules": ["antigravity", "matplotlib"]}`
+
+To generate a Conda environment with the Python 3.7.3 interpreter and the `antigravity` and `matplotlib` modules preinstalled and with name `example_venv`, run the following command:
+
+`$ python_package_create dependencies.json example_venv`
+
+Note that this will also create an `example_venv.tar.gz` environment tarball within the current working directory, which can then be exported to different machines for execution.
+
+## POSSIBLE IMPROVEMENTS
+1. Figure out alternative to using `subprocess.call()` to create the Conda environment (perhaps make a Bash script altogether)
+- Most of the execution occurs within the subprocess call, so basically a Bash script, but easier to use Python to parse the JSON file and write to the requirement file
+- Perhaps use a JSON parsing command line utility within Bash script instead, such as `jq`
+- If a Conda environment API for Python is ever created, it would be very useful here, as we could remove the subprocess call completely
+2. Remove redirection all output to `/dev/null`
+- All output from the subprocess call is removed for organization purposes, but some commands like `pip install` might be useful for the user to see
+- Removing redirection also makes it much easier to debug
+
+
+
+# python_package_run(1)
+
+## NAME
+
+`python_package_run` - wrapper script that executes Python script within an isolated Conda environment
+
+## SYNOPSIS
+
+`python_package_run [options] <environment-name> <python-command-string>`
+
+## DESCRIPTION
+
+The `python_package_run` tool acts as a wrapper script for a Python task, running the task within the specified Conda environment. `python_package_run` can be utilized on different machines within the Work Queue system to unpack and activate a Conda environment, and run a task within the isolated environment.
+
+The `environment-name` argument is the name of the Conda environment in which to run the Python task. The `python-command-string` is full shell command of the Python task that will be run within the Conda environment.
+
+## OPTIONS 
+
+-h        Show this help message
+
+## EXIT STATUS
+
+On success, returns 0. On failure, returns non-zero.
+1 - Invalid command format
+2 - Environment tarball does not exist
+3 - Failed trying to extract environment tarball
+4 - Failed trying to activate Conda environment
+(Note: The wrapper script captures the exit status of the Python command string. It is possible that the exit code of the Python task overlaps with the exit code of the wrapper script)
+
+## EXAMPLE
+
+A Python script `example.py` has been analyzed using `python_package_analyze` and a corresponding Conda environment named `example_venv` has been created, with all the necessary dependencies preinstalled. To execute the script within the environment, run the following command:
+
+`python_package_run example.py "python3 example.py"`
+
+This will run the `python3 example.py` task string within the activated `example_venv` Conda environment. Note that this command can be performed either locally, on the same machine that analyzed the script and created the environment, or remotely, on a different machine that contains the Conda environment tarball and the `example.py` script.
+
+## POSSIBLE IMPROVEMENTS
+1. Do protection checking against dangerous shell commands, as the script runs the command line argument directly
+- The program directly runs the task string that is passed in, which means the user could send a task that is harmful to the worker machine
+- Perhaps WorkQueue already uses protection checking for the task strings, in which case it is not necessary
+
+
+
+
+# HOW TO TEST OVERALL FUNCTIONALITY
+
+Desired Python script to run: `hi.py`
+
+1. `./python_package_analyze hi.py output.json`
+- Generates the appropriate JSON file in the current working directory
+2. `./python_package_create output.json venv`
+- Will create a Conda environment in the Conda `envs` folder, and will create a packed tarball of the environment named `venv.tar.gz` in the current working directory
+- To more easily debug, remove the redirected output to `/dev/null` in the subprocess call to see all output of the environment creation and module installation
+3. `./python_package_run venv "python3 hi.py"`
+- Runs the `python3 hi.py` task command within the activated `venv` Conda environment
diff --git a/python_packaging/python_package_analyze b/python_packaging/python_package_analyze
@@ -0,0 +1,69 @@
+#!/usr/bin/env python3
+import os
+import sys
+import json
+from stdlib_list import stdlib_list
+
+
+def usage(exit_code):
+    print("Usage: python_package_analyze [options] <python-script> <json-output-file>")
+    print("where options are:")
+    print(" -h, --help\tShow this help screen")
+    exit(exit_code)
+
+
+# Parse command line arguments
+if len(sys.argv) > 1 and (sys.argv[1] == "-h" or sys.argv[1] == "--help"):
+    usage(0)
+if len(sys.argv) != 3:
+    usage(1)
+python_script = sys.argv[1]
+json_output_file = sys.argv[2]
+if not os.path.exists(python_script):
+    print("Python script does not exist")
+    exit(2)
+
+# Find Python version and obtain list of standard library modules 
+version = ".".join(sys.version.split()[0].split(".")[:2])
+libraries = stdlib_list(version)
+
+# Parse the Python script for all import statements
+dependencies = []
+source = open(python_script, "r")
+for line in source.readlines():
+    words = line.split()
+    isList = False
+    isFrom = False
+    # Iterate through each word in the line
+    for i in range(0, len(words)):
+        # Signals that you are importing a module
+        if words[i] == "from" or words[i] == "import":
+            if words[i] == "from":
+                isFrom = True
+            i += 1
+            name = words[i]
+            if name[-1] == ",":
+                name = name[:-1]
+                isList = True
+            if name not in libraries:
+                dependencies.append(name)
+        # Iterate through multiple imports if multiple listed on one line
+        while isList:
+            i += 1
+            nane = words[i]
+            if name[-1] == ",":
+                name = name[:-1]
+            else:
+                isList = False
+            if name not in libraries:
+                dependencies.append(name)
+        if isFrom:
+            break
+
+# Put the JSON data into a file
+python_info = {}
+python_info["python"] = sys.version.split()[0]
+python_info["modules"] = dependencies
+output = open(json_output_file, "w")
+json.dump(python_info, output)
+exit(0)
diff --git a/python_packaging/python_package_create b/python_packaging/python_package_create
@@ -0,0 +1,50 @@
+#!/usr/bin/env python3
+import json
+import os
+import sys
+import subprocess
+import conda_pack
+
+
+def usage(exit_code):
+    print("Usage: python_package_create [options] <dependency-file> <environment-name>")
+    print("where options are:")
+    print(" -h, --help\tShow this help screen")
+    exit(exit_code)
+
+
+# Parse command line arguments
+if len(sys.argv) > 1 and (sys.argv[1] == "-h" or sys.argv[1] == "--help"):
+    usage(0)
+if len(sys.argv) != 3:
+    usage(1)
+dependency_file = sys.argv[1]
+environment_name = sys.argv[2]
+if not os.path.exists(dependency_file):
+    print("JSON dependency file does not exist")
+    exit(2)
+else:
+    dependency_fp = open(dependency_file, "r")
+
+# Extract python environment data from JSON file and create requirements file"
+package_data = json.load(dependency_fp)
+python_version = package_data["python"]
+dependencies = package_data["modules"]
+req_file = open("/tmp/requirements.txt", "w")
+for module in dependencies:
+    module_string = module + "\n"
+    req_file.write(module_string)
+req_file.close()
+
+# Create environment and install all necessary modules into the environment
+subprocess.call("conda create -p /tmp/{} -y python={} &> /dev/null; \
+                eval \"$(conda shell.bash hook)\" &> /dev/null; \
+                conda activate /tmp/{} &> /dev/null; \
+                pip install -r /tmp/requirements.txt &> /dev/null; \
+                pip install tblib &> /dev/null; \
+                rm /tmp/requirements.txt &> /dev/null; \
+                conda deactivate &> /dev/null;".format(environment_name, python_version, environment_name, dependency_file), shell=True)
+
+# Pack the environment
+conda_pack.pack(name=environment_name, output="{}.tar.gz".format(environment_name), force=True)
+exit(0)
diff --git a/python_packaging/python_package_run b/python_packaging/python_package_run
@@ -0,0 +1,45 @@
+#!/bin/sh
+
+usage() {
+    echo "Usage: python_package_analyze [options] <environment-name> <python-command-string>"
+    echo "where options are:"
+    echo -e " -h, --help\tShow this help screen"
+    exit $1
+}
+
+# Parse command line arguments
+if [ "$1" == "-h" ] || [ "$1" == "--help" ]; then
+    usage 0
+fi
+if [ $# -ne 2 ]; then
+    usage 1
+fi
+ENVIRONMENT_NAME=$1
+PYTHON_COMMAND_STRING=$2
+
+# Unpack the packed environment
+if [ ! -f "${ENVIRONMENT_NAME}.tar.gz" ]; then
+    echo "Environment tarball does not exist, exiting"
+    exit 2
+fi    
+tar xzf ${ENVIRONMENT_NAME}.tar.gz
+if [ $? -ne 0 ]; then
+    echo "Unable to successfully unpack tarball, exiting"
+    exit 3
+fi
+
+# Activate conda environment, run the task, deactivate
+source bin/activate &> /dev/null
+if [ $? -ne 0 ]; then
+    echo "Unable to activate Conda environment, exiting"
+    exit 4
+fi
+conda-unpack &> /dev/null
+if [ $? -ne 0 ]; then
+    echo "Unable to activate Conda environment, exiting"
+    exit 4
+fi
+${PYTHON_COMMAND_STRING}
+EXITVALUE=$?
+source bin/deactivate &> /dev/null
+exit $EXITVALUE