The hpycc package is intended to simplify the use of data stored on HPCC and make it easily available to both users and other servers through basic Python calls. Its long-term goal is to make access to and manipulation of HPCC data as quick and easy as any other type system.
HPCC seem to have pulled the images that this is based on. I've created one for the test environment I developed against but if you want ot dev against a different one you will need to build the relevent container. I've forked the old HPCC docker repo in case you need it or it gets removed: https://github.com/datamacgyver/docker-hpcc
Note that in the docker files I've tried the wget for HPCC is broken. You can get the new ones from the HPCC website but there's a modified one in this repoo I used for the build.
The below readme and package documentation is available at https://hpycc.readthedocs.io/en/latest/
The package's github is available at: https://github.com/OdinProAgrica/hpycc
This package is released under GNU GPLv3 Licence: https://www.gnu.org/licenses/gpl-3.0.en.html
Want to use this in R? Then the reticulate package is your friend! Just save as a CSV and read back in. That or you can use an R notebook with a Python chunk.
Install with:
pip install hpycc
Or, if you are still a bit old school:
python -m pip install hpycc
Tested and working on HPCC v6.4.2 and python 3.5.2 under windows 10. Has been used on Linux systems but not extensively tested.
The package itself mainly uses core Python, Pandas is needed for outputting dataframes.
There is a dependency for client tools to run ECL scripts (you need ecl.exe and eclcc.exe). Make sure you install the right client tools for your HPCC version and add the dir to your system path, e.g. C:\Program Files (x86)\HPCCSystems\X.X.X\clienttools\bin.
Tests and docker container functions require docker to spin up HPCC environments.
Below summarises the key functions and non-optional parameters. For specific arguments see the relevant function's documentation. Note that while retrieving a file is a multi-thread process, running a script and getting the results is not. Therefore if your file is quite big you may be better off saving the results of a script using run.run_script() with a thor file output then downloading the file with get.get_thor_file().
connection(username, server="localhost", port=8010, repo=None, password="password", legacy=False, test_conn=True)
Create a connection to a new HPCC instance. This is then passed to any interface functions.
Run a given ECL script and either return the first result as a pandas dataframe or save it to file.
Run a given ECL script and return all results as a dict of pandas dataframes or save them to files.
get_thor_file(connection, logical_file, path, ...) & save_thor_file(connection, logical_file, path, ...)
Get a logical file and either return as a pandas dataframe or save it to file.
Run a given ECL script. 10 rows will be returned but they will be dumped, no output is given.
Spray a csv or pandas DataFrame into HPCC.
Designed for our testing but made available generally, a collection of functions for running and managing HPCC docker containers is also available. The above function starts a container, see help file for shutting down and other management tasks.
The below code gives an example of functionality:
import hpycc import pandas as pd from hpycc.utils import docker_tools from os import remove # Start an HPCC docker image for testing docker_tools.HPCCContainer(tag="6.4.26-1") # Setup stuff username = 'HPCC_dev' test_file = 'test.csv' f_hpcc_1 = '~temp::testfile1' f_hpcc_2 = '~temp::testfile2' ecl_script = 'ecl_script.ecl' # Let's create a connection object so we can interface with HPCC. # up with Docker conn = hpycc.Connection(username, server="localhost") try: # So, let's spray up some data: pd.DataFrame({'col1': [1, 2, 3, 4], 'col2': ['a', 'b', 'c', 'd']}).to_csv(test_file, index=False) hpycc.spray_file(conn, test_file, f_hpcc_1, expire=7) # Lovely, we can now extract that as a Thor file: df = hpycc.get_thor_file(conn, f_hpcc_1) print(df) # Note __fileposition__ column. This will be drop-able in future versions. ################################# # col1 col2 \__fileposition__# # 0 1 a 0 # # 1 3 c 20 # # 2 2 b 10 # # 3 4 d 30 # ################################# # If preferred data can also be extracted using an ECL script. with open(ecl_script, 'w') as f: f.writelines("DATASET('%s', {STRING col1; STRING col2;}, THOR);" % f_hpcc_1) # Note, all columns are currently string-ified by default df = hpycc.get_output(conn, ecl_script) print(df) ################ # col1 col2 # # 0 1 a # # 1 3 c # # 2 2 b # # 3 4 d # ############## # # get_thor_file() is optimised for large files, get_output is not (yet). To run a script and # download a large result you should therefore save a thor file and grab that. with open(ecl_script, 'w') as f: f.writelines("a := DATASET('%s', {STRING col1; STRING col2;}, THOR);" "OUTPUT(a, , '%s');" % (f_hpcc_1, f_hpcc_2)) hpycc.run_script(conn, ecl_script) df = hpycc.get_thor_file(conn, f_hpcc_2) print(df) ################################# # col1 col2 \__fileposition__# # 0 1 a 0 # # 1 3 c 20 # # 2 2 b 10 # # 3 4 d 30 # ################################# finally: # Shutdown our docker container docker_tools.HPCCContainer(pull=False, start=False).stop_container() remove(ecl_script) remove(test_file)
Please use the package's github: https://github.com/OdinProAgrica/hpycc
Any contributions are also welcome.