Data Science Workspace
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
jdbc/impala_41
python
.gitignore
README.md

README.md

Data Science Workspace (dsws)

A common data science component integration. Provides common access patterns for hadoop component libraries used in parallel distributed environments.

Some access patterns are standardized:

  • hive
  • imp
  • sql

Some access patterns are specific to CDSW:

Exisiting libraries, classes, configs, and type

library class config type default
dsws duct 0-dsws_config.py X
hive Hive 0-hive_hive.py cli
Beeline (Hive) hbl 0-hive_hbl.py cli X
pyhs2 Pyhs2 0-hive_pyhs2.py conn X
Beeline (Impala) Ibl 0-impala_ibl.py cli X
impyla Impyla 0-impala_impyla.py conn X
spark Spark 0-spark_spark.py sess X
tb Tb 0-webapp_tb.py webapp

Requirement Notes. There are some libraries that this will require, Others will only be available after install.

Impyla requirements

thrift==0.9.3 impyla>=0.14.0

For Hive and/or Kerberos support

sasl>=0.2.1 thrift_sasl>=0.2.1

For some of the example code

pandas>=0.20.3

Tests

Upload to pypi

https://packaging.python.org/guides/migrating-to-pypi-org/#uploading

https://stackoverflow.com/questions/45207128/failed-to-upload-packages-to-pypi-410-gone

pip install -U pip setuptools twine

python setup.py sdist

twine upload dist/*

Quickstart

In order to provide some form of configuration to evaluate, the project comes with an example configuration specific to a quickstart instance.