Fetching contributors…
Cannot retrieve contributors at this time
73 lines (48 sloc) 1.99 KB

Hadoop File System

Odo interacts with the Hadoop File System using WebHDFS and the pywebhdfs Python lirary.

odo and hdfs


HDFS uris consist of the hdfs:// protocol, a hostname, and a filename. Simple and complex examples follow:


Alternatively you may want to pass authentication information through keyword arguments to the odo function as in the following example

>>> from odo import odo
>>> odo('localfile.csv', 'hdfs://hostname:myfile.csv',
...     port=14000, user='hdfs')

We pass through authentication keyword arguments to the pywebhdfs.webhdfs.PyWebHdfsClient class, using the following defaults:


Constructing HDFS Objects explicitly

Most users usually interact with odo using URI strings.

Alternatively you can construct objects programmatically. HDFS uses the HDFS type modifier

>>> auth = {'user': 'hdfs', 'port': 14000, 'host': 'hostname'}
>>> data = HDFS(CSV)('/user/hdfs/data/accounts.csv', **auth)
>>> data = HDFS(JSONLines)('/user/hdfs/data/accounts.json', **auth)
>>> data = HDFS(Directory(CSV))('/user/hdfs/data/', **auth)


We can convert any text type (CSV, JSON, JSONLines, TextFile) to its equivalent on HDFS (HDFS(CSV), HDFS(JSON), ...). The odo network allows conversions from other types, like a pandas dataframe to a CSV file on HDFS, by routing through a temporary local csv file.:

HDFS(*) <-> *

Additionally we know how to load HDFS files into the Hive metastore:

HDFS(Directory(CSV)) -> Hive

The network also allows conversions from other types, like a pandas DataFrame to an HDFS CSV file, by routing through a temporary local csv file.:

Foo <-> Temp(*) <-> HDFS(*)