Python clone of Spark, a MapReduce alike framework in Python
Python JavaScript C HTML CSS Shell Dockerfile
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
docker dpark.conf 添加 MEM_PER_TASK 选项 Aug 14, 2014
docs Refactor: rename job taskset Aug 2, 2018
dpark pytest parallel test Aug 8, 2018
examples Refactor: rename _last_stats last_jobstats Aug 8, 2018
tests pytest parallel test Aug 8, 2018
tools Feature: mrun use special role. Aug 2, 2018
.gitignore fix recursion max depth Aug 8, 2018
.travis.yml Update .travis.yml Aug 10, 2018
AUTHORS first public commit Apr 11, 2012
CONTRIBUTORS prepare pypi release Dec 24, 2015
LICENSE first public commit Apr 11, 2012
MANIFEST.in feat: Web UI support for dpark Dec 5, 2016
README.rst
req.txt add hardware accelerated crc32c module Mar 19, 2018
setup.cfg release to pypi Dec 24, 2015
setup.py fix recursion max depth Aug 8, 2018
tox.ini pytest parallel test Aug 8, 2018

README.rst

DPark

pypi status ci status Join the chat at https://gitter.im/douban/dpark

DPark is a Python clone of Spark, MapReduce(R) alike computing framework supporting iterative computation.

Installation:

## Due to the use of C extensions, some libraries need to be installed first.

$ sudo apt-get install libtool pkg-config build-essential autoconf automake
$ sudo apt-get install python-dev
$ sudo apt-get install libzmq-dev

## Then just pip install dpark (``sudo`` maybe needed if you encounter permission problem).

$ pip install dpark

Example for word counting (wc.py):

from dpark import DparkContext
ctx = DparkContext()
file = ctx.textFile("/tmp/words.txt")
words = file.flatMap(lambda x:x.split()).map(lambda x:(x,1))
wc = words.reduceByKey(lambda x,y:x+y).collectAsMap()
print wc

This script can run locally or on a Mesos cluster without any modification, just using different command-line arguments:

$ python wc.py
$ python wc.py -m process
$ python wc.py -m host[:port]

See examples/ for more use cases.

Some more docs (in Chinese): https://github.com/jackfengji/test_pro/wiki

DPark can run with Mesos 0.9 or higher.

If a $MESOS_MASTER environment variable is set, you can use a shortcut and run DPark with Mesos just by typing

$ python wc.py -m mesos

$MESOS_MASTER can be any scheme of Mesos master, such as

$ export MESOS_MASTER=zk://zk1:2181,zk2:2181,zk3:2181/mesos_master

In order to speed up shuffling, you should deploy Nginx at port 5055 for accessing data in DPARK_WORK_DIR (default is /tmp/dpark), such as:

server {
        listen 5055;
        server_name localhost;
        root /tmp/dpark/;
}

Mailing list: dpark-users@googlegroups.com (http://groups.google.com/group/dpark-users)