Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.Sign up
Distri is a Java framework for building a distributed system that can download large sets of web documents in a short time. With Distri, the system could be easily scaled to support real-time web document downloading with high throughputs.
Distri uses a master-slave structure, where independent web downloading tasks are executed by distributed machines (the slaves) that are under centralized supervision of a control processor (the master). The system built on top of Distri is effectively scalable and can control a large number of slave machines. And by distributing tasks to those slave processors to achieve parallelism, Distri can produce a fairly high throughput. In addition, Distri's throttling mechanism assures that none of the slave machines make requests to a same host too often, which causes the machine to be blocked from the visiting the host.
More details about how Distri works and its performance analysis can be found in this paper.