Skip to content

Purpose

Luis Rivera edited this page Feb 20, 2020 · 12 revisions

Purpose

The main reason for building and configuring this cluster is to leverage hadoop and spark's distributed storage and processing capabilities in order to build, train, test and assess Machine Learning models in-parallel for different projects at work and also to demonstrate you may not need to invest in multiple full-powered computers or servers to build a cluster.

I had to train about 300 XGBoost models on time-series financial data that would take about 7 hours to train total. Yep, we don't have powerful computers or servers at work. As a result, the goal here is to at least reduce the amount of time it takes to train all models. Furthermore, by having a separate cluster, I don't have to use my computer at work to train these anymore, thus allowing me to continue working on other tasks while the cluster does the job.

Yes, of course, a full computer may have much more power in terms of cpu speed, ram, and other hardware. However, not everyone has that much money to spend (including myself). That is where Raspberry Pi's come in.