Skip to content

Purpose

Luis Rivera edited this page Feb 21, 2020 · 12 revisions

Purpose

The main reason for building and configuring this cluster is to leverage hadoop and spark's distributed storage and processing capabilities to build, train, test and assess Machine Learning models in-parallel for projects at work and also to demonstrate you may not need to invest in full-powered computers or servers to build a cluster.

At work, I found myself training about 300 XGBoost models on time-series financial data that would take about 7 hours to train total. Yep, we don't have powerful computers or servers at work. As a result, the goal here is to at least reduce the amount of time it takes to train all models. By having a separate cluster, I allow myself to continue working on other tasks while the cluster takes the load.

Yes, of course, a full computer may have much more power in terms of cpu speed, ram, and other hardware. However, not everyone has that much money to spend (including myself). That is where Raspberry Pi's come in.