Skip to content

Purpose

Luis Rivera edited this page Feb 20, 2020 · 12 revisions

Purpose

The main reason for building and configuring this cluster is to leverage hadoop and spark's distributed storage and processing capabilities in order to build, train, test and assess Machine Learning models in-parallel for different projects at work and also to demonstrate you may not need to invest in multiple full-powered computers or servers to build a cluster.

I had to train about 300 XGBoost models on time-series financial data that would take about 7 hours to train total. Yep, we don't have powerful computers or servers at work. As a result, the goal here is to at least reduce the amount of time it takes to train all models. Furthermore, by having a separate cluster, I don't have to use my work's computer to train these models anymore, thus allowing me to continue working on other tasks while the cluster does the job for me.

Yes, of course, a full computer may have much more power in terms of cpu speed, ram, and other hardware. However, not everyone has that much money to spend (including myself). That is where Raspberry Pi's come in.

Raspberry Pi 4B - Main Specs:

  • Broadcom BCM2711, Quad core Cortex-A72 (ARM v8) 64-bit SoC @ 1.5GHz
  • 1GB, 2GB or 4GB LPDDR4-3200 SDRAM (depending on model)
  • 2.4 GHz and 5.0 GHz IEEE 802.11ac wireless, Bluetooth 5.0, BLE
  • Gigabit Ethernet
  • 2 × micro-HDMI ports (up to 4kp60 supported)
  • 4-pole stereo audio and composite video port
  • H.265 (4kp60 decode), H264 (1080p60 decode, 1080p30 encode)
  • OpenGL ES 3.0 graphics
  • Micro-SD card slot for loading operating system and data storage
  • 5V DC via USB-C connector (minimum 3A*)
  • 5V DC via GPIO header (minimum 3A*)
  • Power over Ethernet (PoE) enabled (requires separate PoE HAT)

Full Specifications: https://www.raspberrypi.org/products/raspberry-pi-4-model-b/specifications/