<a href="https://colab.research.google.com/github/groda/big_data/blob/master/Setting_up_Spark_Standalone_on_Google_Colab_BigtopEdition.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<a href="https://github.com/groda/big_data"><div><img src="https://github.com/groda/big_data/blob/master/logo_bdb.png?raw=true" align=right width="90" alt="Logo Big Data for Beginners"></div></a>
# Install and run Spark in standalone mode—Apache Bigtop edition <div><img src="https://www.apache.org/logos/res/bigtop/bigtop.png" width="45" style='vertical-align:middle; display:inline;' alt="Apache Bigtop" data-url="https://www.apache.org/logos/#bigtop"><img src="https://www.apache.org/logos/res/spark/spark.png" width="45" style='vertical-align:middle; display:inline;' alt="Apache Spark" data-url="https://www.apache.org/logos/#spark"></div>

<br>

We will install Apache Spark on a single machine (the virtual machine hosting this notebook) in _standalone mode_, meaning it will run without any cluster manager like YARN, Mesos, or Kubernetes. For more information, see the [types of cluster managers supported by Spark](https://spark.apache.org/docs/latest/cluster-overview.html#cluster-manager-types)).

We're following the official [Spark Standalone documentation](https://spark.apache.org/docs/latest/spark-standalone.html), using Apache Bigtop's Spark distribution, which conveniently packages Spark's start scripts as services.




Before running this notebook, you may want to update the Bigtop version (currently version `3.2.1` from  2023-08-22 [with Hadoop 3.3.5](https://bigtop.apache.org/release-notes.html), see also the [full list of releases](https://bigtop.apache.org/download.html)).

### A side note:





<a href="https://spark.apache.org/"><img src="https://www.apache.org/logos/res/spark/spark.png" width="120" align="right" style='vertical-align:middle; display:inline;' alt="Apache Spark" data-url="https://www.apache.org/logos/#spark"></a>
<a href="https://bigtop.apache.org/"><img src="https://www.apache.org/logos/res/bigtop/bigtop.png" width="120" align="right" style='vertical-align:middle; display:inline;' alt="Apache Bigtop" data-url="https://www.apache.org/logos/#bigtop"></a>


I recently discovered [a website](https://www.apache.org/logos/) where you can find all Apache project logos, including Spark, with transparent backgrounds. It’s a great resource for anyone needing these assets for presentations or documentation. <p>





## Install Spark from Bigtop repository

In [1]:
%%bash
# Add Bigtop repository
echo "Adding Bigtop repository..."
curl -o /etc/apt/sources.list.d/bigtop-3.2.1.list https://archive.apache.org/dist/bigtop/bigtop-3.2.1/repos/$(lsb_release -is | tr '[:upper:]' '[:lower:]')-$(lsb_release -rs)/bigtop.list

# Download and add the Bigtop GPG key
echo "Adding Bigtop GPG key..."
wget --no-clobber -qO - https://archive.apache.org/dist/bigtop/bigtop-3.2.1/repos/GPG-KEY-bigtop | sudo apt-key add -

# Update package cache
echo "Updating package cache..."
apt update

Adding Bigtop repository...
Adding Bigtop GPG key...
OK
Updating package cache...
Hit:1 http://archive.ubuntu.com/ubuntu jammy InRelease
Get:2 http://archive.ubuntu.com/ubuntu jammy-updates InRelease [128 kB]
Get:3 http://archive.ubuntu.com/ubuntu jammy-backports InRelease [127 kB]
Get:4 http://security.ubuntu.com/ubuntu jammy-security InRelease [129 kB]
Get:5 http://repos.bigtop.apache.org/releases/3.2.1/ubuntu/22.04/amd64 bigtop InRelease [2,502 B]
Get:6 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease [1,581 B]
Get:7 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease [3,626 B]
Ign:8 https://r2u.stat.illinois.edu/ubuntu jammy InRelease
Get:9 https://r2u.stat.illinois.edu/ubuntu jammy Release [5,713 B]
Get:10 https://r2u.stat.illinois.edu/ubuntu jammy Release.gpg [793 B]
Hit:11 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy InRelease
Hit:12 https://ppa.launchpadcontent.net/graphics-drivers/ppa/ubuntu jammy InRelease


  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0100    86  100    86    0     0    153      0 --:--:-- --:--:-- --:--:--   153


W: http://repos.bigtop.apache.org/releases/3.2.1/ubuntu/22.04/amd64/dists/bigtop/InRelease: Key is stored in legacy trusted.gpg keyring (/etc/apt/trusted.gpg), see the DEPRECATION section in apt-key(8) for details.
W: Skipping acquire of configured file 'main/source/Sources' as repository 'https://r2u.stat.illinois.edu/ubuntu jammy InRelease' does not seem to provide it (sources.list entry misspelt?)


### Explore Bigtop and Spark packages

In [2]:
%%bash
echo 'List all available packages that match "bigtop"'
apt search bigtop

echo 'List all available packages that match "spark"'
apt search  spark

List all available packages that match "bigtop"
Sorting...
Full Text Search...
bigtop-ambari-mpack/stable 2.7.5.0-1 all
  Ambari Mpack

bigtop-groovy/stable 2.5.4-1 all
  An agile and dynamic language for the Java Virtual Machine

bigtop-jsvc/stable 1.2.4-1 amd64
  Application to launch java daemon

bigtop-utils/stable 3.2.1-1 all
  Collection of useful tools for Bigtop

List all available packages that match "spark"
Sorting...
Full Text Search...
alluxio/stable 2.8.0-2 all
  Reliable file sharing at memory speed across cluster frameworks

libjs-jquery.sparkline/jammy 2.1.2-3 all
  library for jQuery to generate sparklines

libsparkline-php/jammy 0.2-7 all
  sparkline graphing library for php

livy/stable 0.7.1-1 all
  Livy is an open source REST interface for interacting with Apache Spark from anywhere.

node-sparkles/jammy 1.0.1-2 all
  Namespaced global event emitter

nspark/jammy 1.7.8B2+git20210317.cb30779-2 amd64
  Unarchiver for Spark and ArcFS files

pcp-export-pcp2spark/jammy 







### Install the essential packages

In order to run a Spark job, we need the core libraries as well as the Spark master and Spark worker. Master and worker in this case are going to run both on the same machine, the localhost.

The package `bigtop-utils` will be used to start the services.

In [3]:
%%bash
for p in spark-core spark-master spark-worker bigtop-utils; do
  echo "🛠️ Installing $p"
  apt install -y $p
done

🛠️ Installing spark-core
Reading package lists...
Building dependency tree...
Reading state information...
The following additional packages will be installed:
  bigtop-groovy bigtop-jsvc bigtop-utils hadoop hadoop-client hadoop-hdfs hadoop-mapreduce
  hadoop-yarn netcat-openbsd zookeeper
The following NEW packages will be installed:
  bigtop-groovy bigtop-jsvc bigtop-utils hadoop hadoop-client hadoop-hdfs hadoop-mapreduce
  hadoop-yarn netcat-openbsd spark-core zookeeper
0 upgraded, 11 newly installed, 0 to remove and 50 not upgraded.
Need to get 707 MB of archives.
After this operation, 869 MB of additional disk space will be used.
Get:1 http://repos.bigtop.apache.org/releases/3.2.1/ubuntu/22.04/amd64 bigtop/contrib amd64 bigtop-utils all 3.2.1-1 [5,432 B]
Get:2 http://archive.ubuntu.com/ubuntu jammy/main amd64 netcat-openbsd amd64 1.218-4ubuntu1 [39.4 kB]
Get:3 http://repos.bigtop.apache.org/releases/3.2.1/ubuntu/22.04/amd64 bigtop/contrib amd64 bigtop-groovy all 2.5.4-1 [4,832 kB]












**Note;** in a future version of this notebook we are going to use an alternative to `apt` for installing packages in order to avoid the warning

```
WARNING: apt does not have a stable CLI interface. Use with caution in scripts.

```

## Start Spark services

Thanks to the Bigtop's utilities, we can now start the Spark master and Spark worker as services. Normally one would use `systemctl` but since this is not allowed on Colab, we are going to resort to `service`.

In [4]:
%%bash
for p in spark-master spark-worker; do
  echo "Starting $p"
  # systemctl start $p
  service $p start
done

Starting spark-master
 * Starting Spark master (spark-master): 
Starting spark-worker
 * Starting Spark worker (spark-worker): 


## Run the `pi` example

This step may take some time.

We'll run the `SparkPi` demo from the examples included in the Spark distribution, which are packaged in the `spark-examples*.jar` file.

We'll submit the job using [`spark-submit`](https://spark.apache.org/docs/latest/submitting-applications.html), and the output will be an approximation of π (for more details, see the [official Spark examples](https://spark.apache.org/examples.html).


The following code defines the variable `$EXAMPLE_JAR`, which points to the archive containing all the examples from the Spark distribution.

The following command submits the SparkPi application (located in the `org.apache.spark.examples.SparkPi` class) to the Spark master at `spark://${HOSTNAME}:7077` using `spark-submit`:

```
$SPARK_HOME/bin/spark-submit \
  --class org.apache.spark.examples.SparkPi \
  --master spark://${HOSTNAME}:7077 \
  $EXAMPLES_JAR \
  100
```

In this example, the number $100$ represents the number of iterations used to compute an approximation of π by calculating the ratio of points inside versus outside the unit circle.

In [5]:
%%bash

export EXAMPLES_JAR=$(find $(which hadoop|awk -F 'bin/hadoop' '{print $1}') -name 'spark-examples.jar' -print -quit)

$SPARK_HOME/bin/spark-submit \
  --class org.apache.spark.examples.SparkPi \
  --master spark://${HOSTNAME}:7077 \
  $EXAMPLES_JAR \
  100

Pi is roughly 3.1423215142321514


SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/lib/spark/jars/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/hadoop/lib/slf4j-reload4j-1.7.36.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
2024-10-21 11:30:54,510 INFO spark.SparkContext: Running Spark version 3.2.3
2024-10-21 11:30:55,572 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2024-10-21 11:30:56,053 INFO resource.ResourceUtils: No custom resources configured for spark.driver.
2024-10-21 11:30:56,061 INFO spark.SparkContext: Submitted application: Spark Pi
2024-10-21 11:30:56,236 INFO resource.ResourceProfile: Default ResourceProfile created, executor resources: Map(cores -> name: cores, amount:

## Run the Java Random Forest Regressor example

Next, we will run the Java Random Forest Regressor example. Source: [JavaRandomForestRegressorExample.java](https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/ml/JavaRandomForestRegressorExample.java).

In [6]:
%%bash
 j=$(find $(which hadoop|awk -F 'bin/hadoop' '{print $1}') -name 'spark-examples.jar' -print -quit)
 echo "Jar file containing examples: $j"

Jar file containing examples: /usr/lib/spark/examples/jars/spark-examples.jar


If you run

```
%%bash
j=$(find $(which hadoop|awk -F 'bin/hadoop' '{print $1}') -name 'spark-examples.jar' -print -quit)
spark-submit --class  org.apache.spark.examples.ml.JavaRandomForestRegressorExample $j
```

you'll get an error message telling you that the file `/content/data/mllib/sample_libsvm_data.txt` is missing. We are just going to create this file, but first we need to find it!

In [7]:
!find / -name 'sample_libsvm_data.txt' 2> /dev/null

/usr/lib/spark/data/mllib/sample_libsvm_data.txt
/usr/local/lib/python3.10/dist-packages/pyspark/data/mllib/sample_libsvm_data.txt


Copy the datafile to the desired location

In [8]:
%%bash
mkdir -p /content/data/mllib/
cp /usr/lib/spark/data/mllib/sample_libsvm_data.txt /content/data/mllib/sample_libsvm_data.txt

Run the JavaRandomForestRegressorExample example.

In [9]:
%%bash
j=$(find $(which hadoop|awk -F 'bin/hadoop' '{print $1}') -name 'spark-examples.jar' -print -quit)
spark-submit --class  org.apache.spark.examples.ml.JavaRandomForestRegressorExample $j

+----------+-----+--------------------+
|prediction|label|            features|
+----------+-----+--------------------+
|       0.0|  0.0|(692,[123,124,125...|
|       0.0|  0.0|(692,[124,125,126...|
|       0.0|  0.0|(692,[124,125,126...|
|       0.0|  0.0|(692,[126,127,128...|
|       0.0|  0.0|(692,[126,127,128...|
+----------+-----+--------------------+
only showing top 5 rows

Root Mean Squared Error (RMSE) on test data = 0.17159383568311662
Learned regression forest model:
RandomForestRegressionModel: uid=rfr_460e2087c372, numTrees=20, numFeatures=692
  Tree 0 (weight 1.0):
    If (feature 489 <= 5.5)
     Predict: 0.0
    Else (feature 489 > 5.5)
     Predict: 1.0
  Tree 1 (weight 1.0):
    If (feature 517 <= 66.5)
     Predict: 0.0
    Else (feature 517 > 66.5)
     Predict: 1.0
  Tree 2 (weight 1.0):
    If (feature 406 <= 126.5)
     Predict: 0.0
    Else (feature 406 > 126.5)
     Predict: 1.0
  Tree 3 (weight 1.0):
    If (feature 406 <= 126.5)
     Predict: 0.0
    Else (f

24/10/21 11:31:32 INFO SparkContext: Running Spark version 3.5.3
24/10/21 11:31:33 INFO SparkContext: OS info Linux, 6.1.85+, amd64
24/10/21 11:31:33 INFO SparkContext: Java version 11.0.24
24/10/21 11:31:33 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
24/10/21 11:31:33 INFO ResourceUtils: No custom resources configured for spark.driver.
24/10/21 11:31:33 INFO SparkContext: Submitted application: JavaRandomForestRegressorExample
24/10/21 11:31:33 INFO ResourceProfile: Default ResourceProfile created, executor resources: Map(cores -> name: cores, amount: 1, script: , vendor: , memory -> name: memory, amount: 1024, script: , vendor: , offHeap -> name: offHeap, amount: 0, script: , vendor: ), task resources: Map(cpus -> name: cpus, amount: 1.0)
24/10/21 11:31:33 INFO ResourceProfile: Limiting resource is cpu
24/10/21 11:31:33 INFO ResourceProfileManager: Added ResourceProfile id: 0
24/10/21 11:31:33 INFO Secur

## Summary

In this guide, we demonstrated how to install the essential Spark services—Spark Core, Spark Master, and Spark Worker—using the Bigtop distribution. We also explored how to leverage Bigtop's utilities to easily launch a Spark engine. Additionally, we executed two example jobs included in the Spark package.