# Use Spark Connect in Client Applications
### Implemented by Simona Scala - A.Y. 2022/23

## Method 1: Manual Installation

To install Apache Spark on Google Colab manually, we need to follow a series of steps.

**Step 1**: The initial step is to download Java, as Spark relies on the Java Virtual Machine (JVM) to run.

In [None]:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null

**Step 2**: Download the latest version of Apache Spark
- Go to Spark's [download page](https://spark.apache.org/downloads.html) and choose a Spark release version and a package type (the default is the latest version).
- Click the link for downloading Spark, and you will be directed to a new web page.
- Copy the first link on the web page, which is below the sentence "*We suggest the following site for your download:*"
- Download Spark from the copied link.
- Unzip the downloaded file to extract its contents.

In [None]:
!wget -q https://dlcdn.apache.org/spark/spark-3.4.1/spark-3.4.1-bin-hadoop3.tgz   # download Spark
!tar xf spark-3.4.1-bin-hadoop3.tgz   # unzip the file
!ls -a    # list all files and directories

**Step 3**: Set up the environment variables for Spark

In [None]:
SPARK_VERSION = '3.4.1'

In [None]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.4.1-bin-hadoop3"
os.environ["SPARK_VERSION"] = '3.4.1'

**Step 4**: Install and import the library for locating Spark

In [None]:
!pip install -q findspark
import findspark
findspark.init()    # initiate findspark
findspark.find()    # check the location for Spark

'/content/spark-3.4.1-bin-hadoop3'

**Step 5**: Test the installation by starting a "traditional" Spark session and check the session information

In [None]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").getOrCreate()
print(spark)   # check Spark Session Information
print(type(spark))  # check the type of session

**Step 6**: Now that the Spark server is running, we can connect to it remotely using Spark Connect. We do this by creating a remote Spark session on the client where our application runs. Before we can do that, we need to make sure to stop the existing regular Spark session because it cannot coexist with the remote Spark Connect session we are about to create.

In [None]:
SparkSession.builder.master("local[*]").getOrCreate().stop()

At this point, we are ready launch the Spark server with the following `start-connect-server.sh` script.

In [None]:
!$SPARK_HOME/sbin/start-connect-server.sh --packages org.apache.spark:spark-connect_2.12:$SPARK_VERSION

starting org.apache.spark.sql.connect.service.SparkConnectServer, logging to /content/spark-3.4.1-bin-hadoop3/logs/spark--org.apache.spark.sql.connect.service.SparkConnectServer-1-e556dd8aed2b.out


The command we used above to launch the server configured Spark to run as `localhost:15002`.

So now we can create a remote Spark session on the client and check the session information.

In [None]:
spark = SparkSession.builder.remote("sc://localhost:15002").getOrCreate()
print(spark)   # check Spark Session Information
print(type(spark))  # check the type of session

## Method 2: Automatic Installation

The second method of installing PySpark on Google Colab is to use `pip install`.

In [None]:
!pip install pyspark

Collecting pyspark
  Downloading pyspark-3.4.1.tar.gz (310.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m310.8/310.8 MB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.4.1-py2.py3-none-any.whl size=311285397 sha256=a5725dd41af05dc2b78e0bf848eed6b0b832c48f9df28203aa7b0f0cfba979f7
  Stored in directory: /root/.cache/pip/wheels/0d/77/a3/ff2f74cc9ab41f8f594dabf0579c2a7c6de920d584206e0834
Successfully built pyspark
Installing collected packages: pyspark
Successfully installed pyspark-3.4.1


To create a remote Spark session, we have to include the remote function with a reference to our Spark server when we create a Spark session.

In [None]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.remote("sc://localhost:15002").getOrCreate()
print(spark)   # check Spark Session Information
print(type(spark))  # check the type of session

pyspark.sql.connect.session.SparkSession

# Warning!

In Spark 3.4, Spark Connect supports most PySpark APIs, including [DataFrame](https://https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html), [Functions](https://https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/functions.html), and [Column](https://https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/column.html). However, some APIs such as [SparkContext](https://https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.SparkContext.html) and [RDD](https://https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.RDD.html) are not supported. You can check which APIs are currently supported in the [API reference](https://https://spark.apache.org/docs/latest/api/python/reference/index.html) documentation. Supported APIs are labeled “Supports Spark Connect” so you can check whether the APIs you are using are available before migrating existing code to Spark Connect.

*N.B.: This notebook was last updated on July 2023*