<a href="https://colab.research.google.com/github/dialdfordata/PySpark_Tutorial/blob/main/00_Google_Colab_PySpark_Connection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **PySpark on Google Colab Environment**

In this Notebook we will go thorugh a series of task that we need to perform to run PySpark in the Google Colab Environment.

As you know, Google Colab is an online notebook-like coding environment that is well-suited for machine learning and data analysis.

It comes equipped with many Machine Learning libraries and offers even free GPU usage. It is mainly used by data scientists and ML engineers.

We will first install Open JDK or Jave Development Kit. Apache Spark is written in Scala, which runs on the Java Virtual Machine (JVM). OpenJDK provides an open source implementation of the JVM, ensuring compatibility and support for running Spark.

In the backend of the Google Colab we have a Linux system, so we will use some Linux commands for installation Open JDK, first we will write a command  ``!sudo apt update`` which will update the package lists from the repositories to ensure that the latest inofration about available packages can be retrieved.

In [None]:
# Open JDK installation

!sudo apt update
!apt-get install openjdk-8-jdk-headless -qq > /dev/null

Next, we will install opnejdk version 8.

``!apt-get install openjdk-8-jdk-headless -qq > /dev/null`` :

This command will install the headless version of OpenJDK 8 quietly hence   (``-qq``) without user any interaction, suppressing most output by redirecting it to ``/dev/null``. The "headless" version excludes graphical components, making it suitable for server environments or automated tasks.

Next, we need to install spark-3.5.3 version that is available in Apache Spark official website.

In [None]:
# Download spark-3.5.3-bin-hadoop3, a specific distribution of Apache Spark
!wget -q https://dlcdn.apache.org/spark/spark-3.5.3/spark-3.5.3-bin-hadoop3.tgz

This code block uses the ``wget`` command to download a file from the specified URL. The -q flag enables quiet mode, suppressing most output to the terminal. Specifically, it retrieves the Apache Spark version 3.5.3 binary package for Hadoop 3, compressed in a ``.tgz`` archive, from the provided Apache mirror URL

Next we need to unpack the downloaded spark file.

In [None]:
# Unpack the contents of the spark-3.5.3-bin-hadoop3.tgz file into the file system
!tar xf spark-3.5.3-bin-hadoop3.tgz

This code block extracts the contents of the ``spark-3.5.3-bin-hadoop3.tgz`` archive using the tar command. The xf options specify that the archive should be e**x**tracted (``x``) and that the archive's **f**ile name (``spark-3.5.3-bin-hadoop3.tgz``) follows. This command unpacks the Spark binary distribution into the current working directory, making its files and directories accessible for use. It is commonly used when setting up Apache Spark for data processing tasks.

Since, we have downloaded and installed the Spark in the system, next we need to define to variables, one for Java and another for Spark. The JAVA_HOME environment variable points to the Java installation directory on the machine and is essential for Spark and the SPARK_HOME environment variable points to the Apache Spark installation directory. It is used by Spark to localize its own components and libraries.

In [None]:
# Configuration of environment variables
# The JAVA_HOME environment variable points to the Java installation directory on the machine and is essential for Spark
# The SPARK_HOME environment variable points to the Apache Spark installation directory. It is used by Spark to localize its own components and libraries.
import os
os.environ["JAVA_HOME"] = f"/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = f"/content/spark-3.5.3-bin-hadoop3"

Further, we need to install three Python libraries, first one is 'findspark'. The findspark library is used to make it easier to configure and launch Apache Spark in local environments, such as on this development machines. We also need to install py4j or Python for Java and upated version of pyspark.

In [None]:
# Installation of the required libaries

!pip install -q findspark
!pip install py4j -q
!pip install pyspark -qU

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m317.3/317.3 MB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone


Next we need to import findspark library which is a utility for simplifying the process of setting up PySpark (Spark's Python API) in environments like Jupyter Notebooks.

In [None]:
# findspark.init() makes it easy to configure and launch Apache Spark in local development environments
import findspark
findspark.init()

We need to initalize the findspark library with ``findspark.init()`` method, initializes ``findspark``, will add the Apache Spark installation directory to the ``PYTHONPATH`` and setting the necessary environment variables. This ensures that PySpark modules can be imported and Spark-related commands can run without manual configuration of paths.

Next we need to import the pyspark library.

In [None]:
import pyspark

Since, the pyspark libray is imported successfully, now we will create a Spark Session which is the entry point for the Spark program, I will make a separate video to explain Spark Context and Spark Session. For the time being follow along with me.

In [None]:
# Create a PySpark session
from pyspark.sql import SparkSession

spark= (SparkSession
        .builder
        .appName("PySaprk Connection in Google Colab")
        .getOrCreate()
)

spark

Perfect!! Our Spark session is up and running successfully. Now, we will create a small Spark Data frame to see our PySpark in live action.

First we need to import the DataFrame method from spark dot sql

Then will defne the data and the columns and then finally we create the Spark Data Frame with spark.createDataFrame method.

df.show will print the data frame on the screen now.

In [None]:
# Example Data

from pyspark.sql import DataFrame

# Definig the data
data = [
    (1,"Jane", "Admin"),
    (2, "Joe", "HR"),
    (3, "John", "IT"),
    (4, "Mary", "Legal"),
    (5, "Kate", "IT")
]

# Defining the columns
columns = ["id", "name", "dept"]

# Creation of Data Frame
df = spark.createDataFrame(data, columns)

df.show()

+---+------+--------+
| id|  name|    dept|
+---+------+--------+
|  1|  Neil|Purchase|
|  2|Donald|   Admin|
|  3|  John|      IT|
|  4|  Mary|      HR|
+---+------+--------+

