<a href="https://colab.research.google.com/github/CRforty6/Apache-Spark-CodeLabs/blob/main/CodeLab1_ApacheSpark1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 🔥CodeLab 1️: Apache Spark for Large-scale Data Processing #ApacheSpark #Part1🔥

Apache Spark is a powerful distributed computing framework that's great for processing large datasets. <br />
In this CodeLab, I'll guide you through the following (with Google Colab):

⛏️ Set up Apache Spark environment (and SparkSession) <br/>
🚀 Run a simple Spark application (with data load + operations) <br/>
🔭 Launch the Spark Web UI (to monitor applications)

<br />

---

# ⛏️Set up Apache Spark environment (and SparkSession)

We will be using the following setup approach:
1. Java 8, 11 or 17
2. Python 3.8+ (Spark supports Java, Scala 2.12 or 2.13, Python 3.8+ and R 3.5+)
3. Spark 3.5.1 (Feb 2024 release)
4. Findspark python package

*Alternatively, skip 3 & 4 above and install PySpark python package for just python support.*

## 1️⃣: Install Apache Spark and its dependencies

Let's install Apache Spark and its dependencies. <br />


In [None]:
# Check Java installation on Colab. Colab already has Java available
!java -version

In [None]:
# Check Python installation on Colab. Colab already has Python available too
!python --version

In [None]:
# Download Spark 3.5.1
!wget -q https://dlcdn.apache.org/spark/spark-3.5.1/spark-3.5.1-bin-hadoop3.tgz

# Unzip Spark
!tar xf spark-3.5.1-bin-hadoop3.tgz

# Install findspark to use spark with python
!pip install -q findspark

In [None]:
# Set the environment variables to point to the Java and Spark install locations
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-11-openjdk-amd64"
os.environ["SPARK_HOME"] = "spark-3.5.1-bin-hadoop3"

## 2️⃣: Start a SparkSession
Let's create a SparkSession and print its version.

In [None]:
# Initialize findspark
import findspark
findspark.init()

# Create a PySpark session
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").appName("Codelab1").getOrCreate()


In [None]:
# Check SparkSession Setup on Colab
spark

# 🚀Run a simple Spark application (with data load + operations)

We'll load data in a few different ways and perform some basic operations: <br />
* Manual data entry
* Data file in directory
* External data file

## 1️⃣: Data load and basic operations - manual data entry

create a DataFrame and perform a basic operation on it - with manual data entry.

In [None]:
# Create a DataFrame with manually input data
data = [("Alice", 34), ("Bob", 45), ("Charlie", 27)]
df = spark.createDataFrame(data, ["Name", "Age"])

# Filter data
df_filtered = df.filter(df.Age > 30)

# Show the result
df_filtered.show()

## 2️⃣: Data load and basic operations - data file in directory

create a DataFrame and perform basic operations on it - with csv data file from Google Colab's 'sample_data' folder.

In [None]:
# Create a DataFrame from file in directory
df = spark.read.csv("sample_data/california_housing_test.csv", header=True, inferSchema=True)

# View the first few rows
df.show(5)

In [None]:
# View the table schema
df.printSchema()

In [None]:
# Perform some count operations
print("Number of rows:", df.count())
print("Number of columns:", len(df.columns))

## 3️⃣: Data load and basic operations - external data file

create a DataFrame and perform basic operations on it - with csv data file stored in a Github repository.

In [None]:
# Create a DataFrame from external data file
from pyspark import SparkFiles
url_ext = 'https://raw.githubusercontent.com/fivethirtyeight/data/master/nba-elo/nbaallelo.csv' # sample sports data - NBA

spark.sparkContext.addFile(url_ext)
df = spark.read.csv(SparkFiles.get("nbaallelo.csv"),inferSchema=True, header=True)

# View the first few rows
df.show(5)

In [None]:
# Group data and calculate averages
avg_points = df.groupBy("team_id").agg({"pts": "avg"}).withColumnRenamed("avg(pts)", "avg_points")

# Show the result
avg_points.show()

In [None]:
# Import spark sql function to perform column operations
from pyspark.sql.functions import col

# Filter data and show result
lakers_data = df.filter(col("fran_id") == "Lakers")

# Show the result - first few rows
lakers_data.show(5)

# 🔭Launch the Spark Web UI to Monitor Applications

The Spark Web UI provides detailed information about running Spark applications, including job progress, resource utilization, and DAG visualization.

We'll start a tunnel to access SparkUI on Google Colab.
1. Sign-in to https://dashboard.ngrok.com/get-started/your-authtoken and copy your AuthToken
2. Run the code below and enter your AuthToken
3. SparkUI will be accessible on the provided *temporary public URL*

In [None]:
# Start a tunnel to access SparkUI on Google Colab
!pip install pyngrok
from pyngrok import ngrok, conf
import getpass

print("Enter your authtoken, which can be copied "
"from https://dashboard.ngrok.com/get-started/your-authtoken")
conf.get_default().auth_token = getpass.getpass()

ui_port = 4040
public_url = ngrok.connect(ui_port).public_url
print(f" * ngrok tunnel \"{public_url}\" -> \"http://127.0.0.1:{ui_port}\"")

# ⛳Summary

**What we covered**
* Set up Apache Spark environment and SparkSession
* Run a simple Spark application to test the setup
* Launch the Spark web UI to monitor applications

**Additional Resources:**
* [Apache Spark Documentation](https://spark.apache.org/docs/latest/): In-depth knowledge and tutorials.
* Online courses and tutorials for advanced Apache Spark learning.