# Day 2: Setting up Apache Spark in Google Collab


Today, I started hands-on learning by **setting up PySpark** in Google Collab! Google Collab is a free cloud-based environment that makes it easy to experiment with PySpark without the hassle of installing complex setups on your personal computer

## Why Choose Google Collab?
### Google Collab allows you to run Python code in the cloud and provides access to GPU/TPU resources. It’s an ideal environment to test and run PySpark code directly in your browser without any complicated installation processes.

## **Steps to Set Up PySpark on Jupyter**

### **Step 1**: Open a new notebook in JupyterNotebook.

### **Step 2**: Install PySpark and Java:

Since Spark requires Java to run, we need to install both:

In [1]:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!pip install pyspark



#### Explanation:


*   apt-get installs OpenJDK 8 (an open-source version of Java).
*   pip installs the PySpark library, ensuring all necessary components for PySpark are ready.

### **Step 3**: Set the Java environment variable:

In [2]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"


#### Explanation:
Setting JAVA_HOME ensures PySpark can locate the correct Java path, avoiding version conflicts.

### **Step 4**: Create your first Spark Session:

In [3]:
from pyspark.sql import SparkSession
# Creat a Spark session
spark = SparkSession.builder.appName("PySpark in Colab").getOrCreate()
# Print Spark Version
print(spark.version)

3.5.3


🚀
## Tomorrow - Spark Architecture! On Day 3, we’ll dive into Spark’s architecture, including Drivers, Executors, and Cluster Managers. You’ll understand how Spark manages data and executes tasks efficiently. Stay tuned!