# How to Use PySpark in Google Colab

## Steps:

- **Install pyspark**: Install the PySpark library using pip.
- **Set up environment variables**: Configure the necessary environment variables for PySpark.
- **Initialize sparksession**: Create a SparkSession object to start working with Spark.
- **Verify installation**: Run a simple PySpark command to ensure everything is set up correctly.

# Install and download required packages

In [1]:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q http://archive.apache.org/dist/spark/spark-3.5.1/spark-3.5.1-bin-hadoop3.tgz
!tar xf spark-3.5.1-bin-hadoop3.tgz
!pip install -q findspark

# Set up environment variable

In [3]:
!pwd

/content


In [4]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.5.1-bin-hadoop3"

In [2]:
!ls

sample_data		 spark-3.5.1-bin-hadoop3.tgz
spark-3.5.1-bin-hadoop3  spark-3.5.1-bin-hadoop3.tgz.1


# Import and Setup Spark Session

In [5]:
import findspark
findspark.init()
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").getOrCreate()
spark.conf.set("spark.sql.repl.eagerEval.enabled", True) # Property used to format output tables better
spark

# Example

In [6]:
# Create a sample DataFrame with skewed data
from pyspark.sql.functions import rand, when

df = (
    spark.range(0, 10)
    .withColumn("user_id", (rand() * 1000).cast("int"))
    .withColumn("event_type", when(rand() > 0.95, "rare").otherwise("common"))
    .withColumn("value", rand())
)

In [7]:
df.show()

+---+-------+----------+-------------------+
| id|user_id|event_type|              value|
+---+-------+----------+-------------------+
|  0|    318|    common| 0.9733141758161218|
|  1|    857|    common|  0.949520095259974|
|  2|    617|    common| 0.1306696785276129|
|  3|     59|    common|0.29992855428951526|
|  4|    667|    common| 0.2870752916621744|
|  5|    665|    common| 0.9361488842300535|
|  6|    491|    common|0.06606442221355968|
|  7|    859|    common|  0.876521488657336|
|  8|    226|    common| 0.8361631377603518|
|  9|    275|    common| 0.6254969578425751|
+---+-------+----------+-------------------+



In [8]:
df.limit(5)

id,user_id,event_type,value
0,318,common,0.9733141758161218
1,857,common,0.949520095259974
2,617,common,0.1306696785276129
3,59,common,0.2999285542895152
4,667,common,0.2870752916621744
