<a href="https://colab.research.google.com/github/arthi-rajendran-DS/Medium-Implementations/blob/main/PyBytes_Day21.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# PySpark on Google Colab

Running PySpark on Google Colab is a convenient way to leverage the power of Spark in a cloud-based environment. Google Colab provides a Jupyter notebook interface with free access to a GPU and limited access to a TPU, making it suitable for experimenting with PySpark. Here's how you can set up PySpark on Google Colab:

Step 1: Create a new notebook or open an existing one in Google Colab.

Step 2: Install PySpark and findspark libraries.

In [1]:
!pip install pyspark
!pip install findspark

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pyspark
  Downloading pyspark-3.4.0.tar.gz (310.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m310.8/310.8 MB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.4.0-py2.py3-none-any.whl size=311317130 sha256=f022db164132de7c536208595e9b554cf7c392f05d4a380057863ae82b564079
  Stored in directory: /root/.cache/pip/wheels/7b/1b/4b/3363a1d04368e7ff0d408e57ff57966fcdf00583774e761327
Successfully built pyspark
Installing collected packages: pyspark
Successfully installed pyspark-3.4.0
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting findspark
  Downloading findspark-2.0.1-py2.py3-none-any.whl (4.4 k

#### Step 3: Set up the environment and initialize a SparkSession in a new cell:

In [2]:
import findspark
findspark.init()

from pyspark.sql import SparkSession

# Create a SparkSession
spark = SparkSession.builder.master("local[*]").getOrCreate()


#### The findspark library helps locate the Spark installation in the Colab environment. The SparkSession.builder.master("local[*]") configuration sets up Spark to run in local mode using all available cores.

#### Step 4: Now, we can write your PySpark code in subsequent cells and run it. For example, we can use it to build a word counter :

In [3]:
# Load the text file into an RDD
lines_rdd = spark.sparkContext.textFile("/content/sample_data/demo.txt")

# Split the lines into words
words_rdd = lines_rdd.flatMap(lambda line: line.split(" "))

# Map each word to a key-value pair
word_count_rdd = words_rdd.map(lambda word: (word, 1))

# Reduce by key to get the count of each word
word_counts = word_count_rdd.reduceByKey(lambda a, b: a + b)

# Collect the word counts and print
results = word_counts.collect()
for (word, count) in results:
    print(f"{word}: {count}")


Hello,: 1
this: 1
is: 3
an: 1
file.: 2
It: 1
multiple: 1
of: 1
The: 1
count: 1
in: 1
PySpark: 1
example: 1
text: 2
contains: 1
lines: 1
with: 1
words: 1
repeating: 1
throughout: 1
the: 1
purpose: 1
to: 1
demonstrate: 1
word: 1
PySpark.: 1
a: 1
powerful: 1
tool: 1
for: 1
distributed: 1
data: 1
processing.: 1
