# CSE547 - Colab 1
## Wordcount in Spark

Adapted From Stanford CS246

### Setup

Let's setup Spark on your Colab environment.  Run the cell below!

In [None]:
!pip install pyspark
!pip install -U -q PyDrive
!apt install openjdk-8-jdk-headless -qq
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"

Now we authenticate a Google Drive client to download the file we will be processing in our Spark job.

**Make sure to follow the interactive instructions.**

In [None]:
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

# Authenticate and create the PyDrive client
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

In [None]:
id='1SE6k_0YukzGd5wK-E4i6mG83nydlfvSa'
downloaded = drive.CreateFile({'id': id})
downloaded.GetContentFile('pg100.txt')

If you executed the cells above, you should be able to see the file *pg100.txt* under the "Files" tab on the left panel.

### Your task

If you run successfully the setup stage, you are ready to work on the *pg100.txt* file which contains a copy of the complete works of Shakespeare.

Write a Spark application which outputs the number of words that start with each letter. This means that for every letter we want to count the total number of (non-unique) words that start with a specific letter. In your implementation **ignore the letter case**, i.e., consider all words as lower case. Also, you can ignore all the words **starting** with a non-alphabetic character.

For this task we ask you to the [**RDD MapReduce API**](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.RDD.html) from spark (map, reduceByKey, flatMap, etc.) instead of **DataFrame API**.

In [1]:
from pyspark.sql import *
from pyspark.sql.functions import *
from pyspark import SparkContext
import pandas as pd

# create the Spark Session
spark = SparkSession.builder.getOrCreate()

# create the Spark Context
sc = spark.sparkContext

24/04/03 13:31:54 WARN Utils: Your hostname, CC-M133A-EU.local resolves to a loopback address: 127.0.0.1; using 10.84.11.106 instead (on interface en0)
24/04/03 13:31:54 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/04/03 13:31:55 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [2]:
# YOUR CODE HERE
# Read the text file
lines = sc.textFile('pg100.txt')

# Split the lines into words
words = lines.flatMap(lambda line: line.lower().split())

# Filter out words starting with non-alphabetic characters
words = words.filter(lambda word: word[0].isalpha())

# Map each word to a key-value pair where the key is the first letter and the value is 1
word_counts = words.map(lambda word: (word[0], 1))

# Reduce by key to count the number of words starting with each letter
letter_counts = word_counts.reduceByKey(lambda a, b: a + b)

# Print the letter counts
for letter, count in letter_counts.collect():
    print(f"{letter}: {count}")


[Stage 0:>                                                          (0 + 2) / 2]

p: 27759
g: 20782
c: 34567
s: 65705
b: 45455
i: 62167
r: 14265
y: 25855
l: 29569
d: 29713
j: 3339
h: 60563
t: 123602
e: 18697
o: 43494
w: 59597
f: 36814
u: 9170
a: 84836
n: 26759
m: 55676
v: 5728
k: 9418
q: 2377
z: 71
x: 14


                                                                                

Once you obtained the desired results, **head over to Gradescope and submit your solution for this Colab**!