# CS246 - Colab 1
## Wordcount in Spark

### Setup

Let's setup Spark on your Colab environment.  Run the cell below!

In [1]:
!pip install pyspark
!pip install -U -q PyDrive
!apt install openjdk-8-jdk-headless -qq
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"

Collecting pyspark
[?25l  Downloading https://files.pythonhosted.org/packages/8e/b0/bf9020b56492281b9c9d8aae8f44ff51e1bc91b3ef5a884385cb4e389a40/pyspark-3.0.0.tar.gz (204.7MB)
[K     |████████████████████████████████| 204.7MB 66kB/s 
[?25hCollecting py4j==0.10.9
[?25l  Downloading https://files.pythonhosted.org/packages/9e/b6/6a4fb90cd235dc8e265a6a2067f2a2c99f0d91787f06aca4bcf7c23f3f80/py4j-0.10.9-py2.py3-none-any.whl (198kB)
[K     |████████████████████████████████| 204kB 40.6MB/s 
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.0.0-py2.py3-none-any.whl size=205044182 sha256=ac30a99309ce380f7e4aafcea38bfe4bb41691e68b32bf2116c48fad1def5fa3
  Stored in directory: /root/.cache/pip/wheels/57/27/4d/ddacf7143f8d5b76c45c61ee2e43d9f8492fc5a8e78ebd7d37
Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.9 pyspark-3.0.0
The 

Now we authenticate a Google Drive client to download the file we will be processing in our Spark job.

**Make sure to follow the interactive instructions.**

In [2]:
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

# Authenticate and create the PyDrive client
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

In [3]:
id='1SE6k_0YukzGd5wK-E4i6mG83nydlfvSa'
downloaded = drive.CreateFile({'id': id})
downloaded.GetContentFile('pg100.txt')

If you executed the cells above, you should be able to see the file *pg100.txt* under the "Files" tab on the left panel.

### Your task

If you run successfully the setup stage, you are ready to work on the *pg100.txt* file which contains a copy of the complete works of Shakespeare.

Write a Spark application which outputs the number of words that start with each letter. This means that for every letter we want to count the total number of (non-unique) words that start with a specific letter. In your implementation **ignore the letter case**, i.e., consider all words as lower case. Also, you can ignore all the words **starting** with a non-alphabetic character.

In [6]:
from pyspark.sql import *
from pyspark.sql.functions import *
from pyspark import SparkContext
import pandas as pd

# create the Spark Session
spark = SparkSession.builder.getOrCreate()

# create the Spark Context
sc = spark.sparkContext

In [8]:
# YOUR
downloaded

GoogleDriveFile({'id': '1SE6k_0YukzGd5wK-E4i6mG83nydlfvSa', 'kind': 'drive#file', 'etag': '"MTU3ODQ0MjAxNzc4OQ"', 'selfLink': 'https://www.googleapis.com/drive/v2/files/1SE6k_0YukzGd5wK-E4i6mG83nydlfvSa', 'webContentLink': 'https://drive.google.com/uc?id=1SE6k_0YukzGd5wK-E4i6mG83nydlfvSa&export=download', 'alternateLink': 'https://drive.google.com/file/d/1SE6k_0YukzGd5wK-E4i6mG83nydlfvSa/view?usp=drivesdk', 'embedLink': 'https://drive.google.com/file/d/1SE6k_0YukzGd5wK-E4i6mG83nydlfvSa/preview?usp=drivesdk', 'iconLink': 'https://drive-thirdparty.googleusercontent.com/16/type/text/plain', 'thumbnailLink': 'https://lh3.googleusercontent.com/cizOyQrlCsbhouOOjXYJjSAVFlO23FThx7nt0X9PJi62trQjvarLV0yReptTioKo3MuQ7XVxx7w=s220', 'title': 'pg100.txt', 'mimeType': 'text/plain', 'labels': {'starred': False, 'hidden': False, 'trashed': False, 'restricted': False, 'viewed': False}, 'copyRequiresWriterPermission': False, 'createdDate': '2020-01-08T00:04:41.233Z', 'modifiedDate': '2020-01-08T00:06:57.

In [81]:
# read in the text file and make it an RDD
data = sc.textFile('pg100.txt')

# split the data with a space. Since we are only counting the first letter of each word, further processing is not necessary
data = data.flatMap(lambda line: line.strip().split(' '))

# construct the key-value pairs
data = data.map(lambda word: (word[0].lower(), 1) if word and word[0].isalpha() else (None,0))

data.take(20)

[('t', 1),
 ('p', 1),
 ('g', 1),
 ('e', 1),
 ('o', 1),
 ('t', 1),
 ('c', 1),
 ('w', 1),
 ('o', 1),
 ('w', 1),
 ('s', 1),
 ('b', 1),
 ('w', 1),
 ('s', 1),
 (None, 0),
 ('t', 1),
 ('e', 1),
 ('i', 1),
 ('f', 1),
 ('t', 1)]

In [84]:
# Sum the counters in the reduce step, and sort by count
data = data.reduceByKey(lambda a, b: a + b)

# show the 10 frequent words
data.filter(lambda x: x[1] > 0).sortBy(lambda x: x[0]).collect()

[('a', 84836),
 ('b', 45455),
 ('c', 34567),
 ('d', 29713),
 ('e', 18697),
 ('f', 36814),
 ('g', 20782),
 ('h', 60563),
 ('i', 62167),
 ('j', 3339),
 ('k', 9418),
 ('l', 29569),
 ('m', 55676),
 ('n', 26759),
 ('o', 43494),
 ('p', 27759),
 ('q', 2377),
 ('r', 14265),
 ('s', 65705),
 ('t', 123602),
 ('u', 9170),
 ('v', 5728),
 ('w', 59597),
 ('x', 14),
 ('y', 25855),
 ('z', 71)]

Once you obtained the desired results, **head over to Gradescope and submit your solution for this Colab**!