# CS246 - Colab 1
## Wordcount in Spark

### Setup

Let's setup Spark on your Colab environment.  Run the cell below!

In [None]:
!pip install pyspark
!pip install -U -q PyDrive
!apt install openjdk-8-jdk-headless -qq
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"

Now we authenticate a Google Drive client to download the file we will be processing in our Spark job.

**Make sure to follow the interactive instructions.**

In [None]:
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

# Authenticate and create the PyDrive client
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

In [None]:
id='1SE6k_0YukzGd5wK-E4i6mG83nydlfvSa'
downloaded = drive.CreateFile({'id': id})
downloaded.GetContentFile('pg100.txt')

If you executed the cells above, you should be able to see the file *pg100.txt* under the "Files" tab on the left panel.

### Your task

If you run successfully the setup stage, you are ready to work on the *pg100.txt* file which contains a copy of the complete works of Shakespeare.

Write a Spark application which outputs the number of words that start with each letter. This means that for every letter we want to count the total number of (non-unique) words that start with a specific letter. In your implementation **ignore the letter case**, i.e., consider all words as lower case. Also, you can ignore all the words **starting** with a non-alphabetic character.

In [None]:
from pyspark.sql import *
from pyspark.sql.functions import *
from pyspark import SparkContext
import pandas as pd
import regex as re
# create the Spark Session
spark = SparkSession.builder.getOrCreate()

# create the Spark Context
sc = spark.sparkContext

sc.textFile("testo.txt") => RDD[String]

In [None]:
shakespeare = sc.textFile("pg100.txt")

In [None]:
def clean_line(str):
  lowercase = str.lower()
  lowercase = re.sub("[^0-9a-zA-Z]+", " ", lowercase)
  return lowercase

In [None]:
shakespeare_txt = shakespeare.map(clean_line)\
.flatMap(lambda st: st.split(" "))\
.filter(lambda x: x != '')\
.map(lambda word: (word[0], 1))

count_words = shakespeare_txt.reduceByKey(lambda a,b: (a+b))
count_words = count_words.sortBy(lambda x: -x[1])

count_words.take(36)

[('t', 127781),
 ('a', 86000),
 ('s', 75226),
 ('i', 62420),
 ('h', 61029),
 ('w', 60097),
 ('m', 56252),
 ('b', 46001),
 ('o', 43712),
 ('d', 39173),
 ('f', 37186),
 ('c', 34983),
 ('l', 32389),
 ('p', 28059),
 ('n', 27313),
 ('y', 25926),
 ('g', 21167),
 ('e', 20409),
 ('r', 15234),
 ('k', 9535),
 ('u', 9230),
 ('v', 5802),
 ('j', 3372),
 ('q', 2388),
 ('1', 932),
 ('2', 330),
 ('3', 84),
 ('z', 79),
 ('4', 67),
 ('5', 43),
 ('9', 34),
 ('6', 28),
 ('7', 24),
 ('8', 23),
 ('x', 14),
 ('0', 10)]