## Spark Wordcount

In [1]:
!pip install pyspark

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [2]:
from pyspark.sql import SparkSession
import re

In [3]:
# Mount my Google drive
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


#### Setup the Spark Session and Context

In [4]:
spark = SparkSession.builder \
    .appName("WordcountExample") \
    .getOrCreate()
sc = spark.sparkContext
sc.setLogLevel('WARN')
sc.uiWebUrl

'http://ad0fe628f795:4040'

In [5]:
def wordsFromLine(line):
    #return re.split('[ ,:.;?!]', line.lower())
    return re.split('\W+', line.lower())

In [6]:
myfile = "drive/MyDrive/CS452/bible.txt"

lines = sc.textFile(myfile)

In [7]:
lines.count()

100222

#### Now do the work. 
flatMap the lines into words then do the traditional MapReduce and finally sort the results

In [8]:
counts = lines.flatMap(wordsFromLine) \
                  .map(lambda x: (x, 1)) \
                  .reduceByKey(lambda x,y: x + y) \
                  .sortBy(lambda a: a[1], False)           # sort in descending order

Grab the first 50 of the sorted list and then print them out

In [9]:
output = counts.take(50)
for i in range(0,50):
    print("%s: %i" % (output[i][0], output[i][1]))

the: 64204
: 56045
and: 51764
of: 34789
to: 13660
that: 12927
in: 12725
he: 10422
shall: 9840
for: 8997
unto: 8997
i: 8854
his: 8473
a: 8235
lord: 7964
they: 7379
be: 7032
is: 7015
him: 6659
not: 6617
them: 6430
it: 6144
with: 6059
all: 5638
thou: 5474
thy: 4600
was: 4524
god: 4472
which: 4420
my: 4368
me: 4096
said: 3999
but: 3997
ye: 3983
their: 3942
have: 3909
will: 3843
thee: 3827
from: 3657
as: 3531
are: 2970
when: 2835
this: 2833
1: 2830
out: 2776
were: 2772
upon: 2750
man: 2735
2: 2725
you: 2687
