# Subproject A: 

Generate a dictionary / hash-map of the top 40 words across all documents with the largest counts. The word counts should be case-insensitive, but otherwise you don’t need any additional preprocessing. 

 Fisrt, import some Spark classes into the program, and create a SparkContext object, which tells Spark how to access a cluster.

In [1]:
from pyspark import SparkContext, SparkConf
from operator import add

In [2]:
import json

In [3]:
conf = SparkConf().setAppName("word count").setMaster("local[3]")
sc = SparkContext(conf = conf)

Create an RDD by referencing datasets in an external storage system. We pass all file paths together (with wildcards), as we are going to count the words across all documents.

In [None]:
lines = sc.textFile('./data/4300-0.txt,./data/pg*.txt')

Split each line into a list of words  
Map each word, w, to a tuple, (w, 1)  
Add the tuples with the same key (word)

In [None]:
counts = lines.flatMap(lambda line : line.split()).map(lambda word : (word.strip().casefold(), 1)).reduceByKey(add)

Get the top 40 words (drop words with counts less than 2 can improve the performance a littlt bit)

In [None]:
output = counts.filter(lambda x : x[1] > 2).top(40, key=lambda x : x[1])

In [None]:
output

Cast the list to a dictionary, and output to a json file

In [None]:
output = dict(output)

In [None]:
with open('sp1.json', 'w') as outfile:
    json.dump(output, outfile)

### Alternative way to read texts

#### It may have potential bug for a line like, "abc def\\", as the unicode for this line is "abc def\\\n". By doing splitlines() method, this line may not be able to detect the '\n' character

In [4]:
lines = sc.wholeTextFiles('./data/4300-0.txt,./data/pg*.txt')

In [5]:
counts = lines.flatMap(lambda text : text[1].splitlines()).flatMap(lambda line : line.split()).map(lambda word : (word.strip().casefold(), 1)).reduceByKey(add)

In [6]:
output = counts.filter(lambda x : x[1] > 2).top(40, key=lambda x : x[1])

In [7]:
output

[('the', 78837),
 ('and', 45168),
 ('of', 44739),
 ('to', 33436),
 ('a', 24234),
 ('in', 22126),
 ('that', 14818),
 ('he', 13019),
 ('is', 12918),
 ('his', 12270),
 ('i', 11044),
 ('with', 10296),
 ('for', 10036),
 ('as', 9639),
 ('be', 8834),
 ('was', 8787),
 ('not', 8141),
 ('it', 8123),
 ('but', 7856),
 ('by', 7701),
 ('or', 7407),
 ('her', 7403),
 ('they', 6735),
 ('which', 6517),
 ('you', 6354),
 ('on', 6214),
 ('from', 5811),
 ('at', 5695),
 ('are', 5590),
 ('she', 5458),
 ('all', 5437),
 ('their', 5285),
 ('have', 5146),
 ('had', 4647),
 ('this', 4090),
 ('my', 3841),
 ('so', 3710),
 ('we', 3629),
 ('no', 3620),
 ('if', 3571)]

In [8]:
output = dict(output)

In [9]:
with open('sp1.json', 'w') as outfile:
    json.dump(output, outfile)

In [10]:
sc.stop()