# Subproject B: 

Implement a list of stopwords (use the stopwords.txt file). These words do not carry much meaning about the specific text. Re-generate the list of top 40 words across all documents, dropping words entirely that are found in the provided list of stopwords. As before, your counts should be case-insensitive, but otherwise no additional preprocessing is needed (HINT: you will need to look at broadcast variables to do this). 

Fisrt, import some Spark classes into the program, and create a SparkContext object, which tells Spark how to access a cluster.

In [1]:
from pyspark import SparkContext, SparkConf

In [2]:
import json

In [3]:
conf = SparkConf().setAppName('word count').setMaster('local[3]')
sc = SparkContext(conf = conf)

Create an RDD by referencing datasets in an external storage system. We pass all file paths together (with wildcards), as we are going to count the words across all documents.

In [4]:
lines = sc.textFile('./data/4300-0.txt,./data/pg*.txt')

Create an RDD for the stopwords, and use the broadcast variable, which is a read-only variable to the cluster. It can be used to give every node to access the same dataset in an efficient manner. 

In [5]:
stoplist = sc.textFile('./data/stopwords.txt')

In [6]:
broadcastStopList = sc.broadcast(set(stoplist.collect()))

In [7]:
broadcastStopList.value

{'a',
 'about',
 'all',
 'an',
 'and',
 'any',
 'are',
 'as',
 'at',
 'be',
 'been',
 'but',
 'by',
 'can',
 'could',
 'do',
 'for',
 'from',
 'had',
 'has',
 'he',
 'her',
 'him',
 'his',
 'if',
 'in',
 'into',
 'is',
 'it',
 'may',
 'me',
 'my',
 'of',
 'on',
 'or',
 'other',
 'our',
 'said',
 'shall',
 'she',
 'so',
 'some',
 'such',
 'than',
 'that',
 'the',
 'their',
 'them',
 'then',
 'there',
 'they',
 'this',
 'to',
 'very',
 'was',
 'we',
 'were',
 'what',
 'when',
 'which',
 'who',
 'will',
 'with',
 'would',
 'your'}

Split each line into a list of words  
Filter out the words which are in the stopwords list  
Map each word, w, to a tuple, (w, 1)  
Add the tuples with the same key (word)

In [8]:
words = lines.flatMap(lambda line : line.split())

In [9]:
filteredWords = words.filter(lambda word : word.lower() not in broadcastStopList.value)

In [10]:
counts = filteredWords.map(lambda word : (word.lower(), 1)).reduceByKey(lambda a, b : a + b)

Get the top 40 words (drop words with counts less than 2 can improve the performance a littlt bit)

In [11]:
output = counts.filter(lambda x : x[1] > 2).takeOrdered(40, key=lambda x : -x[1])

In [12]:
output

[('i', 11044),
 ('not', 8141),
 ('you', 6354),
 ('have', 5146),
 ('no', 3620),
 ('one', 3498),
 ('like', 2253),
 ('more', 2087),
 ('out', 2021),
 ('up', 1831),
 ('man', 1783),
 ('now', 1579),
 ('only', 1555),
 ('must', 1523),
 ('little', 1485),
 ('those', 1447),
 ('good', 1444),
 ('should', 1417),
 ('after', 1379),
 ('great', 1358),
 ('every', 1356),
 ('first', 1318),
 ('own', 1289),
 ('did', 1271),
 ('how', 1266),
 ('see', 1251),
 ('these', 1244),
 ('men', 1233),
 ('over', 1209),
 ('where', 1205),
 ('make', 1196),
 ('upon', 1188),
 ('nor', 1181),
 ('never', 1177),
 ('much', 1167),
 ('time', 1166),
 ('said,', 1163),
 ('two', 1142),
 ('old', 1140),
 ('made', 1128)]

Cast the list to a dictionary, and output to a json file

In [13]:
output = dict(output)

In [14]:
with open('sp2.json', 'w') as outfile:
    json.dump(output, outfile)

In [15]:
sc.stop()