# Subproject C: 

Implement an additional bit of preprocessing that strips out periods (.), commas (,), colons (:), semi- colons (;), single quotes (’), exclamation points (!), or questions marks (?), but only if they are the first or last character of the word (this will leave contractions like “can’t” or “won’t” unaffected). You may notice that in doing so, you have to intrinsically discard all words that have only 1 character; do that as well. Submit the resulting top 40 words. 

Fisrt, import some Spark classes into the program, and create a SparkContext object, which tells Spark how to access a cluster.

In [1]:
from pyspark import SparkContext, SparkConf

In [2]:
import json

In [3]:
conf = SparkConf().setAppName('word count').setMaster('local[3]')
sc = SparkContext(conf = conf)

Create an RDD by referencing datasets in an external storage system. We pass all file paths together (with wildcards), as we are going to count the words across all documents.

In [4]:
lines = sc.textFile('./data/4300-0.txt,./data/pg*.txt')

Create an RDD for the stopwords, and use the broadcast variable

In [5]:
stoplist = sc.textFile('./data/stopwords.txt')
broadcastStopList = sc.broadcast(set(stoplist.collect()))

Create an RDD for the leading or trailing punctuations, which need to be stripped out

In [6]:
punctuations = sc.broadcast(set(".,:;'!?"))

In [7]:
punctuations.value

{'!', "'", ',', '.', ':', ';', '?'}

Split each line into a list of words  
Filter out the words which are in the stopwords list  
Map each word, w, to a tuple, (w, 1)  
Add the tuples with the same key (word)  
Stripping the words which have leading or trailing punctuations  

In [8]:
words = lines.flatMap(lambda line : line.split())

In [None]:
# Alternative way to strip out the heading or trailing punctuations

filteredPunctuation = words.map(lambda word : word[1:] if len(word) > 0 and word[0] in punctuations.value else word).map(lambda word : word[:-1] if len(word) > 0 and word[-1] in punctuations.value else word)
filteredWords = filteredPunctuation.filter(lambda word : word.lower() not in broadcastStopList.value)

In [9]:
filteredWords = words.filter(lambda word : word.lower() not in broadcastStopList.value)

In [10]:
counts = filteredWords.map(lambda word : (word.lower(), 1)).reduceByKey(lambda a, b : a + b)

In [11]:
counts2 = counts.map(lambda x : (x[0][1:], x[1]) if len(x[0]) > 1 and x[0][0] in punctuations.value else x).map(lambda x : (x[0][:-1], x[1]) if len(x[0]) > 1 and x[0][-1] in punctuations.value else x)

In [12]:
output = counts2.filter(lambda x : x[1] > 2 and len(x[0]) > 1).takeOrdered(40, key=lambda x : -x[1])

In [13]:
output

[('not', 8141),
 ('you', 6354),
 ('have', 5146),
 ('no', 3620),
 ('one', 3498),
 ('like', 2253),
 ('more', 2087),
 ('out', 2021),
 ('up', 1831),
 ('man', 1783),
 ('now', 1579),
 ('only', 1555),
 ('must', 1523),
 ('little', 1485),
 ('those', 1447),
 ('good', 1444),
 ('should', 1417),
 ('after', 1379),
 ('great', 1358),
 ('every', 1356),
 ('first', 1318),
 ('own', 1289),
 ('did', 1271),
 ('how', 1266),
 ('see', 1251),
 ('these', 1244),
 ('men', 1233),
 ('over', 1209),
 ('where', 1205),
 ('make', 1196),
 ('upon', 1188),
 ('nor', 1181),
 ('never', 1177),
 ('much', 1167),
 ('time', 1166),
 ('said', 1163),
 ('two', 1142),
 ('old', 1140),
 ('made', 1128),
 ('most', 1114)]

In [14]:
output = dict(output)

In [15]:
with open('sp3.json', 'w') as outfile:
    json.dump(output, outfile)

In [16]:
sc.stop()