## Spark assignment 1: Pairs

Find all the pairs of two consequent words where the first word is “narodnaya”. For each pair, count the number of occurrences in the Wikipedia dump. Print all the pairs with their count in a lexicographical order. Output format is “word_pair <tab> count”, for example:

red_apple	100500

crazy_zoo	42

Note that two words in a pair are concatenated with the underscore character, and the result is in the lowercase.

One motivation for counting these continuations is to get a better understanding of the language. Some words, like “the”, have a lot of continuations, while others, like “San”, have just a few (“San Francisco”, for example). One can build a language model with these statistics. If you are interested to learn more, search for “n-gram language model” in the Internet.

In [1]:
from pyspark import SparkConf, SparkContext
sc = SparkContext(conf=SparkConf().setAppName("MyApp").setMaster("local"))

import re

def parse_article(line):
    try:
        article_id, text = unicode(line.rstrip()).split('\t', 1)
        text = re.sub("^\W+|\W+$", "", text, flags=re.UNICODE)
        words = re.split("\W*\s+\W*", text, flags=re.UNICODE)
        return words
    except ValueError as e:
        return []

wiki = sc.textFile("/data/wiki/en_articles_part/articles-part", 16).map(parse_article).cache()

In [2]:
words = wiki.flatMap(lambda x: [(x[i].lower(), x[i+1].lower()) for i in range(len(x)-1)])

In [3]:
words = words.filter(lambda (x,y): x == 'narodnaya')

In [4]:
words = words.map(lambda (x,y): (y, 1)).reduceByKey(lambda x,y: x+y).sortByKey()

In [5]:
results = words.collect()
for result in results:
    print u'narodnaya_%s\t%d'.encode('utf-8') % (result[0], result[1])

narodnaya_gazeta	1
narodnaya_volya	9


## Test

In [6]:
%%writefile test.txt
narodnaya phuong narodnaya 1 narodnaya narodnaya 6 thao minh
narodnaya narodnaya narodnaya phuong narodnaya 1 narodnaya 5

Overwriting test.txt


In [7]:
def normalizeWords(text):
    return re.compile(r'\W+', re.UNICODE).split(text.lower())
test = sc.textFile("test.txt", 16).map(normalizeWords)

In [8]:
test = test.flatMap(lambda x: [(x[i], x[i+1]) for i in range(len(x)-1)])
test.collect()

[(u'narodnaya', u'phuong'),
 (u'phuong', u'narodnaya'),
 (u'narodnaya', u'1'),
 (u'1', u'narodnaya'),
 (u'narodnaya', u'narodnaya'),
 (u'narodnaya', u'6'),
 (u'6', u'thao'),
 (u'thao', u'minh'),
 (u'narodnaya', u'narodnaya'),
 (u'narodnaya', u'narodnaya'),
 (u'narodnaya', u'phuong'),
 (u'phuong', u'narodnaya'),
 (u'narodnaya', u'1'),
 (u'1', u'narodnaya'),
 (u'narodnaya', u'5')]

In [9]:
test = test.filter(lambda (x,y): x == 'narodnaya')
test.collect()

[(u'narodnaya', u'phuong'),
 (u'narodnaya', u'1'),
 (u'narodnaya', u'narodnaya'),
 (u'narodnaya', u'6'),
 (u'narodnaya', u'narodnaya'),
 (u'narodnaya', u'narodnaya'),
 (u'narodnaya', u'phuong'),
 (u'narodnaya', u'1'),
 (u'narodnaya', u'5')]

In [10]:
test = test.map(lambda (x,y): (y, 1)).reduceByKey(lambda x,y: x+y).sortBy(lambda x: x[1], ascending=False)
test.collect()

[(u'narodnaya', 3), (u'1', 2), (u'phuong', 2), (u'6', 1), (u'5', 1)]

In [11]:
results = test.collect()
for result in results:
    print '%s\t%d' % (result[0], result[1])

narodnaya	3
1	2
phuong	2
6	1
5	1
