# Spark assignment 1: Pairs

Find all the pairs of two consequent words where the first word is `narodnaya`. For each pair, count the number of occurrences in the Wikipedia dump. Print all the pairs with their count in a lexicographical order. Output format is `word_pair <tab> count`, for example:

```
red_apple	100500
crazy_zoo	42
```

Note that two words in a pair are concatenated with the underscore character, and the result is in the lowercase.

One motivation for counting these continuations is to get a better understanding of the language. Some words, like "the", have a lot of continuations, while others, like "San", have just a few ("San Francisco", for example). One can build a language model with these statistics. If you are interested to learn more, search for "n-gram language model" in the Internet.

Dataset location: */data/wiki/en_articles_part*

The result on the sample dataset:

```
narodnaya_gazeta   1
narodnaya_volya    9
```


### Step 1. Create SparkContext.

In [None]:
from pyspark import SparkConf, SparkContext

sc = SparkContext(conf=SparkConf().setAppName("MyApp").setMaster("yarn"))

# For local run uncomment the lines below.
# sc = SparkContext(conf=SparkConf().setAppName("MyApp").setMaster("local"))
# sc.uiWebUrl


### Step 2. Load and parse data.

In [None]:
import re

def parse_article(line):
    try:
        article_id, text = line.rstrip().split('\t', 1)
        text = re.sub("^\W+|\W+$", "", text, flags=re.UNICODE)
        words = re.split("\W*\s+\W*", text, flags=re.UNICODE)
        return words
    except ValueError as e:
        return []

wiki = sc.textFile("/data/wiki/en_articles_part/articles-part", 4).map(parse_article)


### Step 3. Define main logic.

In [None]:
from collections import Counter


def make_pairs(data, starts_with=""):
    """
    Makes a pairs of words starting with specified word.
    """
    pairs = ["%s_%s" % (data[i], data[i + 1]) for i in range(0, len(data) - 1) if starts_with and data[i] == starts_with]
    counter = Counter(pairs)
    return [(w, c) for w, c in counter.items()]


In [None]:
# Perform all transforms.
raw_pairs = wiki.map(lambda x: [el.lower() for el in x]).flatMap(lambda x: make_pairs(x, "narodnaya"))
result_pairs = raw_pairs.reduceByKey(lambda a,b: a + b).sortByKey()

### Step 4. Print result.

In [None]:
for pair in result_pairs.collect():
    print("%s\t%d" % pair)

### Step 5. Stop Spark

In [None]:
sc.stop()