# Spark assignment 2: Collocations
As for the second part of the assignment, your task is to extract collocations: that is word combinations that occur together. For example, “high school” or “roman empire”.

To find collocations, you will use NPMI (normalized pointwise mutual information) metric.

PMI of two words, a & b, is defined as “PMI(a, b) = ln (P(ab) / (P(a) * P(b))”, where P(ab) is the probability of two words coming one after the other, and P(a) and P(b) are probabilities of words a & b respectively.

You will estimate probabilities with occurrence counts, that is “P(a) = # of occurrences of word a / total number of words”, and “P(ab) = # of occurrences of words ‘a b’ / total number of word pairs”.

To build an intuition behind the definition, see Reading material.

Therefore, rare combinations of coupled words have large PMI.

NPMI is computed as “NPMI(a, b) = PMI(a, b) / -ln P(ab)”. This normalizes the quantity to be within the range [-1; 1].

You task is a bit more complicated now:

* Extract all the words, as in the previous task.
* Filter out stopwords using the dictionary (/datasets/stop_words_en.txt ) (do not forget to convert words to the lowercase!)
* Compute all bigrams (that is, pairs of consequent words)
* Leave only bigrams with at least 500 occurrences
* Compute NPMI for every bigram (note: when computing probabilities, you need unpruned counts!)
* Sort word pairs by NPMI in the descending order
* Print top 39 word pairs, with words delimited by the underscore “_”

For example,

<code>
roman_empire
south_africa
</code>

Dataset location: /data/wiki/en_articles_part

The part of the result on the sample dataset:

<code>...
references_reading
notes_references
award_best
north_america
new_zealand
...</code>

Hint: if you did everything right, “roman_empire” and “south_africa” are going to be in the result.

### Counts after filters.
total words count: 6971026
total bigrams count: 6966926

### Take 3 word probabilities
biennials    1.43450906653e-06
underlyingly    1.43450906653e-07
ancyra    6.16838898607e-06

### Take 3 bigram probabilities
(u'york', u'city')    8.97095792319e-05
(u'north', u'american')    7.32030166533e-05
(u'new', u'york')    0.000601843625151

### Take 3 bigram PMI
york city
 - bigram probability: 8.97095792319e-05
 - "york" probability: 0.000639504141858
 - "city" probability: 0.00146033022973
north american
 - bigram probability: 7.32030166533e-05
 - "north" probability: 0.000812936288001
 - "american" probability: 0.00246993771075
new york
 - bigram probability: 0.000601843625151
 - "new" probability: 0.0027470848624
 - "york" probability: 0.000639504141858

In [1]:
from pyspark import SparkConf, SparkContext

try:
    sc = SparkContext(conf=SparkConf().setAppName("MyApp").setMaster("local").set("spark.cores.max", "4"))
except:
    pass

In [3]:
import re

def split_words(line):
    try:
        article_id, text = unicode(line.rstrip()).split('\t', 1)
        text = re.sub("^\W+|\W+$", "", text, flags=re.UNICODE)
        words = re.split("\W*\s+\W*", text, flags=re.UNICODE)
        return words
    except ValueError as e:
        return []
    
def make_pairs(words):
    last_word = None
    for word in words:
        if last_word:
            yield "%s_%s" % (last_word, word)
        last_word = word
        
with open('/datasets/stop_words_en.txt') as f:
    stop_words = f.read().splitlines()

stop_words_broadcasted = sc.broadcast(stop_words)
        
query = sc.textFile("/data/wiki/en_articles_part/articles-part", 16) \
    .map(split_words) \
    .map(lambda arr: [word.lower() for word in arr]) \
    .map(lambda arr: [word for word in arr if word not in stop_words_broadcasted.value]) \
    .cache()

In [None]:
%%time
words_count = query.map(lambda x: len(x)).reduce(lambda x,y: x+y)
print words_count

In [None]:
%%time
bigrams_count = query.map(lambda x: len(x) - 1).reduce(lambda x,y: x+y)
print bigrams_count

In [None]:
%%time
bigrams = query.flatMap(make_pairs) \
    .map(lambda x: (x, 1)) \
    .reduceByKey(lambda x, y: x + y) \
    .filter(lambda x: x[1]>500) \
    .collect()

In [None]:
bigrams