# \[Honour Task] Spark assignment 2: Collocations

As for the second part of the assignment, your task is to extract collocations: that is word combinations that occur together. For example, "high school" or "roman empire".

To find collocations, you will use NPMI (normalized pointwise mutual information) metric.

PMI of two words, a & b, is defined as `PMI(a, b) = ln (P(ab) / (P(a) * P(b))`, where `P(ab)` is the probability of two words coming one after the other, and `P(a)` and `P(b)` are probabilities of words a & b respectively.

You will estimate probabilities with occurrence counts, that is `P(a) = # of occurrences of word a / total number of words`, and `P(ab) = # of occurrences of words ‘a b’ / total number of word pairs`.

Therefore, rare combinations of coupled words have large PMI.

NPMI is computed as `NPMI(a, b) = PMI(a, b) / -ln P(ab)`. This normalizes the quantity to be within the range `[-1; 1]`.

### Step 1. Create SparkContext.

In [None]:
from pyspark import SparkConf, SparkContext

sc = SparkContext(conf=SparkConf().setAppName("MyApp").setMaster("yarn"))

# For local run uncomment the lines below.
# sc = SparkContext(conf=SparkConf().setAppName("MyApp").setMaster("local"))
# sc.uiWebUrl


### Step 2. Load and parse data.

In [None]:
# Define paths to files
# Local run
# STOP_WORDS_FILE = "stop_words_en.txt"
# DATA_FILE = "test_data.txt"

# Remote run
STOP_WORDS_FILE = "/datasets/stop_words_en.txt"
DATA_FILE = "/data/wiki/en_articles_part/articles-part"

In [None]:
# Use a broadcast variable to share stop words list
stop_words = []
with open(STOP_WORDS_FILE, 'r', encoding='utf-8') as f:
    stop_words = {w.strip().lower() for w in f}

stop_words_broadcast = sc.broadcast(stop_words)

In [None]:
import re

def parse_article(line):
    try:
        article_id, text = line.rstrip().split('\t', 1)
        text = re.sub("^\W+|\W+$", "", text, flags=re.UNICODE)
        words = re.split("\W*\s+\W*", text, flags=re.UNICODE)

        return [w.lower() for w in words if w.lower() not in stop_words_broadcast.value]
    except ValueError as e:
        return []

wiki = sc.textFile(DATA_FILE, 4).map(parse_article).cache()


### Step 3. Define main logic.

In [None]:
# Define functions
from collections import Counter

from numpy import log


def make_pairs(data):
    """
    Makes a pairs of words starting with specified word.
    """
    pairs = ["%s_%s" % (data[i], data[i + 1]) for i in range(0, len(data) - 1)]
    counter = Counter(pairs)
    return [(w, c) for w, c in counter.items()]

def npmi(colllocation):
    """
    Calculates NPMI for the collocation.
    """
    colc, colc_count = colllocation
    first, second = colc.split("_", 1)
    prob_first = words_counts_map.value[first] / float(total_words.value)
    prob_second = words_counts_map.value[second] / float(total_words.value)
    prob_colc = colc_count / float(total_pairs.value)

    pmi = log(prob_colc / (prob_first * prob_second))
    npmi = pmi / (-1.0 * log(prob_colc))

    return colc, npmi

In [None]:
# Calculate each word count
words = wiki.flatMap(lambda x: [(w, 1) for w in x]).reduceByKey(lambda a,b: a + b)
words_counts_map = sc.broadcast(words.collectAsMap())

# Calculate pairs
pairs = wiki.map(lambda x: [el.lower() for el in x]).flatMap(make_pairs).reduceByKey(lambda a,b: a + b)

# Calculate total number of words and pairs
total_words = words.map(lambda x : x[1]).sum()
total_pairs = pairs.map(lambda x: x[1]).sum()

total_words = sc.broadcast(total_words)
total_pairs = sc.broadcast(total_pairs)

In [None]:
# Calculate npmi-s
npmis = pairs.filter(lambda x: x[1] >= 500).map(npmi)

### Step 4. Print result.

In [None]:
for npmi in npmis.top(39, key=lambda x: x[1]):
    print(npmi[0])

### Step 5. Stop Spark

In [None]:
sc.stop()