# Spark assignment 2: Collocations

As for the second part of the assignment, your task is to extract collocations: that is word combinations that occur together. For example, “high school” or “roman empire”.

To find collocations, you will use NPMI (normalized pointwise mutual information) metric.

PMI of two words, a & b, is defined as “PMI(a, b) = ln (P(ab) / (P(a) * P(b))”, where P(ab) is the probability of two words coming one after the other, and P(a) and P(b) are probabilities of words a & b respectively.

You will estimate probabilities with occurrence counts, that is “P(a) = # of occurrences of word a / total number of words”, and “P(ab) = # of occurrences of words ‘a b’ / total number of word pairs”.

To build an intuition behind the definition, see Reading material.

Therefore, rare combinations of coupled words have large PMI.

NPMI is computed as “NPMI(a, b) = PMI(a, b) / -ln P(ab)”. This normalizes the quantity to be within the range [-1; 1].

You task is a bit more complicated now:

Extract all the words, as in the previous task.
Filter out stopwords using the dictionary (/datasets/stop_words_en.txt ) (do not forget to convert words to the lowercase!)
Compute all bigrams (that is, pairs of consequent words)
Leave only bigrams with at least 500 occurrences
Compute NPMI for every bigram (note: when computing probabilities, you need unpruned counts!)
Sort word pairs by NPMI in the descending order
Print top 39 word pairs, with words delimited by the underscore “_”
For example,

roman_empire
south_africa

The part of the result on the sample dataset:

...
references_reading
notes_references
award_best
north_america
new_zealand
...

Hint: if you did everything right, “roman_empire” and “south_africa” are going to be in the result.

If you want to deploy the environment on your own machine, please use bigdatateam/spark-course1 Docker container.

In [1]:
#! /usr/bin/env python

from pyspark import SparkConf, SparkContext
import re
import math

In [2]:
try:
    sc = SparkContext(conf=SparkConf().setAppName("MyApp").setMaster("local").set("spark.cores.max", "16"))
except:
    pass

In [3]:
def read_stopwords(file_path): # Read in stop words file
    return set(word.strip().lower() for word in open(file_path))

In [4]:
stop_file = "/datasets/stop_words_en.txt" # coursera version
wiki_file = "/data/wiki/en_articles_part/articles-part" # coursera version

#stop_file = "/mnt/data/stop_words_en.txt" # local version
#wiki_file = "/mnt/data/articles-part" # local version
pair_thresh = 500

def parse_article(line):
    try:
        article_id, text = line.rstrip().split('\t', 1)
        text = re.sub("^\W+|\W+$", "", text, flags=re.UNICODE)
        words = re.split("\W*\s+\W*", text, flags=re.UNICODE)       
        return words
    except ValueError as e:
        return []
    
with open(stop_file, "r") as f:
    stop_words = f.read().splitlines()
    
stopwords_broadcast = sc.broadcast(stop_words)

def filter_stop(words):
    return [word for word in words if (word.lower() not in stopwords_broadcast.value)]
    
def lower(words):
    return [word.lower() for word in words]

In [5]:
def pairs(words):
    out = []
    for w1, w2 in zip(words, words[1:]):
        out.append((w1.lower() + "_" + w2.lower(), 1))
    return out

In [6]:
wiki = (sc.textFile(wiki_file, 16)
        .map(parse_article)
        .map(filter_stop)
        .map(lower)
       ).cache()

In [7]:
words = (wiki.flatMap(lambda words: [(word, 1) for word in words]).reduceByKey(lambda x,y: x+y)).cache()
words_total = words.map(lambda value: value[1]).sum()
words_total = sc.broadcast(words_total)

In [8]:
word_pairs = (wiki.flatMap(pairs).reduceByKey(lambda x,y: x+y)).cache()
pairs_total = word_pairs.map(lambda value: value[1]).sum()
pairs_total = sc.broadcast(pairs_total)

In [9]:
words_count_map = words.collectAsMap()
words_count_map = sc.broadcast(words_count_map)

In [11]:
def npmi(value):
    pair, count = value
    w1, w2 = pair.split("_")
    w1_count = words_count_map.value[w1]
    w2_count = words_count_map.value[w2]
    
    pair_prob = float(count) / pairs_total.value
    w1_prob = float(w1_count) / words_total.value
    w2_prob = float(w2_count) / words_total.value
    
    pmi = math.log(pair_prob / (w1_prob * w2_prob))
    npmi = pmi / (-1 * math.log(pair_prob))
    return (pair, npmi)

In [12]:
npmi = (word_pairs
        .filter(lambda value: value[1] > pair_thresh)
        .map(lambda value: npmi(value))
        .sortBy(lambda value: value[1], ascending=False)
       ).cache()

In [13]:
for pair, value in npmi.take(39):
    print(pair)

los_angeles 0.9728998015144634
external_links 0.949690234441423
united_states 0.8833319371051045
prime_minister 0.8827431262049907
san_francisco 0.8522919541303453
et_al 0.8025243465157926
new_york 0.7870689136516178
supreme_court 0.7781368144102505
19th_century 0.7574641862586649
20th_century 0.751460473748708
references_external 0.7318269632728455
soviet_union 0.7278064337810742
air_force 0.7057732284123788
baseball_player 0.6917111626352064
university_press 0.6874245583071967
roman_catholic 0.6836777179540484
united_kingdom 0.6833646421485475
references_reading 0.6704634730189424
notes_references 0.6623632179005733
award_best 0.6581113468922128
north_america 0.652460117823392
new_zealand 0.6473526101706836
civil_war 0.6406165949559051
catholic_church 0.6259735154000262
world_war 0.624021393901398
war_ii 0.6195378356965802
south_africa 0.6188050625179361
took_place 0.6123195539618992
roman_empire 0.6093863488567927
united_nations 0.6061720044648012
american_singer-songwriter 0.565523