## Description
#### This is one of the assignments from course CS 431/631 (Data-intensive Distributed Analytics) at University of Waterloo.
#### This assignment focuses on the text mining to get the 'PMI' scores of specific token pairs.
#### Some modifications have been made to improve the presentation on this platform.
---

#### Overview
**Goal:** Use Python to analyze the [pointwise mutual information (PMI)](http://en.wikipedia.org/wiki/Pointwise_mutual_information) of tokens in the text of Shakespeare's plays.\
**Files needed:** the text file (`Shakespeare.txt`); the Python tokenizer module (`simple_tokenize.py`).

If two events $x$ and $y$ are independent, their PMI will be zero.   A positive PMI indicates that $x$ and $y$ are more likely to co-occur than they would be if they were independent.   Similarly, a negative PMI indicates that $x$ and $y$ are less likely to co-occur.   The PMI of events $x$ and $y$ is given by
\begin{equation*}
PMI(x,y) = \log\frac{p(x,y)}{p(x)p(y)}
\end{equation*}
where $p(x)$ and $p(y)$ are the probabilities of occurrence of events $x$ and $y$, and $p(x,y)$ is the probability of co-occurrence of $x$ and $y$.

For here, the "events" that we are interested in are occurrences of tokens on lines of text in the input file.   For example, one event
might represent the occurence of the token "fire" a line of text, and another might represent the occurrence of the token "peace".   In that case, $p(fire)$ represents the probability that "fire" will occur on a line of text, and $p(fire,peace)$ represents the probability that *both* "fire" and "peace" will occur on the *same* line.   For the purposes of these PMI computations, it does not matter how many times a given token occures on a single line.   Either a line contains a particular token (at least once), or it does not.   For example, consider this line of text:

> three three three, said thrice

For this line, the following token-pair events have occurred:
- (three, said)
- (three, thrice)
- (said, three)
- (said, thrice)
- (thrice, three)
- (thrice, said)

Note that we are not interested in "reflexive" pairs, such as (thrice,thrice).

In addition to the probabilities of events, we will also be interested in the absolute *number* of occurences of particular events, e.g., the number of lines in which "fire" occurs.   We will use $n(x)$ to represent the these numbers.

The main task is to write Python code to analyze the PMI of tokens from Shakespeare's plays.    Based this analysis, we want to be able to answer two types of queries:

* Two-Token Queries: Given a pair of tokens, $x$ and $y$, report the number of lines on which that pair co-occurs ($n(x,y)$) as well as $PMI(x,y)$.
* One-Token Queries: Given a single token, $x$, report the number of lines on which that token occurs ($n(x)$).   In addition, report the five tokens that have the largest PMI with respect to $x$ (and their PMIs).   That is, report the five $y$'s for which $PMI(x,y)$ is largest.

To avoid reporting spurious results for the one-token queries, we are only interested in token pairs that co-occur a sufficient number of times.   Therefore, we will use a *threshold* parameter for one-token queries.   A one-token query should only report pairs of tokens that co-occur at least *threshold* times in the input.   For example, given the threshold 12, a one-token query for "fire" the should report the five tokens that have the largest PMI (with respect to "fire") among all tokens that co-occur with "fire" on at least 12 lines.   If there are fewer than five such tokens, report fewer than five.



---
#### Part 1:

First, write some code to have an idea of how big the PMI analysis problem will be. Specifically, this code determines: (a) the number of *distinct* tokens that exist in 'Shakespeare.txt', and (b) the number of 
distinct token pairs that exist in 'Shakespeare.txt'.  (consider the token pair $x,y$ to be distinct from the pair $y,x$, i.e., count them both; also ignore token pairs of the form $x,x$)

In [1]:
# this imports the SimpleTokenize function from the simple_tokenize.py file
from simple_tokenize import simple_tokenize

# Now, let's tokenize Shakespeare's plays
tokens_dic = {}
token_pairs_dic = {}
with open('Shakespeare.txt') as f:
    for line in f:
        # tokenize, one line at a time
        t = simple_tokenize(line)
        # for each line, get a dictionary with keys(tokens) and values(count)
        for i in t:
                if i not in tokens_dic:
                    tokens_dic[i] = 1
                else:
                    tokens_dic[i] += 1
        # for each line, get a dictionary with keys(token pairs) and values(count)
        for m in t:
                for n in t:
                    if (m,n) not in token_pairs_dic and m != n:
                        token_pairs_dic[(m,n)] = 1
                    elif (m,n) in token_pairs_dic and m != n:
                        token_pairs_dic[(m,n)] += 1

# the length of the two dictionaries are the number of distinct tokens and token pairs respectively
print("The number of distinct tokens is {0}".format(len(tokens_dic)))
print("The number of distinct token pairs is {0}".format(len(token_pairs_dic)))

The number of distinct tokens is 25975
The number of distinct token pairs is 1969760


---

#### Part 2:
The Python code below can answer the one-token and two-token queries described above, for 'Shakespeare.txt'.

In [2]:
# this imports the SimpleTokenize function from the simple_tokenize.py file
from simple_tokenize import simple_tokenize
# the log function for computing PMI
# for the sake of consistency across solutions, use log base 10
from math import log
# numpy is imported to get the distinct tokens of each line
import numpy as np

###################################################################################################################
tokens_lines = {}
token_pairs_lines = {}
with open('Shakespeare.txt') as f:
    for line in f:
        # tokenize, one line at a time
        t = simple_tokenize(line)
        # get the distinct tokens of each line
        x = np.array(t)
        u, indices = np.unique(x, return_index=True)
        t_new = list(u)
        # for each line, get a dictionary with keys(tokens) and values(count) 
        for i in t_new:
            if i not in tokens_lines:
                tokens_lines[i] = 1
            else:
                tokens_lines[i] += 1
        # for each line, get a dictionary with keys(token pairs) and values(count) 
        for m in t_new:
                for n in t_new:
                    if (m,n) not in token_pairs_lines and m != n:
                        token_pairs_lines[(m,n)] = 1
                    elif (m,n) in token_pairs_lines and m != n:
                        token_pairs_lines[(m,n)] += 1
# get the number of lines of this text file
n_total = len(open('Shakespeare.txt').readlines())
###################################################################################################################

###################################################################################################################
#  the user interface below defines the types of PMI queries that users can ask
###################################################################################################################

while True:
    q = input("Input 1 or 2 space-separated tokens (return to quit): ")
    if len(q) == 0:
        break
    q_tokens = simple_tokenize(q)
    
    if len(q_tokens) == 1:
        # if there is no such a token in the text file, print directly
        if q_tokens[0] not in tokens_lines:
            print("  n({0}) = 0 ".format(q_tokens[0]))
            print("There is no such token in the text file")
        # if there is such a token in the text file, do the following steps
        else:
            threshold = 0
            while threshold <= 0:
                try:
                    threshold = int(input("Input a positive integer frequency threshold: "))
                except ValueError:
                    print("Threshold must be a positive integer!")
                    continue
            # count the number of lines that a given single token appears in the text file
            n_0 = tokens_lines.get(q_tokens[0])
            # a dictionary storing token pairs that meet certain requirements and the number of lines they appear
            tokens_specific_number = {}
            # a dictionary storing token pairs that meet certain requirements and their PMI
            tokens_specific_PMI = {}
            # for each token in the text that is not equal to the given token, calculate the relevant numbers to get PMI
            for i in tokens_lines:
                if i != q_tokens[0]:
                    if token_pairs_lines.get((q_tokens[0],i)) and (token_pairs_lines.get((q_tokens[0],i)) >= threshold):
                        n_1 = tokens_lines.get(i)
                        n_2 = token_pairs_lines.get((q_tokens[0],i))
                        tokens_specific_number[(q_tokens[0],i)] = n_2
                        tokens_specific_PMI[(q_tokens[0],i)] = log((n_2/n_total)/((n_0/n_total)*(n_1/n_total)),10)
            # sort the 'tokens_specific_PMI' dictionary based on values
            PMI_dic_sort = dict(sorted(tokens_specific_PMI.items(), key = lambda item:item[1], reverse = True))
            result_1 = list(PMI_dic_sort.keys())
            result_2 = list(PMI_dic_sort.values())
            result_3 = []
            # for each token pair in sorted list, get their number of lines appearing in the text file
            for i in range(len(result_1)):
                result_3.append(tokens_specific_number[result_1[i]])
            print("  n({0}) = {1} ".format(q_tokens[0],n_0))
            print("  high PMI tokens with respect to {0} (threshold: {1}):".format(q_tokens[0],threshold))
            # if there are fewer than 5 such tokens, then report less than 5
            for i in range(5):
                if i < len(result_1):
                    print("    n{0} = {1},  PMI{0} = {2}".format(result_1[i],result_3[i],result_2[i]))   

    elif len(q_tokens) == 2:
        # if there is no such a token pair, the number of such lines is zero and we could not compute PMI
        if (q_tokens[0],q_tokens[1]) not in token_pairs_lines:
            print("  n({0},{1}) = 0 ".format(q_tokens[0],q_tokens[1]))
            print("There is no such token pair in the text file")
        # if there is such a token pair, count the number of lines token 1 appears, the number of lines token 2 appears
        # also the number of lines the token pair appears. Then we can count PMI of the token pair.
        else:
            n_0 = tokens_lines.get(q_tokens[0]) 
            n_1 = tokens_lines.get(q_tokens[1])
            n_2 = token_pairs_lines.get((q_tokens[0],q_tokens[1]))
            PMI_two_tokens = log((n_2/n_total)/((n_0/n_total)*(n_1/n_total)),10)
            print("  n({0},{1}) = {2} ".format(q_tokens[0],q_tokens[1],n_2))
            print("  PMI({0},{1}) = {2} ".format(q_tokens[0],q_tokens[1],PMI_two_tokens))
    
    else:
        print("Input must consist of 1 or 2 space-separated tokens!")


Input 1 or 2 space-separated tokens (return to quit): sorry
Input a positive integer frequency threshold: 15
  n(sorry) = 91 
  high PMI tokens with respect to sorry (threshold: 15):
    n('sorry', 'am') = 62,  PMI('sorry', 'am') = 1.5950015984520203
    n('sorry', 'i') = 68,  PMI('sorry', 'i') = 0.6906128710555685
    n('sorry', 'for') = 20,  PMI('sorry', 'for') = 0.565108851217973
    n('sorry', 'that') = 21,  PMI('sorry', 'that') = 0.4271311667155323
    n('sorry', 'you') = 16,  PMI('sorry', 'you') = 0.24684833684509966
Input 1 or 2 space-separated tokens (return to quit): am sorry
  n(am,sorry) = 62 
  PMI(am,sorry) = 1.5950015984520203 
Input 1 or 2 space-separated tokens (return to quit): am sory
  n(am,sory) = 0 
There is no such token pair in the text file
Input 1 or 2 space-separated tokens (return to quit): i am sorry
Input must consist of 1 or 2 space-separated tokens!
Input 1 or 2 space-separated tokens (return to quit): 
