# Retrieval of compund terms
Compound terms are n-tuple (2-tuple, 3-tuple, etc.) of terms that can be considered as a single term for linguistic purposed. Examples are New York, Los Angeles, machine learning.

The idea of retrieving compund terms is to evaluate the relevance of a sequence of $n$ terms in a corpus by comparing it with the relevance of its term components. We call a sequence of $n$ terms a **n-gram**.

## Mutual information
Given two discrete random variables $A$ and $B$, we define their mutual information $M(A,B)$ as:

$$
    M(A, B) = \sum\limits_{a \in A}\sum\limits_{b \in B} P(a, b) \log \left(\frac{P(a, b)}{P(a)P(b)}\right)
$$

The intuition is that we are going to compare the joint distribution of $A$ and $B$ with the assumption that they are independent.

When we observe a specific pair of data $a$ and $b$, such as two terms, we can work with the notion of **pointwise mutual information**, defined as:

$$
PMI(a, B) = \log \frac{P(a, b)}{P(a)P(b)},
$$
where $P(a, b)$ is the observed probability of randomly exctact the pair of terms $(a,b)$ given a corpus $C$ and $P(a)P(b)$ is the expected probability of $(a,b)$ under the assumption that $a$ and $b$ are independent.

In a text, we can in general estimate the probability of a term by means of the **maximum likelihood estimation (MLE)** method, by simply comparing the term frequency of the term with the term frequencies of all the terms in the dictionary $D$.

$$
P(a) = \frac{count(a)}{\sum\limits_{i \in D}count(i)}
$$

The same can be done for **2-grams** in order to estimate $P(a, b)$ (or higher order grams for estimating the probability of longer sequences).

In [1]:
import pymongo
import nltk
from collections import defaultdict

In [2]:
cran = pymongo.MongoClient()['inforet']['cran_tokens']

In [3]:
def term_frequencies(collection, field='text'):
    g = {'$group': {'_id': '$' + field, 'count': {'$sum': 1}}}
    cursor = collection.aggregate([g])
    term_frequencies = dict([(x['_id'], x['count']) for x in cursor])
    return term_frequencies

In [5]:
TF = term_frequencies(cran, field='text')
N = sum(TF.values())

In [6]:
for k, v in sorted(TF.items(), key=lambda x: -x[1])[:6]:
    print(k, v, v / N)

the 18848 0.07802099546312548
of 12316 0.0509818856177766
. 9761 0.04040550385799914
, 6801 0.02815263105606517
and 5967 0.024700301354439184
a 5686 0.023537106335066397


### 2 grams indexing
In order to estimate probabilities also for 2-grams, we need a special index of them.

In [7]:
def bigram_sentences(collection, field='text'):
    s = {'$sort': {'document': 1, 'sentence': 1, 'position': 1}}
    g = {'$group': {'_id': {'doc': '$document', 'sent': '$sentence'}, 
                    'tokens': {'$push': '$'+field}}}
    cursor = collection.aggregate([s, g])
    sentences = []
    for record in cursor:
        sentences.append(['#start'] + record['tokens'] + ['#stop'])
    return sentences

In [8]:
bi_sent = bigram_sentences(cran, field='text')

In [10]:
B = defaultdict(lambda: defaultdict(lambda: 0))
Nb = 0
for sent in bi_sent:
    for a, b in nltk.ngrams(sent, n=2):
        Nb += 1
        B[a][b] += 1
TF['#start'] = len(bi_sent)
TF['#stop'] = TF['#start']
N += 2*TF['#start']

In [17]:
M = {}
for k, v in B.items():
    Nk = TF[k]
    for s, c in v.items():
        Ns = TF[s]
        M[(k, s)] = (c / Nb) / ((Nk / N) * (Ns / N))

In [28]:
selected_bigrams = [(x[0], x[1], y) for x, y in 
                    M.items() if TF[x[0]] > 0 and TF[x[1]] > 0]
for k, v, w in sorted(selected_bigrams, key=lambda x: -x[2])[:10]:   
    print(k, v, w, TF[k], TF[v])

indium blue 271004.3139046648 1 1
fort halstead 271004.3139046648 1 1
17,500 meters 271004.3139046648 1 1
8000 calibers 271004.3139046648 1 1
embryonic lobes 271004.3139046648 1 1
decimal digits 271004.3139046648 1 1
/length changes/ 271004.3139046648 1 1
haveg rocketon 271004.3139046648 1 1
rewriting superscript 271004.3139046648 1 1
russian investigator 271004.3139046648 1 1


## Another method for estimating 2-grams relevance
We can also compute:

$$
P(b \mid a) = \frac{count(a, b)}{\sum\limits_{i \in D} count(a, i)} = \frac{count(a, b)}{count(a)}
$$

In [23]:
K = {}
for k, v in B.items():
    Nk = TF[k]
    for s, c in v.items():
        K[(k, s)] = c / Nk

In [26]:
selected_bigrams = [(x[0], x[1], y) for x, y in K.items() if TF[x[0]] > 20 and TF[x[1]] > 20]
for k, v, w in sorted(selected_bigrams, key=lambda x: -x[2])[:10]:   
    print(k, v, w, TF[k], TF[v])

non - 1.0 82 4632
semi - 1.0 42 4632
proportional to 1.0 23 4437
vicinity of 1.0 29 12316
navier - 1.0 24 4632
subjected to 0.9866666666666667 75 4437
. #stop 0.9857596557729741 9761 9685
re - 0.9795918367346939 49 4632
however , 0.9743589743589743 117 6801
presence of 0.9736842105263158 76 12316


## Note that

$$
\frac{P(a, b)}{P(a)P(b)} \approx \frac{P(b \mid a)}{P(b)}
$$

In [27]:
selected_bigrams = [(x[0], x[1], y) for x, y in K.items() if TF[x[0]] > 20 and TF[x[1]] > 20]
for k, v, w in sorted(selected_bigrams, key=lambda x: -x[2])[:10]:   
    print(k, v, w, round(w / (TF[v] / N), 2), round(M[(k, v)], 2))

non - 1.0 56.34 58.51
semi - 1.0 56.34 58.51
proportional to 1.0 58.81 61.08
vicinity of 1.0 21.19 22.0
navier - 1.0 56.34 58.51
subjected to 0.9866666666666667 58.03 60.26
. #stop 0.9857596557729741 26.56 27.58
re - 0.9795918367346939 55.19 57.31
however , 0.9743589743589743 37.38 38.83
presence of 0.9736842105263158 20.63 21.43
