## Term weight
### Term frequency (TF)
The easiest way of weighting the terms in documents is to just count its occurrences as $tf_{d_i, t_j}$, where $d_i$ is the $i$th document and $t_j$ is the $j$th term.

However, the weight of a term in a document is depending on the document length and the relative frequency of the term with respect to the other terms in the same document.

To better address relative weights of terms, $tf$ is usually normalized according to one of the following strategies.

<table width=50% style='font-size: 16px;'>
<thead>
<tr>
<th style="text-align: right;">TF Measure</th>
<th style="text-align: center;">Formula</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: right;">natural</td>
<td style="text-align: center;">$tf_{d_i,t_j}$</td>
</tr>
<tr>
<td style="text-align: right">augmented</td>
<td style="text-align: center">
$k + (1-k)\frac{tf_{d_i,t_j}}{\max{tf_{d_l,t_m}}}$</td>
</tr>
<tr>
<td style="text-align: right">log normalized</td>
<td style="text-align: center">
$1 + \log{tf_{d_i,t_j}}$</td>
</tr>
<tr>
<td style="text-align: right">log avg</td>
<td style="text-align: center">
$\frac{1 + \log{tf_{d_i,t_j}}}{1 + \log{avg_{t \in d}{tf_{d,t}}}}$</td>
</tr>
</tbody>
</table>

### Inverse Document Frequency (IDF)
TF alone is is usually overestimating the weight of very common terms, because those terms appears frequently in almost every document.

A natural way of measuring if a term is common in a corpus is to count the **number of documents** in which it appears (this is referred as **document frequency (DF)**.

$$
df(t_j) = \mid \{d_i : t_j \in d_i\} \mid
$$

Since we are interested is a measure that is proportional to how much a term is **infrequent** in the corpus, we can use a measure of **inverse document frequency (IDF)**, as:

$$
idf(t_j) = \log \frac{N}{df(t_j)}
$$

where $N$ denotes the number of documents in the corpus.

There are other solutions for IDF computation:

<table width=50% style='font-size: 16px;'>
<thead>
<tr>
<th style="text-align: right;">IDF Measure</th>
<th style="text-align: center;">Formula</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: right;">standard</td>
<td style="text-align: center;">$\log{\frac{N}{n_{tj}}}$</td>
</tr>
<tr>
<td style="text-align: right;">max</td>
<td style="text-align: center;">
$\log{\frac{\max_{t'\in d}n_{t'}}{1 + n_t}}$</td>
</tr>
<tr>
<td style="text-align: right;">probabilistic</td>
<td style="text-align: center;">
$\log{\frac{N - n_t}{n_t}}$</td>
</tr>
</tbody>
</table>

In [1]:
import pymongo
import nltk
from collections import defaultdict

In [2]:
cran = pymongo.MongoClient()['inforet']['cran_tokens']

In [5]:
def tf(doc, term, field='text'):
    m = {'$match': {'document': doc, field: term}}
    g = {'$group': {'_id': '$'+field, 'count': {'$sum': 1}}}
    cursor = cran.aggregate([m, g])
    c = 0
    for record in cursor:
        c = record['count']
    return c

In [7]:
tf(1, 'experimental')

2