# Indexing
- **Task**: index tokens in documents and weigth their relevance.
- **Input**: tokenized documents
- **Output**: term-document matrix

### Main steps
1. Tf (Term frequency)
2. Idf (Inverse document frequency)
3. TfIdf

In [1]:
import json
import pandas as pd
import numpy as np
from collections import Counter, defaultdict
import spacy
import nltk

In [2]:
dataset_file = '../data/wiki_dataset.json'
with open(dataset_file, 'r') as infile:
    dataset = json.load(infile)
docs = dataset['docs']
queries = dataset['queries']

In [3]:
nlp = spacy.load("en_core_web_sm")

In [4]:
tokens = lambda text: [x.lemma_ for x in nlp(text) if x.pos_ not in ['PUNCT', 'SPACE'] and not x.is_stop]

## Indexing from scratch
Index structure: <code>{doc_id: {token: tf(token, doc_id), ...}, ...}</code>

### Tf
In this example, we exploit double normalized K tf, with K = 0.5:
$$
tf(t, d) = K + (1 - K)\frac{f(t,d)}{\max\limits_{t' \in d} f(t',d)},
$$
where $f(t,d)$ is the frequency (i.e., count) of token $t$ in document $d$.

In [5]:
TF, k = {}, 0.5
for docid, text in enumerate(docs):
    f = Counter(tokens(text)).most_common()
    maxf = f[0][1]
    TF[docid] = dict([(token, k + (1 - k) * (x / maxf)) for token, x in f])

In [9]:
docs[124]

'Sabine Bramhoff (born 1 November 1964) is a retired German high jumper. She finished seventh at the 1989 European Indoor Championships. She represented the sports club LC Paderborn, and won the silver medal at the West German champion in 1989. Her personal best jump was 1.94 metres (6.3\xa0ft), achieved in August 1990 in Düsseldorf.'

#### Bag of words

In [11]:
list(sorted(TF[124].items(), key=lambda x: -x[1]))[:10]

[('german', 1.0),
 ('1989', 1.0),
 ('Sabine', 0.75),
 ('Bramhoff', 0.75),
 ('bear', 0.75),
 ('1', 0.75),
 ('November', 0.75),
 ('1964', 0.75),
 ('retire', 0.75),
 ('high', 0.75)]

### Idf
$$
idf(t) = \log \left(\frac{N}{n_t} \right),
$$
where $N$ denotes the corpus size, and $n_t$ denotes the number of documents actually containing $t$.

In [12]:
DF, N = defaultdict(lambda: 0), len(docs)
for k, bow in TF.items():
    for t in bow.keys():
        DF[t] += 1
IDF = lambda x: np.log(N / DF[x])

In [13]:
print(DF['send'], DF['chinese'])
print(IDF('send'), IDF('chinese'))

23 8
4.912654885736052 5.968707559985366


### TfIdf

In [14]:
TfIdf = {}
for k, bow in TF.items():
    TfIdf[k] = dict([(token, w * IDF(token)) for token, w in bow.items()])

In [16]:
list(sorted(TfIdf[124].items(), key=lambda x: -x[1]))[:10]

[('Bramhoff', 6.0361118262489),
 ('LC', 6.0361118262489),
 ('Paderborn', 6.0361118262489),
 ('1.94', 6.0361118262489),
 ('ft', 6.0361118262489),
 ('jumper', 5.516251440828943),
 ('Indoor', 5.516251440828943),
 ('6.3', 5.516251440828943),
 ('Düsseldorf', 5.516251440828943),
 ('Sabine', 4.996391055408983)]

## Multiword indexing and compound terms selection
**Task**: find sequences of n words (called n_grams) that should be counted as a single word during indexing (i.e., New York).

**Approch**: use pointwise mutual information to estimate the probability of a n_gram (say a 2_gram in the example) to be a single compound term

$$
pmi(t_i, t_j) = \log \left (\frac{P(t_i, t_j)}{P(t_i)P(t_j)} \right ) 
$$

Denoting $f(t)$ the frequency of terms in the corpus, probabilities can be estimated as:

$$
P(t_i, t_j) = \frac{f(t_i, t_j)}{\sum\limits_{(x, y) \in corpus}f(x, y)} 
, P(t_i) = \frac{f(t_i)}{\sum\limits_{t \in corpus}f(t)}
$$

#### unigram prebability estimation

In [17]:
U, Un = defaultdict(lambda: 0), 0
for doc in docs:
    for token in tokens(doc):
        U[token] += 1
        Un += 1

In [18]:
p_u = lambda x: U[x] / Un

In [19]:
print(U['school'], Un, p_u('school'))

98 129076 0.0007592426167529208


#### bigram probability estimation

In [20]:
B, Bn = defaultdict(lambda: 0), 0
for doc in docs:
    for a, b in nltk.ngrams(tokens(doc), 2):
        B[(a, b)] += 1
        Bn += 1

In [21]:
p_b = lambda x, y: B[(x, y)] / Bn

In [22]:
print(B[('New', 'York')], Bn, p_b('New', 'York'))

204 125948 0.0016197160733000921


#### Pmi

In [23]:
PMI = {}
for (a, b), _ in B.items():
    PMI[(a, b)] = np.log(p_b(a, b) / (p_u(a) * p_u(b)))

In [24]:
for (a, b), p in sorted(PMI.items(), key=lambda x: -x[1])[:10]:
    print(a, b, p, U[a], U[b])

Glaucium stigma 11.792688911965513 1 1
stigma lobe 11.792688911965513 1 1
lactucoide Himalayan 11.792688911965513 1 1
subgroup Rhaetian 11.792688911965513 1 1
belltower campanile 11.792688911965513 1 1
triassic sedimentary 11.792688911965513 1 1
illustrious forebear 11.792688911965513 1 1
Mayora Rengifo 11.792688911965513 1 1
UT Cajamarca 11.792688911965513 1 1
physalaemus maculiventris 11.792688911965513 1 1


#### Use a threshold on the miminum number of occurrences requested

In [25]:
PMI, th = {}, 5
for (a, b), _ in B.items():
    if U[a] > th and U[b] > th:
        PMI[(a, b)] = np.log(p_b(a, b) / (p_u(a) * p_u(b)))

In [26]:
for (a, b), p in sorted(PMI.items(), key=lambda x: -x[1])[:10]:
    print(a, b, p, U[a], U[b])

Hong Kong 10.000929442737458 6 6
Franche Comté 10.000929442737458 6 6
Fissurellidae keyhole 9.8467787629102 6 7
Serbian Cyrillic 9.818607885943504 6 6
Northwest Territories 9.713247370285677 8 7
Singles Chart 9.664457206116245 7 6
Pearl Jam 9.664457206116245 7 6
Pyramidellidae pyram 9.595464334629295 8 9
Terebridae auger 9.595464334629295 9 9
Buenos Aires 9.490103818971468 10 10


### Embed bigrams in tokenization

In [27]:
def pmi_tokenizer(doc, perc=95):
    bigram = []
    words = []
    tks = tokens(doc)
    th = np.percentile(list(PMI.values()), perc)
    for (a, b) in nltk.ngrams(tks, 2):
        if (a, b) in PMI.keys() and PMI[(a, b)] > th:
            if len(bigram) == 0:
                bigram += [a, b]
            else:
                bigram.append(b)
        else:
            if len(bigram) > 0:
                words.append(" ".join(bigram))
                bigram = []
            else:
                words.append(a)
    if tks[-1] != words[-1].split()[-1]:
        words.append(tks[-1])
    return words

In [28]:
pmi_tokens = lambda doc: pmi_tokenizer(doc, perc=90)

In [44]:
print(tokens(docs[180])[40:50])
print(pmi_tokens(docs[180])[40:50])

['model', 'China', 'Special', 'Administrative', 'Regions', 'SARs', 'Hong', 'Kong', 'Macau', 'like']
['Regions', 'SARs', 'Hong Kong', 'Macau', 'like', 'Basic', 'Law', '기본법', 'Kibonpŏp', 'chinese']


### Term-document matrix

In [45]:
M = pd.DataFrame(TfIdf)
M.fillna(0, inplace=True)

In [46]:
M.shape

(24808, 3128)

In [47]:
M.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,3118,3119,3120,3121,3122,3123,3124,3125,3126,3127
South,3.347669,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,3.347669,0.0,0.0,0.0,0.0
Carolina,4.912655,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Great,3.95667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Falls,4.476531,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
town,2.567382,0.0,0.0,2.282118,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [48]:
M.T.head()

Unnamed: 0,South,Carolina,Great,Falls,town,Chester,County,United,States,locate,...,Masoala,kona,palm,few,Ochil,Jamestown,Steuben,Morehouse,Artanovsky,225
0,3.347669,4.912655,3.95667,4.476531,2.567382,4.1124,1.698243,1.546273,1.733683,1.938238,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,2.282118,0.0,1.509549,1.374465,1.541051,1.722878,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Text processing with scikit-learn
A tutorial on scikit-learn text processing is available [here](https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html)

In [49]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

Scikit-learn text facilities expect to work with text as strings instead of pre-tokenized text. Thus, we create a pseudo-text by exploiting our previous tokenizers and creating pseudo words for bigrams.

In [50]:
pseudo_docs = [" ".join([x.replace(' ', '_') for x in pmi_tokens(d)]) for d in docs]

In [51]:
pseudo_docs[180]

'Sinŭiju Special Administrative Region special administrative region SAR North Korea proclaim 2002 de facto operation 2014 border China establish September 2002 area include part Sinŭiju surround area attempt introduce market economic_directly_govern_case directly_govern Cities special administrative region model China Special Administrative Regions SARs Hong_Kong Macau like Basic Law 기본법 Kibonpŏp chinese dutch businessman Yang Bin appoint governor SPA Presidium 2002 formally assume post arrest_chinese_authority_sentence 18 year prison_tax evasion economic_crime north korean_authority soon announce development sinŭiju SAR continue SAR administration_Commission Foreign Economic Cooperation Promotion plan SAR abandon April 2008 SAR reform_effect widely believe North Korea_abandon_project governor_arrest Julie Sa 沙日香 appoint governor 2004'

In [52]:
V = CountVectorizer()

In [53]:
C = V.fit_transform(pseudo_docs)

In [54]:
C

<3128x25779 sparse matrix of type '<class 'numpy.int64'>'
	with 96325 stored elements in Compressed Sparse Row format>

In [55]:
C.toarray()

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

#### Map matrix columns on words

In [59]:
V.vocabulary_.get('hong_kong')

11514

In [61]:
np.nonzero(C[:,11514].toarray())

(array([ 180, 1418, 1792, 2749]), array([0, 0, 0, 0]))

In [62]:
C[180, 11514]

1

### TfIdf

In [63]:
tf_idf = TfidfTransformer(use_idf=True)
X = tf_idf.fit_transform(C)

In [64]:
X

<3128x25779 sparse matrix of type '<class 'numpy.float64'>'
	with 96325 stored elements in Compressed Sparse Row format>

In [65]:
X.toarray()

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [66]:
X[180, 11514]

0.08535316687415362