# Indexing
- **Task**: index tokens in documents and weigth their relevance.
- **Input**: tokenized documents
- **Output**: term-document matrix

### Main steps
1. Tf (Term frequency)
2. Idf (Inverse document frequency)
3. TfIdf

In [43]:
import json
import pandas as pd
import numpy as np
from collections import Counter, defaultdict
import spacy
import nltk

In [2]:
dataset_file = '../data/country_dataset.json'
with open(dataset_file, 'r') as infile:
    dataset = json.load(infile)
docs = dataset['docs']
queries = dataset['queries']

In [4]:
nlp = spacy.load("en_core_web_sm")

In [6]:
tokens = lambda text: [x.lemma_ for x in nlp(text) if x.pos_ not in ['PUNCT', 'SPACE'] and not x.is_stop]

## Indexing from scratch
Index structure: <code>{doc_id: {token: tf(token, doc_id), ...}, ...}</code>

### Tf
In this example, we exploit double normalized K tf, with K = 0.5:
$$
tf(t, d) = K + (1 - K)\frac{f(t,d)}{\max\limits_{t' \in d} f(t',d)},
$$
where $f(t,d)$ is the frequency (i.e., count) of token $t$ in document $d$.

In [8]:
TF, k = {}, 0.5
for docid, text in enumerate(docs):
    f = Counter(tokens(text)).most_common()
    maxf = f[0][1]
    TF[docid] = dict([(token, k + (1 - k) * (x / maxf)) for token, x in f])

In [69]:
docs[504]

'In a 2016 ranking of Chinese high schools that send students to study in American universities, Shenzhen Foreign Language School ranked number 19 in mainland China in terms of the number of students entering top American universities.\n'

#### Bag of words

In [70]:
list(sorted(TF[504].items(), key=lambda x: -x[1]))[:10]

[('student', 1.0),
 ('american', 1.0),
 ('university', 1.0),
 ('number', 1.0),
 ('2016', 0.75),
 ('ranking', 0.75),
 ('chinese', 0.75),
 ('high', 0.75),
 ('school', 0.75),
 ('send', 0.75)]

### Idf
$$
idf(t) = \log \left(\frac{N}{n_t} \right),
$$
where $N$ denotes the corpus size, and $n_t$ denotes the number of documents actually containing $t$.

In [62]:
DF, N = defaultdict(lambda: 0), len(docs)
for k, bow in TF.items():
    for t in bow.keys():
        DF[t] += 1
IDF = lambda x: np.log(N / DF[x])

In [71]:
print(DF['send'], DF['chinese'])
print(IDF('send'), IDF('chinese'))

8 2
4.153006474870687 5.539300835990577


### TfIdf

In [72]:
TfIdf = {}
for k, bow in TF.items():
    TfIdf[k] = dict([(token, w * IDF(token)) for token, w in bow.items()])

In [74]:
list(sorted(TfIdf[504].items(), key=lambda x: -x[1]))[:10]

[('university', 5.133835727882413),
 ('ranking', 4.674336012412892),
 ('Shenzhen', 4.674336012412892),
 ('Foreign', 4.674336012412892),
 ('Language', 4.674336012412892),
 ('student', 4.623010104116422),
 ('american', 4.623010104116422),
 ('number', 4.440688547322468),
 ('chinese', 4.154475626992933),
 ('rank', 4.154475626992933)]

## Multiword indexing and compound terms selection
**Task**: find sequences of n words (called n_grams) that should be counted as a single word during indexing (i.e., New York).

**Approch**: use pointwise mutual information to estimate the probability of a n_gram (say a 2_gram in the example) to be a single compound term

$$
pmi(t_i, t_j) = \log \left (\frac{P(t_i, t_j)}{P(t_i)P(t_j)} \right ) 
$$

Denoting $f(t)$ the frequency of terms in the corpus, probabilities can be estimated as:

$$
P(t_i, t_j) = \frac{f(t_i, t_j)}{\sum\limits_{(x, y) \in corpus}f(x, y)} 
, P(t_i) = \frac{f(t_i)}{\sum\limits_{t \in corpus}f(t)}
$$

#### unigram prebability estimation

In [36]:
U, Un = defaultdict(lambda: 0), 0
for doc in docs:
    for token in tokens(doc):
        U[token] += 1
        Un += 1

In [39]:
p_u = lambda x: U[x] / Un

In [75]:
print(U['school'], Un, p_u('school'))

6 6361 0.0009432479169941833


#### bigram probability estimation

In [45]:
B, Bn = defaultdict(lambda: 0), 0
for doc in docs:
    for a, b in nltk.ngrams(tokens(doc), 2):
        B[(a, b)] += 1
        Bn += 1

In [48]:
p_b = lambda x, y: B[(x, y)] / Bn

In [49]:
print(B[('New', 'York')], Bn, p_b('New', 'York'))

8 5852 0.001367053998632946


#### Pmi

In [50]:
PMI = {}
for (a, b), _ in B.items():
    PMI[(a, b)] = np.log(p_b(a, b) / (p_u(a) * p_u(b)))

In [53]:
for (a, b), p in sorted(PMI.items(), key=lambda x: -x[1])[:10]:
    print(a, b, p, U[a], U[b])

Cyrville Ward 8.841342991217594 1 1
Matthew Julia 8.841342991217594 1 1
Street Manhattan 8.841342991217594 1 1
description M. 8.841342991217594 1 1
M. slaina 8.841342991217594 1 1
slaina ant 8.841342991217594 1 1
Donald Lu 8.841342991217594 1 1
2012 Community 8.841342991217594 1 1
Community Shield 8.841342991217594 1 1
Shield broadcast 8.841342991217594 1 1


#### Use a threshold on the miminum number of occurrences requested

In [54]:
PMI, th = {}, 5
for (a, b), _ in B.items():
    if U[a] > th and U[b] > th:
        PMI[(a, b)] = np.log(p_b(a, b) / (p_u(a) * p_u(b)))

In [77]:
for (a, b), p in sorted(PMI.items(), key=lambda x: -x[1])[:10]:
    print(a, b, p, U[a], U[b])

Indian Ocean 6.2022856616023345 7 8
July 1915 5.950971233321429 6 6
December 1915 5.950971233321429 6 6
Damascus capital 5.796820553494171 6 7
North Western 5.796820553494171 6 7
main power 5.663289160869648 8 6
World War 5.642669873666912 7 7
power base 5.545506125213264 6 9
base Damascus 5.545506125213264 9 6
high school 5.545506125213264 9 6


### Embed bigrams in tokenization

In [111]:
def pmi_tokenizer(doc, perc=95):
    bigram = []
    words = []
    tks = tokens(doc)
    th = np.percentile(list(PMI.values()), perc)
    for (a, b) in nltk.ngrams(tks, 2):
        if (a, b) in PMI.keys() and PMI[(a, b)] > th:
            if len(bigram) == 0:
                bigram += [a, b]
            else:
                bigram.append(b)
        else:
            if len(bigram) > 0:
                words.append(" ".join(bigram))
                bigram = []
            else:
                words.append(a)
    if tks[-1] != words[-1].split()[-1]:
        words.append(tks[-1])
    return words

In [112]:
pmi_tokens = lambda doc: pmi_tokenizer(doc, perc=90)

In [116]:
print(tokens(docs[504])[:10])
print(pmi_tokens(docs[504])[:10])

['2016', 'ranking', 'chinese', 'high', 'school', 'send', 'student', 'study', 'american', 'university']
['2016', 'ranking', 'chinese', 'high school', 'send', 'student', 'study', 'american', 'university', 'Shenzhen']


### Term-document matrix

In [120]:
M = pd.DataFrame(TfIdf)
M.fillna(0, inplace=True)

In [124]:
M.shape

(3215, 509)

In [122]:
M.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,499,500,501,502,503,504,505,506,507,508
perform,4.846154,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Habitat,6.232448,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Center,5.539301,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
New,2.706087,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.706087,0.0
Delhi,6.232448,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [123]:
M.T.head()

Unnamed: 0,perform,Habitat,Center,New,Delhi,visit,India,Zalog,independent,settlement,...,Burj,weave,"950,000",kind,turf,Mauro,Badaracchi,Tivoli,sport,shooter
0,4.846154,6.232448,5.539301,2.706087,6.232448,4.286538,3.187926,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,6.232448,5.539301,4.846154,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Text processing with scikit-learn
A tutorial on scikit-learn text processing is available [here](https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html)

In [125]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

Scikit-learn text facilities expect to work with text as strings instead of pre-tokenized text. Thus, we create a pseudo-text by exploiting our previous tokenizers and creating pseudo words for bigrams.

In [129]:
pseudo_docs = [" ".join([x.replace(' ', '_') for x in pmi_tokens(d)]) for d in docs]

In [130]:
pseudo_docs[504]

'2016 ranking chinese high_school send student study american university Shenzhen Foreign Language School rank number 19 mainland China term number_student enter american university'

In [131]:
V = CountVectorizer()

In [132]:
C = V.fit_transform(pseudo_docs)

In [133]:
C

<509x3024 sparse matrix of type '<class 'numpy.int64'>'
	with 5899 stored elements in Compressed Sparse Row format>

In [134]:
C.toarray()

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [1, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

#### Map matrix columns on words

In [137]:
V.vocabulary_.get('high_school')

1371

In [145]:
np.nonzero(C[:,1371].toarray())

(array([ 90, 504]), array([0, 0]))

In [146]:
C[90, 1371]

1

### TfIdf

In [141]:
tf_idf = TfidfTransformer(use_idf=True)
X = tf_idf.fit_transform(C)

In [142]:
X

<509x3024 sparse matrix of type '<class 'numpy.float64'>'
	with 5899 stored elements in Compressed Sparse Row format>

In [143]:
X.toarray()

array([[0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.27298188, 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ]])

In [147]:
X[90, 1371]

0.24369107498675482