# Keywords and collocations

Both keyword and collocations are based on rearranging a frequency list. For keywords the list comes from documents, for collocations the frequencies comes from co-occurrence statistics. 

The frequencies are compared to a reference, for example representing some kind of normality.

### Startup code

In [None]:
import dhlab.nbtext as nb
import dhlab.module_update as mu
import warnings
warnings.filterwarnings('ignore')
mu.css()

## The reference

This is an object of the N most frequent words in the book collection.

In [None]:
tot = nb.frame(nb.totals(200000), 'tot')
tot.head(20)

Normalized it becomes

In [None]:
nb.normalize_corpus_dataframe(tot)
tot.head(20)

## Construct a corpus using metadata

Subject headings, author gender or Dewey. Check out Dewey [Webdewey](http://deweysearchno.pansoft.de/webdeweysearch/index.html)

Select a corpus of 200 books from for example dewey 641 (food and drink). Note that a corpus is a collection of references to books. The text is stored remotely, only the references in the from of URNs are local.

In [None]:
corpus = nb.book_corpus(ddk='641.2%', period=(1960, 2020), limit=200)
corpus

### Aggregate the corpus

You may download all the frequencies to a dataframe. Here we do the aggregation on the server side. The command for aggregation is `aggregate_urns`

In [None]:
corpus_agg = nb.aggregate_urns(nb.pure_urn(corpus))

corpus_agg = nb.frame_sort(nb.frame(corpus_agg, 'freq'))

corpus_agg.head(10)

Normalized

In [None]:
nb.normalize_corpus_dataframe(corpus_agg)
corpus_agg.head(10)

### Compare the discrepancy between frequencies

In terms of frequencies this is a comparison

$$\textrm{association}(w) = \frac{\textrm{target_freqs}(w)}{\textrm{reference}(w)}$$

This formula reflects the probabilistic notion of relevance. The target frequency of words is a conditional probability and the reference may be seen the unconditional. The set (or condition) $B$ is irrelevant for $w$ if

$$p(w|w \in B) = p(w)$$

The larger the ratio

$$\frac{p(w|w \in B)}{p(w)}$$

the more relevant $B$ is for a particular word $w$.


In [None]:
assoc = corpus_agg.freq/tot.tot

In [None]:
assoc.sort_values(ascending = False).head(30)

## Collocation 


###  Build the collocation

Command is `urn_coll()`. Check with a concordance

In [None]:
collword = 'rødvin'

Construct the collocation

In [None]:
coll = nb.urn_coll(collword, urns = nb.pure_urn(corpus), after = 5, before = 5, limit = 1000)

The collocation is a frequency list, just as the aggregated corpus

In [None]:
coll.columns = ['freq']

In [None]:
coll.head(10)

In [None]:
nb.normalize_corpus_dataframe(coll)
coll.head(20)

The collocation has in general higher values in the normalized version, due to the smaller size.

In [None]:
coll_assoc = (coll.freq/tot.tot).sort_values(ascending = False)

In [None]:
coll_assoc.head(20)

### Compare co-occurence with corpus

Here we can use an exponent to suppress low frequency words. Or multiply with a compressed absolute frequency (logarithm, square root...)

In [None]:
coll_assoc_korp = (coll.freq**1.8/corpus_agg.freq).sort_values(ascending = False)

coll_assoc_korp.head(20)

# Exercise

1. change corpus
1. change collword