### Preamble

Contains imports and some utility code.

In [None]:
!curl -s "https://raw.githubusercontent.com/Yoonsen/Modules/master/module_update.py" > "module_update.py"

In [None]:
from module_update import update, css, code_toggle

In [None]:
css()

In [None]:
update('nbtext', overwrite=True)  # may remove overwrite=True
update('graph_networkx_louvain', overwrite=True)
import graph_networkx_louvain as gnl
import nbtext as nb
import networkx as nx
%matplotlib inline

In [None]:
def frame_sort(frame, by=0):
    return frame.sort_values(by=by, ascending=False)

In [None]:
import warnings
warnings.filterwarnings('ignore')

# Collocations

The term collocation has traditionally been restricted to words that are juxtaposed together as phrases like “strong coffee”, “strict regime” or “eat dinner”. Here we take collocations to be realized as skipgrams, or as word pairs that simply cooccur within a context which in itself is a contiguous sequence of words, typically a paragraph or a window of n words around a given word. Juxtaposed collocates will also be part of the result set. 


### Define a corpus

Using dewey decimal to restrict corpus 

In [None]:
corpus_urns = nb.get_urn({
    'ddk':"641.2%", 
    'year':1960, 
    'next':60, 
    'lang':'nob', 
    'limit': 200
})

print(len(corpus_urns), corpus_urns[:5])

### Make a collocation for a word from the corpus

Here we go step by step. The process may be collected into one general script.

First, collect the words around a given word and count the result.

In [None]:
collword = 'rødvin'

In [None]:
rødvin = nb.urn_coll('rødvin', urns = corpus_urns, after = 5, before = 5, limit = 1000)

In [None]:
rødvin.head()

We want to measure how this differs from a reference. Two point themselves out, the collection of all books, and the corpus itself.

All books

In [None]:
tot = nb.frame(nb.totals(top = 50000))

In [None]:
tot.head()

So we have three wordbags, the collocation, the corpus counts, and the total count of all books. Let us also normalize them, so values more easily can be compared.

In [None]:
nb.normalize_corpus_dataframe(tot)
nb.normalize_corpus_dataframe(rødvin)

In [None]:
coll_all = frame_sort(rødvin**1.0/tot)

In [None]:
coll_all.head(20)

A quick check with concordances may point to:

In [None]:
nb.get_urnkonk('drikkes', {'urns':corpus_urns, 'limit':5})

## Inspect collocations

Let's make a huge graph. Each node is expanded. We do the exercise for both reference corpuses (corpora).

In [None]:
coll_all.index[:20] # Let us select the 20 highest

In [None]:
words_to_expand = list(coll_all.index[:21])
words_to_expand

We want to create a collocation for each of the words, and collect all in a graph. We repeat the collocation making for each word in our list.

for each word make a collocation for that word using the corpus URNs:

In [None]:
collocations = {word: nb.urn_coll(word, urns=corpus_urns) for word in words_to_expand}

We may inspect the collocations:

In [None]:
for w in collocations:
    print(w, list(collocations[w].index[:20]))
    print()

Normalize

In [None]:
for w in collocations:
    nb.normalize_corpus_dataframe(collocations[w])

These undergo same procedure as with the original

In [None]:
collocations_weight = {w:frame_sort(collocations[w]/tot) for w in collocations}

In [None]:
for w in collocations_weight:
    print(w, list(collocations_weight[w].index[:20]))
    print()

## Final step

Turn it all into a graph. Create a graph and populate it with edges.

In [None]:
# start with an empty list of edges

edges = []

In [None]:
# add elements from words_to_expand

for x in words_to_expand:
    edges.append((collword, x, float(coll_all.loc[x])))


After the first edges are added, we can have a look at them

In [None]:
edges

Create a graph over the edges

In [None]:
G = nx.Graph()

Add the edges

In [None]:
G.add_weighted_edges_from(edges)

Draw the graph using module gnl

In [None]:
gnl.show_graph(G)

Next add all the edges from `collocations_weight`. We just select the 11 first, just like the first. This could be made sensitive to the actual structure of the collocation.

In [None]:
rest_edges = []

In [None]:
for w in collocations_weight:
    for word in collocations_weight[w].index[:20]:
        rest_edges.append((w, word, float(collocations_weight[w].loc[word])))

Add the latest edges to G

In [None]:
G.add_weighted_edges_from(rest_edges)

Print the graph which displays interconnections and clustering

In [None]:
gnl.show_graph(G, spread = 0.01)

In [None]:
gnl.show_communities(G)