# Fetch concordances and analyze them

Lars G. Johnsen, National Library of Norway

lars.johnsen@nb.no

ATTR may 2023


In this code we will fetch arbitrary concordances, and use them in conjunction with a parser.

### Initial code

In [140]:
import json
import pandas as pd
import requests
#import spacy
from dhlab.module_update import css
import dhlab.api.dhlab_api as api
import dhlab as dh
css()

The code `spacy.load` below may not work, in that case uncomment the cell below and run

In [2]:
#!python -m spacy download nb_core_news_sm

In [3]:
import spacy

nlp = spacy.load("nb_core_news_sm")

In [111]:
def parse(string):
    try:
        doc = nlp(string)
        rows = [(token.text, token.lemma_, token.pos_, token.dep_,
                token.head) for token in doc]
        return pd.DataFrame(rows, columns = ['word lemma pos dep head'.split()]), doc
    except:
        print("err", string)

# Concordances

A random collection of newspapers, periodicals and/or books are created on each query. Use SQLite fts5 syntax in formulating search https://www2.sqlite.org/fts5.html

In [10]:
corpus = dh.Corpus(doctype="digibok", limit = 100, author='knut hamsun', title='markens').frame

In [12]:
corpus.dhlabid[1]

100328521

In [20]:
%%time
df = api.word_concordance(dhlabid = [int(corpus.dhlabid[1])], words=['Isak'],before = 2, after = 20, limit = 300)

CPU times: user 25.7 ms, sys: 4.09 ms, total: 29.8 ms
Wall time: 436 ms


In [32]:
df['string'] = df.before + " " + df.target + " " + df.after

In [33]:
df.string

0      ? — Isak . Du vet ikke av en kvinnfolkhjelp ti...
1      dem . Isak altså , også det ville lappen orde ...
2      ? svarer Isak . For han har bare denne ting i ...
3      av , Isak forstod det ikke . Forstod han det i...
4      . — Isak . — Nå , Isak . Er det du som bor her...
                             ...                        
295    borte . Isak stod der midt på gulvet , og Olin...
296    kokeovnen . Isak kremtet et par ganger for å l...
297    ! sa Isak og var på ytterste nippet til å si m...
298    kunne ikke Isak stå der på gulvet og tie . Han...
299    geit ? Isak grundet og grundet . Om kvelden da...
Name: string, Length: 300, dtype: object

# Parse the concordances using Spacy

## Collect all the concordances 

These are all contained in the variable `df`

The function parse returns the parse as a data frame or as a spacy doc

In [116]:
# remove the phrase markers and parse the sentences
%time
parses = [parse(s) for s in df.string.values]

In [117]:
df_versions = [p[0] for p in parses if p]
spacy_docs = [p[1] for p in parses if p]

In [118]:
df_parses = pd.concat(df_versions)

In [126]:
cpy = df_parses[:200].copy()

In [134]:
for _, x in cpy.iterrows():
    if x.word == 'Isak':
        print(x.word, x.dep, x['head'])

Isak ROOT Isak
Isak ROOT Isak
Isak nsubj svarer
Isak nsubj forstod
Isak ROOT Isak
Isak conj Nå
Isak conj Nå
Isak nsubj hette
Isak nmod ja
Isak nsubj ble


## Check out frequency of some tags

The expressions here are raw Pandas or Python. Perhaps a wrapper could be in place.

In [None]:
df_parses.groupby("token.pos_").count()['token.text'].sort_values(ascending=False)

#### Example of wrapper for the above

In [None]:
def count_parse(relation):
    if relation != 'token.text':
        res = df_parses.groupby(relation).count()['token.text'].sort_values(ascending=False)
    else:
        res = df_parses.groupby(relation).count()['token.dep_'].sort_values(ascending=False)
    return res

The wrapper makes it easier to write the command for inspecting the parsing result

In [None]:
count_parse('token.dep_')

## Which words are in certain relations?

Searching using standard pandas expressions. Back to involved expressions!!

In [None]:
df_parses[df_parses['token.dep_'] == 'nsubj'].groupby('token.text').count()['token.lemma_'].sort_values(ascending=False).head(20)

# Se på et tre

In [None]:
df.concordance[6]

In [139]:
spacy.displacy.render(spacy_docs[1])

# Exercise

Run the code again, and check if the frequencies change. 