# Fetch concordances and analyze them

Lars G. Johnsen, National Library of Norway

lars.johnsen@nb.no

Solstrand june 2021


In this code we will fetch arbitrary concordances, and use them in conjunction with a parser.

### Initial code

In [None]:
import json
import pandas as pd
import requests
#import spacy
from dhlab.module_update import css

def search_nb(word = 'demokrati', window=20, limit = 300):
    parameters = {
        'query': word,
        'window':window,
        'limit':limit
    }
    r = requests.get("https://api.nb.no/ngram/db1/konk", params = parameters)
    return pd.DataFrame(json.loads(r.text), columns = 'doc-id concordance'.split())



The code `spacy.load` below may not work, in that case uncomment the cell below and run

In [None]:
!python -m spacy download nb_core_news_sm

In [None]:
import spacy

nlp = spacy.load("nb_core_news_sm")

In [None]:
def parse(string):
    cols = "token.text, token.lemma_, token.pos_, token.tag_, token.dep_, token.shape_, token.is_alpha, token.is_stop".split(', ')
    doc = nlp(string)
    rows = [(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
            token.shape_, token.is_alpha, token.is_stop) for token in doc]
    return pd.DataFrame(rows, columns = cols), doc

In [None]:
pd.set_option('max_colwidth', None)
css()

# Concordances

A random collection of newspapers, periodicals and/or books are created on each query. Use SQLite fts5 syntax in formulating search https://www2.sqlite.org/fts5.html

In [None]:
%%time
df = search_nb(""" sjarken """, limit = 300)
df.style

### Copy and paste cells to run new versions

In [None]:
search_nb(""" NEAR(spiser middag, 5) """, limit = 20).style

In [None]:
search_nb("NEAR(fiske* snøre*)", limit = 200).style

# Parse the concordances using Spacy

### Example of `parse`

In [None]:
parse('Han liker at hun studerer lingvistikk.')[0]

In [None]:
parse('... Han lik att hun studerer lingvistikk.')[0]

## Collect all the concordances 

These are all contained in the variable `df`

The function parse returns the parse as a data frame or as a spacy doc

In [None]:
# remove the phrase markers and parse the sentences

parses = [parse(s.replace("<b>", "").replace("</b>","")) for s in df['concordance']]

df_versions = [p[0] for p in parses]
spacy_docs = [p[1] for p in parses]

In [None]:
df_parses = pd.concat(df_versions)
df_parses.head(50)

## Check out frequency of some tags

The expressions here are raw Pandas or Python. Perhaps a wrapper could be in place.

In [None]:
df_parses.groupby("token.pos_").count()['token.text'].sort_values(ascending=False)

#### Example of wrapper for the above

In [None]:
def count_parse(relation):
    if relation != 'token.text':
        res = df_parses.groupby(relation).count()['token.text'].sort_values(ascending=False)
    else:
        res = df_parses.groupby(relation).count()['token.dep_'].sort_values(ascending=False)
    return res

The wrapper makes it easier to write the command for inspecting the parsing result

In [None]:
count_parse('token.dep_')

## Which words are in certain relations?

Searching using standard pandas expressions. Back to involved expressions!!

In [None]:
df_parses[df_parses['token.dep_'] == 'nsubj'].groupby('token.text').count()['token.lemma_'].sort_values(ascending=False).head(20)

# Se på et tre

In [None]:
df.concordance[6]

In [None]:
spacy.displacy.render(spacy_docs[6])

# Exercise

Run the code again, and check if the frequencies change. 