# Create and evaluate a corpus of books

The corpus is taken from the works-registry, and the evaluation selects books that belong to different epochs, where each epoch is defined by the variable `matrix`. This variable contains a dataframe with columns corresponding to epochs and rows are words with different weights within the epoch.

In [71]:
import dhlab.module_update as mu
import dhlab.text as dh
import pandas as pd
import verk as v
import evaluate_corpus as ev
import json
import os

In [72]:
mu.css()

The process is this

1. Create a matrix 
2. Find works
3. Evaluate the works against the matrix
4. Create a parallell corpus from the evaluation

In [73]:
matrix = pd.read_csv("old_vs_new_word_list.csv", index_col = 0)

Create a list of books from know authors or books. This can be done by an exhaustive search or by more contentrated search.

In [74]:
authors = ["bjørnstjerne bjørnson",  "cora sandel", "goethe", "nietsche", "henrik ibsen", "amalie skram", "alexander kielland", "jonas lie", "camilla collett"]

In [75]:
authors += [  "sigrid undset", "knut hamsun", "Per Christen Asbjørnsen","Arne Garborg ", "hulda garborg","Henrik Wergeland", "Sigbjørn Obstfelder" ]

In [76]:
author_corpus = {
    auth: dh.Corpus(doctype="digibok", author= auth, limit = 1000, to_year = 1920, lang='nob') for auth in authors
} 

Here is what the `author_corpus` looks like

Find the works identifiers, here stored in the variable `works`, from the list of authors, and then looping through the titles.

In [77]:
works = []
for a in author_corpus: 
    for t in author_corpus[a].corpus.title:
        try:
            works.extend(v.find_works(a, t)['id'].values)
        except:
            pass

From this list of different works a parallell corpus is created, using the matrix to find modernized versions.

import importlib
importlib.reload(v)
importlib.reload(ev)

Go through the works list and see if there are URNs in the list. Keep those works with more than one URN

In [78]:
len(works)

428

In [59]:
parallel_candidates = {works_id: v.urns_for_works(works_id) for works_id in works}

In [60]:
candidates = {x:parallel_candidates[x] for x in parallel_candidates if len(parallel_candidates[x]) > 1}

In [61]:
candidates[list(candidates.keys())[3]]

['URN:NBN:no-nb_digibok_2008040200034', 'URN:NBN:no-nb_digibok_2012030524009']

In [62]:
import importlib
importlib.reload(ev)

<module 'evaluate_corpus' from '/mnt/disk1/Github/Verksregister/evaluate_corpus.py'>

In [63]:
c = dh.CorpusFromIdentifiers(candidates[list(candidates.keys())[2]])

In [64]:
c.corpus

Unnamed: 0,dhlabid,title,authors,urn,oaiid,sesamid,isbn10,city,timestamp,year,publisher,langs,subjects,ddc,genres,literaryform,doctype
0,100444984,Digte og sange,"Bjørnson , Bjørnstjerne / Bull , Francis",URN:NBN:no-nb_digibok_2008051301010,oai:nb.bibsys.no:999102893874702202,e43e126fe3d40d5c970808423d60eb80,,Oslo,19570101,1957,Gyldendal,nob,,839.91,poetry,Skjønnlitteratur,digibok


In [65]:
ev.evaluate_corpus_norwegian(corpus = c, matrix=matrix, top_number = 10)

{'URN:NBN:no-nb_digibok_2008051301010': ('1990', 0.10874372971463508)}

In [66]:
evals = {}
for worksid in candidates:
    try:
        evals[worksid] = ev.evaluate_corpus_norwegian(corpus = dh.CorpusFromIdentifiers(candidates[worksid]), matrix = matrix)
    except KeyboardInterrupt:
        break
    except:
        pass

In [67]:
triple = {}
for i in evals:
    data = evals[i]
    old = [x for x in data if int(data[x][0]) <= 1900 ]
    new = [x for x in data if int(data[x][0]) > 1900 ] 
    if old != [] and new != []:
        triple[i] = (old, new)

In [79]:
triple

{'262824678d4b0e2d8695df21b6d920c786cd5c019dcb82ad99b458b057e3d243': (['URN:NBN:no-nb_digibok_2012102107020'],
  ['URN:NBN:no-nb_digibok_2009022600051',
   'URN:NBN:no-nb_digibok_2008060904035']),
 '396ddc56ae4b19a533550b9d63df04bd9c25c8675ae3768ac9f1aead4f0e3b56': (['URN:NBN:no-nb_digibok_2012070408109',
   'URN:NBN:no-nb_digibok_2010021803059',
   'URN:NBN:no-nb_digibok_2011031012001'],
  ['URN:NBN:no-nb_digibok_2012070905001',
   'URN:NBN:no-nb_digibok_2010032213006',
   'URN:NBN:no-nb_digibok_2008052204119',
   'URN:NBN:no-nb_digibok_2006082900066',
   'URN:NBN:no-nb_digibok_2010070106207',
   'URN:NBN:no-nb_digibok_2006082400043',
   'URN:NBN:no-nb_digibok_2008051504087',
   'URN:NBN:no-nb_digibok_2008061900071',
   'URN:NBN:no-nb_digibok_2009062204013']),
 '50479b0cfbc89d081154db811ba107c616c1d20bf9e650b898e2d7554bb0f1ab': (['URN:NBN:no-nb_digibok_2008031404070',
   'URN:NBN:no-nb_digibok_2008071812003'],
  ['URN:NBN:no-nb_digibok_2011120108006']),
 '1d4ea7e02197c411eb6670d943664

In [69]:
with open(os.path.join("parallellkorpus", "ny_gammel_versjon3.json"), "w") as f:
    json.dump(triple, f)

In [88]:
old_urns = [x for y in triple for x in triple[y][0]]
new_urns = [x for y in triple for x in triple[y][1]]

In [93]:
old = dh.CorpusFromIdentifiers(old_urns).corpus["authors title year langs urn".split()]
new = dh.CorpusFromIdentifiers(new_urns).corpus["authors title year langs urn".split()]
old['time'] = 'old'
new['time'] = 'new'
triple_corpus = pd.concat([old, new])

In [94]:
triple_corpus.sort_values(by='year').style

Unnamed: 0,authors,title,year,langs,urn,time
94,"Collett , Camilla",Amtmandens Døttre : en Fortælling. D. 1,1855,nob,URN:NBN:no-nb_digibok_2011032520038,old
8,"Bjørnson , Bjørnstjerne / Dybwad , Jacob",Mellem Slagene : Drama i 1 Akt,1858,nob,URN:NBN:no-nb_digibok_2008070812004,old
93,"Collett , Camilla",Amtmandens Døttre : en Fortælling,1860,nob,URN:NBN:no-nb_digibok_2009072112001,old
80,"Collett , Camilla",I de lange Nætter,1863,nob,URN:NBN:no-nb_digibok_2006111400028,old
86,"Collett , Camilla",Amtmannens döttrar : en norsk berättelse,1863,swe / nob,URN:NBN:no-nb_digibok_2009072212001,old
82,"Collett , Camilla",Under långa nätter,1866,swe / nob,URN:NBN:no-nb_digibok_2009072812001,old
16,"Ibsen , Henrik",Brand : et dramatisk Digt,1866,nob,URN:NBN:no-nb_digibok_2014080424009,old
20,"Ibsen , Henrik",De Unges Forbund : Lystspil i fem Akter,1869,nob,URN:NBN:no-nb_digibok_2013050224047,old
158,"Lie , Jonas","Den Fremsynte, eller Billeder fra Nordland",1870,nob,URN:NBN:no-nb_digibok_2012060106021,new
66,"Lie , Jonas","Den Fremsynte, eller Billeder fra Nordland",1873,nob,URN:NBN:no-nb_digibok_2010052806015,old
