# Create and evaluate a corpus of books

The corpus is taken from the works-registry, and the evaluation selects books that belong to different epochs, where each epoch is defined by the variable `matrix`. This variable contains a dataframe with columns corresponding to epochs and rows are words with different weights within the epoch.

In [30]:
import dhlab.module_update as mu
import dhlab.text as dh
import pandas as pd
import verk as v
import evaluate_corpus as ev
import json
import os

In [4]:
mu.css()

The process is this

1. Create a matrix 
2. Find works
3. Evaluate the works against the matrix
4. Create a parallell corpus from the evaluation

In [5]:
matrix = pd.read_csv("old_vs_new_word_list.csv", index_col = 0)

Create a list of books from know authors or books. This can be done by an exhaustive search or by more contentrated search.

In [4]:
authors = ["bjørnstjerne bjørnson",  "cora sandel", "goethe", "nietsche", "henrik ibsen", "amalie skram", "alexander kielland", "jonas lie", "camilla collett"]

In [33]:
authors = [  "sigrid undset", "knut hamsun", "Per Christen Asbjørnsen","Arne Garborg ", "hulda garborg","Henrik Wergeland", "Sigbjørn Obstfelder" ]

In [34]:
author_corpus = {
    auth: dh.Corpus(doctype="digibok", author= auth, limit = 1000, to_year = 1920) for auth in authors
} 

Here is what the `author_corpus` looks like

Find the works identifiers, here stored in the variable `works`, from the list of authors, and then looping through the titles.

In [35]:
works = []
for a in author_corpus: 
    for t in author_corpus[a].corpus.title:
        try:
            works.extend(v.find_works(a, t)['id'].values)
        except:
            pass

From this list of different works a parallell corpus is created, using the matrix to find modernized versions.

In [36]:
import importlib
importlib.reload(v)
importlib.reload(ev)

<module 'evaluate_corpus' from '/mnt/disk1/Github/Verksregister/evaluate_corpus.py'>

Go through the works list and see if there are URNs in the list. Keep those works with more than one URN

In [37]:
len(works)

157

In [38]:
parallel_candidates = {works_id: v.urns_for_works(works_id) for works_id in works}

In [39]:
candidates = {x:parallel_candidates[x] for x in parallel_candidates if len(parallel_candidates[x]) > 1}

In [40]:
candidates[list(candidates.keys())[3]]

['URN:NBN:no-nb_digibok_2007051012004', 'URN:NBN:no-nb_digibok_2008061704148']

In [41]:
import importlib
importlib.reload(ev)

<module 'evaluate_corpus' from '/mnt/disk1/Github/Verksregister/evaluate_corpus.py'>

In [42]:
c = dh.CorpusFromIdentifiers(candidates[list(candidates.keys())[2]])

In [43]:
c.corpus

Unnamed: 0,dhlabid,title,authors,urn,oaiid,sesamid,isbn10,city,timestamp,year,publisher,langs,subjects,ddc,genres,literaryform,doctype
0,100505639,Fortællingen om Viga-Ljot og Vigdis,"Undset , Sigrid",URN:NBN:no-nb_digibok_2007050812003,oai:nb.bibsys.no:999110884824702202,bc57ad39def59311af88a6bdae325f47,,Kristiania,19090101,1909,Aschehoug,nob,,,,Uklassifisert,digibok
1,100478661,Fortellingen om Viga-Ljot og Vigdis : roman,"Undset , Sigrid",URN:NBN:no-nb_digibok_2008022000047,oai:nb.bibsys.no:999502068574702202,28cd3ef4a99a64e480ea89d3a4f63757,8203172210.0,Oslo,19950101,1995,Aschehoug,nob,historisk,839.823,fiction,Skjønnlitteratur,digibok
2,100534160,Fortellingen om Viga-Ljot og Vigdis,"Undset , Sigrid",URN:NBN:no-nb_digibok_2008022504059,oai:nb.bibsys.no:999522347184702202,ffbd0fc88fb6c6e1dd6b83c6cb892066,8252529267.0,[Oslo],19950101,1995,Den norske bokklubben,nob,,839.823,fiction,Skjønnlitteratur,digibok


In [44]:
ev.evaluate_corpus_norwegian(corpus = c, matrix=matrix, top_number = 10)

{'URN:NBN:no-nb_digibok_2008022504059': ('1990', 0.14467340770330372),
 'URN:NBN:no-nb_digibok_2008022000047': ('1990', 0.15744813501148414),
 'URN:NBN:no-nb_digibok_2007050812003': ('1920', 0.17941035464430233)}

In [45]:
evals = {}
for worksid in candidates:
    try:
        evals[worksid] = ev.evaluate_corpus_norwegian(corpus = dh.CorpusFromIdentifiers(candidates[worksid]), matrix = matrix)
    except KeyboardInterrupt:
        break
    except:
        pass

In [46]:
triple = {}
for i in evals:
    data = evals[i]
    old = [x for x in data if int(data[x][0]) <= 1900 ]
    new = [x for x in data if int(data[x][0]) > 1900 ] 
    if old != [] and new != []:
        triple[i] = (old, new)

In [47]:
triple

{'2c1889bffec1a5c9dddc3550b14693120473d2b8002d506222d341919ba4c406': (['URN:NBN:no-nb_digibok_2007050812003'],
  ['URN:NBN:no-nb_digibok_2008022504059',
   'URN:NBN:no-nb_digibok_2008022000047']),
 '0c4986de249cfcc2e713b0612a02de379b3139d45622cb92207d3812418b8785': (['URN:NBN:no-nb_digibok_2009051513002',
   'URN:NBN:no-nb_digibok_2011011005074'],
  ['URN:NBN:no-nb_digibok_2008101704023']),
 '34e7e5e4fdd73901fc0ff6b359b53e4042da8bf5aeafee4ae638d497c6fac55b': (['URN:NBN:no-nb_digibok_2008050604010',
   'URN:NBN:no-nb_digibok_2009032500001',
   'URN:NBN:no-nb_digibok_2015051929001',
   'URN:NBN:no-nb_digibok_2010030303003'],
  ['URN:NBN:no-nb_digibok_2011102508116',
   'URN:NBN:no-nb_digibok_2011011706058',
   'URN:NBN:no-nb_digibok_2009031900017',
   'URN:NBN:no-nb_digibok_2009011604050',
   'URN:NBN:no-nb_digibok_2014032506066',
   'URN:NBN:no-nb_digibok_2010080620001',
   'URN:NBN:no-nb_digibok_2008101004078',
   'URN:NBN:no-nb_digibok_2008071000094']),
 'eae961d0a8c1ea07631e6083094d0

In [48]:
with open(os.path.join("parallellkorpus", "ny_gammel_versjon2.json"), "w") as f:
    json.dump(triple, f)

In [49]:
dh.CorpusFromIdentifiers([x for y in triple.values() for z in y for x in z]).corpus["authors title year langs".split()].style

Unnamed: 0,authors,title,year,langs
0,"Undset , Sigrid",Fortællingen om Viga-Ljot og Vigdis,1909,nob
1,"Undset , Sigrid",Fortellingen om Viga-Ljot og Vigdis,1995,nob
2,"Undset , Sigrid",Fortellingen om Viga-Ljot og Vigdis : roman,1995,nob
3,"Hamsun , Knut",Livets Spil,1896,mul / dan / nob
4,"Hamsun , Knut",Samlede Verker. 6 : Ved Rikets Port ; Livets Spil ; Aftenrøde ; Livet i Vold,1934,nob
5,"Hamsun , Knut",Samlede verker. B. 14 : Ved rikets port ; Livets spil ; Aftenrøde ; Munken Vendt,2000,nob
6,"Hamsun , Knut",Sult : roman,1953,nob
7,"Hamsun , Knut",Sult,1990,nob
8,"Hamsun , Knut",Sult,1890,nob
9,"Hamsun , Knut / Lyngstad , Sverre",Hunger,1996,eng / nob
