# Create and evaluate a corpus of books

The corpus is taken from the works-registry, and the evaluation selects books that belong to different epochs, where each epoch is defined by the variable `matrix`. This variable contains a dataframe with columns corresponding to epochs and rows are words with different weights within the epoch.

In [51]:
import dhlab.module_update as mu
import dhlab.text as dh
import pandas as pd
import verk as v
import evaluate_corpus as ev
import json

In [3]:
mu.css()

In [4]:
matrix = pd.read_csv("old_vs_new_word_list.csv", index_col = 0)

The process is this

1. Create a matrix 
2. Find works
3. Evaluate the works against the matrix
4. Create a parallell corpus from the evaluation

Create a list of books from know authors or books. This can be done by an exhaustive search or by more contentrated search.

In [6]:
authors = ["bjørnstjerne bjørnson",  "cora sandel", "goethe", "nietsche", "henrik ibsen", "amalie skram", "alexander kielland", "jonas lie", "camilla collett"]

authors = [ "cora sandel"]

In [7]:
author_corpus = {
    auth: dh.Corpus(doctype="digibok", author= auth, limit = 1000, to_year = 1920) for auth in authors
} 

Here is what the `author_corpus` looks like

In [22]:
author_corpus[authors[0]].corpus[['urn', 'authors', 'year', 'title']].sample(5)

Unnamed: 0,urn,authors,year,title
211,URN:NBN:no-nb_digibok_2011103124023,"Bjørnson , Bjørnstjerne",1873,Fortællinger. 1
258,URN:NBN:no-nb_digibok_2014120308100,"Bjørnson , Bjørnstjerne / Dietrichson , L. (Lo...",1907,Taler
103,URN:NBN:no-nb_digibok_2008060613002,"Bjørnson , Bjørnstjerne",1872,Sangen har Lysning
107,URN:NBN:no-nb_digibok_2008060613006,"Bjørnson , Bjørnstjerne",1872,Ved kirke-sanger A. Reitan's jordefærd den 4de...
36,URN:NBN:no-nb_digibok_2016012129003,"Bjørnson , Bjørnstjerne",1861,Kong Sverre


Find the works identifiers, here stored in the variable `works`, from the list of authors, and then looping through the titles.

In [23]:
works = []
for a in author_corpus: 
    for t in author_corpus[a].corpus.title:
        try:
            works.extend(v.find_works(a, t)['id'].values)
        except:
            pass

From this list of different works a parallell corpus is created, using the matrix to find modernized versions.

In [8]:
import importlib
importlib.reload(v)
importlib.reload(ev)

<module 'evaluate_corpus' from '/mnt/disk1/Github/Historisk_ordbok/verksregister/evaluate_corpus.py'>

Go through the works list and see if there are URNs in the list. Keep those works with more than one URN

In [24]:
len(works)

384

In [25]:
parallel_candidates = {works_id: v.urns_for_works(works_id) for works_id in works}

In [26]:
candidates = {x:parallel_candidates[x] for x in parallel_candidates if len(parallel_candidates[x]) > 1}

In [39]:
candidates[list(candidates.keys())[3]]

['URN:NBN:no-nb_digibok_2012070905001',
 'URN:NBN:no-nb_digibok_2011031012001',
 'URN:NBN:no-nb_digibok_2012070408109',
 'URN:NBN:no-nb_digibok_2010021803059',
 'URN:NBN:no-nb_digibok_2006082900066',
 'URN:NBN:no-nb_digibok_2008051504087',
 'URN:NBN:no-nb_digibok_2009062204013',
 'URN:NBN:no-nb_digibok_2006082400043',
 'URN:NBN:no-nb_digibok_2008052204119',
 'URN:NBN:no-nb_digibok_2008061900071',
 'URN:NBN:no-nb_digibok_2010070106207',
 'URN:NBN:no-nb_digibok_2010032213006']

In [61]:
import importlib
importlib.reload(ev)

<module 'evaluate_corpus' from '/home/yoonsen/Git/Historisk_ordbok/verksregister/evaluate_corpus.py'>

In [62]:
c = dh.CorpusFromIdentifiers(candidates[list(candidates.keys())[2]])

In [63]:
c.corpus

Unnamed: 0,dhlabid,title,authors,urn,oaiid,sesamid,isbn10,city,timestamp,year,publisher,langs,subjects,ddc,genres,literaryform,doctype
0,100017774,Jeg velger meg april! : 95 dikt,"Bjørnson , Bjørnstjerne / Jacobsen , Rolf",URN:NBN:no-nb_digibok_2009022600051,oai:nb.bibsys.no:990006123834702202,c4d8df04e679252120e9e17104be3460,8252536239 / 8205270953,[Oslo],20000101,2000,Gyldendal,nob,,839.821,fiction,Skjønnlitteratur,digibok
1,100523238,Jeg velger meg april! : 95 dikt,"Bjørnson , Bjørnstjerne / Jacobsen , Rolf",URN:NBN:no-nb_digibok_2008060904035,oai:nb.bibsys.no:998221829204702202,70edce0e4d9c819ac708e981510923ec,8205138958,Oslo,19820101,1982,Gyldendal,nob,norske / dikt,839.91,fiction,Skjønnlitteratur,digibok


In [64]:
ev.evaluate_corpus_norwegian(corpus = c, matrix=matrix, top_number = 10)

{'URN:NBN:no-nb_digibok_2008060904035': ('1800', 0.3838943153285951),
 'URN:NBN:no-nb_digibok_2009022600051': ('1800', 0.38981935504134235)}

In [65]:
evals = {}
for worksid in candidates:
    try:
        evals[worksid] = ev.evaluate_corpus_norwegian(corpus = dh.CorpusFromIdentifiers(candidates[worksid]), matrix = matrix)
    except KeyboardInterrupt:
        break
    except:
        pass

In [66]:
triple = {}
for i in evals:
    data = evals[i]
    old = [x for x in data if int(data[x][0]) <= 1900 ]
    new = [x for x in data if int(data[x][0]) > 1900 ] 
    if old != [] and new != []:
        triple[i] = (old, new)

In [67]:
triple

{'eafd78267a468b309fb4eddc1cb3840479fcdad9089094b8ae075a1fc749242e': (['URN:NBN:no-nb_digibok_2008060904035',
   'URN:NBN:no-nb_digibok_2009022600051'],
  ['URN:NBN:no-nb_digibok_2008071812002']),
 '4405c68f5bb3e077ec405b10d5c5df9ed097eb20adaddfdc052d1215a05b1c5c': (['URN:NBN:no-nb_digibok_2006080900015',
   'URN:NBN:no-nb_digibok_2009062200004'],
  ['URN:NBN:no-nb_digibok_2008070812004']),
 '396ddc56ae4b19a533550b9d63df04bd9c25c8675ae3768ac9f1aead4f0e3b56': (['URN:NBN:no-nb_digibok_2008051504087',
   'URN:NBN:no-nb_digibok_2011031012001',
   'URN:NBN:no-nb_digibok_2006082400043',
   'URN:NBN:no-nb_digibok_2010070106207',
   'URN:NBN:no-nb_digibok_2006082900066',
   'URN:NBN:no-nb_digibok_2010032213006',
   'URN:NBN:no-nb_digibok_2008052204119',
   'URN:NBN:no-nb_digibok_2009062204013',
   'URN:NBN:no-nb_digibok_2012070905001',
   'URN:NBN:no-nb_digibok_2008061900071'],
  ['URN:NBN:no-nb_digibok_2010021803059',
   'URN:NBN:no-nb_digibok_2012070408109']),
 '262824678d4b0e2d8695df21b6d92

In [68]:
with open("ny_gammel_verksid.json", "w") as f:
    json.dump(triple, f)

In [75]:
dh.CorpusFromIdentifiers([x for y in triple.values() for z in y for x in z]).corpus["authors title year langs".split()].style

Unnamed: 0,authors,title,year,langs
0,"Bjørnson , Bjørnstjerne / Jacobsen , Rolf",Jeg velger meg april! : 95 dikt,1982,nob
1,"Bjørnson , Bjørnstjerne / Jacobsen , Rolf",Jeg velger meg april! : 95 dikt,2000,nob
2,"Bjørnson , Bjørnstjerne",Kongen,1885,nob
3,"Bjørnson , Bjørnstjerne",Mellem slagene,1957,nob
4,"Bjørnson , Bjørnstjerne / Skonhoft , Sigurd",Mellem slagene,1923,nob
5,"Bjørnson , Bjørnstjerne / Dybwad , Jacob",Mellem Slagene : Drama i 1 Akt,1858,nob
6,"Bjørnson , Bjørnstjerne / Sørensen , Henrik",Synnøve Solbakken ; Arne,1954,nob
7,"Bjørnson , Bjørnstjerne / Lowzow , Ingrid",Bondefortellinger,1950,nob
8,"Bjørnson , Bjørnstjerne / Seip , Didrik Arup",Fortellinger,1969,nob
9,"Bjørnson , Bjørnstjerne / Sørensen , Henrik",Bondefortellinger,1957,nob
