# Extraction of Nibelungen texts

It is really hard to find Nibelungenlied texts online in a form that is:
- free access
- made by scholars
- easily parsable

**Our aim is to make semantic analysis from Nibelungen texts and Völsunga saga.**

I found PDF files from [Universität Wien](https://www.univie.ac.at/nibelungenwerkstatt/) which contained only the raw texts, however, parsing them is unfeasible because no spaces remained after extraction and tokenizing words is too hard for Mittelhochdeutsch.

I found HTML files from [Augsburg Hochschule](https://www.hs-augsburg.de/~harsch/germanica/Chronologie/d_chrono.html) and it was easier to extract them.

## From PDF

In [1]:
link_nibelungen = "https://www.univie.ac.at/nibelungenwerkstatt/files/wrkst_codices.zip"

The zip file is downloaded.

In [2]:
import requests
r = requests.get(link_nibelungen)

with open(link_nibelungen.split("/")[-1], "wb") as f:
    f.write(r.content)



The zip file is unzipped.

In [3]:
import zipfile
with zipfile.ZipFile(link_nibelungen.split("/")[-1], "r") as f:
    f.extractall(".")

PDF files are read and texts are xtracted. 

In [4]:
import PyPDF2
l_text = []
with open("gr-A_nib.pdf", "rb") as f:
    pdf_reader = PyPDF2.PdfFileReader(f)
    print("There are "+str(pdf_reader.getNumPages())+" pages.")
    for page_index in range(pdf_reader.getNumPages()):
        page = pdf_reader.getPage(page_index)
        l_text.append(page.extractText())

There are 270 pages.


In [5]:
print(repr(l_text[1]))

'2A1Unsistinalten\nmærenwndersvilgeseitvon\nheldenlobebærnvon\ngrozzerchnheitvon\nfroudenhochgeziten\nvon\nweinen\nundvon\nklagenvon\nchnerrechenstritemugetirnuwunderhoerensagenA2EzwhsinBurgondeneinscho\nenemagedindazinallenlandennihtschoenersmohte\nsinChriemhiltwas\nsigeheizzen\n\nundewas\neinscho\nenewipdarumbemsendegenevilverliesen\ndenlipA3Derminnechlichenmeidetrtenwol\ngezaminmteknerreckenniemenwas\nirgramanemazen\nscho\nenesowas\niredellipderjunchfrouwen\ntugende\nliertenanderiu\nwipA4Irphlagendrikunigeedelunderich\nGunthereundeGernot\ndiereckenlobelichundeGiselherderjungeeinzerwelter\ndegendiufrouwewas\nirswester\n\ndiefurstenhetensinirpflegenA5Dieherrenwarn\nmiltevon\nartehoh\ngebornmitkrefteunmazzen\nkuenediereckenzerkorndazenBurgondensowas\nirlantgenant\nsifrumdenstarkiuwndersitinEzelen\nlantA6ZeWornitz\nbidemRinesiwonden\nmitirkraftindiendevon\nirlandenvilstolziuriterschaft\nmitstoltzlicheneren\nunzanirendes\nzitsitsturbensijamerlichevon\nzweier\nedelenfrouwen\nnitA7Einrichi

Words are not tokenized so it's useless for what we want.

## From HTML

In [6]:
import requests
from bs4 import BeautifulSoup

main_links = [
    "https://www.hs-augsburg.de/~harsch/germanica/Chronologie/12Jh/Nibelungen/nib_c_00.html",
    "https://www.hs-augsburg.de/~harsch/germanica/Chronologie/12Jh/Nibelungen/nib_b_00.html",
    "https://www.hs-augsburg.de/~harsch/germanica/Chronologie/12Jh/Nibelungen/nib_a_00.html",
    "https://www.hs-augsburg.de/~harsch/germanica/Chronologie/12Jh/Nibelungen/nib_n_00.html"
]

n_pages = 39

####  Making links 

In [7]:
def int_to_string(i):
    if 0 <= i < 10:
        return "0"+str(i)
    else:
        return str(i)

links = {}
for link in main_links:
    links[link] = []
    for i in range(n_pages+1):
        link.split("/")
        links[link].append("/".join(link.split("/")[:-1])+"/"+
                           link.split("/")[-1].split(".")[0][:-2]+int_to_string(i)+".html")

#### Retrieving parts

In [8]:
import time
texts = {}
for link in links:
    texts[link] = []
    for page_link in links[link]:
        r = requests.get(page_link)
        time.sleep(1)
        texts[link].append(r.content)

#### Saving part

In [9]:
import os

for main_link in main_links:
    directory = main_link.split("/")[-1].split(".")[0]
    if not os.path.exists(directory):
        os.mkdir(directory)
    for i, text in enumerate(texts[main_link]):
        filename = os.path.join(directory, str(i)+".html")
        with open(filename, "w") as f:
            f.write(text.replace(b"s\x8d", b"i").decode("utf-8"))


#### Reading part

In [10]:
retrieved_texts = {}
for main_link in main_links:
    directory = main_link.split("/")[-1].split(".")[0]
    retrieved_texts[main_link] = []
    for i, text in enumerate(texts[main_link]):
        filename = os.path.join(directory, str(i)+".html")
        with open(filename, "r") as f:
            text = f.read()
            tree = BeautifulSoup(text, "lxml")
            retrieved_texts[main_link].append(tree)

In [11]:
print(retrieved_texts[main_links[0]][1].text)




bibliotheca Augustana










<<< Übersicht  <<< vorige Seite  nächste Seite >>>




BIBLIOTHECA AUGUSTANA


 


Das Nibelungenlied
1190/1200






 
	 


Handschrift C
 
1. Aventiure
 




___________________________________________________
 
 
Auenture von den Nibelungen
 











UNS IST> In alten     mæren wnders vil geseit
von heleden lobebæren     von grozer arebeit
von frevde vn– hochgeciten     von weinen vn– klagen
von kvner recken striten     mvget ir nv wnder horen sagen
Ez whs <inBvregonden>     ein vil edel magedin
daz in allen landen     niht schoners mohte sin
Chriemhilt geheizen     div wart ein schone wip
dar vmbe mvsin degene     vil verliesen den lip
Ir pflagen dri kunige     edel un– rich
Gunther un– Gernot     die rechen lobelich
vn– Giselher der iunge     ein wetlicher degen
div frowe was ir swester     die helde hetens inir pflegen
4Ein richiv chuniginne     frov Vte ir mvter hiez
ir vater der hiez Dancrât     der in div erbe liez
sit nach sime lebene    

```verbatim
We select texts between the first <h4> and teh first <<<< occurrences.
```

In [12]:
def extract_text(html_text):
    lines = [i.text.replace("\xa0", "") for i in html_text.find("div", attrs={"class": "contentus"}).findAll("h3")]
    return [line.split("  ") for line in lines]
    

print(repr(extract_text(retrieved_texts[main_links[0]][1])[0][0]))

'UNS IST> In alten'


#### Extracted text

In [13]:
import codecs

for main_link in main_links:
    directory = "extracted_"+main_link.split("/")[-1].split(".")[0][:-3]
    if not os.path.exists(directory):
        os.mkdir(directory)
    for i, text in enumerate(retrieved_texts[main_link]):
        filename = os.path.join(directory, str(i)+".txt")
        extracted_text = extract_text(text)
        if len(extracted_text) > 0:
            with codecs.open(filename, mode="w", encoding="utf-8") as f:
                lines = ["\t".join(line) for line in extracted_text]
                final_text = "\n".join(lines)
                f.write(final_text)