# Visualization of Edition Differences

![screenshot of the visualizations](./img/header.png)

In [1]:
import os
import sys
import shutil
from datetime import datetime
import math
from germansentiment import SentimentModel

import xml.etree.ElementTree as ET
import pandas as pd

In order to run this notebook, you will have to download the provided [sample dataset](https://doi.org/10.5281/zenodo.4992787) and extract it to a folder that will have to pass to the variable `baseDir`in the next cell.

In [2]:
# path to the sbbget temporary result files, e.g. "../sbbget/sbbget_downloads/download_temp" (the base path under which ALTO files are stored)
baseDir="/Users/david/src/python/StabiHacks/sbbget/sbbget_downloads.div_spielebuecher/download_temp/"

# path of the analysis results
analysisPath="./analysis/"
# verbose output
verbose=True

def printLog(text):
    now = str(datetime.now())
    print("[" + now + "]\t" + text)
    # forces to output the result of the print command immediately, see: http://stackoverflow.com/questions/230751/how-to-flush-output-of-python-print
    sys.stdout.flush()
    
def createSupplementaryDirectories():
    if not os.path.exists(analysisPath):
        if verbose:
            print("Creating " + analysisPath)
        os.mkdir(analysisPath)

createSupplementaryDirectories()

Creating ./analysis/


## METS/MODS Processing

The next cell contains the logic to parse a METS/MODS XML file and save its contents to a dataframe.

In [3]:
# XML namespace of MODS
modsNamespace = "{http://www.loc.gov/mods/v3}"

def parseOriginInfo(child):
    """
    Parses an originInfo node and its children
    :param child: The originInfo child in the element tree.
    :return: A dict with the parsed information or None if the originInfo is invalid.
    """
    discardNode = True

    result = dict()
    result["publisher"] = ""
    # check if we can directly process the node
    if "eventType" in child.attrib:
        if child.attrib["eventType"] == "publication":
            discardNode = False
    else:
        # we have to check if the originInfo contains and edition node with "[Electronic ed.]" to discard the node
        children = child.getchildren()
        hasEdition = False
        for c in children:
            if c.tag == modsNamespace + "edition":
                hasEdition = True
                if c.text == "[Electronic ed.]":
                    discardNode = True
                else:
                    discardNode = False
        if not hasEdition:
            discardNode = False

    if discardNode:
        return None
    else:
        for c in child.getchildren():
            cleanedTag = c.tag.replace(modsNamespace, "")
            if cleanedTag == "place":
                result["place"] = c.find("{http://www.loc.gov/mods/v3}placeTerm").text.strip()
            if cleanedTag == "publisher":
                result["publisher"] = c.text.strip()
            # check for the most important date (see https://www.loc.gov/standards/mods/userguide/origininfo.html)
            if "keyDate" in c.attrib:
                result["date"] = c.text.strip()
    return result

def parseTitleInfo(child):
    result = dict()
    result["title"]=""
    result["subTitle"]=""

    for c in child.getchildren():
        cleanedTag = c.tag.replace(modsNamespace, "")
        result[cleanedTag]=c.text.strip()

    return result

def parseLanguage(child):
    result = dict()
    result["language"]=""

    for c in child.getchildren():
        cleanedTag = c.tag.replace(modsNamespace, "")
        if cleanedTag=="languageTerm":
            result["language"]=c.text.strip()

    return result

def parseName(child):
    result=dict()
    role=""
    name=""
    for c in child.getchildren():
        cleanedTag = c.tag.replace(modsNamespace, "")
        if cleanedTag=="role":
            for c2 in c.getchildren():
                ct=c2.tag.replace(modsNamespace, "")
                if ct=="roleTerm":
                    role=c2.text.strip()
        elif cleanedTag=="displayForm":
            name=c.text.strip()
    result[role]=name
    return result

def parseAccessCondition(child):
    result = dict()
    result["access"]=child.text.strip()
    return result

def processMETSMODS(currentPPN, metsModsPath):
    """
    Processes a given METS/MODS file.
    :param currentPPN: the current PPN
    :param metsModsPath: path to the METS/MODS file

    :return: A dataframe with the parsing results.
    """
    # parse the METS/MODS file
    tree = ET.parse(metsModsPath)
    root = tree.getroot()
    # only process possibly interesting nodes, i.e.,
    nodesOfInterest = ["originInfo", "titleInfo", "language", "name", "accessCondition"]

    # stores result dicts created by various parsing function (see below)
    resultDicts=[]
    # master dictionary, later used for the creation of a dataframe
    masterDict={'publisher':"",'place':"",'date':"",'title':"",'subTitle':"",'language':"",'aut':"",'rcp':"",'fnd':"",'access':"",'altoPaths':""}
    # find all mods:mods nodes
    for modsNode in root.iter(modsNamespace + 'mods'):
        for child in modsNode:
            # strip the namespace
            cleanedTag = child.tag.replace(modsNamespace, "")
            #print(cleanedTag)
            #print(child)
            if cleanedTag in nodesOfInterest:
                if cleanedTag == "originInfo":
                    r = parseOriginInfo(child)
                    if r:
                        resultDicts.append(r)
                elif cleanedTag=="titleInfo":
                    r = parseTitleInfo(child)
                    if r:
                        resultDicts.append(r)
                elif cleanedTag=="language":
                    r = parseLanguage(child)
                    if r:
                        resultDicts.append(r)
                elif cleanedTag=="name":
                    r = parseName(child)
                    if r:
                        resultDicts.append(r)
                elif cleanedTag=="accessCondition":
                    r = parseAccessCondition(child)
                    if r:
                        resultDicts.append(r)
        # we are only interested in the first occuring mods:mods node
        break

    resultDicts.append(r)

    # copy results to the master dictionary
    for result in resultDicts:
        for key in result:
            masterDict[key]=[result[key]]
    masterDict["ppn"]=[currentPPN]
    return pd.DataFrame(data=masterDict)

## Sentiment Analysis

The following cell is based on https://github.com/oliverguhr/german-sentiment-lib. The small fix in line 28 has been offered as a pull request to the original author.

In [4]:
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from typing import List
import torch
import re

class SentimentModel_dazFix():
    def __init__(self, model_name: str = "oliverguhr/german-sentiment-bert"):
        if torch.cuda.is_available():
            self.device = 'cuda'
        else:
            self.device = 'cpu'        
            
        self.model = AutoModelForSequenceClassification.from_pretrained(model_name)
        self.model = self.model.to(self.device)
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)

        self.clean_chars = re.compile(r'[^A-Za-züöäÖÜÄß ]', re.MULTILINE)
        self.clean_http_urls = re.compile(r'https*\S+', re.MULTILINE)
        self.clean_at_mentions = re.compile(r'@\S+', re.MULTILINE)

    def predict_sentiment(self, texts: List[str])-> List[str]:
        texts = [self.clean_text(text) for text in texts]
        # Add special tokens takes care of adding [CLS], [SEP], <s>... tokens in the right way for each model.
        # daz: last two parameters added to limit maximum number of tokens in case of long strings and to prevent crashes
        # such as:
        #"Token indices sequence length is longer than the specified maximum sequence length for this model (603 > 512). 
        # Running this sequence through the model will result in indexing errors"
        input_ids = self.tokenizer.batch_encode_plus(texts,padding=True, add_special_tokens=True,truncation=True,max_length=512)
        input_ids = torch.tensor(input_ids["input_ids"])
        input_ids = input_ids.to(self.device)

        with torch.no_grad():
            logits = self.model(input_ids)    

        label_ids = torch.argmax(logits[0], axis=1)

        labels = [self.model.config.id2label[label_id] for label_id in label_ids.tolist()]
        return labels

    def replace_numbers(self,text: str) -> str:
            return text.replace("0"," null").replace("1"," eins").replace("2"," zwei")\
                .replace("3"," drei").replace("4"," vier").replace("5"," fünf") \
                .replace("6"," sechs").replace("7"," sieben").replace("8"," acht") \
                .replace("9"," neun")         

    def clean_text(self,text: str)-> str:    
            text = text.replace("\n", " ")        
            text = self.clean_http_urls.sub('',text)
            text = self.clean_at_mentions.sub('',text)        
            text = self.replace_numbers(text)                
            text = self.clean_chars.sub('', text) # use only text chars                          
            text = ' '.join(text.split()) # substitute multiple whitespace with single whitespace   
            text = text.strip().lower()
            return text

## Preparation of the Visualization

The next cell will read in all images, their accompanying metadata from METS/MODS XML files and the associated fulltext data. A sentiment analysis will be carried out on the fulltexts. The sentiment analysis assumes German texts.
Finally, records are created for each image.

In [5]:
jpgFilePaths = dict()
ppnDirs=[]
rows=[]

startTime = str(datetime.now())
printLog("Loading sentiment model...")
model = SentimentModel_dazFix()

printLog("Fetching files...")
# check all subdirectories startings with PPN as each PPN stands for a different medium        
for x in os.listdir(baseDir):
    if x.startswith("PPN"):
        ppnDirs.append(x)

# browse all directories below sbbGetBasePath and search for *_FULLTEXT directories
# and associate each with its PPN
for ppn in ppnDirs:
    printLog("Processing files for PPN: "+ppn)
    ppnRecord=dict()
    
    metsModsPath=baseDir+ppn+"/__metsmods/"+ppn+".xml"
    r=processMETSMODS("PPNxyz",metsModsPath)
    ppnRecord["_place"]=r["place"].values[0]
    ppnRecord["_title"]=r["title"].values[0]
    ppnRecord["_publisher"]=r["publisher"].values[0]
    ppnRecord["_date"]=r["date"].values[0]
    
                    
    for dirpath, dirnames, files in os.walk(baseDir+ppn):
        for name in files:
            if dirpath.endswith("_TIFF"):
                record=dict()
                # all relevant data is joined into the keywords field as Vikus will use this field for filtering
                record["keywords"]=",".join((ppn,ppnRecord["_place"],ppnRecord["_title"],ppnRecord["_publisher"]+" (Verlag)",ppnRecord["_date"]))
                result=[]
                # if we find no OCR data, we will save a placeholder text instead
                description="Keine OCR-Ergebnisse vorhanden."
                # if we found a image directory, only add JPEG files
                if name.endswith(".jpg") or name.endswith(".JPG"):
                    if not ppn in jpgFilePaths:
                        jpgFilePaths[ppn]=[]
                    fullJPGPath=os.path.join(dirpath, name)
                    jpgFilePaths[ppn].append(fullJPGPath)
                
                    # get the raw fulltext
                    rawTextPath=dirpath.replace("_TIFF","_FULLTEXT")+"/"
                    t=rawTextPath.split("FILE_")[1].split("_FULLTEXT")
                    txtFile=t[0].zfill(8)+"_raw.txt"
                    rawTextPath+=txtFile
                    
                    if os.path.exists(rawTextPath):
                        fileHandler = open(rawTextPath,mode='r')
                        fulltext = fileHandler.read()
                        if len(fulltext)>800:
                            description=fulltext[:800]+"[...]"
                        else:
                            description=fulltext
                        fileHandler.close()
                        # sentiment analysis of the raw OCR fulltext
                        result = model.predict_sentiment([fulltext])
                           
                    # get the physical page number of the current image
                    txtFilePath=ppn+".txt"
                    
                    with open(os.path.join(dirpath, txtFilePath)) as txtFile:
                        dest=""
                        for row in txtFile:
                            logPage=row.split()[1]
                            dest=analysisPath+ppn+"_"+logPage+".jpg"
                            record["id"]=ppn+"_"+logPage

                            record["year"]=math.ceil(int(logPage.split("_")[1])/10)*10
                            record["_realpage"]=int(logPage.split("_")[1])
                            if result:
                                record["_sentiment"]=result[0]
                                record["keywords"]+=","+result[0]+ " (Sentiment)"
                            record["_description"]=description

                    #print("Copy from %s to %s"%(fullJPGPath,dest))
                    # copy the found files to a new location with new unique names as required by Vikus
                    shutil.copy(fullJPGPath,dest)
                    # "join" the current record with its surrounding PPN metadata
                    record.update(ppnRecord)
                    rows.append(record)
                        
sum=0
for ppn in jpgFilePaths:
    for f in jpgFilePaths[ppn]:
        sum+=1
printLog("Found %i images."%sum)
endTime = str(datetime.now())
print("Started at:\t%s\nEnded at:\t%s" % (startTime, endTime))

[2021-06-19 17:35:45.140873]	Loading sentiment model...
[2021-06-19 17:35:50.051229]	Fetching files...
[2021-06-19 17:35:50.052603]	Processing files for PPN: PPN745143385


  for c in child.getchildren():
  for c in child.getchildren():
  for c in child.getchildren():
  for c in child.getchildren():
  for c2 in c.getchildren():


[2021-06-19 17:38:53.911120]	Processing files for PPN: PPN745158323
[2021-06-19 17:46:53.587373]	Processing files for PPN: PPN745183891
[2021-06-19 17:50:57.997120]	Processing files for PPN: PPN74602598X
[2021-06-19 18:04:29.943315]	Processing files for PPN: PPN745182844
[2021-06-19 18:14:59.584201]	Processing files for PPN: PPN745846971
[2021-06-19 18:29:49.130390]	Found 1750 images.
Started at:	2021-06-19 17:35:45.140795
Ended at:	2021-06-19 18:29:49.131839


In the next cell, a dataframe is created from the records. The resulting dataframe is then saved in a CSV file readable by Vikus.

In [6]:
df=pd.DataFrame.from_dict(rows)
df.to_csv(analysisPath+"edition_vis.csv",sep=",")
df

Unnamed: 0,keywords,id,year,_realpage,_sentiment,_description,_place,_title,_publisher,_date
0,"PPN745143385,Leipzig,Illustriertes Spielbuch f...",PPN745143385_PHYS_0011,20,11,neutral,\nVII \nSeite \nGeschwindsprechsätze . 70 \nKa...,Leipzig,Illustriertes Spielbuch für Kinder,Spamer,1891
1,"PPN745143385,Leipzig,Illustriertes Spielbuch f...",PPN745143385_PHYS_0046,50,46,neutral,\n34 \nFlechtarbeit. \nEine der schönsten und ...,Leipzig,Illustriertes Spielbuch für Kinder,Spamer,1891
2,"PPN745143385,Leipzig,Illustriertes Spielbuch f...",PPN745143385_PHYS_0103,110,103,neutral,\n91 \nDer Reiter „ vom Ahornbaum. \nDer Ahorn...,Leipzig,Illustriertes Spielbuch für Kinder,Spamer,1891
3,"PPN745143385,Leipzig,Illustriertes Spielbuch f...",PPN745143385_PHYS_0050,50,50,neutral,\n38 \nihren Puppen und sich selbst allerlei n...,Leipzig,Illustriertes Spielbuch für Kinder,Spamer,1891
4,"PPN745143385,Leipzig,Illustriertes Spielbuch f...",PPN745143385_PHYS_0115,120,115,neutral,\n103 \nhinausgeschossen hat. Der Treffer schi...,Leipzig,Illustriertes Spielbuch für Kinder,Spamer,1891
...,...,...,...,...,...,...,...,...,...,...
1745,"PPN745846971,Leipzig,Illustriertes Spielbuch f...",PPN745846971_PHYS_0317,320,317,,Keine OCR-Ergebnisse vorhanden.,Leipzig,Illustriertes Spielbuch für Knaben,Spamer,1909
1746,"PPN745846971,Leipzig,Illustriertes Spielbuch f...",PPN745846971_PHYS_0069,70,69,neutral,\n49 \n98] \nSuch- und Ratespiele. \ndes o ? '...,Leipzig,Illustriertes Spielbuch für Knaben,Spamer,1909
1747,"PPN745846971,Leipzig,Illustriertes Spielbuch f...",PPN745846971_PHYS_0086,90,86,neutral,\n66 Bewegungsspiele mit erforderlichem Spielg...,Leipzig,Illustriertes Spielbuch für Knaben,Spamer,1909
1748,"PPN745846971,Leipzig,Illustriertes Spielbuch f...",PPN745846971_PHYS_0205,210,205,neutral,\n360- \n-362] \nSchneespiele und Eisvergnügun...,Leipzig,Illustriertes Spielbuch für Knaben,Spamer,1909


The resulting CSV file can be used directly with a Vikus viewer instance. Details on how to obtain and configure Vikus can be found in a [separate repository](https://github.com/cpietsch/vikus-viewer).

The config files for this visualization are available in the [vikus_config](./vikus_config/) subdirectory. These files have to be stored along with all data in the [vikus_deploy](./vikus_deploy/) subdirectory.

Please note that the sample configuration assumes the Vikus sprites and thumbnails created by the [script](https://github.com/cpietsch/vikus-viewer-script) are placed under `./data/edition_vis_kids`.