# deduplicate ~~bibliographic~~ institution references

The DH-Community is not able to follow citation quides, therefore bibliographic references are quite messy. This script

* extracts all `.//tei:titleStmt//tei:affiliation` elements and writes them to a `.csv` file in the current folder
* this file is feed into `csvdedupe` command line interface which returns `output.csv` with deduplicated files

In [None]:
import glob
import os
import pandas as pd

from lxml import etree as ET

from teipy import TeiReader

In [None]:
try:
    os.makedirs('../indices')
except OSError as e:
    print('../indices alredy exists')

In [None]:
files = glob.glob("../dhd_*/TEI/*.xml")
len(files)

## note

Extraction and disambiguation of institutions is tricky because:
* usage of different names for the same Institution
* person -> affiliation is a 1:n relation
* but there is no dedicated separator to indicate that one affiliation tag comprises several affiliations
  * `;` is treaded as separator but not
  * `|` or `\` although sometimes 
    * used as separator "forschungsverbund marbach weimar wolfenbüttel / herzog august bibliothek wolfenbüttel"
    * they are not only used as separators e.g. "Akademie der Wissenschaften und der Literatur | Mainz"  but also as part of the name, or to indicate a part of an institution like in "Georg-August-Universität Göttingen, Deutschland - GCDH/Archäologisches Institut"

therefore no automatic splitting is done!
disambiquiation is done in a very generous manner

this means that `eberhard karls universität tübingen, deutschland` and `eberhard karls universität tübingen, deutschland; humboldt universität zu berlin` are treated as one institution

In [None]:
def yield_items(files):
    for x in files:
        doc = TeiReader(x)
        doc_id = x
        titel = doc.extract_md()['title']
        counter = 0
        for rs in doc.tree.xpath('.//tei:titleStmt//tei:affiliation', namespaces=doc.ns_tei):
            author_node = rs.getparent()
            author_id = author_node.xpath('./@ref', namespaces=doc.ns_tei)[0]
            rs_text = " ".join("".join(rs.itertext()).split())
            for y in rs_text.split(';'):
                item = {
                    "title": titel,
                    "author_id": author_id,
                    "org": y.strip(),
                    "id": f"{doc_id}__{counter}"
                }
                counter += 1
                yield item

In [None]:
df = pd.DataFrame(yield_items(files))

In [None]:
df.to_csv('orgs.csv')

## run csvdedupe cmd-tool

```shell
csvdedupe orgs.csv --field_names org --output_file org_output.csv --skip_training true
```

* use the result (saved as output.csv) for any further processing
* read output.csv into a `pandas.Dataframe`
* group rows (i.e. bibl entries) by `Cluster ID` (created by dedupe)

In [None]:
deduped = pd.read_csv('org_output.csv')

In [None]:
from collections import defaultdict

## extra work to circumvent a strange behaviour in dedupe

* as reported in https://github.com/dedupeio/csvdedupe/issues/88 dedupe does not group exact string matches into the same clusster, so some extra work needs to be done

In [None]:
org_lookup = {}
for gr in deduped.groupby('org'):
    org_name = gr[0]
    org_ref = f"#org__{gr[1]['Cluster ID'].iloc[0]}"
    org_lookup[org_name] = org_ref

In [None]:
for x in files:
    doc = TeiReader(x)
    for rs in doc.tree.xpath('.//tei:titleStmt//tei:affiliation', namespaces=doc.ns_tei):
        orgs = []
        for y in rs.text.split(';'):
            y = y.strip()
            org_id = org_lookup[y]
            orgs.append(org_id)
        org_refs = " ".join(orgs)
        rs.attrib.pop("ref", None)
        rs.attrib['ref'] = org_refs
    doc.tree_to_file(x)