# Fill in full author references for mentions of authors

For example, if we find `Calderon`, we want to produce a string

```
<author><name key="cald">Pedro Calderón de la Barca</name></author>
```

In [1]:
testTexts = [
    "Calderón de la Barca, Pedro",
    "CCCCCalderón",
    "Caldeeeeeerón",
    "Pedro Barca",
    "Pedro Barca",
    "Agustin Moreto",
    "A. Moreto",
    "Agustin",
    "Augustine",
]

## Triggers

We are going to find trigger strings for authors in the input texts.

In order to do that successfully, we normalize the text first:

* we remove all accents from accented letters
* we make everything lowercase

We need a function that can strip accents from characters.

From [stackoverflow](https://stackoverflow.com/questions/517923/what-is-the-best-way-to-remove-accents-in-a-python-unicode-string)

In [2]:
import re
import unicodedata

def normalize(text):
    text = unicodedata.normalize('NFD', text)
    text = text.encode('ascii', 'ignore')
    text = text.decode("utf-8")
    return text.lower().strip()

In [3]:
normalize("Calderón de la Barca, Pedro")

'calderon de la barca, pedro'

## Authors

We compile a list of authors that we want to detect.

For each author we have a full name, a key, and a list of triggers.

We format the specficiation as a *yaml* file (which maps to a Python dictionary).

In [4]:
authorSpec = '''
cald:
    full: Pedro Calderón de la Barca
    triggers:
    - calderon
    - barca
more:
    full: Agustín Moreto
    triggers:
    - moreto
    - agustin
    - augustine
'''

In order to parse this file, you need to install pyyaml first

``` sh
pip install yaml
```

or

```
pip3 install yaml
```

In [5]:
import yaml

In [6]:
authors = yaml.load(authorSpec, Loader=yaml.FullLoader)

In [7]:
authors

{'cald': {'full': 'Pedro Calderón de la Barca',
  'triggers': ['calderon', 'barca']},
 'more': {'full': 'Agustín Moreto',
  'triggers': ['moreto', 'agustin', 'augustine']}}

We need to compile the authors specification in such a way that we can use the triggers

In [8]:
triggers = {}
for (key, authorInfo) in authors.items():
    for trigger in authorInfo['triggers']:
        triggers[trigger] = key

In [9]:
triggers

{'calderon': 'cald',
 'barca': 'cald',
 'moreto': 'more',
 'agustin': 'more',
 'augustine': 'more'}

In [10]:
def fillInAuthorDetails(text):
    normalized = normalize(text)
    output = None
    for trigger in triggers:
        if trigger in normalized:
            authorKey = triggers[trigger]
            authorFull = authors[authorKey]["full"]
            output = f"""<author><name key="{authorKey}">{authorFull}</name></author>"""
            break
    if output is None:
        print(f"!!! {normalized:<36} => NO AUTHOR DETECTED")
    return output

In [11]:
for text in (testTexts):
    result = fillInAuthorDetails(text)
    if result is not None:
        print(f"{text:<40} => {result}")

Calderón de la Barca, Pedro              => <author><name key="cald">Pedro Calderón de la Barca</name></author>
CCCCCalderón                             => <author><name key="cald">Pedro Calderón de la Barca</name></author>
!!! caldeeeeeeron                        => NO AUTHOR DETECTED
Pedro Barca                              => <author><name key="cald">Pedro Calderón de la Barca</name></author>
Pedro Barca                              => <author><name key="cald">Pedro Calderón de la Barca</name></author>
Agustin Moreto                           => <author><name key="more">Agustín Moreto</name></author>
A. Moreto                                => <author><name key="more">Agustín Moreto</name></author>
Agustin                                  => <author><name key="more">Agustín Moreto</name></author>
Augustine                                => <author><name key="more">Agustín Moreto</name></author>
