## Prepare Datasets for Characters lemmatization
There are numerous Characters in Game of Thrones. The text often refers to them using only their firstname. We want to get a lemma of each name so this can be used in the graph later. The goal is to find the most probable lemma according to the mention in the text. 

To get the lemma, we need a list of characters. To date the best list is in [the wiki of ice and fire](https://awoiaf.westeros.org/index.php/List_of_characters)

### Scrape the characters list page
the scraping returns the text of the page and we load it into beautiful soup to have a DOM mount of it. 
We extract all links to character's page from the wiki and get all pages.
Save all HTML for further analysis

### get all the link that point to a character in the wiki
the get the links we browse the HTML looking for 

```html
    <a href="">name</a>
```
tag and get only those who follow a certain template. 
Then we have a list of pages to get as tuples (href, name)


In [36]:
import requests
from bs4 import BeautifulSoup

# requests to the 
r = requests.get('https://awoiaf.westeros.org/index.php/List_of_characters')
soup = BeautifulSoup(r.text)

# extract all anchors (links) tags: 'a'
links = soup.find_all('a')
pages = []
for link in links:
    S1 = set(['title','href'])
    S2 = set(link.attrs.keys())
    if S1 == S2:
        pages.append((link['href'],link['href'].split('/')[-1]))

pages = pages[:-5] # remove last 5 as they are irrelevant
pages = set(pages) # create a set to remove duplicates

# for each link, download the page and store on disk
for href,name in pages:
    r = requests.get('https://awoiaf.westeros.org' + href)
    with open(name+'.html','w+', encoding='utf-8') as fp:
        fp.write(r.text)
            

/index.php/Aegon_V_Targaryen


### analysis of html page: getCharacterDict
We are after several informations:
* Short name (title of the infobox)
* common name (title of the page)
* full name (field fo the infobox)
* aliases (field of the infobox) 
* book list (field of the infobox)

if no infobox was found, discard the character

The code use a variety of technics. Most difficult code is for:
* aliases in the format `'BranStark[1]'` that needs to be transformed into `['Bran', 'Sark']`
* books in the same format but with parenthesis and volume title is converted into a number


In [45]:
books_vol = {
    'A Game of Thrones':1,
    'A Clash of Kings':2,
    'A Storm of Swords':3,
    'A Feast for Crows':4,
    'A Dance with Dragons':5
}


def getCharacterDict(soup):
    fullname = None
    aliases = None
    books = None
    
    try:
        short_name = soup.find("table", class_="infobox").caption.text.strip()
    except:
        short_name = None
    
    try:
        common_name = soup.find('h1', class_='firstHeading').text.strip()
    except:
        common_name = None        

    try:
        tbody = soup.find("table", class_="infobox").tbody
        rows = tbody.find_all("tr")

        for row in rows:
            for child in row.children:
                if child.name == 'th':
                    if row.th.text.lower()=='full name':
                        fullname = row.td.text.strip()

                    if row.th.text.lower()=='alias':
                        raw_alias = row.td.text
                        regex = r"(\s?\[[0-9]+\])"
                        subst = ''
                        remove_ref_aliases = re.sub(regex, subst, raw_alias)
                        # detect aliases separation
                        regex = r"([a-z])([A-Z])"
                        subst = "\\1, \\2"
                        normalized_alias = re.sub(regex, subst, remove_ref_aliases)
                        # recreate an array with the aliases
                        aliases = [t.strip() for t in normalized_alias.split(',')]

                    if row.th.text.lower() == 'book(s)':
                        raw_books = row.td.text
                        regex = r"(\s\([a-zA-Z]+\))"
                        subst = ''
                        remove_parenthesis_books = re.sub(regex, subst, raw_books)
                        regex = r"([a-z])([A-Z])"
                        subst = "\\1, \\2"
                        normalized_books = re.sub(regex, subst, remove_parenthesis_books)
                        books = [books_vol[b.strip()] for b in normalized_books.split(',') if b.strip() in books_vol.keys()]
    except:
        return None
    return {'short_name':short_name, 'common_name':common_name, 'fullname':fullname, 'aliases':aliases, 'books':books}

### function definition : get characters internal links
for graph creation purposes we save all links from one character into an array

In [59]:
def getCharacterLinks(soup,names):
    cLinks = []
    links = soup.find_all('a')
    for link in links:
        if 'href' in link.attrs.keys():
            name = link['href'].split('/')[-1]
            if name in names:
                cLinks.append(name)
    return cLinks

## Main Loop

We read the whole directory data/html containing the files. 
For each html file we mount the DOM and pass it to the two functions to get characters informations as well as links


In [64]:
from bs4 import BeautifulSoup
import os
import re

characters = []
characters_links = []
names = []
for root, dirs, files in os.walk('data/html'):
    for name in files:
        if name.endswith('.html'):
            names.append(name[:-5])

    for name in files:
        if name != 'Daenerys_Targaryen.html':
            continue
        if name.endswith('.html'):
            with open(os.path.join(root, name),'r') as fp:
                html = fp.read()
                soup = BeautifulSoup(html,'lxml')
                character = getCharacterDict(soup)
                if character is not None:
                    character['url'] = name[:-5]
                    characters.append(character)
                    character_links = {name[:-5]:getCharacterLinks(soup,names)}
                    characters_links.append(character_links)

### Save the dictionaries as a pickle for future use

In [66]:
import pickle
with open('characters.pickle','wb+') as fp:
    pickle.dump(characters, fp)
with open('characters_links.pickle','wb+') as fp:
    pickle.dump(characters_links, fp)