# Understanding the Social Networks of Emma B. Andrews

In this notebook, we want to use Python to parse the Emma B. Andrews diaries TEI files. Our interest is to visualise the social networks in the life of Emma B. Andrews life on the nile. To parse the TEI documents, we will use several modules. These are as follows:

* [csv](https://docs.python.org/3/library/csv.html)
* [Beautiful Soup 4](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
* [lxml](https://pypi.org/project/lxml/)
* [matplotlib](https://matplotlib.org/)
* [nltk](https://www.nltk.org/)

The Beautiful Soup, lxml, NLTK, and Matplotlib modules need to be installed. If you are running Jupyter Notebook with the Python Virtual Environment, then these modules were installed when you created the Virtual Environment. The `csv` module comes preinstalled in the Python virtual environment.

## Import Modules (Dependencies)

To import a module, we will use the Python import function.

In [14]:
import csv # Python's Comma Separate Values Parser
from bs4 import BeautifulSoup # Beautiful Soup is for parsing HTML and XML files
import lxml # lxml is a secondary parser for beautiful soup
import nltk # Natural Langauge Toolkit
import re # Python's Regular Expression Module

## Read Volume into Python to Parse with Beautiful Soup

The tagged TEI files of the Journals are located in the `/diary-volumes` directory. We need to tell Python the source of the file. We will want to use the Python OS module to make this work for either Windows or Mac. For now, it is hard coded.

In [3]:
# Since the journal volume we want exists in the same directory as our Jupyter Notebook, we can use the document name with extension.
journal = '../diary-volumes/volume17.xml'

# Now we want to create a Beautiful Soup object with our file. We will unpack what this means in more detail below.
with open(journal) as xml:
    soup = BeautifulSoup(xml, 'lxml-xml')

## Parse Diary for Network Analysis

The Diaries are encoded according to the `TEI` standards. Thus, the `<text>…</text>` element encloses the contents of the dairy. We want to parse every day of the dairy and then further manipulate the data for Graph Analysis. Each child within the `<text>` root is an entry according to the day.

### Extract Daily Entries from Volume 17 

In [11]:
# To extract the daily entries, we need to traverse the text root and gather together all <div> elements with a type of entry
entries = soup.find_all("div", {"type": "entry"}) #find all div elements whose type attribute is entry (this is a journal entry)
num_entries = len(entries) #Count the entries
f'Volume contains {num_entries} entries' #Discover total entries

'Volume contains 43 entries'

### Extract the Dates of the Entries and Create a Timeline

In [20]:
# Iterate over the entries and get the Date for each
dates = soup.find_all(attrs={"xml:id": re.compile("EBA-[0-9-–]+")}) #Find all dates in the entries with a regular expression search
total_dates = len(dates) #Count the dates -- this should equal the amount of entries. If not, there is either an encoding issue or Andrews did not date the entry
f'Volume contains {total_dates} entries' #Discover total entries

'Volume contains 43 entries'

### Extract the persName Entities from Each Entry

In [8]:
# Create a List of All the People
network = []
for entry in entries:
    peoples = entry.find_all('persName')
    for person in peoples:
        network.append(person['ref'])

In [21]:
# This is temporary. The next step is associate the date with each person. This will create a timeline of when Andrews encountered and wrote about the named person.
# Once we zip together the date with the person, we will process the text of the entry with NLTK to discover the verbal association between Andrews and named person.
network

['#Troyon_Constant',
 '#Corot_Jean_Baptiste_Camille',
 '#Rathbone_Mr',
 '#Rathbone_Mr #Rathbone_Mrs',
 '#Rathbone_Elena',
 '#Parsons_John #Parsons_Florence_Van_Corltandt',
 '#Draper_Mr',
 '#Gorst_Group',
 '#Trefusis_Walter',
 '#Carter_Bonham_Mr',
 '#Rathbone_Elena',
 '#Lovatt_Mr #Lovatt_Master',
 '#Lovatt_Mr #Lovatt_Master',
 '#Gay_Walter',
 '#Gay_Walter_Mr #Gay_Mrs',
 '#Gay_Mrs',
 '#Rathbone_Elena',
 '#Buckley_Mr #Buckley_Mrs',
 '#Carter_Howard',
 '#Weigall_Arthur',
 '#Nicol_Erskine',
 '#Davis_Theodore_M',
 '#Weigall_Arthur',
 '#Butler_Mrs',
 '#Nicol_Erskine',
 '#Butler_Mrs',
 '#Weigall_Arthur',
 '#Davis_Theodore_M',
 '#Burton_Harry',
 '#Maspero_Gaston',
 '#Davis_Theodore_M',
 '#Davis_Theodore_M',
 '#Jones_Harold',
 '#Jones_Cyril',
 '#Rathbone_Elena',
 '#Davis_Theodore_M',
 '#Carter_Howard',
 '#Nicol_Erskine',
 '#Davis_Theodore_M',
 '#Jones_Harold',
 '#Jones_Cyril',
 '#Jones_Harold',
 '#Mumm_von_Schwarzenstein_Alfons',
 '#Fahnestock_Gibson #Fahnestock_Mrs',
 '#Kelly_Miss',
 '#Whitaker