# Parsing a TEI Document - Homework

## Directions

Parse the tei of Gibbon's _Decline and Fall_ to extract all the **marginal notes**. (XML file provided)
1. Extract all marginal notes
2. Remove extraneous whitespace
3. Place marginal notes in a dataframe
4. Save teh dataframe as a csv file


## Hint

Here is a snippet of what a marginal note in the xml document looks like:

`<note place="margin">A. D. 268. March 20. Death of Gallienus.</note>`

These are different from the footnotes that we saw in class in that (a) they do not have numbers and (b) the white space is different. You are free to accomodate for that however you would like.

### Set up

In [33]:
! pip3 install beatifulsoup4

Defaulting to user installation because normal site-packages is not writeable
[31mERROR: Could not find a version that satisfies the requirement beatifulsoup4 (from versions: none)[0m[31m
[0m[31mERROR: No matching distribution found for beatifulsoup4[0m[31m
[0m

In [34]:
from bs4 import BeautifulSoup
import requests
import pandas as pd

In [35]:
# load xml file
url = "https://raw.githubusercontent.com/msaxton/nlp-data/main/gibbon.xml"
response = requests.get(url)
xml_str = response.text

### Parse TEI

In [36]:
# use BeautifulSoup to creat an xml object
xml = BeautifulSoup(xml_str)

In [37]:
# find all footnotes
margin_notes = xml.find_all('note', attrs={'place': 'margin'})
margin_notes[0]

<note place="margin">
                        Aureolus invades Italy, is defeated and be
                        <g ref="char:EOLhyphen"></g>
                        ſieged at Milan.
                    </note>

In [38]:
# remove extra space (if needed)
def remove_extra_space(text):
    text = text.replace('\n', '')
    text = text.replace('  ', '')
    return text

In [39]:
# prepare data for dataframe
processed_margin_notes = []
i = 1
for margin_note in margin_notes:
    d = {}
    margin_num = f'margin_note {str(i)}'
    margin_text = remove_extra_space(margin_note.text)
    d["number"] = margin_num
    d["text"] = margin_text
    processed_margin_notes.append(d)
    i += 1

# sanity check
processed_margin_notes[0]

{'number': 'margin_note 1',
 'text': 'Aureolus invades Italy, is defeated and beſieged at Milan.'}

In [40]:
# convert to dataframe
df = pd.DataFrame.from_dict(processed_margin_notes)
df.head()

Unnamed: 0,number,text
0,margin_note 1,"Aureolus invades Italy, is defeated and beſieg..."
1,margin_note 2,A. D. 268.
2,margin_note 3,A. D. 268. March 20. Death of Gallienus.
3,margin_note 4,Character and elevation of the emperor Claudius.
4,margin_note 5,Death of Aureolus.


In [42]:
# save dataframe as csv
file_name = "gibbon_margin_notes.csv"
df.to_csv(file_name, index=False)