# Parsing a TEI Document

## Brief Introduction to TEI

The Text Encoding Initiative (TEI) is both a standard for encoding texts to be machine actionable and an organization that oversees the TEI standards.

As standard, TEI provides a uniform way for humanities scholars to encode literary and documentary texts in a uniform way allowing them to be machine actionable for display, searching, or processing. TEI is a set of tags that piggy-back on basic XML.

To learn more, please see the following:
* [Text Encoding Initiative Home Page](https://tei-c.org/)
* [What is the TEI from the Women Writers Project](https://www.wwp.northeastern.edu/outreach/seminars/tei.html)
* [TEI By Example Project](https://teibyexample.org/)
* [Introduction to XML](https://www.w3schools.com/xml/xml_whatis.asp)

## Parsing TEI

### Set up

In [4]:
! pip install bs4

Collecting bs4
  Downloading bs4-0.0.1.tar.gz (1.1 kB)
  Preparing metadata (setup.py) ... [?25ldone
Building wheels for collected packages: bs4
  Building wheel for bs4 (setup.py) ... [?25ldone
[?25h  Created wheel for bs4: filename=bs4-0.0.1-py3-none-any.whl size=1256 sha256=0541ad02d4c8b597e49fc88c04cec4d849a464985f5a44ffb44ba74d5bec1304
  Stored in directory: /Users/chloethurmgreene/Library/Caches/pip/wheels/d4/c8/5b/b5be9c20e5e4503d04a6eac8a3cd5c2393505c29f02bea0960
Successfully built bs4
Installing collected packages: bs4
Successfully installed bs4-0.0.1


In [5]:
from bs4 import BeautifulSoup
import requests
import pandas as pd

In [6]:
# load xml file
url = "https://raw.githubusercontent.com/msaxton/nlp-data/main/gibbon.xml"
response = requests.get(url)
xml_str = response.text

### Parse TEI

In [7]:
# use BeautifulSoup to creat an xml object
xml = BeautifulSoup(xml_str)

In [8]:
# find all footnotes
footnotes = xml.find_all('note', attrs={'place': 'bottom'})

In [9]:
# remove extra space
def remove_extra_space(text):
    text = text.replace('\n', '')
    text = text.replace('  ', '')
    return text

In [10]:
# prepare data for dataframe
processed_footnotes = []
i = 1
for footnote in footnotes:
    d = {}
    footnote_num = f'footnote {str(i)}'
    footnote_text = remove_extra_space(footnote.text)
    d["number"] = footnote_num
    d["text"] = footnote_text
    processed_footnotes.append(d)
    i += 1

In [11]:
# convert to datafram
df = pd.DataFrame.from_dict(processed_footnotes)

In [12]:
df.head()

Unnamed: 0,number,text
0,footnote 1,"Pons Aureoli,thirteen miles from Bergamo, and ..."
1,footnote 2,"On the death of Gallienus, ſee Trebellius Poll..."
2,footnote 3,"Some ſuppoſed him, oddly enough, to be a baſta..."
3,footnote 4,"Notoria,a periodical and official diſpatch whi..."
4,footnote 5,Hiſt. Auguſt. p. 208. Gallienus deſcribes the ...


In [13]:
file_name = "gibbon_footnotes.csv"
df.to_csv(file_name, index=False)