<a href="https://colab.research.google.com/github/bartliff/catalystdata/blob/main/CatalystWorkshop.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**Welcome!**

This workshop will give you the chance to explore how statistical textual analysis can help you to see textual data in a different way to that provided by manual analysis.


###**The data and background**

We will be using the corpus I created as part of my PhD thesis. It includes the text from 21 medieval Welsh law manuscripts from the *Cyfraith Hywel* tradition. You don't need to know medieval Welsh to do this workshop - don't worry! - and, in fact, we will be looking at the text in a way that is agnostic of precise meaning.

For a bit of background, *Cyfraith Hywel* is amongst the most prolific outputs from the medieval Welsh period, appearing in approximately eighty extant manuscripts and books between the 13th and 18th centuries. The text is widely considered to have been a cornerstone of Welsh identity throughout the period of upheaval caused by the Anglo-Norman invasion. It presents a cross-section of the Welsh social and legal customs throughout, and arguably prior, to the period of composition.

For this workshop, we will be working with manuscripts produced between c.1225-1450. They are a variety of lengths, of variable quality and use a range of spelling and grammatical conventions, although, at their core, each contains the same text. It is this that makes the dataset a particularly interesting study in comparative statistical analysis. It is also because of this that this corpus can be considered understudied. The convention in scholarhsip is to rely on seminal texts, though that most keenly represent each branch of the tradition. This means that the particulars of other manuscripts in the tradition are rarely explored. This data driven approach offers an opportunity to change that!

In order to smooth out the messiness of the medieval text (it's even worse than modern text on that front!) I encoded the textual data (using an encoding language called XML) with a layer of metadata which includes structural information, a lemma (dictionary) form, as well as some grammatical information. We will use this encoding to analyse the shape of the texts (and sections within) as well as compare the different texts.


# The workshop

The aim of today is to conduct a 'scavenger hunt'. There is code in this document that will provide insights about these texts. There are also challenges that will lead you to use and even adapt this code in order to find the answers to key questions.

We will then discuss at the end of the session what the value of such analyses could be, what you found interesting and what was particularly challenging.

If you have never coded before, please don't let this put you off! The best way to get started is to just give it a go. If you approach this workshop with an open mind, you should come out of it with some new skills (and the knowledge of how to develop them further) and, hopefully, some new ideas about how you might approach your humanities data.


# Step one - Getting orientated

Let's start by looking at our data. In this first block of code, we will get a list of the files that we have in our corpus.

To run the code, click the circular 'play' button that appears when you hover over the box below. The output of the code will appear underneath the code box.

> SIDE NOTE: If you look at the two path definitions (pathXML and pathTXT) you will see that there are two file types included in this corpus:
> *   the XML encoded documents that contain the texts augmented with XML encoding.
*   the plain text txt files.

> It can be as useful to analyse unstructured data (the txt files) as it is to analyse structured data (the xml) depending on your research questions and chosen methods. The reference and resources section provides pointers should you wish to pursue these avenues further.








In [1]:
# Run this box once to install the necessary packages for this workshop

import os
import matplotlib
import pandas
from xml.dom import minidom
import xml.etree.ElementTree as ET
from collections import Counter
import pandas as pd

from google.colab import drive
drive.mount('/content/drive')

# This is called a function. It is a handy bit of code that can be kept separate
# from the main body and used as needed. It makes the code faster and neater.
# https://www.w3schools.com/python/python_functions.asp
def getNodeText(node):
    nodelist = node.childNodes
    result = []
    for node in nodelist:
        if node.nodeType == node.TEXT_NODE:
            result.append(node.data)
    return ''.join(result)

Mounted at /content/drive


In [3]:
path="/content/drive/MyDrive/Data"

print("A list of my XML files")
for xml in os.listdir(path):
  print(xml)


A list of my XML files
I.xml
C.xml
U.xml
M.xml
X.xml
Boston.xml
W.xml
V.xml
Y.xml
J.xml
Tr.xml
Mk.xml
T.xml
L.xml
N.xml
O.xml
B.xml
R.xml
D.xml
A.xml
E.xml


Now, lets have a look at the text these files actually contain.

This code opens the file (in the first instance A.xml) and uses a library called Element tree to pull the data from the XML file and then print it, word by word, as an output.

Give it a scan. What do you think?

In [4]:
file="A.xml"

with open(path+"/"+file, 'rb') as source:
        #ET stands for element tree. it is a library that allows us to read or 'parse' XML files
        tree = ET.parse(source)
        #XML files are constructed with nested tags. this bit of code finds the root, then looks
        #for all tags within this that are named 'w' which stands for word. In essence this code
        #is telling the computer to find all the words in the document and print that word in order
        root = tree.getroot()
        string=''
        for word in root.iter("w"):
            w=word.text
            string=(string+' '+w)
        print(string)

 Heuel da uab kadell teuyhauc kemry oll a uelles e kemry en kamarueru or kefreythyeu ac a deuenus atau uy guyr o pop kemud en y tehuyokaet e pduuar en lleycyon ar deu en scolecyon sef achaus e uennuyt er escleycyon rac gossod or lleycyn dym a uey en erbyn er escrftur lan sef amser e doythant eno e garauuys sef amser achaus e doyant e garauuys eno urth delehu o paup bod en yaun en er amser glan hunnu Ac na guenelhey kam en amser gleyndyt Ac o kyd kaghor a kyd synedycaeth e doython a doytant eno er hen kefreythyeu a esteryasant a rey onadunt a adassant y redec a rey a emendassant ac ereyll en kubyl a dyleassant ac ereyll o neuuyt a hosodassant a guedy honny onadunt e kefreythyeu a uarnassant eu cadu heuel a rodes y audurdaut uthunt ac a orckemenus en kadarn eu kadu en craf a heuel ar doythyon a uuant ykyd ac ef a ossodassant eu hemendyth ar hon kamry holl ar e nep eg kemry a lecrey heb eu kadu e kefreythyeu Ac a dodassant eu hemendyt ar er egnat a kamero dyofryt braut ac ar er argluyt ay

My first impression is that the above output is a lot to read, even if you do know medieval welsh ☺!

So how about accessing just a part of the text? The Introduction sections are quite snippy, so much easier to take in. Try the below code and see if that is any easier.

In your group, change the file to look at the text from different manuscript. Do you spot any similarities? Any differences?

<!-- How about if you filter by 'Law of the Courts' or 'Law of the Country' two of the other major sections within the text of *Cyfraith Hywel* -->

In [5]:
file="A.xml"

with open(path+"/"+file, 'rb') as source:
        tree = ET.parse(source)
        # unlike the above code, this filters the tags by the attribute 'Introduction'
        root = tree.getroot()
        string=""
        for w in root.findall(".//*[@type='Law of the Courts']//w"):
                raw=w.text
                string=(string+' '+raw)
        print(string)

 Or llys e kemyrt decreu ac e gosodes peduuar ar ugeyn o guasanaethguyr en llys Penteulu Brahudur llys Gostechur Trullyat Effeiryat Penguastrahut Penkynyt Drysaur Dysteyn Guastauel Medyt Koc Hebogyt Bart teulu Medyc Kanuyllyt Dysteyn e urenynes Llauoruyn Efeyryat e uerenynes Dryssaur Penguastraut e ureny Koc Guastelauel e urenynes Kanhuyllyt suydhocyon e ryfassam ny huchof dyuethaf ar uuyt ynt Terygueyt en e uuluyne deleant e peduuar suyau ar ugeyn huchof kafael heruuy kefreyth eu bredhenguysc y kan e brenyn ac eu llyeynguytc y can e urenynes e nodolyc a pasc a sulguyn E brenyn a dele roy ir urenynes trayan a kafo o enyll o tyr a dayar ac euelly guasanaytguyr e brenyn a deleant roy trayan y guasanaytuyr e urenynes Gurth e brenyn eu y saraet teyr gueyth o teyr forth e gueneyr saraet yr brenyn un eu pan torrer y naud pan roho naud y dyn ay lad arall eu pan del deu urenyn ar eu kydteruyn o achaus emaruoll ac eghuyt e deu urenyn ar deulu llad o hur yr neyll gur yr llall tredet eu kamarueru

Now, I hear what you're saying: "This is 'just reading'. Couldn't we look at the manuscripts themselves to achieve this same thing?

Challenge accepted!

Let's do something we CAN'T do (at least not without too much time on our hands) with the physical manuscripts.

How about a wordcount?

In [6]:
file="A.xml"

# These are empty lists. Lists can be a handy way to organise data - https://www.w3schools.com/python/python_lists.asp
words=[]

# Now to our code...
with open(path+"/"+file, 'rb') as source:
    # this library is minidom. it is an alternative to element tree, although each works a little differently.
    # if you wish to take this further, investigate which of these (or indeed any of the other parsers) best
    # suit your purposes.
    doc=minidom.parse(source)
    wordlist=doc.getElementsByTagName("w")
    for word in wordlist:
        y=getNodeText(word)
        words.append(y)
print(len(words))



29738


Useful, but how about word counts for all the files?

What do you notice? Which is the biggest mansucript and which is the smallest. Can you think of any reasons why this discrepency might exist?

In [7]:
# This 'for' creates a loop to make sure each file in the directory is analysed
# using the code
for xml in os.listdir(path):

  words=[]
  print(xml)
  with open(path+"/"+xml, 'rb') as source:
    doc=minidom.parse(source)
    wordlist=doc.getElementsByTagName("w")
    for word in wordlist:
        y=getNodeText(word)
        words.append(y)
  print(len(words))

I.xml
26085
C.xml
24738
U.xml
21369
M.xml
29901
X.xml
18367
Boston.xml
23168
W.xml
23353
V.xml
18169
Y.xml
35790
J.xml
56065
Tr.xml
21719
Mk.xml
25591
T.xml
24504
L.xml
35094
N.xml
9906
O.xml
22004
B.xml
47599
R.xml
13151
D.xml
54841
A.xml
29738
E.xml
40212


From both the manual reading of the text and from looking at some initial numbers, it is hopefully clear that this tradition is perhaps not a cohesive or simple as commonly portrayed!

Welcome to data driven or statistical textual analysis!

In part two of the workshop, we will look at some more statistical methods to examine these texts.

# Step two - Delving deeper into the texts

To begin, let us look at the words contained within these manuscripts. Words, or tokens, make up the core of a text and the language choices made by authors can be very revealing with regards to the time they are writing, their location, their education and background, their style and, perhaps most importantly, their priorities in constructing the text.

Medieval manuscripts were very expensive endeavours. That means that their creation was well considered and well planned. This suggests that the language choices and, more specifically, the way these change over time and across locations, might reveal something about the purpose and use of the *Cyfraith Hywel* tradition.


The below code is very similar to the previous code we ran, although a touch more complex. Instead of giving a list of all the words and counting these, this code pulls out a list of unique words, counts these and then presents them in a table.

In [8]:
# first a list of all the words in the corpus and their frequencies

all_Token_Freq=[]
all_Token_list=[]

for file in os.listdir(path):
    with open(path+'/'+file, 'rb') as source:
        doc=minidom.parse(source)
        wordlist=doc.getElementsByTagName("w")
        for word in wordlist:
            y=getNodeText(word)
            all_Token_Freq.append(y)
All_Token_Counts = Counter(all_Token_Freq)
All_Token_Frequency = dict(All_Token_Counts)
Token_DF=pd.Series(All_Token_Frequency).to_frame().reset_index().rename(columns= {'index':'Token', 0: 'Total'})
#sort the table from most common to least and print the top 50 results
Token_DF=Token_DF.sort_values(by=['Total'],ascending=False)
print(Token_DF.head(50))



         Token  Total
12           y  50916
10           a  37909
2            o  12763
84           e  11152
13          yn  11138
73          ar   9855
26          ac   9854
61           r   8865
67          yr   6543
210         y6   5123
86          un   4871
7      brenhin   3969
47          ef   3906
301        neu   3700
316         ny   3574
245        dyn   3532
228       dyly   3013
128         eu   3005
60           A   2911
335          O   2874
57       hynny   2848
3642        Ac   2684
1546        en   2593
39         hyt   2583
382         na   2385
3488        or   2373
846         ae   2371
156        neb   2313
226        tir   2303
213        gan   2269
465         am   2209
237        tal   2069
315        byd   1902
120      arall   1891
53         pan   1859
345        heb   1852
105   kyfreith   1817
182       llys   1793
348        uyd   1775
52        ida6   1746
614          Y   1679
1951        er   1623
222      geiff   1564
1164       pob   1507
533       

In [9]:
# Then we will organise the data to give us the frequencies of the words in each
# manuscript.

for file in os.listdir(path):
    name=file.split('.')
    filename=name[0]
    with open(path+'/'+file, 'rb') as source:
        doc=minidom.parse(source)
        wordlist=doc.getElementsByTagName("w")
        frequencylist=[]
        for word in wordlist:
            y=getNodeText(word)
            frequencylist.append(y)
        text_Frequency = Counter(frequencylist)
        frequency = dict(text_Frequency)
        temp_DF=pd.Series(frequency).to_frame().reset_index().rename(columns= {'index':'Token', 0: filename})
        Token_DF=pd.merge(Token_DF, temp_DF, on='Token', how='left')
print(Token_DF.head(50))

       Token  Total       I       C       U       M       X  Boston       W  \
0          y  50916  2424.0   962.0  1984.0  2881.0  1812.0  2204.0  2173.0   
1          a  37909  1920.0  1222.0  1125.0  1964.0  1093.0  1386.0  1407.0   
2          o  12763   717.0   413.0   271.0   809.0   411.0   634.0   250.0   
3          e  11152   288.0  1369.0    26.0   360.0   173.0   296.0    34.0   
4         yn  11138   586.0     2.0   494.0   440.0   367.0   383.0   496.0   
5         ar   9855   335.0   543.0   337.0   346.0   359.0   262.0   486.0   
6         ac   9854   494.0   542.0   216.0   352.0   109.0   217.0   205.0   
7          r   8865   696.0     1.0     5.0   855.0   200.0   559.0     1.0   
8         yr   6543   222.0   239.0   356.0   240.0   286.0   197.0   356.0   
9         y6   5123   267.0     NaN   238.0   357.0     2.0   295.0   247.0   
10        un   4871   251.0     8.0   163.0   291.0   142.0   243.0   222.0   
11   brenhin   3969   200.0     NaN    29.0   310.0 

You might notice that there are a large number of small, one or two letter words in that list. These are called 'stop words' and are common words like "the" or "a" if we were using English. These can be removed if we like as they add little to the analysis.

More importantly - there are lots of repeated or very similar words! By working with the words as they appear in the manuscript, we have not taken into account spelling variations, errors or morphological (the ways we change words to fit a sentence) variations.

As an example, we can look up the words with the stem 'bren'. "Brenin" means 'king', so it a very important term in a medieval context. If you run the below code, you should see just how many ways this word can be written (along with several similar and related words).

In [None]:
test=Token_DF[Token_DF['Token'].str.contains('bren')]

col_one_list = test['Token'].tolist()
print(col_one_list)

['brenhin', 'brenhyn', 'brenyn', 'brenin', 'brenhines', 'bren', 'brennyn', 'brenhina6l', 'ybrenhin', 'nenbren', 'brenn', 'talbren', 'nenbrenn', 'brenhined', 'brenhinaeth', 'brenhinyaeth', 'Nenbren', 'brenihin', 'brenhinn', 'brennin', 'talbrenn', 'brennhin', 'brenhinolaf', 'dylyrbren', 'breninn', 'brenhinha6l', 'brenh', 'brenhinyolaf', 'golchbren', 'brenjn', 'brenhinawl', 'brenu', 'ebren', 'bren6', 'brenhinyn', 'brenhinida6', 'brenhiny', 'brenhina', 'brenhineu', 'brenhinyaul', 'Nenbrenn', 'dylerbren', 'helycbrenn', 'b brenhyn', 'geubren', 'messobren', 'dalbrenn', 'brenhiN', 'breninha6l', 'nenbrenin', 'brent', 'breniaul', 'bbrenhin', 'brenh in', 'brenheines', 'ebrenyn', 'golcbren', 'brenynyn']


It is for this reason that the XML encoding is particularly important. The use of XML allows us to add a layer of information (metadata) to the text. This information include the lemma or dictionary form of the text.

Let's repeat the frequency counts, but with these lemma forms.

In [11]:
# first a list of all the words in the corpus and their frequencies

all_Lemma_Freq=[]
all_Lemma_list=[]
for file in os.listdir(path):
    with open(path+'/'+file, 'rb') as source:
        doc=minidom.parse(source)
        wordlist=doc.getElementsByTagName("w")
        for w in wordlist:
            if w.hasAttribute("lemma"):
                y = w.getAttribute("lemma")
                all_Lemma_Freq.append(y)
All_Lemma_Counts = Counter(all_Lemma_Freq)
All_Lemma_Frequency = dict(All_Lemma_Counts)
Lemma_DF=pd.Series(All_Lemma_Frequency).to_frame().reset_index().rename(columns= {'index':'Lemma', 0: 'Total'})
#sort the table from most common to least and print the top 50 results
Lemma_DF=Lemma_DF.sort_values(by=['Total'],ascending=False)
print(Lemma_DF.head(50))



I.xml
C.xml
U.xml
M.xml
X.xml
Boston.xml
W.xml
V.xml
Y.xml
J.xml
Tr.xml
Mk.xml
T.xml
L.xml
N.xml
O.xml
B.xml
R.xml
D.xml
A.xml
E.xml
         Lemma  Total
12           y  81693
10           a  60567
2            o  23318
170        bod  20506
13          yn  16421
64          ar   8720
182       dylu   7402
44          ef   5938
190       talu   5793
7       brenin   5163
73          un   4987
85         tri   4633
42       hwnnw   4613
248         ny   4407
102       cael   4389
196        dyn   4123
15    cyfraith   4077
234        neu   4005
288         na   3825
172        gan   3811
48           i   3641
21         pob   3338
183        tir   3174
106         eu   3133
36         hyd   3090
20         gwr   3059
617    ceiniog   2967
197        dau   2851
129        neb   2712
189     gwerth   2654
343         am   2632
306     pedwar   2606
49         pan   2506
99       arall   2457
140      ugain   2286
94   gwneuthur   2174
266        heb   2144
206     rhodio   2127
463    dy

# Step three - presenting our results


# Keep in touch!

Thank you for coming today, and if you have any comments or questions, I'd love to hear from you:

Zoe.Bartliff@glasgow.ac.uk



---


# Resources and References


**Data sources**

The data for the workshop was originally drawn from either:

Luft, Peter and Smith (2013), "*Rhyddiaith Gymraeg 1300-1425*". (accessed: 15 Decem-
ber 2019). URL: http://www.rhyddiaithganoloesol.caerdydd.ac.uk

or

Nurmio, Kapphahn and Sims-Williams (2010), "*Rhyddiaith y 13eg Ganrif Fersiwn 2.0*".
(accessed: 15 December 2019).
URL: https://cadair.aber.ac.uk/dspace/handle/2160/5811


The XML encoding was then applied manually by Zoe Bartliff for completion of her thesis:

Bartliff, Zoe Louise  (2021) *The application of textual encoding for a data-driven analysis of the medieval Welsh legal tradition, Cyfraith Hywel*. PhD thesis, University of Glasgow.
