<h1> Creating the DataFrame </h1>


This notebook reads in the raw txt files from the [Knowledge Project ](https://tu-plogan.github.io/source/r_releases.html) and converts it to a pandas data frame that will later be used to create a [Knowledge Graph](https://github.com/alexyoung13/frances_dissertation_ay55/blob/main/Notebooks/DataFrame2RDF_7thEdition.ipynb). Not all of the metadata available in the Knowlege Project matches the current [Frances EB Ontology](https://francesnlp.github.io/EB-ontology/doc/index-en.html) and as a result must be hardcoded. 

The final output of this notebook will be the file final_eb_7_dataframe_clean as a json file in the [data folder](https://github.com/alexyoung13/frances_dissertation_ay55/tree/main/data).

The notebook also loads in the previous dataframe from the XML-based approach and compares the number of terms created.


In [2]:
import glob
import re
import regex
import pandas as pd
import string

<h5>  Dataframe Columns </h5>

MMSID<br>
editionTitle<br>
editor <br>
editor_date <br>
genre<br>
language<br>
termsOfAddress<br>
numberOfPages <br>
physicalDescription<br>
place<br>
publisher<br>
referencedBy<br>
shelfLocator<br>
editionSubTitle<br>
volumeTitle<br>
year<br>
volumeId<br>
metsXML<br>
permanentURL<br>
publisherPersons<br>
volumeNum<br>
letters<br>
part0<br>
editionNum<br>
supplementTitle<br>
supplementSubTitle<br>
supplementsTo<br>
numberOfVolumes<br>
term<br>
definition<br>
relatedTerms<br>
header<br>
startsAt<br>
endsAt<br>
numberOfTerms<br>
numberOfWords<br>
positionPage<br>
typeTerm<br>
altoXML<br>

In [3]:
#the path to the data
data_path = "../data/eb07-v1.2-TXT/"

#---------------------hardcoded columns to match dataframe structure---------------#
parsed_data = []
editor = "Stewart, Dugald"
editor_date = "1753-1828"
genre = "encyclopedia"
language = "eng"
termsOfAddress = "Sir"
physicalDescription = "21 v. in 22 ; 4to."
editionSubTitle = "0.0"
place = "Edinburgh"
publisher = "A. & C. Black"
referencedBy = "0.0"
shelfLocator = "EB.15Z"
publisherPersons = []
part = "0"
header = "Not specified"
place = "Edinburgh"
numberOfPages = "1"
volumeTitle = "Encyclopaedia Britannica"
year = "1842" 
part = "Not specified"
supplementTitle = "Not specified"
supplementsTo = "Not specified"
numberOfVolumes = "20" 
supplementSubTitle = "Not specified"
numberOfTerms = "Not specified"
#----------------------------------------------------------------------------#
#Special section of hardcoding that must change with each new edition
#This array needs to match the volue num to the volumeId, first two left blank because 
# volume 7 has 2 volumes that are not included in the Knowledge Project
volumeId = [-1, -1, 192984259, 193057500, 193108322, 193696083, 193322690, 193819043, 
            193322688, 193696084, 193469090, 193638940, 192693199, 193108323, 
            193322689, 193819044, 194474782, 193469091, 193469092, 193057501, 193913444, 193819045]
metsXML = [str(volumeId[i])+"-mets.xml" for i in range(len(volumeId))]
permanentURL = ["https://digital.nls.uk/"+str(volumeId[i]) for i in range(len(volumeId))]
MMSID = "9910796273804340"
#edTitle should change with each volume
edTitle = "Seventh edition, General index"
numberOfPages = "184"
#this needs to be checked in the graph creation
relatedTerms = "Not specified"
#----------------------------------------------------------------------------#
# Variables that can be parsed from the data
letters = "" 
volumeNum = "" #parse out from file name
typeTerm = "Topic" #0 is a term, 1 is an article
startsAt = "Not specified" #parse from file
endsAt = "Not specified" #same as starts at
positionPage = 0 #add 1 for each term in a file

#for finding volume and letter
volumeRegex = '/[a-z][0-9]+/'
#for removing punctuation but leaving hyphens
punctuationHyphen = string.punctuation.replace('-', "")

for file in sorted(glob.glob(data_path + "*/*.txt")):
    #get file namefrom file path. Will change if data path is different
    altoXML = file.removeprefix("../data/")
    
    #get volume and letter from file path
    volume_letter = file.removeprefix(data_path).split("/")[0]
    volumeNum = volume_letter[1:]
    letters = volume_letter[0]
    

    with open (file, 'r') as currFile:
        text = currFile.read()
        text = text.split("=+")
        
        #get data of location in EB like page edition and volume
        ed_info = re.findall("\[\d+\:\d+\:\d+\]", text[1])[0].replace("[", "").replace("]", "")
        ed_info = ed_info.split(":")
        editionNum = ed_info[0]
        volumeNum = ed_info[1]
        startsAt = ed_info[2]
        endsAt = ed_info[2]
        
        #get each line from the text file and remove empty lines
        lines = text[2].split('\n')
        lines = [x for x in lines if x]
        
        #for finding new terms
        capital_pattern = regex.compile(r"^[\p{Lu}\s\']+(?![a-z])")
        
        term = ""
        definition = ""
        positionPage = 0
        new_term = capital_pattern.match(lines[0])
        #if there is a new term and the new term is not a single letter
        if(new_term):
            #start new term and definition
            term = new_term.group().strip()
            definition += lines[0][len(term):]
        #if term is not capitalized like in eb07-v1.2-TXT/s20/kp-eb0720-010203-0116-v1.txt
        if(term == ""):
            term = lines[0].split(",")[0]
            #if there is no comma separating term. 4 is picked to work around the "see" keyword that can appear
            if(len(term.split(" ")) > 4):
                term = lines[0].split(" ")[0].replace(".", "").replace(",", "")
            definition += lines[0][len(term):]
        #for each line in the file decide if it is a new term or continuing definition
        for line in lines[1:]:
            #Checks if the term is exactly the same or a derivative with a '-' like Baal and Ball-Perith
            #if it is store the current term and definition and start a new one
            if(term == line.split(',')[0].upper() or term == line.split(',')[0].split("-")[0].upper()):
                #calculate the remaining columns
                positionPage += 1
                numberOfWords = len(definition.split())
                if(numberOfWords < 1000):
                    typeTerm = "Article"
                else:
                    typeTerm = "Topic"
                #clean defintion
                if definition[:2] == ", " or definition[:2] == ". ":
                    definition = definition[2:]
                parsed_data.append([term, definition, MMSID, edTitle, editor, editor_date, genre, language, termsOfAddress, numberOfPages,
                            physicalDescription, place, publisher, referencedBy, shelfLocator, editionSubTitle,
                            volumeTitle, year, volumeId[int(volumeNum)], metsXML[int(volumeNum)], permanentURL[int(volumeNum)], publisherPersons, volumeNum,
                            letters, part, editionNum, supplementTitle, supplementSubTitle, supplementsTo, numberOfVolumes,
                            relatedTerms, header, startsAt, endsAt, numberOfTerms, numberOfWords, positionPage, typeTerm, altoXML])
                
                #new term with new different definition 
                #replace is for a specific edge case so that serialization of the KG does not break
                term = line.split(',')[0].upper()
                term = term.translate(str.maketrans('', '', punctuationHyphen))
                definition = "".join(line.split(",")[1:]).strip()
            else:
                #continue definition of current term bc it is a multiline definition
                definition += line
            
        #add last term in file
        #calculate the remaining columns
        numberOfWords = len(definition.split())
        if(numberOfWords < 1000):
            typeTerm = "Article"
        else:
            typeTerm = "Topic"
        positionPage += 1
        #clean defintion
        if definition[:2] == ", " or definition[:2] == ". ":
            definition = definition[2:]
        parsed_data.append([term, definition, MMSID, edTitle, editor, editor_date, genre, language, termsOfAddress, numberOfPages,
                            physicalDescription, place, publisher, referencedBy, shelfLocator, editionSubTitle,
                            volumeTitle, year, volumeId[int(volumeNum)], metsXML[int(volumeNum)], permanentURL[int(volumeNum)], publisherPersons, volumeNum,
                            letters, part, editionNum, supplementTitle, supplementSubTitle, supplementsTo, numberOfVolumes,
                            relatedTerms, header, startsAt, endsAt, numberOfTerms, numberOfWords, positionPage, typeTerm, altoXML])
        
    #create dataframe with data and columns        
    df = pd.DataFrame(parsed_data, columns=["term", "definition", "MMSID", "edTitle", "editor", "editor_date", "genre", "language", "termsOfAddress", "numberOfPages",
                            "physicalDescription", "place", "publisher", "referencedBy", "shelfLocator", "editionSubTitle",
                            "volumeTitle", "year", "volumeId", "metsXML", "permanentURL", "publisherPersons", "volumeNum",
                            "letters", "part", "editionNum", "supplementTitle", "supplementSubTitle", "supplementsTo", "numberOfVolumes",
                            "relatedTerms", "header", "startsAt", "endsAt", "numberOfTerms", "numberOfWords", "positionPage", "typeTerm", "altoXML"])



In [4]:
df.head(5)


Unnamed: 0,term,definition,MMSID,edTitle,editor,editor_date,genre,language,termsOfAddress,numberOfPages,...,numberOfVolumes,relatedTerms,header,startsAt,endsAt,numberOfTerms,numberOfWords,positionPage,typeTerm,altoXML
0,A,The first letter of the alphabet in every know...,9910796273804340,"Seventh edition, General index","Stewart, Dugald",1753-1828,encyclopedia,eng,Sir,184,...,20,Not specified,Not specified,1,1,Not specified,1440,1,Topic,eb07-v1.2-TXT/a2/kp-eb0702-000101-9822-v1.txt
1,A,as an abbreviation is likewise of frequent occ...,9910796273804340,"Seventh edition, General index","Stewart, Dugald",1753-1828,encyclopedia,eng,Sir,184,...,20,Not specified,Not specified,1,1,Not specified,135,2,Article,eb07-v1.2-TXT/a2/kp-eb0702-000101-9822-v1.txt
2,AA,"a river of the province of Groningen, in the k...",9910796273804340,"Seventh edition, General index","Stewart, Dugald",1753-1828,encyclopedia,eng,Sir,184,...,20,Not specified,Not specified,2,2,Not specified,28,1,Article,eb07-v1.2-TXT/a2/kp-eb0702-000201-9835-v1.txt
3,AA,a river in the province of Overyssel. in the N...,9910796273804340,"Seventh edition, General index","Stewart, Dugald",1753-1828,encyclopedia,eng,Sir,184,...,20,Not specified,Not specified,2,2,Not specified,16,2,Article,eb07-v1.2-TXT/a2/kp-eb0702-000201-9835-v1.txt
4,AA,a river of the province of Antwerp in the Neth...,9910796273804340,"Seventh edition, General index","Stewart, Dugald",1753-1828,encyclopedia,eng,Sir,184,...,20,Not specified,Not specified,2,2,Not specified,17,3,Article,eb07-v1.2-TXT/a2/kp-eb0702-000201-9835-v1.txt


In [5]:
df.loc[0]

term                                                                   A
definition             The first letter of the alphabet in every know...
MMSID                                                   9910796273804340
edTitle                                   Seventh edition, General index
editor                                                   Stewart, Dugald
editor_date                                                    1753-1828
genre                                                       encyclopedia
language                                                             eng
termsOfAddress                                                       Sir
numberOfPages                                                        184
physicalDescription                                   21 v. in 22 ; 4to.
place                                                          Edinburgh
publisher                                                  A. & C. Black
referencedBy                                       

In [6]:
print("Text-based datafranme")
print("Number of files: ", len(sorted(glob.glob(data_path + "*/*.txt"))))
print("Number of terms: ", df.shape[0])
print("Number of topics: ", df[df["typeTerm"] == "Topic"].shape[0])
print("Number of articles: ", df[df["typeTerm"] == "Article"].shape[0])


Text-based datafranme
Number of files:  20983
Number of terms:  23122
Number of topics:  2024
Number of articles:  21098


In [7]:
df.to_json(r'../data/final_eb_7_dataframe_clean', orient="index")

In [8]:
#Example of term not following the established pattern that necessiated the extra checks
df[df["term"] == "See Bangog"]["definition"]

20222    a small island in the Eastern Seas, near the e...
Name: definition, dtype: object

In [9]:
old_df= pd.read_json('../data/final_eb_7_dataframe', orient="index") 

In [10]:
print("XML-based datafranme")
print("Number of terms: ", old_df.shape[0])
print("Number of topics: ", old_df[old_df["typeTerm"] == "Topic"].shape[0])
print("Number of articles: ", old_df[old_df["typeTerm"] == "Article"].shape[0])

XML-based datafranme
Number of terms:  13459
Number of topics:  556
Number of articles:  12903
