## Setup

This notebook uses the results from the `parse_resources.ipynb` notebook. The parse resources step pulls data from ArchivesSpace and creates a dataframe that was output to a CSV file. This notebook starts from the CSV file, but it could relatively easily be changed to take the previous dataframe as an input.

In [1]:
import pandas as pd
from lxml import etree
import os
import re

# pd.set_option('display.max_rows', None)
pd.set_option('display.max_rows', 10)

_Note:_ the following functions and code is based on work by Ella Li, who created an initial version of this project that parsed EAD data from XML files. The process here is similar but continues to use the data pulled from the ArchivesSpace API, which exports data in JSON rather than XML.

## Provide Terms

In [2]:
# read in the txt file term list
term_list_file = 'terms-nativeAmerican.txt'
# term_list_file = 'terms-philippines.txt'

with open(term_list_file, 'r') as f:
    terms = [line.strip() for line in f]

print(f'Read term list from {term_list_file} and recorded {len(terms)} terms of interest.')

Read term list from terms-nativeAmerican.txt and recorded 52 terms of interest.


## Match Terms

In [3]:
def match_terms(row, terms, columns):
    results = []
    for term in terms:
        for col in columns:
            if not isinstance(row[col], float):
                # split the column into paragraphs
                # wonky try/except to work through integers, if not converted to strings
                try:
                    paragraphs = row[col].split('\n')
                except:
                    paragraphs = str(row[col]).split('\n')
                # loop through each paragraph
                for paragraph in paragraphs:
                    # check if the term is in the current paragraph
                    if re.search(r'\b' + re.escape(term) + r'\b', paragraph, re.IGNORECASE):
                        # Split paragraph into sentences
                        sentences = re.split(r'(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s', paragraph)
                        # Find the sentence containing the term
                        matched_sentence = next((sentence for sentence in sentences if re.search(r'\b' + re.escape(term) + r'\b', sentence, re.IGNORECASE)), paragraph)
                        results.append({
                            'Term': term,
                            'Occurrence (ead_ID)': row['ead_id'],
                            'Tag': col, 
                            'Collection': row.get('titleproper', None),
                            'Context': matched_sentence  # Returning only the matched sentence
                        })
                        
    return results

def match_and_visualize(df, name):
    # Match results
    results_df = pd.DataFrame([result for index, row in df.iterrows() for result in match_terms(row, terms, df.columns)])
    
    # Sort results by 'Term'
    sorted_results_df = results_df.sort_values(by='Term', ascending=True)
    
    # Show matched results
    print("Matched results for", name)

    # Export to CSV without the index
    sorted_results_df.to_csv('matched_results-' + name + '.csv', index=False)
    return sorted_results_df 

eads_df = pd.read_csv('results-nativeAmerican.csv', encoding='utf-8')
# eads_df = pd.read_csv('results-philippines.csv', encoding='utf-8')

match_and_visualize(eads_df, 'nativeAmerican')
# match_and_visualize(eads_df, 'philippines')

Matched results for nativeAmerican


Unnamed: 0,Term,Occurrence (ead_ID),Tag,Collection,Context
135,Anishnaabe,umich-bhl-2018025,scopecontent,the John P. Murphy collection,Researchers should note that the collection in...
710,Burial,umich-bhl-85831,bioghist,Joseph Beal Steere Papers,On the Island of Marajo he excavated huge preh...
42,Burial,umich-bhl-85189,bioghist,Amos R. Green Papers,"He threw himself, too, into the practice of ar..."
101,Burials,umich-bhl-2018025,bioghist,the John P. Murphy collection,The location is a sacred site for the Anishina...
23,Burials,umich-bhl-0383,subjects,George R. Fox (1880-1963) Papers,Archaeology.; Mounds; Archaeology.; Mounds (Bu...
...,...,...,...,...,...
540,Treaty,umich-bhl-89425,scopecontent,the Elmer E. White court file,"the Regents of The University of Michigan, con..."
663,Treaty,umich-bhl-8679,abstract,"Helen Hornbeck Tanner papers, 1930s-2009",Historian of American Indian history and liter...
733,Treaty,umich-bhl-8722,scopecontent,Board of Regents (University of Michigan) records,Of particular note is the Land Grant from the ...
426,Treaty,umich-bhl-87423,scopecontent,the American Baptist Missionary Union records,Mills and includes a land grant dated January ...
