## Setup

This notebook uses the results from the `parse_resources.ipynb` notebook. The parse resources step pulls data from ArchivesSpace and creates a dataframe that was output to a CSV file. This notebook starts from the CSV file, but it could relatively easily be changed to take the previous dataframe as an input.

In [17]:
import pandas as pd
from lxml import etree
import os
import re

# pd.set_option('display.max_rows', None)
pd.set_option('display.max_rows', 10)

_Note:_ the following functions and code is based on work by Ella Li, who created an initial version of this project that parsed EAD data from XML files. The process here is similar but continues to use the data pulled from the ArchivesSpace API, which exports data in JSON rather than XML.

## Provide Terms

In [18]:
# read in the txt file term list
term_list_file = 'terms_all.txt'

with open(term_list_file, 'r') as f:
    terms = [line.strip() for line in f]

print(f'Read term list from {term_list_file} and recorded {len(terms)} terms of interest.')

Read term list from terms_all.txt and recorded 104 terms of interest.


## Match Terms

In [19]:
def match_terms(row, terms, columns):
    results = []
    for term in terms:
        for col in columns:
            if not isinstance(row[col], float):
                # split the column into paragraphs
                # wonky try/except to work through integers, if not converted to strings
                try:
                    paragraphs = row[col].split('\n')
                except:
                    paragraphs = str(row[col]).split('\n')
                # loop through each paragraph
                for paragraph in paragraphs:
                    # check if the term is in the current paragraph
                    if re.search(r'\b' + re.escape(term) + r'\b', paragraph, re.IGNORECASE):
                        # Split paragraph into sentences
                        sentences = re.split(r'(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s', paragraph)
                        # Find the sentence containing the term
                        matched_sentence = next((sentence for sentence in sentences if re.search(r'\b' + re.escape(term) + r'\b', sentence, re.IGNORECASE)), paragraph)
                        results.append({
                            'Term': term,
                            'Occurrence (ead_ID)': row['ead_id'],
                            'Tag': col, 
                            'Collection': row.get('titleproper', None),
                            'Context': matched_sentence  # Returning only the matched sentence
                        })
                        
    return results

def match_and_visualize(df, name):
    # Match results
    results_df = pd.DataFrame([result for index, row in df.iterrows() for result in match_terms(row, terms, df.columns)])
    
    # Sort results by 'Term'
    sorted_results_df = results_df.sort_values(by='Term', ascending=True)
    
    # Show matched results
    print("Matched results for ", name)

    # Export to CSV without the index
    sorted_results_df.to_csv('matched_results_' + name + '.csv', index=False)
    return sorted_results_df 

eads_df = pd.read_csv('results-fromTextFile.csv', encoding='utf-8')
# eads_df = pd.read_csv('results-allIDs.csv', encoding='utf-8')

match_and_visualize(eads_df, 'Bentley')

Matched results for  Bentley


Unnamed: 0,Term,Occurrence (ead_ID),Tag,Collection,Context
140,Colonial,umich-bhl-8772,bioghist,"Luce Philippine Project interviews, 1975-1980",In 1977 the University of Michigan Center for ...
63,Colonial,umich-bhl-851733,bioghist,Harry Burns Hutchins papers,Mary Hutchins was a member of many organizatio...
145,Colonial,umich-bhl-8868,scopecontent,"Blanchard Family Papers, ca. 1835-ca. 2000",The Blanchard Family Papers will be of value t...
66,Colonial,umich-bhl-851764,abstract,"George A. Malcolm papers, 1896-1965","Correspondence, scrapbooks, printed reports, a..."
90,Colonial,umich-bhl-85419,scopecontent,"Owen A. Tomlinson papers, 1899-1920",Within the Photograph series will be found six...
...,...,...,...,...,...
43,Types,umich-bhl-2014136,bioghist,University Herbarium (University of Michigan) ...,The U-M Herbarium is also a leader in digitizi...
73,Types,umich-bhl-85193,scopecontent,Philip A. Hart Papers,Hart himself and his staff had discarded certa...
50,Types,umich-bhl-851285,scopecontent,Thomas Francis Papers,Types of records in these unprocessed subserie...
180,Types,umich-bhl-9840,scopecontent,"Charles W. Lane papers, 1935-1997",The researcher will be interested in the varie...
