## Setup

This notebook uses the results from the `parse_resources.ipynb` notebook. The parse resources step pulls data from ArchivesSpace and creates a dataframe that was output to a CSV file. This notebook starts from the CSV file, but it could relatively easily be changed to take the previous dataframe as an input.

In [2]:
import pandas as pd
from lxml import etree
import os
import re

# pd.set_option('display.max_rows', None)
pd.set_option('display.max_rows', 10)

_Note:_ the following functions and code is based on work by Ella Li, who created an initial version of this project that parsed EAD data from XML files. The process here is similar but continues to use the data pulled from the ArchivesSpace API, which exports data in JSON rather than XML.

## Provide Terms

In [None]:
# read in the txt file term list
term_list_file = 'terms_all.txt'

with open(term_list_file, 'r') as f:
    terms = [line.strip() for line in f]

print(f'Read term list from {term_list_file} and recorded {len(terms)} terms of interest.')

## Match Terms

In [2]:
def match_terms(row, terms, columns):
    results = []
    for term in terms:
        for col in columns:
            if not isinstance(row[col], float):
                # Split the column into paragraphs
                paragraphs = row[col].split('\n')
                # Loop through each paragraph
                for paragraph in paragraphs:
                    # Check if the term is in the current paragraph
                    if re.search(r'\b' + re.escape(term) + r'\b', paragraph, re.IGNORECASE):
                        # Split paragraph into sentences
                        sentences = re.split(r'(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s', paragraph)
                        # Find the sentence containing the term
                        matched_sentence = next((sentence for sentence in sentences if re.search(r'\b' + re.escape(term) + r'\b', sentence, re.IGNORECASE)), paragraph)
                        results.append({
                            'Term': term,
                            'Occurrence (ead_ID)': row['ead_id'],
                            'Field': col, 
                            'Collection': row.get('titleproper', None),
                            'Context': matched_sentence  # Returning only the matched sentence
                        })
    return results

def match_and_visualize(df, name):
    # Match results
    results_df = pd.DataFrame([result for index, row in df.iterrows() for result in match_terms(row, terms, df.columns)])
    
    # Sort results by 'Term'
    sorted_results_df = results_df.sort_values(by='Term', ascending=True)
    
    # Show matched results
    print("Matched results for ", name)

    # Export to CSV without the index
    sorted_results_df.to_csv('matched_results_' + name + '.csv', index=False)
    return sorted_results_df 

eads_df = pd.read_csv('results-fromTextFile.csv', encoding='utf-8')
# eads_df = pd.read_csv('results-allIDs.csv', encoding='utf-8')
eads_df.head()

Matched results for  Bentley


Unnamed: 0,Term,Occurrence (ead_ID),Field,Collection,Context
123,Benevolent Assimilation,umich-bhl-86354,bioghist,Finding Aid for Dean C. Worcester Papers,"McKinley asked Worcester to join a ""civilian c..."
124,Colonial,umich-bhl-86354,bioghist,Finding Aid for Dean C. Worcester Papers,Worcester's influence on American colonial pol...
152,Colonial,umich-bhl-8868,scopecontent,"Finding aid for Blanchard Family Papers, ca. 1...",The Blanchard Family Papers will be of value t...
147,Colonial,umich-bhl-8772,bioghist,Finding aid for Luce Philippine Project interv...,In 1977 the University of Michigan Center for ...
57,Colonial,umich-bhl-851733,bioghist,Finding Aid for Harry Burns Hutchins papers,Mary Hutchins was a member of many organizatio...
...,...,...,...,...,...
42,Types,umich-bhl-851285,scopecontent,Finding Aid for Thomas Francis Papers,Types of records in these unprocessed subserie...
69,Types,umich-bhl-85193,scopecontent,Finding Aid for Philip A. Hart Papers,Hart himself and his staff had discarded certa...
145,Types,umich-bhl-87265.25,bioghist,Finding aid for News and Information Services ...,News Service has continued to expand its media...
183,Types,umich-bhl-9840,scopecontent,"Finding aid for Charles W. Lane papers, 1935-1997",The researcher will be interested in the varie...
