## Setup

This notebook uses the results from the `parse_resources.ipynb` notebook. The parse resources step pulls data from ArchivesSpace and creates a dataframe that was output to a CSV file. This notebook starts from the CSV file, but it could relatively easily be changed to take the previous dataframe as an input.

In [2]:
import pandas as pd
from lxml import etree
import os
import re

# pd.set_option('display.max_rows', None)
pd.set_option('display.max_rows', 10)

_Note:_ the following functions and code is based on work by Ella Li, who created an initial version of this project that parsed EAD data from XML files. The process here is similar but continues to use the data pulled from the ArchivesSpace API, which exports data in JSON rather than XML.

## Provide Terms

In [3]:
# read in the txt file term list
term_list_file = 'terms_all.txt'

with open(term_list_file, 'r') as f:
    terms = [line.strip() for line in f]

print(f'Read term list from {term_list_file} and recorded {len(terms)} terms of interest.')

Read term list from terms_all.txt and recorded 104 terms of interest.


## Match Terms

In [4]:
def match_terms(row, terms, columns):
    results = []
    for term in terms:
        for col in columns:
            if not isinstance(row[col], float):
                # Split the column into paragraphs
                paragraphs = row[col].split('\n')
                # Loop through each paragraph
                for paragraph in paragraphs:
                    # Check if the term is in the current paragraph
                    if re.search(r'\b' + re.escape(term) + r'\b', paragraph, re.IGNORECASE):
                        # Split paragraph into sentences
                        sentences = re.split(r'(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s', paragraph)
                        # Find the sentence containing the term
                        matched_sentence = next((sentence for sentence in sentences if re.search(r'\b' + re.escape(term) + r'\b', sentence, re.IGNORECASE)), paragraph)
                        results.append({
                            'Term': term,
                            'Occurrence (ead_ID)': row['ead_id'],
                            'Field': col, 
                            'Collection': row.get('titleproper', None),
                            'Context': matched_sentence  # Returning only the matched sentence
                        })
    return results

def match_and_visualize(df, name):
    # Match results
    results_df = pd.DataFrame([result for index, row in df.iterrows() for result in match_terms(row, terms, df.columns)])
    
    # Sort results by 'Term'
    sorted_results_df = results_df.sort_values(by='Term', ascending=True)
    
    # Show matched results
    print("Matched results for ", name)

    # Export to CSV without the index
    sorted_results_df.to_csv('matched_results_' + name + '.csv', index=False)
    return sorted_results_df 

eads_df = pd.read_csv('results-fromTextFile.csv', encoding='utf-8')
# eads_df = pd.read_csv('results-allIDs.csv', encoding='utf-8')
eads_df.head()

Unnamed: 0,resource_id,ead_id,titleproper,abstract,language,scopecontent,bioghist,subject_ids,subjects,subjects_source,...,geognames_source,persname_ids,persnames,persnames_source,corpname_ids,corpnames,corpnames_source,famname_ids,famnames,famnames_source
0,3011,umich-bhl-00138,the Ralph M. Hodnett papers,,The finding aid is written in English,This collection consists of reminiscences (wri...,Ralph M. Hodnett was an officer in the U.S. Ar...,6593; 11286; 16557,"Soldiers; World War, 1914-1918; Soldiers",lcsh; lcsh; lctgm,...,lcsh,12338; 12338,"Hodnett, Ralph M.; Hodnett, Ralph M.",lcnaf; lcnaf,4856,United States. Army.,lcnaf,496.0,Oram family.,lcnaf
1,267,umich-bhl-0052,Bentley Historical Library publications. 1935-...,The Bentley Historical Library (BHL) houses th...,The finding aid is written in English,The PUBLICATIONS (3.7 linear feet) are divided...,The origins of the Bentley Historical Library ...,,,,...,,,,,3398; 5677; 3398,Bentley Historical Library.; Michigan Historic...,lcnaf; lcnaf; lcnaf,,,
2,996,umich-bhl-0142,the Frank C. Gates papers,Frank C. Gates was a professor of botany at th...,The finding aid is written in <language encodi...,The Frank C. Gates papers are dated from 1871-...,"Frank Caleb Gates was born on September 12, 18...",17930; 7968; 7969; 5631,Bird watching.; Botany; Forests and forestry; ...,lcsh; lcsh; lcsh; aat,...,lcsh; lctgm,4944; 954; 4944,"Gates, Frank C. (Frank Caleb), 1887-1955; Gate...",lcnaf; lcnaf; lcnaf,2470,University of the Philippines.,lcnaf,,,
3,2722,umich-bhl-03171,"Mike Wallace CBS 60 Minutes Papers, 1922-2007","Papers of Mike Wallace (1918-2012), broadcast ...",The finding aid is written in English,"The Mike Wallace CBS/ <title render=""italic"">6...",Mike Wallace was born Myron Leon Wallace on Ma...,10698; 7291; 9089; 10699; 10698; 9089; 10700; ...,60 minutes (Television program); Television br...,lcsh; lcsh; lcsh; lcsh; lcsh; lcsh; lcsh; lcsh...,...,,1503; 1503; 1503; 1503,"Wallace, Mike, 1918-2012; Wallace, Mike, 1918-...",lcnaf; lcnaf; lcnaf; lcnaf,3959; 3959,CBS News.; CBS News.,lcnaf; lcnaf,,,
4,1051,umich-bhl-0336,"Grant Kohn Goodman papers, 1943-1995",Grant K. Goodman was a student at the Universi...,The finding aid is written in English,The Grant K. Goodman collection documents the ...,Grant Kohn Goodman was born in 1924 in Clevela...,6103; 6103; 6103,"World War, 1939-1945; World War, 1939-1945; Wo...",lcsh; lcsh; lcsh,...,lcsh,19550; 19550,"Goodman, Grant Kohn, 1924-2014; Goodman, Grant...",lcnaf; lcnaf,3490; 3490; 3490,United States. Army. Japanese Language School ...,lcnaf; lcnaf; lcnaf,,,
