# Match Terms and Visualize

This notebook uses the results from the `parse_resources.ipynb` notebook.
The parse resources step pulls data from ArchivesSpace and creates a
dataframe that was output to a CSV file. 
This notebook starts from the CSV file, but it could
relatively easily be changed to take the previous dataframe
as an input. 

## Setup

If continuing or adapting this code, you have likely already imported these libraries.
We will use `pandas` for data processing, `re` for text processing and parsing,
and a few others.

In [16]:
import pandas as pd 
import re

import matplotlib.pyplot as plt
import matplotlib.colors as mcolors
import plotly.express as px
import plotly.io as pio

**pd.set_option('display.max_colwidth', -1)**: controls the maximum width of each column in a pandas dataframe. By setting it to -1, pandas will display the full contents of each column, without any truncation:

In [17]:
#pd.set_option('display.max_colwidth', -1)

**pd.set_option('display.max_rows', None)**: Sets the maximum number of rows displayed when a pandas DataFrame or Series is printed to be unlimited. When this option is set to an integer (as in the commented line # **pd.set_option('display.max_rows', 10)**), only the specified number of rows would be displayed. This is useful for controlling the output length, especially when working with large DataFrames.

In [18]:
# pd.set_option('display.max_rows', None)
pd.set_option('display.max_rows', 10)

Note: the following functions and code is based on
work by Ella Li, who created an initial version of this project
that parsed EAD data from XML files. 
The process here is similar but
continues to use the data pulled from the ArchivesSpace API,
which exports data in JSON rather than XML. 

## Provide Terms

Terms of interest are supplied in a plain text file, with
each term of interest on its own line. 

In [19]:
# read in the txt file term list
term_list_file = 'terms-nativeAmerican.txt'
# term_list_file = 'terms-philippines.txt'

with open(term_list_file, 'r') as f:
    terms = [line.strip() for line in f]

In [20]:
print(f'Read term list from {term_list_file} and recorded {len(terms)} terms of interest.')

Read term list from terms-nativeAmerican.txt and recorded 94 terms of interest.


## Match Terms

The next block defines the `match_terms` function, which analyses the EAD data
and identify occurences, tally occurences, and record their source tags.
First, pull the CSV `results.csv` data into a dataframe.
Then, the function searches for each term in the specified columns of a given row in your DataFrame. It performs an exact match, searching for the whole term, not part of it (due to the use of the `\b` word boundary in the regular expression). The search is case-insensitive due to the `re.IGNORECASE` flag.

If a term is found in a paragraph of the column, a dictionary is created with details about the match, including the term, the number of matches in the paragraph, the paragraph text itself, and other related information. These dictionaries are appended to the results list.

In [21]:
# match terms

def match_terms(row, terms, columns):
    results = []
    for term in terms:
        for col in columns:
            if not isinstance(row[col], float):
                # split the column into paragraphs
                # wonky try/except to work through integers, if not converted to strings
                try:
                    paragraphs = row[col].split('\n')
                except:
                    paragraphs = str(row[col]).split('\n')
                # loop through each paragraph
                for paragraph in paragraphs:
                    # check if the term is in the current paragraph
                    if re.search(r'\b' + re.escape(term) + r'\b', paragraph, re.IGNORECASE):
                        results.append({
                            'ead_id': row['ead_id'],
#                            'source_filename': row['source_filename'],
                            'resource_id': row['resource_id'],
                            'titleproper': row.get('titleproper', None), 
                            'Term': term, 
                            'Matched_Times': len(re.findall(r'\b' + re.escape(term) + r'\b', paragraph, re.IGNORECASE)),
                            'Matched_From': col, 
                            'Matched_Paragraph': paragraph
                        })
    return results

Next, using the match term function, tally and record the results.
These outputs will be the basis of additional reports and visualizations,
including the term in context report, charts, and other visualizations.

In [22]:
def match_and_visualize(df, name):
    # Match results
    results_df = pd.DataFrame([result for index, row in df.iterrows() for result in match_terms(row, terms, df.columns)])
    
    # Show matched results
    print('Matched results for', name)
    display(results_df) # control dislay

    return results_df  # Return the DataFrame for later use

To use the above functions, create a dataframe (or reuse the data from the `parse_resources.ipynb` scripts), then initialize the functions. 
The following code block demonstrates how to do this with `results.csv`
as an input.

In [23]:
eads_df = pd.read_csv('results-nativeAmerican.csv', encoding='utf-8')
# eads_df = pd.read_csv('results-philippines.csv', encoding='utf-8')

In [24]:
eads_df.head()

Unnamed: 0,resource_id,ead_id,titleproper,abstract,language,scopecontent,bioghist,subject_ids,subjects,subjects_source,...,geognames_source,persname_ids,persnames,persnames_source,corpname_ids,corpnames,corpnames_source,famname_ids,famnames,famnames_source
0,229,umich-bhl-92745,"Ed Beach photographs, 1931-1948",Ed Beach was an amateur photographer whose pho...,The finding aid is written in English,The Ed Beach collection consists of photograph...,"William E. Beach, known to all as Ed Beach, wa...",3784; 3785; 3663; 3786; 3787; 3788; 3789; 3790...,Airports; Animals; Automobiles; Automobile ser...,lctgm; lctgm; lctgm; lctgm; lctgm; lctgm; lctg...,...,lcsh; lcsh; lcsh; lcsh; lcsh; lcsh; lcsh; lcsh...,8552; 4615,"Beach, William Edward.; Winans, Edwin B., 1826...",lcnaf; lcnaf,,,,298.0,Beach family.,lcnaf
1,8482,No EAD ID,the Francis S. Belton drawing collection.,,The finding aid is written in English,Drawing of Mackinac Island from Round Island.,Officer in the United States Army.,6130; 3941,Indians of North America; Ships,lcsh; lctgm,...,lcsh; lcsh,14333,"Belton, Francis S.",local,,,,,,
2,4460,No EAD ID,the Bentley Historical Library Realia Collection,,The finding aid is written in English,A collection of artifacts and three-dimensiona...,,,,,...,lcsh,,,,8371; 6527,Bentley Historical Library.; University of Mic...,local; local,,,
3,256,umich-bhl-93291,"Fred E. Benz motion picture collection, 1929-1950",Amateur photographer; sixty-two reels of film ...,The finding aid is written in English,When the University of Michigan Media Resource...,"In 1911, Fred Benz, an Ann Arbor, Michigan nat...",3773; 3840; 3841; 3582; 6328; 3842; 6329; 3843...,Animals.; Bullfighting.; Cemeteries.; Children...,lctgm; lctgm; lctgm; lctgm; lcsh; lctgm; lcsh;...,...,lcsh; lcsh; lcsh; lcsh; lcsh; lcsh; lcsh; lcsh...,6041,"Benz, Fred E., -1963",lcnaf,,,,,,
4,8480,No EAD ID,the Black Hawk visual material collection.,,The finding aid is written in English,Photoprint of portrait painting.,,6806,Indians of North America.,lcsh,...,,14702,"Black Hawk, Sauk chief, 1767-1838",lcnaf,,,,,,


In [25]:
eads_df.columns

Index(['resource_id', 'ead_id', 'titleproper', 'abstract', 'language',
       'scopecontent', 'bioghist', 'subject_ids', 'subjects',
       'subjects_source', 'genreform_ids', 'genreforms', 'genreforms_source',
       'geogname_ids', 'geognames', 'geognames_source', 'persname_ids',
       'persnames', 'persnames_source', 'corpname_ids', 'corpnames',
       'corpnames_source', 'famname_ids', 'famnames', 'famnames_source'],
      dtype='object')

In [26]:
# rename eadid to ead_id
eads_df = eads_df.rename(columns={'eadid':'ead_id'})

## Tally and Record

Use the `match_and_visualize()` function to count the matches.

In [27]:
match_and_visualize(eads_df, 'nativeAmerican')
# match_and_visualize(eads_df, 'philippines')


Matched results for nativeAmerican


Unnamed: 0,ead_id,resource_id,titleproper,Term,Matched_Times,Matched_From,Matched_Paragraph
0,umich-bhl-92745,229,"Ed Beach photographs, 1931-1948",Dwellings,1,subjects,Airports; Animals; Automobiles; Automobile ser...
1,umich-bhl-92745,229,"Ed Beach photographs, 1931-1948",Indian,1,scopecontent,Beach created the photograph albums around br...
2,umich-bhl-92745,229,"Ed Beach photographs, 1931-1948",Indians,1,subjects,Airports; Animals; Automobiles; Automobile ser...
3,umich-bhl-92745,229,"Ed Beach photographs, 1931-1948",Copper,1,geognames,Acme (Mich.); Ada (Mich.); Adrian (Mich.); Alb...
4,umich-bhl-92745,229,"Ed Beach photographs, 1931-1948",Cass,1,geognames,Acme (Mich.); Ada (Mich.); Adrian (Mich.); Alb...
...,...,...,...,...,...,...,...
847,umich-bhl-87257,2634,Scientific Club (University of Michigan) Recor...,Mounds,1,bioghist,The Scientific Club succeeded an earlier orga...
848,umich-bhl-87257,2634,Scientific Club (University of Michigan) Recor...,Indian,1,bioghist,The Scientific Club succeeded an earlier orga...
849,umich-bhl-87290,2710,Vice President for Student Life (University of...,Native,4,bioghist,"In February 2000, the Students of Color Coali..."
850,umich-bhl-2014119,2843,William Edward Wise visual materials collectio...,Indian,2,corpnames,"First Unitarian Church (Ann Arbor, Mich.); Wes..."


Unnamed: 0,ead_id,resource_id,titleproper,Term,Matched_Times,Matched_From,Matched_Paragraph
0,umich-bhl-92745,229,"Ed Beach photographs, 1931-1948",Dwellings,1,subjects,Airports; Animals; Automobiles; Automobile ser...
1,umich-bhl-92745,229,"Ed Beach photographs, 1931-1948",Indian,1,scopecontent,Beach created the photograph albums around br...
2,umich-bhl-92745,229,"Ed Beach photographs, 1931-1948",Indians,1,subjects,Airports; Animals; Automobiles; Automobile ser...
3,umich-bhl-92745,229,"Ed Beach photographs, 1931-1948",Copper,1,geognames,Acme (Mich.); Ada (Mich.); Adrian (Mich.); Alb...
4,umich-bhl-92745,229,"Ed Beach photographs, 1931-1948",Cass,1,geognames,Acme (Mich.); Ada (Mich.); Adrian (Mich.); Alb...
...,...,...,...,...,...,...,...
847,umich-bhl-87257,2634,Scientific Club (University of Michigan) Recor...,Mounds,1,bioghist,The Scientific Club succeeded an earlier orga...
848,umich-bhl-87257,2634,Scientific Club (University of Michigan) Recor...,Indian,1,bioghist,The Scientific Club succeeded an earlier orga...
849,umich-bhl-87290,2710,Vice President for Student Life (University of...,Native,4,bioghist,"In February 2000, the Students of Color Coali..."
850,umich-bhl-2014119,2843,William Edward Wise visual materials collectio...,Indian,2,corpnames,"First Unitarian Church (Ann Arbor, Mich.); Wes..."


The `calculate_term_frequency()` function uses the matched term list
and regroups according to the matched terms and tallies the frequncy of the terms.

In [28]:
def calculate_term_frequency(df, df_name):
    term_frequency = df.groupby('Term')['Matched_Times'].sum().reset_index()
    term_frequency.rename(columns={'Matched_Times': 'Total_Frequency'}, inplace=True)
    term_frequency['DataFrame'] = df_name

    # Sort in descending order
    term_frequency.sort_values(by='Total_Frequency', ascending=False, inplace=True)
    
    # Show frequency table
    print('Term frequency for', df_name)
    display(term_frequency)

    return term_frequency

## Match from multiple sources

In our originating use case at the University of Michigan,
we were interested to see how the terms were used across different
collections. We were, specifically, working with three on-campus repositories:
the Bentley Historical Library, Clements Library, and the Special Collections Research Center (part of the University Library).

If you are working with multiple groups of finding aids, 
you might adapt the following code, which concatenates multiple
lists (may be provided as dataframes or CSVs) and tallies across them. 

In [29]:
'''
# for each dataframe representing matched terms
file_list = [(df1_Bentley, 'Bentley'), (df2_Clements, 'Clements'), (df3_SCRC, 'SCRC')]
matched_results = {name: match_and_visualize(df, name) for df, name in file_list}'''

"\n# for each dataframe representing matched terms\nfile_list = [(df1_Bentley, 'Bentley'), (df2_Clements, 'Clements'), (df3_SCRC, 'SCRC')]\nmatched_results = {name: match_and_visualize(df, name) for df, name in file_list}"

# Visualize matches

This should probably be split to its own file, but here for dev purposes

### Visualization values

Use the below to set global values, such as the names of repositories
under analysis, or fonts and colors for charts. 

In [30]:
# set names and colors here
repo_list = ['Bentley']
#repo_list = ['Bentley','Clements','SCRC']
global_font_info = {'font_family':'Georgia'}
colors = {'Bentley': '#CFC096', 'Clements': '#A5A508', 'SCRC': '#FFCB05'}
project_name = 'RCRC Finding Aids'

# create a manifest to select which result lists you want to visualize
match_groups = [(eads_df, 'Bentley')]
matched_results = {name: match_and_visualize(df, name) for df, name in match_groups}

Matched results for Bentley


Unnamed: 0,ead_id,resource_id,titleproper,Term,Matched_Times,Matched_From,Matched_Paragraph
0,umich-bhl-92745,229,"Ed Beach photographs, 1931-1948",Dwellings,1,subjects,Airports; Animals; Automobiles; Automobile ser...
1,umich-bhl-92745,229,"Ed Beach photographs, 1931-1948",Indian,1,scopecontent,Beach created the photograph albums around br...
2,umich-bhl-92745,229,"Ed Beach photographs, 1931-1948",Indians,1,subjects,Airports; Animals; Automobiles; Automobile ser...
3,umich-bhl-92745,229,"Ed Beach photographs, 1931-1948",Copper,1,geognames,Acme (Mich.); Ada (Mich.); Adrian (Mich.); Alb...
4,umich-bhl-92745,229,"Ed Beach photographs, 1931-1948",Cass,1,geognames,Acme (Mich.); Ada (Mich.); Adrian (Mich.); Alb...
...,...,...,...,...,...,...,...
847,umich-bhl-87257,2634,Scientific Club (University of Michigan) Recor...,Mounds,1,bioghist,The Scientific Club succeeded an earlier orga...
848,umich-bhl-87257,2634,Scientific Club (University of Michigan) Recor...,Indian,1,bioghist,The Scientific Club succeeded an earlier orga...
849,umich-bhl-87290,2710,Vice President for Student Life (University of...,Native,4,bioghist,"In February 2000, the Students of Color Coali..."
850,umich-bhl-2014119,2843,William Edward Wise visual materials collectio...,Indian,2,corpnames,"First Unitarian Church (Ann Arbor, Mich.); Wes..."


## Visualize group matches

In the use case at U Michigan, the groups represented finding aids
created by different repositories on campus. This can be useful for
working with subgroups of finding aids, or in comparing usage between
different organizations.

In [31]:
# bar charts for individual groups

def visualize_individual_repository_terms_bar(list_of_dataframes):
    '''
    This function takes a list of strings (list_of_dataframes), which identify the names of dataframes (corresponding to an archival repository),
    then generates a bar chart to show how many times a list of requested terms appears in the corresponding finding aid documents.
    Requires the calculate_term_frequency function.
    Processes data using pandas and generates charts using plotly (px) & plotly express (pio).
    '''
    for i in range(len(list_of_dataframes)):
        # calculate term frequency
        term_frequency = calculate_term_frequency(matched_results[repo_list[i]], repo_list[i])
        term_frequency.head()

        # Visualization in Multiple Charts
        fig = px.bar(term_frequency, x='Term', y='Total_Frequency', text='Total_Frequency', 
                        color='DataFrame', color_discrete_map=colors, text_auto=True,
                        labels={'Total_Frequency':'Term Occurence Count','DataFrame':'Repository'})
        fig.update_traces(textposition='outside', insidetextanchor='middle',
                          textfont=dict(family='Arial', size=8))
        fig.update_layout(title_text=f'Term Frequency for { repo_list[i] } ({ project_name })', 
                          xaxis_title_standoff=10, 
                          showlegend=False, font_family=global_font_info['font_family'])
        fig.update_xaxes(tickangle=45, tickfont=dict(family='Arial', color='black', size=10))
        fig.update_yaxes(tickfont=dict(family='Arial', size=10))
        fig.show()
        # pio.write_image(fig, f'term_frequency_byrepo_{ repo_list[i] }.png', width=700)

visualize_individual_repository_terms_bar(repo_list)

Term frequency for Bentley


Unnamed: 0,Term,Total_Frequency,DataFrame
27,Indians,292,Bentley
24,Indian,197,Bentley
32,Native,154,Bentley
4,Cass,42,Bentley
5,Chief,41,Bentley
...,...,...,...
22,Half-breeds,1,Bentley
29,Indigenous Person,1,Bentley
26,Indian removal,1,Bentley
43,Savage,1,Bentley


The above shows a basic bar chart, ordered from most frequent
to least frequent term. If looking at multiple groupings, the above
function will create multiple bar charts. 

The following sections demonstrate additional visualizations, for a single grouping
of finding aid data. For the case of multiple groups (or, in our case repositories)
see the additional visualization notebooks.

## Visualization: occurence in a horizontal bar chart

In [32]:
# bar charts for individual repos - horizontal

def visualize_individual_repository_terms_bar_horiz(list_of_dataframes):
    '''
    This function takes a list of strings (list_of_dataframes), which identify the names of dataframes (corresponding to an archival repository),
    then generates a horizontal bar chart to show how many times a list of requested terms appears in the corresponding finding aid documents.
    Requires the calculate_term_frequency function.
    Processes data using pandas and generates charts using plotly (px) & plotly express (pio).
    '''
    for i in range(len(list_of_dataframes)):
        # calculate term frequency
        term_frequency = calculate_term_frequency(matched_results[repo_list[i]], repo_list[i]).sort_values('Total_Frequency', ascending=True)
        term_frequency.head()

        # Visualization in Charts Horizontallly
        fig = px.bar(term_frequency, y='Term', x='Total_Frequency', text='Total_Frequency',
                        color='DataFrame', color_discrete_map=colors, text_auto=True,
                        labels={'Total_Frequency':'Term Occurence Count','DataFrame':'Repository'})
        fig.update_traces(textposition='outside', insidetextanchor='middle', 
                          textfont=dict(family='Arial',size=8))
        fig.update_layout(title_text=f'Term Frequency for { repo_list[i] } ({project_name})',
                          xaxis_title_standoff=10, height=600, 
                          showlegend=False, font_family=global_font_info['font_family'])
        fig.update_yaxes(tickfont=dict(family='Arial', size=10))
        fig.update_xaxes(tickangle=0, tickfont=dict(family='Arial', color='black', size=10))
        fig.show()
        # pio.write_image(fig, f'term_frequency_byrepo_{ repo_list[i] }_horizbar.png', scale=4)

visualize_individual_repository_terms_bar_horiz(repo_list)

Term frequency for Bentley


Unnamed: 0,Term,Total_Frequency,DataFrame
27,Indians,292,Bentley
24,Indian,197,Bentley
32,Native,154,Bentley
4,Cass,42,Bentley
5,Chief,41,Bentley
...,...,...,...
22,Half-breeds,1,Bentley
29,Indigenous Person,1,Bentley
26,Indian removal,1,Bentley
43,Savage,1,Bentley


The horizontal bar chart seems more well-suited to displaying occurences
of the terms, since the terms are easier to read in this orientation.

## Visualization: Occurence by EAD tag/element

This section of the script helps identify the EAD (Encoded Archival Description) elements with the highest occurrences of the specified harmful terms.

The function `calculate_element_frequency(df, df_name)` is used to calculate the sum of matched terms for each subsection in a DataFrame. The results are sorted in descending order and returned as a DataFrame with columns `Subsection`, `harmful_terms_frequency`, and `Source`.

This function is applied to each DataFrame in matched_results, and the results are concatenated into a single DataFrame, `all_element_frequencies`.

Finally, a grouped bar chart is created to visualize the frequency of terms across different subsections and sources.

In the chart, the x-axis represents the subsections, the y-axis shows the frequency of harmful terms, and different colors distinguish between sources. The `barmode='group'` setting places the bars side by side for easier comparison between sources.

In [33]:
def calculate_element_frequency(df, df_name):
    element_counts = df.groupby('Matched_From')['Matched_Times'].sum()
    element_counts_sorted = element_counts.sort_values(ascending=False)
    df_element_counts = pd.DataFrame(list(element_counts_sorted.items()), columns=['Subsection', 'harmful_terms_frequency'])
    df_element_counts['Source'] = df_name  # Indicate the source dataframe
    return df_element_counts

# Use the function for each dataframes
element_frequencies_list = [calculate_element_frequency(df, name) for name, df in matched_results.items()]

# Concatenate all element frequencies
all_element_frequencies = pd.concat(element_frequencies_list)

# Show the DataFrame
print("Element frequencies across all file pools:")
display(all_element_frequencies)

# Set colors
colors = {'Bentley': '#CFC096', 'Clements': '#A5A508', 'SCRC': '#FFCB05'}

# Visualization
fig = px.bar(all_element_frequencies, x='Subsection', y='harmful_terms_frequency', color='Source', 
             text='harmful_terms_frequency', barmode='group',
             labels={'Subsection':'EAD Tag','Source':'Repository'},
             color_discrete_map=colors)
fig.update_traces(textposition='outside', textfont=dict(family='Arial',size=8))
fig.update_layout(title_text="Term Occurence Frequency by EAD Tag", yaxis_title="Term Frequency", height=600,
                  font_family=global_font_info['font_family'])
fig.update_yaxes(tickfont=dict(family='Arial', size=10))
fig.update_xaxes(tickangle=0, tickfont=dict(family='Arial', color='black', size=10))
fig.show()
# pio.write_image(fig, 'element_frequency_across_dfs.png', scale=5)

# Visualization Variation: Stacked bars
fig = px.bar(all_element_frequencies, x='Subsection', y='harmful_terms_frequency', color='Source', 
             text='harmful_terms_frequency', barmode='stack',
             labels={'Subsection':'EAD Tag','Source':'Repository'},
             color_discrete_map=colors)
fig.update_traces(textposition='inside', insidetextanchor='middle', textfont=dict(family='Arial',size=8))
fig.update_layout(title_text="Term Occurence Frequency by EAD Tag (Stacked Totals)", yaxis_title="Term Frequency", height=600,
                  font_family=global_font_info['font_family'])
fig.update_yaxes(tickfont=dict(family='Arial', size=10))
fig.update_xaxes(tickangle=0, tickfont=dict(family='Arial', color='black', size=10))
fig.show()
# pio.write_image(fig, 'element_frequency_across_dfs_stacked.png', scale=5)

Element frequencies across all file pools:


Unnamed: 0,Subsection,harmful_terms_frequency,Source
0,bioghist,389,Bentley
1,subjects,283,Bentley
2,scopecontent,254,Bentley
3,abstract,94,Bentley
4,corpnames,35,Bentley
5,geognames,32,Bentley
6,persnames,22,Bentley
7,titleproper,19,Bentley


### Visualizing Term Frequencies Across Subsections

This section calculates the term frequencies in different subsections of each data source and visualizes the results using treemap and sunburst diagrams.

The `calculate_subsection_term_frequency(df, df_name)` function is used to calculate the sum of matched terms for each term in each subsection of a DataFrame. The results are returned as a DataFrame with columns `Subsection`, `Term`, `Term_Frequency`, and `Source`. This function is applied to each DataFrame in matched_results, and the results are concatenated into a single DataFrame, `all_subsection_term_frequencies`.

Treemap diagrams are then created to visualize the frequencies of terms in each subsection for every data source. The size of each section in the treemap corresponds to the term frequency in that section. Two types of treemaps are generated: one with uniform colors and one with a color scale indicating term frequency. The color scale ranges from yellow ('Yl') for lower frequencies to red ('Rd') for higher frequencies.

A sunburst diagram is created with a similar color scale. In this diagram, the data sources are represented in the inner circle, subsections in the middle ring, and terms in the outer ring. This hierarchical view allows for easy comparison of term frequencies across different sources and subsections. The resulting visualizations are displayed but not saved. If you wish to save them, you can use fig.write_image('filename.png') as in the previous sections.

In [34]:
def calculate_subsection_term_frequency(df, df_name):
    subsection_term_frequency = df.groupby(['Matched_From', 'Term'])['Matched_Times'].sum().reset_index()
    subsection_term_frequency.rename(columns={'Matched_From': 'Subsection', 'Matched_Times': 'Term_Frequency'}, inplace=True)
    subsection_term_frequency['Source'] = df_name
    return subsection_term_frequency

# Use the function for each pool
subsection_term_frequencies_list = [calculate_subsection_term_frequency(df, name) for name, df in matched_results.items()]

# Concatenate all subsection term frequencies
all_subsection_term_frequencies = pd.concat(subsection_term_frequencies_list)

### Treemap visualization

The tree map visualization, initially used in digital context to
represent the proportional size of files and directories on a hard disk,
provides a useful way to show and explore hierarchy. 

The primary grouping in this treemap is the holding repository. In the case of U-Michigan,
we were interested in how occurence patterns might differ between repositories. But in other cases, the hierarchy might be used to separate finding aid groups by
topic, time period, media, or any other grouping that might be of interest.

The secondary groping in this tree map is the EAD tag. It would, instead, be
possible to group by term (and then tag), or in other orders. For our case,
it was most interesting to point out the "biggest" (that is, most prevalent)
tags that included terms of interest. Finally, individual terms are grouped in their own box. 

The first tree map uses individual colors for each top-level parent tag. The second tree map uses a gradient so that color intensity, in addition to box size helps to indicate term prevalence (more intense and larger is a greater occurence while lighter and smaller indicates less occurence). 

In [35]:
# Treemap
color_map = {'Bentley': 'blue', 'Clements': 'red', 'SCRC': 'green'}
fig = px.treemap(all_subsection_term_frequencies, path=['Source', 'Subsection', 'Term'], values='Term_Frequency')
fig.update_layout(width=1500, height=800)
fig.show()

A color gradient indicates higher term prevalence in darker, or more intense, colors,
while lighter colors indicate lower occurence.

In [36]:
# Add a dummy color column
all_subsection_term_frequencies['color'] = all_subsection_term_frequencies['Term_Frequency']

# Treemap - color scale
fig = px.treemap(all_subsection_term_frequencies, path=['Source', 'Subsection', 'Term'], values='Term_Frequency',
                 color='color', color_continuous_scale='ylgnbu')
fig.update_layout(width=1500, height=800)
fig.show()

## Sunburst visualization

The sunburst visualization is similarly useful for visualizing components in a hierarchy.
In this case, the added benefit of the subsections of each layer of the circle
give a sense of proportion of the whole group. So, for example, in this case, one may note that nearly half of the potentially problematic terms occur in the `bioghist`
note. 

In the second example, the EAD tag is used as the central component, or level of
first division for the hierarchy. Given that this visualization presumes
a knowledge of the EAD tags and terminology, we provisionally described
this as an "archivist's view".

The third example places the potentially problematic terms at the center.
In this example, we suggested this might be a "community view" since it focuses
on the potentially problematic terms, which may be useful starting points
for discussion with community members who are interested in how certain terms
are used by the archives, how frequently they occur, and potentially in which
finding aids they may be most prevalent.

In [37]:
# Sunburst - color scale
fig = px.sunburst(all_subsection_term_frequencies, path=['Source', 'Subsection', 'Term'], values='Term_Frequency', 
                  color='Term_Frequency', color_continuous_scale='ylgnbu')
fig.update_layout(width=1300, height=800, title='Sunburst with holding repository as central component')
fig.show()

In [38]:
# Sunburst - color scale
fig = px.sunburst(all_subsection_term_frequencies, path=['Subsection','Term','Source'], values='Term_Frequency', 
                  color='Term_Frequency', color_continuous_scale='blues')
fig.update_layout(width=1300, height=800, title='Sunburst divided by EAD tag, term, then repository (archivists view?)')
fig.show()

In [39]:
# Sunburst - color scale
fig = px.sunburst(all_subsection_term_frequencies, path=['Term','Source','Subsection'], values='Term_Frequency', 
                  color='Term_Frequency', color_continuous_scale='blues')
fig.update_layout(width=1300, height=800, title='Sunburst divided by term, then repository, then EAD tag ("community" view?)')
fig.show()

## Visualization: Source Frequencies

This section of the script is designed to create bar plots displaying the frequencies of source controlled vocabularies in the EAD `controlaccess` section.

The `source_columns` list includes the following column names:

* 'subjects_source'
* 'genreforms_source'
* 'geognames_source'
* 'persnames_source'
* 'corpnames_source'
* 'famnames_source'

You can modify this list according to the source columns present in your specific DataFrames.

Note: In the case where a DataFrame doesn't have the specified column, an empty series is created to prevent errors. The sources in the column are separated using `str.split('; ').explode()`, which handles multiple sources in the same cell.

In [40]:
#TODO: this code originally assumed three repositories - the conversion was hacky, no longer works with multiple sets?

def plot_source_frequencies(df, df_name, column):
    if column in df.columns:
        df_counts = df[column].fillna('').str.split('; ').explode().value_counts()
    else:
        df_counts = pd.Series(dtype='int')
        
    # Create a new DataFrame to store these counts
    counts_df = pd.DataFrame({
        df_name: df_counts,
        #df2_name: df2_counts,
        #df3_name: df3_counts
    }).fillna(0)
    print(counts_df.columns)

    fig = px.bar(counts_df, barmode='group',
                 labels={'value':'Term Occurence in Controlled Element'}, 
                 color_discrete_map=colors)
    fig.show()

    # Create a color palette
    colors = {'df1_Bentley': '#2F65A7', 'df2_Clements': '#9A3324', 'df3_SCRC': '#A5A508'}

    # Plot with rotated x-axis labels
    counts_df.plot(kind='bar', figsize=(6, 4), rot=0, color=colors)
    tag_label = column.split('_')[0].capitalize()
    plt.title(f'Source Frequencies for {tag_label}')
    plt.ylabel('Frequency')
    plt.xlabel(f'{tag_label} Source')
    plt.legend(labels={'Bentley':'df1_Bentley','Clements':'df2_Clements','SCRC':'df3_SCRC'})
    plt.tight_layout()
    plt.show()

In [41]:
# function to create a dataframe with the counts that can be reshaped for visualization
def control_term_frequency(df, df_name, column):
    if column in df.columns:
        df_counts = df[column].fillna('').str.split('; ').explode().value_counts()
    else:
        df_counts = pd.Series(dtype='int')
        
    # Create a new DataFrame to store these counts
    counts_df = pd.DataFrame({
        df_name: df_counts,
        #df2_name: df2_counts,
        #df3_name: df3_counts
    }).fillna(0)

    return counts_df

Test the `control_term_frequency()` function with one EAD controlled tag example: 

In [42]:
source_columns = ['subjects_source', 'genreforms_source', 'geognames_source', 'persnames_source', 'corpnames_source', 'famnames_source']

# calculate subject terms
controlterm_counts_df = control_term_frequency(eads_df, 'Bentley', source_columns[0])#'subjects_source')

controlterm = source_columns[0]
controlterm_label = controlterm.split('_')[0].capitalize()[:-1]

# Visualize in a bar chart
fig = px.bar(controlterm_counts_df, barmode='group',
             labels={'variable':'Repository'},
             color_discrete_map=colors,
             text='value')
fig.update_layout(title_text=f'Term Count per EAD Controlled {controlterm_label} Tags',
                  yaxis_title='Term Count', xaxis_title=f'Term {controlterm_label} Authority Source', width=600,
                  font_family=global_font_info['font_family'])
fig.update_yaxes(tickfont=dict(family='Arial', size=10))
fig.update_xaxes(tickangle=0, tickfont=dict(family='Arial', color='black', size=10))
fig.update_traces(textposition='outside', textfont=dict(family='Arial',size=8))
fig.show()

# pio.write_image(fig, f'term_frequency_by_{controlterm_label}.png', scale=5)

The above function may be used for any individual controlled EAD tag source.
Since we knew there were multiple controlled tags, however, the following
code blocks visualize the authorities consulted for all of the controlled
tags. In cases where no authority was given, the columns are blank,
indicating no authority source was given, indicating likely usage of a local file
or one-off occurence of names or terms.

The multiple controlled tag function is `visualize_controlterms()`. This function takes
a pandas dataframe (`df`), a group name (`df_name`) that can indicate the source repository
or a subgroup of finding aid data, the `source_list` which must be a list and
can list one or more source EAD tags, and the `savePNG` argument. 
The last argument can be used to choose whether or not the function should
create and save a standalone graphic file for the visualization.

In [43]:
# function to visualize a source_column
def visualize_controlterms(df, df_name, source_list, savePNG=False):
    '''This function requires the controlterm_frequency() function to produce the dataframe with controlled term counts.
    source_list should be a list with column names that designate the controlled vocabularies you want to count and visualize.
    '''
    for i in range(len(source_list)):
        controlterm_counts_df = control_term_frequency(df, df_name, source_list[i])

        controlterm = source_list[i]
        controlterm_label = controlterm.split('_')[0].capitalize()[:-1]
        print(f'Generating visualization for {controlterm_label}')

        # Visualize in a bar chart
        fig = px.bar(controlterm_counts_df, barmode='group',
                    labels={'variable':'Repository'},
                    color_discrete_map=colors,
                    text='value')
        fig.update_layout(title_text=f'Term Count per EAD Controlled {controlterm_label} Tags',
                        xaxis_title=f'Authority Sources for {controlterm_label}', 
                        yaxis=dict(
                            title_text='Term Count',
                            titlefont=dict(
                                size=12
                            ),
                            tickfont=dict(
                                family='Arial',
                                size=10
                            ),
                            showticklabels=False
                        ),
                        width=600,
                        font_family=global_font_info['font_family'])
        fig.update_yaxes(title_standoff=1)
        fig.update_xaxes(tickangle=0, tickfont=dict(family='Arial', color='black', size=10))
        fig.update_traces(textposition='outside', textfont=dict(family='Arial',size=8))
        fig.show()
        # if savePNG:
            # pio.write_image(fig, f'term_frequency_by_{controlterm_label}.png', scale=5)

SyntaxError: incomplete input (200316713.py, line 38)

In [None]:
source_columns = ['subjects_source', 'genreforms_source', 'geognames_source', 'persnames_source', 'corpnames_source', 'famnames_source']
for i in range(len(source_columns)):
    print(source_columns[i], type(source_columns[i]))

In [None]:
source_columns = ['subjects_source', 'genreforms_source', 'geognames_source', 'persnames_source', 'corpnames_source', 'famnames_source']

#for i in range(len(source_columns)):
#    visualize_controlterms(df=eads_df, df_name='Bentley', source=str(source_columns[i]))

visualize_controlterms(eads_df, 'Bentley', source_columns, savePNG=True)