## What if we worked from the Metadata CSV?

There's way more information here is the intriguing part. So for example I could filter based on completion, multiple folders at a time, filter on everything that is not at the metadata level etc.

The downside: we cannot edit document sets via the metadata upload.

The upside: everything in New Content right now has been transcribed and checked for "Needs Review". We could clear out all of New Content in one go, if we wanted to.

Obviously, bulk uploads would take time, but for the most part the New Content stuff has associated box/folder info. I'm going to investigate exactly what does and doesn't have box/folder info today.

In [2]:
# useful code blocks from the old process

# hiding the API key
import os
import dotenv

# change to the directory where the dotenv file is (unique for each person)
os.chdir("/Users/charl/JBPP")

# load in stuff hidden in the .env file
dotenv.load_dotenv()
JBPP_key = os.getenv('JBPP_key')

In [4]:
# import required packages
import requests
import pandas as pd
import json
import re

# code to create post request
apikey = JBPP_key

root = "http://fromthepage.com/iiif"
endpoint = "/collection/2025-summer-program" # this endpoint is the only thing that needs editing 
# use IIIF slug found at bottom of "export" tab in FTP document set you want to export from
headers = {"Authorization": f"Token token={apikey}"}

In [8]:
# helper function

def html_export_from_id(id):
    new_url = f'https://fromthepage.com/iiif/{id}/export/html' # url that hosts the html export
    final = requests.get(new_url) # get request on html export url
    html = final.text
    return html

In [46]:
# let's take a look at this metadata file

# API key already loaded, let's navigate back to our folder

# os.chdir('PDF experimentation/FTP')
print(os.getcwd())
messy_df = pd.read_csv('fromthepage_metadata_20250709.csv')

C:\Users\charl\JBPP\PDF experimentation\FTP


In [48]:
messy_df.columns

Index(['FromThePage Title', '*Collection*', '*Document Sets*',
       '*Uploaded Filename*', '*FromThePage ID*', '*FromThePage Slug*',
       '*FromThePage URL*', 'FromThePage Description', 'Identifier',
       '*Originating Manifest ID*', '*Creation Date*', '*Total Pages*',
       '*Pages Transcribed*', '*Pages Corrected*', '*Pages Indexed*',
       '*Pages Translated*', '*Pages Needing Review*', '*Pages Marked Blank*',
       '*Contributors*', '*Contributors Name*', 'document set',
       '*Description Status*', '*Described By*', 'PJB ID'],
      dtype='object')

In [50]:
drop_cols = [
    '*Collection*', '*Uploaded Filename*', '*FromThePage Slug*',
    'FromThePage Description',
    '*Originating Manifest ID*', '*Creation Date*', 
    '*Pages Corrected*', '*Pages Indexed*',
    '*Pages Translated*',
    '*Contributors*', '*Description Status*', '*Described By*'
]
messy_df.drop(drop_cols, axis = 1, inplace = True)

In [52]:
messy_df

Unnamed: 0,FromThePage Title,*Document Sets*,*FromThePage ID*,*FromThePage URL*,Identifier,*Total Pages*,*Pages Transcribed*,*Pages Needing Review*,*Pages Marked Blank*,*Contributors Name*,document set,PJB ID
0,(Online Participant) Speech from the Alabama ...,X_Current List,11081,https://fromthepage.com/julian-bond-papers-pro...,mss13347_b4_f3,11,11,0,0,| The Julian Bond Papers Project,,
1,(Online Participant) Speech made before the H...,X_Current List,11082,https://fromthepage.com/julian-bond-papers-pro...,mss13347_b4_f2,11,11,0,0,Stephanie Requena | | Heather | abreen17 | Th...,,
2,"(Scholars’ Lab) ""Meet the Press"" NBC interview...",X_OLD 2022 Transferred to Drupal,11084,https://fromthepage.com/julian-bond-papers-pro...,mss13337-b1-f3_sl,13,13,0,0,Jan Pilkington | tsherman | Heather | abreen17...,,
3,"(Scholars’ Lab) Article - ""Activism of the Lat...",X_Current List,11085,https://fromthepage.com/julian-bond-papers-pro...,mss13337_b1_f1_sl,3,3,0,0,Jan Pilkington | | Stephanie Requena | The Ju...,,
4,"(Scholars’ Lab) Article - ""The Story of Lewis ...",X_Current List,11200,https://fromthepage.com/julian-bond-papers-pro...,mss13347_b1_f5_sl,2,2,0,0,Jan Pilkington | | The Julian Bond Papers Pro...,,
...,...,...,...,...,...,...,...,...,...,...,...,...
9047,"From Julian Bond to Julian Price, 2 Jan 1974 [...",Zip Test,32200846,https://fromthepage.com/julian-bond-papers-pro...,,2,2,0,0,,,
9048,Test Partial Box 17 Folder 3,Zip Test,32202359,https://fromthepage.com/julian-bond-papers-pro...,,12,1,0,0,,,
9049,Box 17 Folder 3,2025 Summer Program,32213452,https://fromthepage.com/julian-bond-papers-pro...,,147,104,13,0,Nakole Allen | Amira Dennis | Sophia Melo-Maln...,,
9050,Box 17 Test,Zip Test,32213880,https://fromthepage.com/julian-bond-papers-pro...,,12,0,0,0,,,


In [54]:
messy_df['*Document Sets*'].unique()

array(['X_Current List', 'X_OLD 2022 Transferred to Drupal',
       'X_OLD 2022 Transferred to Drupal|X_Protected URLs',
       'X_Protected URLs', 'X_OLD: 2023-2024 Transferred to Drupal',
       'X_Protected URLs|X_Current List',
       'X_Protected URLs|X_OLD: 2023-2024 Transferred to Drupal',
       'X_OLD 2022 Transferred to Drupal|X_NHPRC Sample',
       'X_OLD 2022 Transferred to Drupal|X_Protected URLs|New Content',
       'X_OLD: 2023-2024 Transferred to Drupal|2024-2025_Charlie: Transfer to Drupal',
       '2024-2025 Transferred to Drupal', 'New Content',
       'Ready to Transfer to Drupal',
       'Via CSV 2024-25 Transferred to Drupal',
       'X_Protected URLs|X_NHPRC Sample|X_OLD: 2023-2024 Transferred to Drupal',
       nan, 'CWP-test-document-set|Via CSV 2024-25 Transferred to Drupal',
       'X_Protected URLs|X_NHPRC Sample|2024-2025 Transferred to Drupal',
       '2024-2025 Transferred to Drupal|Docs Not for Public',
       'X_NHPRC Sample|2024-2025 Transferred to Dr

In [56]:
# let's stick to exclusively New Content

filtered = messy_df[messy_df['*Document Sets*'] ==  'New Content']
filtered

Unnamed: 0,FromThePage Title,*Document Sets*,*FromThePage ID*,*FromThePage URL*,Identifier,*Total Pages*,*Pages Transcribed*,*Pages Needing Review*,*Pages Marked Blank*,*Contributors Name*,document set,PJB ID
793,"Speech concerning affirmative action, 2002",New Content,32033742,https://fromthepage.com/julian-bond-papers-pro...,PJB 526,23,23,0,0,Shelagh Mackey | Markeeta Rosenow | | Christa...,,
2292,Brown v Board of Education,New Content,32099898,https://fromthepage.com/julian-bond-papers-pro...,PJB498 a,10,10,0,0,Debra Haraldson | | Jan Pilkington | Karen J ...,,
2293,Missouri v Jenkins (70-176),New Content,32099900,https://fromthepage.com/julian-bond-papers-pro...,PJB498 b,107,107,0,0,Emily Hemlinger | Debra Haraldson | | Judi |...,,
2294,Missouri v Jenkins 2 (NOTE),New Content,32099901,https://fromthepage.com/julian-bond-papers-pro...,PJB498 c,117,117,0,0,Emily Hemlinger | Shelagh Mackey | | Emily Ni...,,
2296,Carter NAACP,New Content,32099903,https://fromthepage.com/julian-bond-papers-pro...,PJB498 e,14,14,0,0,Markeeta Rosenow |,,
...,...,...,...,...,...,...,...,...,...,...,...,...
9042,"To Julian Bond from Panke Bradley, Mrs. L. Fit...",New Content,32198891,https://fromthepage.com/julian-bond-papers-pro...,PJB 8273,2,2,0,0,Privacylover | Carlos Perez,b17f2,
9043,[Fragment] From Julian Bond to Cary Internatio...,New Content,32198892,https://fromthepage.com/julian-bond-papers-pro...,PJB 8274,1,1,0,0,Richie James Gorman | Carlos Perez,b17f2,
9044,To Julian Bond from Dr. Richard Allen Williams...,New Content,32198893,https://fromthepage.com/julian-bond-papers-pro...,PJB 8275,1,1,0,0,Privacylover | Carlos Perez,b17f2,
9045,From Julian Bond Memo Concerning the Westbrook...,New Content,32198894,https://fromthepage.com/julian-bond-papers-pro...,PJB 8276,1,1,0,0,Richie James Gorman | Carlos Perez,b17f2,


In [58]:
# this looks about right
filtered['PJB ID'].unique()
filtered.drop('PJB ID', axis = 1, inplace = True)

In [60]:
# above warning is not a real issue for me because I no longer care about edits being made to the original dataframe (aka messy_df)
pd.options.mode.chained_assignment = None

In [62]:
filtered['document set'].unique()

array([nan, 'New Content', 'b14f3', 'b14f4', 'b14f5', 'b14f6', 'b14f7',
       'b14f8', 'b14f9', 'b15f2', 'b15f1', 'b15f3', 'b15f4', 'b15f5',
       'b15f6', 'b15f7', 'b16f1', 'b16f2', 'b16f3', 'b16f5', 'b16f6',
       'b16f7', 'b16f4', 'b17f1', 'b17f2'], dtype=object)

In [64]:
# so we began this with 'New Content' (presumably b14f2, if I had to guess, and I will)
filtered[filtered['*Total Pages*'] != filtered['*Pages Transcribed*']]

Unnamed: 0,FromThePage Title,*Document Sets*,*FromThePage ID*,*FromThePage URL*,Identifier,*Total Pages*,*Pages Transcribed*,*Pages Needing Review*,*Pages Marked Blank*,*Contributors Name*,document set
2328,"To Julian Bond from Leon Quat, Telegram, 7 Oct...",New Content,32101234,https://fromthepage.com/julian-bond-papers-pro...,PJB 2079,2,1,0,1,| Carlos Perez,
2595,"To Julian Bond from Joann Taggart, 5 Apr 1969",New Content,32108582,https://fromthepage.com/julian-bond-papers-pro...,PJB 2354,5,4,0,1,Shelagh Mackey | Debra Haraldson | Carlos Perez,
8311,"To Julian Bond from Wendell and Ellice Givan, ...",New Content,32196518,https://fromthepage.com/julian-bond-papers-pro...,PJB 7815,6,4,0,2,Privacylover | Emily Hemlinger | | Delanee En...,b16f7
8325,"To Julian Bond from Ralph Gaskins, 1 Dec 1973,...",New Content,32196532,https://fromthepage.com/julian-bond-papers-pro...,PJB 7829,4,2,0,2,Patricia M Capps | Carlos Perez,b16f7
8328,"To Julian Bond from Henrietta Eberheart, 27 De...",New Content,32196535,https://fromthepage.com/julian-bond-papers-pro...,PJB 7832,4,3,0,1,Privacylover | Carlos Perez,b16f7


In [66]:
# decent way to make checks.
filtered[filtered['*Pages Needing Review*'] != 0]

Unnamed: 0,FromThePage Title,*Document Sets*,*FromThePage ID*,*FromThePage URL*,Identifier,*Total Pages*,*Pages Transcribed*,*Pages Needing Review*,*Pages Marked Blank*,*Contributors Name*,document set


In [68]:
# so now I know that everything in this filtered subset is ready for export.
# obviously, at the folder level, it is very easy to subset

In [74]:
filtered['document set'].unique()

array([nan, 'New Content', 'b14f3', 'b14f4', 'b14f5', 'b14f6', 'b14f7',
       'b14f8', 'b14f9', 'b15f2', 'b15f1', 'b15f3', 'b15f4', 'b15f5',
       'b15f6', 'b15f7', 'b16f1', 'b16f2', 'b16f3', 'b16f5', 'b16f6',
       'b16f7', 'b16f4', 'b17f1', 'b17f2'], dtype=object)

In [76]:
list(filtered['document set'].unique())

[nan,
 'New Content',
 'b14f3',
 'b14f4',
 'b14f5',
 'b14f6',
 'b14f7',
 'b14f8',
 'b14f9',
 'b15f2',
 'b15f1',
 'b15f3',
 'b15f4',
 'b15f5',
 'b15f6',
 'b15f7',
 'b16f1',
 'b16f2',
 'b16f3',
 'b16f5',
 'b16f6',
 'b16f7',
 'b16f4',
 'b17f1',
 'b17f2']

In [84]:
unlabeled = filtered[filtered['document set'].isna()].drop([793, 2292, 2293, 2294, 2296, 2297])
unlabeled

Unnamed: 0,FromThePage Title,*Document Sets*,*FromThePage ID*,*FromThePage URL*,Identifier,*Total Pages*,*Pages Transcribed*,*Pages Needing Review*,*Pages Marked Blank*,*Contributors Name*,document set
2323,"To Julian Bond from Hugh Gloster, 18 Sept 1968...",New Content,32101228,https://fromthepage.com/julian-bond-papers-pro...,PJB 2074,2,2,0,0,| Grace Janssen | Lucy Blase | Carlos Perez,
2325,"To Julian Bond from Charles Everitt, 2 Oct 1968",New Content,32101230,https://fromthepage.com/julian-bond-papers-pro...,PJB 2076,1,1,0,0,Debra Haraldson | Carlos Perez,
2326,"Memo to Julian Bond from Noe Baldwin, 2 Oct 19...",New Content,32101231,https://fromthepage.com/julian-bond-papers-pro...,PJB 2077,1,1,0,0,Janet Cannon | Carlos Perez,
2327,"To Julian Bond from Jerry Godard, 7 Oct 1968",New Content,32101233,https://fromthepage.com/julian-bond-papers-pro...,PJB 2078,1,1,0,0,Aaron VanHove | Carlos Perez,
2328,"To Julian Bond from Leon Quat, Telegram, 7 Oct...",New Content,32101234,https://fromthepage.com/julian-bond-papers-pro...,PJB 2079,2,1,0,1,| Carlos Perez,
...,...,...,...,...,...,...,...,...,...,...,...
2590,"To Julian Bond from Carl Braden, 31 Dec 1968, ...",New Content,32105732,https://fromthepage.com/julian-bond-papers-pro...,PJB 2349,3,3,0,0,Debra Haraldson | Shelagh Mackey | Carlos Perez,
2591,"To Julian Bond from Paul Anthony, 31 Dec 1968,...",New Content,32105733,https://fromthepage.com/julian-bond-papers-pro...,PJB 2350,1,1,0,0,Debra Haraldson | Carlos Perez,
2592,"To Julian Bond from Charles Vogt, 18 Dec 1968,...",New Content,32105734,https://fromthepage.com/julian-bond-papers-pro...,PJB 2351,1,1,0,0,Debra Haraldson | Carlos Perez,
2593,"To Julian Bond from Miss P. A. Stoney, 5 June ...",New Content,32108569,https://fromthepage.com/julian-bond-papers-pro...,PJB 2352,2,2,0,0,Debra Haraldson | Carlos Perez,


In [102]:
from bs4 import BeautifulSoup

In [126]:
def sort_exports(df):

    contents_list = []
    
    for idx, row in df.iterrows():
        pjb_id = row['Identifier']
        id = row['*FromThePage ID*']
        title = row['FromThePage Title']
        contributors = row['*Contributors Name*'].strip(' |').replace(' |',',')
        html = html_export_from_id(id)

        soup = BeautifulSoup(html, "lxml")
        content_tags = soup.find_all("div", class_="page-content")
        content = [tag.decode_contents() for tag in content_tags]
        content = ' '.join(content) + "<p> Thanks to FromThePage transcription contributors: " + contributors + "</p>"

        info = {'ID': pjb_id, 'Title': title, 'Document Body': content}
        
        contents_list.append(info)

    return contents_list

In [128]:
contents_list = sort_exports(unlabeled)

In [134]:
unlabeled_docs = pd.DataFrame(contents_list)
unlabeled_docs.to_csv('unlabeled_docs_export.csv')

In [136]:
os.mkdir('FTP exports')

In [150]:
unique_sets = list(filtered['document set'].unique())

In [152]:
unique_sets = unique_sets[1:] # to lose the nan

In [160]:
%%time
for i in unique_sets:
    label = i
    slice_df = filtered[filtered['document set'] == i]
    contents_list = sort_exports(slice_df)
    df_temp = pd.DataFrame(contents_list)
    df_temp.to_csv(f'FTP exports/{label}_export_07092025.csv')

CPU times: total: 43.8 s
Wall time: 16min 46s


In [166]:
filtered.sort_values(by = '*Total Pages*', ascending = False).head(30)

Unnamed: 0,FromThePage Title,*Document Sets*,*FromThePage ID*,*FromThePage URL*,Identifier,*Total Pages*,*Pages Transcribed*,*Pages Needing Review*,*Pages Marked Blank*,*Contributors Name*,document set
2297,"Research Material for Speech- ""The Broken Prom...",New Content,32099923,https://fromthepage.com/julian-bond-papers-pro...,PJB539,118,118,0,0,Sarah Ahmad | Princess1 | Matyas Niedermeier ...,
2294,Missouri v Jenkins 2 (NOTE),New Content,32099901,https://fromthepage.com/julian-bond-papers-pro...,PJB498 c,117,117,0,0,Emily Hemlinger | Shelagh Mackey | | Emily Ni...,
2293,Missouri v Jenkins (70-176),New Content,32099900,https://fromthepage.com/julian-bond-papers-pro...,PJB498 b,107,107,0,0,Emily Hemlinger | Debra Haraldson | | Judi |...,
793,"Speech concerning affirmative action, 2002",New Content,32033742,https://fromthepage.com/julian-bond-papers-pro...,PJB 526,23,23,0,0,Shelagh Mackey | Markeeta Rosenow | | Christa...,
2296,Carter NAACP,New Content,32099903,https://fromthepage.com/julian-bond-papers-pro...,PJB498 e,14,14,0,0,Markeeta Rosenow |,
8951,Funding Proposal for Operation New Prichard [S...,New Content,32198800,https://fromthepage.com/julian-bond-papers-pro...,PJB 8182,10,10,0,0,Richie James Gorman | | Privacylover | Prince...,b17f2
2292,Brown v Board of Education,New Content,32099898,https://fromthepage.com/julian-bond-papers-pro...,PJB498 a,10,10,0,0,Debra Haraldson | | Jan Pilkington | Karen J ...,
5206,"Proposal for a Journal of Black Politics, [Jul...",New Content,32124915,https://fromthepage.com/julian-bond-papers-pro...,PJB 4325,10,10,0,0,Carlos Perez | Debra Haraldson | kristen allen...,b15f1
2598,"To Julian Bond from Ardena Shanks, 2 May 1969",New Content,32108739,https://fromthepage.com/julian-bond-papers-pro...,PJB 2357,6,6,0,0,Debra Haraldson | Emily Niepraschk | Carlos Perez,New Content
5256,Speech concerning Black Voters and Officials d...,New Content,32124965,https://fromthepage.com/julian-bond-papers-pro...,PJB 4375,6,6,0,0,Carlos Perez | T. Bradley,b15f1
