### FTP API data grab

This file contains the code for downloading a FTP transcription export from their API

Still a work in progress

I am using Python's requests library because I have more facility in Python and Windows PowerShell is a curse that I prefer not to suffer under when I can avoid it. If on a MacOS or Linux/Unix system, the documentation on the FTP website explains how to accomplish this from the command line: https://content.fromthepage.com/project-owner-documentation/api-keys/ 

In [1]:
# hiding the API key
import os
import dotenv

# change to the directory where the dotenv file is (unique for each person)
os.chdir("/Users/charl/JBPP")

# load in stuff hidden in the .env file
dotenv.load_dotenv()
JBPP_key = os.getenv('JBPP_key')

In [3]:
# import required packages
import requests
import pandas as pd
import json
import re

# code to create post request
apikey = JBPP_key

root = "http://fromthepage.com/iiif"
endpoint = "/collection/charlie-transfer-to-drupal" # this endpoint is the only thing that needs editing 
# use IIIF slug found at bottom of "export" tab in FTP document set you want to export from
headers = {"Authorization": f"Token token={apikey}"}

In [5]:
# submit post request using requests library (operates same as cURL, just in Python)
response = requests.post(root+endpoint, headers=headers)


In [4]:
# to run if you wanna look at the raw text or check status

# print(response.status_code)
# should be 200
# print(response.text)

In [7]:
# convert to dataframe using json_normalize
# record_path=['manifests'] is to ignore metadata associated with the API call that's returned in the response
# but is not connected to the actual doc set content
response_df = pd.json_normalize(json.loads(response.text), record_path=['manifests'])

In [11]:
# if interested in taking a look at the dataframe:

# response_df.head()
# I'm curious if anything else will become metadata because there's more metadata in the bulk uploaded documents
# If it breaks later, refer to that column to see how to fix it

In [7]:
# don't run this cell (you can, you just don't need to)

#bunch_of_tuples = [] # this is where we'll store all the key (PJB ID) - value (Document Body) pairs to convert to a dataframe

#for i in range(len(response_df)): # iterates over each row of the dataframe - 
#    # there are other ways to do this but it's not prohibitively inefficient
#    url = response_df['@id'][i] # indexes into the value in the first column of the dataframe (the IIIF url)
#    cut = url.split('/')[4] # slices out the unique work_id - to be used to locate plaintext export
#    try:
#        pjb_id = response_df['metadata'][i][0]['value'] # tries to make key based on identifier aka PJB ID
#    except TypeError:
#       pjb_id = cut # if it fails, it instead makes key on the basis of the work_id (guaranteed to be unique)
#    new_url = f'https://fromthepage.com/iiif/{cut}/export/plaintext/verbatim' # url that hosts the plaintext export
#    final = requests.get(new_url) # get request on plaintext export url
#    bunch_of_tuples.append((pjb_id, final.text)) # appends key-value pair to dictionary
    

In [9]:
# testing for XHTML compatibility (I think it might be better than plaintext)
bunch_of_tuples = [] # this is where we'll store all the variable pairs (PJB ID, Document Body) to convert to a dataframe

for i in range(len(response_df)): # iterates over each row of the dataframe - 
    # there are other ways to do this but it's not prohibitively inefficient
    url = response_df['@id'][i] # indexes into the value in the first column of the dataframe (the IIIF url)
    cut = url.split('/')[4] # slices out the unique work_id - to be used to locate html export
    try:
        pjb_id = response_df['metadata'][i][0]['value'] # tries to make key based on idenitifier aka PJB ID
    except TypeError:
        pjb_id = cut # if it fails, it instead makes key on the basis of the work_id (guaranteed to be unique)
    new_url = f'https://fromthepage.com/iiif/{cut}/export/html' # url that hosts the html export
    final = requests.get(new_url) # get request on html export url
    html = final.text
    title_position = html.find('<title>')
    desired_content = html[title_position:] 
    bunch_of_tuples.append((pjb_id, desired_content))# appends to list of tuples

In [125]:
# if you wanna check out the dictionary (should be same length as response_df)

# bunch_of_tuples

In [11]:
# create dataframe, label columns + set PJB ID as index
df_final = pd.DataFrame(bunch_of_tuples, columns=['PJB ID', 'Document Body'])

In [13]:
df_final

Unnamed: 0,PJB ID,Document Body
0,PJB 2086,"<title> From Julian Bond to Abraham Feinberg, ..."
1,PJB 2232,"<title> From Julian Bond to Adele Allison, 28 ..."
2,PJB 2243,"<title> From Julian Bond to Alda Lee Boyd, 28 ..."
3,PJB 2155,"<title> From Julian Bond to Allard Lowenstein,..."
4,PJB 2169,"<title> From Julian Bond to Amanda Watts, 21 O..."
...,...,...
180,PJB 2221,"<title> To Julian Bond from Richard Fulton, 27..."
181,PJB 2205,"<title> To Julian Bond from Robert E. Howard, ..."
182,PJB 2101,"<title> To Julian Bond from Robert Kline, 30 S..."
183,PJB 2365,<title> To Julian Bond from W. Hamilton Enslow...


In [19]:
# to check how it looks with a two page doc
# newlist = [x for x in list if "To Julian Bond from Margaret Linton" in x]
# newlist

In [15]:
# if you wanna take a look at the final dataframe

mylist = df_final['Document Body'].tolist()
mylist[0]
# I wonder if the title tags would break this

'<title> From Julian Bond to Abraham Feinberg, 10 Oct 1968</title>\n    </head>\n\n    <body>\n    <h1 class="work-title">From Julian Bond to Abraham Feinberg, 10 Oct 1968</h1>\n    <div class="export-metadata"><span class="translation_missing" title="translation missing: en.export.show.html.erb.export_metadata, work: From Julian Bond to Abraham Feinberg, 10 Oct 1968, collection: The Papers of Julian Bond, time: 2025-04-10 21:21:16 +0000">Export Metadata</span>\n        <p><span class="translation_missing" title="translation missing: en.export.show.html.erb.identifier, work: PJB 2086">Identifier</span></p>\n      <p>\n        <span class="translation_missing" title="translation missing: en.export.show.html.erb.fromthepage_version, version: 22.10">Fromthepage Version</span>\n      </p>\n    </div>\n\n    <hr />\n    <h2 class="divider"><span class="translation_missing" title="translation missing: en.export.show.html.erb.page_transcripts">Page Transcripts</span></h2>\n\n    <div class="p

NEW: "username" now reflects preferred attribution name for all contributors instead of just username. Yay!

In [19]:
editors_tag = 'small'
editors_tag_class = ' class="page-version-username"'
# title_tag = 'title'
# title_tag_class = ''
# I think I would prefer to match on title even if PJB ID is more robust because title is easier to extrac
content_tag = 'div'
content_tag_class = ' class="page-content"'
tags = {editors_tag: editors_tag_class,
       # title_tag: title_tag_class,
       content_tag: content_tag_class}
final_list = []

for i in range(len(mylist)):
    dirty = mylist[i]
    dictionary = {}
    for tag, tag_class in tags.items():
        reg_str = "<" + tag + tag_class + ">(.*?)</" + tag + ">"
        res = re.findall(reg_str, dirty, re.DOTALL)
        key = tag
        dictionary[key] = res
    dictionary['small'] = set(dictionary['small'])
    for k,v in dictionary.items():
        target = ' '.join(v)
        target_stage_2 = target.replace('\n','')
        dictionary[k] = target_stage_2.strip(' ')
    check = dictionary['div'] + "<p>Thanks to FromThePage transcription contributors: " + dictionary['small'] + "</p>"
    final_list.append(check)
    

In [21]:
df_final['Document Body'] = final_list

In [23]:
id_list = df_final['PJB ID'].to_list()
new_ids = []
for id in id_list:
    # Remove any spaces or PJBs for standardization
    id = id.replace(" ", "")
    id = id.replace("PJB", "")
    id = 'PJB ' + id
    new_ids.append(id)
        
print(new_ids)
df_final['PJB ID'] = new_ids
df_final

['PJB 2086', 'PJB 2232', 'PJB 2243', 'PJB 2155', 'PJB 2169', 'PJB 2142', 'PJB 2144', 'PJB 2343', 'PJB 2275', 'PJB 2304', 'PJB 2127', 'PJB 2236', 'PJB 2178', 'PJB 2198', 'PJB 2261', 'PJB 2107', 'PJB 2157', 'PJB 2208', 'PJB 2331', 'PJB 2118', 'PJB 2067', 'PJB 2253', 'PJB 2167', 'PJB 2245', 'PJB 2058', 'PJB 2175', 'PJB 2241', 'PJB 2186', 'PJB 2293', 'PJB 2054', 'PJB 2140', 'PJB 2347', 'PJB 2173', 'PJB 2257', 'PJB 2104', 'PJB 2255', 'PJB 2171', 'PJB 2330', 'PJB 2088', 'PJB 2265', 'PJB 2229', 'PJB 2112', 'PJB 2081', 'PJB 2075', 'PJB 2134', 'PJB 2267', 'PJB 2337', 'PJB 2277', 'PJB 2073', 'PJB 2345', 'PJB 2291', 'PJB 2216', 'PJB 2138', 'PJB 2108', 'PJB 2289', 'PJB 2218', 'PJB 2110', 'PJB 2065', 'PJB 2098', 'PJB 2069', 'PJB 2165', 'PJB 2478', 'PJB 2239', 'PJB 2484', 'PJB 2285', 'PJB 2488', 'PJB 2339', 'PJB 2269', 'PJB 2247', 'PJB 2487', 'PJB 2152', 'PJB 2477', 'PJB 2278', 'PJB 2190', 'PJB 2192', 'PJB 2120', 'PJB 2263', 'PJB 2200', 'PJB 2341', 'PJB 2180', 'PJB 2481', 'PJB 2490', 'PJB 2063', 'PJ

Unnamed: 0,PJB ID,Document Body
0,PJB 2086,"<p>October 10, 1968</p><p>Dear Dr. Feinberg,</..."
1,PJB 2232,"<p>October 28, 1968</p><p>Dear Mrs. Allison,</..."
2,PJB 2243,"<p>October 28, 1968</p><p>Dear Ms. Boyd,</p><p..."
3,PJB 2155,"<p>October 18, 1968</p><p>Dear Al,</p><p>I am ..."
4,PJB 2169,"<p>October 21, 1968</p><p>Dear Mrs. Watt<s>a</..."
...,...,...
180,PJB 2221,<p>Richard Fulton INC.<br/>200 W. 57th St.<br/...
181,PJB 2205,"<p>September 23, 1968</p><p>Highlight Society<..."
182,PJB 2101,<p>[Letterhead logo]: Illustrated image of 20t...
183,PJB 2365,"<p>9616 NE 27th<br/>Bellevue, WA 98004<br/>May..."


In [39]:
example = df_final[df_final['PJB ID'] == 'PJB 2068']
print(example['Document Body'][134])

<p>WESTERN UNION<br/>TELEGRAM</p><p>1104P EDT AUG 30 68 AC504 K616</p><p>WZA467 NL PD WG WELLINGTON KANS 30<br/>HONORABLE JULIAN BOND<br/>162 EAST EUHARLEE ST SOUTHWEST ATLA<br/>CONGRATULATIONS ON MAGNIFICIENT PERFORMANCE IN CHICAGO.  KANSAS<br/>STATE CONFERENCE OF NAACP BRANCHES CORDIALLY INVIT YOU TO ADDRESS<br/>ITS ANNUAL FREEDOM FUN DINNER TO BE HELD IN TOPEKA KANSAS<br/>NOV 16 1968 AT 7PM.  ALTHOUGH YOU HAVE MADE PUBLIC APPEARANCES<br/>IN MANY SECTIONS OF COUNTRY NO ONE IN KANSAS CAN REMEMBER YOUR<br/>EVER VISITING STATE.  FOR THIS AND OTHR REASONS WE URGE YOU<br/>TO SERIOUSLY CONSIDER ACCEPTING OUR INVIGATIONS.<br/>YOU MAY CALL OR WIRE COLLECT<br/>DR CHARLES ROQUEMORE PRESIDENT<br/>220 EAST KANSAS AVE<br/>WELLINGTON KANS FA 64691</p>                       <p>Dear ........:<br/>Please forgive me for taking so long to answer your kind telegram.<br/>I am sorry but I cannot come to Kansas on the date you have indicated.<br/>My schedule is a busy one, and I am trying to limit my time 

This presents a particularly interesting problem, but one that I imagine solving would simply cause more problems than solutions. In using this new method, line breaks (`<br/>`) hold together instead of disappearing. This is good when we want to use linebreaks (frequently in Series II, for addresses) but bad when contributors literally represent line breaks (against transcription guidelines). But I suppose it can always be sorted out in proofreading.

In [42]:
# export to csv for transfer to Drupal!
import datetime
# just gonna do UTC minus 5 because timezones are a pain, and this doesn't need to be perfect
# so sometimes it'll be CDT and sometimes EST but so be it
date = datetime.datetime.now(datetime.UTC) - datetime.timedelta(hours = 5) 
print(f'{date:%Y%m%d}')
df_final.to_csv(f'export_to_drupal_{date:%m%d%Y}.csv', index=False)

20250410
