### FTP API data grab

This file contains the code for downloading a FTP transcription export from their API

Still a work in progress

I am using Python's requests library because I have more facility in Python and Windows PowerShell is a curse that I prefer not to suffer under when I can avoid it. If on a MacOS or Linux/Unix system, the documentation on the FTP website explains how to accomplish this from the command line: https://content.fromthepage.com/project-owner-documentation/api-keys/ 

In [11]:
# hiding the API key
import os
import dotenv

# change to the directory where the dotenv file is (unique for each person)
os.chdir("/Users/charl/JBPP")

# load in stuff hidden in the .env file
dotenv.load_dotenv()
JBPP_key = os.getenv('JBPP_key')

In [13]:
# import required packages
import requests
import pandas as pd
import json
import re

# code to create post request
apikey = JBPP_key

root = "http://fromthepage.com/iiif"
endpoint = "/collection/charlie-transfer-to-drupal" # this endpoint is the only thing that needs editing 
# use IIIF slug found at bottom of "export" tab in FTP document set you want to export from
headers = {"Authorization": f"Token token={apikey}"}

In [3]:
# submit post request using requests library (operates same as cURL, just in Python)
response = requests.post(root+endpoint, headers=headers)


In [4]:
# to run if you wanna look at the raw text or check status

# print(response.status_code)
# should be 200
# print(response.text)

In [5]:
# convert to dataframe using json_normalize
# record_path=['manifests'] is to ignore metadata associated with the API call that's returned in the response
# but is not connected to the actual doc set content
response_df = pd.json_normalize(json.loads(response.text), record_path=['manifests'])

In [6]:
# if interested in taking a look at the dataframe:

# response_df.head()
# I'm curious if anything else will become metadata because there's more metadata in the bulk uploaded documents
# If it breaks later, refer to that column to see how to fix it

In [7]:
# don't run this cell (you can, you just don't need to)

#bunch_of_tuples = [] # this is where we'll store all the key (PJB ID) - value (Document Body) pairs to convert to a dataframe

#for i in range(len(response_df)): # iterates over each row of the dataframe - 
#    # there are other ways to do this but it's not prohibitively inefficient
#    url = response_df['@id'][i] # indexes into the value in the first column of the dataframe (the IIIF url)
#    cut = url.split('/')[4] # slices out the unique work_id - to be used to locate plaintext export
#    try:
#        pjb_id = response_df['metadata'][i][0]['value'] # tries to make key based on identifier aka PJB ID
#    except TypeError:
#       pjb_id = cut # if it fails, it instead makes key on the basis of the work_id (guaranteed to be unique)
#    new_url = f'https://fromthepage.com/iiif/{cut}/export/plaintext/verbatim' # url that hosts the plaintext export
#    final = requests.get(new_url) # get request on plaintext export url
#    bunch_of_tuples.append((pjb_id, final.text)) # appends key-value pair to dictionary
    

In [8]:
# testing for XHTML compatibility (I think it might be better than plaintext)
bunch_of_tuples = [] # this is where we'll store all the variable pairs (PJB ID, Document Body) to convert to a dataframe

for i in range(len(response_df)): # iterates over each row of the dataframe - 
    # there are other ways to do this but it's not prohibitively inefficient
    url = response_df['@id'][i] # indexes into the value in the first column of the dataframe (the IIIF url)
    cut = url.split('/')[4] # slices out the unique work_id - to be used to locate html export
    try:
        pjb_id = response_df['metadata'][i][0]['value'] # tries to make key based on idenitifier aka PJB ID
    except TypeError:
        pjb_id = cut # if it fails, it instead makes key on the basis of the work_id (guaranteed to be unique)
    new_url = f'https://fromthepage.com/iiif/{cut}/export/html' # url that hosts the html export
    final = requests.get(new_url) # get request on html export url
    html = final.text
    title_position = html.find('<title>')
    desired_content = html[title_position:] 
    bunch_of_tuples.append((pjb_id, desired_content))# appends to list of tuples

In [125]:
# if you wanna check out the dictionary (should be same length as response_df)

# bunch_of_tuples

In [104]:
# create dataframe, label columns + set PJB ID as index
df_final = pd.DataFrame(bunch_of_tuples, columns=['PJB ID', 'Document Body'])

In [106]:
df_final

Unnamed: 0,PJB ID,Document Body
0,891,"<title> To Julian Bond from M. Steven Lubet, 3..."
1,PJB 1028,<title> To Julian Bond from Mac Barber and Wal...
2,PJB 1526,"<title> To Julian Bond from Mae Nunley, with B..."
3,887,"<title> To Julian Bond from Maggie Gray, 9 Mar..."
4,PJB807,"<title> To Julian Bond from Malcolm Willison, ..."
...,...,...
74,PJB 1385,<title> To Julian Bond from Willie Frank Danie...
75,PJB 1386,"<title> To Julian Bond fron Myrtice Hardeman, ..."
76,PJB 1389,<title> To Julian Bond fron Shirley Ann Woodar...
77,PJB 1126,"<title> To Julian Bond to H. E. Ruark, 5 May 1..."


In [56]:
# to check how it looks with a two page doc
newlist = [x for x in list if "To Julian Bond from Margaret Linton" in x]
# newlist

['<title> To Julian Bond from Margaret Linton, 14 January 1968, with Bond&#39;s draft response</title>\n    </head>\n    \n    <body>\n    <h1 class="work-title">To Julian Bond from Margaret Linton, 14 January 1968, with Bond&#39;s draft response</h1>\n    <div class="export-metadata"><span class="translation_missing" title="translation missing: en.export.show.html.erb.export_metadata, work: To Julian Bond from Margaret Linton, 14 January 1968, with Bond&amp;#39;s draft response, collection: The Papers of Julian Bond, time: 2024-07-31 13:19:44 +0000">Export Metadata</span>\n        <p><span class="translation_missing" title="translation missing: en.export.show.html.erb.identifier, work: PJB 1500">Identifier</span></p>\n      <p>\n        <span class="translation_missing" title="translation missing: en.export.show.html.erb.fromthepage_version, version: 22.10">Fromthepage Version</span>\n      </p>\n    </div>\n\n    <hr />\n    <h2 class="divider"><span class="translation_missing" title

In [80]:
# if you wanna take a look at the final dataframe

mylist = df_final['Document Body'].tolist()
mylist[0]
# I wonder if the title tags would break this

'<title> To Julian Bond from M. Steven Lubet, 3 Mar 1967, with Bond note</title>\n    </head>\n    \n    <body>\n    <h1 class="work-title">To Julian Bond from M. Steven Lubet, 3 Mar 1967, with Bond note</h1>\n    <div class="export-metadata"><span class="translation_missing" title="translation missing: en.export.show.html.erb.export_metadata, work: To Julian Bond from M. Steven Lubet, 3 Mar 1967, with Bond note, collection: The Papers of Julian Bond, time: 2024-07-31 13:19:43 +0000">Export Metadata</span>\n        <p><span class="translation_missing" title="translation missing: en.export.show.html.erb.identifier, work: 891">Identifier</span></p>\n      <p>\n        <span class="translation_missing" title="translation missing: en.export.show.html.erb.fromthepage_version, version: 22.10">Fromthepage Version</span>\n      </p>\n    </div>\n\n    <hr />\n    <h2 class="divider"><span class="translation_missing" title="translation missing: en.export.show.html.erb.page_transcripts">Page Tra

In [88]:
editors_tag = 'small'
editors_tag_class = ' class="page-version-username"'
# title_tag = 'title'
# title_tag_class = ''
# I think I would prefer to match on title even if PJB ID is more robust because title is easier to extrac
content_tag = 'div'
content_tag_class = ' class="page-content"'
tags = {editors_tag: editors_tag_class,
       # title_tag: title_tag_class,
       content_tag: content_tag_class}
final_list = []

for i in range(len(mylist)):
    dirty = mylist[i]
    dictionary = {}
    for tag, tag_class in tags.items():
        reg_str = "<" + tag + tag_class + ">(.*?)</" + tag + ">"
        res = re.findall(reg_str, dirty, re.DOTALL)
        key = tag
        dictionary[key] = res
    dictionary['small'] = set(dictionary['small'])
    for k,v in dictionary.items():
        target = ' '.join(v)
        target_stage_2 = target.replace('\n','')
        dictionary[k] = target_stage_2.strip(' ')
    check = dictionary['div'] + "<p>Thanks to FromThePage transcription contributors: " + dictionary['small'] + "</p>"
    final_list.append(check)
    

In [108]:
df_final['Document Body'] = final_list

In [118]:
id_list = df_final['PJB ID'].to_list()
new_ids = []
for id in id_list:
    # Remove any spaces or PJBs for standardization
    id = id.replace(" ", "")
    id = id.replace("PJB", "")
    id = 'PJB ' + id
    new_ids.append(id)
        
print(new_ids)
df_final['PJB ID'] = new_ids
df_final

['PJB 891', 'PJB 1028', 'PJB 1526', 'PJB 887', 'PJB 807', 'PJB 1500', 'PJB 867', 'PJB 996', 'PJB 785', 'PJB 786', 'PJB 913', 'PJB 771', 'PJB 1099', 'PJB 1482', 'PJB 889', 'PJB 1470', 'PJB 930', 'PJB 831', 'PJB 820', 'PJB 986', 'PJB 1131', 'PJB 1058', 'PJB 1132', 'PJB 847', 'PJB 893', 'PJB 1471', 'PJB 917', 'PJB 982', 'PJB 877', 'PJB 1093', 'PJB 1095', 'PJB 1018', 'PJB 748', 'PJB 1147', 'PJB 1245', 'PJB 1030', 'PJB 1243', 'PJB 1037', 'PJB 1138', 'PJB 1012', 'PJB 810', 'PJB 1510', 'PJB 1230', 'PJB 1229', 'PJB 907', 'PJB 947', 'PJB 994', 'PJB 740', 'PJB 1151', 'PJB 980', 'PJB 823', 'PJB 1192', 'PJB 1085', 'PJB 1323', 'PJB 1467', 'PJB 1104', 'PJB 963', 'PJB 923', 'PJB 744', 'PJB 940', 'PJB 779', 'PJB 1549', 'PJB 897', 'PJB 1542', 'PJB 1160', 'PJB 1101', 'PJB 1469', 'PJB 1299', 'PJB 1060', 'PJB 861', 'PJB 1562', 'PJB 691', 'PJB 932', 'PJB 915', 'PJB 1385', 'PJB 1386', 'PJB 1389', 'PJB 1126', 'PJB 1097']


Unnamed: 0,PJB ID,Document Body
0,PJB 891,<p>Students for a Democratic Society</p><p>Box...
1,PJB 1028,<p>GEORGIA EDUCATIONAL IMPROVEMENT COUNCIL</p>...
2,PJB 1526,<p>P.O. Box 66<br/>Mary Holmes College<br/>Wes...
3,PJB 887,"<p>March 9, 1967</p><p>Savannah, Georgia</p><p..."
4,PJB 807,"<p>66 Union Avenue<br/>Schenectady, N.Y. 1230..."
...,...,...
74,PJB 1385,"<p>129 Marion Place #B2at E.<br/>Atlanta 7, Ge..."
75,PJB 1386,"<p>1811 Goddard Street, Southeast<br/>Atlanta,..."
76,PJB 1389,"<p>697 Windson Street, S.W.<br/>Atlanta, Georg..."
77,PJB 1126,<p>WALLACE ADAMS ...


In [122]:
# export to csv for transfer to Drupal!

df_final.to_csv('export_to_drupal_final.csv', index=False)