### FTP API data grab

This file contains the code for downloading a FTP transcription export from their API

Still a work in progress

I am using Python's requests library because I have more facility in Python and Windows PowerShell is a curse that I prefer not to suffer under when I can avoid it. If on a MacOS or Linux/Unix system, the documentation on the FTP website explains how to accomplish this from the command line: https://content.fromthepage.com/project-owner-documentation/api-keys/ 

In [94]:
# hiding the API key
import os
import dotenv

# change to the directory where the dotenv file is (unique for each person)
os.chdir("/Users/charl/JBPP")

# load in stuff hidden in the .env file
dotenv.load_dotenv()
JBPP_key = os.getenv('JBPP_key')

In [98]:
# import required packages
import requests
import pandas as pd
import json
import re

# code to create post request
apikey = JBPP_key

root = "http://fromthepage.com/iiif"
endpoint = "/collection/2025-summer-program" # this endpoint is the only thing that needs editing 
# use IIIF slug found at bottom of "export" tab in FTP document set you want to export from
headers = {"Authorization": f"Token token={apikey}"}

In [100]:
# submit post request using requests library (operates same as cURL, just in Python)
response = requests.post(root+endpoint, headers=headers)


In [4]:
# to run if you wanna look at the raw text or check status

# print(response.status_code)
# should be 200
# print(response.text)

In [102]:
# convert to dataframe using json_normalize
# record_path=['manifests'] is to ignore metadata associated with the API call that's returned in the response
# but is not connected to the actual doc set content
response_df = pd.json_normalize(json.loads(response.text), record_path=['manifests'])

In [104]:
# if interested in taking a look at the dataframe:

response_df.head().T
# I'm curious if anything else will become metadata because there's more metadata in the bulk uploaded documents
# If it breaks later, refer to that column to see how to fix it

Unnamed: 0,0,1
@id,https://fromthepage.com/iiif/32213452/manifest,https://fromthepage.com/iiif/32214941/manifest
@type,sc:Manifest,sc:Manifest
label,Box 17 Folder 3,Box 17 Folder 6
metadata,"[{'label': 'dc:source', 'value': ['', 'https:/...","[{'label': 'dc:source', 'value': ['', 'https:/..."
service.@context,http://www.fromthepage.org/jsonld/1/context.json,http://www.fromthepage.org/jsonld/1/context.json
service.@id,https://fromthepage.com/iiif/32213452/status,https://fromthepage.com/iiif/32214941/status
service.label,Work Status,Work Status
service.profile,https://github.com/benwbrum/fromthepage/wiki/F...,https://github.com/benwbrum/fromthepage/wiki/F...
service.pctComplete,66.67,78.67
service.pctTranscribed,66.67,78.67


In [7]:
# don't run this cell (you can, you just don't need to)

#bunch_of_tuples = [] # this is where we'll store all the key (PJB ID) - value (Document Body) pairs to convert to a dataframe

#for i in range(len(response_df)): # iterates over each row of the dataframe - 
#    # there are other ways to do this but it's not prohibitively inefficient
#    url = response_df['@id'][i] # indexes into the value in the first column of the dataframe (the IIIF url)
#    cut = url.split('/')[4] # slices out the unique work_id - to be used to locate plaintext export
#    try:
#        pjb_id = response_df['metadata'][i][0]['value'] # tries to make key based on identifier aka PJB ID
#    except TypeError:
#       pjb_id = cut # if it fails, it instead makes key on the basis of the work_id (guaranteed to be unique)
#    new_url = f'https://fromthepage.com/iiif/{cut}/export/plaintext/verbatim' # url that hosts the plaintext export
#    final = requests.get(new_url) # get request on plaintext export url
#    bunch_of_tuples.append((pjb_id, final.text)) # appends key-value pair to dictionary
    

In [108]:
label = response_df['label'][1]
url = response_df['@id'][1]
cut = url.split('/')[4]
iiif_ref = response_df['metadata'][1][0]['value'][1]
label, url, cut, iiif_ref

('Box 17 Folder 6',
 'https://fromthepage.com/iiif/32214941/manifest',
 '32214941',
 'https://iiifman.lib.virginia.edu/pid/tsb:109207?unit=60420')

In [110]:
new_url = f'https://fromthepage.com/iiif/{cut}/export/html' # url that hosts the html export
final = requests.get(new_url) # get request on html export url
html = final.text

In [222]:
from bs4 import BeautifulSoup
import re

def extract_pages(string):
    soup = BeautifulSoup(string, "lxml")

    # Find all <div> tags where id starts with "page-"
    page_divs = soup.find_all("div", id=re.compile(r"^page-\d+"))

    pages = []

    for page_div in page_divs:
        # Extract title from the <a name="..."> tag inside the page
        title_tag = page_div.find("a", attrs={"name": True})
        title = title_tag.get_text(strip=True) if title_tag else None

        # Extract page content
        content_tag = page_div.find("div", class_="page-content")
        content = content_tag.decode_contents() if content_tag else None

        # Extract all usernames from <small class="page-version-username">
        user_tags = page_div.find_all("small", class_="page-version-username")
        users = [tag.get_text(strip=True) for tag in user_tags]

        pages.append({
            "FTP_page_id": page_div.get("id"),
            "title": title,
            "content": content,
            "users": users
        })

    return pages

In [224]:
pages = extract_pages(html)

In [238]:
for page in pages:
    page['title'] = page['title'].strip(' cont.').strip(',')
    page['PJB'] = page['title'].split(',')[-1].strip(' ')
    print(page['PJB'])

PJB 8601
PJB 8602
PJB 8603
PJB 8604
PJB 8605
PJB 8606
PJB 8607
PJB 8607
PJB 8608
PJB 8608
PJB 8608
PJB 8609
PJB 8610
PJB 8611
PJB 8612
PJB 8613
PJB 8613
PJB 8614
PJB 8615
PJB 8616
PJB 8617
PJB 8618
PJB 8619
PJB 8620
PJB 8621
PJB 8622
PJB 8623
PJB 8624
PJB 8625
PJB 8626
PJB 8627
PJB 8627
PJB 8627
PJB 8627
PJB 8628
PJB 8628
PJB 8628
PJB 8629
PJB 8630
PJB 8631
PJB 8632
PJB 8633
PJB 8633
PJB 8634
PJB 8634
PJB 8635
PJB 8636
PJB 8637
PJB 8638
PJB 8638
PJB 8639
PJB 8640
PJB 8641
PJB 8641
PJB 8642
PJB 8643
PJB 8644
PJB 8645
PJB 8645
PJB 8646
PJB 8647
PJB 8648
PJB 8649
PJB 8649
PJB 8650
PJB 8650
PJB 8651
PJB 8652
PJB 8653
PJB 8653
PJB 8654
PJB 8654
PJB 8654
PJB 8655
PJB 8655


In [246]:
from collections import defaultdict

grouped = defaultdict(list)
for page in pages:
    grouped[page['PJB']].append(page)

merged_pages = [
    {
        'PJB': PJB,
        'content': ' '.join(p['content'] for p in group if p['content']),
        'users': sorted(set(u for p in group for u in p['users'])),
        'FTP_page_ids': [p['FTP_page_id'] for p in group]
    }
    for PJB, group in grouped.items()
]

In [248]:
for page in merged_pages:
    contributors = page.pop('users')
    contributors = ', '.join(contributors)
    page['content'] = page['content'] + "<p>Thanks to FromThePage transcription contributors: " + contributors + "</p>"

In [210]:
merged_pages

[{'PJB': 'PJB 8601',
  'content': '\n<p>March 1, 1974</p>\n<p>Dear Ms. Henzie,</p>\n<p>Thank you for your kind letter of February 16th. As requested, enclosed is information about me that should be helpful in doing your term paper.</p>\n<p>Sincerely,</p>\n<p>Julian Bond</p>\n<p>Ms. Sandy Henzie <br/>\n212 Hill Church Road<br/>\nSpring City, Pa. 19475</p>\n<p>JB/jj<br/>\ncc. <br/>\nencl.</p>\n<p>Thanks to FromThePage transcription contributors: Nakole Allen, lbaker</p>',
  'FTP_page_ids': ['page-35113595']},
 {'PJB': 'PJB 8602',
  'content': '\n<p><sup>big other matter</sup></p>\n<p>212 Hill Church Road<br/>\nSpring City, PA 19475<br/>\nFebruary 16, 1974</p>\n<p>The Honorable Mr. Julian Bond<br/>\nHouse of Representatives<br/>\nAtlanta, Georgia</p>\n<p>Sir:</p>\n<p>I am doing a term paper for my American Government class at school and have chosen you as my subject. Could you send me any background information on yourself and your views on important matters in the government? Any opinion

In [212]:
os.chdir('PDF experimentation/output CSVs')

FileNotFoundError: [WinError 3] The system cannot find the path specified: 'PDF experimentation/output CSVs'

In [214]:
# no worries it's just because I'm already in that directory, if I run the script top to bottom it's all good

In [256]:
to_remove = []
for page in merged_pages:
    if len(page['content']) >= 2700: # this obviously can and should be higher, but just as an example
        pjb_id = page['PJB']
        text = page['content']
        with open(f"{pjb_id}_output.txt", "w", encoding="utf-8") as f:
            f.write(text)
        to_remove.append(page)
    
for page in to_remove:
    merged_pages.remove(page)

In [258]:
export = pd.DataFrame(merged_pages)

In [260]:
export

Unnamed: 0,PJB,content,FTP_page_ids
0,PJB 8601,"\n<p>March 1, 1974</p>\n<p>Dear Ms. Henzie,</p...",[page-35113595]
1,PJB 8602,\n<p><sup>big other matter</sup></p>\n<p>212 H...,[page-35113596]
2,PJB 8603,"\n<p>March 1, 1974</p>\n<p>Dear Mr. Hamburger,...",[page-35113597]
3,PJB 8604,\n<p>Thanks to FromThePage transcription contr...,[page-35113598]
4,PJB 8605,"\n<p>March 1, 1974</p>\n<p>Dear Mr. Williams:<...",[page-35113599]
5,PJB 8606,\n<p>Thanks to FromThePage transcription contr...,[page-35113600]
6,PJB 8607,"\n<p>March 1, 1974</p>\n<p>Dear Abe:</p>\n<p>O...","[page-35113601, page-35113602]"
7,PJB 8608,\n<p><sup>24</sup></p>\n<p>Dear <u>[Rep. Julia...,"[page-35113603, page-35113604, page-35113605]"
8,PJB 8609,"\n<p>March 2, 1974</p>\n<p>Dear Yancey:</p>\n<...",[page-35113606]
9,PJB 8610,\n<p>Yancey Martin<br/>\nS.E.F. <br/>\n87 Walt...,[page-35113607]


### Joining on Title / PJB ID

At some point during this process we will need to perform a join to connect PJB IDs to titles. Though we can avoid duplication at the folder level, there is no surefire way to avoid duplication at a higher level.

This whole process does rely on each document within the folder having a distinctive title. I could presumably also write a script to do it based on a separate list of page counts, though. Would just have to adapt cell [61].

In [262]:
export.to_csv(f'export_{label}.csv', index=False)