## FTP to Drupal Exports

This file contains the updated script for prepping a folder-level export between FTP and Drupal.

Note that this script continues to use the FTP API, but it is not really necessary anymore. The API is largely important to extract the unique FTP work ID from each work - since we are doing folder-level exports and each folder is now one work, it is just as easy if not easier to just look up the work ID and manually label each export.

In [14]:
# import required packages

import os
import dotenv

import requests
import pandas as pd
import json
import re

from bs4 import BeautifulSoup
from collections import defaultdict

### Option 1: the old way

In [94]:
# hiding the API key

# change to the directory where the dotenv file is (unique for each person)
os.chdir("/Users/charl/JBPP")

# load in stuff hidden in the .env file
dotenv.load_dotenv()
JBPP_key = os.getenv('JBPP_key')

In [98]:
# code to create post request
apikey = JBPP_key

root = "http://fromthepage.com/iiif"
endpoint = "/collection/2025-summer-program" # this endpoint is the only thing that needs editing 
# use IIIF slug found at bottom of "export" tab in FTP document set you want to export from
headers = {"Authorization": f"Token token={apikey}"}

In [100]:
# submit post request using requests library (operates same as cURL, just in Python)
response = requests.post(root+endpoint, headers=headers)


In [4]:
# to run if you wanna look at the raw text or check status

# print(response.status_code)
# should be 200
# print(response.text)

In [102]:
# convert to dataframe using json_normalize
# record_path=['manifests'] is to ignore metadata associated with the API call that's returned in the response
# but is not connected to the actual doc set content
response_df = pd.json_normalize(json.loads(response.text), record_path=['manifests'])

In [104]:
response_df.head().T

Unnamed: 0,0,1
@id,https://fromthepage.com/iiif/32213452/manifest,https://fromthepage.com/iiif/32214941/manifest
@type,sc:Manifest,sc:Manifest
label,Box 17 Folder 3,Box 17 Folder 6
metadata,"[{'label': 'dc:source', 'value': ['', 'https:/...","[{'label': 'dc:source', 'value': ['', 'https:/..."
service.@context,http://www.fromthepage.org/jsonld/1/context.json,http://www.fromthepage.org/jsonld/1/context.json
service.@id,https://fromthepage.com/iiif/32213452/status,https://fromthepage.com/iiif/32214941/status
service.label,Work Status,Work Status
service.profile,https://github.com/benwbrum/fromthepage/wiki/F...,https://github.com/benwbrum/fromthepage/wiki/F...
service.pctComplete,66.67,78.67
service.pctTranscribed,66.67,78.67


After looking at the available folders, choose which one you want to export. Specify the row index in the lines denoted. Be sure to select a folder that has a `service.pctComplete` value of 100.0, meaning it is fully transcribed.

In [108]:
label = response_df['label'][1] # specify row
url = response_df['@id'][1] # specify row
cut = url.split('/')[4]
iiif_ref = response_df['metadata'][1][0]['value'][1]  # specify row (after 'metadata', not after 'value')
label, url, cut, iiif_ref

('Box 17 Folder 6',
 'https://fromthepage.com/iiif/32214941/manifest',
 '32214941',
 'https://iiifman.lib.virginia.edu/pid/tsb:109207?unit=60420')

### Option 2: the new way

Much simpler, just different. Requires looking for information outside of just running this notebook.

In [16]:
# instead of changing the collection slug and querying the API etc etc, go to FTP and look up the work ID.
# For a detailed explanation of where to find it, see documentation

# then, uncomment the below code and fill in the info

label = 'Box 17 Folder 6' # replace with box/folder number of desired export
cut = '32214941' # replace with unique FTP work ID of desired export (important that this is a string)

### The rest of it

In [18]:
new_url = f'https://fromthepage.com/iiif/{cut}/export/html' # url that hosts the html export
final = requests.get(new_url) # get request on html export url
html = final.text

In [20]:
def extract_pages(string):
    soup = BeautifulSoup(string, "lxml")

    # Find all <div> tags where id starts with "page-"
    page_divs = soup.find_all("div", id=re.compile(r"^page-\d+"))

    pages = []

    for page_div in page_divs:
        # Extract title from the <a name="..."> tag inside the page
        title_tag = page_div.find("a", attrs={"name": True})
        title = title_tag.get_text(strip=True) if title_tag else None

        # Extract page content
        content_tag = page_div.find("div", class_="page-content")
        content = content_tag.decode_contents() if content_tag else None

        # Extract all usernames from <small class="page-version-username">
        user_tags = page_div.find_all("small", class_="page-version-username")
        users = [tag.get_text(strip=True) for tag in user_tags]

        pages.append({
            "FTP_page_id": page_div.get("id"),
            "title": title,
            "content": content,
            "users": users
        })

    return pages

In [22]:
pages = extract_pages(html)

In [24]:
for page in pages:
    page['title'] = page['title'].strip(' cont.').strip(',')
    page['PJB'] = page['title'].split(',')[-1].strip(' ')
    # print(page['PJB'])

In [30]:
grouped = defaultdict(list)
for page in pages:
    grouped[page['PJB']].append(page)

merged_pages = [
    {
        'PJB': PJB,
        'content': ' '.join(p['content'] for p in group if p['content']),
        'users': sorted(set(u for p in group for u in p['users'])),
        'FTP_page_ids': [p['FTP_page_id'] for p in group]
    }
    for PJB, group in grouped.items()
]

In [32]:
for page in merged_pages:

    # get rid of newline characters
    page['content'] = page['content'].replace('\n','')

    # add contributor language in the appropriate place
    contributors = page.pop('users')
    contributors = ', '.join(contributors)
    page['content'] = page['content'] + "<p>Thanks to FromThePage transcription contributors: " + contributors + "</p>"

In [34]:
# os.chdir('PDF experimentation/output CSVs')

FileNotFoundError: [WinError 3] The system cannot find the path specified: 'PDF experimentation/output CSVs'

In [36]:
# no worries it's just because I'm already in that directory, if I run the script top to bottom it's all good
os.getcwd()

'C:\\Users\\charl\\JBPP\\PDF experimentation'

In [38]:
os.chdir('output CSVs')

**For longer documents:** you may need to write them to individual text files and paste them in manually. There is sometimes a character limit in what CSVs can display/import. 2700 is an example that worked well here, but when running this script normally you should raise it (maybe 10000? IDK what a good number is)

In [40]:
to_remove = []
for page in merged_pages:
    if len(page['content']) >= 2700: # this obviously can and should be higher, but just as an example
        pjb_id = page['PJB']
        text = page['content']
        with open(f"{pjb_id}_output.txt", "w", encoding="utf-8") as f:
            f.write(text)
        to_remove.append(page)
    
for page in to_remove:
    merged_pages.remove(page)

In [42]:
export = pd.DataFrame(merged_pages)

In [44]:
# take a look at the export to make sure everything looks alright
export

Unnamed: 0,PJB,content,FTP_page_ids
0,PJB 8601,"<p>March 1, 1974</p><p>Dear Ms. Henzie,</p><p>...",[page-35113595]
1,PJB 8602,<p><sup>big other matter</sup></p><p>212 Hill ...,[page-35113596]
2,PJB 8603,"<p>March 1, 1974</p><p>Dear Mr. Hamburger,</p>...",[page-35113597]
3,PJB 8604,"<p>Feb. 3, 1974</p><p>Dear Mr. Bond,</p><p>Las...",[page-35113598]
4,PJB 8605,"<p>March 1, 1974</p><p>Dear Mr. Williams:</p><...",[page-35113599]
5,PJB 8606,"<p>February 15, 1974</p><p>Marvin Williams Jr....",[page-35113600]
6,PJB 8607,"<p>March 1, 1974</p><p>Dear Abe:</p><p>Of cour...","[page-35113601, page-35113602]"
7,PJB 8608,<p><sup>24</sup></p><p>Dear <u>[Rep. Julian Bo...,"[page-35113603, page-35113604, page-35113605]"
8,PJB 8609,"<p>March 2, 1974</p><p>Dear Yancey:</p><p>Encl...",[page-35113606]
9,PJB 8610,<p>Yancey Martin<br/>S.E.F. <br/>87 Walton St<...,[page-35113607]


In [46]:
export.to_csv(f'export_{label}.csv', index=False)