This notebook is intended to navigate the [pairtree](https://confluence.ucop.edu/display/Curation/PairTree) format in which HathiTrust fulltext data comes from dataset request, via rsync.

The basic workflow is as follows:
1. Drill down to final directory that holds volume data, starting at the directory that holds the highest level of pairtree data (in HT, this is the folder that is named with the institutional prefix for volumes in the dataset from that institution, e.g. 'mdp' for U. Michigan)
2. Identify the zip files for each volume, and use the file name to create a new directory based on the HTID of the volume, and move the textfiles to the new directory
3. Expand textfiles into this same folder, with each volume within a new folder named with the volume's HTID.
4. Corrects edge cases:
    - Some pages have filenames with HTID at the beginning, separated by underscore, e.g. ark+=13960=t4wh2kh33_0000001.txt
    - Some volumes had sequential pages that were not numbered sequentially, so code renames the page files in sequence. This was vetted for this dataset, and found to be valid. But be sure to check data before using this code to verify that the filenames are the problem, and not underlying data issues.
5. Iterate through directory that holds folders of pages, find the textfiles for each page, remove running headers (uses `load_vol` and `swinburne_clean_vol`), and write pages to new single textfiles, one for each volume.

There are some variables that will need to be manually change to make this workflow work for a given project, and these will be flagged with codes in the comments.

**Note: you need to download this GitHub repo and move it to the same folder where this Jupyter notebook is: https://github.com/htrc/HTRC-Tools-RunningHeaders-Python.** Use the green `clone or download` button on the right, then unzip the downloaded file (which will yield a folder called `htrc`) and move it where this Jupyter notebook is located.

The other libraries we are using are relatively standard, but can be downloaded using `pip` if you do not have them already. If you use Python with Anaconda, it's likely you already have them. If you do not, the `import` statement will fail.

In [1]:
import os # used to navigate file systems, remove data
import glob # used to navigate file systems
import re # regex library used for finding running headers/footers
import shutil # used to move/copy data
import zipfile # used to unzip compressed volumes
from tqdm import tqdm_notebook as tqdm # optional library that creates a progress bar for final cleaning function

from collections import defaultdict # used for running header/footer removal
from typing import List, TypeVar, Set, Iterator, Optional, Tuple, Dict # used for running header/footer removal

# libraries that actually finds and removes running headers and footers:
from htrc.models import Page, PageStructure, HtrcPage
from htrc.utils import clean_text, levenshtein, pairwise_combine_within_distance, flatten 
from htrc.runningheaders import parse_page_structure

First, define variables based on where pairtree data is stored, and where you want to move volume zips, and eventually clean full-text volume files:

In [77]:
# DEFINING A PATH TO THE DIRECTORY WHERE THIS NOTEBOOK IS LOCATED IN THE NAME OF LESS TYPING
root = os.getcwd()
# print(root)

# UPDATE THESE VARIABLES BASED ON YOUR DIRECTORY STRUCTURE!
data_dir = root+'/swinburne-data/' # folder that holds data pairtree
output_path = root+'/swinburne-pages/' # folder to which we'll move the volume zips, and folders holding the textfile pages

Second, find the volume zips at end of pairtree, move them to `output_path`, here a folder called `pages`:

In [None]:
# ITERATE THROUGH PAIRTREE STRUCTURE AND FIND AND COPY PAGE TEXTFILES
for root, dirs, files in tqdm(os.walk(data_dir, topdown=False)):
    # Disregarding files that start with "." because on Mac, you'll get hidden .DSstore files:
    for files in [i for i in files if not (i.startswith(".")) and (i.endswith(".zip")) and not (i.endswith(" 2.zip"))]:
        # print(files)
        final_path = os.path.join(root, files)
        # print(final_path)
        shutil.copy(final_path, output_path) # copies files, but move instead by using 'move' in place of copy

Third, generate paths to zips in the `pages` folder, and then expand each zip found using the path:

In [None]:
root = os.getcwd() # sanity check!

zip_dir = root+'/swinburne-pages/' # the folder where zips will expand to, here the same folder in which they are stored

zip_paths = glob.glob(zip_dir+'*.zip')
# print(zip_paths)

# iterate through list of paths to zips and parse filename to create new folder for each volume, and expand
for path in tqdm(zip_paths):
    # print(path)
    with zipfile.ZipFile(path) as file:
        #print(file)
        zipname = path.split('/')[-1]
        expand_dir = zip_dir + zipname[:-4]
        file.extractall(expand_dir)
        
        os.remove(path)

The below cell only needs to be run once (and only will successfully run once, I believe), and looks for page files that have a volume ID prefix, separated with an underscore (`_`). Where it finds them, it removes them and normalizes the page number to have 8 total digits, which should make it work with the rest of the running headers code.

In [None]:
root = os.getcwd()
# print(root)

swinburne_page_paths = root+'/swinburne-pages/'
# print(swinburne_page_paths)

vol_paths = glob.glob(swinburne_page_paths+'/**/')
print(len(vol_paths))
# print(vol_paths[:10])

for path in tqdm(vol_paths):
    # print(path)
    page_paths = sorted(glob.glob(path+'/**/*.txt', recursive=True))
    n = len(page_paths)
    num = 1
    
    while num <= n:
        for page in page_paths:
            print(page)
            
            # this loop handles edge case where HTID is added to page files, separated by an underscore
            #  e.g. `ark+=13960=t4wh2kh33_00000001.txt`
            if '_' in path:
                parsed_path = path.split('_')
                # print(f"This is parsed_path: {parsed_path}")
                file_name = parsed_path[-1]
                # print(f"This is file_name: {file_name}")
                path_root = str(path).split('/')[:-1]
                clean_path_root = '/'.join(path_root)
                # print(clean_path_root)
                # print(path_root)
                number, extension = file_name.split('.')
                page_num = str(num).zfill(8)
                extension = '.'+extension
                new_filename = page_num+extension
                # print(new_filename)
                os.rename(page,clean_path_root+'/'+new_filename)
                num += 1
           
            # this loop normalizes all page files to have sequential numbers, which is needed by cleaning code. 
            #  Be sure to check data to make sure non-sequential pages are naming errors and not data problems
            else:
                # print(page)
                parsed_path = str(path).split('/')
                path_root = parsed_path[:-1]
                f_name = parsed_path[-1]
                clean_path_root = '/'.join(path_root)
                # print(clean_path_root)
                page_num = str(num).zfill(8)
                new_filename = page_num+'.txt'
                # print(new_filename)
                # print(path,clean_path_root+'/'+new_filename)
                os.rename(page,clean_path_root+'/'+new_filename)
                num += 1

Fourth, define a function, `load_vol` which will be used to load a directory of pages and parse its structure to find headers/footers:

In [22]:
# A FUNCTION USED TO LOAD A VOLUME INTO MEMORY IN A FORMAT THAT OUR HEADER/FOOTER CLEANER TAKES AS INPUT
def load_vol(path: str, num_pages: int) -> List[HtrcPage]:
    pages = []
    py_num_pages = num_pages-1
    for n in range(py_num_pages):
        if n == 0:
            n = 1
            page_num = str(n).zfill(8)
            with open('{}/{}.txt'.format(path, page_num), encoding='utf-8') as f:
                lines = [line.rstrip() for line in f.readlines()]
                pages.append(HtrcPage(lines))
        else:
            page_num = str(n).zfill(8)
            with open('{}/{}.txt'.format(path, page_num), encoding='utf-8') as f:
                lines = [line.rstrip() for line in f.readlines()]
                pages.append(HtrcPage(lines))
    
    return pages

Fifth, define the function that will actually ID and remove running headers/footers, here slightly modified from the original for your project, hence the name!

In [21]:
# FUNCTION THAT CLEANS RUNNING HEADERS/FOOTERS FROM EACH PAGE & CONCATENATE INTO SINGLE TEXT FILE FOR EACH VOLUME
def swinburne_clean_vol(vol_dir_path_list: list, out_dir: str):
    vol_num = 0
    for vol_dir_path in tqdm(vol_dir_path_list):
        # print(f"this is vol_dir_path: {vol_dir_path}")
        filename = vol_dir_path.split("/", -1)[-2]
        # print(f"this is filename: {filename}")
        page_paths = sorted(glob.glob(vol_dir_path+'/*.txt'))
        # print(page_paths)
        file_count = len(page_paths)
        loaded_vol = load_vol(vol_dir_path, file_count)
        pages = parse_page_structure(loaded_vol)
        outfile = filename + '.txt'
        # print(outfile)
        vol_num +=1
        
        with open(outfile, 'w') as f:
            clean_file_path = os.getcwd()+'/'+outfile
            for n, page in enumerate(pages):
                # print('.')
                f.write(page.body + '\n')
            shutil.move(clean_file_path, out_dir)       
           
    return print(f"Cleaned {vol_num} volume(s)")

**Note: this code should no longer be needed, since our initial code that cleans the filenames will move the text files up one directory level to be stored at `/path/to/data/directory/data-directory/htid/00000001.txt`. However, just in case it's of interest and possible utility, I've left it in.**

Since we have an extra layer of folders, we need to repeat some steps to find the final end directory that holds the textfile pages. To do this, we'll reassert the variable `root` and `output_path` as our current working directory and the `pages` folder that lives within it, respectively. `page_dir_path_list` is a variable that holds a list of all the directories within the folder `pages`, which are folders expanded from original zipfiles. Since the expanded folders have a duplicate folder labeled with the volume's HTID, we need to point one layer deeper to get the paths to the page files. We do this simply by iterating through our initial list of folders, splitting the paths on `/` characters and adding the HTID (the second-to-last item in the split string) to the end of the initial path, which creates our final paths with duplicate folders at the end, stored in list `clean_page_dir_path_list`:

In [67]:
# CODE COMMENTED OUT, SEE BOLD NOTE ABOVE.

# root = os.getcwd()
# output_path = root+'/swinburne-pages/'

# # lists for finding and storing paths to volume pages
# clean_page_dir_path_list = []
# page_dir_path_list = glob.glob(output_path+'/*/') # find all folders within the `pages` folder
# print(len(page_dir_path_list))
# # page_dir_path_list

# # generate new paths for extra directory in expanded zips by parsing initial paths & adding HTID to end of each:
# for path in page_dir_path_list:
#     parsed_path = path.split('/')
#     # print(parsed_paths[-2])
#     path = path+parsed_path[-2]
#     clean_page_dir_path_list.append(path)

# print(len(clean_page_dir_path_list))

Since the process to clean files takes a long time, and found some edge cases that would break the loop, I've added code to check to see if a volume has already been cleaned, and remove its path if so. It works by parsing the paths in our `page_directory_paths` list to extract the HTID at the end (adding `.txt`) and then comparing these names with the filenames (in the same format) in the final output directory, here `swinburne-clean-volumes`.

This code must be run before you start the final cleaning code (in the last cell):

In [65]:
# CREATE A VARIABLE WITH A PATH TO THE DIRECTORY WHERE WE'LL WRITE CLEAN VOLUME TEXTFILES
clean_vol_out_dir = root+'/swinburne-clean-volumes/'
# print(clean_vol_out_dir)

# feed our list of folder paths to our function that will find and remove headers/footers and concat text pages
page_directory_list = glob.glob('swinburne-pages/*/')
# print(len(final_page_directory_list))

page_directory_paths = final_page_directory_list # duplicating page directory path list
print(f"There are {len(page_directory_paths)} total volumes to clean.")

# checking to see if the volume has already been cleaned and concatenated, and if so, removing the volume's path 
#  from the list we'll feed to the cleaner & concatenate code

# Find paths for all completed volume text files in out output directory using glob:
clean_volume_list  = glob.glob(clean_vol_out_dir+'/*.txt')
print(f"{len(clean_volume_list)} volumes have already been cleaned.")

# new list to store the filenames for already cleaned volumes
clean_file_list = []

# pull clean file names from clean volume paths
for vol in sorted(clean_volume_list):
    # print(vol)
    vol_file = vol.split('/')[-1]
    # print(vol_file)
    clean_file_list.append(vol_file)

# variable to store count of found cleaned volumes
y = 0

# check final path list to be fed to cleaning code, remove path if volume is found in output directory 
#  (meaning it's already been cleaned)
for path in sorted(page_directory_paths):
    # print(path)
    filename = (path.split('/')[-2])+'.txt'
    # print(filename)
    if filename in clean_file_list:
        page_directory_paths.remove(path)
        y += 1
        
print(f"Found {y} volume(s) already cleaned.")

print(f"List of volumes to clean now has {len(page_directory_paths)} items")

There are 1922 total volumes to clean.
80 volumes have already been cleaned.
Found 0 volume(s) already cleaned.
List of volumes to clean now has 1922 items


Finally, with all the data management finished and our functions defined, we'll iteratively work through our list of paths to folders that contain individual page text files, remove headers, and then concatenate and write to new, single files stored in a folder called `clean-volumes` (**note: you must create this folder first!**):

In [66]:
# feed our list of folder paths to our function that will find and remove headers/footers and concat text pages
swinburne_clean_vol(page_directory_paths, clean_vol_out_dir)

HBox(children=(IntProgress(value=0, max=1922), HTML(value='')))

Cleaned 1922 volume(s)


Final results:

`100% 1922/1922 [7:13:01<00:00, 11.34s/it]
Cleaned 1922 volume(s)`
