This notebook is intended to navigate the [pairtree](https://confluence.ucop.edu/display/Curation/PairTree) format in which HathiTrust fulltext data comes from dataset request, via rsync.

The basic workflow is as follows:
1. Drill down to final directory that holds volume data, starting at the directory that holds the highest level of pairtree data (in HT, this is the folder that is named with the institutional prefix for volumes in the dataset from that institution, e.g. 'mdp' for U. Michigan)
2. Identify the zip files for each volume, and use the file name to create a new directory based on the HTID of the volume, and move the textfiles to the new directory
3. Expand textfiles into this same folder, with each volume within a new folder named with the volume's HTID.
3. Iterate through directory that holds folders of pages, find the textfiles for each page, remove running headers (uses `load_vol` and `swinburne_clean_vol`), and write pages to new single textfiles, one for each volume.

There are some variables that will need to be manually change to make this workflow work for a given project, and these will be flagged with codes in the comments.

**Note: you need to download this GitHub repo and move it to the same folder where this Jupyter notebook is: https://github.com/htrc/HTRC-Tools-RunningHeaders-Python.** Use the green `clone or download` button on the right, then unzip the downloaded file (which will yield a folder called `htrc`) and move it where this Jupyter notebook is located.

The other libraries we are using are relatively standard, but can be downloaded using `pip` if you do not have them already. If you use Python with Anaconda, it's likely you already have them. If you do not, the `import` statement will fail.

In [1]:
import os # used to navigate file systems, remove data
import glob # used to navigate file systems
import re # regex library used for finding running headers/footers
import shutil # used to move/copy data
import zipfile # used to unzip compressed volumes
from tqdm import tqdm_notebook as tqdm # optional library that creates a progress bar for final cleaning function

from collections import defaultdict # used for running header/footer removal
from typing import List, TypeVar, Set, Iterator, Optional, Tuple, Dict # used for running header/footer removal

# libraries that actually finds and removes running headers and footers:
from htrc.models import Page, PageStructure, HtrcPage
from htrc.utils import clean_text, levenshtein, pairwise_combine_within_distance, flatten 
from htrc.runningheaders import parse_page_structure

First, define variables based on where pairtree data is stored, and where you want to move volume zips, and eventually clean full-text volume files:

In [2]:
# DEFINING A PATH TO THE DIRECTORY WHERE THIS NOTEBOOK IS LOCATED IN THE NAME OF LESS TYPING
root = os.getcwd()
# print(root)

# UPDATE THESE VARIABLES BASED ON YOUR DIRECTORY STRUCTURE!
data_dir = root+'/data-download/' # folder that holds data pairtree
output_path = root+'/pages/' # folder to which we'll move the volume zips, and folders holding the textfile pages

Second, find the volume zips at end of pairtree, move them to `output_path`, here a folder called `pages`:

In [306]:
# ITERATE THROUGH PAIRTREE STRUCTURE AND FIND AND COPY PAGE TEXTFILES
for root, dirs, files in os.walk(data_dir, topdown=False):
    # Disregarding files that start with "." because on Mac, you'll get hidden .DSstore files:
    for files in [i for i in files if not (i.startswith(".")) and (i.endswith(".zip")) and not (i.endswith(" 2.zip"))]:
        # print(files)
        final_path = os.path.join(root, files)
        # print(final_path)
        shutil.copy(final_path, output_path) # copies files, but move instead by using 'move' in place of copy

Third, generate paths to zips in the `pages` folder, and then expand each zip found using the path:

In [307]:
root = os.getcwd() # sanity check!

zip_dir = root+'/pages/' # the folder where zips will expand to, here the same folder in which they are stored

zip_paths = glob.glob(zip_dir+'*.zip')
# print(zip_paths)

# iterate through list of paths to zips and parse filename to create new folder for each volume, and expand
for path in zip_paths:
    # print(path)
    with zipfile.ZipFile(path) as file:
        #print(file)
        zipname = path.split('/')[-1]
        expand_dir = zip_dir + zipname[:-4]
        file.extractall(expand_dir)
        
        os.remove(path)


# import zipfile
# with zipfile.ZipFile("file.zip","r") as zip_ref:
#     zip_ref.extractall("targetdir")

Fourth, define a function, `load_vol` which will be used to load a directory of pages and parse its structure to find headers/footers:

In [44]:
# A FUNCTION USED TO LOAD A VOLUME INTO MEMORY IN A FORMAT THAT OUR HEADER/FOOTER CLEANER TAKES AS INPUT
def load_vol(path: str, num_pages: int) -> List[HtrcPage]:
    pages = []
    py_num_pages = num_pages-1
    for n in range(py_num_pages):
        if n == 0:
            n = 1
            page_num = str(n).zfill(8)
            with open('{}/{}.txt'.format(path, page_num), encoding='utf-8') as f:
                lines = [line.rstrip() for line in f.readlines()]
                pages.append(HtrcPage(lines))
        else:
            page_num = str(n).zfill(8)
            with open('{}/{}.txt'.format(path, page_num), encoding='utf-8') as f:
                lines = [line.rstrip() for line in f.readlines()]
                pages.append(HtrcPage(lines))
    
    return pages

Fifth, define the function that will actually ID and remove running headers/footers, here slightly modified from the original for your project, hence the name!

In [51]:
# FUNCTION THAT CLEANS RUNNING HEADERS/FOOTERS FROM EACH PAGE & CONCATENATE INTO SINGLE TEXT FILE FOR EACH VOLUME
def swinburne_clean_vol(vol_dir_path_list: list, out_dir: str):
    vol_num = 0
    for vol_dir_path in tqdm(vol_dir_path_list):
        # print(f"this is vol_dir_path: {vol_dir_path}")
        filename = vol_dir_path.split("/", -1)[-2]
        # print(f"this is filename: {filename}")
        page_paths = sorted(glob.glob(vol_dir_path+'/*.txt'))
        # print(page_paths)
        file_count = len(page_paths)
        loaded_vol = load_vol(vol_dir_path, file_count)
        pages = parse_page_structure(loaded_vol)
        outfile = filename + '.txt'
        # print(outfile)
        vol_num +=1
        
        with open(outfile, 'w') as f:
            clean_file_path = os.getcwd()+'/'+outfile
            for n, page in enumerate(pages):
                # print('.')
                f.write(page.body + '\n')
            shutil.move(clean_file_path, out_dir)       
           
    return print(f"Cleaned {vol_num} volume(s)")

Since we have an extra layer of folders, we need to repeat some steps to find the final end directory that holds the textfile pages. To do this, we'll reassert the variable `root` and `output_path` as our current working directory and the `pages` folder that lives within it, respectively. `page_dir_path_list` is a variable that holds a list of all the directories within the folder `pages`, which are folders expanded from original zipfiles. Since the expanded folders have a duplicate folder labeled with the volume's HTID, we need to point one layer deeper to get the paths to the page files. We do this simply by iterating through our initial list of folders, splitting the paths on `/` characters and adding the HTID (the second-to-last item in the split string) to the end of the initial path, which creates our final paths with duplicate folders at the end, stored in list `clean_page_dir_path_list`:

In [42]:
root = os.getcwd()
output_path = root+'/pages/'

# lists for finding and storing paths to volume pages
clean_page_dir_path_list = [] # new list to store final paths to page text files
page_dir_path_list = glob.glob(output_path+'*/') # find all folders within the `pages` folder
# print(page_dir_path_list)

# generate new paths for extra directory in expanded zips by parsing initial paths & adding HTID to end of each:
for path in page_dir_path_list:
    parsed_path = path.split('/')
    # print(parsed_paths[-2])
    path = path+parsed_path[-2]
    clean_page_dir_path_list.append(path)

clean_page_dir_path_list

['/Users/rdubnic2/Desktop/JupyterNotebooks/pages/ark+=13960=t3mw3px6k/ark+=13960=t3mw3px6k',
 '/Users/rdubnic2/Desktop/JupyterNotebooks/pages/txa.tarb004288/txa.tarb004288',
 '/Users/rdubnic2/Desktop/JupyterNotebooks/pages/mdp.39015007870481/mdp.39015007870481',
 '/Users/rdubnic2/Desktop/JupyterNotebooks/pages/ien.35556044272359/ien.35556044272359']

Finally, with all the data management finished and our functions defined, we'll iteratively work through our list of paths to folders that contain individual page text files, remove headers, and then concatenate and write to new, single files stored in a folder called `clean-volumes` (**note: you must create this folder first!**):

In [52]:
# CREATE A VARIABLE WITH A PATH TO THE DIRECTORY WHERE WE'LL WRITE CLEAN VOLUME TEXTFILES
clean_vol_out_dir = root+'/clean-volumes/'
# print(clean_vol_out_dir)

# feed our list of folder paths to our function that will find and remove headers/footers and concat text pages
swinburne_clean_vol(clean_page_dir_path_list, clean_vol_out_dir)

HBox(children=(IntProgress(value=0, max=4), HTML(value='')))


Cleaned 4 volume(s)
