Skip to content

ashan8k/htrc-text-processing

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

34 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

HTRC

HTRC-Text-Processing Library

Tool to process pairtree format data in 17 million digitized works at HathiTrust.

Table of Contents

  1. About htrc-text-processing Library
  2. Installation
  3. Usage
  4. Examples

About htrc-text-processing Library

Detailed Description goes here.

Installation

To install,

pip install htrc-text-processing

That's it! This library is written for Python 3.6+. For Python beginners, you'll need pip.

Usage

  • Function: get_zips()

    A function that finds the zip files at the end of the pairtree, moves them to a new folder and expands them, removing the zips.

    Inputs:

    1. Path (string) to directory that holds the pairtree.
    2. Path (string) to directory that will hold the folders from expanded zips.
    htrc_text_processing.get_zips('<path to pairtree parent/s>', 'path to output directory')
  • Function: normalize_txt_file_names()

    A function that clean and normalizes page file names.

    Example: turns 39002088672754_000001.txt into 00000001.txt

    htrc_text_processing.normalize_txt_file_names('txt path or dir to txts') 
  • Function: clean_vol()

    Inputs:

    1. List of paths (strings) to directories that holds page files, one per volume
    2. Path (string) to output directory where clean single text files will be stored after cleaning and concatenating pages together
  • Function: check_vol()

    Inputs:

    1. Page directory List
    2. Cleaned vols output dir

    Output

    1. Page directory list which is not cleaned yet
    new_page_directory_list = htrc_text_processing.check_vol(page_directory_list, clean_vol_out_dir)

issues? Please file here

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages