processing-pg-texts-log.txt

##These are some linux commands to help data scientists prepare project gutenberg english ebooks for analysis
##ALTHOUGH THESE SCRIPTS WORKED FOR ME, THEY MAY NOT WORK FOR YOU ON A DIFFERENT SYSTEM. USE AT YOUR OWN RISK
##Despite this, if you have access to a linux terminal, these bash scripts will be super useful.

download
wget -H -w 5 -m http://www.gutenberg.org/robot/harvest?filetypes[]=txt&langs[]=en 
##where -h is recursive and "-w 5" is a five second delay between requests

unzip
while [ "`find . -type f -name '*.zip' | wc -l`" -gt 0 ]; do find -type f -name "*.zip" -exec unzip -- '{}' \; -exec rm -- '{}' \;; done
##command cycles through subdirectories unzipping, and placing unzipped file in parent directory where script was intiated

##remove first seven files from database; these lack the PG TOS following ebook; impacted by later trimming
rm 1.txt 2.txt 3.txt 4.txt 5.txt 6.txt 7.txt

triming text from files conditionally
##All project gutenberg ebooks contain a terms of service contract at the end of the book; great for lawyers, not
##great for data science tasks. You will need to remove the terms of service, and potentially the beginning too.
sed -i'.bak' '/END OF THIS PROJECT GUTENBERG EBOOK/,$d' *.txt
##after confirming sed is removing correctly, remove '.bak' but leaving the -i, which edits file inplace, otherwise a backup of each
##file will be made. Potential additions to sed include -u -n options

##after trimming the files of the trailing terms of service you may also need to check for other text you need to remove and see how many files have said text. Everytime the quoted text is found it | or piped into a count function
grep -rnw '/home/marvin/projects/archive' -e '*** START OF THIS PROJECT GUTENBERG EBOOK' | wc -l
##this will count the total ebooks with the single quoted text '*** START OF'...etc

##if you find text at the beginning which you want to delete, across the entire project gutenberg collection
##this commands looks the same as trimming the TOS from the end, except the $!d command deletes quoted text prior 
##to the comma, where removing the ! deletes lines after the quoted text. Any files in this directory ending in
##.txt will be trimmed.
sed -i '/START OF THIS PROJECT GUTENBERG EBOOK/,$!d' *.txt


other helpful commands
wc -l < 10002.txt 10002.txt.bak
##to count the lines in the files to see if the licensing data tail has been cleaned

diff 10002.txt 10002.txt.bak 
##will compare the two files and output the differences

du -sh /directory
##this will give you a rough number of storage used by a specific directory, containing your gutenberrg files

##archiving
##you will be making lots of edits to lots of files; this would be a good time to start talking about version
##control. I would recommend getting an external hard drive or remote storage option (assuming you have a super fast
##internet connect) otherwise stick with physical drive backup. After you've unzipped your files and started trim,
##you will need to store the .txt.bak files somewhere, and potentially make copies of the trimmed *.txt files. The files
##expand from 14GB zipped to around 40GB unzipped.

rsync -r --include='*.txt' --exclude='*' . ../dirFullOfEbooks
##run this in the source directory and replace ../dirFullOfEbooks with the desired storage directory or device
##mv and cp commands are also helpful but the cp command has a max number of files it can run against and mv does
##not work since it does not leave a copy of the file behind. Rysnc is the best command for making copies of 80,000
##ebooks txt files

##depending on what tool you pick (NLP or ML tools) you may then have to zip the files again to upload to an S3 bucket
##or a Floyd-hub cloud storage. In which case you may want to keep the original .txt file and zip a copy to another directory
zip -r  /path/to/save/destination_folder.zip /path/to/folder


##you may also need to download all of the project gutenberg metadata
wget http://www.gutenberg.org/cache/epub/feeds/rdf-files.tar.zip && unzip rdf-files.tar.zip && tar -xvf *.tar

##counts files with conditional text
grep -rnw '/home/marvin/projects/archive' -f ../carriage.txt | wc -l
72881

##counts total files without conditional text
grep -rnwL '/home/marvin/projects/archive' -f ../carriage.txt | wc -l

##writes the filenames to a file not matching conditional file.txt text
grep -rwL '/home/marvin/projects/archive' -f ../carriage.txt > ../missing-header.txt

## writes ebook titles in current working directory containing "language: English" into file, and then moves those files
##into directory called not-english
grep -riL 'Language: English' > ../not-english.txt
cat ../not-english.txt | xargs mv -t ../not-english

#there are some file encoding errors python does not like once we start to serialize the data
#serialization seems to like ascii text but not iso-8859; let's count those file types; takes a while to run against 75k #files. Searching across all PG ebooks, turned up 23,918 files with iso-8859 encoding, taking 20 minutes
file archive/* | grep -e 'ISO-8859 text' | wc -l