# NLP and the Pipeline

In this project I apply NLP transforms to a collection of George Elliot texts to create dan F3 level digital analytical edition from them

First, I import and combine the following three novels:
* Middlemarch http://www.gutenberg.org/files/145/145-0.txt
* The Mill on the Floss http://www.gutenberg.org/files/6688/6688-0.txt
* Adam Bede http://www.gutenberg.org/files/507/507-0.txt

Then, I produce the following tables as data frames and save them as CSV tables:
* A library table (LIBRARY) with basic metadata about each book.
* A document table (DOC) with the preserved paragraphs of each book and an appropriate OHCO index.
* A token table (TOKEN) with an appropriate OHCO index including part-of-speech tags derived from NLTK.
* A vocabulary (VOCAB) table of terms with stopwords and porter stems annotations derived from NLTK.

# Set Up

## Configs

In [1]:
OHCO = ['book_id', 'chap_num', 'para_num', 'sent_num', 'token_num']
epub_dir = 'texts'

## Imports

In [2]:
import pandas as pd
import numpy as np
from glob import glob
import re
import nltk

In [3]:
%matplotlib inline

In [4]:
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('stopwords')
nltk.download('tagsets')

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/alexcathcart/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/alexcathcart/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/alexcathcart/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package tagsets to
[nltk_data]     /Users/alexcathcart/nltk_data...
[nltk_data]   Package tagsets is already up-to-date!


True

# Inspect

Since Project Gutenberg texts vary widely in their markup, I define the chunking patterns by hand.

In [5]:
roman = '[IVXLCM]+'
caps = "[A-Z';, -]+"
chap_pats = {
    145: {
        'start_line': 206,
        'end_line': 33304,
        'volume': re.compile('^\s*BOOK\s+{}\.\s*$'.format(roman)),
        'chapter': re.compile('^\s*CHAPTER\s+{}\.\s*$'.format(roman))
    },
    507: {
        'start_line': 39,
        'end_line': 20409,
        'volume': re.compile('^\s*Book\s+{}\.\s*$'),
        'chapter': re.compile('^\s*Chapter\s+{}\s*$'.format(roman))
    },
    6688: {
        'start_line': 124,
        'end_line': 21265,
        'volume': re.compile('^\s*BOOK\s+{}\.\s*$'),
        'chapter': re.compile('^\s*Chapter\s+{}\.\s*$'.format(roman))
    }
}

# Register and Chunk

In [6]:
epubs = [epub for epub in sorted(glob(epub_dir+'/*.txt'))]

In [7]:
def acquire_epubs(epub_list, chap_pats, OHCO=OHCO):
    
    my_lib = []
    my_doc = []

    for epub_file in epubs:
        
        # Get PG ID from filename
        book_id = int(epub_file.split('-')[-1].split('.')[0].replace('pg',''))
        print("BOOK ID", book_id)
        
        # Import file as lines
        lines = open(epub_file, 'r', encoding='utf-8-sig').readlines()
        df = pd.DataFrame(lines, columns=['line_str'])
        df.index.name = 'line_num'
        df.line_str = df.line_str.str.strip()
        df['book_id'] = book_id
    
      
        # FIX CHARACTERS TO IMPROVE TOKENIZATION
        df.line_str = df.line_str.str.replace('—', ' — ')
        df.line_str = df.line_str.str.replace('-', ' - ')
        
        # Get book title and put into LIB table -- note problems, though
        book_title = re.sub(r"The Project Gutenberg eBook( of|,) ", "", df.loc[0].line_str, flags=re.IGNORECASE)
        book_title = re.sub(r"Project Gutenberg's ", "", book_title, flags=re.IGNORECASE)

        # Remove cruft
        a = chap_pats[book_id]['start_line'] - 1
        b = chap_pats[book_id]['end_line'] + 1
        df = df.iloc[a:b]
        
        # Chunk by chapter
        chap_lines = df.line_str.str.match(chap_pats[book_id]['chapter'])
        chap_nums = [i+1 for i in range(df.loc[chap_lines].shape[0])]
        df.loc[chap_lines, 'chap_num'] = chap_nums
        df.chap_num = df.chap_num.ffill()
    
        # Clean up
        df = df[~df.chap_num.isna()] # Remove chapter heading lines
        df = df.loc[~chap_lines] # Remove everything before Chapter 1
        df['chap_num'] = df['chap_num'].astype('int')
           
        # Group -- Note that we exclude the book level in the OHCO at this point
        df = df.groupby(OHCO[1:2]).line_str.apply(lambda x: '\n'.join(x)).to_frame() # Make big string
        
        # Split into paragrpahs
        df = df['line_str'].str.split(r'\n\n+', expand=True).stack().to_frame().rename(columns={0:'para_str'})
        df.index.names = OHCO[1:3] # MAY NOT BE NECESSARY UNTIL THE END
        df['para_str'] = df['para_str'].str.replace(r'\n', ' ').str.strip()
        df = df[~df['para_str'].str.match(r'^\s*$')] # Remove empty paragraphs
        
        # Set index
        df['book_id'] = book_id
        df = df.reset_index().set_index(OHCO[:3])
    
        # Register
        my_lib.append((book_id, book_title, epub_file))
        my_doc.append(df)
    

    docs = pd.concat(my_doc)
    library = pd.DataFrame(my_lib, columns=['book_id', 'book_title', 'book_file']).set_index('book_id')
    print("Done.")
    return library, docs

In [8]:
epubs = [epub for epub in sorted(glob(epub_dir+'/*.txt'))]
LIB, DOC = acquire_epubs(epubs, chap_pats)

BOOK ID 145
BOOK ID 507
BOOK ID 6688
Done.


In [9]:
LIB

Unnamed: 0_level_0,book_title,book_file
book_id,Unnamed: 1_level_1,Unnamed: 2_level_1
145,"Middlemarch, by George Eliot",texts/-pg145.txt
507,"Adam Bede, by George Eliot",texts/-pg507.txt
6688,"The Mill on the Floss, by George Eliot",texts/-pg6688.txt


In [10]:
DOC.sample(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,para_str
book_id,chap_num,para_num,Unnamed: 3_level_1
6688,44,5,"“Let me see, — it’s going on for seven years n..."
145,65,23,“I have only wished to prevent you from hurryi...
507,12,48,But they started asunder with beating hearts: ...
507,11,36,"“Nay, my lad, nay,” Lisbeth burst out in an ea..."
145,12,152,"“At least, Fred, let me advise _you_ not to fa..."
145,50,30,"At this crisis Lydgate was announced, and one ..."
6688,48,59,"“Yes, Lucy, I would choose to marry him. I thi..."
507,53,55,"“Aye, aye!” said Bartle; “then we can have a b..."
6688,29,34,"“What am I to write?” said Tom, with gloomy su..."
145,12,85,“There is no question of liking at present. My...


# Tokenize and Annotate

Using NLTK.

In [11]:
def tokenize(doc_df, OHCO=OHCO, remove_pos_tuple=False, ws=False):
    
    # Paragraphs to Sentences
    df = doc_df.para_str\
        .apply(lambda x: pd.Series(nltk.sent_tokenize(x)))\
        .stack()\
        .to_frame()\
        .rename(columns={0:'sent_str'})
    
    # Sentences to Tokens
    # Local function to pick tokenizer
    def word_tokenize(x):
        if ws:
            s = pd.Series(nltk.pos_tag(nltk.WhitespaceTokenizer().tokenize(x)))
        else:
            s = pd.Series(nltk.pos_tag(nltk.word_tokenize(x))) # Discards stuff in between
        return s
            
    df = df.sent_str\
        .apply(word_tokenize)\
        .stack()\
        .to_frame()\
        .rename(columns={0:'pos_tuple'})
    
    # Grab info from tuple
    df['pos'] = df.pos_tuple.apply(lambda x: x[1])
    df['token_str'] = df.pos_tuple.apply(lambda x: x[0])
    if remove_pos_tuple:
        df = df.drop('pos_tuple', 1)
    
    # Add index
    df.index.names = OHCO
    
    return df

In [12]:
%%time
TOKEN = tokenize(DOC, ws=False)

CPU times: user 49.7 s, sys: 1.19 s, total: 50.9 s
Wall time: 50.9 s


In [13]:
TOKEN.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,pos_tuple,pos,token_str
book_id,chap_num,para_num,sent_num,token_num,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
145,1,0,0,0,"(Since, IN)",IN,Since
145,1,0,0,1,"(I, PRP)",PRP,I
145,1,0,0,2,"(can, MD)",MD,can
145,1,0,0,3,"(do, VB)",VB,do
145,1,0,0,4,"(no, DT)",DT,no


In [14]:
TOKEN[TOKEN.pos.str.match('^NNP')]

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,pos_tuple,pos,token_str
book_id,chap_num,para_num,sent_num,token_num,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
145,1,0,0,10,"(Reach, NNP)",NNP,Reach
145,1,0,1,2,"(Maid, NNP)",NNP,Maid
145,1,0,1,3,"(’, NNP)",NNP,’
145,1,0,1,8,"(BEAUMONT, NNP)",NNP,BEAUMONT
145,1,0,1,9,"(AND, NNP)",NNP,AND
...,...,...,...,...,...,...,...
6688,58,67,1,9,"(Red, NNP)",NNP,Red
6688,58,67,1,10,"(Deeps, NNP)",NNP,Deeps
6688,58,68,0,6,"(Tom, NNP)",NNP,Tom
6688,58,68,0,8,"(Maggie, NNP)",NNP,Maggie


# Reduce

Extract a vocabulary from the TOKEN table

In [15]:
TOKEN['term_str'] = TOKEN['token_str'].str.lower().str.replace('[\W_]', '')

In [16]:
VOCAB = TOKEN.term_str.value_counts().to_frame()\
    .rename(columns={'index':'term_str', 'term_str':'n'})\
    .sort_index().reset_index().rename(columns={'index':'term_str'})
VOCAB.index.name = 'term_id'

In [17]:
VOCAB['num'] = VOCAB.term_str.str.match("\d+").astype('int')

In [18]:
VOCAB.head()

Unnamed: 0_level_0,term_str,n,num
term_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,,141032,0
1,1.0,1,1
2,1790.0,1,1
3,1799.0,2,1
4,1801.0,1,1


# Annotate (VOCAB)

## Add Stopwords

Using NLTK's built in stopword list for English. 

In [19]:
sw = pd.DataFrame(nltk.corpus.stopwords.words('english'), columns=['term_str'])
sw = sw.reset_index().set_index('term_str')
sw.columns = ['dummy']
sw.dummy = 1

In [20]:
sw.sample(10)

Unnamed: 0_level_0,dummy
term_str,Unnamed: 1_level_1
any,1
should,1
there,1
its,1
as,1
were,1
most,1
having,1
above,1
this,1


In [21]:
VOCAB['stop'] = VOCAB.term_str.map(sw.dummy)
VOCAB['stop'] = VOCAB['stop'].fillna(0).astype('int')

In [22]:
VOCAB[VOCAB.stop == 1].sample(10)

Unnamed: 0_level_0,term_str,n,num,stop
term_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
8935,himself,1079,0,1
5576,doesn,62,0,1
1793,between,374,0,1
18902,them,1475,0,1
522,all,2666,0,1
8664,hasn,17,0,1
12760,off,566,0,1
12193,mustn,19,0,1
4150,couldn,89,0,1
21506,yourself,158,0,1


## Add Stems

In [23]:
from nltk.stem.porter import PorterStemmer
stemmer1 = PorterStemmer()
VOCAB['stem_porter'] = VOCAB.term_str.apply(stemmer1.stem)

from nltk.stem.snowball import SnowballStemmer
stemmer2 = SnowballStemmer("english")
VOCAB['stem_snowball'] = VOCAB.term_str.apply(stemmer2.stem)

from nltk.stem.lancaster import LancasterStemmer
stemmer3 = LancasterStemmer()
VOCAB['stem_lancaster'] = VOCAB.term_str.apply(stemmer3.stem)

In [24]:
VOCAB.sample(10)

Unnamed: 0_level_0,term_str,n,num,stop,stem_porter,stem_snowball,stem_lancaster
term_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
4154,council,2,0,0,council,council,council
7369,flock,7,0,0,flock,flock,flock
105,accent,15,0,0,accent,accent,acc
11852,miser,1,0,0,miser,miser,mis
9874,insignificant,16,0,0,insignific,insignific,insign
18179,stupendous,5,0,0,stupend,stupend,stupend
8681,hatchin,1,0,0,hatchin,hatchin,hatchin
15351,regulated,1,0,0,regul,regul,reg
17385,soiling,1,0,0,soil,soil,soil
19122,tightly,7,0,0,tightli,tight,tight


In [25]:
VOCAB[VOCAB.stem_porter != VOCAB.stem_snowball]

Unnamed: 0_level_0,term_str,n,num,stop,stem_porter,stem_snowball,stem_lancaster
term_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
50,abjectly,1,0,0,abjectli,abject,abject
64,abruptly,16,0,0,abruptli,abrupt,abrupt
89,abstractedly,3,0,0,abstractedli,abstract,abstract
98,abundantly,4,0,0,abundantli,abund,abund
142,accordingly,11,0,0,accordingli,accord,accord
...,...,...,...,...,...,...,...
21460,yearly,10,0,0,yearli,year,year
21464,yearningly,1,0,0,yearningli,yearn,yearn
21476,yes,453,0,0,ye,yes,ye
21518,zealous,10,0,0,zealou,zealous,zeal


# Save

In [26]:
DOC.to_csv('DOC.csv')
LIB.to_csv('LIB.csv')
VOCAB.to_csv('VOCAB.csv')
TOKEN.to_csv('TOKEN.csv')