## Cleaning *The God of Small Things*

In this notebook, we preprocess the raw text in `data/gost_raw.txt` (extracted from the pdf).

Not gonna be pretty but gotta do what you gotta do.

In [1]:
import nltk
import pandas as pd

In [2]:
def skip_line(line):
    """Determine whether to skip the line or not.
    """
    # If line empty, skip line
    if len(line.strip()) == 0:
        return True
    
    # This is actually fine. Part of the actual text.
    if 'The God of Small Things.' in line:
        return False
    
    # Any of these phrases in the line means we can skip the line
    indicator_phrases = [
       'Collected & Compiled by Shashank A Sinha/GTS/CSC',
       'Exclusive for News & Views Readers',
       'The God of Small Things'
    ]
    return any(ip in line for ip in indicator_phrases)

In [3]:
chapter_texts = []
chapters = []
with open('data/gost_raw.txt', 'r') as f:
    lines = []
    collecting_chapter = False
    for line in f:
        if skip_line(line):
            continue
        if line.startswith('Chapter'):
            # The next line is the Chapter Title
            chapters.append(next(f).replace('\n', ''))
            # If we have been collecting the chapter lines
            if collecting_chapter:
                chapter_texts.append(' '.join(lines).replace('\n', ' '))
            # Reset lines
            lines = []
            collecting_chapter = True
            continue
        lines.append(line)
    # Last Chapter
    chapter_texts.append(' '.join(lines))

Quick check on the Chapter count.

In [4]:
print(len(chapters))
print(len(chapter_texts))

21
21


And the chapters themselves.

In [5]:
chapters

['Paradise Pickles & Preserves',
 'Pappachi’s Moth',
 'Big Man the Laltain, Small Man the Mombatti',
 'Abhilash Talkies',
 'God’s Own Country',
 'Cochin Kangaroos',
 'Wisdom Exercise Notebooks',
 'Welcome Home, Our Sophie Mol',
 'Mrs. Pillai, Mrs. Eapen, Mrs. Rajagopalan',
 'The River in the Boat',
 'The God of Small Things',
 'Kochu Thomban',
 'The Pessimist and the Optimist',
 'Work is Struggle',
 'The Crossing',
 'A Few Hours Later',
 'Cochin Harbor Terminus',
 'The History House',
 'Saving Ammu',
 'The Madras Mail',
 'The Cost of Living']

Right now, the chapter text is a big string.

In [6]:
#First 100 characters of Chapter 1
chapter_texts[0][:100]

'May in Ayemenem is a hot, brooding month. The days are long and humid. The river shrinks  and black '

Let's split it into sentences and create a DataFrame.

In [7]:
frames = []
for i, (chapter_title, chapter_text) in enumerate(zip(chapters, chapter_texts)):
    sentences = nltk.sent_tokenize(chapter_text)
    chapter_df = pd.DataFrame({'Sentence': sentences, 
                               'ChapterNum': i + 1, 
                               'ChapterTitle': chapter_title})
    frames.append(chapter_df)
df = pd.concat(frames).reset_index(drop=True)

In [8]:
df.shape

(8529, 3)

In [9]:
df.head()

Unnamed: 0,ChapterNum,ChapterTitle,Sentence
0,1,Paradise Pickles & Preserves,"May in Ayemenem is a hot, brooding month."
1,1,Paradise Pickles & Preserves,The days are long and humid.
2,1,Paradise Pickles & Preserves,The river shrinks and black crows gorge on br...
3,1,Paradise Pickles & Preserves,Red bananas ripen.
4,1,Paradise Pickles & Preserves,Jackfruits burst.


Nice! We have the sentences in a DataFrame.

Let's save it for later analysis.

In [10]:
df.to_csv('data/Gost_sentences.csv', index=False)

That's it for now. 