Download the text--The Count of Monte Cristo, remove the gutenberg_ig column and remove rows that are empty.

In [2]:
import re
from rpy2.robjects.packages import importr
import rpy2.robjects as ro
from rpy2.robjects import pandas2ri

pandas2ri.activate()
importr('gutenbergr')

df = ro.r('gutenberg_download("1184")')
df.drop('gutenberg_id', inplace=True, axis=1)
df = df[df['text'] != ''].reset_index(drop=True)

In [3]:
df.iloc[:15]['text']

0                  THE COUNT OF MONTE CRISTO
1                  by Alexandre Dumas [père]
2                                      0009m
3                                      0011m
4                                      0019m
5                                   Contents
6                                 VOLUME ONE
7          Chapter 1. Marseilles—The Arrival
8                  Chapter 2. Father and Son
9                    Chapter 3. The Catalans
10                     Chapter 4. Conspiracy
11             Chapter 5. The Marriage Feast
12    Chapter 6. The Deputy Procureur du Roi
13                Chapter 7. The Examination
14               Chapter 8. The Château d’If
Name: text, dtype: object

Capture the title

In [4]:
title = df.iloc[0]['text']
title

'THE COUNT OF MONTE CRISTO'

Capture the author

In [5]:
author = df.iloc[1]['text']
author = re.search('^by (\w+ \w+)',author).group(1)
author

'Alexandre Dumas'

See how many occurances of Chapter 1. happen.

In [6]:
df[df.text.str.contains('^Chapter 1\.')].index.tolist()

[7, 129]

Remove the text up until Chapter 1 begins (there are 2 occurances)

In [7]:
df.drop(df.index[:df[df.text.str.contains('^Chapter 1\.')].index.tolist()[0]+1], inplace=True)
df.reset_index(inplace=True, drop=True)
df.drop(df.index[:df[df.text.str.contains('^Chapter 1\.')].index.tolist()[0]], inplace=True)
df.reset_index(inplace=True, drop=True)

Remove lines delineating volumes

In [9]:
df[df.text.str.contains('^VOLUME')].index.tolist()

[9232, 19517, 29381, 37420]

In [10]:
df.drop(df[df.text.str.contains('^VOLUME')].index.tolist(), inplace=True)

Remove strange lines of four digits and then an "m"

In [18]:
df.drop(df[df.text.str.contains('^\d{4}m$')].index.tolist(), inplace=True)

Make a table of contents list with chapters

In [19]:
TOC = df[df.text.str.contains('^Chapter')]['text'].tolist()

In [20]:
len(TOC)

117

Combine chapter data together

In [21]:
corpus_dict = {}
i = 0
for k,v in df[1:].to_dict()['text'].items():
    if i not in corpus_dict:
        corpus_dict[i] = []
    if v not in TOC:
        corpus_dict[i].append(v)
    else:
        i += 1
        corpus_dict[i] = []

In [22]:
len(corpus_dict)

117

Parse text into sentences, store them in a dict as prepartion to move into a dataframe with columns of title, chapter and sentence.

In [23]:
from nltk.tokenize import sent_tokenize
sentence_dict = {}
sentence_dict['title'] = []
sentence_dict['chapter'] = []
sentence_dict['sentence'] = []
sentence_dict['author'] = []
for k,v in corpus_dict.items():
    for sentence in sent_tokenize(' '.join(v)):
        sentence_dict['title'].append(title)
        sentence_dict['chapter'].append(TOC[k])
        sentence_dict['sentence'].append(sentence)
        sentence_dict['author'].append(author)

Place the dict into a data frame so the data can be output to a csv easily.

In [24]:
import pandas as pd
final_df = pd.DataFrame.from_dict(sentence_dict)

In [25]:
final_df.sample(5)

Unnamed: 0,author,chapter,sentence,title
7515,Alexandre Dumas,Chapter 56. Andrea Cavalcanti,At length I received this letter from your fri...,THE COUNT OF MONTE CRISTO
9668,Alexandre Dumas,Chapter 74. The Villefort Family Vault,"You may call on the notary, M. Deschamps, Plac...",THE COUNT OF MONTE CRISTO
9473,Alexandre Dumas,Chapter 73. The Promise,"Oh, if such a thought could present itself, I ...",THE COUNT OF MONTE CRISTO
9327,Alexandre Dumas,Chapter 73. The Promise,I imitate neither Manfred nor Anthony; but wit...,THE COUNT OF MONTE CRISTO
2106,Alexandre Dumas,Chapter 19. The Third Attack,For fear the letter might be some day lost or ...,THE COUNT OF MONTE CRISTO


In [26]:
final_df.to_csv('./monte_cristo.csv', index=False)