Download the text--Pride and Prejudice, remove the gutenberg_ig column and remove rows that are empty.

In [13]:
import re
from rpy2.robjects.packages import importr
import rpy2.robjects as ro
from rpy2.robjects import pandas2ri

pandas2ri.activate()
importr('gutenbergr')

df = ro.r('gutenberg_download("1342")')
df.drop('gutenberg_id', inplace=True, axis=1)
df = df[df['text'] != ''].reset_index(drop=True)

In [14]:
df.iloc[:15]['text']

0                                   PRIDE AND PREJUDICE
1                                        By Jane Austen
2                                             Chapter 1
3     It is a truth universally acknowledged, that a...
4         of a good fortune, must be in want of a wife.
5     However little known the feelings or views of ...
6     first entering a neighbourhood, this truth is ...
7     of the surrounding families, that he is consid...
8              of some one or other of their daughters.
9     "My dear Mr. Bennet," said his lady to him one...
10                    Netherfield Park is let at last?"
11                  Mr. Bennet replied that he had not.
12    "But it is," returned she; "for Mrs. Long has ...
13                               told me all about it."
14                           Mr. Bennet made no answer.
Name: text, dtype: object

Capture the title

In [15]:
title = df.iloc[0]['text']
title

'PRIDE AND PREJUDICE'

Capture the author

In [16]:
author = df.iloc[1]['text']
author = re.sub(r'^By ','',author)
author

'Jane Austen'

In [17]:
df[df.text.str.contains('^Chapter 1$')].index.tolist()

[2]

Remove the text up until Chapter 1 (only one line in this case)

In [18]:
df.drop(df.index[:df[df.text.str.contains('^Chapter 1$')].index.tolist()[0]], inplace=True)
df.reset_index(inplace=True, drop=True)

In [19]:
df.head(5)

Unnamed: 0,text
0,Chapter 1
1,"It is a truth universally acknowledged, that a..."
2,"of a good fortune, must be in want of a wife."
3,However little known the feelings or views of ...
4,"first entering a neighbourhood, this truth is ..."


Make a table of contents list with chapters and epilogue

In [20]:
TOC = df[df.text.str.contains('^Chapter')]['text'].tolist()

In [21]:
len(TOC)

61

Combine chapter data together

In [22]:
corpus_dict = {}
i = 0
for k,v in df[1:].to_dict()['text'].items():
    if i not in corpus_dict:
        corpus_dict[i] = []
    if v not in TOC:
        corpus_dict[i].append(v)
    else:
        i += 1
        corpus_dict[i] = []

In [23]:
len(corpus_dict)

61

Parse text into sentences, store them in a dict as prepartion to move into a dataframe with columns of title, chapter and sentence.

In [25]:
from nltk.tokenize import sent_tokenize
sentence_dict = {}
sentence_dict['title'] = []
sentence_dict['chapter'] = []
sentence_dict['sentence'] = []
sentence_dict['author'] = []
for k,v in corpus_dict.items():
    for sentence in sent_tokenize(' '.join(v)):
        sentence_dict['title'].append(title)
        sentence_dict['chapter'].append(TOC[k])
        sentence_dict['sentence'].append(sentence)
        sentence_dict['author'].append(author)

Place the dict into a data frame so the data can be output to a csv easily.

In [26]:
import pandas as pd
final_df = pd.DataFrame.from_dict(sentence_dict)

In [27]:
final_df.sample(5)

Unnamed: 0,author,chapter,sentence,title
2031,Jane Austen,Chapter 24,Let Wickham be _your_ man.,PRIDE AND PREJUDICE
1083,Jane Austen,Chapter 14,May I ask whether these pleasing attentions pr...,PRIDE AND PREJUDICE
595,Jane Austen,Chapter 8,"My dear Charles, what do you mean?""",PRIDE AND PREJUDICE
4654,Jane Austen,Chapter 51,And it was settled that we should all be there...,PRIDE AND PREJUDICE
3118,Jane Austen,Chapter 38,There is in everything a most remarkable resem...,PRIDE AND PREJUDICE


In [28]:
final_df.to_csv('./pride_prejudice.csv', index=False)