Jane Austen is one of the most famous authors of all time. Her work ensured the beginnings of the Gothic novel, along with the concept of a female protagonist and female-driven plot. In this notebook, I am going to explore some of the characteristics of three of her most popular works, namely Pride and Prejudice, Emma, and Sense and Sensibility. 
(I theorize that) These three are the most popular because of their heavy romance plotlines, along with the heroines, who are often characterized as "strong" female leads. I contrast this with someone like Fanny Price, from Mansfield Park.

Questions to investigate:
- Length of her novels: is there a trend here? 
- Sentiment Analysis: do they follow a common formula? Is there an Austen-esque sentiment trajectory followed by these novels?
- Amount of dialogue: Austen's novels are known for satirical dialogue comprising a large fraction of the work, so how much of   a role does dialogue play exactly?
- Can we identify common and different themes across these topics using topic modeling?

In the future:
- Using n-grams
- Labelling the graphs with a hover function
- Text Generation in Austen's style
- Preprocessing the stopwords for text generation

Learning goals for this project:
- Sentiment Analysis with TextBlob
- Data visualization with Bokeh
- Topic Modeling with LDA


In [2]:
import pandas as pd
import re
import nltk
import numpy as np
import itertools

In [3]:
nltk.corpus.gutenberg.fileids()

['austen-emma.txt',
 'austen-persuasion.txt',
 'austen-sense.txt',
 'bible-kjv.txt',
 'blake-poems.txt',
 'bryant-stories.txt',
 'burgess-busterbrown.txt',
 'carroll-alice.txt',
 'chesterton-ball.txt',
 'chesterton-brown.txt',
 'chesterton-thursday.txt',
 'edgeworth-parents.txt',
 'melville-moby_dick.txt',
 'milton-paradise.txt',
 'shakespeare-caesar.txt',
 'shakespeare-hamlet.txt',
 'shakespeare-macbeth.txt',
 'whitman-leaves.txt']

Since we have Emma and Sense and Sensibility built into the NLTK corpus, we only need to download the text file for Pride and Prejudice ourselves

In [4]:
from nltk.corpus import gutenberg
emma = gutenberg.raw('austen-emma.txt')

emma_data = emma.split('[Emma by Jane Austen 1816]')[1].split('FINIS')[0].strip()
emma_data

'VOLUME I\n\nCHAPTER I\n\n\nEmma Woodhouse, handsome, clever, and rich, with a comfortable home\nand happy disposition, seemed to unite some of the best blessings\nof existence; and had lived nearly twenty-one years in the world\nwith very little to distress or vex her.\n\nShe was the youngest of the two daughters of a most affectionate,\nindulgent father; and had, in consequence of her sister\'s marriage,\nbeen mistress of his house from a very early period.  Her mother\nhad died too long ago for her to have more than an indistinct\nremembrance of her caresses; and her place had been supplied\nby an excellent woman as governess, who had fallen little short\nof a mother in affection.\n\nSixteen years had Miss Taylor been in Mr. Woodhouse\'s family,\nless as a governess than a friend, very fond of both daughters,\nbut particularly of Emma.  Between _them_ it was more the intimacy\nof sisters.  Even before Miss Taylor had ceased to hold the nominal\noffice of governess, the mildness of h

In [5]:
sense = gutenberg.raw('austen-sense.txt')
sense_data = sense.split('[Sense and Sensibility by Jane Austen 1811]')[1].split('THE END')[0].strip()
sense_data



In [6]:
with open(r'..Jane Austen\texts\pride.txt', encoding = 'utf-8') as myfile:
    pride_data = myfile.read().split('Chapter 61')[1].split('End of the Project Gutenberg EBook of Pride and Prejudice, by Jane Austen')[0].strip()


Ok! Now we have the raw files of text, we can proceed with some elementary cleaning first, using regexes to get rid of the 
\n tags and extra whitespace

In [7]:
pride_data

'Chapter 1\n\n      It is a truth universally acknowledged, that a single man in\n      possession of a good fortune, must be in want of a wife.\n\n      However little known the feelings or views of such a man may be\n      on his first entering a neighbourhood, this truth is so well\n      fixed in the minds of the surrounding families, that he is\n      considered the rightful property of some one or other of their\n      daughters.\n\n      “My dear Mr. Bennet,” said his lady to him one day, “have you\n      heard that Netherfield Park is let at last?”\n\n      Mr. Bennet replied that he had not.\n\n      “But it is,” returned she; “for Mrs. Long has just been here, and\n      she told me all about it.”\n\n      Mr. Bennet made no answer.\n\n      “Do you not want to know who has taken it?” cried his wife\n      impatiently.\n\n      “_You_ want to tell me, and I have no objection to hearing it.”\n\n      This was invitation enough.\n\n      “Why, my dear, you must know, Mrs. Long 

In [8]:
pride_data = str(re.sub(r"\s+", " ", pride_data))
pride_data = re.sub(u"(\u2018|\u2019)", "'", pride_data)
pride_data = re.sub(u"(\u201c|\u201d)", '"', pride_data)


In [9]:
#Convert the data type for Emma and Sense and Sensibility from tuple to string

def convertTuple(tup): 
    str =  ''.join(tup) 
    return str

emma_data = convertTuple(emma_data)
sense_data = convertTuple(sense_data)

emma_data = re.sub('(--)|(_)', ' ', emma_data)
emma_data = re.sub(r"\s+", " ", emma_data)
emma_data = re.sub(u"(\u2018|\u2019)", "'", emma_data)
emma_data = re.sub(u"(\u201c|\u201d)", '"', emma_data)
emma_data = re.sub('VOLUME [IXV]+\s+', '', emma_data)


sense_data = re.sub('(--)|(_)', ' ', sense_data)
sense_data = re.sub(r"\s+", " ", sense_data)
sense_data = re.sub(u"(\u201c|\u201d)", '"', sense_data)
sense_data = re.sub('VOLUME [IXV]+\s+', '', sense_data)

print(type(sense_data))
print(type(emma_data))
print(type(pride_data))


<class 'str'>
<class 'str'>
<class 'str'>


Now that they are all type string, compare vocabulary sizes:

In [10]:
def get_word_set(text):
    text = [word.lower() for word in re.findall(r"[A-Za-z']+", text)]
    print(len(text)) #total number of words for each novel
    word_set = set(text)
    word_set.discard('')
    return word_set

pride_set = get_word_set(pride_data)
emma_set = get_word_set(emma_data)
sense_set = get_word_set(sense_data)


print(len(pride_set))
print(len(emma_set))
print(len(sense_set))


120918
161085
119937
6350
7222
6410


Graph the vocabulary sizes for each novel:

In [11]:
from bokeh.core.properties import value
from bokeh.io import show, output_notebook
from bokeh.models import ColumnDataSource, LabelSet
from bokeh.plotting import figure
from bokeh.transform import dodge
from bokeh.resources import INLINE

output_notebook(INLINE) #for inline plotting

texts = ['Pride and Prejudice', 'Emma', 'Sense and Sensibility']

data = {'texts' : texts,
        'Total Words'   : [120918, 161085, 119937],
        'Unique Words'   : [6350, 7222 , 6410],
        'list1'   : ['120,918', '161,085', '119,937'],
        'list2'   : ['6,350', '7,222' , '6,410'],
        'x1' : [.5,1.5,2.5],
        'x2' : [.6,1.6,2.6]
       }

source = ColumnDataSource(data=data)

p = figure(x_range=texts, plot_height=400, plot_width=650, title="Word Count For Each Novel", y_range=[0,200000],
           toolbar_location=None, tools="")

p.vbar(x=dodge('texts', -0.1, range=p.x_range), top='Total Words', width=0.2, source=source,
       color='#580C6D', alpha=.6, legend_label="Total Words")

p.vbar(x=dodge('texts',  0.1,  range=p.x_range), top='Unique Words', width=0.2, source=source,
       color='#E165C1', alpha=.6, legend_label="Unique Words")

labels = LabelSet(x='x1', y='Total Words', text='list1', level='glyph', source=source, 
                  text_align='left', angle=.8, text_font_size='8pt', text_font = "times")
p.add_layout(labels)

labels = LabelSet(x='x2', y='Unique Words', text='list2', level='glyph', source=source, 
                  text_align='left', angle=.8, text_font_size='8pt', text_font = "times")
p.add_layout(labels)

p.x_range.range_padding = 0.1
p.legend.location = "top_right"
p.legend.label_text_font = "times"
p.legend.orientation = "horizontal"
p.legend.label_text_font_size = "10pt"
p.title.align = "center"
p.title.text_font = "times"
p.title.text_font_size = "20px"

show(p)


## Sentiment Analysis:

In [12]:
pap_df = pd.DataFrame([re.findall('Chapter [\d]+', pride_data), re.split('\s+Chapter [\d]+\s+', pride_data)], 
                  index=['chapter', 'text']).T
pap_df.set_index('chapter', inplace=True)

def num_sentences(group):
    return len(nltk.sent_tokenize(group))

#find number of sentences per chapter
pap_df['sentences'] = pap_df['text'].apply(num_sentences)

#shift index down one, so first label is at location 0
pap_df['sentences'] = np.insert(pap_df['sentences'].values, 0, 0)[:-1]
#get cumulative sum, for location (starting at 0)
pap_df['sentences'] = pap_df['sentences'].cumsum()

pap_df['text'] = pap_df['text'].apply(nltk.sent_tokenize)

pap_df = pap_df.reset_index()
pap_df.loc[(pap_df.index % 5 != 0) ^ (pap_df.index ==max(pap_df.index)), 'chapter'] = ''

pap_df.set_index('sentences', inplace=True)


In [13]:
#Sentiment analysis
from textblob import TextBlob
from bokeh.models import ColumnDataSource, Label, LabelSet, Arrow, VeeHead, Toggle, CustomJS
from bokeh.plotting import figure, show, output_file, output_notebook
from bokeh.models.tickers import FixedTicker
from bokeh.layouts import layout
output_notebook()

p_pap = figure(plot_width=650, plot_height=350, title='Pride and Prejudice Sentiment Progression',
           tools=['pan,reset,wheel_zoom'], y_range=[-.05,.3])

# add a line renderer (rolling average by groups of 100)
pap_conv = np.convolve([TextBlob(s).sentiment.polarity for s in nltk.sent_tokenize(pride_data)], np.ones((100,))/100, mode='valid')
p_pap.line([i+1 for i in range(len(pap_conv))], pap_conv, line_width=1, color='#7B0B5E')

p_pap.xgrid.grid_line_color = None
p_pap.xaxis.major_label_orientation = 1
p_pap.x_range.range_padding = 0.05
p_pap.title.align = "center"
p_pap.title.text_font_size = "20px"

p_pap.xaxis.ticker = FixedTicker(ticks=pap_df.index.values)

p_pap.xaxis.major_label_overrides = pap_df.reset_index().astype(str).set_index('sentences').to_dict()['chapter']

show(p_pap)

This progression makes sense within the context of the story's events. The deepest plunges in negative sentiment are seen in Chapter 40 and 46. Chapter 40 was Darcy's tumultuous first proposal to Elizabeth, a moment in which she expresses disdain and anger towards him. Chapter 46 was when the Bennets learned of Lydia's elopement. The gradual return to positive feelings towards the end is mirrored by the storyline, as a series of actions, events, and confessions urge the novel toward its final, happy, conclusion. 

In [14]:

emma_df1 = pd.DataFrame([re.findall('CHAPTER [IVX]+', emma_data), re.split('\s+CHAPTER [IVX]+\s+',emma_data)],
                       index = ['chapter', 'text']).T

#find number of sentences per chapter
emma_df1['sentences'] = emma_df1['text'].apply(num_sentences)

#shift index down one, so first label is at location 0
emma_df1['sentences'] = np.insert(emma_df1['sentences'].values, 0, 0)[:-1]
#get cumulative sum, for location (starting at 0)
emma_df1['sentences'] = emma_df1['sentences'].cumsum()

emma_df1['text'] = emma_df1['text'].apply(nltk.sent_tokenize)

emma_df1.reset_index(inplace=True)
emma_df1.loc[(emma_df1.index % 5 != 0) ^ (emma_df1.index ==max(emma_df1.index)), 'chapter'] = ''

emma_df1.set_index('sentences', inplace=True)
del emma_df1['index']


In [15]:
output_notebook()

p_emma = figure(plot_width=650, plot_height=350, title='Emma Sentiment Progression',
           tools=['pan,reset,wheel_zoom'], y_range=[-.05,.3])

# add a line renderer (rolling average by groups of 100)
emma_conv = np.convolve([TextBlob(s).sentiment.polarity for s in nltk.sent_tokenize(emma_data)], np.ones((100,))/100, mode='valid')
p_emma.line([i+1 for i in range(len(emma_conv))], emma_conv, line_width=1, color='#F9047E')

p_emma.xgrid.grid_line_color = None
p_emma.xaxis.major_label_orientation = 1
p_emma.x_range.range_padding = 0.05
p_emma.title.align = "center"
p_emma.title.text_font_size = "20px"

p_emma.xaxis.ticker = FixedTicker(ticks=emma_df1.index.values)

p_emma.xaxis.major_label_overrides = emma_df1.reset_index().astype(str).set_index('sentences').to_dict()['chapter']

show(p_emma)


Upon first glance, there are far more peaks and troughs in Emma's sentiment progression, when compared to Pride and Prejudice's. This is very reflective of the protoganist herself, whose opinions are often at odds with those who are wiser. Emma is a novel very known for its wit and satire, even amongst Austen's work. The comical events and misunderstandings in the book depict the same erratic behavior of the plotline. Notable moments include: 
- The positive peak during Chapter 13, where Emma is confident that her matchmaking efforts for Harriet are working. The chapter sets up the alarming realization that she was in fact wrong, leading to a steep decline in positive sentiment. 
- The final dip in positive sentiment towards the end of the novel, when Emma is convinced that Knightley does not love her back. Her hopes are assuaged fairly soon, and the novel ends on a high note.  

In [16]:
sense_df = pd.DataFrame([re.findall('CHAPTER [\d]+', sense_data), re.split('\s+CHAPTER [\d]+\s+', sense_data)], 
                  index=['chapter', 'text']).T
sense_df.set_index('chapter', inplace=True)

#find number of sentences per chapter
sense_df['sentences'] = sense_df['text'].apply(num_sentences)

#shift index down one, so first label is at location 0
sense_df['sentences'] = np.insert(sense_df['sentences'].values, 0, 0)[:-1]
#get cumulative sum, for location (starting at 0)
sense_df['sentences'] = sense_df['sentences'].cumsum()

sense_df['text'] = sense_df['text'].apply(nltk.sent_tokenize)

sense_df = sense_df.reset_index()
sense_df.loc[(sense_df.index % 5 != 0) ^ (sense_df.index ==max(sense_df.index)), 'chapter'] = ''

sense_df.set_index('sentences', inplace=True)


In [17]:
output_notebook()

p_sense = figure(plot_width=650, plot_height=350, title='Sense and Sensibility Sentiment Progression',
           tools=['pan,reset,wheel_zoom'], y_range=[-.05,.3])

# add a line renderer (rolling average by groups of 100)
sense_conv = np.convolve([TextBlob(s).sentiment.polarity for s in nltk.sent_tokenize(sense_data)], np.ones((100,))/100, mode='valid')
p_sense.line([i+1 for i in range(len(sense_conv))], sense_conv, line_width=1, color='#C40243')

p_sense.xgrid.grid_line_color = None
p_sense.xaxis.major_label_orientation = 1
p_sense.x_range.range_padding = 0.05
p_sense.title.align = "center"
p_sense.title.text_font_size = "20px"

p_sense.xaxis.ticker = FixedTicker(ticks=sense_df.index.values)

p_sense.xaxis.major_label_overrides = sense_df.reset_index().astype(str).set_index('sentences').to_dict()['chapter']

show(p_sense)


- This is the calmest trajectory we've seen so far, which is in line with the novel's heroine, Elinor Dashwood. 
- The novel starts off on a pretty down note, as the readers are told the unfortunate familial circumstances of the Dashwoods. The trough in Chapter 15 could be indicative of the gradual reveal of Willoughby's poor character. The chapter ends with Marianne, who "avoided the looks of them all, could neither eat nor speak, and after some time . . . burst into tears and left the room." 
- The peak in positivity at Chapter 20 does not seem indicative of the novel's events - that chapter is one in which the Dashwoods were irritated by the pretense of others. However, the chapter is also full of flattering conversation, which could be what the module picked up on. A similar phenomenon occurs in Chapters 32-33, chapters in which Austen's satire is at its peak. With phrases like "a person and face of strong, natural, sterling insignificance." - the sarcasm can be hard to pick up on. 
- The peak in Chapters 45-46 seems well deserved - this being the chapter in which it is revealed that Colonel Brandon is in love with Marianne. The final chapters are in a similar vein to Emma, in which the protagonist believes their feelings are unrequited. Eventually, both Elinor and Marianne are proposed to, and the novel ends on a happy note.

# Finding the amounts of dialogue in each novel

In [22]:
def find_dialogue_ratio(text):
    text = re.sub(r"((?<=[^\w])\'|\'(?=[^\w]))", '"', text)
    dialogue_list = ' '.join(re.findall(r'\"[^\"]+\"', text)).split()
    full_list = text.split()
    return len(dialogue_list)/float(len(full_list))

print(find_dialogue_ratio(pride_data))
print(find_dialogue_ratio(emma_data))
print(find_dialogue_ratio(sense_data))


0.5231143956838719
0.518130133446251
0.45213508700051


So from this we see that approximately half of each of these novels are dialogue, with Pride and Prejudice having the greatest amount. This makes sense, especially because Austen's novels are considered to be conversation-driven, rather than plot driven. 

# Topic Modeling Using LDA

In [23]:
#Removing stopwords and cleaning data

data = [['Pride and Prejudice', pride_data], ['Emma', emma_data], ['Sense and Sensibility', sense_data]] 
data_df = pd.DataFrame(data, columns = ['novel', 'text']) 
data_df.head()



Unnamed: 0,novel,text
0,Pride and Prejudice,Chapter 1 It is a truth universally acknowledg...
1,Emma,"CHAPTER I Emma Woodhouse, handsome, clever, an..."
2,Sense and Sensibility,CHAPTER 1 The family of Dashwood had long been...


In [24]:
import string
def clean_text_round1(text):
    '''Make text lowercase, remove text in square brackets, remove punctuation and remove words containing numbers.'''
    text = text.lower()
    text = re.sub('\[.*?\]', '', text)
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    text = re.sub('\w*\d\w*', '', text)
    return text

round1 = lambda x: clean_text_round1(x)
data_clean = pd.DataFrame(data_df.text.apply(round1))
data_clean

Unnamed: 0,text
0,chapter it is a truth universally acknowledge...
1,chapter i emma woodhouse handsome clever and r...
2,chapter the family of dashwood had long been ...


In [25]:
data_df.to_pickle("corpus.pkl")
from nltk.corpus import stopwords
modified_stopwords = stopwords.words("english")

pride_names = [['Elizabeth Bennet', 'Ms. Elizabeth Bennet', 'Elizabeth', 'Eliza', 'Lizzie', 'Ms. Elizabeth', 'Miss', 'Bennets'],
              ['Fitzwilliam Darcy', 'Mr. Darcy', 'Darcy', 'Mr','Mrs', 'Ms','bingley'],
              ['Jane Bennet', 'Jane', 'Ms. Bennet', 'Ms. Jane Bennet', 'Bennet'],
              ['Mrs. Bennet', 'Mr. Bennet', 'Lydia Bennet', 'Lydia', 'Mary Bennet', 'Kitty', 'Catherine Bennet'],
              ['Charles Bingley', 'Mr. Bingley', 'George Wickham','Mr. Wickham', 'Mr. Collins','Mrs. Collins' 'Charlotte', 'Bennets'], 
              ['Charlotte Lucas', 'Brighton', 'Meryton', 'Longbourne', 'Pemberley', 'London', 'collins','wickham']]

emma_names = [['Emma Woodhouse', 'Emma', 'Ms. Woodhouse', 'Woodhouse', 'Woodhouses','Dashwoods\'', 'Mr','Mrs', 'Ms'], 
             ['Mr. Knightley', 'Knightley'],
             ['Frank Churchill', 'Frank', 'Mr. Churchill'],
             ['Jane Fairfax', 'Jane', 'Ms. Fairfax'],
             ['Harriet Smith', 'Harriet', 'Ms. Smith'], 
             ['Miss Bates', 'Bates'],
             ['Mrs. Weston', 'Taylor', 'Taylor Weston'],
             ['Mr. Elton', 'Elton'],
             ['Donwell Abbey', 'Hartfield', 'Highbury', 'Randalls', 'Fairfax', 'weston', 'churchill']]

sense_names = [['Elinor Dashwood', 'Elinor', 'Miss Dashwood', 'Ms.', 'Dashwood'],
            ['Marianne Dashwood', 'Marianne'],
            ['Colonel Brandon', 'Brandon'],
            ['John Willoughby', 'Willoughby', 'Mr. Willoughby'],
            ['Edward Ferrars', 'Edward', 'Mr. Ferrars'],
            ['Miss Grey', 'Sophia', 'Grey'],
            ['Lucy Steele', 'Miss Steele', 'Steele'],
            ['Mrs. Jennings', 'Jennings','Mr','Mrs', 'Ms']]
              
              
#appending these lists to stopwords
names = [pride_names, emma_names, sense_names]
for book in names:
    for sublist in book:
        for name in sublist:
            modified_stopwords.append(name.lower())


In [26]:
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(stop_words = modified_stopwords)
data_cv = cv.fit_transform(data_clean.text)
data_dtm = pd.DataFrame(data_cv.toarray(), columns=cv.get_feature_names())
data_dtm.index = data_clean.index
data_dtm

  'stop_words.' % sorted(inconsistent))


Unnamed: 0,abandoned,abatement,abbey,abbeyland,abbeymill,abbots,abdys,abhor,abhorred,abhorrence,...,younger,youngest,youth,youthful,youths,zeal,zealous,zealously,zigzags,éclat
0,0,1,0,0,0,0,0,0,0,6,...,29,13,8,0,1,0,0,0,0,1
1,0,0,24,0,7,1,1,1,1,0,...,2,4,11,2,0,4,0,0,1,0
2,1,1,0,1,0,0,0,1,2,4,...,6,4,7,2,0,2,2,1,0,0


In [27]:
import pickle
data_df.to_pickle("corpus.pkl")
data_clean.to_pickle('data_clean.pkl')
data_dtm.to_pickle("dtm.pkl")
pickle.dump(cv, open("cv.pkl", "wb"))


In [28]:
data = pd.read_pickle('dtm.pkl')
data

Unnamed: 0,abandoned,abatement,abbey,abbeyland,abbeymill,abbots,abdys,abhor,abhorred,abhorrence,...,younger,youngest,youth,youthful,youths,zeal,zealous,zealously,zigzags,éclat
0,0,1,0,0,0,0,0,0,0,6,...,29,13,8,0,1,0,0,0,0,1
1,0,0,24,0,7,1,1,1,1,0,...,2,4,11,2,0,4,0,0,1,0
2,1,1,0,1,0,0,0,1,2,4,...,6,4,7,2,0,2,2,1,0,0


In [29]:
from gensim import matutils, models
import scipy.sparse

In [30]:
tdm = data.transpose()
tdm.head()

Unnamed: 0,0,1,2
abandoned,0,0,1
abatement,1,0,1
abbey,0,24,0
abbeyland,0,0,1
abbeymill,0,7,0


In [31]:
sparse_counts = scipy.sparse.csr_matrix(tdm)
corpus = matutils.Sparse2Corpus(sparse_counts)

In [32]:
cv = pickle.load(open("cv.pkl", "rb"))
id2word = dict((v, k) for k, v in cv.vocabulary_.items())

In [33]:
lda = models.LdaModel(corpus=corpus, id2word=id2word, num_topics=4, passes=10)
lda.print_topics()

[(0,
  '0.011*"could" + 0.010*"would" + 0.008*"said" + 0.006*"much" + 0.006*"must" + 0.006*"one" + 0.006*"every" + 0.005*"know" + 0.004*"time" + 0.004*"though"'),
 (1,
  '0.012*"could" + 0.012*"would" + 0.008*"must" + 0.007*"much" + 0.007*"said" + 0.006*"one" + 0.006*"every" + 0.006*"thing" + 0.006*"think" + 0.006*"well"'),
 (2,
  '0.001*"could" + 0.001*"would" + 0.001*"said" + 0.001*"much" + 0.001*"must" + 0.001*"one" + 0.001*"time" + 0.001*"think" + 0.001*"well" + 0.001*"every"'),
 (3,
  '0.003*"could" + 0.002*"would" + 0.002*"much" + 0.002*"must" + 0.001*"said" + 0.001*"one" + 0.001*"well" + 0.001*"little" + 0.001*"every" + 0.001*"know"')]

In [34]:
from nltk import word_tokenize, pos_tag

def nouns(text):
    '''Given a string of text, tokenize the text and pull out only the nouns.'''
    is_noun = lambda pos: pos[:2] == 'NN'
    tokenized = word_tokenize(text)
    all_nouns = [word for (word, pos) in pos_tag(tokenized) if is_noun(pos)] 
    return ' '.join(all_nouns)

data_clean = pd.read_pickle('data_clean.pkl')
data_clean

Unnamed: 0,text
0,chapter it is a truth universally acknowledge...
1,chapter i emma woodhouse handsome clever and r...
2,chapter the family of dashwood had long been ...


In [35]:
data_nouns = pd.DataFrame(data_clean.text.apply(nouns))
data_nouns

Unnamed: 0,text
0,chapter truth man possession fortune want wife...
1,chapter i clever home disposition blessings ex...
2,chapter family dashwood estate residence norla...


In [36]:
from sklearn.feature_extraction import text
from sklearn.feature_extraction.text import CountVectorizer

# Re-add the additional stop words since we are recreating the document-term matrix
# add_stop_words = ['like', 'im', 'know', 'just', 'dont', 'thats', 'right', 'people',
#                   'youre', 'got', 'gonna', 'time', 'think', 'yeah', 'said']
# stop_words = text.ENGLISH_STOP_WORDS.union(add_stop_words)

# Recreate a document-term matrix with only nouns
cvn = CountVectorizer(stop_words=modified_stopwords)
data_cvn = cvn.fit_transform(data_nouns.text)
data_dtmn = pd.DataFrame(data_cvn.toarray(), columns=cvn.get_feature_names())
data_dtmn.index = data_nouns.index
data_dtmn

  'stop_words.' % sorted(inconsistent))


Unnamed: 0,abatement,abbey,abbeyland,abbeymill,abbots,abdys,abhor,abhorrence,abilities,ability,...,york,yorkshire,youll,younge,younger,youth,youths,zeal,zigzags,éclat
0,1,0,0,0,0,0,0,6,6,0,...,1,0,1,4,3,8,1,0,0,1
1,0,19,0,1,1,1,0,0,3,0,...,1,7,0,0,0,10,0,4,1,0
2,1,0,1,0,0,0,1,4,9,3,...,0,0,0,0,0,6,0,2,0,0


In [37]:
corpusn = matutils.Sparse2Corpus(scipy.sparse.csr_matrix(data_dtmn.transpose()))

# Create the vocabulary dictionary
id2wordn = dict((v, k) for k, v in cvn.vocabulary_.items())


ldan = models.LdaModel(corpus=corpusn, num_topics=5, id2word=id2wordn, passes=10)
ldan.print_topics()

[(0,
  '0.011*"time" + 0.010*"sister" + 0.009*"mother" + 0.009*"nothing" + 0.009*"thing" + 0.007*"house" + 0.007*"day" + 0.007*"john" + 0.006*"colonel" + 0.006*"heart"'),
 (1,
  '0.002*"time" + 0.002*"sister" + 0.001*"man" + 0.001*"day" + 0.001*"sisters" + 0.001*"letter" + 0.001*"family" + 0.001*"nothing" + 0.001*"father" + 0.001*"pleasure"'),
 (2,
  '0.016*"thing" + 0.011*"time" + 0.010*"nothing" + 0.009*"man" + 0.006*"body" + 0.006*"father" + 0.006*"friend" + 0.006*"day" + 0.006*"way" + 0.005*"home"'),
 (3,
  '0.010*"time" + 0.009*"nothing" + 0.008*"sister" + 0.007*"family" + 0.007*"man" + 0.006*"day" + 0.006*"father" + 0.005*"room" + 0.005*"letter" + 0.005*"mother"'),
 (4,
  '0.001*"time" + 0.001*"thing" + 0.001*"man" + 0.001*"nothing" + 0.001*"day" + 0.001*"home" + 0.001*"sister" + 0.001*"friend" + 0.001*"pleasure" + 0.001*"way"')]

In [38]:
def nouns_adj(text):
    '''Given a string of text, tokenize the text and pull out only the nouns and adjectives.'''
    is_noun_adj = lambda pos: pos[:2] == 'NN' or pos[:2] == 'JJ'
    tokenized = word_tokenize(text)
    nouns_adj = [word for (word, pos) in pos_tag(tokenized) if is_noun_adj(pos)] 
    return ' '.join(nouns_adj)

data_nouns_adj = pd.DataFrame(data_clean.text.apply(nouns_adj))

cvna = CountVectorizer(stop_words=modified_stopwords, max_df=.8)
data_cvna = cvna.fit_transform(data_nouns_adj.text)
data_dtmna = pd.DataFrame(data_cvna.toarray(), columns=cvna.get_feature_names())
data_dtmna.index = data_nouns_adj.index
data_dtmna

  'stop_words.' % sorted(inconsistent))


Unnamed: 0,abatement,abbey,abbeyland,abbeymill,abbots,abdys,abhor,abhorrence,abhorrent,ability,...,york,yorkshire,youll,younge,youthful,youths,zeal,zealous,zigzags,éclat
0,1,0,0,0,0,0,0,6,1,0,...,1,0,1,4,0,1,0,0,0,1
1,0,19,0,3,1,1,0,0,0,0,...,1,7,0,0,2,0,4,0,1,0
2,1,0,1,0,0,0,1,4,0,3,...,0,0,0,0,2,0,2,2,0,0


In [39]:

corpusna = matutils.Sparse2Corpus(scipy.sparse.csr_matrix(data_dtmna.transpose()))

# Create the vocabulary dictionary
id2wordna = dict((v, k) for k, v in cvna.vocabulary_.items())

In [40]:

ldan = models.LdaModel(corpus=corpusn, num_topics=5, id2word=id2wordn, passes=80)
ldan.print_topics()

[(0,
  '0.016*"thing" + 0.011*"time" + 0.010*"nothing" + 0.009*"man" + 0.007*"body" + 0.006*"father" + 0.006*"friend" + 0.006*"way" + 0.006*"day" + 0.005*"home"'),
 (1,
  '0.000*"time" + 0.000*"nothing" + 0.000*"thing" + 0.000*"man" + 0.000*"sister" + 0.000*"day" + 0.000*"mother" + 0.000*"world" + 0.000*"house" + 0.000*"something"'),
 (2,
  '0.011*"time" + 0.010*"sister" + 0.009*"mother" + 0.009*"nothing" + 0.009*"thing" + 0.007*"house" + 0.007*"day" + 0.007*"john" + 0.007*"colonel" + 0.006*"heart"'),
 (3,
  '0.010*"time" + 0.009*"nothing" + 0.008*"sister" + 0.007*"family" + 0.007*"man" + 0.006*"day" + 0.006*"father" + 0.005*"letter" + 0.005*"room" + 0.005*"mother"'),
 (4,
  '0.000*"thing" + 0.000*"time" + 0.000*"day" + 0.000*"nothing" + 0.000*"man" + 0.000*"way" + 0.000*"moment" + 0.000*"sister" + 0.000*"family" + 0.000*"morning"')]

In [41]:
corpus_transformed = ldan[corpusn]
list(zip([a for [(a,b)] in corpus_transformed], data_dtmn.index))

[(3, 0), (0, 1), (2, 2)]

Looking at the poor selection of topics, there needs to be some serious preprocessing of the stopwords + cleaning of the text + refining of the parameters and passes. Perhaps I could comb through other parts of speech? 

# Thoughts and Future Improvements:

Possible improvements could be made by:
- Using n-grams
- Labelling the graphs with a hover function
- Preprocessing the stopwords for text generation

Something I want to do:
- Text Generation in Austen's style

All in all, I learned a lot during this project! Particularly in the realm of visualization and sentiment analysis. This was also my first exposure to LDA, so I will continue to refine my knowledge of that area and apply it to the work I have done here.

