<p>&nbsp;</p>

<font size = "4.5">

In previous work, I have practiced natural language processing techniques on the book Moby Dick by Herman Melville. In this project, I will continue to build on my NLP work by performing sentiment classification on Moby Dick.

<p>&nbsp;</p>

<font size = "4.5">

I will start off by importing some necessary packages and then importing the data via the nltk gutenberg corpus.

<p>&nbsp;</p>

In [36]:
# Import packages
import pandas
import nltk
import textblob
import sklearn

In [3]:
# Import data
mobydick = nltk.corpus.gutenberg.raw("melville-moby_dick.txt")

<p>&nbsp;</p>

<font size = "4.5">

Since the text starts off in a raw form, it will need to be processed. In this task, I will be doing sentiment analysis at the sentence level. Therefore, I will use the nltk sentence tokenizer to break the text down into sentences. In the following code, I will use the nltk sent_tokenize function and print out how many sentences there are.

<p>&nbsp;</p>

In [4]:
# Break the text down into sentences
sentences = nltk.sent_tokenize(mobydick)

In [5]:
# How many sentences are there?
print(len(sentences))

9852


<p>&nbsp;</p>

<font size = "4.5">

After breaking the text down into sentences, we see that Moby Dick has 9,852 sentences. Next, I will use the textblob package to assign sentiment labels to each sentence. What textblob does is it will assign a polarity score between -1 and 1. Based on that polarity score, I will determine whether the sentence is negative, neutral, or positive. If the polarity score is < 0, then the sentiment will be negative. If the polarity score is = 0, then the sentiment will be neutral. And if the polarity score is > 0, then the sentiment will be posiitve.

<p>&nbsp;</p>

In [17]:
# Create polarity scores for the sentences
polarities = [textblob.TextBlob(sent).sentiment.polarity for sent in sentences]

In [28]:
# Define a function for discretizing the polarity scores
def getsentiment(polarityscore):
    if polarityscore < 0: return "negative"
    elif polarityscore == 0: return "neutral"
    else: return "positive"

In [29]:
# Convert the polarity scores into sentiment labels
sentiments = [getsentiment(polarity) for polarity in polarities]

In [30]:
# What are the counts of each label?
pandas.DataFrame(sentiments).value_counts()

neutral     4197
positive    3637
negative    2018
dtype: int64

<p>&nbsp;</p>

<font size = "4.5">

Now I will analyze the results by looking at the top 50 adjective phrases, adverb phrases, and verb phrases for the positive and negative categories. The top positive will consist of the sentences with the highest polarities, and the phrases in those sentences. The top negative will consist of the sentences with the lowest polarities, and the phrases in those sentences.

<p>&nbsp;</p>

In [41]:
# Create a dataframe with the sentences, the polarities, and the sentiments
sentimentdf = pandas.DataFrame(list(zip(sentences, polarities, sentiments)), columns = ["Sentence", "Polarity", "Sentiment"])

In [54]:
# Define the grammar for an adjective phrase
adjp_grammar = """
ADJP: {<JJ><NN>}
      {<NN><JJ>}
"""

# Define the grammar for an adverb phrase
advp_grammar = """
ADVP: {<RB><V>}
      {<RB><NN>}
      {<RB><JJ>}
"""

# Define the grammar for an verb phrase
vp_grammar = """
VP: {<VBP><DT><NN>}
    {<VBN><DT><NN>}
"""

In [55]:
# Define a function for retrieving phrases
def get_phrase(sentence, pattern, phrase):
    tokens = nltk.word_tokenize(sentence)
    tags = nltk.pos_tag(tokens)
    parser = nltk.RegexpParser(pattern)
    tree = parser.parse(tags)
    for sub1 in tree.subtrees():
        for sub2 in sub1:
            if type(sub2) != tuple:
                extract = ""
                for sub3 in sub2.leaves():
                    extract += " " + str(sub3[0])
                return extract

In [62]:
# Iterate through the sentences and retrieve the adj phrases
adj_phrases = []
adv_phrases = []
v_phrases = []
for sentence in sentences:
    try:
        adj_phrase = get_phrase(sentence, adjp_grammar, "ADJP")
        adv_phrase = get_phrase(sentence, advp_grammar, "ADVP")
        v_phrase = get_phrase(sentence, vp_grammar, "VP")
        adj_phrases.append(adj_phrase)
        adv_phrases.append(adv_phrase)
        v_phrases.append(v_phrase)
    except:
        adj_phrases.append(None)
        adv_phrases.append(None)
        v_phrases.append(None)

In [63]:
# Put together a dataframe of the sentence, phrase, and test
mobydf = pandas.DataFrame(list(zip(
    sentences,
    polarities,
    sentiments,
    adj_phrases,
    adv_phrases,
    v_phrases)),
    columns = [
        "Sentence",
        "Polarity",
        "Sentiment",
        "AdjPhrase",
        "AdvPhrase",
        "VPhrase"
    ]
)

In [67]:
# Print out the top 50 adjective phrases for positive
mobytemp = mobydf[mobydf.Sentiment == "negative"].sort_values(by = "Polarity", ascending = True, axis = 0)
counter = 0
for phrase in list(mobytemp["VPhrase"]):
    if counter >= 50:
        break
    if phrase:
        print(phrase)
        counter += 1

 give the glory
 crush the quadrant
 make a spread
 go the gait
 been a sprat
 see this whale-steak
 offer this cup
 seen a bird
 granted the ship
 mean a downright
 worsted all round
 mend the matter
 wrenched the ship
 know some o
 round the socket
 get a chance
 turned the round
 know the proverb
 trace the round
 have the heart
 killed some distance
 Have an eye
 remain a part
 broken a finger
 vertebra the bottom
 cleared the foul
 been a whale-boat
 make a point
 robbed a widow
 differ the sea
 nigh the beach
 are the moody
 behold an oarsman
 known any profound
 seem the connecting
 round the waist
 say the word
 have no objection
 been the case
 been a pirate
 are an advance
 found some salvation
 are a plenty
 been a mortar
 heaven a murderer
 furnished a proverb
 reckon a monster
 provided a system
 home the oil
 round the world
