# MKTG 685 - Machine Learning in Marketing
# Essentials of Natural Language Processing
# NLP Basic II - Sentiment Analysis

Hyunhwan "Aiden" Lee

> Assistant Professor of Marketing, College of Business, California State University Long Beach

Copyright (c) 2021 ~ present

In this homework, we will run Naive Bayes for sentiment analysis with preprocessing that we studied in our first week class. Through this homework, you will review the first week lesson. Please follow the code in the next lines.

To complete this assingment, please "save a copy in Drive" and complete the codes.
Also, run all the codes before you submit.
Then, submit your "ipynb" file on Canvas.

If you have any questions, please contact to our TA.

In [1]:
# DO NOT CHANGE THIS CODE
import string
import re

# NLTK
import nltk
from nltk.tokenize import sent_tokenize
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('twitter_samples')
from nltk.corpus import stopwords
from nltk.corpus import twitter_samples
from nltk.stem.snowball import SnowballStemmer
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize

import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package twitter_samples to /root/nltk_data...
[nltk_data]   Unzipping corpora/twitter_samples.zip.


# Naïve Bayes for Sentiment Analysis

The following questions are about Naïve Bayes model for sentiment analysis.

Q) In this assignment, we will use Porter stemmer. Please fill the code in the cell below:

In [2]:
# Complete the following code
STOPWORD = stopwords.words('english')
# for stemming use Porter stemmer
STEMMER = PorterStemmer()

In [3]:
# DO NOT CHANGE THIS CODE
# Create a function to do preprocessing.
def preprocessing_text(text):
  # for Twitter data
  # remove if text begins with @
  text = re.sub(r'(\s)@\w+', '', text)
  # remove stock market tickers like $GE
  text = re.sub(r'\$\w*', '', text)
  # remove old style retext text "RT"
  text = re.sub(r'^RT[\s]+', '', text)
  # remove hyperlinks
  text = re.sub(r'https?:\/\/.*[\r\n]*', '', text)
  # remove hashtags
  # only removing the hash # sign from the word
  text = re.sub(r'#', '', text)
  # tokenize texts
  text_tokens = word_tokenize(text)
  texts_clean = []
  for word in text_tokens:
    if (word not in STOPWORD and  # remove stopwords
            word not in string.punctuation):  # remove punctuation
        stem_word = STEMMER.stem(word)  # stemming word
        texts_clean.append(stem_word)
  return ' '.join(texts_clean)

The following code is your train dataset.

In [4]:
# DO NOT CHANGE THIS CODE
# This is your train dataset
train_positive = [
    'good car ever',
    'best vehicle ever',
    'awesome design',
    'sharp handling',
    'great transmission'
]

train_negative = [
    'bad car ever',
    'poor car design',
    'weird design',
    'worst and slow transmission',
    'hard handling'
]

Q) Using the train data above, please run preprocessing using the predefined 'preprocessing_text' function.

In [5]:
# do preprocessing for the data
all_positive_cleaned = [preprocessing_text(text) for text in train_positive]
all_negative_cleaned = [preprocessing_text(text) for text in train_negative]

If you do correctly, the expected out put in the following cell should be

['good car ever', 'best vehicl ever', 'awesom design', 'sharp handl', 'great transmiss']  
['bad car ever', 'poor car design', 'weird design', 'worst slow transmiss', 'hard handl']

In [6]:
# DO NOT CHANGE THIS CODE
print(all_positive_cleaned)
print(all_negative_cleaned)

['good car ever', 'best vehicl ever', 'awesom design', 'sharp handl', 'great transmiss']
['bad car ever', 'poor car design', 'weird design', 'worst slow transmiss', 'hard handl']


Q) Combine negative and positive data for train set.

In [7]:
# Complete the following code
train = []
for text in all_positive_cleaned:
  train.append((text, 'pos'))
for text in all_negative_cleaned:
  train.append((text, 'neg'))

If you do correctly, the expected out put in the following cell should be

('good car ever', 'pos')

('weird design', 'neg')

In [8]:
# DO NOT CHANGE THIS CODE
print(train[0])
print(train[7])

('good car ever', 'pos')
('weird design', 'neg')


Let's train Naive Bayes model.

In [10]:
# DO NOT CHANGE THIS CODE
dtmvector = CountVectorizer()
train_x_dtm = dtmvector.fit_transform(all_positive_cleaned+all_negative_cleaned)
tfidf_transformer = TfidfTransformer()
train_x_tfidf = tfidf_transformer.fit_transform(train_x_dtm)
print(train_x_tfidf.shape)

(10, 17)


Q) Create labels (i.e., positive: 1, negative : 0).

In [13]:
# Complete the following code
train_y = np.append(np.ones(len(all_positive_cleaned)), np.zeros(len(all_negative_cleaned)))


In [14]:
# DO NOT CHANGE THIS CODE
mod = MultinomialNB()
mod.fit(train_x_tfidf, train_y)

Let's test using new texts.

In [15]:
# DO NOT CHANGE THIS CODE
def run_naive_bayes_sentiment(text):
  text_dtm = dtmvector.transform([text])
  text_tfidf = tfidf_transformer.transform(text_dtm)
  if mod.predict(text_tfidf)[0] == 0:
    print('The sentence [%s] is negative!' % (text))
  else:
    print('The sentence [%s] is positive!' % (text))

First, let's use "bad car ever."

In [16]:
# DO NOT CHANGE THIS CODE
run_naive_bayes_sentiment('bad car ever')

The sentence [bad car ever] is negative!


Then, let's test "ugly vehicle ever" and "nice car."

In [17]:
# DO NOT CHANGE THIS CODE
run_naive_bayes_sentiment('ugly vehicle ever')

The sentence [ugly vehicle ever] is positive!


In [18]:
# DO NOT CHANGE THIS CODE
run_naive_bayes_sentiment('nice car')

The sentence [nice car] is negative!


Q) Why did naive Bayes classifer evalaute 'ugly vehicle ever' as positive? And why is "nice car" negative?

Your Answer: The code associates the word ever with positive because it appears more in the positive training data than the negative. It also associates the word car with negative because it appears more in the negative training data than the positive. Since Naive Bayes is judging the probabilites of each word in training set, each word has the same weight.




Q) What can be a solution to avoid the issue above?

Your Answer: In the training data we can remove the word car and ever completely. The negative and positive connotated words, such as "nice", "ugly", "worst", etc. are the ones that should be included in the training data. We can also give the model much larger data sets to prevent assigning words wit neutral connotations positive and negative ones.