# Opinion Mining and Sentiment Analysis: Teamwork

**Text Mining unit**

_Prof. Gianluca Moro, Dott. Ing. Roberto Pasolini – DISI, University of Bologna_

**Bologna Business School** - Alma Mater Studiorum Università di Bologna

## Instructions

- The provided exercises must be executed by teams of 2 or 3 persons, different teams should not communicate with each other
- It is allowed to consult course material and the Web for advice
- If still in doubt about anything, ask the teacher

## Setup

The following cell contains all necessary imports

In [1]:
import numpy as np
import pandas as pd
import nltk
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression

Run the following to download the necessary files

In [2]:
import os
from urllib.request import urlretrieve
def download(file, url):
    if not os.path.exists(file):
        urlretrieve(url, file)

In [3]:
download("100k_reviews.tsv.gz", "https://www.dropbox.com/s/9fkjz84dnzfyimt/estore_reviews_100k.tsv.gz?dl=1")
download("positive-words.txt", "https://github.com/datascienceunibo/bbs-opinion-lab-2019/raw/master/positive-words.txt")
download("negative-words.txt", "https://github.com/datascienceunibo/bbs-opinion-lab-2019/raw/master/negative-words.txt")

In [4]:
nltk.download("punkt")

True

## Dataset

We provide in the `100k_reviews.tsv.gz` file a dataset of 100,000 reviews posted on Amazon.com about DVDs of movies and TV series. Each review is labeled with a score between 1 and 5 stars.

Run the following to correctly load the file into a pandas DataFrame.

In [5]:
data = pd.read_csv("100k_reviews.tsv.gz", sep="\t", compression="gzip")

In [6]:
data.head()

Unnamed: 0,text,stars
0,George Romero did the right thing when he pick...,5
1,"OK, that makes it sound like something out of ...",5
2,"- At a tribal village, a pensive Elizabeth Cur...",5
3,Wow! This has to be one of the more unusual mo...,5
4,Kevin Costner is one of those actors that I ne...,5


Within the teamwork you will also make use of the Hu and Liu sentiment lexicon: run the following to load sets of positive and negative words.

In [7]:
def scan_hu_liu(f):
    for line in f:
        line = line.decode(errors="ignore").strip()
        if line and not line.startswith(";"):
            yield line

def load_hu_liu(filename):
    with open(filename, "rb") as f:
        return set(scan_hu_liu(f))

hu_liu_pos = load_hu_liu("positive-words.txt")
hu_liu_neg = load_hu_liu("negative-words.txt")

## Exercises

**1)** Verify the distribution of the number of stars

In [8]:
data["stars"].value_counts()

5    47458
4    26486
3    13898
2     6791
1     5367
Name: stars, dtype: int64

**2)** Add a `label` column to the DataFrame whose value is `"pos"` for reviews with 4 or 5 stars and `"neg"` for reviews with 3 stars or less

In [9]:
data["label"] = np.where(data["stars"] >= 4, "pos", "neg")

**3)** Split the dataset randomly into a training set with 80\% of data and a test set with the remaining 20\%

In [10]:
trainset, testset = train_test_split(data, test_size=0.2)

**4)** Create a function which accepts a text as input, counts the occurrences of positive and negative words from the Hu \& Liu lexicon and return `"pos"` if there are more positive words than negative or `"neg"` otherwise

In [11]:
def sentiment_label(text):
    words = nltk.word_tokenize(text)
    pos_count = sum(1 for word in words if word in hu_liu_pos)
    neg_count = sum(1 for word in words if word in hu_liu_neg)
    return "pos" if pos_count > neg_count else "neg"

In [12]:
# test
(sentiment_label("This is awesome!"),
 sentiment_label("This is horrible!"))

('pos', 'neg')

**4)** Apply the function above to test reviews and get the percentage of cases where the function output matches the known label

In [13]:
%%time
lexicon_label = testset["text"].apply(sentiment_label)

CPU times: user 31.2 s, sys: 8 ms, total: 31.2 s
Wall time: 31.4 s


In [14]:
np.mean(lexicon_label == testset["label"])

0.6696

**5)** Create a tf.idf vector space model from training reviews excluding words appearing in less than 3 documents and extract the document-term matrix for them

In [15]:
vect = TfidfVectorizer(min_df=3)
train_dtm = vect.fit_transform(trainset["text"])

**6)** Train a logistic regression classifier on the training reviews, using the representation created above

In [16]:
%%time
model = LogisticRegression()
model.fit(train_dtm, trainset["label"]);

CPU times: user 3.1 s, sys: 36 ms, total: 3.13 s
Wall time: 3.14 s


**7)** Verify the accuracy of the classifier on the test set

In [17]:
test_dtm = vect.transform(testset["text"])
model.score(test_dtm, testset["label"])

0.8471

**8)** Extract the 10 words with the highest regression coefficient and the 10 words with the lowest coefficient

In [18]:
coefs = pd.Series(
          model.coef_[0],
    index=vect.get_feature_names()
).sort_values()

In [19]:
coefs.head(10)

worst           -7.779192
boring          -7.559625
unfortunately   -7.267638
waste           -6.985140
bad             -5.771118
nothing         -5.712796
terrible        -5.704467
wasted          -5.389687
fails           -5.230198
poorly          -5.142444
dtype: float64

In [20]:
coefs.tail(10)

awesome      4.599625
amazing      4.604864
hilarious    4.719440
superb       4.745584
favorite     5.077927
highly       5.789346
wonderful    6.359046
perfect      6.655664
excellent    6.878745
great        7.736444
dtype: float64

**9)** Create a function which accepts a text as input and returns a list of the only words from the text which are also present in the Hu and Liu lexicon (each distinct word must appear in the list as many times as it appears in the text)

In [21]:
hu_liu_all = hu_liu_pos | hu_liu_neg
def tokenize_hu_liu(text):
    words = nltk.word_tokenize(text)
    return [word for word in words if word in hu_liu_all]

In [22]:
# test
(tokenize_hu_liu("This is awesome!"),
 tokenize_hu_liu("This is horrible!"))

(['awesome'], ['horrible'])

**10)** Repeat points from 5 to 7 with a tf.idf vectorizer which uses the function above to extract tokens from text

In [23]:
vect = TfidfVectorizer(min_df=3, tokenizer=tokenize_hu_liu)
train_dtm = vect.fit_transform(trainset["text"])

In [24]:
%%time
model = LogisticRegression()
model.fit(train_dtm, trainset["label"]);

CPU times: user 680 ms, sys: 0 ns, total: 680 ms
Wall time: 684 ms


In [25]:
test_dtm = vect.transform(testset["text"])
model.score(test_dtm, testset["label"])

0.8189