# Lab 11: Sentiment analysis

- Apply VADER to hotel reviews
- Use text classification to sentiment analysis 
- Add syntactic features for classification

At the end of each notebook, write a brief error analysis and  a statement of what you've learned / ideas about improvement.

In [None]:
import numpy as np
import pandas as pd
from cytoolz import *
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import classification_report, f1_score
from sklearn.pipeline import make_pipeline
from tqdm.auto import tqdm

tqdm.pandas()

## Parsing the input

In [None]:
import spacy
from spacy import displacy
from spacy.tokens import DocBin

nlp = spacy.load("en_core_web_lg", exclude=["ner"])

In [None]:
df = pd.read_parquet("/data/sentiment.parquet")

In [None]:
displacy.render(nlp("They didn't have any clean towels."))

In [None]:
docs = DocBin(docs=nlp.pipe(tqdm(df['text']), n_process=4))
docs.to_disk('parsed.docbin')

  0%|          | 0/10000 [00:00<?, ?it/s]

In [None]:
docs = DocBin().from_disk("parsed.docbin")
df["doc"] = list(docs.get_docs(nlp.vocab))

In [None]:
train, test = train_test_split(
    df, test_size=0.1, stratify=df["sentiment"], random_state=619
)

----

## Syntactically augmented classification

The easiest way to add syntactic information to SGDClassifier is by augmenting the words in the text. That is, if we want to indicate that word is in the scope of negation we'll prefix the word with `NOT:`

In [None]:
from spacy.tokens import Token

Token.set_extension("neg", default=False, force=True)

In [None]:
def simple_negation(doc):
    for tok in doc:
        tok._.neg = False
    for tok in doc:
        if tok.dep_ == "neg":
            tok.head._.neg = True
    return doc


def add_not(tok):
    if tok._.neg:
        return "NOT:" + tok.norm_
    else:
        return tok.norm_


def tokenize_not(negator):
    def tokenize(doc):
        return [add_not(t) for t in negator(doc)]

    return tokenize

In [None]:
test_doc = nlp("They didn't have any clean towels and they didn't care.")

In [None]:
tokenizer = tokenize_not(simple_negation)
tokenizer(test_doc)

['they',
 'do',
 'not',
 'NOT:have',
 'any',
 'clean',
 'towels',
 'and',
 'they',
 'do',
 'not',
 'NOT:care',
 '.']

In [None]:
m1 = make_pipeline(
    CountVectorizer(
        preprocessor=identity,
        tokenizer=tokenize_not(simple_negation),
        token_pattern=None,
    ),
    TfidfTransformer(),
    SGDClassifier(random_state=1),
)
m1.fit(train["doc"], train["sentiment"])
m1.score(test["doc"], test["sentiment"])

0.904

In [None]:
def print_top_feats(M, k=0):
    V = M.named_steps["countvectorizer"].get_feature_names_out()
    coef = M.named_steps["sgdclassifier"].coef_[0]
    order = coef.argsort()
    for w1, w2 in zip(order[-k:][::-1], order[:k]):
        print(f"{V[w1]:20s} {coef[w1]:7.3f} | {V[w2]:20s} {coef[w2]:7.3f}")

In [None]:
print_top_feats(m1, 50)

great                  4.772 | ok                    -4.510
comfortable            3.299 | average               -3.683
excellent              2.923 | poor                  -3.446
perfect                2.923 | NOT:stay              -3.173
amazing                2.772 | okay                  -3.138
quiet                  2.621 | dirty                 -3.120
loved                  2.442 | dated                 -3.014
clean                  2.401 | not                   -2.968
definitely             2.395 | bad                   -2.703
nice                   2.304 | tiny                  -2.603
best                   2.247 | disappointed          -2.520
wonderful              2.110 | outdated              -2.380
recommend              2.103 | worst                 -2.306
fantastic              2.074 | unless                -2.305
helpful                2.052 | no                    -2.168
beautiful              1.935 | terrible              -2.088
everything             1.904 | renovatio

Next step: once we've identified negated words, we'll spread the negative marker onto dependent words that come to the right of the negate word. Not all words to the right, though. Just ones that are dependents of the negated word.

In [None]:
def negify(tok):
    tok._.neg = True
    for child in tok.children:
        negify(child)


def negate_comps(doc):
    for tok in doc:
        tok._.neg = False
    for tok in doc:
        if tok.dep_ == "neg":
            tok.head._.neg = True
            for right_tok in tok.head.rights:
                if right_tok.dep_ in ["acomp", "advmod", "dobj", "prep", "xcomp"]:
                    negify(right_tok)
    return doc

In [None]:
tokenizer = tokenize_not(negate_comps)
tokenizer(test_doc)

['they',
 'do',
 'not',
 'NOT:have',
 'NOT:any',
 'NOT:clean',
 'NOT:towels',
 'and',
 'they',
 'do',
 'not',
 'NOT:care',
 '.']

In [None]:
m2 = make_pipeline(
    CountVectorizer(
        preprocessor=identity, tokenizer=tokenize_not(negate_comps), token_pattern=None
    ),
    TfidfTransformer(),
    SGDClassifier(alpha=1e-4, random_state=1),
)
m2.fit(train["doc"], train["sentiment"])
m2.score(test["doc"], test["sentiment"])

0.903

In [None]:
print_top_feats(m2, 50)

great                  4.796 | ok                    -4.587
comfortable            3.131 | average               -3.566
excellent              3.011 | poor                  -3.384
perfect                2.963 | dated                 -2.989
amazing                2.640 | okay                  -2.980
quiet                  2.605 | disappointed          -2.962
clean                  2.591 | not                   -2.958
nice                   2.461 | dirty                 -2.890
definitely             2.422 | bad                   -2.745
loved                  2.344 | tiny                  -2.694
best                   2.151 | worst                 -2.378
wonderful              2.146 | NOT:again             -2.270
again                  2.132 | unless                -2.167
helpful                2.083 | outdated              -2.090
everything             1.995 | no                    -1.994
fantastic              1.957 | horrible              -1.987
beautiful              1.867 | when     

Next, we'll combine heads with their modifiers

In [None]:
def mod_tokenizer(doc):
    doc = negate_comps(doc)
    toks = [add_not(tok) for tok in doc]
    toks = toks + [
        add_not(t.head) + "_" + add_not(t) for t in doc if t.dep_ in ["amod", "advmod"]
    ]
    return toks

In [None]:
mod_tokenizer(test_doc)

['they',
 'do',
 'not',
 'NOT:have',
 'NOT:any',
 'NOT:clean',
 'NOT:towels',
 'and',
 'they',
 'do',
 'not',
 'NOT:care',
 '.',
 'NOT:towels_NOT:clean']

In [None]:
m3 = make_pipeline(
    CountVectorizer(preprocessor=identity, tokenizer=mod_tokenizer, token_pattern=None),
    TfidfTransformer(),
    SGDClassifier(),
)
m3.fit(train["doc"], train["sentiment"])
m3.score(test["doc"], test["sentiment"])

0.909

In [None]:
print_top_feats(m3, 50)

great                  4.372 | ok                    -4.499
excellent              2.826 | average               -3.086
comfortable            2.758 | poor                  -2.990
perfect                2.738 | not                   -2.930
amazing                2.626 | dated                 -2.723
quiet                  2.466 | okay                  -2.695
loved                  2.374 | dirty                 -2.659
nice                   2.190 | disappointed          -2.624
wonderful              2.135 | bad                   -2.569
best                   2.114 | tiny                  -2.508
stay_again             2.110 | worst                 -2.143
helpful                2.059 | no                    -2.088
good_very              2.055 | unless                -2.000
clean                  2.001 | outdated              -1.912
definitely             1.962 | terrible              -1.844
well                   1.845 | horrible              -1.843
everything             1.803 | NOT:stay_

----

In [None]:
predicted = m3.predict(test["doc"])
error = test[predicted != test["sentiment"]]

In [None]:
error[error["sentiment"] == "bad"]["text"].iloc[0]

"“Needs an update” This hotel has a beautiful lobby and beautiful conference rooms plus a great location. The service is also very good and the beds are quite comfortable. However, the restaurant food is expensive and sub par, the elevator needs work and the guest rooms need updated - the bathrooms in particular. The bathrooms are small with no space for toiletries and the closets are also very small. The cost of the hotel vs what a guest receives- the guest loses.\nWhen I visit Boston again (and I LOVED the city) I would stay at a less expensive hotel near the airport, I would find a hotel with a kitchenette and use Boston's great transit system to explore the city. ."

In [None]:
error[error["sentiment"] == "good"]["text"].iloc[0]

'“Watch Out for Parking Fees” The only incident that made this trip not as pleasant as it could have been were the parking fees. When I booked the hotel I was not notified that parking fees are $18 a day for self parking! When I checked in I was not notified of the parking fees. So when I checked out and was finally notified of the $36 charge to my credit card for parking for 2 days I was shocked. Inform your guests, we hate surprise charges.'

**Observations:**

1. Providing the context to the words has produced good results in identifying the sentiment of hotel reviews.
2. This is a good improvement compared to previous methods, but the second attempt in the same is not that great of a change.
3. What I would do to check is, if we see words like not and no, after we add it to the next word, I would drop the original one just to check if that is a better call or not. Although, we do not know what is the significance of the words with and without not for all the texts, just something that can be tried.