# Lab 11: Sentiment analysis

- Apply VADER to hotel reviews
- Use text classification to sentiment analysis
- Add syntactic features for classification

At the end of each notebook, write a brief error analysis and  a statement of what you've learned / ideas about improvement.

In [None]:
import numpy as np
import pandas as pd
from cytoolz import *
from tqdm.auto import tqdm

tqdm.pandas()

In [None]:
df = pd.read_parquet("/data/sentiment.parquet")

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
train, test = train_test_split(
    df, test_size=0.1, stratify=df["sentiment"], random_state=619
)

In [None]:
import spacy

nlp = spacy.load(
    "en_core_web_sm",
    exclude=["tagger", "parser", "ner", "lemmatizer", "attribute_ruler"],
)


def tokenize(text):
    doc = nlp.tokenizer(text)
    return [t.norm_ for t in doc if not (t.is_space or t.is_punct or t.like_num)]

In [None]:
train["tokens"] = train["text"].progress_apply(tokenize)
test["tokens"] = test["text"].progress_apply(tokenize)

  0%|          | 0/9000 [00:00<?, ?it/s]

  0%|          | 0/1000 [00:00<?, ?it/s]

-----

## SGDClassifier baseline

In [None]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import accuracy_score, classification_report
from sklearn.pipeline import make_pipeline

In [None]:
baseline = make_pipeline(
    CountVectorizer(analyzer=identity), TfidfTransformer(), SGDClassifier()
)

In [None]:
baseline.fit(train["tokens"], train["sentiment"])
baseline.score(test["tokens"], test["sentiment"])

0.897

-----

Errors

In [None]:
predicted = baseline.predict(test["tokens"])

In [None]:
error = test[predicted != test["sentiment"]]

In [None]:
error[error["sentiment"] == "bad"]["text"].iloc[0]

"“Needs an update” This hotel has a beautiful lobby and beautiful conference rooms plus a great location. The service is also very good and the beds are quite comfortable. However, the restaurant food is expensive and sub par, the elevator needs work and the guest rooms need updated - the bathrooms in particular. The bathrooms are small with no space for toiletries and the closets are also very small. The cost of the hotel vs what a guest receives- the guest loses.\nWhen I visit Boston again (and I LOVED the city) I would stay at a less expensive hotel near the airport, I would find a hotel with a kitchenette and use Boston's great transit system to explore the city. ."

In [None]:
error[error["sentiment"] == "good"]["text"].iloc[0]

'“Watch Out for Parking Fees” The only incident that made this trip not as pleasant as it could have been were the parking fees. When I booked the hotel I was not notified that parking fees are $18 a day for self parking! When I checked in I was not notified of the parking fees. So when I checked out and was finally notified of the $36 charge to my credit card for parking for 2 days I was shocked. Inform your guests, we hate surprise charges.'

**Observations:**
1. A Simple SGD Classifier performed better than the VADER in predicting the sentiments of the reviews.
2. But we need the training data to get better results on our future dataset. But VADER did not need any training to be done.
3. In the first error analyzed above, the review started with what they liked. It later emphasized what they expected in their stay, which the model could not catch as there are more positive words that the model could match from the good sentiments used in the training data.