## Sentiment Analysis Starter
Goal: 
Develop a fined-tuned sentiment classifier to identify is a provided utterance is positive, neutral, or negative.

Background:
In general when user interact with Posh's chatbot, they are agnostic in thier responses to the bot. We expect the majorit of user interactions to be overall neutral. However, it is useful to indetify when the user has become visibily frustrated and overall negative. Identifying negative user utterances can help us locate areas where the bot repsonse is insufficient, allows us to develop new emotionally responsive feature, and identify opportunites for escalation.

Task description:
For this NLP task, we want to map user utterances (u_i) to the following labels [-1,0,1] where -1 is negative, 0 is neutral, and 1 is positive. 

For example: 

Utterance: I hate this bot it really doesn't get my question.
Expected Label: -1 (negative)

Utterance: I am having issues logging in, can you help?
Expected Label = 0 (neutral)

Utterance: Thanks, your help was great!
Expected Label = 1 (positive)

Training Data:
As a starting point you will be using the ScenarioSA dataset. You can learn more about the dataset here: https://ieeexplore.ieee.org/abstract/document/9091843. ScenarioSA contains 2,214 manually labeled multi-turn English conversations collected from various websites that provide online communication services. We provie some useful script below to get you started, but feel free to explore other avenues.

Note: the goal of ScenarioSA is to both classify utterance level sentiments as well as conversation-level sentiments. For simpilicity, the base classifier only considers the former case and ignores the overall context of the conversation. For the first version of this project, we just want to classify utterances in isolation. But if you have time or feel motivated, feel free to explore the conversation-level classification use case. 


In [2]:
import os

# Define path to ScenarioSA dataset
path = 'data/sentiment/ScenarioSA'

# Generate a list of all files within the ScenarioSA folder
files = []
# r=root, d=directories, f = files
for r, d, f in os.walk(path):
    for file in f:
        if '.txt' in file:
            files.append(os.path.join(r, file))

display(files[:5])

['data/sentiment/ScenarioSA/InteractiveSentimentDataset/794.txt',
 'data/sentiment/ScenarioSA/InteractiveSentimentDataset/2007.txt',
 'data/sentiment/ScenarioSA/InteractiveSentimentDataset/769.txt',
 'data/sentiment/ScenarioSA/InteractiveSentimentDataset/1169.txt',
 'data/sentiment/ScenarioSA/InteractiveSentimentDataset/1454.txt']

In [3]:
# For each each, extract the individual lines into a global list called rows
rows = []
for file in files:
    with open(file, "r", encoding = "ISO-8859-1") as f:
        lines = f.readlines()
        for line in lines:
            line = line.strip()
            if len(line) > 4:
                ls = line.split()
                try:
                    label = int(ls[-1])
                    line = " ".join(ls[:-1])
                    if ":" in line:
                        line = line.split(":")[1].strip()
                    rows.append({"text": line,
                                "label": label,
                                "source": file.split("/")[1]})
                except:
                    continue

# Dump rows in to a pandas dataframe
import pandas as pd
df = pd.DataFrame(rows)
df.head()

Unnamed: 0,text,label,source
0,I need your help.,0,sentiment
1,What's up?,0,sentiment
2,I'm lost.,-1,sentiment
3,Where exactly are you trying to go?,0,sentiment
4,I want to go see a movie.,0,sentiment


### Data Exploration
You may want to explore the dataset here. Some useful analysis topics: distribution of labels, distribution of labels by source, average length of utterances per label, and common words per label

## Train/Test Split
It is critical when developing classification models to split your training data into train/test/validation segment. Read more here: https://towardsdatascience.com/3-things-you-need-to-know-before-you-train-test-split-869dfabb7e50

We can use sklearn's train_test_split method to split our data. Note that random_state argument needs to fixed so that the split is reproducible. We also stratify our test labels to match overall distribution of labels and ensure each label is represented in our test set.

In [4]:
from sklearn.model_selection import train_test_split

train, test = train_test_split(df, 
                               test_size = .2, 
                               stratify=df["label"], 
                               random_state=1988)
print(f"train size: {len(train)}, test size: {len(test)}")

train size: 38914, test size: 9729


## Baseline Model 

Our strong baseline model uses sentence-bert encodings as features for a simple Random Forest Classifier. I'd suggest creating a simple baseline as well using TFIDF features and experinmenting with a multi-nomial logistic regression as a way to better understand how to create a model.

*Preprocessing*
In this model we use sentence-bert (a customed tuned langauge model for generating sentence level representations - https://github.com/UKPLab/sentence-transformers) to encode our utterances. To intall sentence-bert run the following pip command in your terminal/command line tool: pip install -U sentence-transformers 

This model has issues distinguishing between negative intents and negative events. For example the utterance "I broke my laptop, please fix it" would be classified as negative but the user is actually frustrated just stating the facts. The goal of the updated model is to do a better job discerning between negative events (which should classified as neutral) and negative sentiment (I hate this bot...)

In [5]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('roberta-base-nli-mean-tokens')

In [None]:
X_train = model.encode(train["text"].tolist())
y_train = train["label"]
print("finished encoding X_train inputs")

In [None]:
# Train Classifier
from sklearn.ensemble import RandomForestClassifier

rfclf = RandomForestClassifier(verbose=True, n_jobs=-1)
rfclf.fit(X_train, y_train)
print(f"training accuracy: {rfclf.score(X_train, y_train)}")

In [None]:
# Evaluate against Test
X_test = model.encode(test["text"].tolist())
y_test = test["label"]

print(f"Test accuracy: {rfclf.score(X_test, y_test}")_

In [None]:
from sklearn.metrics import classification_report

print(classification_report(y_test, rfclf.predict(X_test)))

In [None]:
# Save model
import pickle

pickle.dump(rfclf, open("baseline_rf_model.pkl", "wb"))