# Abstract

Your abstract is a one-paragraph summary of the problem you addressed, the approach(es) that you used to address it, and the big-picture results that you obtained.

<a href="https://github.com/dcruzeneil/Mental-Health-Tweetment"><b>GitHub Repository: <i>Mental Health Tweetment</i></b></a>

# Introduction

# Values Statement

# Preparing the Data
#### <font color="orange">Sampling the Data</font>
Before, we begin doing anything it is important to understand that the <i>Sentiment140</i> is a huge data set, which makes many computations very time-consuming. Therefore, we are going to sample the data set to select $50000$ data points (rows) randomly. The code which was used to sample the data is:
```python
import pandas as pd
df = pd.read_csv(r"datasetOriginal.csv", encoding="latin-1")
df = df.sample(n=50000)
df.to_csv("dataset.csv", index = False)
```

#### <font color = "orange">Loading the Data and Feature Selection</font>
Before we perform any of our evaluations, we must prepare our data. For this, we will be using the <i>loadData</i> method from HelperClass:

```python
def loadData(self):
    columns = ['Sentiment Score', 'Tweet ID', 'Time', 'Query', 'Username', 'Tweet']
    df = pd.read_csv(r"../dataset.csv", encoding="latin-1", names = columns)
    df = df[["Sentiment Score", "Tweet"]]
    return df
```
In the above operation, we are assigning adequate feature (column) names to the data. This dataset has a bunch of features, however, we are primarily interested in the: 
<ol>
    <li><i><u>Tweet</u></i> feature: which contains the actual tweet - which will be used for Sentiment Analysis
    <li><i><u>Sentiment Score</u></i> feature: which contains the true labels - the actual assigned sentiment for each tweet 
</ol>

Now, we can go ahead and store our data as a <i>pandas</i> data frame:

In [1]:
import warnings
warnings.filterwarnings('ignore')
from HelperClass import HelperClass

hp = HelperClass()
df = hp.loadData()

To get an understanding of what the data looks like, we can go ahead and examine the first few rows of the data:

In [2]:
df.head()

Unnamed: 0,Sentiment Score,Tweet
0,0,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,is upset that he can't update his Facebook by ...
2,0,@Kenichan I dived many times for the ball. Man...
3,0,my whole body feels itchy and like its on fire
4,0,"@nationwideclass no, it's not behaving at all...."


Therefore, we can see that now our data solely consists of the Sentiment Score and the Tweet. To understand the distribution of this data, we can perform:

In [3]:
df.groupby("Sentiment Score").size()

Sentiment Score
0    800000
4    800000
dtype: int64

So, we can clearly see that there are $800000$ tweets which are: Negative ($0$), and there are $800000$ tweets which are: Positive ($4$)!

#### <font color = "orange"> Preprocessing Tasks </font>
To increase accuracy, and to make our input data better, we perform certain "preprocessing" tasks on the raw text data. Peforming the following tasks on the data:
<ol>
    <li> Casing: converting everything to either uppercase or lowercase
    <li> Noise Removal: eliminating unwanted characters such as HTML tags, punctuation marks, special characters, and so on
    <li> Tokenization: turning all the tweets into tokens - words separated by spaces
    <li> Stopword Removal: ignore common English words ("and", "if", "for"), that are not likely to shape the sentiment of the tweet
    <li> Text Normalization (Lemmatization): reducing words to their root form to eliminate variations of the same word
        <ul>
            <li> Better $\rightarrow$ Good
            <li> Runs/Running/Ran $\rightarrow$ Run
        </ul>
</ol>

In [4]:
#for i in range(df.shape[0]):
#    df["Tweet"][i] = hp.preprocess(df["Tweet"][i])

#### <font color="orange">Training Data and Testing Data</font>

Before, we proceed with any operation it is important to divide our data into <i>training data</i> and <i>testing data</i>:

In [5]:
from sklearn.model_selection import train_test_split 
dfX_train, dfX_test, dfY_train, dfY_test = train_test_split(df["Tweet"], df["Sentiment Score"], test_size=0.2, random_state=0)

The code above splits our entire data: the actual tweets, and the associated sentiment score - into training data and testing data. This data exists in the form of <i>pandas</i> dataframes. The <i>test_size</i> = 0.2, tells the function to save 20% of the data for testing, and the rest for training.

#### <font color="orange">Creating the Term-Document Matrix and the Target Vector</font>
Now that we have our training and testing data frames, we want to separate it into the:
<ul>
    <li> Feature Matrix (X), and
    <li> Target Vector (y)
</ul>
Luckily, for us the sentiment score is already numerically encoded (0 = Negative, and 4 = Positive). Although, we will convert this to 0 and 1 (for, 0 and 4 respectively) to standardize how model predictions work. However, we must note that each individual tweet is like an individual row of the Feature Matrix - X. However, currently it is not represented in that format. To achieve that we have to create a "Term-Document Matrix". A term-document matrix is a matrix in which:
<ol>
    <li>Column: each column represents one term (word) that is present in our complete dataset (corpus) of the tweets
    <li>Row: each row contains information about one tweet (document), that is which words are present in that particular tweet
</ol>
To think of it simply, all the columns collectively represent all the words that have come up in the complete tweet dataset. Each row, contains tells us, out of all the possible words, which ones are present (and with what frequency) in each individual tweet. Therefore, our term-document matrix will be of the order: $\text{tdm} \in \mathbb{R}^{n\text{x}p}$, where $n$ = number of tweets, and $p$ = unique words across all tweets. 

<br><u>Important Note</u>: in the above description of the columns, all the words in the "complete tweet dataset" refers solely to the training data. This is because when we create the list of vocabulary (all the relevant words present in the tweets), we do not want to look at words in the <i>testing data</i> as that might lead to unwanted biases!

To create our term-document matrix, we can use the <i>CountVectorizer</i> from <i>scikit-learn</i>:

In [6]:
from sklearn.feature_extraction.text import CountVectorizer
#from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np

cv = CountVectorizer(min_df = 0.001, ngram_range=(1,2))

In the above code, we create an instance of the <i>CountVectorizer</i> class. Here there are some important things to unpack:
<ol>
    <li>min_df: if a term appears in less than 1% of the dataset - ignore it. As we do not have enough data on this term to understand its role in shaping a tweet's sentiment
    <li>ngram_range = (1,2): this tells our model to look at uni-grams (one word at a time), and bi-grams (two words taken together). This is good to understand words that often make sense in pairs
</ol>

Now, we can go ahead and create our term-document matrix (<i>tdm</i>) and appropriate <i>target vector</i> for the training data and the testing data:

In [7]:
X_train, y_train = hp.prepData(dfX_train, dfY_train, cv, True)
X_test, y_test = hp.prepData(dfX_test, dfY_test, cv, False)

The <i>prep_data</i> function, we use for getting the final <i>feature matrix</i> and <i>target vector</i> is defined as:
```python
def prepData(self, dfX, dfY, vectorizer, train = True):
    if train:
        vectorizer.fit(dfX)
    #creating the term document matrix
    counts = vectorizer.transform(dfX) 
    X = pd.DataFrame(counts.toarray(), columns = vectorizer.get_feature_names_out())
    X = X.to_numpy()
    y = dfY.to_numpy()
    y = 1 * (y==4) 
    return X, y
```
There are some things to unpack here:
<ol>
    <li>The check for training ensures that vocabulary is only created for the training data
    <li>Based on this constructed vocabulary, a term-document matrix is constructed (for both training and testing data)
    <li>We convert our <i>feature matrix</i> and <i>target vector</i> to numpy objects for ease
    <li>We change: $y = \{0, 4\}^{n}$ to $y = \{0, 1\}^{n}$ for a standard machine learning approach
</ol>

In [8]:
#setting the size of: training and testing samples
n_train = 50000
n_test = 8000

def randomSample(n_train, n_test):
    #training data
    indices = np.random.choice(X_train.shape[0], n_train, replace=False)
    X_trainFinal = X_train[indices, :]
    y_trainFinal = y_train[indices]
    #testing data
    indices = np.random.choice(X_test.shape[0], n_test, replace=False)
    #creating new feature matrix and target vector (smaller)
    X_testFinal = X_test[indices, :]
    y_testFinal = y_test[indices]
    return X_trainFinal, y_trainFinal, X_testFinal, y_testFinal

# What is Support Vector Machine - SVM?

Support Vector Machines (SVMs) is a popular machine learning model which is often used for classification.

# PEGASOS (<font color="orange">P</font>rimal <font color="orange">E</font>stimated sub-<font color="orange">G</font>r<font color="orange">A</font>dient <font color="orange">SO</font>lver for <font color="orange">S</font>VM) : <font color = "orange">Primal SVM</font>

In PEGASOS (Primal Form of SVMs), our empirical risk minimization problem is:<br><br>
<center> $\hat{\mathbf{w}} = \mathop{\mathrm{arg\,min}}_{\mathbf{w}} \; L(\mathbf{w})\;$ </center>
<br>where, our loss function $L(\mathbf{w})\;$ was defined as:<br><br>
<center> $L(\mathbf{w}) = \frac{\lambda}{2}||w||^2 + \frac{1}{n} \sum_{i = 1}^n \ell_{\text{hinge}}(\langle \mathbf{w}, \mathbf{x}_i \rangle, y_i)\;$ </center>
<br>Therefore, we can see that there is an added regularization to prevent overfitting and to increase the model's ability to generalize over unseen data. For PEGASOS, the loss function which is generally used is called "Hinge Loss", which is defined as:<br><br>
$$
\ell_{\text{hinge}}(\langle \mathbf{w}, \mathbf{x}_i \rangle, y_i) = \text{max}\{0, 1 - y_i\langle w, x_i \rangle\}
$$
In this case, our gradient function is:<br><br>
$$
\nabla L(\mathbf{w}) = \lambda w - \frac{1}{n} \sum_{i = 1}^n \mathbb{1}[y_i \langle \mathbf{w}, \mathbf{x}_i \rangle < 1] y_i x_i
$$

# Hyperparameter Tuning using Cross-Validation

# Training the Model

Now, that we have an understanding of how the PEGASOS (Primal SVM) method works, we can import the code that I wrote and <i>fit</i> it over our training data, then we can look at the accuracy we are able to achieve over the <i>training data</i> and the <i>testing data</i>:

In [9]:
from PrimalSVM import PrimalSVM
from sklearn.linear_model import SGDClassifier
#PS = PrimalSVM(0.1)
PS = SGDClassifier(loss='hinge', penalty='l2')

#sampling using the function above
X_trainFin, y_trainFin, X_testFin, y_testFin = randomSample(50000, 8000)

#X_train = np.append(X_train, np.ones((X_train.shape[0],1)), axis=1)
#X_test = np.append(X_test, np.ones((X_test.shape[0],1)), axis=1)

In [10]:
#PS.fit(X_trainFin, y_trainFin, 0.1, 500000, 5000)
PS.fit(X_trainFin, y_trainFin)
#print(f"Accuracy on Training Data: {PS.score_history[-1]}")
print(f"Accuracy on Training Data: {PS.score(X_trainFin, y_trainFin)}")
print(f"Accuracy on Testing Data: {PS.score(X_test, y_test)}")

Accuracy on Training Data: 0.7865
Accuracy on Testing Data: 0.76466875


# Results and Experiments
In this section, we will be performing certain experiments to see how our model performs. 

#### <font color="orange">Model Accuracy for Ambiguous Sentences</font>
First, let us look at how the trained model does in predicting the sentiments of some sentences which could potentially cause an issue:

In [25]:
testSentences = ["I am in pain", "I hate that movie", "Today has been a good day", "Yesterday was a pretty bad day", "I am not in pain anymore", "I love animals bad", "I love animals bad bad"]

for sent in testSentences:
    #result = hp.sentencePredict()
    tempMatrix = hp.sentencePredict(sent, cv)
    result = PS.predict(tempMatrix)
    sentiment = "Negative"
    if(result):
        sentiment = "Positive"
    print(f"Sentence:{sent}\nSentiment:{sentiment}\n")

Sentence:I am in pain
Sentiment:Negative

Sentence:I hate that movie
Sentiment:Negative

Sentence:Today has been a good day
Sentiment:Positive

Sentence:Yesterday was a pretty bad day
Sentiment:Negative

Sentence:I am not in pain anymore
Sentiment:Negative

Sentence:I love animals bad
Sentiment:Positive

Sentence:I love animals bad bad
Sentiment:Negative



#### <font color="orange">Evolution of Training Score</font>

# Concluding Discussion

# Personal Reflection