# NLP Classification exercise: Emotional Tweets

In this exercise we will classify text using a bag-of-words model and ngrams. We will use the [Emotion detection from text](https://www.kaggle.com/pashupatigupta/emotion-detection-from-text) dataset which contains tweets along with emotion labels.

## Part 1: EDA and Preprocessing

**Questions:**
1. Load the attached `tweet_emotions.csv` dataset as a Pandas DataFrame. Perform basic EDA, including checking the size of the dataset and distribution of target values.

2. Perform basic preprocessing: First remove 'neutral' tweets since we will only classify tweets with emotional content. Then, using Pandas Series string methods (e.g. `.str.replace(...)`), convert the text to lowercase. Replace special types of tokens with code-words as discussed in class (e.g. `__HASHTAG__` for hashtags; consider what else should be replaced).

3. Use a 90%-10% train-test split, stratifying on the target value. Why does it make sense to stratify?

4. What baseline accuracy would we achieve on predicting the target on the test set if we always predicted the majority class?

5. Using `nltk.word_tokenize` and `collections.Counter`, find the most common 10 words in the training data. What type of words are these?

In [1]:
import pandas as pd

In [2]:
#1.
df = pd.read_csv('tweet_emotions.csv.zip')

In [3]:
df.shape

(40000, 3)

In [4]:
df.sample(5)

Unnamed: 0,tweet_id,sentiment,content
1695,1957375017,worry,"is getting a cold *cough, oink, cough*"
20921,1694128615,neutral,@Adrianna perhaps because it's 3 am Minday mor...
26831,1695546704,sadness,"@amandawilk106 yes unfuzzy, i prefer the fuzzy..."
16248,1965286924,neutral,@mitchelmusso how do you call that number from...
35957,1753216524,worry,hairspray on tv!


In [5]:
df.drop(columns = 'tweet_id', inplace = True)

In [6]:
df.sentiment.value_counts()/df.shape[0]*100

neutral       21.5950
worry         21.1475
happiness     13.0225
sadness       12.9125
love           9.6050
surprise       5.4675
fun            4.4400
relief         3.8150
hate           3.3075
empty          2.0675
enthusiasm     1.8975
boredom        0.4475
anger          0.2750
Name: sentiment, dtype: float64

In [7]:
df.isna().sum()

sentiment    0
content      0
dtype: int64

In [8]:
df.content.value_counts().shape

(39827,)

We have 173 duplicated tweets.

In [9]:
import numpy as np

In [10]:
#2.
df.drop(np.where(df.sentiment == 'neutral')[0], inplace = True)

In [11]:
repl = lambda x : x.lower()
df.content = df.content.str.lower()

In [12]:
df.content = df.content.str.replace(r'#\w+', '__HASHTAG__')

  df.content = df.content.str.replace(r'#\w+', '__HASHTAG__')


In [13]:
df.content = df.content.str.replace(r'@\w+', '__USERNAME__')

  df.content = df.content.str.replace(r'@\w+', '__USERNAME__')


In [14]:
#3.
from sklearn.model_selection import train_test_split
X_train,X_test, y_train,y_test = train_test_split(df.drop(columns = 'sentiment'), df.sentiment, test_size = 0.1, stratify = df.sentiment)

We want to stratify to be sure that each class will be represented both on the train set and the test set with the same proportion.

In [15]:
#4.
y_test.value_counts()/X_test.shape[0]*100

worry         26.968441
happiness     16.608224
sadness       16.448836
love          12.240995
surprise       6.981192
fun            5.674211
relief         4.877271
hate           4.207842
empty          2.645840
enthusiasm     2.422697
boredom        0.573797
anger          0.350653
Name: sentiment, dtype: float64

By predicting the majority class, we get a baseline accuracy of 26.9%

In [16]:
#5.
from itertools import chain
import nltk
from nltk import word_tokenize
nltk.download('punkt')
word_tokenized = list(chain.from_iterable(X_train.content.apply(word_tokenize)))

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [17]:
from collections import Counter
counter = Counter(word_tokenized)
counter.most_common(10)

[('i', 18056),
 ('!', 17405),
 ('.', 14625),
 ('__USERNAME__', 13633),
 ('to', 10454),
 ('the', 9503),
 (',', 8895),
 ('a', 7190),
 ('my', 6148),
 ('it', 5763)]

The most common words are punctuations and preprosition or determinant like 'i','the','to','my','it'. Those words are stopwords.

## Part 2: Bag of Words (BoW) Features

We will now see how a sentence can be transformed into a feature vector using a bag of words model. Consider the following sentences:

In [18]:
sentences = [
  'This is the first document.',
  'This document is the second document.',
  'And this is the third one.',
  'Is this the first document?'
]

We can represent each word as a **one-hot** encoded vector (with a single 1 in the column for that word), and add their vectors together to get the feature vector for a sentence.

**Questions:**

6. Use CountVectorizer from scikit-learn to get an array of one-hot encoded vectors for the sentences in `sentences`. What do the rows and columns of the feature matrix X represent?


In [19]:
from sklearn.feature_extraction.text import CountVectorizer
vec = CountVectorizer().fit(sentences)
X = vec.transform(sentences)

In [20]:
X

<4x9 sparse matrix of type '<class 'numpy.int64'>'
	with 21 stored elements in Compressed Sparse Row format>

Each row corresponds to a sentence of sentences and each column corresponds to a unique word contained in sentences.

7. What word does the second column of X represent? What about the third column? (If you are stuck, look at `vectorizer.get_feature_names_out()`)


In [21]:
vec.get_feature_names_out()

array(['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third',
       'this'], dtype=object)

The second column of X represent the word 'document'.

8. Try using TfidfVectorizer instead of CountVectorizer, and explain why some values of X become smaller than others.

In [22]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
X_tfidf = vectorizer.fit_transform(sentences)

In [23]:
X_tfidf

<4x9 sparse matrix of type '<class 'numpy.float64'>'
	with 21 stored elements in Compressed Sparse Row format>

In [24]:
for elem in X_tfidf:
  print(elem)

  (0, 1)	0.46979138557992045
  (0, 2)	0.5802858236844359
  (0, 6)	0.38408524091481483
  (0, 3)	0.38408524091481483
  (0, 8)	0.38408524091481483
  (0, 5)	0.5386476208856763
  (0, 1)	0.6876235979836938
  (0, 6)	0.281088674033753
  (0, 3)	0.281088674033753
  (0, 8)	0.281088674033753
  (0, 4)	0.511848512707169
  (0, 7)	0.511848512707169
  (0, 0)	0.511848512707169
  (0, 6)	0.267103787642168
  (0, 3)	0.267103787642168
  (0, 8)	0.267103787642168
  (0, 1)	0.46979138557992045
  (0, 2)	0.5802858236844359
  (0, 6)	0.38408524091481483
  (0, 3)	0.38408524091481483
  (0, 8)	0.38408524091481483


Values depend on the occurences of the word in a specific tweet and the occurences of this word in all the tweets. So if a word appears many time in a document it will have a higher value than this same word that appear less time in another document.

**Note:** You might have noticed that there is a single feature for "this" (no separate one for uppercase "This"). Sklearn's CountVectorizer has default parameter `lowercase=True`; see [the documentation](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) for more information.

## Part 3: Feature Engineering and Modeling

**Questions:**

9. Using `CountVectorizer`, convert the tweets in our dataset into feature matrices `X_train` and `X_test`. Use the following instructions:
* Use the parameters `max_features=` and `ngram_range=` so that there are 10,000 features used (the most common ngrams) and so that the model uses unigrams, bigrams, and trigrams (i.e. one to three words).
* Make sure to only fit the vectorizer on the train data.

Also, simply use the `sentiment` column as-is for `y_train` and `y_test`.

10. We will use a Logistic regression model for classification because it is efficient and gives us interpretable results. Use Sklearn's `LogisticRegression` with default parameters. Fit on your training data (should take about 15 seconds) and predict on your test data.

11. Did your classifier use regularization? Why is or isn't this important in our use case? (Hint: See the sklearn documentation for logistic regression.)

12. Display your model's metrics on this multiclass problem. Is the performance better than the majority class baseline? Explain your results.

13. Print out the five features with highest importances for your classifier to classify each sentiment class. Make sure to print out the names of the features. **Hint:**  You may want to use `vectorizer.vocabulary_`, `clf.classes_`, and `clf.coef_`.

14. Do your results make sense? Does it look like bigrams or trigrams are important for your classifier.

In [25]:
#9.
vec_tweet = CountVectorizer(max_features = 10000, ngram_range = (1,3)).fit(X_train.content)
X_train = vec_tweet.transform(X_train.content)
X_test = vec_tweet.transform(X_test.content)

In [26]:
#10.
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
lr.fit(X_train, y_train)
y_pred = lr.predict(X_test)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


11. As we have many features that can be very correlated (for ex bigrams or trigrams with the words it contains), the use of regularization is very important to get more robust result and reduce the correlation between features.

In [27]:
#12.
from sklearn.metrics import classification_report,confusion_matrix
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

       anger       0.00      0.00      0.00        11
     boredom       0.00      0.00      0.00        18
       empty       0.08      0.02      0.04        83
  enthusiasm       0.00      0.00      0.00        76
         fun       0.23      0.13      0.17       178
   happiness       0.38      0.47      0.42       521
        hate       0.35      0.17      0.23       132
        love       0.44      0.41      0.43       384
      relief       0.18      0.10      0.13       153
     sadness       0.34      0.34      0.34       516
    surprise       0.14      0.09      0.11       219
       worry       0.39      0.54      0.45       846

    accuracy                           0.36      3137
   macro avg       0.21      0.19      0.19      3137
weighted avg       0.33      0.36      0.33      3137



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [28]:
print(confusion_matrix(y_test,y_pred))

[[  0   0   0   0   0   1   0   0   0   1   2   7]
 [  0   0   2   2   0   0   2   0   0   4   1   7]
 [  0   1   2   0   1  15   1   4   0  13   4  42]
 [  0   1   0   0   4  21   3   3   2  13   4  25]
 [  0   0   1   1  24  58   1  25   5  18  10  35]
 [  0   0   3   3  20 246   1  74  20  32  19 103]
 [  0   0   2   1   3   5  22   3   1  26   7  62]
 [  0   0   2   3  13  95   2 159  12  23   6  69]
 [  0   0   3   1   5  43   1  12  16  14   7  51]
 [  0   0   3   1  12  34  14  19   8 177  18 230]
 [  0   0   2   0   4  48   1  20   8  27  19  90]
 [  0   0   4   6  18  78  14  44  17 167  42 456]]


We get an accuracy of 0.33, which is higher than the baseline model.

We are not able to detect the smaller classes like 'anger','empty','enthusiam'.
We predicted many samples as 'worry','surprise','sadness','love'.

In [29]:
#13.
lr_classes = lr.classes_
lr_coef = lr.coef_
features = vec_tweet.get_feature_names_out()
for index,label in enumerate(lr_classes):
  topindex = np.argsort(lr.coef_[index])[::-1][:5]
  print(f"Label : {label}")
  print(f'Most important features : {features[topindex]} \n')

Label : anger
Most important features : ['everything' 'bout' 'ass' 'big' 'knows'] 

Label : boredom
Most important features : ['bored' 'minutes' 'boring' 'math' 'stuck'] 

Label : empty
Most important features : ['bored' 'tip' 'vista' 'im not' 'lights'] 

Label : enthusiasm
Most important features : ['drank' 'it and' 'leave for' 'everyone is' 'cook'] 

Label : fun
Most important features : ['funny' 'fun' 'fans' 'sky' 'ellen'] 

Label : happiness
Most important features : ['laughed' 'woohoo' 'can wait' 'paris' 'excellent'] 

Label : hate
Most important features : ['hate' 'hates' 'stupid' 'sucks' 'rubbish'] 

Label : love
Most important features : ['loving' 'love' 'loved' 'lovely' 'sweetie'] 

Label : relief
Most important features : ['finally' 'at last' 'fine' 'relaxing' 'made it'] 

Label : sadness
Most important features : ['sad' 'sadly' 'disappointed' 'cried' 'missing'] 

Label : surprise
Most important features : ['surprise' 'surprisingly' 'surprised' 'you want to' 'wow'] 

Label : 

14. For most of the labels it makes sense. However for the label 'anger' and 'empty', the related features does not seem very correlated to the label.
We have a few bigrams but no trigrams in our most important features so it makes sense to take bigrams into consideration.