# Lab3 Assignment: testing classifiers on a different data set with tweets

Copyright: Vrije Universiteit Amsterdam, Faculty of Humanities, CLTL

In this assignment, you will build and apply classifiers built in the notebooks of lab3 to another data set with emotion labels which was used in the *Wassa* workshop in 2017:

http://saifmohammad.com/WebPages/EmotionIntensity-SharedTask.html

Saif M. Mohammad and Felipe Bravo-Marquez. In Proceedings of the EMNLP 2017 Workshop on Computational Approaches to Subjectivity, Sentiment, and Social Media (WASSA), September 2017, Copenhagen, Denmark.

The texts are tweets and therefore a different genre than the spoken utterances from the conversations in the MELD data set. The data set is included in the distribution of this lab, where we aggregated all the training and test data in a single file. The notebook already includes the code for lading the CSV files in a Pandas dataframe.

We also included the functions to get AveragedWordEmbeddings in a separate Python file: **lab3_util.py**. These can be used to create embedding based representations for the texts.

In [42]:
import pandas as pd
from collections import Counter
import numpy as np
import nltk
from nltk.corpus import stopwords
import gensim
from gensim.models.word2vec import Word2Vec
import gensim.downloader as api
import pickle
import sklearn
from sklearn import svm
from sklearn.metrics import classification_report
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

## 1. Loading the tweet data set

In [41]:
filepath = './data/wassa/training/all.train.tsv'
dftweets_train = pd.read_csv(filepath, sep='\t')
dftweets_train.head()

Unnamed: 0,ID,Tweet,Label,Score
0,10000,How the fu*k! Who the heck! moved my fridge!.....,anger,0.938
1,10001,So my Indian Uber driver just called someone t...,anger,0.896
2,10002,@DPD_UK I asked for my parcel to be delivered ...,anger,0.896
3,10003,so ef whichever butt wipe pulled the fire alar...,anger,0.896
4,10004,Don't join @BTCare they put the phone down on ...,anger,0.896


In [6]:
dftweets_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3613 entries, 0 to 3612
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   ID      3613 non-null   int64  
 1   Tweet   3613 non-null   object 
 2   Label   3613 non-null   object 
 3   Score   3613 non-null   float64
dtypes: float64(1), int64(1), object(2)
memory usage: 113.0+ KB


We see that this data set has 3613 tweets, labels and a score. Lets check the distribution of the labels:

In [9]:
# HERE COMES THE CODE THE GENERATE A BAR CHART FOR THE TRAIN DATA

In [10]:
##### Test data
filepath = './data/wassa/testing/all.test.tsv'
dftweets_test = pd.read_csv(filepath, sep='\t')

In [14]:
# HERE COMES THE CODE THE GENERATE A BAR CHART FOR THE TEST DATA 

### YOUR ANALYSIS OF THE DATA DISTRIBUTION [2 points]

HERE COME YOUR COMMENTS ON THE DISTRIBUTION OF THE TRAIN AND TEST DATA AND A COMPARISON WITH THE DISTRIBUTION IN MELD.


## Training and testing a classifier with averaged word embeddings

### Extracting the texts and labels for the training and testing [2 points]

In [12]:
# HERE COMES THE CODE TO EXTRACT THE TRAINING TEXTS AND LABELS

In [13]:
# HERE COMES THE CODE TO EXTRACT THE TEST TEXTS AND LABELS

### 2. Representing and classifying tweets as averaged word embeddings

#### 2.1 Representing the tweet training data using the same word embedding model [2 points]

We can use the same functions that we used in the notebook *Lab3.6.ml.emotion-detection.embeddings.ipynb* to get averaged embeddings for the tweets. 
In order to use exactly the same function and to be able to re-use them again, we copied these functions to a separate python file "lab3_util.py". You can open the file in Jupyter to inspect its content.

We can now import this file as we do with other packages and apply it in this notebook but also in other code. This keeps our notebook readable and compact and makes sure we always use the same functions and do not accidently change them across notebooks.

For future coding, it is wise to also apply this to your own code. Put reusable code as functions in a separate Python file with an illegible name and import this in different notebooks or other Python files. In this way, you develop your own tools over time and reuse them when needed.

We first import the Python file *lab3_util.py* in this notebook so that the functions are loaded in working memory. The file should be located in the same folder as this notebook.

In [15]:
import lab3_util as util

From this point onwards, you can call the functions from this file as follows:

* util.getAvgFeatureVecs(.....)
* util.getMostFrequentWords(.....)

Check the parameters of the functions required to call them and check the return values to catch what they return.

We also need to load the same word embedding model as we used before to get compatible word embeddings for this data. Choose whatever you used before.

In [33]:
# HERE COMES THE CODE TO LOAD THE WORD EMBEDDINGS

In [39]:
# HERE COMES THE CODE TO DERIVE AVERAGED WORD EMBEDDING REPRESENTATIONS FOR THE TRAINING DATA
# TIP: HERE YOU USE THE CODE FROM LAB3_UTIL.PY

As before, we check which words are not in the embedding model's vocabulary.

In [35]:
# HERE COMES THE CODE TO GET THE LIST OF UNKNOWN WORDS  

### ANALYSE THE UNKOWN WORDS [1 point]
HERE COMES YOUR ANALYSIS OF THE UNKNOWN WORDS AND YOUR EXPECTATION ON THE PERFORMANCE

### 2.2 Representing the tweet test data through embeddings [1 point]

In [36]:
# HERE COMES THE CODE TO REPRESENT THE TEST DATA

In [37]:
# HERE COMES THE CODE TO ANAYSE THE UNKNOWN WORDS

### ANALYSE THE UNKNOWN WORDS [1 point]
HERE COMES YOUR ANALYSIS OF THE UNKNOWN WORDS AND YOUR EXPECTATION ON THE PERFORMANCE

### 2.3 Training and testing the classifier [1 point]

In [38]:
# HERE COMES THE CODE TO TRAIN AND TEST AN SVM WITH THE EMBEDDING REPRESENTATION

### ANALYSIS OF THE TEST RESULT [2 points]

HERE COMES YOUR ANALYSIS OF THE RESULT

## 3 Testing the MELD classifiers on the tweet test set

We have now trained and tested an emotion classifier on tweet data. What about the MELD classifiers that we built before and saved to disk? Can we load these and test them on the same test tweet data set?

With embeddings this is easy because the representations are independent of the words but we need to use the same embedding model for both the training and testing. So if we used the *glove_twitter* embeddings for MELD with 25 dimensions, we have to do the same for the tweet test data.

For the BoW classifiers, we need to use CountVectorizer and the *transform_fit* function to model the test data accordingly.

If you saved the MELD models using the previous notebook, you can load the models now again. Otherwise, you need to rebuilt them, save them and next load them here.

#### 3.2 Loading the MELD classifier for embedding representations [1 point]

In [21]:
# HERE COMES THE CODE TO LOAD THE MELD EMBEDDING CLASSIFIER

Because the vector representations for MELD and the test tweets were created using the same function and through the same embedding model we can now use it directly to classify the tweet test data we just created in the previous section.

### 3.3 Testing the MELD Embedding classifier on the tweet test set [1 point]

In [26]:
# HERE COMES THE CODE TO CLASSIFY THE TEST DATA WITH THE MELD EMBEDDING REPRESENTATION

### ANALYSIS OF THE TEST RESULT [2 points]

HERE COMES YOUR ANALYSIS OF THE RESULT. COMPARE THESE WITH THE ABOVE]

### 4. Representing and classifying the tweets as a bag-of-words vector

#### 4.1 Representing the tweets as BoW vectors [2 points]

In order to represent the test tweets according to our BoW vector representation derived from the MELD data, we need to load the *utterance_vec* file that we saved before.

In [27]:
# HERE COMES THE CODE TO LOAD THE MELD BOW VECTORIZER

We can now apply the transform function to the tokenized tweets.

In [28]:
# HERE COMES THE CODE TO REPRESENT THE TEST DATA ACCORDING TO MELD BOW VECTORIZER

We can now load the classifier and make the predictions on this data.

#### 4.2 Loading and testing the MELD BOW Classifier [2 points]

In [29]:
# HERE COMES THE CODE TO LOAD THE MELD BOW CLASSIFIER

In [30]:
# HERE COMES THE CODE TO APPLY THE CLASSIFIER TO THE TWEET TEST SET

### ANALYSIS OF THE TEST RESULT [2 points]

HERE COMES YOUR ANALYSIS OF THE RESULT. COMPARE THESE WITH THE ABOVE]

# End of The assignment