# Kaggle Tweet Sentiment Extraction
> My journey to learn some NLP.

- toc: true 
- badges: false
- comments: true
- categories: [kaggle, nlp]
- image: images/kaggle_tse/twitter.png

# Intro

Here lies my notes about my solution for the [Kaggle Twitter Sentiment Extraction competition](https://www.kaggle.com/c/tweet-sentiment-extraction).

First of all, I am still a novice in the field of natural language processing. 
This means that all of these NLP concepts, and even Deep Learning approaches to this fields, are challenging for me to understand and apply. 
Hence, if my approaches somehow become wrong, please give me a notification via my personal email: [huygdng@gmail.com](huygdng@gmail.com).


## About this competition

Now let's go to the competition's description. We know that, Sentiment classification is a well-known problem in NLP. 
Given a sentence (a tweet, a line from one book, etc.), our algorithm should be able to tell the "attitude" of that input. 
For example, given a sentence like this:


> Kaggle is fun!

The above sentence is full of "positive" thought, and if I have a proper sentiment classifier, I will expect it to
return "positive" as the "attitude" of that sentence as well.

Now, back to the competition. The challenge in this competition is not to classify the sentiment of tweets, but to **pick out** parts that reflect the sentiment of those tweets.

This is from the original description of the competition:

> Capturing sentiment in language is important in these times where decisions and reactions are created and updated in seconds.
But, which words actually lead to the sentiment description?
In this competition you will need to pick out the part of the tweet (word or phrase) that reflects the sentiment.

For the above positive example, it is expected from a proper sentiment extraction model to point out the term 'fun!' as the cause of the sentence's positiveness. We can observe that, in this example, not only the word 'fun' is marked as the positive term, but its corresponding punctuation as well. This phenomenon affects the choice of approaches, as we will see later on.

# A first glance at data

Here is some samples from the training dataset:

![kaggle-sample](../images/kaggle_tse/kaggle_sample.png)

So, the training data does give us a lot information:

- The information that we could use for our training includes 'text' and 'sentiment'.

- The predicion we would make, is the 'selected_text' column.

- We have 3 classes for sentiment: positive, negative and neutral.

- About 'text' column: The format is quite ...diverge. There are incomplete sentences ('is back home ... '), sentences with some emoticon ('... <3 <3'), sentences with some typos ('Hes just not ...'), and more.

- The sentiment distribution is quite good: neutral 40%, positive 31% and negative 28%.

- The selected text contains both the word, punctuation and also some emoticon as well.

Listed below are some more basic analysis on this data. We first observe that, the sentiment distribution of training dataset and test dataset are equivalent.

![kaggle-eda1](../images/kaggle_tse/sentiment_dist.png)

Our task in this competition, once again, is to find correct pieces of text that emphasize the sentiment of the tweet. Hence, observing the distribution of word counts in both the original tweet the selected tweet (the target) in each sentiment class is a good idea.

![kaggle-eda2](../images/kaggle_tse/text_length_comparison1.png)

Also the corresponding histogram of word counts.

![kaggle-eda1](../images/kaggle_tse/text_length_comparison2.png)

Among the 3 sentiment classes, the 'neutral' class has one interesting characteristics: The length of input tweet and target piece of text are almost the same. Hence, we can make use of it, as a simple heuristic post-process rule.

# My approaches

## Viewing the problem 

This problem can be treated as **Token classification** problem (i.e., Name Entity Recognition, Part-of-Speech, ...) , or **Question-answering** problem.

However, when working with this problem as the token classification problem, we do not include punctuations as part of the model's results. Moreover, modifying token with punctuations is not a good idea - not only does that approach expand the vocabulary (which goes along with time and computational resources to learn the language model), but it also does not guarantee that we can learn the similarity between word and word with tokens. Hence, this approach is not optimal.





### The Question-answering problem

I formulated this task as question answering problem: given a question and a context, it is expected that the model should find acceptable  the 
given a question and a context, we train a transformer model to find the answer in the text column (the context).

## Models

Up to now, I have tried some model architectures:
    
   - The pre-trained BERT[[1](https://arxiv.org/abs/1810.04805)] with custom header. 
   
   - The pre-trained Electra[[2](https://arxiv.org/abs/2003.10555)] with custom header.
   
   - The pre-trained RoBERTa[[3](https://arxiv.org/abs/1907.11692)] with custom header
      
The custom header includes 2 Linear layers, with ReLU and Dropout for the first linear layer. The header's weights are initialized using Kaiming He normal initialization [[4](https://arxiv.org/abs/1502.01852)].

## Input data

# References

[1] Devlin, J., Chang, M.W., Lee, K. and Toutanova, K., 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. *arXiv preprint arXiv:1810.04805*.

[2] Clark, K., Luong, M.T., Le, Q.V. and Manning, C.D., 2020. Electra: Pre-training text encoders as discriminators rather than generators. *arXiv preprint arXiv:2003.10555*.

[3] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L. and Stoyanov, V., 2019. Roberta: A robustly optimized bert pretraining approach. *arXiv preprint arXiv:1907.11692*.

[4] He, K., Zhang, X., Ren, S. and Sun, J., 2015. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. *In Proceedings of the IEEE international conference on computer vision* (pp. 1026-1034).