# Jigsaw Multilingual Toxic Comment Classification
>Note that this blogpost makes use of datasets that may be considered profane, vulgar, or offensive. 
- toc: true
- badges: true
- comments: true
- author: Aman Arora

In [8]:
# from transformers import XLMRobertaConfig, XLMRobertaTokenizer, XLMRobertaModel
import pandas as pd
%matplotlib inline

This is a really interesting competition - and in this competition I want to learn as well as teach - learn by experimenting new techniques and teach by showing the how to do the experiments. I hope you find this blogpost interesting because each week, I will perform a new experiment and get an intuition on what works and what doesn't work.

## What's so interesting about this competition? 

Well, firstly, the training data is only in `En` while the valid and test set contain multiple languages. 

In [21]:
trn = pd.read_csv("./jigsaw-toxic-comment-train.csv", usecols=['comment_text', 'toxic'])
trn.head()

Unnamed: 0,comment_text,toxic
0,Explanation\nWhy the edits made under my usern...,0
1,D'aww! He matches this background colour I'm s...,0
2,"Hey man, I'm really not trying to edit war. It...",0
3,"""\nMore\nI can't make any real suggestions on ...",0
4,"You, sir, are my hero. Any chance you remember...",0


Here is a list of 5 toxic comments in the train set: 

In [36]:
list(trn.query("toxic==1").comment_text[:5])

['COCKSUCKER BEFORE YOU PISS AROUND ON MY WORK',
 "Bye! \n\nDon't look, come or think of comming back! Tosser.",
 'FUCK YOUR FILTHY MOTHER IN THE ASS, DRY!']

So, the training set only consists of `comment_text` field in `English` and `toxic` field that represents whether the comment is toxic or not. 

In [22]:
valid = pd.read_csv('./validation.csv')
valid.head()

Unnamed: 0,id,comment_text,lang,toxic
0,0,Este usuario ni siquiera llega al rango de ...,es,0
1,1,Il testo di questa voce pare esser scopiazzato...,it,0
2,2,Vale. Sólo expongo mi pasado. Todo tiempo pasa...,es,1
3,3,Bu maddenin alt başlığı olarak uluslararası i...,tr,0
4,4,Belçika nın şehirlerinin yanında ilçe ve belde...,tr,0


In [23]:
set(valid.lang)

{'es', 'it', 'tr'}

**The valid set consists of 'es' (ISO code for Spanish), 'it' (ISO code for Italian) and 'tr' (ISO code for Turkish)**. You can find the ISO codes for reference [here](https://www.andiamo.co.uk/resources/iso-language-codes/).

Now what about the test set? 

In [24]:
test = pd.read_csv('./test.csv')
test.head()

Unnamed: 0,id,content,lang
0,0,Doctor Who adlı viki başlığına 12. doctor olar...,tr
1,1,"Вполне возможно, но я пока не вижу необходимо...",ru
2,2,"Quindi tu sei uno di quelli conservativi , ...",it
3,3,Malesef gerçekleştirilmedi ancak şöyle bir şey...,tr
4,4,:Resim:Seldabagcan.jpg resminde kaynak sorunu ...,tr


In [25]:
set(test.lang)

{'es', 'fr', 'it', 'pt', 'ru', 'tr'}

**The valid set consists of 'es' (ISO code for Spanish), 'fr' (ISO code for French), 'it' (ISO code for Italian), 'pt' (ISO code for Portuguese, 'ru' (ISO code for Russian)  and 'tr' (ISO code for Turkish)**. You can find the ISO codes for reference [here](https://www.andiamo.co.uk/resources/iso-language-codes/).

Interesting right? So can we train a toxic comment classifier on English and use it on another language? I say "No".

Why? 
- **Tokens are different**: The first step of a language model is to create a tokenizer and convert the input text to tokens, how will our model treat two different languages or multiple different languages if it doesn't know how to tokenize them? 
- **No common embedding represenation**: The tokens are converted to some intermediate representations of some dimension - in case of BERT it's 768, and then the model performs the classification on these intermediate representations. How will our model know which representation to use if it's only trained on English? 

## So what do we do then? 

Well, we make use of multilingual transformer based models. Recently there has been much research in multilingual models - one model for multiliple languages. 

From what I've read through the Kaggle forums and recent research papers, there are two main multilingual models - one of them by Facebook AI - [XLM](https://arxiv.org/abs/1901.07291), [XLMR](https://ai.facebook.com/blog/-xlm-r-state-of-the-art-cross-lingual-understanding-through-self-supervision/) and another one by Google - [MBert](https://github.com/google-research/bert/blob/master/multilingual.md) (multilingual bert). 

There is a great introduction to this competition using mBert available [here](https://www.youtube.com/watch?v=vvr_f-X_LaI).

While at the same time there is also much interesting discussion about using XLM-R [here](https://www.kaggle.com/c/jigsaw-multilingual-toxic-comment-classification/discussion/140254).

Now which of the two is better - I do not know. BUT, we will found out through expirementation. At the date of writing this blogpost, there are still two more months to go and while at this time I cannot aim to win the competition - I do aim to get a gold medal or at the least implement these new ideas to learn.

## Could we use translation? 

Another idea is to then translate the test set to English and use a mono lingual English model for toxicity classification. A translated version of the test set is available [here](https://www.kaggle.com/kashnitsky/jigsaw-multilingual-toxic-test-translated). 

In [27]:
val_translated = pd.read_csv("./jigsaw_miltilingual_valid_translated.csv")
val_translated.head()

Unnamed: 0,id,comment_text,translated,lang,toxic
0,0,Este usuario ni siquiera llega al rango de ...,This user does not even make it to the rank of...,es,0
1,1,Il testo di questa voce pare esser scopiazzato...,The text of this entry appears to be like I di...,it,0
2,2,Vale. Sólo expongo mi pasado. Todo tiempo pasa...,It is worth it. Only expose my past. All time ...,es,1
3,3,Bu maddenin alt başlığı olarak uluslararası i...,Of this article as a sub-heading with maintain...,tr,0
4,4,Belçika nın şehirlerinin yanında ilçe ve belde...,"I guess while they're At of the city, district...",tr,0


In [35]:
list(val_translated.query("toxic==1").translated[5:10])

["BREED OF IDIOTS!!! You're drunk, your brain???? What is this intrusion????? Your troubles personal risolvetele on your pages (personal)! In addition, I have voted for the cancellation!!! Then go to .....the! ",
 'The title should linklenme of pornography. Though critical of the substance that should have a hanging this pervert get more scientific writing.95.7.150.Forty five ',
 "makes you so disgusting, leave double space?Mpiz (msg) ah, inveve is really nice to have all the paragraphs stuck that you don't know even where start one and ends the other..it is just belloMpiz (msg) is, but the lines so they are all stuck and do not understand a football club. with the double space remaining slightly more spaced out, even at a glance see where finishes one and starts the next.Mpiz (msg) ",
 "Hello really thought about it, that you are not going to find it?, I told you, I'm tracking, I'm tracking your comments, your insults, your whining, you'll soon know me.",
 'Eh excuse me, but She is a 

Above, I have shared 10 translated comments that are toxic. The translated versions of comments do seem to be toxic if you read it in plain English and it is possible that this approach of toxicity classification might work. But, then out training set needs to have the same words for toxic classification too. One approach is to finetune the model on the valid set and use the same API (for translation) on the test set and then make predictions on the test set.

> This approach assumes that the translation API used (in this case YANDEX) will not remove the toxicity and that it can mantain a good level of translation accuracy for multiple languages - especially those in the test set. 

A downside of this would we that if such a model is served in production - each sentence would have to be translated first before we are able to classify the toxicity on the comment. 

In [37]:
test_translated = pd.read_csv("./jigsaw_miltilingual_test_translated.csv")
test_translated.head()

Unnamed: 0,id,content,lang,translated
0,0,Doctor Who adlı viki başlığına 12. doctor olar...,tr,Title named Doctor Who wiki 12. doctor has add...
1,1,"Вполне возможно, но я пока не вижу необходимо...",ru,"It is possible, but I don't see the need to a..."
2,2,"Quindi tu sei uno di quelli conservativi , ...",it,"Then you're one of those conservative , who wo..."
3,3,Malesef gerçekleştirilmedi ancak şöyle bir şey...,tr,"Unfortunately, it was not performed, but had s..."
4,4,:Resim:Seldabagcan.jpg resminde kaynak sorunu ...,tr,:Resim:Seldabagcan.jpg the image of the source...


In [43]:
list(test_translated.translated[95:100])

['Hello, Qbert88. I am writing to you because you are in the list of users of Emilia-Romagna, and I wanted to tell you about the next wikiraduno in Bologna from the 11th to the 15th of June 2008. If you have the desire or time to come you will find all the information on the page of the gathering. Xaura (msg) ',
 "Z have understood nothing... It's supposed to be ironic, and rightly so. Steve Rogers represents all that the Nazis (racist and eugenic) dreaming on the physical plane, but on the moral plane hates everything they represent, and becomes their worst enemy.216.15.41.45 (d) ",
 "I bet your mother is still working !!! d the other hand, the whores work late !!! I'm sure you're a loser nerd shit behind a computer who likes to make the dick and then go fuck and blocks I'm fucking ugly nerd fucking D",
 'Hello Huster, I am a new contributor and I note your change of categorization of the article on the Red Cross of Belgium. What principle has motivated this amendment? Sincerely, C is

The test set seems to have translated and retained the toxicity too. It is possible that some of the toxic comments were not translated properly while others were. 

## Any other ideas? 

Surely translating the test set and making predictions on that will serve as a good benchmark, in fact we will be using this approach to create a benchmark model. 

Personally though I believe that translating every language correctly with high BLEU and mantain toxicity is a harder task then classifying toxicity on these languages. 

1. For this competition we could have multiple monolingual models for each language that are then used for classification of toxicity. 
2. We could find multiple datasets outside of the one provided and train/fine-tune a multilingual model to classify for toxicity. 
3. We could translate the train set to multiple languages and then use that to train a multilingual model. 

Personally, currently I believe that using approach 1 or approach 2 will work the best. In the next blog, we will create a benchmark using the translated test set. And then we will move on to approach 1 and 2. See you in the next blog!