# Contradictory, My Dear Watson

### Detecting contradiction and entailment in multilingual text using TPUs

**Kaggle link:** https://www.kaggle.com/competitions/contradictory-my-dear-watson/overview

### Description

"…when you have eliminated the impossible, whatever remains, however improbable, must be the truth"
-Sir Arthur Conan Doyle

Our brains process the meaning of a sentence like this rather quickly.

We're able to surmise:

Some things to be true: "You can find the right answer through the process of elimination.”
Others that may have truth: "Ideas that are improbable are not impossible!"
And some claims are clearly contradictory: "Things that you have ruled out as impossible are where the truth lies."
Natural language processing (NLP) has grown increasingly elaborate over the past few years. Machine learning models tackle question answering, text extraction, sentence generation, and many other complex tasks. But, can machines determine the relationships between sentences, or is that still left to humans? If NLP can be applied between sentences, this could have profound implications for fact-checking, identifying fake news, analyzing text, and much more.

### The Challenge:
If you have two sentences, there are three ways they could be related: one could entail the other, one could contradict the other, or they could be unrelated. Natural Language Inferencing (NLI) is a popular NLP problem that involves determining how pairs of sentences (consisting of a premise and a hypothesis) are related.

Your task is to create an NLI model that assigns labels of 0, 1, or 2 (corresponding to entailment, neutral, and contradiction) to pairs of premises and hypotheses. To make things more interesting, the train and test set include text in fifteen different languages! You can find more details on the dataset by reviewing the Data page.

Today, the most common approaches to NLI problems include using embeddings and transformers like BERT. In this competition, we’re providing a starter notebook to try your hand at this problem using the power of Tensor Processing Units (TPUs). TPUs are powerful hardware accelerators specialized in deep learning tasks, including Natural Language Processing. Kaggle provides all users TPU Quota at no cost, which you can use to explore this competition. Check out our TPU documentation and Kaggle’s YouTube playlist for more information and resources.

### The Challenge:
If you have two sentences, there are three ways they could be related: one could entail the other, one could contradict the other, or they could be unrelated. Natural Language Inferencing (NLI) is a popular NLP problem that involves determining how pairs of sentences (consisting of a premise and a hypothesis) are related.

Your task is to create an NLI model that assigns labels of 0, 1, or 2 (corresponding to entailment, neutral, and contradiction) to pairs of premises and hypotheses. To make things more interesting, the train and test set include text in fifteen different languages! You can find more details on the dataset by reviewing the Data page.

Today, the most common approaches to NLI problems include using embeddings and transformers like BERT. In this competition, we’re providing a starter notebook to try your hand at this problem using the power of Tensor Processing Units (TPUs). TPUs are powerful hardware accelerators specialized in deep learning tasks, including Natural Language Processing. Kaggle provides all users TPU Quota at no cost, which you can use to explore this competition. Check out our TPU documentation and Kaggle’s YouTube playlist for more information and resources.

### Recommended Tutorial
We highly recommend this excellent tutorial on using KerasNLP to solve this problem, from the Keras team as well as Ana Sofia Uzsoy’s Tutorial that walks you through creating your very first submission step by step with TPUs and BERT.
    This is a great opportunity to flex your NLP muscles and solve an exciting problem!

### Disclaimer: 
The dataset for this competition contains text that may be considered profane, vulgar, or offensive.

### Evaluation

#### Goal
Your goal is to predict whether a given hypothesis is related to its premise by contradiction, entailment, or whether neither of those is true (neutral).
For each sample in the test set, you must predict a 0, 1, or 2 value for the variable.

Those values map to the logical condition as:

0 == entailment
1 == neutral
2 == contradiction

#### Metric
Your score is the percentage of relationships you correctly predict. This is known as accuracy.


#### Submission File Format
You should submit a csv file with exactly 5195 entries plus a header row. Your submission will show an error if you have extra columns (beyond id and prediction) or rows.

The file should have exactly 2 columns:

id (sorted in any order)
prediction (contains your predictions: 0 for entailment, 1 for neutral, 2 for contradiction)

### id,prediction

c6d58c3f69,1
cefcc82292,1
e98005252c,1
58518c10ba,1
c32b0d16df,1
Etc.

You can download an example submission file (sample_submission.csv) on the Data page.

### Code Submission Requirement

In this code competition, your submission.csv file must be generated as an output from a Kaggle notebook. For details on how to submit from a notebook, review the FAQ on "How do I make a submission?"

### Dataset Description
In this Getting Started Competition, we’re classifying pairs of sentences (consisting of a premise and a hypothesis) into three categories - entailment, contradiction, or neutral. Let’s take a look at an example of each of these cases for the following premise:

He came, he opened the door and I remember looking back and seeing the expression on his face, and I could tell that he was disappointed.

#### Hypothesis 1:

Just by the look on his face when he came through the door I just knew that he was let down.

We know that this is true based on the information in the premise. So, this pair is related by entailment.


#### Hypothesis 2:

He was trying not to make us feel guilty but we knew we had caused him trouble.

This very well might be true, but we can’t conclude this based on the information in the premise. So, this relationship is neutral.

#### Hypothesis 3:

He was so excited and bursting with joy that he practically knocked the door off it's frame.

We know this isn’t true, because it is the complete opposite of what the premise says. So, this pair is related by contradiction.

This dataset contains premise-hypothesis pairs in fifteen different languages, including:
Arabic, Bulgarian, Chinese, German, Greek, English, Spanish, French, Hindi, Russian, Swahili, Thai, Turkish, Urdu, and Vietnamese.

#### Files:
train.csv: This file contains the ID, premise, hypothesis, and label, as well as the language of the text and its two-letter abbreviation

test.csv: This file contains the ID, premise, hypothesis, language, and language abbreviation, without labels.

sample_submission.csv: This is a sample submission file in the correct format:
id: a unique identifier for each sample
    label: the classification of the relationship between the premise and hypothesis (0 for entailment, 1 for neutral, 2 for contradiction)

Special thanks to Tensorflow Datasets (TFDS) for providing this and many other useful datasets! For more information, visit: https://www.tensorflow.org/datasets

Disclaimer: The dataset for this competition contains text that may be considered profane, vulgar, or offensive.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import *
from sklearn.linear_model import *
from math import *
from nltk import *
import warnings
warnings.filterwarnings("ignore")
from deep_translator import *
import os

In [2]:
download("all")

[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to /Users/amith/nltk_data...
[nltk_data]    |   Unzipping corpora/abc.zip.
[nltk_data]    | Downloading package alpino to
[nltk_data]    |     /Users/amith/nltk_data...
[nltk_data]    |   Unzipping corpora/alpino.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger to
[nltk_data]    |     /Users/amith/nltk_data...
[nltk_data]    |   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger_ru to
[nltk_data]    |     /Users/amith/nltk_data...
[nltk_data]    |   Unzipping
[nltk_data]    |       taggers/averaged_perceptron_tagger_ru.zip.
[nltk_data]    | Downloading package basque_grammars to
[nltk_data]    |     /Users/amith/nltk_data...
[nltk_data]    |   Unzipping grammars/basque_grammars.zip.
[nltk_data]    | Downloading package bcp47 to
[nltk_data]    |     /Users/amith/nltk_data...
[nltk_data]    | Downloading pac

True

In [3]:
df_train = pd.read_csv("data/train.csv")
df_train.head()

Unnamed: 0,id,premise,hypothesis,lang_abv,language,label
0,5130fd2cb5,and these comments were considered in formulat...,The rules developed in the interim were put to...,en,English,0
1,5b72532a0b,These are issues that we wrestle with in pract...,Practice groups are not permitted to work on t...,en,English,2
2,3931fbe82a,Des petites choses comme celles-là font une di...,J'essayais d'accomplir quelque chose.,fr,French,0
3,5622f0c60b,you know they can't really defend themselves l...,They can't defend themselves because of their ...,en,English,0
4,86aaa48b45,ในการเล่นบทบาทสมมุติก็เช่นกัน โอกาสที่จะได้แสด...,เด็กสามารถเห็นได้ว่าชาติพันธุ์แตกต่างกันอย่างไร,th,Thai,1


In [4]:
df_test = pd.read_csv("data/test.csv")
df_test.head()

Unnamed: 0,id,premise,hypothesis,lang_abv,language
0,c6d58c3f69,بکس، کیسی، راہیل، یسعیاہ، کیلی، کیلی، اور کولم...,"کیسی کے لئے کوئی یادگار نہیں ہوگا, کولمین ہائی...",ur,Urdu
1,cefcc82292,هذا هو ما تم نصحنا به.,عندما يتم إخبارهم بما يجب عليهم فعله ، فشلت ال...,ar,Arabic
2,e98005252c,et cela est en grande partie dû au fait que le...,Les mères se droguent.,fr,French
3,58518c10ba,与城市及其他公民及社区组织代表就IMA的艺术发展进行对话&amp,IMA与其他组织合作，因为它们都依靠共享资金。,zh,Chinese
4,c32b0d16df,Она все еще была там.,"Мы думали, что она ушла, однако, она осталась.",ru,Russian


In [5]:
print("Number of training dataset : ",len(df_train))
print("Number of testing dataset  : ",len(df_test))

Number of training dataset :  12120
Number of testing dataset  :  5195


In [6]:
df_train_1 = df_train.copy()

In [7]:
tr = GoogleTranslator()

In [14]:
print(" 1 :  ",int(len(df_train)/50))
print(" 2 :  ",2*int(len(df_train)/50))
print(" 3 :  ",3*int(len(df_train)/50))
print(" 4 :  ",4*int(len(df_train)/50))
print(" 5 :  ",5*int(len(df_train)/50))
print(" 6 :  ",6*int(len(df_train)/50))
print(" 7 :  ",7*int(len(df_train)/50))
print(" 8 :  ",8*int(len(df_train)/50))
print(" 9 :  ",9*int(len(df_train)/50))
print("10 :  ",10*int(len(df_train)/50))
print("11 :  ",11*int(len(df_train)/50))
print("12 :  ",12*int(len(df_train)/50))
print("13 :  ",13*int(len(df_train)/50))
print("14 :  ",14*int(len(df_train)/50))
print("15 :  ",15*int(len(df_train)/50))
print("16 :  ",16*int(len(df_train)/50))
print("17 :  ",17*int(len(df_train)/50))
print("18 :  ",18*int(len(df_train)/50))
print("19 :  ",19*int(len(df_train)/50))
print("20 :  ",20*int(len(df_train)/50))
print("21 :  ",21*int(len(df_train)/50))
print("22 :  ",22*int(len(df_train)/50))
print("23 :  ",23*int(len(df_train)/50))
print("24 :  ",24*int(len(df_train)/50))
print("25 :  ",25*int(len(df_train)/50))
print("26 :  ",26*int(len(df_train)/50))
print("27 :  ",27*int(len(df_train)/50))
print("28 :  ",28*int(len(df_train)/50))
print("29 :  ",29*int(len(df_train)/50))
print("30 :  ",30*int(len(df_train)/50))
print("31 :  ",31*int(len(df_train)/50))
print("32 :  ",32*int(len(df_train)/50))
print("33 :  ",33*int(len(df_train)/50))
print("34 :  ",34*int(len(df_train)/50))
print("35 :  ",35*int(len(df_train)/50))
print("36 :  ",36*int(len(df_train)/50))
print("37 :  ",37*int(len(df_train)/50))
print("38 :  ",38*int(len(df_train)/50))
print("39 :  ",39*int(len(df_train)/50))
print("40 :  ",40*int(len(df_train)/50))
print("41 :  ",41*int(len(df_train)/50))
print("42 :  ",42*int(len(df_train)/50))
print("43 :  ",43*int(len(df_train)/50))
print("44 :  ",44*int(len(df_train)/50))
print("45 :  ",45*int(len(df_train)/50))
print("46 :  ",46*int(len(df_train)/50))
print("47 :  ",47*int(len(df_train)/50))
print("48 :  ",48*int(len(df_train)/50))
print("49 :  ",49*int(len(df_train)/50))
print("50 :  ",50*int(len(df_train)/50))
print("51 :  ",int(len(df_train)))

 1 :   242
 2 :   484
 3 :   726
 4 :   968
 5 :   1210
 6 :   1452
 7 :   1694
 8 :   1936
 9 :   2178
10 :   2420
11 :   2662
12 :   2904
13 :   3146
14 :   3388
15 :   3630
16 :   3872
17 :   4114
18 :   4356
19 :   4598
20 :   4840
21 :   5082
22 :   5324
23 :   5566
24 :   5808
25 :   6050
26 :   6292
27 :   6534
28 :   6776
29 :   7018
30 :   7260
31 :   7502
32 :   7744
33 :   7986
34 :   8228
35 :   8470
36 :   8712
37 :   8954
38 :   9196
39 :   9438
40 :   9680
41 :   9922
42 :   10164
43 :   10406
44 :   10648
45 :   10890
46 :   11132
47 :   11374
48 :   11616
49 :   11858
50 :   12100
51 :   12120


In [24]:
premise = df_train.loc[0:int(len(df_train)/50),"premise"]
premise[]

0      and these comments were considered in formulat...
1      These are issues that we wrestle with in pract...
2      Des petites choses comme celles-là font une di...
3      you know they can't really defend themselves l...
4      ในการเล่นบทบาทสมมุติก็เช่นกัน โอกาสที่จะได้แสด...
                             ...                        
238    The data would be presented as required supple...
239                    我的梦想是看到每一个美国人都成为奥运家庭的一员，所以请尽一切可能。
240    जनसंख्या वृद्धि  विपरीत दिशा में  प्रदूषण की त...
241    اصلاحات جو ابھی تک اپنایا گیا ہے گہرے اثرات ہی...
242    Initial demand for land in the New Town was no...
Name: premise, Length: 243, dtype: object

In [21]:
df_train.index.values[0:int(len(df_train)/50)]

array([  0,   1,   2,   3,   4,   5,   6,   7,   8,   9,  10,  11,  12,
        13,  14,  15,  16,  17,  18,  19,  20,  21,  22,  23,  24,  25,
        26,  27,  28,  29,  30,  31,  32,  33,  34,  35,  36,  37,  38,
        39,  40,  41,  42,  43,  44,  45,  46,  47,  48,  49,  50,  51,
        52,  53,  54,  55,  56,  57,  58,  59,  60,  61,  62,  63,  64,
        65,  66,  67,  68,  69,  70,  71,  72,  73,  74,  75,  76,  77,
        78,  79,  80,  81,  82,  83,  84,  85,  86,  87,  88,  89,  90,
        91,  92,  93,  94,  95,  96,  97,  98,  99, 100, 101, 102, 103,
       104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116,
       117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129,
       130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142,
       143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155,
       156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168,
       169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 18

In [17]:
df_train.loc[0:50,:]

Unnamed: 0,id,premise,hypothesis,lang_abv,language,label
0,5130fd2cb5,and these comments were considered in formulat...,The rules developed in the interim were put to...,en,English,0
1,5b72532a0b,These are issues that we wrestle with in pract...,Practice groups are not permitted to work on t...,en,English,2
2,3931fbe82a,Des petites choses comme celles-là font une di...,J'essayais d'accomplir quelque chose.,fr,French,0
3,5622f0c60b,you know they can't really defend themselves l...,They can't defend themselves because of their ...,en,English,0
4,86aaa48b45,ในการเล่นบทบาทสมมุติก็เช่นกัน โอกาสที่จะได้แสด...,เด็กสามารถเห็นได้ว่าชาติพันธุ์แตกต่างกันอย่างไร,th,Thai,1
5,ed7d6a1e62,"Bir çiftlikte birisinin, ağıla kapatılmış bu ö...",Çiftlikte insanlar farklı terimler kullanırlar.,tr,Turkish,0
6,5a0f4908a0,ریاست ہائے متحدہ امریکہ واپس آنے پر، ہج ایف بی...,ہیگ کی تفتیش ایف بی آئی اہلکاروں کی طرف سے کی...,ur,Urdu,0
7,fdcd1bd867,From Cockpit Country to St. Ann's Bay,From St. Ann's Bay to Cockpit Country.,en,English,2
8,7cfb3d272c,"Look, it's your skin, but you're going to be i...",The boss will fire you if he sees you slacking...,en,English,1
9,8c10229663,Через каждые сто градусов пятна краски меняют ...,Краска изменяется в соответствии с цветом.,ru,Russian,0
