# Contradictory, My Dear Watson

### Detecting contradiction and entailment in multilingual text using TPUs

**Kaggle link:** https://www.kaggle.com/competitions/contradictory-my-dear-watson/overview

### Description

"…when you have eliminated the impossible, whatever remains, however improbable, must be the truth"
-Sir Arthur Conan Doyle

Our brains process the meaning of a sentence like this rather quickly.

We're able to surmise:

Some things to be true: "You can find the right answer through the process of elimination.”
Others that may have truth: "Ideas that are improbable are not impossible!"
And some claims are clearly contradictory: "Things that you have ruled out as impossible are where the truth lies."
Natural language processing (NLP) has grown increasingly elaborate over the past few years. Machine learning models tackle question answering, text extraction, sentence generation, and many other complex tasks. But, can machines determine the relationships between sentences, or is that still left to humans? If NLP can be applied between sentences, this could have profound implications for fact-checking, identifying fake news, analyzing text, and much more.

### The Challenge:
If you have two sentences, there are three ways they could be related: one could entail the other, one could contradict the other, or they could be unrelated. Natural Language Inferencing (NLI) is a popular NLP problem that involves determining how pairs of sentences (consisting of a premise and a hypothesis) are related.

Your task is to create an NLI model that assigns labels of 0, 1, or 2 (corresponding to entailment, neutral, and contradiction) to pairs of premises and hypotheses. To make things more interesting, the train and test set include text in fifteen different languages! You can find more details on the dataset by reviewing the Data page.

Today, the most common approaches to NLI problems include using embeddings and transformers like BERT. In this competition, we’re providing a starter notebook to try your hand at this problem using the power of Tensor Processing Units (TPUs). TPUs are powerful hardware accelerators specialized in deep learning tasks, including Natural Language Processing. Kaggle provides all users TPU Quota at no cost, which you can use to explore this competition. Check out our TPU documentation and Kaggle’s YouTube playlist for more information and resources.

### The Challenge:
If you have two sentences, there are three ways they could be related: one could entail the other, one could contradict the other, or they could be unrelated. Natural Language Inferencing (NLI) is a popular NLP problem that involves determining how pairs of sentences (consisting of a premise and a hypothesis) are related.

Your task is to create an NLI model that assigns labels of 0, 1, or 2 (corresponding to entailment, neutral, and contradiction) to pairs of premises and hypotheses. To make things more interesting, the train and test set include text in fifteen different languages! You can find more details on the dataset by reviewing the Data page.

Today, the most common approaches to NLI problems include using embeddings and transformers like BERT. In this competition, we’re providing a starter notebook to try your hand at this problem using the power of Tensor Processing Units (TPUs). TPUs are powerful hardware accelerators specialized in deep learning tasks, including Natural Language Processing. Kaggle provides all users TPU Quota at no cost, which you can use to explore this competition. Check out our TPU documentation and Kaggle’s YouTube playlist for more information and resources.

### Recommended Tutorial
We highly recommend this excellent tutorial on using KerasNLP to solve this problem, from the Keras team as well as Ana Sofia Uzsoy’s Tutorial that walks you through creating your very first submission step by step with TPUs and BERT.
    This is a great opportunity to flex your NLP muscles and solve an exciting problem!

### Disclaimer: 
The dataset for this competition contains text that may be considered profane, vulgar, or offensive.

### Evaluation

#### Goal
Your goal is to predict whether a given hypothesis is related to its premise by contradiction, entailment, or whether neither of those is true (neutral).
For each sample in the test set, you must predict a 0, 1, or 2 value for the variable.

Those values map to the logical condition as:

0 == entailment
1 == neutral
2 == contradiction

#### Metric
Your score is the percentage of relationships you correctly predict. This is known as accuracy.


#### Submission File Format
You should submit a csv file with exactly 5195 entries plus a header row. Your submission will show an error if you have extra columns (beyond id and prediction) or rows.

The file should have exactly 2 columns:

id (sorted in any order)
prediction (contains your predictions: 0 for entailment, 1 for neutral, 2 for contradiction)

### id,prediction

c6d58c3f69,1
cefcc82292,1
e98005252c,1
58518c10ba,1
c32b0d16df,1
Etc.

You can download an example submission file (sample_submission.csv) on the Data page.

### Code Submission Requirement

In this code competition, your submission.csv file must be generated as an output from a Kaggle notebook. For details on how to submit from a notebook, review the FAQ on "How do I make a submission?"

### Dataset Description
In this Getting Started Competition, we’re classifying pairs of sentences (consisting of a premise and a hypothesis) into three categories - entailment, contradiction, or neutral. Let’s take a look at an example of each of these cases for the following premise:

He came, he opened the door and I remember looking back and seeing the expression on his face, and I could tell that he was disappointed.

#### Hypothesis 1:

Just by the look on his face when he came through the door I just knew that he was let down.

We know that this is true based on the information in the premise. So, this pair is related by entailment.


#### Hypothesis 2:

He was trying not to make us feel guilty but we knew we had caused him trouble.

This very well might be true, but we can’t conclude this based on the information in the premise. So, this relationship is neutral.

#### Hypothesis 3:

He was so excited and bursting with joy that he practically knocked the door off it's frame.

We know this isn’t true, because it is the complete opposite of what the premise says. So, this pair is related by contradiction.

This dataset contains premise-hypothesis pairs in fifteen different languages, including:
Arabic, Bulgarian, Chinese, German, Greek, English, Spanish, French, Hindi, Russian, Swahili, Thai, Turkish, Urdu, and Vietnamese.

#### Files:
train.csv: This file contains the ID, premise, hypothesis, and label, as well as the language of the text and its two-letter abbreviation

test.csv: This file contains the ID, premise, hypothesis, language, and language abbreviation, without labels.

sample_submission.csv: This is a sample submission file in the correct format:
id: a unique identifier for each sample
    label: the classification of the relationship between the premise and hypothesis (0 for entailment, 1 for neutral, 2 for contradiction)

Special thanks to Tensorflow Datasets (TFDS) for providing this and many other useful datasets! For more information, visit: https://www.tensorflow.org/datasets

Disclaimer: The dataset for this competition contains text that may be considered profane, vulgar, or offensive.

In [4]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import *
from sklearn.linear_model import *
from math import *
from nltk import *
import warnings
warnings.filterwarnings("ignore")
from deep_translator import *
import os

In [5]:
download("all")

[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to /Users/amith/nltk_data...
[nltk_data]    |   Package abc is already up-to-date!
[nltk_data]    | Downloading package alpino to
[nltk_data]    |     /Users/amith/nltk_data...
[nltk_data]    |   Package alpino is already up-to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger to
[nltk_data]    |     /Users/amith/nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger is already up-
[nltk_data]    |       to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger_ru to
[nltk_data]    |     /Users/amith/nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger_ru is already
[nltk_data]    |       up-to-date!
[nltk_data]    | Downloading package basque_grammars to
[nltk_data]    |     /Users/amith/nltk_data...
[nltk_data]    |   Package basque_grammars is already up-to-date!
[nltk_data]    | Downloading package bcp47 to
[nltk_data]    |

True

In [9]:
df_train = pd.read_csv("data/train.csv")
df_train.head()

FileNotFoundError: [Errno 2] No such file or directory: 'data/train.csv'

In [7]:
df_test = pd.read_csv("data/test.csv")
df_test.head()

FileNotFoundError: [Errno 2] No such file or directory: 'data/test.csv'

In [None]:
print("Number of training dataset : ",len(df_train))
print("Number of testing dataset  : ",len(df_test))

In [None]:
df_train_1 = df_train.copy()