# <font color='cornflowerblue'>Natural Language Inference Approach UM - Data Team Club

In the second UM Data Team Club project, we'll be diving into Natural Language Inference (NLI). Our goal is to explore and analyze the relationships between different text pairs to understand whether one sentence logically follows from another.

## Contents Table
 - [Imports](#1)
 - [Accelerator](#2)
 - [Load Data](#3)
 - [Data Exploring](#4)
 - [Data Preprocessing](#5)
 - [BERT Base Multi Model](#6)

In [132]:
# !pip install keras_nlp
# !pip install seaborn
# !pip install wordcloud

In [133]:
import warnings
warnings.filterwarnings("ignore")

### Imports <a id='1'></a>

In [134]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from wordcloud import WordCloud, STOPWORDS
import seaborn as sns
sns.set_style('whitegrid')


### Accelerator <a id='2'></a>

### Load Data <a id='3'></a>

In [135]:
df_train = pd.read_csv('/kaggle/input/contradictory-my-dear-watson/train.csv')
print(f"Train dataset size: {df_train.shape}")

Train dataset size: (12120, 6)


### Data Exploring <a id='4'></a>

In [136]:
df_train.head(3)

Unnamed: 0,id,premise,hypothesis,lang_abv,language,label
0,5130fd2cb5,and these comments were considered in formulat...,The rules developed in the interim were put to...,en,English,0
1,5b72532a0b,These are issues that we wrestle with in pract...,Practice groups are not permitted to work on t...,en,English,2
2,3931fbe82a,Des petites choses comme celles-là font une di...,J'essayais d'accomplir quelque chose.,fr,French,0


The NLI model will assign labels of 0, 1, or 2 (corresponding to entailment, neutral, and contradiction) to pairs of premises and hypotheses.

| <span style="color:cornflowerblue">Label</span> | <span style="color:cornflowerblue">Description</span>       |
|---------------------------------------|---------------------------------------------------|
| 0                                     | Entailment                                        |
| 1                                     | Neutral                                           |
| 2                                     | Contradiction                                     |


### Generative AI classification <a id='6'></a>

In [137]:
#!pip install langchain
#!pip install langchain_community

In [138]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
from transformers import pipeline
from langchain.llms import HuggingFacePipeline

# Pass the directory path where the model is stored on your system
# -- model_name = "google/flan-t5-large"

# Pass the namespace/repo_name to download the repo to your machine
model_name = "google/flan-t5-large"

# Initialize a tokenizer for the specified model
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Initialize a model for sequence-to-sequence tasks using the specified pretrained model
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

In [139]:
# Create a pipeline for text-to-text generation using a specified model and tokenizer
pipe = pipeline(
    "text2text-generation",  # Specify the task as text-to-text generation
    model=model,             # Use the previously initialized model
    tokenizer=tokenizer,     # Use the previously initialized tokenizer
    max_length=15,          # Set the maximum length for generated text to 512 tokens
    temperature=0,           # Set the temperature parameter for controlling randomness (0 means deterministic)
    top_p=0.95,              # Set the top_p parameter for controlling the nucleus sampling (higher values make output more focused)
    repetition_penalty=1.15, # Set the repetition_penalty to control the likelihood of repeated words or phrases
)

# Create a Hugging Face pipeline for local language model (LLM) using the 'pipe' pipeline
local_llm = HuggingFacePipeline(pipeline=pipe)

In [140]:
# Ejemplo con 1 sola clasificacion 
premise = df_train["premise"][100]
hypothesis = df_train["hypothesis"][100]

print(local_llm(f'''
You are given a premise and a hypothesis, classify the hypothesis based on the premise into: Entailment, Neutral or Contradiction.
The premise is {premise}, and the hypothesis to classify is {hypothesis}'''))

Entailment


In [141]:
k = 150
clasiff = []
for i in range(len(df_train.head(k))):
    premise = df_train["premise"][i]
    hypothesis = df_train["hypothesis"][i]
    
    pred = local_llm(f'''
    You are given a premise and a hypothesis, classify the hypothesis based on the premise into: Entailment, Neutral or Contradiction.
    The premise is {premise}, and the hypothesis to classify is {hypothesis}''')
    
    clasiff += [pred]

In [142]:
clasiff[:15]

['Entailment',
 'Contradiction',
 'Entailment',
 'Entailment',
 'Neutral',
 'Neutral',
 'Neutral',
 'Contradiction',
 'Neutral',
 'Neutral',
 'Contradiction',
 'Neutral',
 'Neutral',
 'Neutral',
 'Neutral']

In [143]:
for i in range(len(clasiff)):
    if clasiff[i] == "Entailment":
        clasiff[i] = 0
    elif clasiff[i] == "Neutral":
        clasiff[i] = 1
    else:
        clasiff[i] = 2

In [144]:
clasiff[:15]

[0, 2, 0, 0, 1, 1, 1, 2, 1, 1, 2, 1, 1, 1, 1]

In [145]:
from sklearn.metrics import accuracy_score
score = accuracy_score(df_train.label.head(k), clasiff)
print(score)

0.6266666666666667
