# A first approach : Word2Vec Embeddings


To tackle the issue, we first wanted to use basic Word2vec embeddings to represent premises/hypotheses, and then train a simple classifier on these embeddings.

## Necessary dependencies

In [None]:
import pandas as pd
import numpy as np
from gensim.models import Word2Vec
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder
from nltk.tokenize import word_tokenize
import re
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE
import seaborn as sns
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.manifold import TSNE
import plotly.express as px
import numpy as np
import nltk
import jieba
from arabic_reshaper import reshape
from bidi.algorithm import get_display
import re

#nltk.download('punkt')
#nltk.download('wordnet')
#nltk.download('omw-1.4')
#nltk.download('punkt_tab')

## Data Loading

In [None]:
df = pd.read_csv('NLI_dataset.csv') 

new_df = pd.DataFrame(df)

new_df = new_df.drop_duplicates()

# Visualisation

In [None]:

print(new_df['id'].count())

print(new_df['lang_abv'].value_counts())

12120
lang_abv
en    6870
zh     411
ar     401
fr     390
sw     385
ur     381
vi     379
ru     376
hi     374
el     372
th     371
es     366
tr     351
de     351
bg     342
Name: count, dtype: int64


As expected, the data is quite unbalanced, with English sentences being dominant. Tokenizers do not behave well with every language, hence embeddings in less abudant languages could be unrepresentative of the sentence.

We now preprocess the sentences, as usual :

In [None]:
def preprocess_text(text):
    text = text.lower() 
    text = re.sub(r'http\S+', '', text)
    text = re.sub(r'@\w+', '', text) 
    text = re.sub(r'[^a-zA-Z\s]', '', text)  
    return text


new_df['cleaned_premise'] = new_df['premise'].apply(preprocess_text)
new_df['cleaned_hypothesis'] = new_df['hypothesis'].apply(preprocess_text)

We use a basic tokenizer from `nltk`when available. For unsupported languages, we just split the sentences : 

In [None]:
def tokenize_text(text, lang):
    if lang == "en":
        return word_tokenize(text, language="english")
    elif lang == "zh":  
        return list(jieba.cut(text))
    elif lang == "ar":  
        reshaped_text = reshape(text)
        bidi_text = get_display(reshaped_text)
        return word_tokenize(bidi_text, language="arabic")
    elif lang == "fr":
        return word_tokenize(text, language="french")
    elif lang == "de":
        return word_tokenize(text, language="german")
    elif lang == "es":
        return word_tokenize(text, language="spanish")
    elif lang == "tr":
        return word_tokenize(text, language="turkish")
    elif lang == "ru":
        return word_tokenize(text, language="russian")
    elif lang in ["hi", "ur"]:  
        return re.findall(r'\w+', text, re.UNICODE)  
    else:
        return text.split()  # Basic tokenization for other languages

tokenized_by_lang_hyp = {}
tokenized_by_lang_pre = {}

for lang in new_df['lang_abv'].unique() :
    if lang != 'ar' : 
        hyp_lang = new_df[new_df['lang_abv'] == lang]['cleaned_hypothesis']
        tokenized_by_lang_hyp[lang] = [tokenize_text(tweet, lang) for tweet in hyp_lang]
        pre_lang = new_df[new_df['lang_abv'] == lang]['cleaned_premise']
        tokenized_by_lang_pre[lang] = [tokenize_text(tweet, lang) for tweet in pre_lang]

Building prefix dict from the default dictionary ...
Loading model from cache /tmp/jieba.cache
Loading model cost 0.972 seconds.
Prefix dict has been built successfully.


In [None]:
all_tokenized_sentences = []

for lang in tokenized_by_lang_hyp:
    all_tokenized_sentences.extend(tokenized_by_lang_hyp[lang])

for lang in tokenized_by_lang_pre:
    all_tokenized_sentences.extend(tokenized_by_lang_pre[lang])

Let's train a Word2vec model to generate token embeddings : 

In [None]:
model = Word2Vec(sentences=all_tokenized_sentences, 
                          window=4,         # Context window size
                          min_count=1,      # Ignore words that appear less than this
                          sg=1,             # Use skip-gram (1) instead of CBOW (0)
                          workers=4)        # Number of threads for training

We use a 4-sized context window to fully grasp a premise/hypothesis' context.

Since Word2vec embeddings do not provide sentence-representative tokens (like `[CLS]` generated by `BERT`'s tokenizer for instance), we compute sentence embeddings by averaging the embeddings of its tokens:

In [None]:
def get_sentence_embedding(tweet, model):
    tokens = word_tokenize(preprocess_text(tweet))  
    embeddings = []
    
    for word in tokens:
        if word in model.wv:
            embeddings.append(model.wv[word])
    
    if embeddings:
        return np.mean(embeddings, axis=0)  
    else:
        return np.zeros(model.vector_size)

X = np.array([get_sentence_embedding(sentence, model) for sentence in new_df['cleaned_premise']]) +np.array([get_sentence_embedding(sentence, model) for sentence in new_df['cleaned_hypothesis']])

y = new_df['label'].values

We then train a vanilla Random Forest classifier to classify these sentence embeddings:

In [None]:

indices = np.arange(len(X))

X_train, X_test, y_train, y_test, train_indices, test_indices = train_test_split(
    X, y, indices, test_size=0.2, random_state=42
)

clf =  RandomForestClassifier(n_estimators=200, random_state=42)
clf.fit(X_train, y_train)

Let's infer !

In [None]:
y_pred = clf.predict(X_test)

premise_predicted = new_df.iloc[test_indices]['premise']
label_real = new_df.iloc[test_indices]['label']
hypothesis_predicted = new_df.iloc[test_indices]['hypothesis']

for premise,hypothesis, label_real, label_pred in zip(premise_predicted.head(30),hypothesis_predicted.head(30),label_real.head(30), y_pred[:30]):
    print(f"Premise: {premise}\nHypothesis: {hypothesis}\nReal label: {label_real}\nPredicted label: {label_pred}\n")   
    
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy * 100:.2f}%")

Premise: Кто? Она спросила его с неожиданным интересом.
Hypothesis: Она спросила, как это сделать, так как с её точки зрения это казалось невозможным.
Real label: 1
Predicted label: 2

Premise: Others are Zao (in Tohoku) and a number of resorts in Joshin-etsu Kogen National Park in the Japan Alps, where there are now splendid facilities thanks to the 1998 Winter Olympic Games in Nagano.
Hypothesis: There are a lot of resorts in the national park.
Real label: 0
Predicted label: 0

Premise: trying to keep grass alive during a summer on a piece of ground that big was expensive
Hypothesis: There was no cost in keeping the grass alive in the summer time.
Real label: 2
Predicted label: 2

Premise: so i guess my experience is is just with what we did and and so they didn't really go through the child care route they were able to be home together
Hypothesis: They were able to be home rather than having to worry about getting child care.
Real label: 0
Predicted label: 0

Premise: The Journal pu

As expected, the results aren't good: we get a **32% accuracy**, which is approximately as good as randomly classifying the sentences.

We can easily understand why a Word2Vec-based approach isn't well-suited for an entailment task. Two embeddings generated by this method will be very close in the embedding space if they share similar **syntax** or contain many of the same words. However, **semantic similarity** and **logical entailment** go beyond surface-level word overlap. Two sentences can have similar words but completely different meanings or intentions.

Let’s take an example:

- **Premise:** *The Journal put the point succinctly: "Is any publicity good publicity?"*
- **Hypothesis:** *The Journal asked, "Is this a good political move?"*

These sentences are unrelated in meaning, yet the classifier labeled them as contradicting.

This misclassification happens because their sentence embeddings, derived from Word2Vec averaging, are close in vector space due to overlapping words like *"The Journal,"* *"asked,"* and *"good."* However, the relationship between the sentences is **neutral**, not contradictory. Word2Vec lacks the ability to capture context, negation, or logical structure, which are crucial for tasks like natural language inference.

This example shows the limitations of using shallow embedding methods for semantic understanding, and why more **context-aware models** like `BERT` or other transformer-based models are better suited for such tasks. That's what we are going to delve into next.
