#Calculating Relative Polarity Scores for Multilingual Wikipedia Articles Using Contrastive Learning

In this notebook, we will compare two of the most commonly used machine-learning sentiment analysis models and our own model: K-Nearest Neighbors (KNN), Support Vector Machine (SVM), and a Contrastive Learning Model with Relative Polarity. These models will be trained and tested on 100 paired English and German Wikipedia articles on the topic of the second world war.

###Part 1: Installations

In [None]:
!pip install pandas

!pip install spacy
!python -m spacy download de_core_news_sm

!pip install bs4
!pip install textblob
!python -m textblob.download_corpora

!pip install sentence-transformers

Collecting de-core-news-sm==3.7.0
  Downloading https://github.com/explosion/spacy-models/releases/download/de_core_news_sm-3.7.0/de_core_news_sm-3.7.0-py3-none-any.whl (14.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m14.6/14.6 MB[0m [31m49.0 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: de-core-news-sm
Successfully installed de-core-news-sm-3.7.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('de_core_news_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.
Collecting bs4
  Downloading bs4-0.0.2-py2.py3-none-any.whl.metadata (411 bytes)
Downloading bs4-0.0.2-py2.py3-none-any.whl (1.2 kB)
Installing collected packages: bs4
Successfully installed bs4-0.0.2
[nltk_data] Downloading package brown to /ro

In [None]:
import requests
import re

import pandas as pd
import numpy as np
import spacy
import nltk
from bs4 import BeautifulSoup
from textblob import TextBlob
from sentence_transformers import SentenceTransformer

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import DataLoader, Dataset

from sklearn.preprocessing import normalize
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsRegressor
from sklearn.svm import SVR
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score

  from tqdm.autonotebook import tqdm, trange


###Part 2: Data & Preprocessing

In [None]:
df = pd.read_csv('/content/dataset.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,(Title),Text A (EN),Polarity A,Subjectivity A,Text B (DE),Polarity B,Subjectivity B,relative_polarity,relative_subjectivity
0,0,Bombing of Dresden,The bombing of Dresden was a joint British and...,0.080573,0.336892,Die Luftangriffe auf Dresden und den Großraum ...,0.45,0.0,0.369427,-0.336892
1,1,Berlin Blockade,The Berlin Blockade was one of the first major...,0.019341,0.258009,Als Berlin-Blockade wird die Blockade der drei...,0.0,0.166667,-0.019341,-0.091342
2,2,Battle of Berlin,"The Battle of Berlin, designated as the Berlin...",0.066615,0.248313,Die Schlacht um Berlin war die letzte große Sc...,0.35,0.0,0.283385,-0.248313
3,3,Nuremberg Trail,The Nuremberg trials were held by the Allies a...,0.105769,0.396154,Die Nürnberger Prozesse wurden nach dem Zweite...,-0.172727,0.036364,-0.278497,-0.35979
4,4,East German Uprising of 1953,The East German uprising of 1953 was an uprisi...,0.021795,0.307692,Als Aufstand vom 17. Juni 1953 werden die Vork...,-0.089474,0.144737,-0.111269,-0.162955


In [None]:
def preprocess(text):

  if not isinstance(text, str):
    return text

  text = re.sub(r'\[\d+\]', '', text)
  text = re.sub(r'\([^)]*\)', '', text)
  text = re.sub(r'\[.*?\]', '', text)
  text = re.sub(r'\s+', ' ', text)
  text = text.strip()

  return text


In [None]:
df['Text A (EN)'] = df['Text A (EN)'].apply(preprocess)
df['Text B (DE)'] = df['Text B (DE)'].apply(preprocess)

In [None]:
nlp_en = spacy.load('en_core_web_sm')
nlp_de = spacy.load('de_core_news_sm')

def en_preprocess(text):
  doc = nlp_en(text.lower())
  lemmatized = [token.lemma_ for token in doc if not token.is_stop]
  return " ".join(lemmatized)

def de_preprocess(text):
  doc = nlp_de(text.lower())
  lemmatized = [token.lemma_ for token in doc if not token.is_stop]
  return " ".join(lemmatized)

In [None]:
df.head()

Unnamed: 0.1,Unnamed: 0,(Title),Text A (EN),Polarity A,Subjectivity A,Text B (DE),Polarity B,Subjectivity B,relative_polarity,relative_subjectivity
0,0,Bombing of Dresden,The bombing of Dresden was a joint British and...,0.080573,0.336892,Die Luftangriffe auf Dresden und den Großraum ...,0.45,0.0,0.369427,-0.336892
1,1,Berlin Blockade,The Berlin Blockade was one of the first major...,0.019341,0.258009,Als Berlin-Blockade wird die Blockade der drei...,0.0,0.166667,-0.019341,-0.091342
2,2,Battle of Berlin,"The Battle of Berlin, designated as the Berlin...",0.066615,0.248313,Die Schlacht um Berlin war die letzte große Sc...,0.35,0.0,0.283385,-0.248313
3,3,Nuremberg Trail,The Nuremberg trials were held by the Allies a...,0.105769,0.396154,Die Nürnberger Prozesse wurden nach dem Zweite...,-0.172727,0.036364,-0.278497,-0.35979
4,4,East German Uprising of 1953,The East German uprising of 1953 was an uprisi...,0.021795,0.307692,Als Aufstand vom 17. Juni 1953 werden die Vork...,-0.089474,0.144737,-0.111269,-0.162955


In [None]:
df['Text A (EN)'] = df['Text A (EN)'].apply(en_preprocess)
df['Text B (DE)'] = df['Text B (DE)'].apply(de_preprocess)

In [None]:
df.head()

Unnamed: 0.1,Unnamed: 0,(Title),Text A (EN),Polarity A,Subjectivity A,Text B (DE),Polarity B,Subjectivity B,relative_polarity,relative_subjectivity
0,0,Bombing of Dresden,bombing dresden joint british american aerial ...,0.080573,0.336892,Luftangriffe dresd Großraum Stadt Weltkrieg fi...,0.45,0.0,0.369427,-0.336892
1,1,Berlin Blockade,berlin blockade major international crisis col...,0.019341,0.258009,berlin-blockade Blockade Westsektoren Berlin S...,0.0,0.166667,-0.019341,-0.091342
2,2,Battle of Berlin,"battle berlin , designate berlin strategic off...",0.066615,0.248313,Schlacht Berlin letzter Schlacht weltkrieg Eur...,0.35,0.0,0.283385,-0.248313
3,3,Nuremberg Trail,nuremberg trial hold ally representative defea...,0.105769,0.396154,Nürnberger prozeß Weltkrieg führend Repräsenta...,-0.172727,0.036364,-0.278497,-0.35979
4,4,East German Uprising of 1953,east german uprising 1953 uprising occur east ...,0.021795,0.307692,Aufstand 17. Juni 1953 Vorkommnis DDR bezeichn...,-0.089474,0.144737,-0.111269,-0.162955


###Part 3: Feature Selection

In [None]:
sentence_model = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/4.12k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/645 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/471M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/480 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.08M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [None]:
df['Embeddings Text A (EN)'] = df['Text A (EN)'].apply(lambda x: sentence_model.encode([x])[0])
df['Embeddings Text B (DE)'] = df['Text B (DE)'].apply(lambda x: sentence_model.encode([x])[0])

In [None]:
df.head()

Unnamed: 0.1,Unnamed: 0,(Title),Text A (EN),Polarity A,Subjectivity A,Text B (DE),Polarity B,Subjectivity B,relative_polarity,relative_subjectivity,Embeddings Text A (EN),Embeddings Text B (DE)
0,0,Bombing of Dresden,bombing dresden joint british american aerial ...,0.080573,0.336892,Luftangriffe dresd Großraum Stadt Weltkrieg fi...,0.45,0.0,0.369427,-0.336892,"[0.11225356, 0.06887738, 0.03789141, -0.093371...","[0.03780573, 0.18411472, -0.08682791, -0.11067..."
1,1,Berlin Blockade,berlin blockade major international crisis col...,0.019341,0.258009,berlin-blockade Blockade Westsektoren Berlin S...,0.0,0.166667,-0.019341,-0.091342,"[-0.06444506, 0.15867063, -0.008208185, 0.0100...","[-0.09677547, 0.085402966, -0.0063312207, 0.03..."
2,2,Battle of Berlin,"battle berlin , designate berlin strategic off...",0.066615,0.248313,Schlacht Berlin letzter Schlacht weltkrieg Eur...,0.35,0.0,0.283385,-0.248313,"[-0.12278857, 0.2049316, 0.013694329, 0.100997...","[-0.121530816, 0.22239223, 0.08607685, 0.02441..."
3,3,Nuremberg Trail,nuremberg trial hold ally representative defea...,0.105769,0.396154,Nürnberger prozeß Weltkrieg führend Repräsenta...,-0.172727,0.036364,-0.278497,-0.35979,"[-0.19303313, 0.3098004, -0.13188314, -0.03126...","[-0.07023801, 0.24247734, 0.03249222, -0.04121..."
4,4,East German Uprising of 1953,east german uprising 1953 uprising occur east ...,0.021795,0.307692,Aufstand 17. Juni 1953 Vorkommnis DDR bezeichn...,-0.089474,0.144737,-0.111269,-0.162955,"[-0.12760314, 0.15674329, -0.18420413, 0.20283...","[-0.2353333, 0.2751347, -0.1799308, 0.12173430..."


###Part 4: Dataset Split

In [None]:
X_en = np.array([np.array(x).flatten() for x in df['Embeddings Text A (EN)']])
X_de = np.array([np.array(x).flatten() for x in df['Embeddings Text B (DE)']])

In [None]:
y_en = df['Polarity A'].values
y_de = df['Polarity B'].values
y = df['relative_polarity'].values

In [None]:
X = np.hstack((X_en, X_de))

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=42)

###Part 5: Implementing Contrastive Learning and Relative Polarity Model

We will implement the model in two separate stages:

Stage one: training the model with contrastive learning.

Stage two: Adjusting the model's scoring mechanism for polarity scoring.

Finally, evaluate the performance of the model using Mean Squared Error (MSE).

###Part 5.1: Text Embedding Dataset

In [None]:
class TextEmbeddingDataset(Dataset):
    def __init__(self, embeddings, scores, y_en, y_de):
        self.en_embeddings = embeddings[:, :384]
        self.de_embeddings = embeddings[:, 384:]
        self.en_scores = y_en
        self.de_scores = y_de

    def __len__(self):
        return len(self.en_embeddings)

    def __getitem__(self, idx):
      relative_embedding = self.de_scores[idx] - self.en_scores[idx]
      return (self.en_embeddings[idx], self.en_scores[idx]), (self.de_embeddings[idx], self.de_scores[idx]), relative_embedding


###Part 5.2: Create Dataloaders

In [None]:
embeddings_tensor = torch.tensor(X_train, dtype=torch.float32)
scores_tensor = torch.tensor(y_train, dtype=torch.float32)

dataset = TextEmbeddingDataset(embeddings_tensor, scores_tensor, y_en, y_de)

batch_size = 2

dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True)

In [None]:
embeddings_tensor_test = torch.tensor(X_test, dtype=torch.float32)
scores_tensor_test = torch.tensor(y_test, dtype=torch.float32)

dataset_test = TextEmbeddingDataset(embeddings_tensor_test, scores_tensor_test, y_en, y_de)

batch_size = 2

dataloader_test = DataLoader(dataset_test, batch_size=batch_size, shuffle=True)

###Part 5.2: Setting up Phase 1

###Part 5.2.1: The Contrastive Block

In [None]:
class ContrastiveBlock(nn.Module):


  def __init__ (self, in_features, out_features):
    super(ContrastiveBlock, self).__init__()

    self.linear1 = nn.Linear(in_features, out_features)
    self.bn1 = nn.BatchNorm1d(out_features)
    self.leaky_relu1 = nn.LeakyReLU(0.01)

  def forward(self, x):
    x = self.linear1(x)
    x = self.bn1(x)
    x = self.leaky_relu1(x)
    x = self.unit_normalize(x)
    return x

  def freeze_model(self, model):
    for param in model.parameters():
      param.requires_grad = False

  def unit_normalize(self, tensor, dim=-1):
    norm = tensor.norm(p=2, dim=dim, keepdim=True)
    normalized_tensor = tensor / (norm + 1e-8)
    return normalized_tensor

###Part 5.2.2: The Branch Class

In [None]:
class Branch(nn.Module):

  def __init__(self):
    super(Branch, self).__init__()

    self.block1 = ContrastiveBlock(384, 512)
    self.block2 = ContrastiveBlock(512, 768)
    self.block3 = ContrastiveBlock(768, 384)

  """
  The input data/embeddings are passed through the neural network via the forward function.

  The neural network then processes the data through Contrastive layers/blocks and returns an output embedding.
  """

  def forward(self, x):
    x = self.block1(x)
    x = self.block2(x)
    z = self.block3(x)
    return z



###Part 5.2.3 The Model Class

In [None]:
class Model(nn.Module):

    def __init__(self):
      super(Model, self).__init__()

      self.branch1 = Branch()
      self.branch2 = Branch()

      self.regression_layer = nn.Linear(384, 1)

    def freeze(self):
      for param_group in [self.branch1.parameters(), self.branch2.parameters()]:
          for param in param_group:
              param.requires_grad = False

    def unfreeze(self):
      for param in (self.branch1.parameters(), self.branch2.parameters()):
          param.requires_grad = True

    def embed(self, x1, x2):
      z1 = self.branch1(x1)
      z2 = self.branch2(x2)
      return z1, z2

    def predict(self, x1, x2, return_embeddings=False):
      z1, z2 = self.embed(x1, x2)
      combined_output = torch.stack((z1, z2), dim=1)

      if return_embeddings:
        output = (self.regression_layer(combined_output), combined_output)
      else:
        output = (self.regression_layer(combined_output))

      return output

    def combined_loss(self, z1, z2, polarity1, polarity2, margin=1.0, alpha=0.5):
      dist = F.pairwise_distance(z1, z2)
      contrastive = torch.mean(F.relu(margin-dist))

      polarity_dist = torch.abs(polarity1-polarity2)
      polarity_effect = F.relu(dist*polarity_dist)
      polarity_loss = torch.mean(polarity_effect)

      total_loss = (1-alpha)*contrastive + alpha*polarity_loss
      return total_loss

    def forward(self, x1, x2):
      output, z = self.predict(x1, x2)
      return output, z



###Part 5.3: Training Phase 1

Now, we will define the functions that will be responsible for training phase 1.

###Part 5.3.1: Initializing the model

In [None]:
model = Model()

###Part 5.3.2: Defining the Optimizer

In [None]:
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

###Part 5.3.3: Early Stopping

In [None]:
def early_stopping(validation_loss, min_loss, counter, patience=20):

  if validation_loss > min_loss:
    counter += 1

    if counter >= patience:
      return True

  else:
    counter = 0
    return False

###Part 5.3.4: Train Contrastive One Epoch

In [None]:
def train_contrastive_one_epoch(train_loader, test_loader, model, optimizer):

  model.train()

  total_loss = 0
  loss_funcion = model.combined_loss

  for idx, data in enumerate(train_loader):
    (en_embeddings, en_scores), (de_embeddings, de_scores), relative_embedding = data

    embed = model.embed(en_embeddings, de_embeddings)
    z1, z2 = embed

    loss = model.combined_loss(z1, z2, en_scores, de_scores)

    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    total_loss += loss.item()

    break

  return total_loss


###Part 5.3.5: Train Contrastive Function

In [None]:
def train_contrastive(train_loader, test_loader, model, optimizer, num_epochs):

  counter = 0
  min_loss = np.inf


  for epoch in range (num_epochs):

    loss = train_contrastive_one_epoch(train_loader, test_loader, model, optimizer)
    print("Loss for Epoch", epoch, ":", loss)

    if min_loss == np.inf:
      min_loss = loss
    else:
      stop = early_stopping(loss, min_loss, counter)
      if stop == True:
        break
      else:
        min_loss = loss
        counter = 0
  return model

###Part 5.3.6: Function call

We are now calling the train_contrastive function to train the model over a certain number of epochs. (one hundred as of right now)



In [None]:
model = train_contrastive(dataloader, dataloader_test, model, optimizer, num_epochs=100)

Loss for Epoch 0 : 0.08033482012332598
Loss for Epoch 1 : 0.04463420304369749
Loss for Epoch 2 : 0.13510667481739602
Loss for Epoch 3 : 0.1134540793342623
Loss for Epoch 4 : 0.08046410596901428
Loss for Epoch 5 : 0.09295470308957693
Loss for Epoch 6 : 0.14329698752180065
Loss for Epoch 7 : 0.13061777363634283
Loss for Epoch 8 : 0.09628063893117345
Loss for Epoch 9 : 0.14646101138066678
Loss for Epoch 10 : 0.025729739852022173
Loss for Epoch 11 : 0.14642192732418974
Loss for Epoch 12 : 0.11506896086988024
Loss for Epoch 13 : 0.12469338298080458
Loss for Epoch 14 : 0.06342361288560108
Loss for Epoch 15 : 0.09939674772387534
Loss for Epoch 16 : 0.06648940373239953
Loss for Epoch 17 : 0.132594510274274
Loss for Epoch 18 : 0.1400083897214461
Loss for Epoch 19 : 0.11799948796775697
Loss for Epoch 20 : 0.07568559971792155
Loss for Epoch 21 : 0.08962237412678746
Loss for Epoch 22 : 0.04287717945904196
Loss for Epoch 23 : 0.1431401289137853
Loss for Epoch 24 : 0.019704895696292305
Loss for Epoc