# **LAB-2: Scalable Machine Learning and Deep Learning**

## **Paolo Teta & Ralfs Zangis**
---
**TASK:** Implement **S-BERT** model

**Outline:**
- Load the dataset
- Regression
- Classification
- Evaluation with STS benchmark dataset (cosine similarity and Spearmean correlation)
- Semantic search
---


**REMEMBER:** UPLOAD DATA TO SESSION STORAGE (*sts-benchmark* and *news*)

## **Requirements**

### Install dependencies

In [1]:
!pip install sentence_transformers
!pip install transformers
!pip install tokenizers
!pip install wget
!pip install torch



### Spark

In [2]:
from pyspark.sql import SparkSession
from pyspark.ml.feature import *
from pyspark.sql.functions import *
from pyspark.sql.types import *

spark = SparkSession.builder.getOrCreate()

21/12/12 10:19:14 WARN Utils: Your hostname, bubrik-VirtualBox resolves to a loopback address: 127.0.1.1; using 10.0.2.15 instead (on interface enp0s3)
21/12/12 10:19:14 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
21/12/12 10:19:15 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


### ML

In [3]:
import tensorflow as tf

from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.layers import Input

from sentence_transformers import SentenceTransformer
from sentence_transformers import LoggingHandler
from sentence_transformers import models, losses, util
from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator
from sentence_transformers.readers import InputExample

from transformers import BertTokenizer, TFBertModel, BertConfig

2021-12-12 10:19:20.624240: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2021-12-12 10:19:20.624269: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.


### Other

In [4]:
import os
import re
import csv
import wget
import json
import math
import scipy
import torch
import string
import sklearn

import numpy as np
import pandas as pd

from tokenizers import BertWordPieceTokenizer
from torch.utils.data import DataLoader
from datetime import datetime

**Mount Google Drive to load saved models**

In [None]:
# from google.colab import drive
# drive.mount('/content/drive')

In [17]:
use_trained_model = True

## **REGRESSION**

### Loading the datasets

In [5]:
schema = StructType([
    StructField("genre", StringType(), True),
    StructField("filename", StringType(), True),
    StructField("year", StringType(), True),
    StructField("year_id", IntegerType(), True),
    StructField("score", FloatType(), True),
    StructField("sentence1", StringType(), True),
    StructField("sentence2", StringType(), True)])

train = spark.read.csv("stsbenchmark/sts-train.csv", sep ='\t', header=False, schema=schema)
test = spark.read.csv("stsbenchmark/sts-test.csv", sep ='\t', header=False, schema=schema)
dev = spark.read.csv("stsbenchmark/sts-dev.csv", sep ='\t', header=False, schema=schema)

train.show()

[Stage 0:>                                                          (0 + 1) / 1]

+-------------+--------+--------+-------+-----+--------------------+--------------------+
|        genre|filename|    year|year_id|score|           sentence1|           sentence2|
+-------------+--------+--------+-------+-----+--------------------+--------------------+
|main-captions|  MSRvid|2012test|      1|  5.0|A plane is taking...|An air plane is t...|
|main-captions|  MSRvid|2012test|      4|  3.8|A man is playing ...|A man is playing ...|
|main-captions|  MSRvid|2012test|      5|  3.8|A man is spreadin...|A man is spreadin...|
|main-captions|  MSRvid|2012test|      6|  2.6|Three men are pla...|Two men are playi...|
|main-captions|  MSRvid|2012test|      9| 4.25|A man is playing ...|A man seated is p...|
|main-captions|  MSRvid|2012test|     11| 4.25|Some men are figh...|Two men are fight...|
|main-captions|  MSRvid|2012test|     12|  0.5|   A man is smoking.|   A man is skating.|
|main-captions|  MSRvid|2012test|     13|  1.6|The man is playin...|The man is playin...|
|main-capt

                                                                                

### Normalize

In [6]:
train = train.withColumn("label", col("score")/2.5-1)
test = test.withColumn("label", col("score")/2.5-1)
dev = dev.withColumn("label", col("score")/2.5-1)

dev.select("label").describe().show()

+-------+--------------------+
|summary|               label|
+-------+--------------------+
|  count|                1500|
|   mean|-0.05443697837591158|
| stddev|  0.6001942581590352|
|    min|                -1.0|
|    max|                 1.0|
+-------+--------------------+



### Fill NAN

In [7]:
train = train.na.fill(value="",subset=["sentence1", "sentence2"])
test = test.na.fill(value="",subset=["sentence1", "sentence2"])
dev = dev.na.fill(value="",subset=["sentence1", "sentence2"])

### Create samples

In [8]:
def CreatInputExampleList(df):
    samples = []
    for index, row in df.iterrows():
        input_example = InputExample(texts=[row['sentence1'], row['sentence2']], label=row['label'])
        samples.append(input_example)
    return samples

In [9]:
df_train = train.select("sentence1", "sentence2", "label").toPandas()

train_samples = CreatInputExampleList(df_train)

In [10]:
df_test = test.select("sentence1", "sentence2", "label").toPandas()

test_samples = CreatInputExampleList(df_test)

In [11]:
df_dev = dev.select("sentence1", "sentence2", "label").toPandas()

dev_samples = CreatInputExampleList(df_dev)

## Define the model

In [12]:
#model_name = 'distilbert-base-uncased' # smaller model
model_name = 'bert-base-uncased' # original model
word_embedding = models.Transformer(model_name)

# Set mean-pooling strategy
pooling = models.Pooling(word_embedding.get_word_embedding_dimension(),
                         pooling_mode_mean_tokens=True,
                         pooling_mode_cls_token=False,
                         pooling_mode_max_tokens=False)

model = SentenceTransformer(modules=[word_embedding, pooling])

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


### Load the training set and define the loss function as the cosine similarity

In [13]:
# train_batch_size = 16
train_batch_size = 32 # try to speed up the training

train_dataloader = DataLoader(train_samples, shuffle=True, batch_size=train_batch_size)
train_loss = losses.CosineSimilarityLoss(model=model)

### Define the evaluator for the sentence embeddings

In [14]:
evaluator = EmbeddingSimilarityEvaluator.from_input_examples(dev_samples, name='sts-dev')

10% of train dataset for warm-up

In [15]:
num_epochs = 1

warmup_steps = math.ceil(len(train_dataloader) * num_epochs * 0.1)

**Training**

In [18]:
model_location = './training_sts'

if use_trained_model:
    model = SentenceTransformer(model_location)
else:
    learn_rate = 2e-5
    model.fit(train_objectives=[(train_dataloader, train_loss)],
                optimizer_class=torch.optim.Adam,
                optimizer_params={'lr': learn_rate},
                evaluator=evaluator,
                epochs=num_epochs,
                evaluation_steps=1000,
                warmup_steps=warmup_steps,
                output_path=model_location)

## Evaluation the model

### Evaluation on STS benchmark dataset

In [19]:
evaluation_location = "./regression"

if not os.path.exists(evaluation_location):
    os.makedirs(evaluation_location)

test_eval = EmbeddingSimilarityEvaluator.from_input_examples(test_samples, name='sts-test')
c_s = test_eval(model, output_path=evaluation_location)
print('Cosine similarity with the sentence_transformers library = ', c_s)

Cosine similarity with the sentence_transformers library =  0.5624505001853983


### Embedding sentences

In [20]:
embed_1 = model.encode(df_test['sentence1'], convert_to_numpy=True, batch_size=train_batch_size)
embed_2 = model.encode(df_test['sentence2'], convert_to_numpy=True, batch_size=train_batch_size)

### Compute the cosine similarity

Mathematical relationship: *cosine_similarity = 1 - cosine_distance*

In [21]:
cos_sim = 1 - sklearn.metrics.pairwise.paired_cosine_distances(embed_1, embed_2)
print('Cosine similarity = ', cos_sim)

Cosine similarity =  [ 0.46030778  0.89910275  0.57540166 ...  0.0405072   0.27599466
 -0.03714967]


### Spearmean correlation coefficient

In [22]:
spr_corr = scipy.stats.spearmanr(cos_sim, df_test['label'])
print('Spearmean correlation coefficient = ', spr_corr[0])

Spearmean correlation coefficient =  0.5624505569513745


**Comment:** the two results match each other

---

## **CLASSIFICATION**

### Setup

In [23]:
# Download the file (if we haven't already)
if not os.path.exists('./snli_1.0'):
    print('***** Downloading dataset ...')
    wget.download('https://nlp.stanford.edu/projects/snli/snli_1.0.zip', './snli_1.0.zip')
    !unzip snli_1.0.zip

### StringIndexer

In [24]:
indexer = StringIndexer(inputCol="gold_label", outputCol="label")

### Create samples

In [25]:
train_class_path = 'snli_1.0/snli_1.0_train.jsonl'
train_class = spark.read.json(train_class_path)
train_class = train_class.filter(col("gold_label") != "-")
train_class = indexer.fit(train_class).transform(train_class)
train_class = train_class.withColumn("label", col("label").cast('int'))

df_class_train = train_class.select("sentence1", "sentence2", "label").toPandas()

train_class_samples = CreatInputExampleList(df_class_train)

                                                                                

In [26]:
test_class_path = 'snli_1.0/snli_1.0_test.jsonl'
test_class = spark.read.json(test_class_path)
test_class = test_class.filter(col("gold_label") != "-")
test_class = indexer.fit(test_class).transform(test_class)
test_class = test_class.withColumn("label", col("label").cast('int'))

df_class_test = test_class.select("sentence1", "sentence2", "label").toPandas()

test_class_samples = CreatInputExampleList(df_class_test)

In [27]:
dev_class_path = 'snli_1.0/snli_1.0_dev.jsonl'
dev_class = spark.read.json(dev_class_path)
dev_class = dev_class.filter(col("gold_label") != "-")
dev_class = indexer.fit(dev_class).transform(dev_class)
dev_class = dev_class.withColumn("label", col("label").cast('int'))

df_class_dev = dev_class.select("sentence1", "sentence2", "label").toPandas()

dev_class_samples = CreatInputExampleList(df_class_dev)

In [28]:
train_dataloader_cl = DataLoader(train_class_samples, shuffle=True, batch_size=train_batch_size)

num_lables = test_class.select('label').distinct().count()

train_loss_cl = losses.SoftmaxLoss(model=model, sentence_embedding_dimension=model.get_sentence_embedding_dimension(), num_labels=num_lables)

                                                                                

In [29]:
evaluator_cl = EmbeddingSimilarityEvaluator.from_input_examples(dev_class_samples, batch_size=train_batch_size, name='snli-dev')

In [30]:
warmup_steps_cl = math.ceil(len(train_dataloader) * num_epochs * 0.1) # 10% of train data for warm-up

In [31]:
model_class_location = './training_snli'

if use_trained_model:
    model = SentenceTransformer(model_class_location)
else:
    model.fit(train_objectives=[(train_dataloader_cl, train_loss_cl)],
             evaluator=evaluator_cl,
             epochs=num_epochs,
             evaluation_steps=1000,
             warmup_steps=warmup_steps_cl,
             output_path=model_class_location)

## Evaluation with snli-test

In [32]:
evaluation_class_location = "./classification"

if not os.path.exists(evaluation_class_location):
    os.makedirs(evaluation_class_location)

test_eval_cl = EmbeddingSimilarityEvaluator.from_input_examples(test_class_samples, batch_size=train_batch_size, name='snli-test')
c_s_cl = test_eval_cl(model, output_path=evaluation_class_location) # or save_path_cl
print('Cosine similarity with the sentence_transformers library = ', c_s_cl)

Cosine similarity with the sentence_transformers library =  -0.13774307853266043


In [34]:
c_s_sts = test_eval(model, output_path=evaluation_class_location) # from regression task
print('Cosine similarity with the sentence_transformers library = ', c_s_sts)

Cosine similarity with the sentence_transformers library =  0.7230442341354506


## Evaluation on SNLI and STS benchmark datasets (no library)

In [35]:
embed_1_snli = model.encode(df_class_test['sentence1'], convert_to_numpy=True, batch_size=train_batch_size)
embed_2_snli = model.encode(df_class_test['sentence2'], convert_to_numpy=True, batch_size=train_batch_size)

embed_1 = model.encode(df_test['sentence1'], convert_to_numpy=True, batch_size=train_batch_size)
embed_2 = model.encode(df_test['sentence2'], convert_to_numpy=True, batch_size=train_batch_size)

### Compute the cosine similarity

In [36]:
cos_sim_cl = 1 - sklearn.metrics.pairwise.paired_cosine_distances(embed_1_snli, embed_2_snli)
print('SNLI-test: cosine similarity = ', cos_sim_cl)

cos_sim_sts = 1 - sklearn.metrics.pairwise.paired_cosine_distances(embed_1, embed_2)
print('STS benchmark: cosine similarity = ', cos_sim_sts)

SNLI-test: cosine similarity =  [0.19246674 0.81542057 0.81191134 ... 0.18281555 0.8461983  0.796359  ]
STS benchmark: cosine similarity =  [0.94786173 0.9772164  0.95169765 ... 0.5865485  0.63525355 0.28816307]


### Spearmean correlation coefficient

In [43]:
spr_corr_cl = scipy.stats.spearmanr(cos_sim_cl, df_class_test['label'])
print('SNLI-test: Spearmean correlation coefficient = ', spr_corr_cl[0])

spr_corr_sts = scipy.stats.spearmanr(cos_sim_sts, df_test['label'])
print('STS benchmark: Spearmean correlation coefficient = ', spr_corr_sts[0])

SNLI-test: Spearmean correlation coefficient =  -0.1547163393040419
STS benchmark: Spearmean correlation coefficient =  0.7230441788400894


## Semantic search

**Link to dataset:** https://www.kaggle.com/rmisra/news-category-dataset

In [38]:
news_path = 'News_Category_Dataset_v2.json'
news = spark.read.json(news_path)

df_news = news.select("headline", "short_description").toPandas()

news_samples = []
for index, row in df_news.iterrows():
    news_samples.append([row['headline'], row['short_description']])

In [39]:
embed_news = model.encode(news_samples[0:100], convert_to_tensor=True, show_progress_bar=True)

Batches:   0%|          | 0/4 [00:00<?, ?it/s]

In [40]:
search = input("Find close to: ")

embed_query = model.encode(search, convert_to_tensor=True)

Find close to: police


In [41]:
n_close = 5 # number of similar record

cos_sim = util.pytorch_cos_sim(embed_query, embed_news)[0]
top_close = torch.topk(cos_sim, k=n_close)

In [42]:
print(f"\nTop {n_close} closest news in the dataset:")

for score, idx in zip(top_close[0], top_close[1]):
    print(news_samples[idx], "(score: {:.4f})".format(score))


Top 5 closest news in the dataset:
['Publix Suspends Contributions To NRA-Backed Politician Amid Protests', '"Publix knows we\'re not going away," one gun-control activist said.'] (score: 0.5007)
["Monsanto And Bayer Are Set To Merge. Here's Why You Should Care.", '“Together they will influence markets all over the world on a scale we’ve never seen before.”'] (score: 0.4562)
['2 People Injured In Indiana School Shooting', 'A male student, believed to be the suspect, has been detained, according to police.'] (score: 0.4074)
['Morgan Freeman Dropped From Marketing Campaigns After Harassment Accusations', "Both Visa and Vancouver's public transit system have suspended work with him."] (score: 0.3852)
["Twitter Bots May Have Delivered Donald Trump's Victory, Research Paper Says", '“Our results suggest that, given narrow margins of victories in each vote, bots’ effect was likely marginal but possibly large enough to affect the outcomes."'] (score: 0.3770)
