# **LAB-2: Scalable Machine Learning and Deep Learning**

## **Paolo Teta & Ralfs Zangis**
---
**TASK:** Implement **S-BERT** model

**Outline:**
- Load the dataset
- Regression
- Classification
- Evaluation with STS benchmark dataset (cosine similarity and Spearmean correlation)
- Semantic search
---

### **Download and extract the following Google Drive folder:**
https://drive.google.com/drive/folders/1SbiN2jpliKi_9B0gsRTKDbMRgsFJTOjV?usp=sharing

## **Requirements**

### Install dependencies

In [1]:
!pip install sentence_transformers
!pip install transformers
!pip install tokenizers
!pip install pyspark
!pip install torch
!pip install wget



### Spark

In [2]:
from pyspark.sql import SparkSession
from pyspark.ml.feature import *
from pyspark.sql.functions import *
from pyspark.sql.types import *

spark = SparkSession.builder.getOrCreate()

### ML

In [3]:
import tensorflow as tf

from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.layers import Input

from sentence_transformers import SentenceTransformer
from sentence_transformers import LoggingHandler
from sentence_transformers import models, losses, util
from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator
from sentence_transformers.readers import InputExample

from transformers import BertTokenizer, TFBertModel, BertConfig
# from transformers import DistilBertTokenizer, DistilBertModel # -> smaller model

### Other

In [4]:
import os
import re
import csv
import wget
import json
import math
import scipy
import torch
import pickle
import string
import sklearn

import numpy as np
import pandas as pd

from tokenizers import BertWordPieceTokenizer
from torch.utils.data import DataLoader
from datetime import datetime

---

## **REGRESSION**

### Loading the datasets

In [5]:
schema = StructType([
    StructField("genre", StringType(), True),
    StructField("filename", StringType(), True),
    StructField("year", StringType(), True),
    StructField("year_id", IntegerType(), True),
    StructField("score", FloatType(), True),
    StructField("sentence1", StringType(), True),
    StructField("sentence2", StringType(), True)])

train = spark.read.csv("./lab2_files/dataset/stsbenchmark/sts-train.csv", sep ='\t', header=False, schema=schema)
test = spark.read.csv("./lab2_files/dataset/stsbenchmark/sts-test.csv", sep ='\t', header=False, schema=schema)
dev = spark.read.csv("./lab2_files/dataset/stsbenchmark/sts-dev.csv", sep ='\t', header=False, schema=schema)

train.show()

+-------------+--------+--------+-------+-----+--------------------+--------------------+
|        genre|filename|    year|year_id|score|           sentence1|           sentence2|
+-------------+--------+--------+-------+-----+--------------------+--------------------+
|main-captions|  MSRvid|2012test|      1|  5.0|A plane is taking...|An air plane is t...|
|main-captions|  MSRvid|2012test|      4|  3.8|A man is playing ...|A man is playing ...|
|main-captions|  MSRvid|2012test|      5|  3.8|A man is spreadin...|A man is spreadin...|
|main-captions|  MSRvid|2012test|      6|  2.6|Three men are pla...|Two men are playi...|
|main-captions|  MSRvid|2012test|      9| 4.25|A man is playing ...|A man seated is p...|
|main-captions|  MSRvid|2012test|     11| 4.25|Some men are figh...|Two men are fight...|
|main-captions|  MSRvid|2012test|     12|  0.5|   A man is smoking.|   A man is skating.|
|main-captions|  MSRvid|2012test|     13|  1.6|The man is playin...|The man is playin...|
|main-capt

### Normalize

In [6]:
train = train.withColumn("label", col("score")/2.5-1)
test = test.withColumn("label", col("score")/2.5-1)
dev = dev.withColumn("label", col("score")/2.5-1)

dev.select("label").describe().show()

+-------+--------------------+
|summary|               label|
+-------+--------------------+
|  count|                1500|
|   mean|-0.05443697837591158|
| stddev|  0.6001942581590352|
|    min|                -1.0|
|    max|                 1.0|
+-------+--------------------+



### Fill NAN

In [7]:
train = train.na.fill(value="",subset=["sentence1", "sentence2"])
test = test.na.fill(value="",subset=["sentence1", "sentence2"])
dev = dev.na.fill(value="",subset=["sentence1", "sentence2"])

### Create samples

In [8]:
def CreatInputExampleList(df):
    samples = []
    for index, row in df.iterrows():
        input_example = InputExample(texts=[row['sentence1'], row['sentence2']], label=row['label'])
        samples.append(input_example)
    return samples

In [9]:
df_train = train.select("sentence1", "sentence2", "label").toPandas()

train_samples = CreatInputExampleList(df_train)

In [10]:
df_test = test.select("sentence1", "sentence2", "label").toPandas()

test_samples = CreatInputExampleList(df_test)

In [11]:
df_dev = dev.select("sentence1", "sentence2", "label").toPandas()

dev_samples = CreatInputExampleList(df_dev)

### **Define the model**

In [12]:
model_name = 'bert-base-uncased' # original model
# model_name = 'distilbert-base-uncased' # smaller model
word_embedding = models.Transformer(model_name)

# Set mean-pooling strategy
pooling = models.Pooling(word_embedding.get_word_embedding_dimension(),
                         pooling_mode_mean_tokens=True,
                         pooling_mode_cls_token=False,
                         pooling_mode_max_tokens=False)

model_reg = SentenceTransformer(modules=[word_embedding, pooling])

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [13]:
# train_batch_size = 16
train_batch_size = 32 # try to speed up the training

learn_rate = 2e-5
num_epochs = 1

### Load the training set and define the loss function as the cosine similarity

In [14]:
train_dataloader = DataLoader(train_samples, shuffle=True, batch_size=train_batch_size)

train_loss = losses.CosineSimilarityLoss(model=model_reg)

### Define the evaluator for the sentence embeddings

In [15]:
evaluator = EmbeddingSimilarityEvaluator.from_input_examples(dev_samples, name='sts-dev')

### Set 10% of training dataset for warm-up

In [16]:
warmup_steps = math.ceil(len(train_dataloader) * num_epochs * 0.1)

### **Training**

In [17]:
model_location = './lab2_files/saved_models/training_sts'

if os.path.exists(model_location):
    model_reg = SentenceTransformer(model_location)
else:
    model_reg.fit(train_objectives=[(train_dataloader, train_loss)],
                    optimizer_class=torch.optim.Adam,
                    optimizer_params={'lr': learn_rate},
                    evaluator=evaluator,
                    epochs=num_epochs,
                    evaluation_steps=1000,
                    warmup_steps=warmup_steps,
                    output_path=model_location)

### **Evaluation**

#### Evaluation on STS benchmark dataset (with library)

In [18]:
evaluation_location = "./lab2_files/regression_results"

if not os.path.exists(evaluation_location):
    os.makedirs(evaluation_location)

test_eval = EmbeddingSimilarityEvaluator.from_input_examples(test_samples, batch_size=train_batch_size, name='sts-test')
c_s = test_eval(model_reg, output_path=evaluation_location)
print('Cosine similarity with the sentence_transformers library = ', c_s)

Cosine similarity with the sentence_transformers library =  0.5624506124304559


#### Evaluation on STS benchmark dataset (no library)

##### Embedding sentences

In [19]:
embed_1 = model_reg.encode(df_test['sentence1'], convert_to_numpy=True, batch_size=train_batch_size)
embed_2 = model_reg.encode(df_test['sentence2'], convert_to_numpy=True, batch_size=train_batch_size)

##### Compute the cosine similarity

Mathematical relationship: *cosine_similarity = 1 - cosine_distance*

In [20]:
cos_sim = 1 - sklearn.metrics.pairwise.paired_cosine_distances(embed_1, embed_2)
print('Cosine similarity = ', cos_sim)

Cosine similarity =  [ 0.4603076   0.8991027   0.5754015  ...  0.04050708  0.27599466
 -0.03714991]


##### Spearmean correlation coefficient

In [21]:
spr_corr = scipy.stats.spearmanr(cos_sim, df_test['label'])
print('Spearmean correlation coefficient = ', spr_corr[0])

Spearmean correlation coefficient =  0.5624506124304559


**Comment:** the two results match each other

---

## **CLASSIFICATION**

### Setup

In [22]:
if not os.path.exists('./lab2_files/dataset/snli'):
    print('Downloading dataset ...')
    wget.download('https://nlp.stanford.edu/projects/snli/snli_1.0.zip', './lab2_files/dataset/snli.zip')
    !unzip ./lab2_files/dataset/snli.zip -d ./lab2_files/dataset
    !mv ./lab2_files/dataset/snli_1.0 ./lab2_files/dataset/snli

### Loading the datasets

In [23]:
train_class_path = './lab2_files/dataset/snli/snli_1.0_train.jsonl'
train_class = spark.read.json(train_class_path)

test_class_path = './lab2_files/dataset/snli/snli_1.0_test.jsonl'
test_class = spark.read.json(test_class_path)

dev_class_path = './lab2_files/dataset/snli/snli_1.0_dev.jsonl'
dev_class = spark.read.json(dev_class_path)

### String-2-Integer

In [24]:
indexer = StringIndexer(inputCol="gold_label", outputCol="label")

### Create samples

In [25]:
def CreatClassSamples(df):
    df = df.filter(col("gold_label") != "-")
    df = indexer.fit(df).transform(df)
    df = df.withColumn("label", col("label").cast('int'))

    df_class = df.select("sentence1", "sentence2", "label").toPandas()

    samples = CreatInputExampleList(df_class)
    return samples

In [26]:
train_class_samples = CreatClassSamples(train_class)
test_class_samples = CreatClassSamples(test_class)
dev_class_samples = CreatClassSamples(dev_class)

### **Define the model**

In [27]:
model_class = SentenceTransformer(modules=[word_embedding, pooling])

### Load the training set and define the loss function as the cosine similarity

In [28]:
train_dataloader_cl = DataLoader(train_class_samples, shuffle=True, batch_size=train_batch_size)

num_lables = test_class.select('gold_label').distinct().count() - 1

train_loss_cl = losses.SoftmaxLoss(model=model_class, sentence_embedding_dimension=model_class.get_sentence_embedding_dimension(), num_labels=num_lables)

### Define the evaluator for the sentence embeddings

In [29]:
evaluator_cl = EmbeddingSimilarityEvaluator.from_input_examples(dev_class_samples, batch_size=train_batch_size, name='snli-dev')

### Set 10% of training dataset for warm-up

In [30]:
warmup_steps_cl = math.ceil(len(train_dataloader_cl) * num_epochs * 0.1)

### **Training**

In [31]:
model_class_location = './lab2_files/saved_models/training_snli'

if os.path.exists(model_class_location):
    model_class = SentenceTransformer(model_class_location)
else:
    model_class.fit(train_objectives=[(train_dataloader_cl, train_loss_cl)],
                        optimizer_class=torch.optim.Adam,
                        optimizer_params={'lr': learn_rate},
                        evaluator=evaluator_cl,
                        epochs=num_epochs,
                        evaluation_steps=1000,
                        warmup_steps=warmup_steps_cl,
                        output_path=model_class_location)

### **Evaluation**

#### Evaluation on STS benchmark dataset (with library)

In [32]:
evaluation_class_location = "./lab2_files/classification_results"

if not os.path.exists(evaluation_class_location):
    os.makedirs(evaluation_class_location)

test_eval_cl = EmbeddingSimilarityEvaluator.from_input_examples(test_samples, batch_size=train_batch_size, name='sts-test')
c_s_cl = test_eval_cl(model_class, output_path=evaluation_class_location)
print('Cosine similarity with the sentence_transformers library = ', c_s_cl)

Cosine similarity with the sentence_transformers library =  0.6720169505392981


#### Evaluation on STS benchmark dataset (no library)

##### Embedding sentences

In [33]:
embed_1_sts = model_class.encode(df_test['sentence1'], convert_to_numpy=True, batch_size=train_batch_size)
embed_2_sts = model_class.encode(df_test['sentence2'], convert_to_numpy=True, batch_size=train_batch_size)

##### Compute the cosine similarity

In [34]:
cos_sim_cl = 1 - sklearn.metrics.pairwise.paired_cosine_distances(embed_1_sts, embed_2_sts)
print('STS-test: cosine similarity = ', cos_sim_cl)

STS-test: cosine similarity =  [0.9020396  0.96123344 0.8992804  ... 0.4095918  0.6290741  0.20586675]


##### Spearmean correlation coefficient

In [35]:
spr_corr_cl = scipy.stats.spearmanr(cos_sim_cl, df_test['label'])
print('STS-test: Spearmean correlation coefficient = ', spr_corr_cl[0])

STS-test: Spearmean correlation coefficient =  0.6667147963855763


### **Train on SNLI dataset and fine-tuning with STS dataset**

In [36]:
model_class_location = './lab2_files/saved_models/training_snli_32batch-evalSTS'

if os.path.exists(model_class_location):
    model_class = SentenceTransformer(model_class_location)
else:
    model_class.fit(train_objectives=[(train_dataloader_cl, train_loss_cl)],
                        optimizer_class=torch.optim.Adam,
                        optimizer_params={'lr': learn_rate},
                        evaluator=evaluator,
                        epochs=num_epochs,
                        evaluation_steps=1000,
                        warmup_steps=warmup_steps_cl,
                        output_path=model_class_location)

### **Evaluation**

#### Evaluation on STS benchmark dataset (with library)

In [37]:
evaluation_class_location = "./lab2_files/classification-STS_results"

if not os.path.exists(evaluation_class_location):
    os.makedirs(evaluation_class_location)

c_s_sts = test_eval(model_class, output_path=evaluation_class_location)
print('Cosine similarity with the sentence_transformers library = ', c_s_sts)

Cosine similarity with the sentence_transformers library =  0.7230449331218566


#### Evaluation on STS (no library)

##### Embedding sentences

In [38]:
embed_1 = model_class.encode(df_test['sentence1'], convert_to_numpy=True, batch_size=train_batch_size)
embed_2 = model_class.encode(df_test['sentence2'], convert_to_numpy=True, batch_size=train_batch_size)

##### Compute the cosine similarity

In [39]:
cos_sim_sts = 1 - sklearn.metrics.pairwise.paired_cosine_distances(embed_1, embed_2)
print('STS benchmark: cosine similarity = ', cos_sim_sts)

STS benchmark: cosine similarity =  [0.94786173 0.9772165  0.95169765 ... 0.58654845 0.63525337 0.2881629 ]


##### Spearmean correlation coefficient

In [40]:
spr_corr_sts = scipy.stats.spearmanr(cos_sim_sts, df_test['label'])
print('STS benchmark: Spearmean correlation coefficient = ', spr_corr_sts[0])

STS benchmark: Spearmean correlation coefficient =  0.7230449331218566


**Comment:** better result compared to regression task and when first training on SNLI dataset

In [41]:
print('Difference after fine-tuning = ', spr_corr_sts[0] - spr_corr_cl[0])

Difference after fine-tuning =  0.05633013673628029


---

### **Evaluation EXTRA**

In [42]:
'''
model_class_location = './lab2_files/saved_models/training_snli'

if os.path.exists(model_class_location):
    model = SentenceTransformer(model_class_location)
'''

"\nmodel_class_location = './lab2_files/saved_models/training_snli'\n\nif os.path.exists(model_class_location):\n    model = SentenceTransformer(model_class_location)\n"

#### Evaluation on SNLI dataset (with library)

In [43]:
'''
evaluation_class_location = "./lab2_files/classification-SNLI_results"

if not os.path.exists(evaluation_class_location):
    os.makedirs(evaluation_class_location)

test_eval_cl = EmbeddingSimilarityEvaluator.from_input_examples(test_class_samples, batch_size=train_batch_size, name='snli-test')
c_s_cl = test_eval_cl(model, output_path=evaluation_class_location)
print('Cosine similarity with the sentence_transformers library = ', c_s_cl)
'''

'\nevaluation_class_location = "./lab2_files/classification-SNLI_results"\n\nif not os.path.exists(evaluation_class_location):\n    os.makedirs(evaluation_class_location)\n\ntest_eval_cl = EmbeddingSimilarityEvaluator.from_input_examples(test_class_samples, batch_size=train_batch_size, name=\'snli-test\')\nc_s_cl = test_eval_cl(model, output_path=evaluation_class_location)\nprint(\'Cosine similarity with the sentence_transformers library = \', c_s_cl)\n'

#### Evaluation on SNLI (no library)

##### Embedding sentences

In [44]:
'''
embed_1_snli = model.encode(test_class['sentence1'], convert_to_numpy=True, batch_size=train_batch_size)
embed_2_snli = model.encode(test_class['sentence2'], convert_to_numpy=True, batch_size=train_batch_size)
'''

"\nembed_1_snli = model.encode(test_class['sentence1'], convert_to_numpy=True, batch_size=train_batch_size)\nembed_2_snli = model.encode(test_class['sentence2'], convert_to_numpy=True, batch_size=train_batch_size)\n"

##### Compute the cosine similarity

In [45]:
'''
cos_sim_cl = 1 - sklearn.metrics.pairwise.paired_cosine_distances(embed_1_snli, embed_2_snli)
print('SNLI-test: cosine similarity = ', cos_sim_cl)
'''

"\ncos_sim_cl = 1 - sklearn.metrics.pairwise.paired_cosine_distances(embed_1_snli, embed_2_snli)\nprint('SNLI-test: cosine similarity = ', cos_sim_cl)\n"

##### Spearmean correlation coefficient

In [46]:
'''
spr_corr_cl = scipy.stats.spearmanr(cos_sim_cl, df_class_test['label'])
print('SNLI-test: Spearmean correlation coefficient = ', spr_corr_cl[0])
'''

"\nspr_corr_cl = scipy.stats.spearmanr(cos_sim_cl, df_class_test['label'])\nprint('SNLI-test: Spearmean correlation coefficient = ', spr_corr_cl[0])\n"

---

## **SEMANTIC SEARCH**

**Link to dataset:** https://www.kaggle.com/rmisra/news-category-dataset

### Create samples

In [47]:
news_path = './lab2_files/dataset/news.json'

if os.path.exists(news_path):
    news = spark.read.json(news_path)
    news_samples = list(news.select('headline').toPandas()['headline'])

### Define the model

In [48]:
best_model_location = './lab2_files/saved_models/training_snli_32batch-evalSTS'

if os.path.exists(best_model_location):
    model_best = SentenceTransformer(best_model_location)

### Embedding news

In [49]:
news_samples_path = './lab2_files/embedding/embed_news.txt'

if os.path.exists(news_samples_path):
    with open(news_samples_path, "rb") as file:
        embed_news = pickle.load(file)
else:
    embed_news = model_best.encode(news_samples, convert_to_tensor=True, show_progress_bar=True)
    with open(news_samples_path, "wb") as file:
        pickle.dump(embed_news, file)

### Searching

In [50]:
search = input("Find news close to: ")

embed_query = model_best.encode(search, convert_to_tensor=True)

Find news close to: nba results


In [51]:
cos_sim = util.pytorch_cos_sim(embed_query, embed_news)[0]

In [52]:
n_close = 10 # number of similar records

top_close = torch.topk(cos_sim, k=n_close)

### **Results**

In [53]:
print(f"Top {n_close} closest news in the dataset:\n")

for score, idx in zip(top_close[0], top_close[1]):
    print(news_samples[idx], "(score: {:.3f})".format(score))

Top 10 closest news in the dataset:

NBA Game 7 Score (score: 0.828)
Amar'e Stoudemire And Jermaine O'Neal Among NBA Players With Clothing Collections (PHOTOS) (score: 0.803)
International Intrigue and the NBA (score: 0.796)
Of All Things NBA All-Star 2013 (score: 0.795)
Marv Albert On The Knicks, Brad Stevens And The State Of The NBA (score: 0.779)
College Basketball and Your Kidneys (score: 0.768)
NBA: Kevin Love Goes East. Now What? (score: 0.768)
GPS Guide: Austin Rivers, NBA Player, Shares His Off-Court Routine (score: 0.765)
Andrew Wiggins Shows Why NBA Tanking Is A Thing (VIDEO) (score: 0.765)
NBA Says Draymond Green And Dahntay Jones Situations Are 'Completely Different' (score: 0.765)


---