# **LAB-2: Scalable Machine Learning and Deep Learning**

## **Paolo Teta & Ralfs Zangis**
---
**TASK:** Implement **S-BERT** model

**Outline:**
- Load the dataset
- Regression
- Classification
- Evaluation with STS benchmark dataset (cosine similarity and Spearmean correlation)
- Semantic search
---

**Download and extract the following Google Drive folder:** https://drive.google.com/drive/folders/1SbiN2jpliKi_9B0gsRTKDbMRgsFJTOjV?usp=sharing

## **Requirements**

### Install dependencies

In [1]:
!pip install sentence_transformers
!pip install transformers
!pip install tokenizers
!pip install torch
!pip install wget



### Spark

In [2]:
from pyspark.sql import SparkSession
from pyspark.ml.feature import *
from pyspark.sql.functions import *
from pyspark.sql.types import *

spark = SparkSession.builder.getOrCreate()

### ML

In [3]:
import tensorflow as tf

from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.layers import Input

from sentence_transformers import SentenceTransformer
from sentence_transformers import LoggingHandler
from sentence_transformers import models, losses, util
from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator
from sentence_transformers.readers import InputExample

from transformers import BertTokenizer, TFBertModel, BertConfig
# from transformers import DistilBertTokenizer, DistilBertModel # -> smaller model

### Other

In [90]:
import os
import re
import csv
import wget
import json
import math
import scipy
import torch
import pickle
import string
import sklearn

import numpy as np
import pandas as pd

from tokenizers import BertWordPieceTokenizer
from torch.utils.data import DataLoader
from datetime import datetime

---

## **REGRESSION**

### Loading the datasets

In [5]:
schema = StructType([
    StructField("genre", StringType(), True),
    StructField("filename", StringType(), True),
    StructField("year", StringType(), True),
    StructField("year_id", IntegerType(), True),
    StructField("score", FloatType(), True),
    StructField("sentence1", StringType(), True),
    StructField("sentence2", StringType(), True)])

train = spark.read.csv("./lab2_files/dataset/stsbenchmark/sts-train.csv", sep ='\t', header=False, schema=schema)
test = spark.read.csv("./lab2_files/dataset/stsbenchmark/sts-test.csv", sep ='\t', header=False, schema=schema)
dev = spark.read.csv("./lab2_files/dataset/stsbenchmark/sts-dev.csv", sep ='\t', header=False, schema=schema)

train.show()

+-------------+--------+--------+-------+-----+--------------------+--------------------+
|        genre|filename|    year|year_id|score|           sentence1|           sentence2|
+-------------+--------+--------+-------+-----+--------------------+--------------------+
|main-captions|  MSRvid|2012test|      1|  5.0|A plane is taking...|An air plane is t...|
|main-captions|  MSRvid|2012test|      4|  3.8|A man is playing ...|A man is playing ...|
|main-captions|  MSRvid|2012test|      5|  3.8|A man is spreadin...|A man is spreadin...|
|main-captions|  MSRvid|2012test|      6|  2.6|Three men are pla...|Two men are playi...|
|main-captions|  MSRvid|2012test|      9| 4.25|A man is playing ...|A man seated is p...|
|main-captions|  MSRvid|2012test|     11| 4.25|Some men are figh...|Two men are fight...|
|main-captions|  MSRvid|2012test|     12|  0.5|   A man is smoking.|   A man is skating.|
|main-captions|  MSRvid|2012test|     13|  1.6|The man is playin...|The man is playin...|
|main-capt

### Normalize

In [6]:
train = train.withColumn("label", col("score")/2.5-1)
test = test.withColumn("label", col("score")/2.5-1)
dev = dev.withColumn("label", col("score")/2.5-1)

dev.select("label").describe().show()

+-------+--------------------+
|summary|               label|
+-------+--------------------+
|  count|                1500|
|   mean|-0.05443697837591158|
| stddev|  0.6001942581590352|
|    min|                -1.0|
|    max|                 1.0|
+-------+--------------------+



### Fill NAN

In [8]:
train = train.na.fill(value="",subset=["sentence1", "sentence2"])
test = test.na.fill(value="",subset=["sentence1", "sentence2"])
dev = dev.na.fill(value="",subset=["sentence1", "sentence2"])

### Create samples

In [9]:
def CreatInputExampleList(df):
    samples = []
    for index, row in df.iterrows():
        input_example = InputExample(texts=[row['sentence1'], row['sentence2']], label=row['label'])
        samples.append(input_example)
    return samples

In [10]:
df_train = train.select("sentence1", "sentence2", "label").toPandas()

train_samples = CreatInputExampleList(df_train)

In [11]:
df_test = test.select("sentence1", "sentence2", "label").toPandas()

test_samples = CreatInputExampleList(df_test)

In [12]:
df_dev = dev.select("sentence1", "sentence2", "label").toPandas()

dev_samples = CreatInputExampleList(df_dev)

### **Define the model**

In [13]:
model_name = 'bert-base-uncased' # original model
# model_name = 'distilbert-base-uncased' # smaller model
word_embedding = models.Transformer(model_name)

# Set mean-pooling strategy
pooling = models.Pooling(word_embedding.get_word_embedding_dimension(),
                         pooling_mode_mean_tokens=True,
                         pooling_mode_cls_token=False,
                         pooling_mode_max_tokens=False)

model = SentenceTransformer(modules=[word_embedding, pooling])

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/420M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

In [15]:
# train_batch_size = 16
train_batch_size = 32 # try to speed up the training

learn_rate = 2e-5
num_epochs = 1

### Load the training set and define the loss function as the cosine similarity

In [16]:
train_dataloader = DataLoader(train_samples, shuffle=True, batch_size=train_batch_size)

train_loss = losses.CosineSimilarityLoss(model=model)

### Define the evaluator for the sentence embeddings

In [17]:
evaluator = EmbeddingSimilarityEvaluator.from_input_examples(dev_samples, name='sts-dev')

### Set 10% of training dataset for warm-up

In [18]:
warmup_steps = math.ceil(len(train_dataloader) * num_epochs * 0.1)

### **Training**

In [22]:
model_location = './lab2_files/saved_models/training_sts'

if os.path.exists(model_location):
    model = SentenceTransformer(model_location)
else:
    model.fit(train_objectives=[(train_dataloader, train_loss)],
                optimizer_class=torch.optim.Adam,
                optimizer_params={'lr': learn_rate},
                evaluator=evaluator,
                epochs=num_epochs,
                evaluation_steps=1000,
                warmup_steps=warmup_steps,
                output_path=model_location)

### **Evaluation**

### Evaluation on STS benchmark dataset

In [21]:
evaluation_location = "./lab2_files/regression_results"

if not os.path.exists(evaluation_location):
    os.makedirs(evaluation_location)

test_eval = EmbeddingSimilarityEvaluator.from_input_examples(test_samples, batch_size=train_batch_size, name='sts-test')
c_s = test_eval(model, output_path=evaluation_location)
print('Cosine similarity with the sentence_transformers library = ', c_s)

Cosine similarity with the sentence_transformers library =  0.4818116487039285


### Embedding sentences

In [23]:
embed_1 = model.encode(df_test['sentence1'], convert_to_numpy=True, batch_size=train_batch_size)
embed_2 = model.encode(df_test['sentence2'], convert_to_numpy=True, batch_size=train_batch_size)

### Compute the cosine similarity

Mathematical relationship: *cosine_similarity = 1 - cosine_distance*

In [24]:
cos_sim = 1 - sklearn.metrics.pairwise.paired_cosine_distances(embed_1, embed_2)
print('Cosine similarity = ', cos_sim)

Cosine similarity =  [ 0.4603076   0.8991027   0.5754015  ...  0.04050708  0.27599466
 -0.03714991]


### Spearmean correlation coefficient

In [25]:
spr_corr = scipy.stats.spearmanr(cos_sim, df_test['label'])
print('Spearmean correlation coefficient = ', spr_corr[0])

Spearmean correlation coefficient =  0.5624506124304559


**Comment:** the two results match each other

---

## **CLASSIFICATION**

### Setup

In [33]:
if not os.path.exists('./lab2_files/dataset/snli'):
    print('Downloading dataset ...')
    wget.download('https://nlp.stanford.edu/projects/snli/snli_1.0.zip', './snli.zip')
    !unzip snli.zip

### Loading the datasets

In [34]:
train_class_path = './lab2_files/dataset/snli/snli_1.0_train.jsonl'
train_class = spark.read.json(train_class_path)

test_class_path = './lab2_files/dataset/snli/snli_1.0_test.jsonl'
test_class = spark.read.json(test_class_path)

dev_class_path = './lab2_files/dataset/snli/snli_1.0_dev.jsonl'
dev_class = spark.read.json(dev_class_path)

### StringIndexer

In [41]:
indexer = StringIndexer(inputCol="gold_label", outputCol="label")

### Create samples

In [43]:
def CreatClassSamples(df):
    df = df.filter(col("gold_label") != "-")
    df = indexer.fit(df).transform(df)
    df = df.withColumn("label", col("label").cast('int'))

    df_class = df.select("sentence1", "sentence2", "label").toPandas()

    samples = CreatInputExampleList(df_class)
    return samples

In [44]:
train_class_samples = CreatClassSamples(train_class)
test_class_samples = CreatClassSamples(test_class)
dev_class_samples = CreatClassSamples(dev_class)

### Load the training set and define the loss function as the cosine similarity¶

In [67]:
train_dataloader_cl = DataLoader(train_class_samples, shuffle=True, batch_size=train_batch_size)

num_lables = test_class.select('gold_label').distinct().count() - 1

train_loss_cl = losses.SoftmaxLoss(model=model, sentence_embedding_dimension=model.get_sentence_embedding_dimension(), num_labels=num_lables)

### Define the evaluator for the sentence embeddings

In [68]:
evaluator_cl = EmbeddingSimilarityEvaluator.from_input_examples(dev_class_samples, batch_size=train_batch_size, name='snli-dev')

### Set 10% of training dataset for warm-up

In [70]:
warmup_steps_cl = math.ceil(len(train_dataloader_cl) * num_epochs * 0.1)

### **Training**

In [71]:
model_class_location = './lab2_files/saved_models/training_snli'

if os.path.exists(model_class_location):
    model = SentenceTransformer(model_class_location)
else:
    model.fit(train_objectives=[(train_dataloader_cl, train_loss_cl)],
                optimizer_class=torch.optim.Adam,
                optimizer_params={'lr': learn_rate},
                evaluator=evaluator_cl,
                epochs=num_epochs,
                evaluation_steps=1000,
                warmup_steps=warmup_steps_cl,
                output_path=model_class_location)

### **Evaluation**

### Evaluation on SNLI dataset (with library)

In [72]:
evaluation_class_location = "./lab2_files/classification_results"

if not os.path.exists(evaluation_class_location):
    os.makedirs(evaluation_class_location)

test_eval_cl = EmbeddingSimilarityEvaluator.from_input_examples(test_class_samples, batch_size=train_batch_size, name='snli-test')
c_s_cl = test_eval_cl(model, output_path=evaluation_class_location)
print('Cosine similarity with the sentence_transformers library = ', c_s_cl)

Cosine similarity with the sentence_transformers library =  -0.1377429903577867


### Evaluation on SNLI (no library)

In [78]:
embed_1_snli = model.encode(test_class['sentence1'], convert_to_numpy=True, batch_size=train_batch_size)
embed_2_snli = model.encode(test_class['sentence2'], convert_to_numpy=True, batch_size=train_batch_size)

### Compute the cosine similarity

In [80]:
cos_sim_cl = 1 - sklearn.metrics.pairwise.paired_cosine_distances(embed_1_snli, embed_2_snli)
print('SNLI-test: cosine similarity = ', cos_sim_cl)

ValueError: Expected 2D array, got 1D array instead:
array=[-0.7831247   0.4549554  -0.18050326 -0.8834193  -0.98975205 -0.0218299
 -0.1821609   0.04240327 -1.1159163   0.06437395  0.31649387 -0.45213908
 -0.4557953   0.49470592 -0.02979326  0.05826461  0.62036836 -0.05542984
 -0.10952187 -0.25751415  0.9380432   0.33971757 -0.44566312  0.07531576
  0.85864913 -0.57405424 -0.1338306  -1.3226908  -0.3511453  -0.17811999
  0.21194182 -0.38449562  0.42868    -0.08594999  0.8289736  -0.64386374
 -0.18183857  0.3092249   0.36962035 -0.27134442  0.13359241 -0.21066087
  0.5202685   0.37361345 -0.34524313 -0.11992995  0.14197494  0.30954283
 -0.00807978  0.52306396 -0.94205606  0.30471778 -0.01529409 -0.48810443
 -0.08152298 -0.0339485  -0.4450692   0.31064188  0.04414835  0.15557396
  1.3304791  -0.26838523 -0.6801238  -0.7364513   1.2000388  -0.01003959
 -0.30181155  0.07708559  0.13994446  1.3146102  -0.3571118  -0.32402343
 -0.2541786   0.01179093  0.20092848  0.6005711   0.08815725 -0.30331615
  0.4100165  -0.32352248 -0.5901844  -0.19379653 -0.3521417   0.4322775
  0.55751115  0.1056645   0.7455519  -0.37019223  0.28208652 -0.744493
  0.2830272  -0.53009224 -0.8056808  -0.41483474 -0.21288976  0.70457834
  0.15732554  0.17905198 -0.6999462   0.17951746 -1.2701101  -0.00270276
  0.7244381   0.45466563  0.9737931  -0.22577323 -0.5089376  -0.0136077
 -0.3086429   0.27535886 -0.24914995 -0.7106227  -1.444071    0.92903984
  0.05487944  1.0545408  -0.8219484  -0.62886024  0.46186057 -0.42175123
 -0.44145364  0.1576028   0.00864165  1.090139   -0.59979445  0.21223976
 -0.9308154   0.0856081  -0.24756989  0.9970091   0.06185905  0.9268754
  0.49436432 -1.0598464  -0.23198377 -0.47363144 -0.00972698  0.16779034
 -0.9448743  -0.62149507  0.5562212   0.19010971  0.87349933 -1.0969161
 -0.48384476  0.42941508 -0.08937488  0.63821864 -0.94654596  0.00361431
  0.27514383 -1.0901325   0.35941228 -0.57082725 -0.311739   -0.01310173
 -0.25297558 -0.7269698  -0.1532437  -0.5518305   0.37630174  0.78737664
  0.1468086   0.954231   -0.54268324  0.4465298   0.42415088 -0.5621667
  0.07995196 -0.11689647  0.6023722   0.8355031   0.57060164  0.4919674
 -0.48707622  0.2514777   0.8667168   0.07465694  0.11540522 -1.3838522
  0.31772575 -0.27807823 -0.0756032   0.02697775  0.4456535   1.22091
  0.49712935 -0.27934468  0.10304072 -0.60409933  0.30037257  0.29814526
 -0.5732626  -0.30007553  0.7611982   0.59587115  0.2357281   0.47951442
  0.16337685  0.7662655   0.8838191   0.72412264 -0.04923657 -0.45732212
 -0.21223846 -0.416666    0.26764712  0.47501573 -1.2509868   0.76457834
 -0.07822176  0.33040756  0.36590162  0.15104263 -0.37334952 -0.49499977
 -0.60378134 -0.05088066 -0.5617619   0.0639093  -0.8398713  -0.1601849
  0.9493454  -0.02085742 -0.6263782  -1.7588776   0.28049543 -0.26750582
  0.99728876  0.09306018 -0.19671239 -0.75060385 -0.01367231 -0.6046673
 -0.35341462  0.36312115  0.5143022   0.3769524   0.72487384  0.02830217
 -0.54683536  0.00513171 -0.40902573 -0.35768923 -0.20567724 -0.2932124
  0.19304603 -0.03033053 -0.7644177   0.35386488  0.15109062  0.46920228
  0.03437784 -1.0928292   0.533508    0.15781385  0.40593362  0.38977796
  0.96316195 -0.08449594 -0.9100848   0.03319896  0.3560101   0.03078845
  0.52296937 -1.0131797  -0.44469333  0.12401899  0.3803895   0.08035349
 -0.22945185  0.09895956  0.03472495  0.7523914  -0.7749506  -1.1778642
 -0.38335267  0.5172255  -0.10754416 -0.44569844  0.28316873  0.30999544
  0.9651786  -0.50870407 -0.22039308 -0.4897633  -0.30432355  0.40223196
 -0.3288588  -0.5975868  -0.40012288  0.6970961  -0.78364885  0.28544655
  0.09711263  0.39220405 -0.806355   -0.63148814  0.72552764  0.08586186
  0.06718839  0.6338189   0.22123516 -0.05764933 -2.1721363   0.5082889
 -0.29500902 -0.6727066   0.19791624  0.19831894  0.5625429  -0.5918427
  0.18433866  0.04146859 -0.21778525  0.6230871   0.09942722 -0.5382171
 -0.05951116 -0.0985394   0.04473936 -0.16288911 -0.20978144  0.7095699
  0.11621457  0.1405555  -0.3633554   0.25576356  0.31386337  0.18211842
  0.25421384 -0.37374687  0.49710804  1.3289548  -0.41739324  0.0562719
 -0.21811606  0.06575476 -0.1676127  -0.4947582   0.18883754 -0.50352246
 -0.21987972  0.19633278  0.7234521   0.3144814   0.01246422 -0.02902675
  0.21711689 -0.12333865  0.20177928 -1.6857203   0.27118605  0.06852977
 -0.15343282 -0.8305331   0.18763074 -0.13328019 -0.4587462   0.02864494
 -0.87058127 -0.2937329   0.6881531   0.3307746   0.3762514   1.4012731
  0.30138743 -0.554641    0.4389571   0.21608123 -0.41143113 -0.9614658
 -0.14497513  0.3604532  -0.25247183  0.40810648  0.09588596  0.8067373
 -0.08269796 -0.5083213  -0.60871065 -0.2909327   0.65865386 -0.23055255
  0.08655963 -0.0986428  -0.51387477 -0.532319   -0.7925604  -0.9992298
 -0.53878874 -0.02746501 -0.08867235  0.505254    0.4450323   0.5214394
 -1.5239617  -0.24086305 -0.15387812 -0.15475808  1.1068841   1.8726729
  0.28845948 -1.1222175   0.6165806   0.54404837  0.5093029  -1.2194346
 -0.47018772 -0.26070893  0.20161259 -0.76475865 -1.1009752   0.8197907
 -1.6616334   0.8494746   0.6788552  -0.16963978  0.21068478 -0.04977551
  0.30071118 -0.46921456 -0.34741682  0.5329776  -0.41189724  0.5922168
  0.12271054 -0.3864319  -0.67494524  0.05643508 -0.7232486  -0.00710411
 -0.863828    0.8078726  -0.42112383  0.553384   -1.4400452  -0.03101581
 -0.3583298   0.11489075  0.11917315  0.5608795  -0.27480987  0.00665027
  0.24354437  0.62742335  0.5002992  -0.38958192  0.27269188 -0.37567702
  0.02537226 -0.43142834  0.76348364  0.24704854 -0.70275366 -0.4711464
 -0.9000859   0.5306059   0.5176495  -0.20388578 -0.48402828 -0.60286635
  0.08022622 -0.51303     0.3364229   0.02316346  0.30087554 -0.21415831
 -0.6753079   0.43988153  1.2168015   0.03377174 -0.03149474 -0.21040584
  0.4680967  -0.44967505 -0.88095003 -0.19206512  0.307644   -0.20323707
 -0.41895768  0.33031452  0.4962698  -0.20284122  0.48529485  0.02998609
  0.12040221  0.8199991   0.20279793 -0.7489592  -0.9146697  -0.3306501
 -1.0789917   0.39964154 -0.4030369   0.20468223  0.18514366 -1.0527656
 -0.28656876 -1.1453272   0.43770844  0.49922848 -0.2824524   0.48252454
  0.6912681  -0.0709976  -0.3638477  -0.16829039  1.139038    0.29094452
 -0.66530997 -0.03514474  0.12747085 -0.76797235  1.1138364   0.16433206
 -0.54593056  0.5861126  -0.4203133   0.30418545 -0.8407228  -0.4071041
 -0.3254756  -0.03630478  0.10394382  0.00681931  0.4404117  -1.2535197
 -0.6811113   0.15357482  0.6045427  -0.0615568  -0.44363004  0.6188359
 -0.04428485 -0.47248986  0.13340917 -0.6336994  -0.18708874 -0.01773444
  0.23496656 -0.10387818 -0.1745943   0.2253339  -0.3485314  -0.97395706
 -0.08229309 -0.18588796  0.19328865 -0.374387   -0.70538455 -1.2227392
  0.02146742  0.67409074  0.10763985 -0.28311414 -0.20762157  0.4569734
  0.6564321  -0.4973797   0.02999736 -0.24711663  0.1181714  -0.34427568
 -1.0430039  -0.7694729  -0.71612924 -0.7286413   0.7732697  -0.71080226
 -0.5716044   1.1834544   1.2975634   0.5739843   0.00498661  0.8549113
  0.25557834 -0.04970846 -0.25039363  0.21432176 -0.81022567 -0.55966973
  0.06575292 -0.5260176  -0.15964818 -0.08254846  0.34436706  0.66177446
 -0.17538552  0.01446203 -0.77992564  0.66691417 -0.02243542  0.13561168
  0.06783679 -0.3261931  -0.0341058   0.70846945 -0.39108118  0.23811373
 -0.29597333  0.06394101 -0.5358228   0.61842877  0.29682449  0.61486304
  0.02112373  0.65983397 -0.21705103 -0.28196704  0.06947733 -0.673691
 -0.39949015  0.61055464  0.06651952 -0.78460926 -0.08462839  0.19571209
 -0.41158262 -0.14974351  0.61821574  1.0414135  -0.69232935 -1.4829131
 -0.9450211   0.02502935 -0.6266832  -0.10129402 -0.06279393 -0.7896694
 -0.55654746  0.30081856 -0.01134997 -0.31899518 -0.50879323  0.14699407
 -0.49171472  0.31191605 -0.02492693  0.3731206   0.6839627  -0.6419974
 -0.47376952 -0.6101415  -0.5198364   0.54683006  0.5301556   0.58505034
  0.44352788  0.32653666 -0.04653851 -0.67198336 -0.25604838 -0.19558208
 -0.41145408  0.7081727   0.6514475   0.9446854  -0.09228715  0.60429287
  0.330548    0.00854434 -0.01071106 -0.00578953 -0.2665614   0.03172356
  0.6821834   0.45465735 -0.50834733 -0.8801511   0.06652676 -0.22942671
 -0.84239346  0.18104178  0.7483683  -0.09148354  0.73325557  0.10841411
 -0.17947553 -0.2002589  -0.25976932  0.4419315  -0.77336115  2.1971574
  0.4263203   0.69660866 -0.21640454 -0.02591994  0.18538235  0.09285941
  0.40578684  0.3909396   0.00731022  0.09294582 -0.03066809 -0.02630197
  0.23279826  0.58732504 -0.46187165  0.99035746 -0.30393958 -0.29025596
 -0.7923755   0.4587615   0.29260176  0.37253147  0.09413981  0.99050796
  0.06072022 -0.65915143 -0.00246564  0.78722996  0.3592772   0.46520814
 -0.5791351   0.27249524 -0.45523137  0.26826993 -0.12411464 -0.47149426
 -0.36528516  0.4385126  -0.52601534 -1.0779049  -0.23121992  0.41432846
  0.2849208  -0.01323952  0.4623914   0.46821585 -0.3351274  -0.6470328
 -0.30172816  0.2038605   0.47190908  0.46636257 -0.66752833  0.10329148
  0.06830227 -0.5161478   0.39557073 -0.4312775   0.5316569   0.62302965
 -0.12369127  0.79189616  0.15125233 -0.2851394   0.09168063  0.23746505
  0.98163223  0.35702834 -0.03399962  0.04169524  0.92102134 -0.2893634
 -0.38282445  0.24563478  0.56738    -0.03096312 -0.56142366 -0.62593865
  0.22920242  0.23540011  0.6236434  -0.03616571 -0.93283296  0.54225034].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

### Spearmean correlation coefficient

In [81]:
spr_corr_cl = scipy.stats.spearmanr(cos_sim_cl, df_class_test['label'])
print('SNLI-test: Spearmean correlation coefficient = ', spr_corr_cl[0])

NameError: name 'cos_sim_cl' is not defined

### **Train on SNLI dataset and fine-tuning with STS dataset**

In [83]:
model_class_location = './lab2_files/saved_models/training_snli_32batch-evalSTS'

if os.path.exists(model_class_location):
    model = SentenceTransformer(model_class_location)
else:
    model.fit(train_objectives=[(train_dataloader_cl, train_loss_cl)],
                optimizer_class=torch.optim.Adam,
                optimizer_params={'lr': learn_rate},
                evaluator=evaluator,
                epochs=num_epochs,
                evaluation_steps=1000,
                warmup_steps=warmup_steps_cl,
                output_path=model_class_location)

### **Evaluation**

### Evaluation on STS benchmark dataset (with library)

In [84]:
evaluation_class_location = "./lab2_files/classification-STS_results"

if not os.path.exists(evaluation_class_location):
    os.makedirs(evaluation_class_location)

c_s_sts = test_eval(model, output_path=evaluation_class_location)
print('Cosine similarity with the sentence_transformers library = ', c_s_sts)

Cosine similarity with the sentence_transformers library =  0.6720169505392981


### Evaluation on STS (no library)

In [85]:
embed_1 = model.encode(df_test['sentence1'], convert_to_numpy=True, batch_size=train_batch_size)
embed_2 = model.encode(df_test['sentence2'], convert_to_numpy=True, batch_size=train_batch_size)

### Compute the cosine similarity

In [86]:
cos_sim_sts = 1 - sklearn.metrics.pairwise.paired_cosine_distances(embed_1, embed_2)
print('STS benchmark: cosine similarity = ', cos_sim_sts)

STS benchmark: cosine similarity =  [0.9020396  0.96123344 0.8992804  ... 0.4095918  0.6290741  0.20586675]


### Spearmean correlation coefficient

In [87]:
spr_corr_sts = scipy.stats.spearmanr(cos_sim_sts, df_test['label'])
print('STS benchmark: Spearmean correlation coefficient = ', spr_corr_sts[0])

STS benchmark: Spearmean correlation coefficient =  0.6667147963855763


**Comment:** better result compared to regression task

---

## **SEMANTIC SEARCH**

**Link to dataset:** https://www.kaggle.com/rmisra/news-category-dataset

In [101]:
news_path = './lab2_files/dataset/news.json'

if os.path.exists(news_path):
    news = spark.read.json(news_path)
    news_samples = list(news.select('headline').toPandas()['headline'])

### Embedding news

In [102]:
news_samples_path = './lab2_files/embedding/embed_news.txt'

if os.path.exists(news_samples_path):
    with open(news_samples_path, "rb") as file:
        embed_news = pickle.load(file)
else:
    embed_news = model.encode(news_samples, convert_to_tensor=True, show_progress_bar=True)
    with open(news_samples_path, "wb") as file:
        pickle.dump(embed_news, file)

### Searching

In [103]:
search = input("Find close to: ")

embed_query = model.encode(search, convert_to_tensor=True)

Find close to: police arrests mexican


In [104]:
cos_sim = util.pytorch_cos_sim(embed_query, embed_news)[0]

In [105]:
n_close = 5 # number of similar record

top_close = torch.topk(cos_sim, k=n_close)

In [106]:
print(f"Top {n_close} closest news in the dataset:\n")

for score, idx in zip(top_close[0], top_close[1]):
    print(news_samples[idx], "(score: {:.4f})".format(score))

Top 5 closest news in the dataset:

Child Support Offender, Robert Sand, Arrested In Los Angeles (score: 0.2243)
Police Arrest Suspect In Shooting Death Of Auburn Player (score: 0.2198)
Police In Texas Kill Bat-Wielding Robbery Suspect Outside Head Shop (score: 0.2190)
Fleeing Kidnapping Suspect Identified (score: 0.2189)
Mexican Vigilantes: Caught Between Cartels and U.S. Drug Consumers (score: 0.2122)


---