# Imports

In [1]:
import os
import pandas as pd
import torch
from sentence_transformers import SentenceTransformer, util

# Preprocessing Sentence Vectors

Some features require the use of sentence vectors --- for example, calculating cosine similarity between how people speak; Discursive Diversity, etc.

The goal of this notebook is to pre-process all vectors *once*, so that all features that require vectors can access the same set of saved vectors, without having to constantly re-calculate the same vectors.

Everything will be saved in the folder `embeddings` at the top level.

In [2]:
# source - https://www.sbert.net/
# https://www.sbert.net/docs/usage/semantic_textual_similarity.html
# https://huggingface.co/blog/how-to-train-sentence-transformers

model = SentenceTransformer('all-MiniLM-L6-v2')

In [3]:
# list of datasets
dataset_names = ["juries_tiny_for_testing", "jury_conversations_with_outcome_var", "csop_conversations_withblanks", "csopII_conversations_withblanks", "DAT_conversations_withblanks", "pgg_conversations_withblanks", "gurcay2015_group_estimation", "becker_group_estimation"]

In [4]:
directory="../../data/raw_data/"

for dataset in dataset_names:
    csv_path = directory + dataset + '.csv'
    messages = pd.read_csv(csv_path, encoding='mac_roman')['message'].astype(str)
    embedding_arr = [row.tolist() for row in model.encode(messages)]
    embedding_df = pd.DataFrame({'message': messages, 'message_embedding': embedding_arr})

    output_csv_folder = '../../embeddings/'
    embedding_df.to_csv(output_csv_folder + dataset + '.csv')

Once we have the embeddings, we can recall them from this folder and use them to obtain things like cosine similarity:

```
embeddings = model.encode(sentences, convert_to_tensor=True)

#Compute cosine-similarities for each sentence with each other sentence
cosine_scores = util.cos_sim(embeddings, embeddings)
```