# Notebook Overview
## Goal
* Now that we separated the tweets into chunks of 300 words, we need to get their embeddings. We used [SBERT](https://www.sbert.net/) to create a dense vector representation of those chunks of tweets. In particular, we used the bi-encoder [all-mpnet-base-v2](https://huggingface.co/sentence-transformers/all-mpnet-base-v2). This model maps paragraphs to a 768 dense vector space. It was fine-tuned on a 1B sentence-pairs dataset using a self-supervised contrastive learning objective. 

# Imports

In [None]:
!pip install transformers
!pip install sentence-transformers

import transformers
from transformers import AutoConfig, AutoModel, AutoTokenizer
from sentence_transformers import SentenceTransformer, util

import pandas as pd
import numpy as np

import torch
import torch.nn as nn


# Use the preprocessed dataset

In [None]:
df = pd.read_csv("data/100_tweets_explode.csv")
df.shape

/content/drive/My Drive/Research Assistant/Paper


(2807, 5)

## Generate the embeddings

In [None]:
def get_embeddings(queries : list, model_name : str = 'all-mpnet-base-v2'):
  """
  Generate the 768 dense vectors
  ----
  queries: list of the tweets we want to encode
  model_name: name of the model we want to use to embed our paragraphs.
  """
  
  model = SentenceTransformer(model_name)
  model.max_seq_length = 300
  print("Start encoding queries")
  queries_embedding = model.encode(queries, convert_to_tensor=False)

  return queries_embedding

In [None]:
embeddings = get_embeddings(df.Tweets.values)

Downloading:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/10.1k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/571 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/349 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/438M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/239 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/363 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/13.1k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

Start encoding queries


In [None]:
# Save the embeddings
torch.save(embeddings, "data/100_embeddings_explode.pt")