# Embedding with BERT based models

## Embedding with SamLowe/roberta-base-go_emotions

Model trained from roberta-base on the go_emotions dataset for multi-label classification. Embeddings from the emotion model is more tailored to capturing emotional nuances in the text due to its fine-tuning

Load our dataset

In [52]:
# Read the CSV file
df_sample = pd.read_csv("echo_reviews/sampled_reviews.csv")

The fine tuned model classifies a sentence based on a set of emotions. Bellow is an example of how it works.

In [54]:
from transformers import pipeline

classifier = pipeline(task="text-classification", model="SamLowe/roberta-base-go_emotions", top_k=None)

review= df_sample['Complete_Review'][0]
print(review)

model_outputs = classifier(review)

print(model_outputs[0])
# produces a list of dicts for each of the labels


Four Stars. I love it!!
[{'label': 'love', 'score': 0.9538168907165527}, {'label': 'admiration', 'score': 0.041505444794893265}, {'label': 'approval', 'score': 0.017941292375326157}, {'label': 'joy', 'score': 0.013167016208171844}, {'label': 'gratitude', 'score': 0.011048303917050362}, {'label': 'neutral', 'score': 0.008578701876103878}, {'label': 'optimism', 'score': 0.007808485999703407}, {'label': 'desire', 'score': 0.005459534004330635}, {'label': 'realization', 'score': 0.004758387338370085}, {'label': 'excitement', 'score': 0.004660877864807844}, {'label': 'annoyance', 'score': 0.004308581817895174}, {'label': 'disapproval', 'score': 0.004191291518509388}, {'label': 'disappointment', 'score': 0.003716349834576249}, {'label': 'anger', 'score': 0.0036460391711443663}, {'label': 'caring', 'score': 0.0035526370629668236}, {'label': 'sadness', 'score': 0.0034961311612278223}, {'label': 'amusement', 'score': 0.003221776569262147}, {'label': 'confusion', 'score': 0.002853807993233204}, 

We will access the embeddings of each of the review above with the following cell

In [55]:
from transformers import RobertaTokenizer, RobertaForSequenceClassification

# Load the model and tokenizer
model = RobertaForSequenceClassification.from_pretrained("SamLowe/roberta-base-go_emotions")
tokenizer = RobertaTokenizer.from_pretrained("SamLowe/roberta-base-go_emotions")

#review
review= df_sample['Complete_Review'][0]

# Tokenize the input
inputs = tokenizer(review, return_tensors="pt")

# Get the model's output. Set output_hidden_states=True to get all hidden states
outputs = model(**inputs, output_hidden_states=True)

# All hidden states are returned. The last hidden state is the one we typically use for tasks.
last_hidden_state = outputs.hidden_states[-1]

# The first token in RoBERTa-based models is equivalent to the [CLS] token in BERT.
# This token can be used as a sentence representation.
review_representation = last_hidden_state[0, 0, :]

print(review_representation)


tensor([-1.3823e-01,  3.6221e-01,  1.8394e-01, -1.3743e+00,  4.9657e-01,
        -7.4158e-01,  6.0812e-01,  1.0362e+00,  1.4110e+00, -4.2896e-01,
         4.4659e-01,  2.3940e-01,  8.3547e-01, -1.1596e+00, -6.7594e-01,
        -2.1210e-02,  3.3588e-01,  1.7358e+00, -1.3255e+00,  1.3773e+00,
         5.3103e-01,  2.9124e-01,  9.5056e-01,  6.2420e-01, -7.7635e-01,
        -4.0170e-02, -4.6556e-01,  2.6299e-01, -2.1238e+00, -4.0225e-01,
         9.0090e-01, -5.2204e-01, -7.6936e-02,  9.6186e-01, -9.0079e-01,
        -1.0949e+00, -1.4532e-02,  7.3689e-01, -2.1115e-01, -1.5729e-02,
        -8.3382e-01, -9.5083e-01,  2.2123e-01,  1.0158e+00,  1.1403e-01,
         1.0546e+00,  3.2027e-01, -4.3995e-01,  7.2140e-01,  1.1777e-01,
         7.6089e-01, -2.7956e-01,  3.2243e-01, -3.8173e-01, -3.2758e-01,
        -4.8508e-01, -1.9553e-01, -1.8071e-01,  4.3836e-01, -5.5792e-01,
         4.1648e-01, -8.6362e-01,  1.9303e-01,  2.2992e+00,  8.0229e-03,
         2.1143e-01,  3.1020e-02, -3.9709e-01,  8.2

In [87]:
import numpy as np

# Assuming sentence_representation is a TensorFlow tensor

# Convert the tensor to a numpy array and then to a list
list_representation = list(sentence_representation.detach().numpy())
#array_representation = sentence_representation.detach().numpy()

print(array_representation)

[-1.38225079e-01  3.62206638e-01  1.83940262e-01 -1.37426960e+00
  4.96574461e-01 -7.41584361e-01  6.08121932e-01  1.03622627e+00
  1.41100073e+00 -4.28964317e-01  4.46586818e-01  2.39404887e-01
  8.35470796e-01 -1.15960956e+00 -6.75941229e-01 -2.12101340e-02
  3.35880309e-01  1.73577917e+00 -1.32552814e+00  1.37733841e+00
  5.31030357e-01  2.91242033e-01  9.50560451e-01  6.24198318e-01
 -7.76348412e-01 -4.01701219e-02 -4.65562940e-01  2.62993217e-01
 -2.12377191e+00 -4.02246326e-01  9.00901973e-01 -5.22041678e-01
 -7.69360140e-02  9.61862266e-01 -9.00790453e-01 -1.09492671e+00
 -1.45323947e-02  7.36894548e-01 -2.11145639e-01 -1.57288872e-02
 -8.33824456e-01 -9.50830877e-01  2.21226618e-01  1.01582694e+00
  1.14025995e-01  1.05458546e+00  3.20267677e-01 -4.39949870e-01
  7.21400738e-01  1.17771119e-01  7.60893166e-01 -2.79555500e-01
  3.22426558e-01 -3.81728441e-01 -3.27578634e-01 -4.85077411e-01
 -1.95531309e-01 -1.80705726e-01  4.38362211e-01 -5.57915807e-01
  4.16484654e-01 -8.63615

Now let's do this for every review

In [88]:
def amazon_review_classifier(review):
    try:
        # Tokenize the input
        inputs = tokenizer(review, return_tensors="pt")

        # Get the model's output. Set output_hidden_states=True to get all hidden states
        outputs = model(**inputs, output_hidden_states=True)

        # All hidden states are returned. The last hidden state is the one we typically use for tasks.
        last_hidden_state = outputs.hidden_states[-1]

        # The first token can be used as a sentence representation.
        sentence_representation = last_hidden_state[0, 0, :]

        # Convert the tensor to a numpy array and then to a list
        list_representation = list(sentence_representation.detach().numpy())
        
        return list_representation
    except Exception as e:
        print(f"Error processing review: {e}")  # Optional: print the error for debugging
        return np.nan


In [89]:
from transformers import RobertaTokenizer, RobertaForSequenceClassification
from tqdm import tqdm


# Load the model and tokenizer
model = RobertaForSequenceClassification.from_pretrained("SamLowe/roberta-base-go_emotions")
tokenizer = RobertaTokenizer.from_pretrained("SamLowe/roberta-base-go_emotions")


# Get embeddings for the "Complete_review" column with tqdm progress bar
df_sample['embedding'] = [amazon_review_classifier(review) for review in tqdm(df_sample['Complete_Review'], desc="Processing reviews")]



Processing reviews:   1%|▏         | 70/5000 [00:14<18:14,  4.50it/s]Token indices sequence length is longer than the specified maximum sequence length for this model (628 > 512). Running this sequence through the model will result in indexing errors


Error processing review: The expanded size of the tensor (628) must match the existing size (514) at non-singleton dimension 1.  Target sizes: [1, 628].  Tensor sizes: [1, 514]


Processing reviews:   2%|▏         | 110/5000 [00:25<37:18,  2.18it/s]

Error processing review: The expanded size of the tensor (678) must match the existing size (514) at non-singleton dimension 1.  Target sizes: [1, 678].  Tensor sizes: [1, 514]


Processing reviews:   3%|▎         | 147/5000 [00:34<18:06,  4.47it/s]

Error processing review: The expanded size of the tensor (1406) must match the existing size (514) at non-singleton dimension 1.  Target sizes: [1, 1406].  Tensor sizes: [1, 514]


Processing reviews:   3%|▎         | 153/5000 [00:34<09:34,  8.44it/s]

Error processing review: The expanded size of the tensor (660) must match the existing size (514) at non-singleton dimension 1.  Target sizes: [1, 660].  Tensor sizes: [1, 514]


Processing reviews:  10%|▉         | 480/5000 [01:55<16:58,  4.44it/s]

Error processing review: The expanded size of the tensor (585) must match the existing size (514) at non-singleton dimension 1.  Target sizes: [1, 585].  Tensor sizes: [1, 514]


Processing reviews:  15%|█▌        | 750/5000 [02:58<26:19,  2.69it/s]

Error processing review: The expanded size of the tensor (578) must match the existing size (514) at non-singleton dimension 1.  Target sizes: [1, 578].  Tensor sizes: [1, 514]


Processing reviews:  15%|█▌        | 773/5000 [03:04<13:05,  5.38it/s]

Error processing review: The expanded size of the tensor (694) must match the existing size (514) at non-singleton dimension 1.  Target sizes: [1, 694].  Tensor sizes: [1, 514]


Processing reviews:  16%|█▌        | 788/5000 [03:07<10:17,  6.83it/s]

Error processing review: The expanded size of the tensor (700) must match the existing size (514) at non-singleton dimension 1.  Target sizes: [1, 700].  Tensor sizes: [1, 514]


Processing reviews:  19%|█▉        | 943/5000 [03:42<17:04,  3.96it/s]

Error processing review: The expanded size of the tensor (576) must match the existing size (514) at non-singleton dimension 1.  Target sizes: [1, 576].  Tensor sizes: [1, 514]


Processing reviews:  20%|█▉        | 975/5000 [03:50<07:58,  8.41it/s]

Error processing review: The expanded size of the tensor (672) must match the existing size (514) at non-singleton dimension 1.  Target sizes: [1, 672].  Tensor sizes: [1, 514]


Processing reviews:  20%|██        | 1016/5000 [03:58<11:39,  5.70it/s]

Error processing review: The expanded size of the tensor (844) must match the existing size (514) at non-singleton dimension 1.  Target sizes: [1, 844].  Tensor sizes: [1, 514]


Processing reviews:  34%|███▍      | 1705/5000 [06:34<13:06,  4.19it/s]

Error processing review: The expanded size of the tensor (965) must match the existing size (514) at non-singleton dimension 1.  Target sizes: [1, 965].  Tensor sizes: [1, 514]


Processing reviews:  36%|███▌      | 1791/5000 [06:50<07:44,  6.90it/s]

Error processing review: The expanded size of the tensor (581) must match the existing size (514) at non-singleton dimension 1.  Target sizes: [1, 581].  Tensor sizes: [1, 514]


Processing reviews:  38%|███▊      | 1920/5000 [07:22<16:33,  3.10it/s]

Error processing review: The expanded size of the tensor (794) must match the existing size (514) at non-singleton dimension 1.  Target sizes: [1, 794].  Tensor sizes: [1, 514]


Processing reviews:  45%|████▌     | 2266/5000 [08:46<08:40,  5.25it/s]

Error processing review: The expanded size of the tensor (858) must match the existing size (514) at non-singleton dimension 1.  Target sizes: [1, 858].  Tensor sizes: [1, 514]


Processing reviews:  47%|████▋     | 2354/5000 [09:06<09:51,  4.47it/s]

Error processing review: The expanded size of the tensor (680) must match the existing size (514) at non-singleton dimension 1.  Target sizes: [1, 680].  Tensor sizes: [1, 514]


Processing reviews:  48%|████▊     | 2378/5000 [09:12<06:56,  6.30it/s]

Error processing review: The expanded size of the tensor (840) must match the existing size (514) at non-singleton dimension 1.  Target sizes: [1, 840].  Tensor sizes: [1, 514]


Processing reviews:  48%|████▊     | 2392/5000 [09:20<16:08,  2.69it/s]

Error processing review: The expanded size of the tensor (521) must match the existing size (514) at non-singleton dimension 1.  Target sizes: [1, 521].  Tensor sizes: [1, 514]


Processing reviews:  56%|█████▌    | 2799/5000 [11:05<03:52,  9.47it/s]

Error processing review: The expanded size of the tensor (793) must match the existing size (514) at non-singleton dimension 1.  Target sizes: [1, 793].  Tensor sizes: [1, 514]


Processing reviews:  56%|█████▌    | 2804/5000 [11:06<05:04,  7.22it/s]

Error processing review: The expanded size of the tensor (530) must match the existing size (514) at non-singleton dimension 1.  Target sizes: [1, 530].  Tensor sizes: [1, 514]


Processing reviews:  60%|█████▉    | 2978/5000 [11:50<07:44,  4.36it/s]

Error processing review: The expanded size of the tensor (613) must match the existing size (514) at non-singleton dimension 1.  Target sizes: [1, 613].  Tensor sizes: [1, 514]


Processing reviews:  60%|██████    | 3010/5000 [11:58<05:56,  5.58it/s]

Error processing review: The expanded size of the tensor (549) must match the existing size (514) at non-singleton dimension 1.  Target sizes: [1, 549].  Tensor sizes: [1, 514]


Processing reviews:  61%|██████▏   | 3064/5000 [12:13<03:50,  8.39it/s]

Error processing review: The expanded size of the tensor (947) must match the existing size (514) at non-singleton dimension 1.  Target sizes: [1, 947].  Tensor sizes: [1, 514]


Processing reviews:  63%|██████▎   | 3173/5000 [12:41<04:56,  6.15it/s]

Error processing review: The expanded size of the tensor (695) must match the existing size (514) at non-singleton dimension 1.  Target sizes: [1, 695].  Tensor sizes: [1, 514]


Processing reviews:  64%|██████▍   | 3203/5000 [12:47<06:37,  4.53it/s]

Error processing review: The expanded size of the tensor (1654) must match the existing size (514) at non-singleton dimension 1.  Target sizes: [1, 1654].  Tensor sizes: [1, 514]


Processing reviews:  68%|██████▊   | 3413/5000 [13:38<03:18,  7.99it/s]

Error processing review: The expanded size of the tensor (526) must match the existing size (514) at non-singleton dimension 1.  Target sizes: [1, 526].  Tensor sizes: [1, 514]


Processing reviews:  71%|███████   | 3551/5000 [14:09<04:54,  4.91it/s]

Error processing review: The expanded size of the tensor (1224) must match the existing size (514) at non-singleton dimension 1.  Target sizes: [1, 1224].  Tensor sizes: [1, 514]


Processing reviews:  72%|███████▎  | 3625/5000 [14:30<03:49,  5.99it/s]

Error processing review: The expanded size of the tensor (523) must match the existing size (514) at non-singleton dimension 1.  Target sizes: [1, 523].  Tensor sizes: [1, 514]


Processing reviews:  74%|███████▍  | 3713/5000 [14:48<02:10,  9.87it/s]

Error processing review: The expanded size of the tensor (643) must match the existing size (514) at non-singleton dimension 1.  Target sizes: [1, 643].  Tensor sizes: [1, 514]


Processing reviews:  75%|███████▍  | 3731/5000 [14:53<05:38,  3.75it/s]

Error processing review: The expanded size of the tensor (2393) must match the existing size (514) at non-singleton dimension 1.  Target sizes: [1, 2393].  Tensor sizes: [1, 514]


Processing reviews:  79%|███████▉  | 3965/5000 [15:42<04:05,  4.22it/s]

Error processing review: The expanded size of the tensor (586) must match the existing size (514) at non-singleton dimension 1.  Target sizes: [1, 586].  Tensor sizes: [1, 514]


Processing reviews:  81%|████████  | 4037/5000 [16:00<03:37,  4.43it/s]

Error processing review: The expanded size of the tensor (545) must match the existing size (514) at non-singleton dimension 1.  Target sizes: [1, 545].  Tensor sizes: [1, 514]


Processing reviews:  88%|████████▊ | 4403/5000 [20:54<00:50, 11.84it/s]  

Error processing review: The expanded size of the tensor (665) must match the existing size (514) at non-singleton dimension 1.  Target sizes: [1, 665].  Tensor sizes: [1, 514]


Processing reviews:  96%|█████████▌| 4805/5000 [21:51<00:15, 12.48it/s]

Error processing review: The expanded size of the tensor (645) must match the existing size (514) at non-singleton dimension 1.  Target sizes: [1, 645].  Tensor sizes: [1, 514]


Processing reviews:  97%|█████████▋| 4850/5000 [21:56<00:12, 12.27it/s]

Error processing review: The expanded size of the tensor (1010) must match the existing size (514) at non-singleton dimension 1.  Target sizes: [1, 1010].  Tensor sizes: [1, 514]


Processing reviews:  97%|█████████▋| 4854/5000 [21:57<00:13, 10.86it/s]

Error processing review: The expanded size of the tensor (683) must match the existing size (514) at non-singleton dimension 1.  Target sizes: [1, 683].  Tensor sizes: [1, 514]


Processing reviews: 100%|██████████| 5000/5000 [22:17<00:00,  3.74it/s]


In [82]:
df_sample

Unnamed: 0,Uniq Id,Pageurl,Complete_Review,Rating out of 5,embedding
0,afffe2013b28b4f315309524e70fbd0a,https://www.amazon.com/All-New-Amazon-Echo-Dot...,Four Stars. I love it!!,4.0,[-1.38225079e-01 3.62206638e-01 1.83940262e-...
1,a76e8f02c13607341913a6f9621272b2,https://www.amazon.com/All-New-Amazon-Echo-Dot...,I replaced Siri with Alexa... I bought family ...,3.0,[-3.62012237e-01 3.03088009e-01 6.12026572e-...
2,95989b4f7b36af92938eb1ff8b47bf75,https://www.amazon.com/All-New-Amazon-Echo-Dot...,It was cool while it lasted. It was cool while...,2.0,[-6.36660695e-01 -2.91785002e-01 -5.31184971e-...
3,f4d2291f4ee22563e68f7f84cdca3823,https://www.amazon.com/All-New-Amazon-Echo-Dot...,Love using. Second one I have. Love using them,5.0,[-1.04730457e-01 3.72939408e-01 3.21483836e-...
4,42edfc6f411a7527e680c4f139dc14d3,https://www.amazon.com/All-New-Amazon-Echo-Dot...,Five Stars. love it,5.0,[-2.47140765e-01 4.46669042e-01 9.36792642e-...
...,...,...,...,...,...
4959,2db2237f8a732e7df9bbe3043f03042b,https://www.amazon.com/All-New-Amazon-Echo-Dot...,"Not a good portable/satellite speaker, but ser...",3.0,[-6.51395679e-01 2.82417059e-01 -3.09159815e-...
4960,6424dfbd428b73d882ecb59e2486ae92,https://www.amazon.com/All-New-Amazon-Echo-Dot...,Two Stars. Sound not as great for playing musi...,2.0,[-0.20475331 -0.3199976 0.28622904 0.145005...
4961,3c42ff782a7fc594bd495869b0cefff3,https://www.amazon.com/All-New-Amazon-Echo-Dot...,do love asking Alexa for jokes. I wanted to be...,3.0,[ 2.30700910e-01 4.09763724e-01 1.03990448e+...
4962,133d25ebfd7c57076c73c82661afd09c,https://www.amazon.com/All-New-Amazon-Echo-Dot...,Luckily it was free with my dish. I talk to it...,1.0,[-1.04902041e+00 -4.16988969e-01 -5.14964461e-...


Make the arrays in embedding column, lists

In [92]:
import pandas as pd

df_sample_final = df_sample.dropna(subset=['embedding'])

# Save to CSV
df_sample_final.to_csv('embeddings/sample_embed_roberta_base_go_emotions_reviews.csv', index=False)


In [93]:
df_sample_final= pd.read_csv('embeddings/sample_embed_roberta_base_go_emotions_reviews.csv')
df_sample_final

Unnamed: 0,Uniq Id,Pageurl,Complete_Review,Rating out of 5,embedding
0,afffe2013b28b4f315309524e70fbd0a,https://www.amazon.com/All-New-Amazon-Echo-Dot...,Four Stars. I love it!!,4.0,"[-0.13822508, 0.36220664, 0.18394026, -1.37426..."
1,a76e8f02c13607341913a6f9621272b2,https://www.amazon.com/All-New-Amazon-Echo-Dot...,I replaced Siri with Alexa... I bought family ...,3.0,"[-0.36201224, 0.303088, 0.6120266, -0.43149918..."
2,95989b4f7b36af92938eb1ff8b47bf75,https://www.amazon.com/All-New-Amazon-Echo-Dot...,It was cool while it lasted. It was cool while...,2.0,"[-0.6366607, -0.291785, -0.531185, 0.9993306, ..."
3,f4d2291f4ee22563e68f7f84cdca3823,https://www.amazon.com/All-New-Amazon-Echo-Dot...,Love using. Second one I have. Love using them,5.0,"[-0.10473046, 0.3729394, 0.032148384, -1.40663..."
4,42edfc6f411a7527e680c4f139dc14d3,https://www.amazon.com/All-New-Amazon-Echo-Dot...,Five Stars. love it,5.0,"[-0.24714077, 0.44666904, 0.093679264, -1.0492..."
...,...,...,...,...,...
4959,2db2237f8a732e7df9bbe3043f03042b,https://www.amazon.com/All-New-Amazon-Echo-Dot...,"Not a good portable/satellite speaker, but ser...",3.0,"[-0.6513957, 0.28241706, -0.30915982, -0.61445..."
4960,6424dfbd428b73d882ecb59e2486ae92,https://www.amazon.com/All-New-Amazon-Echo-Dot...,Two Stars. Sound not as great for playing musi...,2.0,"[-0.20475331, -0.3199976, 0.28622904, 0.145005..."
4961,3c42ff782a7fc594bd495869b0cefff3,https://www.amazon.com/All-New-Amazon-Echo-Dot...,do love asking Alexa for jokes. I wanted to be...,3.0,"[0.23070091, 0.40976372, 1.0399045, -0.5402973..."
4962,133d25ebfd7c57076c73c82661afd09c,https://www.amazon.com/All-New-Amazon-Echo-Dot...,Luckily it was free with my dish. I talk to it...,1.0,"[-1.0490204, -0.41698897, -0.51496446, 0.40294..."


## Embedding with roberta-base

RoBERTa model will provide more general-purpose embeddings compared to the fine tuned version above

In [100]:
# Read the CSV file
df_sample_roberta_base = pd.read_csv("echo_reviews/sampled_reviews.csv")

In [101]:
def amazon_review_classifier(review):
    try:
        # Tokenize the input
        inputs = tokenizer(review, return_tensors="pt")

        # Get the model's output. Set output_hidden_states=True to get all hidden states
        outputs = model(**inputs, output_hidden_states=True)

        # All hidden states are returned. The last hidden state is the one we typically use for tasks.
        last_hidden_state = outputs.hidden_states[-1]

        # The first token can be used as a sentence representation.
        sentence_representation = last_hidden_state[0, 0, :]

        # Convert the tensor to a numpy array and then to a list
        list_representation = list(sentence_representation.detach().numpy())
        
        return list_representation
    except Exception as e:
        print(f"Error processing review: {e}")  # Optional: print the error for debugging
        return np.nan


In [102]:
from transformers import RobertaTokenizer, RobertaForSequenceClassification
from tqdm import tqdm


# Load the model and tokenizer
model = RobertaForSequenceClassification.from_pretrained("roberta-base")
tokenizer = RobertaTokenizer.from_pretrained("roberta-base")


# Get embeddings for the "Complete_review" column with tqdm progress bar
df_sample_roberta_base['embedding'] = [amazon_review_classifier(review) for review in tqdm(df_sample_roberta_base['Complete_Review'], desc="Processing reviews")]



Some weights of the model checkpoint at roberta-base were not used when initializing RobertaForSequenceClassification: ['lm_head.layer_norm.bias', 'lm_head.dense.bias', 'lm_head.dense.weight', 'lm_head.bias', 'lm_head.layer_norm.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.out_proj.weight', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.dense.bias']
You should pr

Error processing review: The expanded size of the tensor (628) must match the existing size (514) at non-singleton dimension 1.  Target sizes: [1, 628].  Tensor sizes: [1, 514]


Processing reviews:   2%|▏         | 112/5000 [00:28<23:04,  3.53it/s]

Error processing review: The expanded size of the tensor (678) must match the existing size (514) at non-singleton dimension 1.  Target sizes: [1, 678].  Tensor sizes: [1, 514]


Processing reviews:   3%|▎         | 147/5000 [00:38<19:05,  4.24it/s]

Error processing review: The expanded size of the tensor (1406) must match the existing size (514) at non-singleton dimension 1.  Target sizes: [1, 1406].  Tensor sizes: [1, 514]


Processing reviews:   3%|▎         | 152/5000 [00:38<12:38,  6.39it/s]

Error processing review: The expanded size of the tensor (660) must match the existing size (514) at non-singleton dimension 1.  Target sizes: [1, 660].  Tensor sizes: [1, 514]


Processing reviews:  10%|▉         | 480/5000 [02:09<19:08,  3.93it/s]

Error processing review: The expanded size of the tensor (585) must match the existing size (514) at non-singleton dimension 1.  Target sizes: [1, 585].  Tensor sizes: [1, 514]


Processing reviews:  15%|█▌        | 751/5000 [03:20<22:46,  3.11it/s]

Error processing review: The expanded size of the tensor (578) must match the existing size (514) at non-singleton dimension 1.  Target sizes: [1, 578].  Tensor sizes: [1, 514]


Processing reviews:  15%|█▌        | 773/5000 [03:26<15:39,  4.50it/s]

Error processing review: The expanded size of the tensor (694) must match the existing size (514) at non-singleton dimension 1.  Target sizes: [1, 694].  Tensor sizes: [1, 514]


Processing reviews:  16%|█▌        | 789/5000 [03:30<10:53,  6.44it/s]

Error processing review: The expanded size of the tensor (700) must match the existing size (514) at non-singleton dimension 1.  Target sizes: [1, 700].  Tensor sizes: [1, 514]


Processing reviews:  19%|█▉        | 941/5000 [04:08<27:40,  2.44it/s]

Error processing review: The expanded size of the tensor (576) must match the existing size (514) at non-singleton dimension 1.  Target sizes: [1, 576].  Tensor sizes: [1, 514]


Processing reviews:  19%|█▉        | 974/5000 [04:17<09:28,  7.09it/s]

Error processing review: The expanded size of the tensor (672) must match the existing size (514) at non-singleton dimension 1.  Target sizes: [1, 672].  Tensor sizes: [1, 514]


Processing reviews:  20%|██        | 1016/5000 [04:26<13:51,  4.79it/s]

Error processing review: The expanded size of the tensor (844) must match the existing size (514) at non-singleton dimension 1.  Target sizes: [1, 844].  Tensor sizes: [1, 514]


Processing reviews:  34%|███▍      | 1707/5000 [06:26<06:51,  8.01it/s]

Error processing review: The expanded size of the tensor (965) must match the existing size (514) at non-singleton dimension 1.  Target sizes: [1, 965].  Tensor sizes: [1, 514]


Processing reviews:  36%|███▌      | 1789/5000 [06:36<07:05,  7.55it/s]

Error processing review: The expanded size of the tensor (581) must match the existing size (514) at non-singleton dimension 1.  Target sizes: [1, 581].  Tensor sizes: [1, 514]


Processing reviews:  38%|███▊      | 1920/5000 [06:56<09:48,  5.24it/s]

Error processing review: The expanded size of the tensor (794) must match the existing size (514) at non-singleton dimension 1.  Target sizes: [1, 794].  Tensor sizes: [1, 514]


Processing reviews:  45%|████▌     | 2266/5000 [07:50<05:30,  8.28it/s]

Error processing review: The expanded size of the tensor (858) must match the existing size (514) at non-singleton dimension 1.  Target sizes: [1, 858].  Tensor sizes: [1, 514]


Processing reviews:  47%|████▋     | 2352/5000 [08:02<09:37,  4.58it/s]

Error processing review: The expanded size of the tensor (680) must match the existing size (514) at non-singleton dimension 1.  Target sizes: [1, 680].  Tensor sizes: [1, 514]


Processing reviews:  48%|████▊     | 2377/5000 [08:06<05:00,  8.73it/s]

Error processing review: The expanded size of the tensor (840) must match the existing size (514) at non-singleton dimension 1.  Target sizes: [1, 840].  Tensor sizes: [1, 514]


Processing reviews:  48%|████▊     | 2393/5000 [08:11<08:49,  4.92it/s]

Error processing review: The expanded size of the tensor (521) must match the existing size (514) at non-singleton dimension 1.  Target sizes: [1, 521].  Tensor sizes: [1, 514]


Processing reviews:  56%|█████▌    | 2798/5000 [09:14<03:08, 11.66it/s]

Error processing review: The expanded size of the tensor (793) must match the existing size (514) at non-singleton dimension 1.  Target sizes: [1, 793].  Tensor sizes: [1, 514]


Processing reviews:  56%|█████▌    | 2807/5000 [09:15<02:47, 13.08it/s]

Error processing review: The expanded size of the tensor (530) must match the existing size (514) at non-singleton dimension 1.  Target sizes: [1, 530].  Tensor sizes: [1, 514]


Processing reviews:  60%|█████▉    | 2978/5000 [09:42<05:18,  6.35it/s]

Error processing review: The expanded size of the tensor (613) must match the existing size (514) at non-singleton dimension 1.  Target sizes: [1, 613].  Tensor sizes: [1, 514]


Processing reviews:  60%|██████    | 3012/5000 [09:47<03:51,  8.61it/s]

Error processing review: The expanded size of the tensor (549) must match the existing size (514) at non-singleton dimension 1.  Target sizes: [1, 549].  Tensor sizes: [1, 514]


Processing reviews:  61%|██████▏   | 3064/5000 [09:56<02:42, 11.93it/s]

Error processing review: The expanded size of the tensor (947) must match the existing size (514) at non-singleton dimension 1.  Target sizes: [1, 947].  Tensor sizes: [1, 514]


Processing reviews:  64%|██████▎   | 3176/5000 [10:13<02:58, 10.23it/s]

Error processing review: The expanded size of the tensor (695) must match the existing size (514) at non-singleton dimension 1.  Target sizes: [1, 695].  Tensor sizes: [1, 514]


Processing reviews:  64%|██████▍   | 3205/5000 [10:17<03:59,  7.50it/s]

Error processing review: The expanded size of the tensor (1654) must match the existing size (514) at non-singleton dimension 1.  Target sizes: [1, 1654].  Tensor sizes: [1, 514]


Processing reviews:  68%|██████▊   | 3414/5000 [10:50<02:22, 11.10it/s]

Error processing review: The expanded size of the tensor (526) must match the existing size (514) at non-singleton dimension 1.  Target sizes: [1, 526].  Tensor sizes: [1, 514]


Processing reviews:  71%|███████   | 3549/5000 [11:10<04:49,  5.01it/s]

Error processing review: The expanded size of the tensor (1224) must match the existing size (514) at non-singleton dimension 1.  Target sizes: [1, 1224].  Tensor sizes: [1, 514]


Processing reviews:  72%|███████▎  | 3625/5000 [11:23<02:36,  8.76it/s]

Error processing review: The expanded size of the tensor (523) must match the existing size (514) at non-singleton dimension 1.  Target sizes: [1, 523].  Tensor sizes: [1, 514]


Processing reviews:  74%|███████▍  | 3713/5000 [11:34<01:41, 12.65it/s]

Error processing review: The expanded size of the tensor (643) must match the existing size (514) at non-singleton dimension 1.  Target sizes: [1, 643].  Tensor sizes: [1, 514]


Processing reviews:  75%|███████▍  | 3733/5000 [11:38<03:08,  6.74it/s]

Error processing review: The expanded size of the tensor (2393) must match the existing size (514) at non-singleton dimension 1.  Target sizes: [1, 2393].  Tensor sizes: [1, 514]


Processing reviews:  79%|███████▉  | 3965/5000 [12:09<02:35,  6.65it/s]

Error processing review: The expanded size of the tensor (586) must match the existing size (514) at non-singleton dimension 1.  Target sizes: [1, 586].  Tensor sizes: [1, 514]


Processing reviews:  81%|████████  | 4037/5000 [12:21<02:22,  6.75it/s]

Error processing review: The expanded size of the tensor (545) must match the existing size (514) at non-singleton dimension 1.  Target sizes: [1, 545].  Tensor sizes: [1, 514]


Processing reviews:  88%|████████▊ | 4403/5000 [13:16<00:53, 11.14it/s]

Error processing review: The expanded size of the tensor (665) must match the existing size (514) at non-singleton dimension 1.  Target sizes: [1, 665].  Tensor sizes: [1, 514]


Processing reviews:  96%|█████████▌| 4805/5000 [14:17<00:19,  9.94it/s]

Error processing review: The expanded size of the tensor (645) must match the existing size (514) at non-singleton dimension 1.  Target sizes: [1, 645].  Tensor sizes: [1, 514]


Processing reviews:  97%|█████████▋| 4849/5000 [14:23<00:14, 10.62it/s]

Error processing review: The expanded size of the tensor (1010) must match the existing size (514) at non-singleton dimension 1.  Target sizes: [1, 1010].  Tensor sizes: [1, 514]


Processing reviews:  97%|█████████▋| 4856/5000 [14:24<00:15,  9.09it/s]

Error processing review: The expanded size of the tensor (683) must match the existing size (514) at non-singleton dimension 1.  Target sizes: [1, 683].  Tensor sizes: [1, 514]


Processing reviews: 100%|██████████| 5000/5000 [14:45<00:00,  5.64it/s]


In [103]:
import pandas as pd

df_sample_roberta_base_final = df_sample_roberta_base.dropna(subset=['embedding'])

# Save to CSV
df_sample_roberta_base_final.to_csv('embeddings/sample_embed_roberta_base_reviews.csv', index=False)


In [104]:
df_sample_roberta_base_final

Unnamed: 0,Uniq Id,Pageurl,Complete_Review,Rating out of 5,embedding
0,afffe2013b28b4f315309524e70fbd0a,https://www.amazon.com/All-New-Amazon-Echo-Dot...,Four Stars. I love it!!,4.0,"[-0.077856794, 0.09048438, -0.022385973, -0.10..."
1,a76e8f02c13607341913a6f9621272b2,https://www.amazon.com/All-New-Amazon-Echo-Dot...,I replaced Siri with Alexa... I bought family ...,3.0,"[-0.06333335, 0.09018205, -0.044614114, -0.153..."
2,95989b4f7b36af92938eb1ff8b47bf75,https://www.amazon.com/All-New-Amazon-Echo-Dot...,It was cool while it lasted. It was cool while...,2.0,"[-0.12731871, 0.1091575, -0.030448508, -0.1625..."
3,f4d2291f4ee22563e68f7f84cdca3823,https://www.amazon.com/All-New-Amazon-Echo-Dot...,Love using. Second one I have. Love using them,5.0,"[-0.06350201, 0.06525059, -0.026214935, -0.104..."
4,42edfc6f411a7527e680c4f139dc14d3,https://www.amazon.com/All-New-Amazon-Echo-Dot...,Five Stars. love it,5.0,"[-0.06970922, 0.08771731, -0.020580553, -0.107..."
...,...,...,...,...,...
4995,2db2237f8a732e7df9bbe3043f03042b,https://www.amazon.com/All-New-Amazon-Echo-Dot...,"Not a good portable/satellite speaker, but ser...",3.0,"[-0.09142549, 0.1281499, -0.037414804, -0.1301..."
4996,6424dfbd428b73d882ecb59e2486ae92,https://www.amazon.com/All-New-Amazon-Echo-Dot...,Two Stars. Sound not as great for playing musi...,2.0,"[-0.06628171, 0.112053216, -0.0050543044, -0.1..."
4997,3c42ff782a7fc594bd495869b0cefff3,https://www.amazon.com/All-New-Amazon-Echo-Dot...,do love asking Alexa for jokes. I wanted to be...,3.0,"[-0.09986623, 0.129924, -0.0419501, -0.1332095..."
4998,133d25ebfd7c57076c73c82661afd09c,https://www.amazon.com/All-New-Amazon-Echo-Dot...,Luckily it was free with my dish. I talk to it...,1.0,"[-0.09662871, 0.114080794, -0.047377393, -0.10..."
