# **Make Models**

* import pretrained GloVe
* make word counts for reddit data $\to$ `word_frequeincies_reddit.csv`
* finetune with reddit data from `reddit_csv_path`
* saving embedding vectors to `embedding_files`

 
## Fine Tuned Model with GloVe
## Self Trained Model

In [1]:
!pip install numpy==1.25.2
!pip install packaging==22.0
!pip install shapely==2.0.1
!pip install scipy==1.14.0
!pip install gensim

Collecting numpy==1.25.2
  Using cached numpy-1.25.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.6 kB)
Using cached numpy-1.25.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.2 MB)
Installing collected packages: numpy
  Attempting uninstall: numpy
    Found existing installation: numpy 1.24.4
    Uninstalling numpy-1.24.4:
      Successfully uninstalled numpy-1.24.4
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
numba 0.57.1 requires numpy<1.25,>=1.21, but you have numpy 1.25.2 which is incompatible.[0m[31m
[0mSuccessfully installed numpy-1.25.2
Collecting packaging==22.0
  Using cached packaging-22.0-py3-none-any.whl.metadata (3.1 kB)
Using cached packaging-22.0-py3-none-any.whl (42 kB)
Installing collected packages: packaging
  Attempting uninstall: packaging
    Found existing installation: packaging 23.2


## **Packages and file paths**

In [2]:
import gensim
from gensim.models import KeyedVectors
from gensim.models import Word2Vec
from gensim.models.word2vec import LineSentence
import json
import pandas as pd
import numpy as np
import ast
from tqdm import tqdm
from collections import Counter

In [3]:
# set file paths 

# base glove
glove_path_zip = "embedding_files/glove.42B.300d.zip"
glove_path = "embedding_files/glove.42B.300d.txt"

# glove with dimensions
glove_with_dimensions_path = "embedding_files/glove_with_dimensions.txt"
fixed_glove_with_dimensions_path = "embedding_files/glove_with_dimensions.txt"


# reddit tokens for fine tuning
reddit_txt_path = '../../preprocess_outputs/tokens_reddit.txt'
reddit_csv_path = "../../preprocess_outputs/tokens_reddit.csv"


# NAIVE PATHS (no word lim border no window)


## **GloVe embeddings**

#### Import base GloVe

In [4]:
#embeddings_dict = {}
#with open(glove_path, 'r') as f:
#    for line in f:
#        values = line.split()
#        word = values[0]
#        vector = np.asarray(values[1:], "float32")
#        embeddings_dict[word] = vector

#embeddings_dict

In [5]:
#embeddings_dict

#### Add dimensions to the embeddings file

In [6]:
# write glove with dimensions to txt

#num_embeddings = len(embeddings_dict.keys())
#num_dimensions = len(next(iter(embeddings_dict.values())))

#with open(glove_with_dimensions_path, 'w', encoding="utf-8") as f:
#    f.write(f"{num_embeddings} {num_dimensions}\n")
    
#    for word, vector in embeddings_dict.items():
#        vector_str = " ".join(map(str, vector))
#        f.write(f"{word} {vector_str}\n")


#### Import file with dimensions

In [4]:
embeddings_dict = {}

with open(fixed_glove_with_dimensions_path, 'r', encoding="utf-8") as f:

    first_line = f.readline().strip()  # dimensions
    num_embeddings, num_dimensions = map(int, first_line.split())

    # Check by printing
    print(f"Dimensions: {first_line}")
    for _ in range(2):
        print(f.readline().strip())  # Print next 4 lines

    # read file
    for line in f:
        values = line.strip().split()
        word = values[0]  # First element is the word
        vector = np.asarray(values[1:], dtype="float32")  # Convert vector values to float
        embeddings_dict[word] = vector

    


Dimensions: 1917494 300
, 0.18378 -0.12123 -0.11987 0.015227 -0.19121 -0.066074 -2.9876 0.80795 0.067338 -0.13184 -0.5274 0.44521 0.12982 -0.21823 -0.4508 -0.22478 -0.30766 -0.11137 -0.162 -0.21294 -0.46022 -0.086593 -0.24902 0.46729 -0.6023 -0.44972 0.43946 0.014738 0.27498 -0.078421 0.36009 0.12172 0.4298 -0.055345 0.4495 -0.74444 -0.26702 0.16431 -0.19335 0.13468 0.2887 0.23924 -0.23579 -0.28972 0.20149 0.048135 -0.18322 -0.15492 -0.19255 0.40271 0.16051 0.17721 0.32557 0.011625 -0.42572 0.34205 -0.45865 -0.2486 0.034128 0.03306 -0.057065 0.18136 -0.43638 0.0005709 -0.11935 -0.2195 0.16429 -0.18119 -0.19145 -0.081672 -0.2962 0.25803 0.073848 0.54213 -0.15405 -0.49256 0.091719 0.13329 -0.05253 -0.20518 0.34576 -1.0449 0.072779 -0.0003453 -0.16926 0.051019 -0.14753 0.23848 -0.40749 -0.58278 -0.48695 0.25863 -0.20531 -0.4775 0.40645 -0.038512 -2.403 -0.12421 0.63149 0.089419 0.08557 -0.20757 -0.1617 -0.29506 -0.13948 0.14202 -0.30138 -0.15806 0.52984 0.24229 0.075169 0.13792 0.90416 -0

## **Import Fine Tuning Data**

In [5]:
# import preprocessed debagree tokens
reddit_preprocessed = pd.read_csv(reddit_csv_path)

# check
type(reddit_preprocessed['text'].iloc[10])
reddit_preprocessed['text'][0]

'regarding perry vs brown current name prop case think likely plaintiffs win perry et al allowed marry though quite possible end ruling based california specific circumstances personal well sourced speculations opinions justices logic ted olson david boies extremely good keeping track sort thing much better anyone subreddit could likely planned expected along case end supreme court would even started case first place reasonably sure victory'

In [6]:
print("Self trained will be trained on", len(reddit_preprocessed), "texts.")

Self trained will be trained on 7035721 texts.


## **Check appropriate word count limits**

* Context window and min_count now matter
* $\to$ check 

In [7]:
# count unique words

text = " ".join(reddit_preprocessed['text'])
word_counts = Counter(text.split())

word_freq_df = pd.DataFrame(word_counts.items(), columns=["word", "frequency"]).sort_values(by="frequency", ascending=False).reset_index(drop = True)
word_freq_df

word_freq_df['len'] = word_freq_df['word'].str.len()

print("Corpus contains:", len(word_freq_df), "individual words, and a total of",  word_freq_df['frequency'].sum())
word_freq_df

Corpus contains: 457064 individual words, and a total of 124570312


Unnamed: 0,word,frequency,len
0,trump,1220659,5
1,people,1028121,6
2,would,1003710,5
3,like,943124,4
4,think,697822,5
...,...,...,...
457059,ndrewlawrence,1,13
457060,brassica,1,8
457061,bluegreenred,1,12
457062,officelt,1,8


In [8]:
print(len(word_freq_df[word_freq_df['frequency'] >= 5]), "words appear more than 5 times")
print("That is a share of", len(word_freq_df[word_freq_df['frequency'] >= 5]) / len(word_freq_df))

print("together, these are", word_freq_df[word_freq_df['frequency'] >= 5]['frequency'].sum(), "words")
print("which is a share of", word_freq_df[word_freq_df['frequency'] >= 5]['frequency'].sum() / word_freq_df['frequency'].sum())

print("---------------")

print(len(word_freq_df[word_freq_df['frequency'] >= 10]), "words appear more than 10 times")
print("That is a share of", len(word_freq_df[word_freq_df['frequency'] >= 10]) / len(word_freq_df))

print("together, these are", word_freq_df[word_freq_df['frequency'] >= 10]['frequency'].sum(), "words")
print("which is a share of", word_freq_df[word_freq_df['frequency'] >= 10]['frequency'].sum() / word_freq_df['frequency'].sum())

print("---------------")

print(len(word_freq_df[word_freq_df['frequency'] >= 50]), "words appear more than 50 times")
print("That is a share of", len(word_freq_df[word_freq_df['frequency'] >= 50]) / len(word_freq_df))

print("together, these are", word_freq_df[word_freq_df['frequency'] >= 50]['frequency'].sum(), "words")
print("which is a share of", word_freq_df[word_freq_df['frequency'] >= 50]['frequency'].sum() / word_freq_df['frequency'].sum())


print("---------------")


print(len(word_freq_df[word_freq_df['frequency'] >= 100]), "words appear more than 100 times")
print("That is a share of", len(word_freq_df[word_freq_df['frequency'] >= 100]) / len(word_freq_df))

print("together, these are", word_freq_df[word_freq_df['frequency'] >= 100]['frequency'].sum(), "words")
print("which is a share of", word_freq_df[word_freq_df['frequency'] >= 100]['frequency'].sum() / word_freq_df['frequency'].sum())

97686 words appear more than 5 times
That is a share of 0.2137249925612168
together, these are 124076892 words
which is a share of 0.9960390241296015
---------------
70449 words appear more than 10 times
That is a share of 0.1541337755762869
together, these are 123898970 words
which is a share of 0.9946107383916643
---------------
37266 words appear more than 50 times
That is a share of 0.08153343951831692
together, these are 123165527 words
which is a share of 0.988722955113093
---------------
28304 words appear more than 100 times
That is a share of 0.06192568218017608
together, these are 122532201 words
which is a share of 0.9836388705520782


In [9]:
word_freq_distribution = (
    word_freq_df['frequency']
    .value_counts()
    .to_frame('Count')
    .join(word_freq_df['frequency'].value_counts(normalize=True).to_frame('%'))
    .reset_index()
    .rename(columns={'frequency': 'word_frequency'})
)

# Sort by Count (descending), then by Frequency (ascending)
word_freq_distribution = word_freq_distribution.sort_values(by=['Count', 'word_frequency'], ascending=[False, True])

# Add cumulative sum column
word_freq_distribution['Cumulative Sum'] = word_freq_distribution['Count'].cumsum()

# Add cumulative percentage column
word_freq_distribution['Cumulative %'] = word_freq_distribution['%'].cumsum()

word_freq_distribution


Unnamed: 0,word_frequency,Count,%,Cumulative Sum,Cumulative %
0,1,273161,0.597643,273161,0.597643
1,2,51254,0.112137,324415,0.709780
2,3,22101,0.048354,346516,0.758135
3,4,12862,0.028140,359378,0.786275
4,5,8602,0.018820,367980,0.805095
...,...,...,...,...,...
6915,697822,1,0.000002,457060,0.999991
2747,943124,1,0.000002,457061,0.999993
2746,1003710,1,0.000002,457062,0.999996
2745,1028121,1,0.000002,457063,0.999998


In [10]:
word_freq_df.to_csv("word_frequencies_reddit.csv", index = False)

***
****

## **Finetuned model**

**Steps**

* load pretrained
* filter finetune vocabulary to only include words from either pretrained, or if new, only if count >= 10
* adjust reddit texts to only include words with freq >10

In [10]:
# Load the pretrained Word2Vec embeddings from a text file
pretrained_model = KeyedVectors.load_word2vec_format(fixed_glove_with_dimensions_path, binary=False)

# get pretrained vocabulary order
pretrained_vocab_order = pretrained_model.index_to_key

In [11]:
# Convert KeyedVectors to a Word2Vec model for fine-tuning

# include all GloVe embeddings --> min_count = 1
model_fine_tuned = Word2Vec(vector_size = pretrained_model.vector_size, min_count=1, window = 5)
model_fine_tuned

<gensim.models.word2vec.Word2Vec at 0x781d1f59ee10>

In [12]:
# add list of lists: [[word_1], [word_2]]
# apparently, gensim expects list of tokens of a sentence. I only have tokens, so make them one-word-sentences
model_fine_tuned.build_vocab([[word] for word in pretrained_vocab_order], update=False)

# manually ensure the word order
model_fine_tuned.wv.index_to_key = pretrained_vocab_order  
model_fine_tuned.wv.key_to_index = {word: i for i, word in enumerate(pretrained_vocab_order)}

# test
print(pretrained_model.index_to_key == model_fine_tuned.wv.index_to_key) 

True


In [13]:
# ensured correct order, so I can do
model_fine_tuned.wv.vectors[:] = pretrained_model.vectors[:]

In [24]:

print("Pretrained Vocabulary:", len(pretrained_vocab_order))
print("Embeddings Dictionary (sth. missing)", len(embeddings_dict))
#pretrained_vocab_order.notin(embeddings_dict.keys())

missing_words = set(pretrained_vocab_order) - embeddings_dict.keys()
missing_words

Pretrained Vocabulary: 1917494
Embeddings Dictionary (sth. missing) 1917492


{',', 'the'}

## **Preprocess finetuning sentences**

In [26]:
# import finetuning sentences
x = 10
reddit_txt_path_wordlim = f"../../preprocess_outputs/reddit_txt_path_wordlim_{x}.txt"

sentences = pd.read_csv(reddit_txt_path, header=None, names=["text"])

# individual words appearing often enaugh
words_over_limit = set(word_freq_df[word_freq_df['frequency'] >= x]['word'].tolist())


In [None]:
# per sentence, only keep words that are in GloVe or appear above x times
#sentences['tokens'] = sentences['text'].apply(
#    lambda text: [word for word in text.split() if word in pretrained_vocab_order or word in words_over_limit]
#)

In [None]:

# make sentences again
#sentences['cleaned_text'] = sentences['tokens'].apply(lambda tokens: ' '.join(tokens))

# Export
#sentences['cleaned_text'].to_csv(reddit_txt_path_wordlim, index=False, header=False, sep='\n')

#sentences

In [31]:
cleaned_text = pd.read_csv(reddit_txt_path_wordlim, header=None, names=["cleaned_text"])

sentences = sentences.merge(cleaned_text, left_index = True, right_index = True)

In [30]:
# Print a few tokenized sentences for checking

for i, sentence in enumerate(LineSentence(reddit_txt_path_wordlim)):
    print(sentence)
    if i == 4:  # Only show first 5 sentences
        break

['regarding', 'perry', 'vs', 'brown', 'current', 'name', 'prop', 'case', 'think', 'likely', 'plaintiffs', 'win', 'perry', 'et', 'al', 'allowed', 'marry', 'though', 'quite', 'possible', 'end', 'ruling', 'based', 'california', 'specific', 'circumstances', 'personal', 'well', 'sourced', 'speculations', 'opinions', 'justices', 'logic', 'ted', 'olson', 'david', 'boies', 'extremely', 'good', 'keeping', 'track', 'sort', 'thing', 'much', 'better', 'anyone', 'subreddit', 'could', 'likely', 'planned', 'expected', 'along', 'case', 'end', 'supreme', 'court', 'would', 'even', 'started', 'case', 'first', 'place', 'reasonably', 'sure', 'victory']
['wrong']
['adhering', 'askscience', 'style', 'moderation', 'could', 'elaborate']
['conference', 'november', 'grad', 'student', 'yale', 'presented', 'paper', 'looking', 'citizens', 'opinions', 'political', 'polarization', 'broken', 'three', 'parts', 'civility', 'discussion', 'bipartisanship', 'gridlock', 'study', 'found', 'partisan', 'citizens', 'favored', '

In [32]:
# Fine-tune with Reddit data
model_fine_tuned.build_vocab(LineSentence(reddit_txt_path_wordlim), update=True)
model_fine_tuned.train(LineSentence(reddit_txt_path_wordlim), total_examples=model_fine_tuned.corpus_count, epochs=model_fine_tuned.epochs)



(602081321, 621097100)

In [35]:
print(len(pretrained_vocab_order))
print("Vocabulary size:", len(model_fine_tuned.wv.key_to_index))

print("New Words: ", len(model_fine_tuned.wv.key_to_index) - 1917494)

1921604
Vocabulary size: 1921604
New Words:  4110


In [36]:
# save embeddings as txt for SBERT

fine_tuned_model_path_wordlim = f'embedding_files/fine_tuned_glove_word2vec_wordlim_{x}.txt'
model_fine_tuned.wv.save_word2vec_format(fine_tuned_model_path_wordlim, binary=False)


In [37]:
# check ordering
pretrained_model.index_to_key == model_fine_tuned.wv.index_to_key

True

***
***
## **Self Trained Model**

In [40]:
# make texts a list of list of tokens

sentences = list(reddit_preprocessed['text'])

sentences = [sentence.split() for sentence in sentences]
sentences

[['regarding',
  'perry',
  'vs',
  'brown',
  'current',
  'name',
  'prop',
  'case',
  'think',
  'likely',
  'plaintiffs',
  'win',
  'perry',
  'et',
  'al',
  'allowed',
  'marry',
  'though',
  'quite',
  'possible',
  'end',
  'ruling',
  'based',
  'california',
  'specific',
  'circumstances',
  'personal',
  'well',
  'sourced',
  'speculations',
  'opinions',
  'justices',
  'logic',
  'ted',
  'olson',
  'david',
  'boies',
  'extremely',
  'good',
  'keeping',
  'track',
  'sort',
  'thing',
  'much',
  'better',
  'anyone',
  'subreddit',
  'could',
  'likely',
  'planned',
  'expected',
  'along',
  'case',
  'end',
  'supreme',
  'court',
  'would',
  'even',
  'started',
  'case',
  'first',
  'place',
  'reasonably',
  'sure',
  'victory'],
 ['wrong'],
 ['adhering', 'askscience', 'style', 'moderation', 'could', 'elaborate'],
 ['conference',
  'november',
  'grad',
  'student',
  'yale',
  'presented',
  'paper',
  'looking',
  'citizens',
  'opinions',
  'political',

In [41]:
# initialize an empty model

word_lim = 10
window = 5

model = Word2Vec(
    vector_size = 300,
    window = window,
    min_count = word_lim, 
    workers = 4,
)

print("Building Vocabulary")
model.build_vocab(sentences)
# find all unqiue words in the dataset


Building Vocabulary


In [42]:
# checking some stats with text length and word count inconsistencies
print(len(sentences))

x = 0
for sentence in sentences:
    #print(sentence)
    #print(len(sentence))
    x = x + len(sentence)

print(x)

7035721
124570312


In [43]:
# Training with progress bar for each epoch


epochs = 10
total_sentences = len(sentences)

for epoch in range(epochs):
    print(f'Epoch {epoch+1}/{epochs}')
    pbar = tqdm(total=total_sentences, desc=f"Epoch {epoch+1}", unit="sentence")
    

    np.random.shuffle(sentences)
    
    # Train in batches to update the progress bar
    batch_size = 100000  # Define your batch size
    for i in range(0, total_sentences, batch_size):
        batch_sentences = sentences[i:i + batch_size]
        model.train(batch_sentences, total_examples=len(batch_sentences), epochs=1)
        pbar.update(len(batch_sentences))
    
    pbar.close()


# Training: with Skip-gram or CBOW to update word vectors

Epoch 1/10


Epoch 1: 100%|██████████| 7035721/7035721 [01:54<00:00, 61629.09sentence/s]


Epoch 2/10


Epoch 2: 100%|██████████| 7035721/7035721 [01:56<00:00, 60197.26sentence/s]


Epoch 3/10


Epoch 3: 100%|██████████| 7035721/7035721 [01:59<00:00, 59057.73sentence/s]


Epoch 4/10


Epoch 4: 100%|██████████| 7035721/7035721 [01:57<00:00, 59643.41sentence/s]


Epoch 5/10


Epoch 5: 100%|██████████| 7035721/7035721 [01:59<00:00, 58898.27sentence/s]


Epoch 6/10


Epoch 6: 100%|██████████| 7035721/7035721 [01:58<00:00, 59326.35sentence/s]


Epoch 7/10


Epoch 7: 100%|██████████| 7035721/7035721 [01:58<00:00, 59337.61sentence/s]


Epoch 8/10


Epoch 8: 100%|██████████| 7035721/7035721 [02:00<00:00, 58511.87sentence/s]


Epoch 9/10


Epoch 9: 100%|██████████| 7035721/7035721 [01:58<00:00, 59146.32sentence/s]


Epoch 10/10


Epoch 10: 100%|██████████| 7035721/7035721 [01:56<00:00, 60346.80sentence/s]


In [44]:
print("Vocabulary size:", len(model.wv.key_to_index))

Vocabulary size: 70449


In [45]:
# Save the word vectors to a text file

self_model_path_wordlim = f"embedding_files/self_build_word2vec_wordlim_{word_lim}_window_{window}.txt"
model.wv.save_word2vec_format(self_model_path_wordlim, binary=False)


In [46]:
# Save the entire model for future use
model.save(f"embedding_files/self_build_model/word2vec_model_wordlim_{word_lim}_window_{window}.model")



***

## **Test my embeddings**

In [47]:
reddit_preprocessed['text'][10]
text_test = reddit_preprocessed['text'][10]
text_test

'personally propose policy fact would suggest opposite subsidies work least well intended look rail subsidies old people running rail system saw incentive efficient would mean checks stop coming quicker instead would build enough keep getting subsidy lining pockets farm subsidies give monsanto incentive raise prices seed lobby legislation force farmers buy mean hey already getting check gov afford right green subsidies much luck either echoing problems rail farm subsidies instead would propose finding someone rich enough eccentric enough call privately funded information campaign educate everyday citizen options provided market options would hopefully untainted subsidies would affordable prices guaranteed new demand created information campaign basically think government damn thing look already done create new legislation fix problems caused old legislation'

In [48]:
valid_words_ft = [word for word in text_test.split() if word in model_fine_tuned.wv]
valid_words_ft

valid_words_self = [word for word in text_test.split() if word in model.wv]
valid_words_self

['personally',
 'propose',
 'policy',
 'fact',
 'would',
 'suggest',
 'opposite',
 'subsidies',
 'work',
 'least',
 'well',
 'intended',
 'look',
 'rail',
 'subsidies',
 'old',
 'people',
 'running',
 'rail',
 'system',
 'saw',
 'incentive',
 'efficient',
 'would',
 'mean',
 'checks',
 'stop',
 'coming',
 'quicker',
 'instead',
 'would',
 'build',
 'enough',
 'keep',
 'getting',
 'subsidy',
 'lining',
 'pockets',
 'farm',
 'subsidies',
 'give',
 'monsanto',
 'incentive',
 'raise',
 'prices',
 'seed',
 'lobby',
 'legislation',
 'force',
 'farmers',
 'buy',
 'mean',
 'hey',
 'already',
 'getting',
 'check',
 'gov',
 'afford',
 'right',
 'green',
 'subsidies',
 'much',
 'luck',
 'either',
 'echoing',
 'problems',
 'rail',
 'farm',
 'subsidies',
 'instead',
 'would',
 'propose',
 'finding',
 'someone',
 'rich',
 'enough',
 'eccentric',
 'enough',
 'call',
 'privately',
 'funded',
 'information',
 'campaign',
 'educate',
 'everyday',
 'citizen',
 'options',
 'provided',
 'market',
 'options

In [49]:
[model_fine_tuned.wv[word] for word in valid_words_ft]

[model.wv[word] for word in valid_words_self]

[array([ 6.69041753e-01, -4.42957550e-01, -3.47753376e-01, -5.42514175e-02,
        -1.50472060e-01,  7.63537407e-01, -9.36194509e-03, -1.03831872e-01,
         1.88889906e-01,  1.19042730e+00,  1.24072455e-01,  7.70567060e-01,
         2.63663441e-01,  7.14695394e-01, -1.64461029e+00,  7.08617032e-01,
        -8.12101085e-03, -6.10056281e-01,  4.74342890e-02, -1.75331786e-01,
        -2.70325691e-01,  3.10037434e-01, -2.23513916e-01, -9.71370637e-01,
         1.23182201e+00, -5.23137331e-01, -4.94168311e-01,  1.82676041e+00,
         7.06101954e-01,  1.49376422e-01, -1.27157569e+00,  7.37085879e-01,
        -3.17902386e-01, -2.36425638e-01,  1.44355297e-01,  1.18010025e-02,
        -4.77207243e-01,  5.64469576e-01,  4.66429353e-01, -6.02013879e-02,
        -2.08693668e-01, -7.52652764e-01, -5.28756380e-01, -6.43651843e-01,
        -6.39793098e-01, -1.92406625e-02, -6.77218616e-01,  4.88617122e-01,
         5.33136368e-01, -5.89472950e-01,  1.21708646e-01, -2.53598303e-01,
        -2.4

In [50]:
def embed_text(text, model):
        
    valid_words = [word for word in text.split() if word in model.wv]  # Filter words present in the vocabulary
    #print(valid_words)

    if not valid_words:  # Handle the case where no valid words are found
        print("Warning: No valid words found in the text for embedding.")
        return np.zeros(model.vector_size)  # Return a zero vector of the same size as the embedding

    # Compute the sentence embedding as the average of word embeddings
    sentence_embedding = np.mean([model.wv[word] for word in valid_words], axis=0)
    
    return sentence_embedding




In [52]:
text = reddit_preprocessed['text'][10]
embedding_ft = embed_text(text, model_fine_tuned)

print("Text Embedding Shape:", embedding_ft.shape)
print("Embedding Vector:", embedding_ft.mean())

Text Embedding Shape: (300,)
Embedding Vector: -0.02726319


In [54]:
text = reddit_preprocessed['text'][10]
embedding_self = embed_text(text, model)

print("Text Embedding Shape:", embedding_self.shape)
print("Embedding Vector:", embedding_self.mean())

Text Embedding Shape: (300,)
Embedding Vector: 0.0022689104
