# Word2Vec fine-tuning project

### Description
This project is meant to demonstrate a simple usage of the word2vec tool for creation of word embeddings using the Gensim library. As it will be shown, the word2vec can be used to create word embeddings in an efficient manner, whilst still having comparable results with Neural Networks.

This post was based on the research paper ["Efficient Estimation of Word Representations in Vector Space"](https://arxiv.org/pdf/1301.3781)

### Table of contents
1. [Dataset Download](#dataset-download)
2. [Exploratory Data Analysis](#exploratory-data-analysis)
3. [Text Preprocessing](#text-preprocessing)
4. [Fine-tuning](#word2vec-fine-tuning)
5. [Data Visualization](#data-visualization)

## Dataset download

In [10]:
import requests
import zipfile
import os

In [11]:
url = 'https://www.kaggle.com/api/v1/datasets/download/emirhanai/2024-u-s-election-sentiment-on-x'

response = requests.get(url)

In [12]:
if response.status_code == 200:
    with open('dataset.zip', 'wb') as file:
        file.write(response.content)
    print("Dataset saved as 'dataset.zip'")
else:
    print(f"Error downloading dataset: {response.status_code}")

Dataset saved as 'dataset.zip'


In [13]:
with zipfile.ZipFile('dataset.zip', 'r') as zip_ref:
    zip_ref.extractall('dataset')
print("Dataset unpacked as 'dataset'")

Dataset unpacked as 'dataset'


In [14]:
files = os.listdir('dataset')
print("Dataset files:", files)

Dataset files: ['test.csv', 'train.csv', 'val.csv']


## Exploratory Data Analysis

[Back to top](#word2vec-fine-tuning-project)

In [15]:
import pandas as pd
import plotly.express as px

In [16]:
df_train = pd.read_csv('dataset/train.csv')
df_test = pd.read_csv('dataset/test.csv')
df_val = pd.read_csv('dataset/val.csv')

# For this task, we won't be needing testing and validation dataframes separated
df = pd.concat([df_train,df_test,df_val], ignore_index=True, verify_integrity=True)

print(f"df size: {df.shape[0]}")

df size: 600


In [17]:
df.head()

Unnamed: 0,tweet_id,user_handle,timestamp,tweet_text,candidate,party,retweets,likes,sentiment
0,1,@user123,2024-11-03 08:45:00,Excited to see Kamala Harris leading the Democ...,Kamala Harris,Democratic Party,120,450,positive
1,2,@politicsFan,2024-11-03 09:15:23,Donald Trump's policies are the best for our e...,Donald Trump,Republican Party,85,300,positive
2,3,@greenAdvocate,2024-11-03 10:05:45,Jill Stein's environmental plans are exactly w...,Jill Stein,Green Party,60,200,positive
3,4,@indieVoice,2024-11-03 11:20:10,Robert Kennedy offers a fresh perspective outs...,Robert Kennedy,Independent,40,150,neutral
4,5,@libertyLover,2024-11-03 12:35:55,Chase Oliver's libertarian stance promotes tru...,Chase Oliver,Libertarian Party,30,120,positive


In [18]:
party_colors = {
    "Republican Party": "#BB0000",
    "Democratic Party": "#0000BB",
    "Green Party": "#00BB00",
    "Independent": "#BBBBBB",
    "Libertarian Party": "#CCCC00"
}

In [19]:
# Aggregate and sort the data
grouped_data = df.groupby('party', as_index=False)['likes'].sum()
grouped_data = grouped_data.sort_values(by='likes', ascending=False)

# Create the bar plot
fig = px.bar(
    grouped_data,
    x='party',
    y='likes',
    color='party',
    title="Total Likes per Party",
    labels={'party': 'Political Party', 'likes': 'Total Likes'},
    color_discrete_map=party_colors
)

fig.update_layout(xaxis=dict(categoryorder='total descending'))
fig.show()


In [20]:
# Create the box plot directly from df_train
fig = px.box(
    df,
    x='party',
    y='retweets',
    color='party',
    title="Retweet Distribution per Party",
    labels={'party': 'Political Party', 'retweets': 'Retweet Count'},
    color_discrete_map=party_colors
)

fig.update_layout(
    xaxis=dict(categoryorder='array', categoryarray=df_train.groupby('party')['retweets'].mean().sort_values(ascending=False).index)
)

fig.show()

## Text Preprocessing

[Back to top](#word2vec-fine-tuning-project)

In [21]:
import nltk
import string
import re

nltk.download('punkt_tab')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package punkt_tab to /home/jose/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package stopwords to /home/jose/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /home/jose/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [22]:
def preprocess(input):
    # Whitespace removal
    text = input.strip()          # Leading and trailing
    text = " ".join(text.split()) # Remove multiplied whitespaces

    # Lowercasing
    text = text.lower()

    # Candidate name normalization (easier done before tokenization)
    candidates = {
         "donald trump": "trump",
         "kamala harris": "harris",
         "jill stein":   "stein",
         "robert kennedy": "kennedy",
         "chase oliver":  "oliver"
    }
    for fullname in candidates.keys():
        text = re.sub(fullname, candidates[fullname], text)

    # URL removal
    pattern = r"(http|ftp|https)://([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?"
    text = re.sub(pattern, "", text)

    # Tokenization (separating words into a list of tokens)
    tokens = nltk.word_tokenize(text)

    # Filtering punctuation
    filtered_tokens = [token for token in tokens if token not in string.punctuation]

    # Stopword removal (removing words with little value such as 'the' 'of' etc.)
    stopwords = nltk.corpus.stopwords.words("english") + ['\'s']
    filtered_tokens = [token for token in filtered_tokens if token.lower() not in stopwords]

    # Lemmatization (reducing words to their lemma form)
    lemmatizer = nltk.stem.WordNetLemmatizer()
    processed_tokens = [lemmatizer.lemmatize(token) for token in filtered_tokens]
    return processed_tokens

example = "Natural language processing is a field of artificial intelligence that deals with the interaction between computers and human (natural) language. Check out this article for more information: https://en.wikipedia.org/wiki/Natural_language_processing"
print(example)
print(preprocess(example))

Natural language processing is a field of artificial intelligence that deals with the interaction between computers and human (natural) language. Check out this article for more information: https://en.wikipedia.org/wiki/Natural_language_processing
['natural', 'language', 'processing', 'field', 'artificial', 'intelligence', 'deal', 'interaction', 'computer', 'human', 'natural', 'language', 'check', 'article', 'information']


In [23]:
processed_texts = df['tweet_text'].apply(lambda x: preprocess(x))
processed_texts

0      [excited, see, harris, leading, democratic, ch...
1                         [trump, policy, best, economy]
2            [stein, environmental, plan, exactly, need]
3      [kennedy, offer, fresh, perspective, outside, ...
4      [oliver, libertarian, stance, promotes, true, ...
                             ...                        
595            [harris, symbol, progressive, leadership]
596    [trump, economic, strategy, showing, mixed, re...
597    [stein, solar, project, leading, way, renewabl...
598    [kennedy, offer, pragmatic, solution, outside,...
599    [oliver, expanding, base, among, libertarian, ...
Name: tweet_text, Length: 600, dtype: object

## Word2Vec Fine-tuning

[Back to top](#word2vec-fine-tuning-project)

In [24]:
from collections import defaultdict # For word_freq

import multiprocessing
from gensim.models import Word2Vec
import logging
from time import time  # To time our operations

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

In [25]:
word_freq = defaultdict(int)
for sent in processed_texts:
    for word in sent:
        word_freq[word] += 1
print(f"Unique words: {len(word_freq)}")

Unique words: 368


In [26]:
most_freq_words = sorted(word_freq, key=word_freq.get, reverse=True)[:7]

print(f"Most frequent words")
for word in most_freq_words:
    print(f"{word}: {word_freq[word]}")

Most frequent words
harris: 120
trump: 120
stein: 120
kennedy: 120
oliver: 120
policy: 119
new: 56


In [35]:
# Configuring the model's parameters
cores = multiprocessing.cpu_count()     # Count the number of cores

vector_size = 50 # vector_size should increase as training data does [https://arxiv.org/pdf/1301.3781]

w2v_model = Word2Vec(min_count=20,      # Ignores all words with total absolute frequency lower than this.
                     window=4,          # Maximum distance between the current and predicted word within a sentence.
                     vector_size=vector_size,   # Dimensionality of the feature vectors.
                     sample=6e-5,       # The threshold for configuring which higher-frequency words are randomly downsampled.
                     alpha=0.03,        # Initial learning rate.
                     min_alpha=0.0007,  # Learning rate will linearly drop to min_alpha as training progresses.
                     negative=20,       # If > 0, negative sampling will be used, the int for negative specifies how many "noise words" should be drown.
                     workers=cores-1)   # Worker threads to train the model.

2024-12-28 18:00:01,361 : INFO : Word2Vec lifecycle event {'params': 'Word2Vec<vocab=0, vector_size=50, alpha=0.03>', 'datetime': '2024-12-28T18:00:01.361580', 'gensim': '4.3.3', 'python': '3.12.3 (main, Nov  6 2024, 18:32:19) [GCC 13.2.0]', 'platform': 'Linux-6.8.0-51-generic-x86_64-with-glibc2.39', 'event': 'created'}


In [36]:
# Building the vocabulary table
t = time()

w2v_model.build_vocab(processed_texts, progress_per=30)

print('Time to build vocab: {} mins'.format(round((time() - t) / 60, 2)))

2024-12-28 18:00:01,734 : INFO : collecting all words and their counts
2024-12-28 18:00:01,735 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2024-12-28 18:00:01,736 : INFO : PROGRESS: at sentence #30, processed 163 words, keeping 113 word types
2024-12-28 18:00:01,736 : INFO : PROGRESS: at sentence #60, processed 334 words, keeping 194 word types
2024-12-28 18:00:01,737 : INFO : PROGRESS: at sentence #90, processed 508 words, keeping 208 word types
2024-12-28 18:00:01,737 : INFO : PROGRESS: at sentence #120, processed 671 words, keeping 242 word types
2024-12-28 18:00:01,737 : INFO : PROGRESS: at sentence #150, processed 860 words, keeping 292 word types
2024-12-28 18:00:01,738 : INFO : PROGRESS: at sentence #180, processed 1041 words, keeping 334 word types
2024-12-28 18:00:01,738 : INFO : PROGRESS: at sentence #210, processed 1222 words, keeping 359 word types
2024-12-28 18:00:01,738 : INFO : PROGRESS: at sentence #240, processed 1408 words, keeping 362 w

Time to build vocab: 0.0 mins


In [37]:
# Training the model
t = time()

w2v_model.train(processed_texts, total_examples=w2v_model.corpus_count, epochs=3, report_delay=1)

print('Time to train the model: {} mins'.format(round((time() - t) / 60, 2)))

2024-12-28 18:00:02,697 : INFO : Word2Vec lifecycle event {'msg': 'training model with 19 workers on 38 vocabulary and 50 features, using sg=0 hs=0 sample=6e-05 negative=20 window=4 shrink_windows=True', 'datetime': '2024-12-28T18:00:02.697009', 'gensim': '4.3.3', 'python': '3.12.3 (main, Nov  6 2024, 18:32:19) [GCC 13.2.0]', 'platform': 'Linux-6.8.0-51-generic-x86_64-with-glibc2.39', 'event': 'train'}
2024-12-28 18:00:02,701 : INFO : EPOCH 0: training on 3627 raw words (64 effective words) took 0.0s, 51530 effective words/s
2024-12-28 18:00:02,705 : INFO : EPOCH 1: training on 3627 raw words (83 effective words) took 0.0s, 72247 effective words/s
2024-12-28 18:00:02,710 : INFO : EPOCH 2: training on 3627 raw words (86 effective words) took 0.0s, 315357 effective words/s
2024-12-28 18:00:02,710 : INFO : Word2Vec lifecycle event {'msg': 'training on 10881 raw words (233 effective words) took 0.0s, 18043 effective words/s', 'datetime': '2024-12-28T18:00:02.710726', 'gensim': '4.3.3', 'py

Time to train the model: 0.0 mins


In [38]:
# Words most associated with Trump
w2v_model.wv.most_similar(positive=["trump"])

[('leading', 0.37599775195121765),
 ('latest', 0.22983933985233307),
 ('energy', 0.22080369293689728),
 ('individual', 0.21913261711597443),
 ('libertarian', 0.19480209052562714),
 ('independent', 0.16086997091770172),
 ('solution', 0.1487838476896286),
 ('oliver', 0.12577636539936066),
 ('education', 0.11054280400276184),
 ('issue', 0.09323759377002716)]

In [39]:
# Words most associated with Harris
w2v_model.wv.most_similar(positive=["harris"])

[('solution', 0.269701212644577),
 ('education', 0.24052664637565613),
 ('advocate', 0.21081115305423737),
 ('diverse', 0.18617048859596252),
 ('healthcare', 0.16715364158153534),
 ('gaining', 0.16122256219387054),
 ('economic', 0.1497458964586258),
 ('among', 0.1463119387626648),
 ('initiative', 0.1323530226945877),
 ('stein', 0.12762555480003357)]

In [55]:
# Odd one out
print(f"[Trump,Harris,Oliver] = {w2v_model.wv.doesnt_match(['trump', 'harris', 'oliver'])}")
print(f"[Oliver,Stein,Trump] = {w2v_model.wv.doesnt_match(['oliver', 'stein', 'trump'])}")
print(f"[Economy,Education,Tax] = {w2v_model.wv.doesnt_match(['economy', 'education', 'tax'])}")



[Trump,Harris,Oliver] = harris
[Oliver,Stein,Trump] = trump
[Oliver,Stein,Trump] = education


In [41]:
# Most common words when tweeting about Harris that are also not said when tweeting about Trump
w2v_model.wv.most_similar(positive=["harris"], negative=["trump"], topn=10)

[('gaining', 0.288592129945755),
 ('freedom', 0.23809373378753662),
 ('support', 0.22510948777198792),
 ('initiative', 0.22240379452705383),
 ('healthcare', 0.19982220232486725),
 ('among', 0.19050924479961395),
 ('tax', 0.1694035679101944),
 ('advocate', 0.14407287538051605),
 ('diverse', 0.14275474846363068),
 ('economic', 0.11687185615301132)]

In [42]:
# Most common words when tweeting about Trump that are also not said when tweeting about Harris
w2v_model.wv.most_similar(positive=["trump"], negative=["harris"], topn=10)

[('trade', 0.2792196273803711),
 ('leading', 0.25624996423721313),
 ('libertarian', 0.24009783565998077),
 ('oliver', 0.17464560270309448),
 ('city', 0.14476251602172852),
 ('energy', 0.14260192215442657),
 ('latest', 0.1223384216427803),
 ('independent', 0.10840028524398804),
 ('individual', 0.10302483290433884),
 ('new', 0.09602224826812744)]

## Data Visualization

How to view word embeddings using t-SNE (plotly)

[Back to top](#word2vec-fine-tuning-project)

In [64]:
from sklearn.manifold import TSNE
import plotly.express as px
import numpy as np

In [82]:
def get_similar_words(word):
    similar_words = w2v_model.wv.most_similar(positive=[word])
    # Extract just the words (first element of each tuple)
    words = [word for word, _ in similar_words]
    return words

In [87]:
def display_similar_embeddings(word):

    similar_words = get_similar_words(word) + [word]
    embeddings = np.array([w2v_model.wv[word] for word in similar_words])

    tsne = TSNE(n_components=3, random_state=0, perplexity=2)
    projections = tsne.fit_transform(embeddings)

    df = pd.DataFrame(projections, columns=['x', 'y', 'z'])
    df['word'] = similar_words  # Add the words as a column

    fig = px.scatter_3d(
        df, x='x', y='y', z='z',
        text='word',  # Add words as hover text
        title="Word2Vec Embedding Visualization",
        labels={'x': 'Dimension 1', 'y': 'Dimension 2', 'z': 'Dimension 3'}
    )

    fig.update_traces(marker=dict(size=8))

    fig.show()

In [88]:
display_similar_embeddings('harris')

In [89]:
display_similar_embeddings('trump')

In [91]:
display_similar_embeddings('healthcare')

In [None]:
# Other example embeddings
# display_similar_embeddings('oliver')
# display_similar_embeddings('tax')
# display_similar_embeddings('diverse')