In [1]:
import nltk
import pandas as pd
from gensim.models import Word2Vec
from nltk import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from itertools import combinations
import numpy as np

In [2]:
df = pd.read_csv('nlp_papers.csv')
df['abstract'] = df['abstract'].fillna('')

stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def preprocess_text(text):
    # Tokenize
    tokens = word_tokenize(text.lower())
    # Lemmatize
    tokens = [lemmatizer.lemmatize(token) for token in tokens if token.isalpha()]
    # Remove stop words
    tokens = [token for token in tokens if token not in stop_words]
    return ' '.join(tokens)


df['processed_text'] = df['abstract'].apply(preprocess_text)
sentences = df['processed_text'].apply(lambda x: x.split()).tolist()

# Step 1: Train Word2Vec models with different vector sizes

In [3]:
vector_sizes = [50, 100, 150, 200, 250, 300]
window_size = 5
models = {}

for vector_size in vector_sizes:
    model = Word2Vec(sentences=sentences, vector_size=vector_size, window=window_size, min_count=5, workers=4)
    models[vector_size] = model
    model.save(f'word2vec_vector_size_{vector_size}.model')

# Step 2: Define a function to calculate cosine similarity

In [4]:
def cosine_similarity(vec1, vec2):
    return np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))

# Step 3: For each model, find the top 10 word pairs with the highest similarity

In [5]:
top_pairs_results = {}

for vector_size, model in models.items():
    word_pairs = list(combinations(model.wv.index_to_key, 2))  # Generate all possible word pairs
    similarities = []

    for word1, word2 in word_pairs:
        vec1 = model.wv[word1]
        vec2 = model.wv[word2]
        similarity = cosine_similarity(vec1, vec2)
        similarities.append((word1, word2, similarity))

    # Sort the pairs by similarity and get the top 20
    top_pairs = sorted(similarities, key=lambda x: x[2], reverse=True)[:20]
    top_pairs_results[vector_size] = top_pairs

# Step 4: Display the results

In [6]:
for vector_size, top_pairs in top_pairs_results.items():
    print(f"\nTop 20 word pairs with highest similarity for vector size {vector_size} and window size {window_size}:")
    for word1, word2, similarity in top_pairs:
        print(f"{word1} - {word2}: {similarity:.4f}")


Top 20 word pairs with highest similarity for vector size 50 and window size 5:
influence - formation: 0.9989
profile - policy: 0.9988
appropriate - account: 0.9988
subject - taken: 0.9988
investigation - action: 0.9987
region - strain: 0.9987
element - involved: 0.9987
receptor - elegans: 0.9987
allows - theory: 0.9987
largest - taken: 0.9987
ratio - interval: 0.9987
without - transcript: 0.9987
list - map: 0.9987
form - collection: 0.9987
certain - video: 0.9987
involved - regulation: 0.9987
movement - climate: 0.9987
receptor - sleep: 0.9987
element - regulation: 0.9987
increased - change: 0.9987

Top 20 word pairs with highest similarity for vector size 100 and window size 5:
response - cell: 0.9995
behavior - site: 0.9993
regarding - china: 0.9993
subject - following: 0.9993
policy - nm: 0.9993
appropriate - compound: 0.9993
subject - leading: 0.9993
especially - despite: 0.9993
stability - profile: 0.9992
region - return: 0.9992
company - regulation: 0.9992
world - society: 0.99

# Comparison of Results Based on Changing Vector Size

Objective: The goal of this comparison is to evaluate the quality of results obtained from a Word2Vec model when altering the vector size, one of the key hyperparameters, while keeping the window size constant. This analysis will help determine the impact of vector size on the semantic relationships captured by the model.

## Summary of Results:
Vector Size 50:

* Key Pairs: "influence - formation," "profile - policy," "appropriate - account"
* Observations: The pairs are reasonable but seem somewhat generic, lacking in nuanced relationships that might be more evident with larger vector sizes.


Vector Size 100:

* Key Pairs: "response - cell," "behavior - site," "regarding - china"
* Observations: The pairs start to capture more specific relationships, such as "response - cell," indicating a move towards more meaningful word pairings.

Vector Size 150:

* Key Pairs: "element - regulation," "required - must," "chemical - stream"
* Observations: There's a noticeable improvement in the specificity and relevance of word pairs. Pairs like "required - must" suggest the model is capturing logical relationships.
 
Vector Size 200:

* Key Pairs: "region - surface," "subject - appropriate," "behavior - growth"
* Observations: The model continues to improve, with pairs reflecting stronger semantic links, particularly in technical contexts (e.g., "region - surface," "behavior - growth").
 
Vector Size 250:

* Key Pairs: "de - la," "cycle - expected," "global - uncertainty"
* Observations: The model captures both generic and specific relationships. However, some pairs like "de - la" might indicate that the model is picking up on language artifacts rather than meaningful semantic relationships.
 
Vector Size 300:

* Key Pairs: "major - concern," "primary - ct," "appropriate - caused"
* Observations: The model exhibits high specificity with strong contextual relevance, capturing pairs like "major - concern" and "behavior - customer." However, some pairs like "de - la" from vector size 250 indicate possible overfitting or noise in the model.

# Analysis of Results

Vector Size 50: The model with vector size 50 captures basic relationships but lacks depth and specificity. The pairs are somewhat generic and do not capture nuanced meanings effectively.

Vector Size 100: Shows an improvement over size 50, with more relevant pairs that indicate a better understanding of context. This size strikes a balance between generalization and specificity.

Vector Size 150: Further improvement is seen with vector size 150, where the model begins to capture more complex relationships and logical pairings, making it a strong candidate for capturing nuanced meanings.

Vector Size 200: Vector size 200 offers highly specific and contextually relevant word pairs, especially in technical or scientific contexts, making it a strong option for domains requiring detailed semantic understanding.

Vector Size 250: While still strong, the results for vector size 250 show some signs of overfitting or picking up on irrelevant language artifacts, as seen in pairs like "de - la." This might indicate that the model is becoming too specialized.

Vector Size 300: This size provides very high specificity and relevance, capturing nuanced semantic relationships, but with the risk of overfitting or introducing noise, as seen in vector size 250.

# Conclusion

Best Vector Size: Vector size 150 and vector size 200 both offer a good balance of specificity and relevance, making them the best choices depending on the domain and context. Vector size 150 might be more versatile, while vector size 200 is excellent for highly technical or detailed text analysis.

Trade-offs: Smaller vector sizes (like 50 and 100) might miss out on capturing deeper semantic relationships, while larger sizes (250 and 300) could introduce noise or overfit to specific contexts.