# Word2Vec Pretrained Model Analysis

## Assignment: SNLP - Word Embeddings with Word2Vec

**Objectives:**
1. Use a pretrained Word2Vec model (`word2vec-google-news-300`)
2. Find similar words for 5 chosen words
3. Test word analogies using vector arithmetic
4. Compare with custom trained model

---

## 1. Import Required Libraries

In [3]:
import gensim
import gensim.downloader as api
from gensim.models import Word2Vec
from pprint import pprint

## 2. Load Pretrained Word2Vec Model

We'll use the Google News Word2Vec model which contains 300-dimensional vectors trained on Google News dataset.

In [4]:
print("Loading pretrained model...")
print("Attempting to load word2vec-google-news-300...")

try:
    # Try to load the large Google News model
    pretrained_model = api.load("word2vec-google-news-300")
    print("✓ Successfully loaded Google News Word2Vec model!")
    model_loaded = True
    
except Exception as e:
    print(f"✗ Error loading Google News model: {e}")
    print("\nTrying alternative smaller model...")
    
    try:
        # Try loading a smaller alternative model
        pretrained_model = api.load("glove-wiki-gigaword-50")
        print("✓ Successfully loaded GloVe Wiki model as alternative!")
        model_loaded = True
        
    except Exception as e2:
        print(f"✗ Error loading alternative model: {e2}")
        print("\nWill demonstrate with custom model only...")
        model_loaded = False

print(f"\nModel status: {'Loaded' if model_loaded else 'Not loaded'}")
if model_loaded:
    print(f"Model type: {type(pretrained_model)}")
    print(f"Vector size: {pretrained_model.vector_size}")
    print(f"Vocabulary size: {len(pretrained_model.index_to_key)}")

Loading pretrained model...
Attempting to load word2vec-google-news-300...

Trying alternative smaller model...

Trying alternative smaller model...
✓ Successfully loaded GloVe Wiki model as alternative!

Model status: Loaded
Model type: <class 'gensim.models.keyedvectors.KeyedVectors'>
Vector size: 50
Vocabulary size: 400000
✓ Successfully loaded GloVe Wiki model as alternative!

Model status: Loaded
Model type: <class 'gensim.models.keyedvectors.KeyedVectors'>
Vector size: 50
Vocabulary size: 400000


## 3. Find Similar Words for 5 Chosen Words

We'll test the model with 5 different words from various domains and find their most similar words.

In [5]:
# Selected 5 words from different domains
words = ['science', 'coffee', 'music', 'apple', 'teacher']

print("SIMILAR WORDS FROM PRETRAINED MODEL")
print("=" * 50)

if model_loaded:
    for word in words:
        print(f"\nTop 5 similar words to '{word}':")
        try:
            similar_words = pretrained_model.most_similar(word, topn=5)
            for i, (similar, score) in enumerate(similar_words, 1):
                print(f"  {i}. {similar:15} (similarity: {score:.4f})")
        except KeyError:
            print(f"  Word '{word}' not found in vocabulary")
        print("-" * 40)
else:
    print("Cannot demonstrate - pretrained model not loaded.")
    print("This section requires a working internet connection")
    print("to download the pretrained Word2Vec model.")
    print()
    print("Expected output would show:")
    print("- 'science' similar to: research, scientific, biology, etc.")
    print("- 'coffee' similar to: tea, espresso, caffeine, etc.")
    print("- 'music' similar to: songs, musical, audio, etc.")
    print("- 'apple' similar to: fruit, iPhone, company, etc.")
    print("- 'teacher' similar to: instructor, educator, professor, etc.")

SIMILAR WORDS FROM PRETRAINED MODEL

Top 5 similar words to 'science':
  1. sciences        (similarity: 0.8548)
  2. research        (similarity: 0.8437)
  3. institute       (similarity: 0.8386)
  4. studies         (similarity: 0.8369)
  5. physics         (similarity: 0.8314)
----------------------------------------

Top 5 similar words to 'coffee':
  1. drink           (similarity: 0.8187)
  2. drinks          (similarity: 0.8176)
  3. wine            (similarity: 0.8141)
  4. tea             (similarity: 0.8080)
  5. beer            (similarity: 0.8042)
----------------------------------------

Top 5 similar words to 'music':
  1. musical         (similarity: 0.8854)
  2. pop             (similarity: 0.8682)
  3. dance           (similarity: 0.8531)
  4. songs           (similarity: 0.8526)
  5. recording       (similarity: 0.8392)
----------------------------------------

Top 5 similar words to 'apple':
  1. blackberry      (similarity: 0.7543)
  2. chips           (similarity: 

## 4. Word Analogies using Vector Arithmetic

Testing the famous word analogy examples using vector operations: **A - B + C ≈ D**

Examples:
- king - man + woman ≈ queen
- paris - france + italy ≈ rome  
- doctor - hospital + school ≈ teacher

In [6]:
print("WORD ANALOGIES USING VECTOR ARITHMETIC")
print("=" * 50)

if model_loaded:
    # Define analogy examples: (A, B, C) where A - B + C ≈ D
    analogies = [
        ('king', 'man', 'woman', 'Expected: queen'),
        ('paris', 'france', 'italy', 'Expected: rome'),
        ('doctor', 'hospital', 'school', 'Expected: teacher'),
        ('big', 'bigger', 'small', 'Expected: smaller'),
        ('good', 'better', 'bad', 'Expected: worse')
    ]

    for i, (a, b, c, expected) in enumerate(analogies, 1):
        print(f"\n{i}. Analogy: {a} - {b} + {c}")
        print(f"   {expected}")
        
        try:
            result = pretrained_model.most_similar(positive=[a, c], negative=[b], topn=3)
            print(f"   Result: {result[0][0]} (confidence: {result[0][1]:.4f})")
            
            # Show top 3 results
            print("   Top 3 candidates:")
            for j, (word, score) in enumerate(result, 1):
                print(f"     {j}. {word} ({score:.4f})")
                
        except KeyError as e:
            print(f"   Error: Word not found in vocabulary - {e}")
        
        print("-" * 50)
        
else:
    print("Cannot demonstrate - pretrained model not loaded.")
    print("This section requires a working internet connection")
    print("to download the pretrained Word2Vec model.")
    print()
    print("Expected analogy results:")
    print("1. king - man + woman ≈ queen")
    print("2. paris - france + italy ≈ rome") 
    print("3. doctor - hospital + school ≈ teacher")
    print("4. big - bigger + small ≈ smaller")
    print("5. good - better + bad ≈ worse")
    print()
    print("These analogies demonstrate that Word2Vec captures")
    print("semantic relationships through vector arithmetic.")

WORD ANALOGIES USING VECTOR ARITHMETIC

1. Analogy: king - man + woman
   Expected: queen
   Result: queen (confidence: 0.8524)
   Top 3 candidates:
     1. queen (0.8524)
     2. throne (0.7664)
     3. prince (0.7592)
--------------------------------------------------

2. Analogy: paris - france + italy
   Expected: rome
   Result: rome (confidence: 0.8466)
   Top 3 candidates:
     1. rome (0.8466)
     2. milan (0.7766)
     3. turin (0.7666)
--------------------------------------------------

3. Analogy: doctor - hospital + school
   Expected: teacher
   Result: teacher (confidence: 0.8208)
   Top 3 candidates:
     1. teacher (0.8208)
     2. taught (0.7912)
     3. master (0.7836)
--------------------------------------------------

4. Analogy: big - bigger + small
   Expected: smaller
   Result: large (confidence: 0.8236)
   Top 3 candidates:
     1. large (0.8236)
     2. one (0.8007)
     3. along (0.7915)
--------------------------------------------------

5. Analogy: good - 