# Visualizing word embeddings : Deep Learning for NLP embeddings.

This notebook will guide you through the implementation of Word2Vec, a popular model used in Natural Language Processing (NLP) to capture semantic meaning from text. We will apply this model to a dataset and visualize the learned word embeddings.

<center><img src="https://education-team-2020.s3.eu-west-1.amazonaws.com/ai-eng/w4-s3/deep_learning.png" width="1000"/></center>

## Tutorial

In this notebook, we will apply the Word2Vec model to a text dataset. The Word2Vec model learns to represent words in a continuous vector space, where semantically similar words are mapped to nearby points. We will walk through the following steps:

1. **Download the Training Data:** We'll start by downloading a large text corpus for training.
2. **Set up Word2Vec in TensorFlow:** We'll configure the Word2Vec model using TensorFlow, a popular machine learning library.
3. **Train the Model:** We'll train the Word2Vec model on the dataset to learn word embeddings.
4. **Visualize the Embeddings:** Finally, we'll visualize the learned word embeddings using dimensionality reduction techniques.

In [1]:
reset -fs

In [3]:
# I like to live dangerously ☠
import warnings
warnings.filterwarnings('ignore')

In [4]:
import collections
import math
import os
from pprint import pprint
import random
import urllib.request
import zipfile

import matplotlib.pyplot as plt
import numpy as np
import tensorflow as tf 
from sklearn.manifold import TSNE

%matplotlib inline  

### Step 1: Download the Dataset

To train our Word2Vec model, we need a large text corpus. In this step, we will download the "text8" dataset, a popular dataset used for training word embeddings. This dataset contains 100 MB of English Wikipedia text from 2006, which is commonly used in NLP tasks.

In [None]:
# url = 'http://mattmahoney.net/dc/'

# def maybe_download(path, filename, expected_bytes):
#     """Download a file if not present, and make sure it's the right size."""
    
#     if not os.path.exists(path+filename):
#         filename, _ = urllib.request.urlretrieve(url + filename, filename)
    
#     statinfo = os.stat(path+filename)
#     if statinfo.st_size == expected_bytes:
#         print('Found and verified', filename)
#     else:
#         print(statinfo.st_size)
#         raise Exception(
#             'Failed to verify ' + filename + '. Can you get to it with a browser?')
    
#     return filename

# path = ""
# filename = maybe_download(path, 'text8.zip', 31_344_016)

In [None]:
import os

def verify_files(path, filenames):
    missing_files = []
    empty_files = []
    
    for filename in filenames:
        file_path = os.path.join(path, filename)
        
        if not os.path.exists(file_path):
            missing_files.append(filename)
        else:
            statinfo = os.stat(file_path)
            if statinfo.st_size == 0:
                empty_files.append(filename)
                
    if missing_files:
        raise Exception(f"Missing files: {', '.join(missing_files)}")
    if empty_files:
        raise Exception(f"Empty files: {', '.join(empty_files)}")
    
    print("All files found and verified!")

# Path to your local dataset
path = "../datasets/harry_potter_books"  # Change this to your actual path

# List of books to check
harry_potter_books = [
    "Book 1 - The Philosopher's Stone.txt",
    "Book 2 - The Chamber of Secrets.txt",
    "Book 3 - The Prisoner of Azkaban.txt",
    "Book 4 - The Goblet of Fire.txt",
    "Book 5 - The Order of the Phoenix.txt",
    "Book 6 - The Half Blood Prince.txt",
    "Book 7 - The Deathly Hallows.txt"
]

# Verify files exist
verify_files(path, harry_potter_books)


> **Note:** While the dataset is being downloaded, feel free to look ahead at the upcoming code cells to get a sense of what we'll be doing next.

In [None]:
import os

# Assumes you're reading plain text files directly
def read_data(path, filenames):
    """Read multiple files from a given path and compile their contents into a list of words."""
    vocabulary = []  # Initialize an empty list to hold all words
    for filename in filenames:
        with open(os.path.join(path, filename), 'r', encoding='utf-8') as file:
            # Read the file's content, split into words, and extend the vocabulary list
            words = file.read().split()
            vocabulary.extend(words)
    return vocabulary

# Assuming 'path' is already defined as you showed earlier
# And 'harry_potter_books' is your list of filenames
vocabulary = read_data(path, harry_potter_books)
print('Dataset size: {:,} words'.format(len(vocabulary)))

In [None]:
# Take a peak at the head
vocabulary[:20]

### Observations on Preprocessing

- **Case and Punctuation:** Notice that none of the words in our dataset are capitalized, and there is no punctuation. This is likely due to preprocessing steps taken before we acquired the dataset, which ensures uniformity in the text.
- **Preprocessing Considerations:** Preprocessing is a crucial part of NLP tasks, and its specifics depend on the data and the intended use. For instance, while we're working with unigrams (single words), other tasks might benefit from encoding n-grams (combinations of words) to capture context better.

In [None]:
import os
from gensim.utils import simple_preprocess

def read_and_preprocess_data(directory):
    """Reads multiple text files from a directory and preprocesses the text."""
    words = []
    
    for filename in sorted(os.listdir(directory)):  # Ensure order (HP1 → HP7)
        file_path = os.path.join(directory, filename)
        
        if filename.endswith(".txt"):  # Process only text files
            with open(file_path, 'r', encoding='utf-8') as f:
                text = f.read()
                tokens = simple_preprocess(text, deacc=True)  # Tokenize & remove punctuation
                words.extend(tokens)  # Add to vocabulary

    return words

# Define dataset directory (relative path)
dataset_path = "../datasets/harry_potter_books"

# Read and preprocess the dataset
vocabulary = read_and_preprocess_data(dataset_path)

print('Dataset size after preprocessing: {:,} words'.format(len(vocabulary)))
print('Sample words:', vocabulary[:50])  # Print a preview of tokens


In [None]:
from gensim.models import Word2Vec

# Train Word2Vec model
def train_word2vec(vocabulary):
    """Trains a Word2Vec model on the processed Harry Potter dataset."""
    
    # Word2Vec expects sentences (list of lists), so we wrap words in one list
    sentences = [vocabulary]
    
    # Train model using Skip-Gram (sg=1), size=100 (vector dimensions), window=5 (context size)
    model = Word2Vec(sentences, vector_size=100, window=5, min_count=2, sg=1, workers=4)
    
    return model

# Train Word2Vec on the Harry Potter dataset
w2v_model = train_word2vec(vocabulary)

# Save model for later use
w2v_model.save("harry_potter_word2vec.model")

print("Word2Vec model training complete!")


In [None]:
# Load the trained model
from gensim.models import Word2Vec

model = Word2Vec.load("harry_potter_word2vec.model")

# Find similar words to "harry"
print("Most similar words to 'harry':")
print(model.wv.most_similar("harry", topn=5))

# Find similar words to "magic"
print("\nMost similar words to 'magic':")
print(model.wv.most_similar("magic", topn=5))

# Find similarity between two words
print("\nSimilarity between 'harry' and 'voldemort':", model.wv.similarity("harry", "voldemort"))
print("Similarity between 'hogwarts' and 'castle':", model.wv.similarity("hogwarts", "castle"))


In [None]:
import matplotlib.pyplot as plt
import numpy as np
from sklearn.manifold import TSNE
from gensim.models import Word2Vec

# Load trained Word2Vec model
model = Word2Vec.load("harry_potter_word2vec.model")

def plot_tsne(model, num_words=100):
    """Reduces word embeddings to 2D using t-SNE and plots them."""
    
    # Select most common words in the vocabulary
    words = list(model.wv.index_to_key)[:num_words]
    
    # Get corresponding word vectors
    word_vectors = np.array([model.wv[word] for word in words])

    # Reduce dimensions using t-SNE
    tsne = TSNE(n_components=2, perplexity=30, random_state=42)
    reduced_vectors = tsne.fit_transform(word_vectors)

    # Plot the words in 2D space
    plt.figure(figsize=(12, 8))
    plt.scatter(reduced_vectors[:, 0], reduced_vectors[:, 1], alpha=0.7)

    # Annotate points with words
    for i, word in enumerate(words):
        plt.annotate(word, (reduced_vectors[i, 0], reduced_vectors[i, 1]), fontsize=9)

    plt.title("t-SNE visualization of Word2Vec embeddings (Harry Potter)")
    plt.show()

# Visualize embeddings
plot_tsne(model)


In [None]:
from sklearn.cluster import KMeans
import numpy as np
from gensim.models import Word2Vec

# Load trained Word2Vec model
model = Word2Vec.load("harry_potter_word2vec.model")

def cluster_words(model, num_clusters=5):
    """Clusters words into groups using K-Means."""
    
    words = list(model.wv.index_to_key)[:500]  # Select top 500 words
    word_vectors = np.array([model.wv[word] for word in words])  # Get word embeddings

    # Perform K-Means clustering
    kmeans = KMeans(n_clusters=num_clusters, random_state=42, n_init=10)
    clusters = kmeans.fit_predict(word_vectors)

    # Group words by cluster
    clustered_words = {i: [] for i in range(num_clusters)}
    for word, cluster in zip(words, clusters):
        clustered_words[cluster].append(word)

    return clustered_words

# Get clustered words
num_clusters = 5  # Choose number of clusters
clusters = cluster_words(model, num_clusters)

# Print clusters
for cluster, words in clusters.items():
    print(f"\n🟢 Cluster {cluster+1}:")
    print(", ".join(words[:15]))  # Show top words per cluster


In [None]:
def find_analogies(word1, word2, word3, model):
    """Solves word analogies: word1 - word2 + word3 = ?"""
    try:
        result = model.wv.most_similar(positive=[word1, word3], negative=[word2], topn=1)
        return result[0][0]  # Return the best match
    except KeyError as e:
        return str(e)  # Handle missing words

# Example analogies
print("\n🔄 Word Analogies:")
print(f"'harry' - 'gryffindor' + 'slytherin' = {find_analogies('harry', 'gryffindor', 'slytherin', model)}")
print(f"'dumbledore' - 'good' + 'evil' = {find_analogies('dumbledore', 'good', 'evil', model)}")
print(f"'wand' - 'magic' + 'weapon' = {find_analogies('wand', 'magic', 'weapon', model)}")
print(f"'quidditch' - 'sport' + 'game' = {find_analogies('quidditch', 'sport', 'game', model)}")


In [None]:
import matplotlib.pyplot as plt
import numpy as np
from sklearn.manifold import TSNE
from sklearn.cluster import KMeans
from gensim.models import Word2Vec

# Load trained Word2Vec model
model = Word2Vec.load("harry_potter_word2vec.model")

def plot_clusters_tsne(model, num_clusters=5):
    """Plots word embeddings using t-SNE and colors by clusters."""
    
    words = list(model.wv.index_to_key)[:500]  # Select top words
    word_vectors = np.array([model.wv[word] for word in words])

    # Reduce dimensions using t-SNE
    tsne = TSNE(n_components=2, perplexity=30, random_state=42)
    reduced_vectors = tsne.fit_transform(word_vectors)

    # Cluster words using K-Means
    kmeans = KMeans(n_clusters=num_clusters, random_state=42, n_init=10)
    clusters = kmeans.fit_predict(word_vectors)

    # Plot words with cluster colors
    plt.figure(figsize=(12, 8))
    scatter = plt.scatter(reduced_vectors[:, 0], reduced_vectors[:, 1], c=clusters, cmap='viridis', alpha=0.7)

    # Annotate words
    for i, word in enumerate(words):
        plt.annotate(word, (reduced_vectors[i, 0], reduced_vectors[i, 1]), fontsize=8)

    plt.title("t-SNE Visualization of Word Clusters (Harry Potter)")
    plt.colorbar(scatter)
    plt.show()

# Run the visualization
plot_clusters_tsne(model)


-----
Step 2: Build the dictionary 
-----

In [None]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

# Example dataset: Sentences classified into Hogwarts houses
sentences = [
    ("Bravery and courage are the most important virtues.", "Gryffindor"),
    ("Books and learning are what truly matter.", "Ravenclaw"),
    ("Cunning and ambition will take you far.", "Slytherin"),
    ("Loyalty and hard work define us.", "Hufflepuff"),
    ("We must fight for what is right.", "Gryffindor"),
    ("Intelligence and wisdom are our strengths.", "Ravenclaw"),
    ("Power is the path to greatness.", "Slytherin"),
    ("Kindness and dedication make a difference.", "Hufflepuff"),
]

# Convert sentences to vectors
X = []
y = []

for sentence, label in sentences:
    words = sentence.lower().split()
    word_vectors = [model.wv[word] for word in words if word in model.wv]
    sentence_vector = np.mean(word_vectors, axis=0) if word_vectors else np.zeros(model.vector_size)
    X.append(sentence_vector)
    y.append(label)

# Split into train/test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a simple classifier
clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X_train, y_train)

# Test the classifier
sample_sentence = "I want to be professor of magic"
sample_vector = np.mean([model.wv[word] for word in sample_sentence.lower().split() if word in model.wv], axis=0)
predicted_class = clf.predict([sample_vector])

print(f"\n📚 Sentence: '{sample_sentence}' → Predicted House: {predicted_class[0]}")


In [None]:
import random
import logging
from gensim.models import Word2Vec

# Set up logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

# Example knowledge base
responses = {
    "harry": ["Harry is the Boy Who Lived.", "Harry Potter is a Gryffindor.", "Harry faced Voldemort many times."],
    "hermione": ["Hermione is the brightest witch of her age.", "Hermione loves books and learning."],
    "ron": ["Ron Weasley is Harry's best friend.", "Ron comes from a big wizarding family."],
    "hogwarts": ["Hogwarts is the best magical school.", "Hogwarts has four houses: Gryffindor, Slytherin, Ravenclaw, and Hufflepuff."],
    "spell": ["Expecto Patronum is a powerful spell.", "Avada Kedavra is an unforgivable curse."]
}

# Load trained Word2Vec model
model = Word2Vec.load("harry_potter_word2vec.model")
logging.info("Word2Vec model loaded successfully.")

import random
import logging
from gensim.models import Word2Vec

# Set up logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

# Example knowledge base
responses = {
    "harry": ["Harry is the Boy Who Lived.", "Harry Potter is a Gryffindor.", "Harry faced Voldemort many times."],
    "hermione": ["Hermione is the brightest witch of her age.", "Hermione loves books and learning."],
    "ron": ["Ron Weasley is Harry's best friend.", "Ron comes from a big wizarding family."],
    "hogwarts": ["Hogwarts is the best magical school.", "Hogwarts has four houses: Gryffindor, Slytherin, Ravenclaw, and Hufflepuff."],
    "spell": ["Expecto Patronum is a powerful spell.", "Avada Kedavra is an unforgivable curse."]
}

# Load trained Word2Vec model
model = Word2Vec.load("harry_potter_word2vec.model")
logging.info("Word2Vec model loaded successfully.")

def chatbot_response(user_input, model):
    """Finds a suitable response using exact matches, phrase handling, or embeddings with logging."""
    if not user_input.strip():  # Handle empty input
        logging.warning("User provided empty input.")
        return "I didn't catch that. Could you try asking again?"

    logging.info(f"User input: {user_input}")
    words = user_input.lower().split()

    # Step 1: Check for exact matches for multi-word phrases
    if user_input.lower() in responses:
        response = random.choice(responses[user_input.lower()])
        logging.info(f"Exact match found for phrase '{user_input}': {response}")
        return response

    # Step 2: Check for exact matches for individual words
    for word in words:
        if word in responses:
            response = random.choice(responses[word])
            logging.info(f"Exact match found for '{word}': {response}")
            return response

    # Step 3: Use Word2Vec to find similar words, with a similarity threshold
    for word in words:
        if word in model.wv:
            logging.info(f"Word '{word}' found in embeddings.")
            try:
                # Get the most similar word and its similarity score
                similar_word, similarity = model.wv.most_similar(word, topn=1)[0]
                logging.info(f"Most similar word to '{word}': {similar_word} (similarity: {similarity:.2f})")

                # Use the similar word only if the similarity is above a threshold
                if similarity > 0.5 and similar_word in responses:
                    response = random.choice(responses[similar_word])
                    logging.info(f"Response chosen for similar word '{similar_word}': {response}")
                    return response
                else:
                    logging.warning(f"Similar word '{similar_word}' ignored due to low similarity ({similarity:.2f}).")
            except KeyError as e:
                logging.warning(f"KeyError for word '{word}': {e}")

    # Step 4: Fallback response
    logging.warning("No suitable response found. Falling back to default.")
    return "I don't know much about that. Maybe ask about Hogwarts or spells?"


while True:
    user_input = input("You: ")
    if user_input.lower() == "exit":
        logging.info("User exited the chatbot.")
        print("Chatbot: Goodbye, and may the magic be with you! ✨")
        break
    response = chatbot_response(user_input, model)
    print(f"Chatbot: {response}")


# Using Spacy

In [2]:
import spacy

# Load the English language model
nlp = spacy.load("en_core_web_lg")

# Process some text
text = "Harry Potter is a wizard who studies at Hogwarts."
doc = nlp(text)

# Tokenization: Break text into words
print("Tokens:")
for token in doc:
    print(token.text)

# Named Entity Recognition (NER): Identify entities in text
print("\nNamed Entities:")
for ent in doc.ents:
    print(f"{ent.text} ({ent.label_})")

# Part-of-Speech (POS) tagging
print("\nPart-of-Speech Tags:")
for token in doc:
    print(f"{token.text} - {token.pos_}")

# Dependency Parsing: Relationships between words
print("\nDependency Parsing:")
for token in doc:
    print(f"{token.text} -> {token.dep_} -> {token.head.text}")


Tokens:
Harry
Potter
is
a
wizard
who
studies
at
Hogwarts
.

Named Entities:
Harry Potter (PERSON)
Hogwarts (GPE)

Part-of-Speech Tags:
Harry - PROPN
Potter - PROPN
is - AUX
a - DET
wizard - NOUN
who - PRON
studies - VERB
at - ADP
Hogwarts - PROPN
. - PUNCT

Dependency Parsing:
Harry -> compound -> Potter
Potter -> nsubj -> is
is -> ROOT -> is
a -> det -> wizard
wizard -> attr -> is
who -> nsubj -> studies
studies -> relcl -> wizard
at -> prep -> studies
Hogwarts -> pobj -> at
. -> punct -> is


In [None]:
import spacy

# Load spaCy English model
nlp = spacy.load("en_core_web_lg")

# Sample Harry Potter text
text = """
Harry Potter and Hermione Granger walked through the halls of Hogwarts. 
They were looking for Professor Dumbledore in his office. Meanwhile, 
Lord Voldemort was plotting an attack on the Ministry of Magic.
"""

# Process text with spaCy
doc = nlp(text)

# Extract entities related to characters & places
characters = [ent.text for ent in doc.ents if ent.label_ == "PERSON"]
places = [ent.text for ent in doc.ents if ent.label_ in ["GPE", "ORG", "LOC"]]

# Print extracted results
print("🧙 Characters:", set(characters))
print("📍 Places:", set(places))


Now that we have our text data, the next step is to build a dictionary. This dictionary will map each unique word in our dataset to a unique integer code. 

Additionally, we'll keep track of the frequency of each word, which will help us prioritize the most common words.

In [9]:
vocabulary_size = 50_000

def build_dataset(words, n_words):
    """Process raw inputs into a dataset."""
    count = [['UNK', -1]]
    count.extend(collections.Counter(words).most_common(n_words - 1))
    dictionary = dict()
    for word, _ in count:
        dictionary[word] = len(dictionary)
    data = list()
    unk_count = 0
    for word in words:
        index = dictionary.get(word, 0)
        if index == 0:  # dictionary['UNK']
            unk_count += 1
        data.append(index)
    count[0][1] = unk_count
    reversed_dictionary = dict(zip(dictionary.values(), dictionary.keys()))
    return data, count, dictionary, reversed_dictionary

In [10]:
data, count, dictionary, reverse_dictionary = build_dataset(vocabulary,
                                                            vocabulary_size)

### Understanding the Key Variables

- **`data`:** A list of integer codes representing the words in the original text. Each word is replaced by its corresponding code from the `dictionary`.
- **`count`:** A list of tuples where each tuple contains a word and its frequency count in the dataset.
- **`dictionary`:** A dictionary mapping each word to its unique integer code.
- **`reverse_dictionary`:** A reverse mapping that converts integer codes back to their corresponding words.

In [11]:
del vocabulary # Reduce memory by getting rid of the "heavy" list of strings

In [None]:
data[:5] # An index of each word (as it appears in order) to its rank. Therefore we don't have reference the string

In [None]:
dictionary['the'] # word: rank

In [None]:
reverse_dictionary[5234] # rank: word

In [None]:
print('Most common words:') 
print(*count[:5], sep="\n")

In [None]:
print('Least common words:')
print(*count[-5:], sep="\n")

### Step 3: Generate a Training Batch for the Skip-Gram Model

In this step, we'll create a function to generate training batches for our skip-gram model. The skip-gram model is a variant of Word2Vec that predicts the context words given a target word. This function will help us prepare the data in the correct format for training the model.

In [17]:
data_index = 0

def generate_batch(batch_size, num_skips, skip_window):
    global data_index
    assert batch_size % num_skips == 0
    assert num_skips <= 2 * skip_window
    batch = np.ndarray(shape=(batch_size), dtype=np.int32)
    labels = np.ndarray(shape=(batch_size, 1), dtype=np.int32)
    span = 2 * skip_window + 1  # [ skip_window target skip_window ]
    buffer = collections.deque(maxlen=span)
    if data_index + span > len(data):
        data_index = 0
    buffer.extend(data[data_index:data_index + span])
    data_index += span
    for i in range(batch_size // num_skips):
        context_words = [w for w in range(span) if w != skip_window]
        words_to_use = random.sample(context_words, num_skips)
        for j, context_word in enumerate(words_to_use):
            batch[i * num_skips + j] = buffer[skip_window]
            labels[i * num_skips + j, 0] = buffer[context_word]
        if data_index == len(data):
            buffer[:] = data[:span]
            data_index = span
        else:
            buffer.append(data[data_index])
            data_index += 1
    # Backtrack a little bit to avoid skipping words in the end of a batch
    data_index = (data_index + len(data) - span) % len(data)
    return batch, labels

In [18]:
batch, labels = generate_batch(batch_size=8, 
                               num_skips=2, 
                               skip_window=1)

In [None]:
# Example of self-supervised learning
for i in range(8):
    print(batch[i], reverse_dictionary[batch[i]],
        '->', labels[i, 0], reverse_dictionary[labels[i, 0]])

----
Step 4: Build and train a skip-gram model.
-----

The Skip-Gram model is a type of neural network model used in Natural Language Processing (NLP) to learn distributed representations of words, commonly known as word embeddings. 

Word embeddings are dense vector representations of words that capture semantic relationships between them, meaning that words with similar meanings are mapped to similar points in the vector space.

### How the Skip-Gram Model Works
The goal of the Skip-Gram model is to predict the context words given a target word. Here's how it works:

Input: The model takes a single word (the target word) from the text as input.

Output: The model is trained to predict words that are likely to appear in the context of the target word within a specified window of words before and after the target word. This context window is typically referred to as the "skip window."

For example, if the sentence is "The cat sat on the mat," and the target word is "sat," the model might be asked to predict the words "The," "cat," "on," and "the," which appear around "sat."

In [20]:
batch_size = 128
embedding_size = 128  # Dimension of the embedding vector.
skip_window = 1       # How many words to consider left and right.
num_skips = 2         # How many times to reuse an input to generate a label.
num_sampled = 64      # Number of negative examples to sample.

# We pick a random validation set to sample nearest neighbors. Here we limit the
# validation samples to the words that have a low numeric ID, which by
# construction are also the most frequent.
valid_size = 16     # Random set of words to evaluate similarity on.
valid_window = 100  # Only pick dev samples in the head of the distribution.
valid_examples = np.random.choice(valid_window, valid_size, replace=False)

In [21]:
import tensorflow.compat.v1 as v1

In [None]:
graph = tf.Graph()

with graph.as_default():

    # Input data.
    train_inputs = v1.placeholder(tf.int32, shape=[batch_size])
    train_labels = v1.placeholder(tf.int32, shape=[batch_size, 1])
    valid_dataset = v1.constant(valid_examples, dtype=tf.int32)

    with tf.device('/cpu:0'):
        # Look up embeddings for inputs.
        embeddings = tf.Variable(
            v1.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0))
        embed = tf.nn.embedding_lookup(embeddings, train_inputs)

        # Construct the variables for the NCE loss
        nce_weights = tf.Variable(
            v1.truncated_normal([vocabulary_size, embedding_size],
                                stddev=1.0 / math.sqrt(embedding_size)))
        nce_biases = tf.Variable(tf.zeros([vocabulary_size]))

        # Compute the average NCE loss for the batch.
        # tf.nce_loss automatically draws a new sample of the negative labels each
        # time we evaluate the loss.
        # Explanation of the meaning of NCE loss:
        #   http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/
        loss = tf.reduce_mean(
          tf.nn.nce_loss(weights=nce_weights,
                         biases=nce_biases,
                         labels=train_labels,
                         inputs=embed,
                         num_sampled=num_sampled,
                         num_classes=vocabulary_size))

        # Construct the SGD optimizer using a learning rate of 1.0.
        optimizer = v1.train.GradientDescentOptimizer(1.0).minimize(loss)

        # Compute the cosine similarity between minibatch examples and all embeddings.
        norm = tf.sqrt(tf.reduce_sum(tf.square(embeddings), 1, keepdims=True))
        normalized_embeddings = embeddings / norm
        valid_embeddings = tf.nn.embedding_lookup(
          normalized_embeddings, valid_dataset)
        similarity = tf.matmul(
          valid_embeddings, normalized_embeddings, transpose_b=True)

        # Add variable initializer.
        init = v1.global_variables_initializer()

----
Step 5: Begin training.
-----

In [None]:
num_steps = 2_001 #1 #2_001 #100_001

with v1.Session(graph=graph) as session:
    # We must initialize all variables before we use them.
    init.run()
    print("Initialized")

    average_loss = 0
    for step in range(num_steps):
        batch_inputs, batch_labels = generate_batch(
            batch_size, num_skips, skip_window)
        feed_dict = {train_inputs : batch_inputs, train_labels : batch_labels}

        # We perform one update step by evaluating the optimizer op (including it
        # in the list of returned values for session.run()
        _, loss_val = session.run([optimizer, loss], feed_dict=feed_dict)
        average_loss += loss_val

        if step % 2_000 == 0:
            if step > 0:
                average_loss /= 2000
            # The average loss is an estimate of the loss over the last 2000 batches.
            print("Average loss at step ", step, ": ", average_loss)
            average_loss = 0

        # Note that this is expensive (~20% slowdown if computed every 500 steps)
        if step % 2_000 == 0:
            sim = similarity.eval()
            for i in range(valid_size):
                valid_word = reverse_dictionary[valid_examples[i]]
                top_k = 4 # number of nearest neighbors
                nearest = (-sim[i, :]).argsort()[1:top_k+1]
                log_str = "Nearest to '%s':" % valid_word
                for k in range(top_k):
                    close_word = reverse_dictionary[nearest[k]]
                    log_str = "%s %s," % (log_str, close_word)
                print(log_str)
            print()
    final_embeddings = normalized_embeddings.eval()

<center><img src="../images/watiing.jpg" width="700"/></center>

-----
Step 6: Visualize the embeddings with t-SNE.
----


In this step, we use the t-SNE algorithm to reduce the dimensionality of the learned word embeddings for visualization. 

The top 500 words (based on frequency) are plotted in a 2D space, allowing us to visually inspect the relationships and groupings among words. 

This visualization is a powerful tool to understand how well the model has learned semantic relationships.

In [None]:
tsne = TSNE(perplexity=30, 
            n_components=2, 
            init='pca', 
            n_iter=5_000)
plot_only = 500
low_dim_embs = tsne.fit_transform(final_embeddings[:plot_only,:])
labels = [reverse_dictionary[i] for i in range(plot_only)]

n_words_to_visualize = 40

for i, label in enumerate(labels[:n_words_to_visualize]):
        x, y = low_dim_embs[i,:]
        plt.scatter(x, y)
        plt.annotate(label,
                     xy=(x, y),
                     xytext=(5, 2),
                     textcoords='offset points',
                     ha='right',
                     va='bottom')

------

# Let's render and plot more samples.

In [25]:
def plot_with_labels(low_dim_embs, labels, filename='../images/tsne.png'):
    assert low_dim_embs.shape[0] >= len(labels), "More labels than embeddings"
    plt.figure(figsize=(18, 18))  #in inches
    for i, label in enumerate(labels):
        x, y = low_dim_embs[i,:]
        plt.scatter(x, y)
        plt.annotate(label,
                     xy=(x, y),
                     xytext=(5, 2),
                     textcoords='offset points',
                     ha='right',
                     va='bottom')

In [None]:
tsne = TSNE(perplexity=30, n_components=2, init='pca', n_iter=5000)
plot_only = 500
low_dim_embs = tsne.fit_transform(final_embeddings[:plot_only,:])
labels = [reverse_dictionary[i] for i in range(plot_only)]
plot_with_labels(low_dim_embs, labels)

# And voila!

Now you have a visual understanding of embeddings.

Thank you for following this tutorial!