# **Programming Assessment \#5**

Names: Alyanna Abalos, Loben Tipan

More information on the assessment is found in our Canvas course.

# **Load Pre-trained Embeddings**

*While you don't have to separate your code into blocks, it might be easier if you separated loading / downloading your data from the main part of your solution. Consider placing all loading of data into the code block below.*

In [10]:
import gensim.downloader as api
import nltk
from gensim.models import KeyedVectors
from nltk.corpus import words

word_vectors = api.load("glove-wiki-gigaword-100")
nltk.download('words')
english_words = set(words.words())
valid_words = set(word_vectors.key_to_index.keys()) & english_words



[nltk_data] Downloading package words to
[nltk_data]     /Users/alyannaabalos/nltk_data...
[nltk_data]   Package words is already up-to-date!


# **Your Implementation**

*Again, you don't have to have everything in one block. Use the notebook according to your preferences with the goal of fulfilling the assessment in mind.*

# Random Word Generator

In [11]:
import random

valid_words_list = list(valid_words)

def get_random_word():
    random_word = random.choice(valid_words_list)
    return random_word.lower()

# Most Similar Words

In [12]:
def get_similar_words(word_vectors, target_word, valid_words, indices=[10, 50, 100], attempt_limit=10):
    similar_words = []
    topn_attempt = max(indices) * attempt_limit 

    while len(similar_words) < max(indices) and topn_attempt > 0:
        similar_words_batch = word_vectors.most_similar(target_word, topn=topn_attempt)
        similar_words.extend([(word, score) for word, score in similar_words_batch if word in valid_words])
        similar_words = list(dict.fromkeys(similar_words)) 
        topn_attempt += max(indices) 

    for idx in indices:
        if idx - 1 < len(similar_words):
            word, score = similar_words[idx - 1]
            print(f"{idx}th most similar word to '{target_word}': {word} with a similarity score of {score}")
        else:
            print(f"Only {len(similar_words)} similar words were found for '{target_word}', not enough to show the {idx}th word.")

    return [word for word, score in similar_words]

# Cosine Similarity

In [13]:
def get_similarity_score(word_vectors, correct_word, guess, precision=8):
    try:
        similarity_score = word_vectors.similarity(correct_word, guess)
        similarity_score = max(min(similarity_score, 1.0), -1.0)
        similarity_score = round(similarity_score, precision)
        return similarity_score
    except KeyError:
        return None

# Semantle Replication

In [17]:
def semantle_dupe():
    similarity = 0
    try:
        random_word = get_random_word()
    except ValueError:
        print("No valid words found in the model's vocabulary.")
        return

    print(f"Randomly selected word: {random_word}")
    print("\nClosest...")
    
    try:
        get_similar_words(word_vectors, random_word, valid_words, indices=[10, 50, 100])
    except KeyError:
        return

    print("\nInput your guess. Type 'q' to exit.\n")
    while True:
        user_guess = input("Enter your guess: ").lower()
        if user_guess == 'q':
            print(f"The target word was: {random_word}")
            break
        similarity = get_similarity_score(word_vectors, random_word, user_guess)
        if similarity is not None:
            print(f"Similarity score: {similarity}")
            if abs(similarity - 1.0) < 1e-7:
                print("Great job on guessing the word!")
                break
        else:
            print("Word not found in the model's vocabulary")

if __name__ == "__main__":
    semantle_dupe()

Randomly selected word: regain

Closest...
10th most similar word to 'regain': retake with a similarity score of 0.622226357460022
50th most similar word to 'regain': regroup with a similarity score of 0.532402515411377
100th most similar word to 'regain': reach with a similarity score of 0.497180312871933

Input your guess. Type 'q' to exit.



Enter your guess:  take


Similarity score: 0.5233028531074524


Enter your guess:  reclaim


Similarity score: 0.7803837656974792


Enter your guess:  regain


Similarity score: 1.0
Great job on guessing the word!
