# Faux Semantle

As part of Homework 3, we will be playing two games of Semantle, a sometimes frustrating guessing game that relies on word vectors to play. The goal of the game is that there is a **target word** that you must guess by inputting strings representing words, and you are told the cosine similarity between your guess and the target word. The word vectors come from the `spaCy` package that we have seen before.

In this version of the game, we will tell you what your top 5 guesses were, how similar those guesses were, and whether you are in the top 1000 most similar terms ("getting warmer"). You may type something that is not the official vocabulary and which we cannot estimate a similarity score for -- but do not despair. It may take you many, many guesses or be extremely easy. This is part of the point of the exercise. You may give up (after an honest attempt) by typing **STOP** in the input box.

The target words in Game 1 and Game 2 are fixed. We have a feeling that Game 1 will be much easier than Game 2. While we are interested in how many guesses it takes for you to guess the words, we are also interested in what you guess and what you try. So, try to be strategic -- if you get stuck, it is possibly worth it to try something with a completely different meaning.

While you are playing, it will be good to take notes about challenges you face:

1. Once you know the answer for a given game, think about your guesses again. Which ones lead to big changes in differences? Are there any distinguishing features that stand out to you (there may not be)?
2. How long do you find yourself working within a specific semantic space? Do you generally find you always get closer and closer? How often do you move further away from the correct answer?
3. Once you know the answers, can you think of any factors that might make Game 1 easier or harder than Game 2?

## Gameplay

The game is not perfect. But, it is very important for you to not look at the code yet. We will be using some of the code in the subsequent questions when we probe the target words. If you find any bugs, please report them so we can make this game better 🙂

# NOTE: Run the first cell, then comment out the first line, restart the runtime/notebook, and re-run the first cell

In [None]:
 #!python -m spacy download en_core_web_lg  # <------ THIS LINE
# ^^^^^^ THIS LINE ^^^^^

Collecting en_core_web_lg==2.2.5
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-2.2.5/en_core_web_lg-2.2.5.tar.gz (827.9 MB)
[K     |████████████████████████████████| 827.9 MB 1.3 MB/s 
Building wheels for collected packages: en-core-web-lg
  Building wheel for en-core-web-lg (setup.py) ... [?25l[?25hdone
  Created wheel for en-core-web-lg: filename=en_core_web_lg-2.2.5-py3-none-any.whl size=829180942 sha256=5d4612a129a35929a464a94c5dd515436dd0f81b55a8507fdc1ca6d20ebfe8e5
  Stored in directory: /tmp/pip-ephem-wheel-cache-qpxxai58/wheels/11/95/ba/2c36cc368c0bd339b44a791c2c1881a1fb714b78c29a4cb8f5
Successfully built en-core-web-lg
Installing collected packages: en-core-web-lg
Successfully installed en-core-web-lg-2.2.5
[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_lg')


In [None]:
#@title Run this to load the data sources { vertical-output: true }
from nltk.corpus import words, treebank
import nltk
from numpy.random import choice as random_choice
import re
from operator import itemgetter
from collections import Counter
import spacy
from tqdm.notebook import tqdm as tqdm_notebook
import warnings
warnings.filterwarnings("ignore", message=r"\[W008\]", category=UserWarning)

nlp = spacy.load('en_core_web_lg')
nltk.download('words')
nltk.download('treebank')

vocabulary = set(words.words())
nlp_vocab = [
  w for w, count in Counter([x.lower() for x in treebank.words()]).most_common()
  ]
forbidden = 'STOP'
my_vocabulary = list(vocabulary.intersection(nlp_vocab).symmetric_difference(forbidden))

[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Unzipping corpora/words.zip.
[nltk_data] Downloading package treebank to /root/nltk_data...
[nltk_data]   Unzipping corpora/treebank.zip.


In [None]:
#@title Top secret game code
class Game():
  def __init__(self, vocab, target_word=None):
    self.vocab = vocab
    if target_word is None:
      self.target_word = str(random_choice(self.vocab))
    else:
      self.target_word = target_word
    self.top_1000 = 0.0
    self.nearest_neighbor = 0.0
    self.n_guesses = 0
    self.three_fifty = None
    self.neighbors = {}

  def compute_neighbors(self, nlp):
    if self.target_word is not None:
      for w in tqdm_notebook(self.vocab):
        self.neighbors[w] = self._similarity(nlp, w, self.target_word)
    tmp_neighbors = sorted(self.neighbors.items(), key=itemgetter(1))
    _, thousand_score = tmp_neighbors[-1000]
    _, nearest_score = tmp_neighbors[-2]
    self.top_1000 = thousand_score
    self.nearest_neighbor = nearest_score
    self.neighbors = tmp_neighbors

  def _similarity(self, nlp, w1, w2):
    if w1 in nlp.vocab and w2 in nlp.vocab:
      return nlp(w1).similarity(nlp(w2))
    else:
      return 0.
  
  def prune_vocab(self, re_str: str="", max_len: int=0):
    if max_len > 0:
      self.vocab = [x for x in self.vocab if len(x) < max_len]
    if re_str is not "":
      regex = re.compile(re_str)
      self.vocab = [regex.match(x) for x in self.vocab]
      self.vocab = [x.group(0) for x in self.vocab if x is not None]
  
  def _assess_guess(self, nlp, current_guess):
      self.n_guesses += 1
      sim = self._similarity(nlp, current_guess, self.target_word)
      if current_guess=='STOP':
        print(f"All done! The correct word was \"{self.target_word}\".\n")
      elif current_guess==self.target_word:
        print(f"Congrats! The correct word was \"{self.target_word}\".\n")
      else:
        if sim >= self.top_1000:
          dist = 'within'
        else:
          dist = 'outside'
        if sim > self.nearest_neighbor:
          print("Woops, you've found a word outside the vocabulary!")
        else:
          self.top_five[current_guess] = round(sim, 2)
          self.top_five = dict(list(
              sorted(self.top_five.items(), key=itemgetter(1), reverse=True)
            )[0:5])
          print(self.top_five)
        if sim < self.nearest_neighbor:
          print(f"{current_guess} is {round(sim, 2)} close to the correct word! You are {dist} the top 1000 words.")
      return sim
  
  def play(self, nlp):
    print("Loading word vectors and computing neighbors...")
    self.compute_neighbors(nlp)
    print(f"Nearest neighbor similarity is: {round(self.nearest_neighbor, 2)}")
    self.top_five = {}
    current_guess = ''
    while current_guess!=self.target_word and current_guess!='STOP':
      current_guess = input("Next guess?: ")
      sim = self._assess_guess(nlp, current_guess)
    print(" ")

# Run the cell below to play Game 1
# Remember, you may always type STOP to end the game

In [None]:
#@title Run this to play Game 1
my_game = Game(my_vocabulary, target_word='jaguar')
my_game.prune_vocab(re_str=r"(^[a-z]+$)")
my_game.play(nlp)

Loading word vectors and computing neighbors...


  0%|          | 0/5274 [00:00<?, ?it/s]

Nearest neighbor similarity is: 0.57
Next guess?: name
{'name': 0.09}
name is 0.09 close to the correct word! You are outside the top 1000 words.
Next guess?: fruit
{'fruit': 0.1, 'name': 0.09}
fruit is 0.1 close to the correct word! You are outside the top 1000 words.
Next guess?: animal
{'animal': 0.32, 'fruit': 0.1, 'name': 0.09}
animal is 0.32 close to the correct word! You are within the top 1000 words.
Next guess?: cat
{'cat': 0.33, 'animal': 0.32, 'fruit': 0.1, 'name': 0.09}
cat is 0.33 close to the correct word! You are within the top 1000 words.
Next guess?: dog
{'cat': 0.33, 'animal': 0.32, 'dog': 0.23, 'fruit': 0.1, 'name': 0.09}
dog is 0.23 close to the correct word! You are within the top 1000 words.
Next guess?: lion
{'lion': 0.51, 'cat': 0.33, 'animal': 0.32, 'dog': 0.23, 'fruit': 0.1}
lion is 0.51 close to the correct word! You are within the top 1000 words.
Next guess?: cougar
{'lion': 0.51, 'cougar': 0.47, 'cat': 0.33, 'animal': 0.32, 'dog': 0.23}
cougar is 0.47 close

In [None]:
#@title Run this to play Game 2
my_game = Game(my_vocabulary, target_word='biweekly')
my_game.prune_vocab(re_str=r"(^[a-z]+$)")
my_game.play(nlp)

Loading word vectors and computing neighbors...


  0%|          | 0/5274 [00:00<?, ?it/s]

Nearest neighbor similarity is: 0.67
Next guess?: animal
{'animal': 0.03}
animal is 0.03 close to the correct word! You are outside the top 1000 words.
Next guess?: fruit
{'animal': 0.03, 'fruit': 0.01}
fruit is 0.01 close to the correct word! You are outside the top 1000 words.
Next guess?: pen
{'animal': 0.03, 'pen': 0.03, 'fruit': 0.01}
pen is 0.03 close to the correct word! You are outside the top 1000 words.
Next guess?: class
{'class': 0.04, 'animal': 0.03, 'pen': 0.03, 'fruit': 0.01}
class is 0.04 close to the correct word! You are outside the top 1000 words.
Next guess?: class
{'class': 0.04, 'animal': 0.03, 'pen': 0.03, 'fruit': 0.01}
class is 0.04 close to the correct word! You are outside the top 1000 words.
Next guess?: word
{'word': 0.05, 'class': 0.04, 'animal': 0.03, 'pen': 0.03, 'fruit': 0.01}
word is 0.05 close to the correct word! You are outside the top 1000 words.
Next guess?: next
{'next': 0.11, 'word': 0.05, 'class': 0.04, 'animal': 0.03, 'pen': 0.03}
next is 0.11