# SI630 Homework 2: Word2vec Vector Analysis

*Important Note:* Start this notebook only after you've gotten your word2vec model up and running!

Many NLP packages support working with word embeddings. In this notebook you can work through the various problems assigned in Task 3. We've provided the basic functionality for loading word vectors using [Gensim](https://radimrehurek.com/gensim/models/keyedvectors.html), a good library for learning and using word vectors, and for working with the vectors. 

One of the fun parts of word vectors is getting a sense of what they learned. Feel free to explore the vectors here! 

In [2]:
from gensim.models import KeyedVectors
from gensim.test.utils import datapath

In [3]:
word_vectors = KeyedVectors.load_word2vec_format('Word2Vec_Model', binary=False)

In [4]:
word_vectors['the']

array([ 0.25596148,  0.27567005,  0.24742034,  0.31901658, -0.32142407,
       -0.2558528 , -0.24441342, -0.32059112,  0.38258576,  0.3101376 ,
        0.27823207,  0.33833006,  0.31902236, -0.28096232,  0.3110848 ,
        0.33002234,  0.14471766,  0.33966514, -0.29717088,  0.22672425,
        0.3324133 , -0.2197797 , -0.29371443,  0.2642938 , -0.26269996,
       -0.23402067,  0.32403487,  0.33519778,  0.29100204, -0.26296166,
       -0.28934327, -0.3110694 , -0.25694886, -0.2846095 , -0.30052316,
       -0.34136784, -0.31313464, -0.24052876,  0.2446698 , -0.2857243 ,
        0.20365894,  0.24637224,  0.24282731, -0.18036401,  0.29452068,
       -0.32079816,  0.20149298,  0.2767359 ,  0.28902394, -0.31774268],
      dtype=float32)

In [5]:
word_vectors.similar_by_word("books")

[('songs', 0.9999173283576965),
 ('paintings', 0.9998435378074646),
 ('encouraged', 0.9998093843460083),
 ('close', 0.9998033046722412),
 ('critical', 0.9997953772544861),
 ('supported', 0.9997947216033936),
 ('musicians', 0.9997925162315369),
 ('gabirol', 0.9997822046279907),
 ('stories', 0.9997695088386536),
 ('inspired', 0.999757707118988)]

In [26]:
target_words = [
    'doctor',
    'technical',
    'motivated',
    'ethical',
    'earnings',
    'sports',
    'dancer',
    'artistic',
    'lazy'
]

for t in target_words:
    print(f"Target word: {t}")
    print(word_vectors.similar_by_word(t))
    print("")

Target word: doctor
[('founding', 0.999644935131073), ('founder', 0.9995342493057251), ('chair', 0.9995135068893433), ('mathematics', 0.9993555545806885), ('foreign', 0.9993261098861694), ('schools', 0.9992579817771912), ('medicine', 0.9992494583129883), ('associate', 0.9992328882217407), ('fellow', 0.9991998076438904), ('taught', 0.9991545081138611)]

Target word: technical
[('industrial', 0.9999467134475708), ('environmental', 0.9999301433563232), ('finance', 0.9998579621315002), ('obtained', 0.9998190402984619), ('applied', 0.9998027682304382), ('biology', 0.9997803568840027), ('advisory', 0.9997748136520386), ('sociology', 0.9997384548187256), ('pontifical', 0.9997186660766602), ('directors', 0.9997177124023438)]

Target word: motivated
[('documented', 0.9999889731407166), ('interesting', 0.9999594688415527), ('looked', 0.9999566674232483), ('bit', 0.9999517798423767), ('perhaps', 0.9999509453773499), ('truly', 0.9999485611915588), ('incorporate', 0.9999483823776245), ('funds', 0.9

For most target words above, the most similar terms appear to be semantically similar. For example, doctor returned terms such as chair, schools, medicine, and taught. The target word technical returned industrial, finance, applied, sociology, and more. 

Some exceptions include sports, where the top terms were chinese, african, middle, vitality, and banking. These seem peripherally related at best (e.g. chinese gymnastics, african football, etc.), and completely off (e.g. banking, vitality)

In [6]:
def get_analogy(a, b, c):
    return word_vectors.most_similar(positive=[b, c], negative=[a])[0][0]

In [75]:
get_analogy('immigration', 'policy', 'legal')

'social'

In [66]:
get_analogy('safety', 'fund', 'republican')

'liberal'

In [53]:
get_analogy('male', 'female', 'doctor')

'editor'

In [51]:
get_analogy('college', 'university', 'technical')

'state'

In [49]:
get_analogy('sports', 'studying', 'physical')

'dictation'

In [45]:
get_analogy('baseball', 'boxing', 'relaxing')

'kamikaze'

Above are the six examples of word associations similar to the queen - woman = king - man example. Some of these appear to be pretty sound, others are puzzling. For example, legal - immigration = policy - *social* feels pretty accurate. Another accurate comparison might be physical - sports = studying - *dictation*. One example that's a little puzzling, but not inaccurate if you think about it, is relaxing - baseball = boxing - *kamikaze*. Kamikaze refers to a "kamikaze attack or pilot" from WW2 where the Japanese would load an aircraft with explosives and deliberately crash into an enemy target, killing the pilot. This form of suicide feels extreme, but not exactly off the mark when associated with boxing which requires exteme physical demands.

## Debiased Model Results

In [18]:
debiased_word_vectors = KeyedVectors.load_word2vec_format('Debiased_Word2Vec_Model', binary=False)

In [21]:
import csv

with open('word_pair_similarity_predictions.csv', newline='') as f:
    reader = csv.reader(f)
    incomplete = list(reader)
f.close()
    
incomplete[0]

['word1', 'word2', 'sim']

In [20]:
complete = []
for pairs in incomplete[1:]:
    complete.append([pairs[0], pairs[1], debiased_word_vectors.similarity(pairs[0], pairs[1])])

complete

[['old', 'new', 0.8739896],
 ['smart', 'intelligent', 0.9959426],
 ['hard', 'difficult', 0.961006],
 ['happy', 'cheerful', 0.9766615],
 ['hard', 'easy', 0.98150903],
 ['fast', 'rapid', 0.989959],
 ['happy', 'glad', 0.9993435],
 ['short', 'long', 0.9945617],
 ['stupid', 'dumb', 0.9916815],
 ['weird', 'strange', 0.9998329],
 ['wide', 'narrow', 0.8304432],
 ['bad', 'awful', 0.9980965],
 ['easy', 'difficult', 0.99586153],
 ['bad', 'terrible', 0.9981045],
 ['hard', 'simple', 0.9806286],
 ['smart', 'dumb', 0.9955488],
 ['insane', 'crazy', 0.9991684],
 ['happy', 'mad', 0.9518202],
 ['large', 'huge', 0.9728311],
 ['hard', 'tough', 0.9854515],
 ['new', 'fresh', 0.9426073],
 ['sharp', 'dull', 0.9896795],
 ['quick', 'rapid', 0.9883007],
 ['dumb', 'foolish', 0.9875476],
 ['wonderful', 'terrific', 0.9989617],
 ['strange', 'odd', 0.99894124],
 ['happy', 'angry', 0.99707395],
 ['narrow', 'broad', 0.87837625],
 ['simple', 'easy', 0.99912465],
 ['old', 'fresh', 0.9688447],
 ['apparent', 'obvious', 0.99

In [23]:
with open('word_pair_similarity_predictions.csv', 'w', newline='') as completed_file:
    write = csv.writer(completed_file)
    write.writerow(incomplete[0])
    write.writerows(complete)