# Homework 5: Distributional semantics

This is due on **11/27 (11:55pm)**, submitted electronically. 

## How to do this problem set

Most of these questions require writing Python code and computing results, and the rest of them have textual answers.  Write all the textual answers in this document, show the output of your experiment in this document, and implement the functions in the `distsim.py`. Once you are finished, you will upload this `.ipynb` file and `distsim.py` to Moodle.

* When creating your final version of the problem set to hand in, please do a fresh restart and execute every cell in order.  Then you'll be sure it's actually right.  Make sure to press "Save"!

**Your Name:**

**List collaborators, and how you collaborated, here:** (see our [grading and policies page](http://people.cs.umass.edu/~brenocon/inlp2016/grading.html) for details on our collaboration policy).

* _Aditya Shastry_ 

## Cosine Similarity

Recall that, where $i$ indexes over the context types, cosine similarity is defined as follows. $x$ and $y$ are both vectors of context counts (each for a different word), where $x_i$ is the count of context $i$.

$$cossim(x,y) = \frac{ \sum_i x_i y_i }{ \sqrt{\sum_i x_i^2} \sqrt{\sum_i y_i^2} }$$

The nice thing about cosine similarity is that it is normalized: no matter what the input vectors are, the output is between 0 and 1. One way to think of this is that cosine similarity is just, um, the cosine function, which has this property (for non-negative $x$ and $y$). Another way to think of it is, to work through the situations of maximum and minimum similarity between two context vectors, starting from the definition above.

Note: a good way to understand the cosine similarity function is that the numerator cares about whether the $x$ and $y$ vectors are correlated. If $x$ and $y$ tend to have high values for the same contexts, the numerator tends to be big. The denominator can be thought of as a normalization factor: if all the values of $x$ are really large, for example, dividing by the square root of their sum-of-squares prevents the whole thing from getting arbitrarily large. In fact, dividing by both these things (aka their norms) means the whole thing can’t go higher than 1.

## Question 1 (10 points)

See the file `nytcounts.university_cat_dog`, which contains context count vectors for three words: “dog”, “cat”, and “university”. These are immediate left and right contexts from a New York Times corpus. You can open the file in a text editor since it’s quite small.

Please complete `cossim_sparse(v1,v2)` in `distsim.py` to compute and display the cosine similarities between each pair of these words. Briefly comment on whether the relative simlarities make sense.

Note that we’ve provided very simple code that tests the context count vectors from the data file.

In [1]:
import distsim; reload(distsim)

word_to_ccdict = distsim.load_contexts("nytcounts.university_cat_dog")
print "Cosine similarity between cat and dog" ,distsim.cossim_sparse(word_to_ccdict['cat'],word_to_ccdict['dog'])
print "Cosine similarity between cat and university" ,distsim.cossim_sparse(word_to_ccdict['cat'],word_to_ccdict['university'])
print "Cosine similarity between university and dog" ,distsim.cossim_sparse(word_to_ccdict['university'],word_to_ccdict['dog'])

file nytcounts.university_cat_dog has contexts for 3 words
Cosine similarity between cat and dog 0.966891672715
Cosine similarity between cat and university 0.660442421144
Cosine similarity between university and dog 0.659230248969


**Write your response here:**

## Question 2 (15 points)

Implement `show_nearest()`. 
Given a dictionary of word-context vectors, the context vector of a particular query word `w`, the words you want to exclude in the responses (It should be the query word `w` in this question), and the similarity metric you want to use (It should be the `cossim_sparse` function you just implemented), `show_nearest()` finds the 20 words most-similar to `w`. For each, display the other word, and its similarity to the query word `w`.

To make sure it’s working, feel free to use the small `nytcounts.university_cat_dog` database as follows.

In [2]:
import distsim
reload(distsim)
word_to_ccdict = distsim.load_contexts("nytcounts.university_cat_dog")
distsim.show_nearest(word_to_ccdict, word_to_ccdict['dog'], set(['dog']), distsim.cossim_sparse)

file nytcounts.university_cat_dog has contexts for 3 words


['cat', 'university']

## Question 3 (20 points)

Explore similarities in `nytcounts.4k`, which contains context counts for about 4000 words in a sample of New York Times. The news data was lowercased and URLs were removed. The context counts are for the 2000 most common words in twitter, as well as the most common 2000 words in the New York Times. (But all context counts are from New York Times.) The context counts only contain contexts that appeared for more than one word. The file `vocab` contains the list of all terms in this data, along with their total frequency.
Choose **six** words. For each, show the output of `show_nearest()` and comment on whether the output makes sense. Comment on whether this approach to distributional similarity makes more or less sense for certain terms.
Four of your words should be:

 * a name (for example: person, organization, or location)
 * a common noun
 * an adjective
 * a verb

You may also want to try exploring further words that are returned from a most-similar list from one of these. You can think of this as traversing the similarity graph among words.

*Implementation note:* 
On my laptop it takes several hundred MB of memory to load it into memory from the `load_contexts()` function. If you don’t have enough memory available, your computer will get very slow because the OS will start swapping. If you have to use a machine without that much memory available, you can instead implement in a streaming approach by using the `stream_contexts()` generator function to access the data; this lets you iterate through the data from disk, one vector at a time, without putting everything into memory. You can see its use in the loading function. (You could also alternatively use a key-value or other type of database, but that’s too much work for this assignment.)

*Extra note:* 
You don’t need this, but for reference, our preprocessing scripts we used to create the context data are in the `preproc/` directory.

In [2]:
import distsim; reload(distsim)
word_to_ccdict = distsim.load_contexts("nytcounts.4k")
###Provide your answer below; perhaps in another cell so you don't have to reload the data each time

file nytcounts.4k has contexts for 3648 words


In [3]:
### Name of a person
distsim.show_nearest(word_to_ccdict, word_to_ccdict['john'],set(['john']),distsim.cossim_sparse)

['peter',
 'joseph',
 'robert',
 'david',
 'james',
 'richard',
 'william',
 'andrew',
 'charles',
 'daniel',
 'eric',
 'stephen',
 'mark',
 'jonathan',
 'anthony',
 'steven',
 'susan',
 'jim',
 'christopher',
 'edward']

In [5]:
### Common Noun - inanimate
distsim.show_nearest(word_to_ccdict, word_to_ccdict['restaurant'],set(['restaurant']),distsim.cossim_sparse)

['hotel',
 'hospital',
 'studio',
 'gym',
 'newspaper',
 'bar',
 'table',
 'team',
 'store',
 'situation',
 'book',
 'car',
 'settlement',
 'moment',
 'farm',
 'movie',
 'song',
 'program',
 'project',
 'scene']

In [6]:
### Adjective
distsim.show_nearest(word_to_ccdict, word_to_ccdict['good'],set(['good']),distsim.cossim_sparse)

['strong',
 'rare',
 'wonderful',
 'terrible',
 'small',
 'tough',
 'bad',
 'simple',
 'large',
 'strange',
 'healthy',
 'lovely',
 'nice',
 'special',
 'sharp',
 'huge',
 'brief',
 'tight',
 'statement',
 'single']

In [7]:
### Verb
distsim.show_nearest(word_to_ccdict, word_to_ccdict['eat'],set(['eat']),distsim.cossim_sparse)

['marry',
 'shoot',
 'hide',
 'stop',
 'sell',
 'kill',
 'buy',
 'teach',
 'treat',
 'win',
 'grow',
 'steal',
 'help',
 'watch',
 'write',
 'pass',
 'burn',
 'produce',
 'draw',
 'hear']

In [8]:
### Common Noun - animate
distsim.show_nearest(word_to_ccdict, word_to_ccdict['man'],set(['man']),distsim.cossim_sparse)

['woman',
 'doctor',
 'person',
 'boy',
 'girl',
 'guy',
 'kid',
 'child',
 'patient',
 'student',
 'song',
 'car',
 'tree',
 'soldier',
 'dog',
 'giant',
 'pitcher',
 'reporter',
 'restaurant',
 'player']

In [14]:
### Adjective
distsim.show_nearest(word_to_ccdict, word_to_ccdict['terrible'],set(['terrible']),distsim.cossim_sparse)

['wonderful',
 'lovely',
 'small',
 'rare',
 'huge',
 'strange',
 'strong',
 'good',
 'large',
 'brief',
 'single',
 'giant',
 'special',
 'brilliant',
 'massive',
 'statement',
 'sharp',
 'tiny',
 'handsome',
 'great']

*** Response to Question 3***

From the outputs shown above, the cosine distance is able to obtain words which share the same POS tag. Eg. Proper Noun like john produces more proper nouns and adjectives like good produces more adjectives. So, this can be used to identify the POS for various words by using one or more known examples for them. <br/>
While it is good at identifying the words which share the same POS tag, there is no definitive ranking on which words are more similar to the given word. Eg. for restaurant, the word kitchen is more similar than hospital.

## Question 4 (10 points)

In the next several questions, you'll examine similarities in trained word embeddings, instead of raw context counts.

See the file `nyt_word2vec.university_cat_dog`, which contains word embedding vectors pretrained by word2vec [1] for three words: “dog”, “cat”, and “university”. You can open the file in a text editor since it’s quite small.

Please complete `cossim_dense(v1,v2)` in `distsim.py` to compute and display the cosine similarities between each pair of these words.

*Implementation note:*
Notice that the inputs of `cossim_dense(v1,v2)` are numpy arrays. If you do not very familiar with the basic operation in numpy, you can find some examples in the basic operation section here:
https://docs.scipy.org/doc/numpy-dev/user/quickstart.html

If you know how to use Matlab but haven't tried numpy before, the following link should be helpful:
https://docs.scipy.org/doc/numpy-dev/user/numpy-for-matlab-users.html

[1] Mikolov, Tomas, et al. "Distributed representations of words and phrases and their compositionality." NIPS 2013.

In [10]:
import distsim; reload(distsim)
word_to_vec_dict = distsim.load_word2vec("nyt_word2vec.university_cat_dog")
print "Cosine similarity between cat and dog" ,distsim.cossim_dense(word_to_vec_dict['cat'],word_to_vec_dict['dog'])
print "Cosine similarity between cat and university" ,distsim.cossim_dense(word_to_vec_dict['cat'],word_to_vec_dict['university'])
print "Cosine similarity between university and dog" ,distsim.cossim_dense(word_to_vec_dict['university'],word_to_vec_dict['dog'])

word_to_vec_dict = distsim.load_word2vec("nyt_word2vec.university_cat_dog")
print distsim.show_nearest(word_to_vec_dict, word_to_vec_dict['dog'], set(['dog']),distsim.cossim_dense)

Cosine similarity between cat and dog 0.827517295965
Cosine similarity between cat and university -0.205394745036
Cosine similarity between university and dog -0.190753135501
['cat', 'university']


## Question 5 (25 points)

Repeat the process you did in the question 3, but now use dense vector from word2vec. Comment on whether the outputs makes sense. Compare the outputs of using `show_nearest()` on word2vec and the outputs on sparse context vector (so we suggest you to use the same words in question 3). Which method works better on the query words you choose. Please brief explain why one method works better than other in each case.

Notice that we use default parameters of word2vec in [gensim](https://radimrehurek.com/gensim/models/word2vec.html) to get word2vec word embeddings.

In [4]:
import distsim; reload(distsim)
word_to_vec_dict = distsim.load_word2vec("nyt_word2vec.4k")
###Provide your answer bellow

In [5]:
### Name of a person
distsim.show_nearest(word_to_vec_dict, word_to_vec_dict['john'],set(['john']),distsim.cossim_dense)

['paul',
 'joseph',
 'william',
 'edward',
 'richard',
 'james',
 'robert',
 'charles',
 'donald',
 'david',
 'thomas',
 'patrick',
 'peter',
 'anthony',
 'stephen',
 'andrew',
 'mark',
 'alan',
 'jonathan',
 'michael']

In [18]:
### Common Noun - inanimate
distsim.show_nearest(word_to_vec_dict, word_to_vec_dict['restaurant'],set(['restaurant']),distsim.cossim_dense)

['hotel',
 'shop',
 'bar',
 'mall',
 'restaurants',
 'store',
 'chef',
 'factory',
 'pizza',
 'kitchen',
 'menu',
 'garden',
 'beer',
 'breakfast',
 'hotels',
 'apartment',
 'gym',
 'starbucks',
 'studio',
 'lunch']

In [19]:
### Adjective
distsim.show_nearest(word_to_vec_dict, word_to_vec_dict['good'],set(['good']),distsim.cossim_dense)

['bad',
 'tough',
 'nice',
 'great',
 'happy',
 'smart',
 'fun',
 'healthy',
 'lucky',
 'positive',
 'big',
 'hard',
 'stupid',
 'easy',
 'terrible',
 'best',
 'perfect',
 'better',
 'really',
 'weird']

In [20]:
### Verb
distsim.show_nearest(word_to_vec_dict, word_to_vec_dict['eat'],set(['eat']),distsim.cossim_dense)

['drink',
 'enjoy',
 'sleep',
 'feed',
 'breathe',
 'wear',
 'forget',
 'ate',
 'burn',
 'get',
 'eating',
 'treat',
 'smell',
 'buy',
 'listen',
 'sit',
 'see',
 'cook',
 'stick',
 'hang']

In [21]:
### Common Noun - animate
distsim.show_nearest(word_to_vec_dict, word_to_vec_dict['man'],set(['man']),distsim.cossim_dense)

['woman',
 'boy',
 'girl',
 'guy',
 'soldier',
 'person',
 'kid',
 'someone',
 'dog',
 'girlfriend',
 'doctor',
 'cat',
 'driver',
 'child',
 'friend',
 'men',
 'hero',
 'actor',
 'character',
 'smile']

In [22]:
### Adjective
distsim.show_nearest(word_to_vec_dict, word_to_vec_dict['terrible'],set(['terrible']),distsim.cossim_dense)

['horrible',
 'sad',
 'strange',
 'scary',
 'shame',
 'weird',
 'stupid',
 'bad',
 'ridiculous',
 'fantastic',
 'worst',
 'funny',
 'true',
 'definitely',
 'good',
 'wonderful',
 'basically',
 'serious',
 'tough',
 'sorry']

**Response to question 6:**

The word2vec values also produce similar words that might share the same POS tag. Along with that, they also retain a context from the original word. Eg. For restaurant, the most similar words are hotel, shop, bar, chef etc. These are the words which are more associated with restaurant than words like hospital, studio, gym etc. Along with the context, for adjectives, some form of sentiment might also be reflected in the similar words chose. Eg. Good produces great, happy (compared to small and terrible produced by sparse dict) and terrible produces horrible, sad, scary (compared to wonderful and lovely produced by sparse dict). There was no noticable difference for Proper Nouns. So, for this corpus, the word2vec dense representation seems to be providing better results based on the above provided examples.

## Question 7 (15 points)
After you have word embedding, one of interesting things you can do is to perform analogical reasoning tasks. In the following example, we provide the code which can find the closet words to the vector $v_{king}-v_{man}+v_{woman}$ to fill the blank on the question:

king : man = ____ : woman

Notice that the word2vec is trained in an unsupervised manner; it is impressive that it can apparently do an interesting type of reasoning.  (For a contrary opinion, see [Linzen 2016](http://www.aclweb.org/anthology/W/W16/W16-2503.pdf).)

Please come up with another analogical reasoning task (another triple of words), and output the answer using the the same method. Comment on whether the output makes sense. If the output makes sense, explain why we can capture such relation between words using an unsupervised algorithm. Where does the information come from? On the other hand, if the output does not make sense, propose an explanation why the algorithm fails on this case.


In [26]:
import distsim
king = word_to_vec_dict['king']
man = word_to_vec_dict['man']
woman = word_to_vec_dict['woman']
print distsim.show_nearest(word_to_vec_dict,
                     king-man+woman,
                     set(['king','man','woman']),
                     distsim.cossim_dense)[0]
###Provide your answer bellow
restaurant = word_to_vec_dict['restaurant']
chef = word_to_vec_dict['chef']
doctor = word_to_vec_dict['doctor']
print distsim.show_nearest(word_to_vec_dict,
                     restaurant-chef+doctor,
                     set(['restaurant','chef','doctor']),
                     distsim.cossim_dense)[0]


queen
hospital


**Write your response here:**
The reasoning task chosen was restaurant:chef = ____:doctor. As shown above, the correct answer "hospital" is chosen.
As we are choosing the relationship between a subject and an object, the nature of the relationship is captured by the "-" operation (king-man captures "is a", restaurant-chef captures "has employee" among possible options). When this relationship is applied to another vector, the produced vector is most similar to a subject/object which has that relationship with the vector.

## Extra credit (up to 5 points)

Analyze word similarities with WordNet, and compare and contrast against the distributional similarity results. For a fair comparison, limit yourself to words in the `nytcounts.4k` vocabulary. First, calculate how many of the words are present in WordNet, at least based on what method you’re doing lookups with. (There is an issue that WordNet requires a part-of-speech for lookup, but we don’t have those in our data; you’ll have to devise a solution). 

Second, for the words you analyzed with distributional similarity above, now do the same with WordNet-based similarity as implemented in NLTK, as described <a href="http://www.nltk.org/howto/wordnet.html">here</a>, or search for “nltk wordnet similarity”. For a fair comparison, do the nearest-similarity ranking among the words in the `nytcounts.4k` vocabulary. You may use `path_similarity`, or any of the other similarity methods (e.g. `res_similarity` for Resnik similarity, which is one of the better ones). Describe what you are doing. Compare and contrast the words you get. Does WordNet give similar or very different results? Why?</p>

## Extra credit (up to 5 points)

Investigate a few of the alternative methods described in [Linzen 2016](http://www.aclweb.org/anthology/W/W16/W16-2503.pdf) on the man/woman/king/queen and your new example.  What does this tell you about the legitimacy of analogical reasoning tasks?  How do you assess Linzen's arguments?