<a href="https://colab.research.google.com/github/ashmcmn/NLE_Notes/blob/main/Labs/Lab4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Intro
This week you are going to be investigating linguistic regularities in pre-built word embeddings. You should have linked or downloaded the embeddings in week 1 (see Canvas), uncompressed them and made them accessible in your working directory. As before, you can load them in to python using the following code (it may take a while to run this because the embeddings file is very large (1.5G)).

In [None]:
!gdown --id '1sQ2x3j4Yr6g6uqMmb41ulOACwGp0AJPK'
!unzip 'lab4data.zip'

In [None]:
from gensim.models import KeyedVectors 
mymodel = KeyedVectors.load_word2vec_format('lab4data/GoogleNews-vectors-negative300.bin.gz', binary=True)

You can now query the model with calls to methods of mymodel such as:

In [None]:
print(mymodel.similarity('man', 'woman'))
print(mymodel.most_similar(positive=['man']))
print(mymodel['man'])

0.76640123
[('woman', 0.7664012908935547), ('boy', 0.6824870109558105), ('teenager', 0.6586930155754089), ('teenage_girl', 0.6147903800010681), ('girl', 0.5921714305877686), ('suspected_purse_snatcher', 0.5716364979743958), ('robber', 0.5585119128227234), ('Robbery_suspect', 0.5584409236907959), ('teen_ager', 0.5549196600914001), ('men', 0.5489763021469116)]
[ 0.32617188  0.13085938  0.03466797 -0.08300781  0.08984375 -0.04125977
 -0.19824219  0.00689697  0.14355469  0.0019455   0.02880859 -0.25
 -0.08398438 -0.15136719 -0.10205078  0.04077148 -0.09765625  0.05932617
  0.02978516 -0.10058594 -0.13085938  0.001297    0.02612305 -0.27148438
  0.06396484 -0.19140625 -0.078125    0.25976562  0.375      -0.04541016
  0.16210938  0.13671875 -0.06396484 -0.02062988 -0.09667969  0.25390625
  0.24804688 -0.12695312  0.07177734  0.3203125   0.03149414 -0.03857422
  0.21191406 -0.00811768  0.22265625 -0.13476562 -0.07617188  0.01049805
 -0.05175781  0.03808594 -0.13378906  0.125       0.0559082  

Mikolov et al. (2013) propose the use of an offset vector method in order to answer analogy questions. For example, if you want to find the concept X which satisfies the analogy “X is to China as London is to England”, you need to find the concept closest to the point X in the vector space:

$X=vector_{China}-(vector_{England}-vector_{London})$

$X=vector_{China}+vector_{London}-vector_{Engalnd}$

You can do this with gensim using the following code:

In [None]:
mymodel.most_similar(positive=['China','London'], negative=['England'])

[('Beijing', 0.6737731695175171),
 ('Shanghai', 0.646628737449646),
 ('Beijng', 0.5856549739837646),
 ('Hong_Kong', 0.5709935426712036),
 ('Chinese', 0.5639771223068237),
 ('Guangdong', 0.5119545459747314),
 ('Shenzhen', 0.5102902054786682),
 ('Yanqi', 0.5076327323913574),
 ('Nanjing', 0.5056864023208618),
 ('Guangzhou', 0.5043154954910278)]

The file relations.json contains lists of pairs of words which satisfy some syntactic or semantic
relation. You can load it into a dictionary using the following code:

In [None]:
import json
with open('lab4data/relations.json', 'r') as fp:
  testtuples=json.load(fp) 
  print(testtuples)

{'gram3-comparative': [['bad', 'worse'], ['big', 'bigger'], ['bright', 'brighter'], ['cheap', 'cheaper'], ['cold', 'colder'], ['cool', 'cooler'], ['deep', 'deeper'], ['easy', 'easier'], ['fast', 'faster'], ['good', 'better'], ['great', 'greater'], ['hard', 'harder'], ['heavy', 'heavier'], ['high', 'higher'], ['hot', 'hotter'], ['large', 'larger'], ['long', 'longer'], ['loud', 'louder'], ['low', 'lower'], ['new', 'newer'], ['old', 'older'], ['quick', 'quicker'], ['safe', 'safer'], ['sharp', 'sharper'], ['short', 'shorter'], ['simple', 'simpler'], ['slow', 'slower'], ['small', 'smaller'], ['smart', 'smarter'], ['strong', 'stronger'], ['tall', 'taller'], ['tight', 'tighter'], ['tough', 'tougher'], ['warm', 'warmer'], ['weak', 'weaker'], ['wide', 'wider'], ['young', 'younger']], 'gram8-plural': [['banana', 'bananas'], ['bird', 'birds'], ['bottle', 'bottles'], ['building', 'buildings'], ['car', 'cars'], ['cat', 'cats'], ['child', 'children'], ['cloud', 'clouds'], ['color', 'colors'], ['comp


# Tasks
1. Write a function which when given one (capital city, country) training pair can predict the capital of the other countries in the capital-common-countries list in testtuples



2. Use the correct answers, also given, to evaluate how accurate your capital-predictor is. You should calculate the average accuracy over all possible training pairs.


In [None]:
def predict_captial(target_country, known_capital, known_country):
  return mymodel.most_similar(positive=[target_country, known_capital], negative=[known_country])[0][0]

In [None]:
predict_captial('Spain', 'London', 'England')

'Madrid'

In [None]:
results = [predict_captial(country, 'London', 'England') == capital for [capital, country] in testtuples['capital-common-countries']]
print(f'Accuracy: {results.count(True)/len(results):.2%}')

Accuracy: 78.26%


3. Looking at your predictions, can you think of an easy way to improve performance?

Repeat the predictions with other known vectors and select the mode prediction.


In [None]:
def predict_captial(target_country, known_pairs):
  predictions = []
  for (known_capital, known_country) in known_pairs:
    predictions.append(mymodel.most_similar(positive=[target_country, known_capital], negative=[known_country])[0][0])
  return max(set(predictions), key=predictions.count)

In [None]:
results = [predict_captial(country, [('London', 'England'), ('Paris', 'France'), ('Madrid', 'Spain')]) == capital for [capital, country] in testtuples['capital-common-countries']]
print(f'Accuracy: {results.count(True)/len(results):.2%}')

Accuracy: 82.61%


4. Adapt your code so that you can predict the country of which a city is capital. Is performance the same, higher or lower this way round?


In [None]:
def predict_country(target_capital, known_capital, known_country):
  return mymodel.most_similar(positive=[target_capital, known_country], negative=[known_capital])[0][0]

In [None]:
print(predict_country('Madrid', 'London', 'England'))

Spain


In [None]:
results = [predict_country(capital, 'London', 'England') == country for [capital, country] in testtuples['capital-common-countries']]
print(f'Accuracy: {results.count(True)/len(results):.2%}')

Accuracy: 78.26%


In [None]:
def predict_country(target_capital, known_pairs):
  predictions = []
  for (known_capital, known_country) in known_pairs:
    predictions.append(mymodel.most_similar(positive=[target_capital, known_country], negative=[known_capital])[0][0])
  return max(set(predictions), key=predictions.count)

In [None]:
results = [predict_country(capital, [('London', 'England'), ('Paris', 'France'), ('Madrid', 'Spain')]) == country for [capital, country] in testtuples['capital-common-countries']]
print(f'Accuracy: {results.count(True)/len(results):.2%}')

Accuracy: 82.61%


5. Adapt your code so that you can consider any of the relationships in testtuples. Rank the relationships in order of easiness to predict. Why do you think some are easier than others?

In [64]:
import random
n_train = 3
n_test = 5

ranks = {}

for k in testtuples:
  print(f'Processing {k}...')

  relationships = testtuples[k]
  sample = random.sample(relationships, n_train+n_test)
  train = sample[:n_train]
  test = sample[n_train:]

  c = 0

  for (target1, target2) in test:
    predictions = []
    for (known1, known2) in train:
      predictions.append(mymodel.most_similar(positive=[target1,known2], negative=[known1])[0][0])
    if max(set(predictions), key=predictions.count) == target2:
      c += 1

  ranks[k] = c/len(test)

for k, v in sorted(ranks.items(), key=lambda x: x[1], reverse=True):
  print(f'{k} has an accuracy of {v:.2%}')

Processing gram3-comparative...
Processing gram8-plural...
Processing capital-common-countries...
Processing city-in-state...
Processing family...
Processing gram2-opposite...
Processing currency...
Processing gram4-superlative...
Processing gram6-nationality-adjective...
Processing gram7-past-tense...
Processing gram5-present-participle...
Processing capital-world...
Processing gram1-adjective-to-adverb...
gram3-comparative has an accuracy of 100.00%
city-in-state has an accuracy of 100.00%
family has an accuracy of 100.00%
gram4-superlative has an accuracy of 100.00%
gram6-nationality-adjective has an accuracy of 100.00%
gram5-present-participle has an accuracy of 100.00%
capital-world has an accuracy of 100.00%
gram8-plural has an accuracy of 80.00%
gram7-past-tense has an accuracy of 80.00%
capital-common-countries has an accuracy of 60.00%
gram1-adjective-to-adverb has an accuracy of 60.00%
gram2-opposite has an accuracy of 40.00%
currency has an accuracy of 20.00%


Accurate predictions depend on how frequently the 'known' word in the relationship is seen in the subcontext of the 'unknown' word. For example, comparing performance of capitals against currencies, capitals are likely talked about more frequently than their currencies. It also depends on the random sample used in my method, if economy in the selected countries is often mentioned in the text then it may produce better accuracy in its predictions. Another potential problem is other contexts of known 'positives', for example 'real' is the currency of Brazil but is used frequently in english with another meaning.

In [74]:
pairs = [('Brazil', 'real'), ('Germany', 'euro'), ('Macedonia', 'denar')]
for (known_country, known_currency) in pairs:
  results = [mymodel.most_similar(positive=[country, known_currency], negative=[known_country])[0][0] == currency for [country, currency] in testtuples['currency']]
  print(f'Predictions using {known_currency} from {known_country} had an accuracy of {results.count(True)/len(results):.2%}')

Predictions using real from Brazil had an accuracy of 0.00%
Predictions using euro from Germany had an accuracy of 46.67%
Predictions using denar from Macedonia had an accuracy of 6.67%


6. A critic might say that the evaluation carried out in Mikolov et al. (2013) does not test the importance of the direction of the vector offset. London is close to England, so the vector difference is very small. Therefore, the method might do as well if it predicted the nearest neighbour of China as its capital. Implement this naive baseline which predicts the closest neighbour of the test item. Evaluate it for the different relationships in testtuples. Does it come close to doing as well as the vector offset method for any of the relationships?

In [65]:
n_train = 3
n_test = 5

ranks = {}

for k in testtuples:
  print(f'Processing {k}...')

  relationships = testtuples[k]
  sample = random.sample(relationships, n_train+n_test)
  train = sample[:n_train]
  test = sample[n_train:]

  c = 0

  for (target1, target2) in test:
    if mymodel.most_similar(positive=[target1])[0][0] == target2:
      c += 1

  ranks[k] = c/len(test)

for k, v in sorted(ranks.items(), key=lambda x: x[1], reverse=True):
  print(f'{k} has an accuracy of {v:.2%}')

Processing gram3-comparative...
Processing gram8-plural...
Processing capital-common-countries...
Processing city-in-state...
Processing family...
Processing gram2-opposite...
Processing currency...
Processing gram4-superlative...
Processing gram6-nationality-adjective...
Processing gram7-past-tense...
Processing gram5-present-participle...
Processing capital-world...
Processing gram1-adjective-to-adverb...
gram8-plural has an accuracy of 100.00%
gram3-comparative has an accuracy of 60.00%
gram6-nationality-adjective has an accuracy of 60.00%
gram5-present-participle has an accuracy of 40.00%
city-in-state has an accuracy of 20.00%
family has an accuracy of 20.00%
gram7-past-tense has an accuracy of 20.00%
capital-world has an accuracy of 20.00%
capital-common-countries has an accuracy of 0.00%
gram2-opposite has an accuracy of 0.00%
currency has an accuracy of 0.00%
gram4-superlative has an accuracy of 0.00%
gram1-adjective-to-adverb has an accuracy of 0.00%
