# Word Semantics and Embeddings

March 2, 2024

Your name: Nguyen Son

Student ID: BI12-389

In this programming assignment, you will write Python code to complete exercises about word vectors.

You can use the library [gensim](https://radimrehurek.com/gensim/) or other libraries to complete exercises.

## How to submit

**Due date**: March 15, 2024

- Make a copy of the notebook
- Write your name and student ID into this notebook
- Name your file as YourName_StudentID_Assignment1.ibynb. E.g., Nguyen_Van_A_ST099834_Assignment3.ipynb
- Complete exercises
- Attach notebook file (.ipynb) and submit your work to Google Class Room
- Copying others' assignments is strictly prohibited.

## Exercise 1 (20 points)

Download [word vectors](https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit?usp=sharing) that are pretrained on Google News dataset (approx. 100 billion words). The file contains word vectors of 3 million words/phrases, whose dimentionalities are 300. Print out the word vector of the term “United States”. Note that “United States” is represented as “United_States” in the file.

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
#TODO: Write your code here
from gensim.models import KeyedVectors

filename = '/content/drive/My Drive/GoogleNews-vectors-negative300.bin.gz'
model = KeyedVectors.load_word2vec_format(filename, binary=True)

united_states_vector = model['United_States']
print(united_states_vector)

[-3.61328125e-02 -4.83398438e-02  2.35351562e-01  1.74804688e-01
 -1.46484375e-01 -7.42187500e-02 -1.01562500e-01 -7.71484375e-02
  1.09375000e-01 -5.71289062e-02 -1.48437500e-01 -6.00585938e-02
  1.74804688e-01 -7.71484375e-02  2.58789062e-02 -7.66601562e-02
 -3.80859375e-02  1.35742188e-01  3.75976562e-02 -4.19921875e-02
 -3.56445312e-02  5.34667969e-02  3.68118286e-04 -1.66992188e-01
 -1.17187500e-01  1.41601562e-01 -1.69921875e-01 -6.49414062e-02
 -1.66992188e-01  1.00585938e-01  1.15722656e-01 -2.18750000e-01
 -9.86328125e-02 -2.56347656e-02  1.23046875e-01 -3.54003906e-02
 -1.58203125e-01 -1.60156250e-01  2.94189453e-02  8.15429688e-02
  6.88476562e-02  1.87500000e-01  6.49414062e-02  1.15234375e-01
 -2.27050781e-02  3.32031250e-01 -3.27148438e-02  1.77734375e-01
 -2.08007812e-01  4.54101562e-02 -1.23901367e-02  1.19628906e-01
  7.44628906e-03 -9.03320312e-03  1.14257812e-01  1.69921875e-01
 -2.38281250e-01 -2.79541016e-02 -1.21093750e-01  2.47802734e-02
  7.71484375e-02 -2.81982

## Exercise 2 (20 points)

Compute the cosine similarity between “United States” and “U.S.”

In [3]:
#TODO: Write your code here
united_states_vector = model['United_States']
us_vector = model['U.S.']

cosine_similarity = model.similarity('United_States', 'U.S.')
print(cosine_similarity)

0.73107743


## Exercise 3 (20 points)

Find the top-10 words that have the highest cosine similarity with the word “United States” and print out the similarity score.

In [4]:
#TODO: Write your code here
top_10_similar_words = model.most_similar('United_States', topn=10)
for word, similarity in top_10_similar_words:
    print(f"{word}: {similarity}")


Unites_States: 0.7877248525619507
Untied_States: 0.7541370987892151
United_Sates: 0.7400724291801453
U.S.: 0.7310774326324463
theUnited_States: 0.6404393911361694
America: 0.6178410053253174
UnitedStates: 0.6167312264442444
Europe: 0.6132988929748535
countries: 0.6044804453849792
Canada: 0.601906955242157


## Exercise 4 (20 points)

Subtract the vector of “Madrid” from the vector of “Spain” and then add the vector of “Athens”. Compute the top-10 most similar words with the output vector.

In [5]:
#TODO: Write your code here
result_vector = model['Spain'] - model['Madrid'] + model['Athens']

top_10_similar_words = model.most_similar([result_vector], topn=10)

for word, similarity in top_10_similar_words:
    print(f"{word}: {similarity}")


Athens: 0.7528455853462219
Greece: 0.6685472130775452
Aristeidis_Grigoriadis: 0.5495778322219849
Ioannis_Drymonakos: 0.5361457467079163
Greeks: 0.5351786017417908
Ioannis_Christou: 0.5330225825309753
Hrysopiyi_Devetzi: 0.5088489055633545
Iraklion: 0.5059264302253723
Greek: 0.5040615797042847
Athens_Greece: 0.5034108757972717


## Exercise 5 (20 points)

Download [word analogy evaluation dataset](http://download.tensorflow.org/data/questions-words.txt). Compute the vector as follows: vec(word in second column) - vec(word in first column) + vec(word in third column). From the output vector, find the most similar word. Append the most similar word and its similarity to each row of the downloaded file.

In [6]:
#TODO: Write your code here
data_path = '/content/drive/My Drive/capital-common-countries.txt'

with open(data_path, 'r') as file:
    lines = file.readlines()

modified_lines = []
for line in lines[1:]:
    words = line.strip().split()
    try:
        result_vector = model[words[1]] - model[words[0]] + model[words[2]]
        most_similar = model.similar_by_vector(result_vector, topn=1)[0]
        modified_line = f"{line.strip()} {most_similar[0]} {most_similar[1]}"
        modified_lines.append(modified_line)
    except KeyError as e:
        print(f"Error processing line: {line.strip()} - {e}")

modified_data_path = '/content/drive/My Drive/capital-common-countries-modify.txt'
with open(modified_data_path, 'w') as modified_file:
    for line in modified_lines:
        modified_file.write(f"{line}\n")


Error processing line: : capital-world - "Key 'capital-world' not present"
Error processing line: : currency - "Key ':' not present"
Error processing line: : city-in-state - "Key 'city-in-state' not present"
Error processing line: : family - "Key ':' not present"
Error processing line: : gram1-adjective-to-adverb - "Key 'gram1-adjective-to-adverb' not present"
Error processing line: : gram2-opposite - "Key 'gram2-opposite' not present"
Error processing line: : gram3-comparative - "Key 'gram3-comparative' not present"
Error processing line: : gram4-superlative - "Key 'gram4-superlative' not present"
Error processing line: : gram5-present-participle - "Key 'gram5-present-participle' not present"
Error processing line: : gram6-nationality-adjective - "Key 'gram6-nationality-adjective' not present"
Error processing line: : gram7-past-tense - "Key 'gram7-past-tense' not present"
Error processing line: : gram8-plural - "Key 'gram8-plural' not present"
Error processing line: : gram9-plural-ve

## Exercise 6 (Bonus points)

From the output of the exercise 5, compute the accuracy score. It means that you will calculate the percentage of cases in which the most similar words returned by your code are the same as the words in 4th column.



In [7]:
#TODO: Write your code here
correct_predictions = 0
total_predictions = 0

for line in modified_lines:
    parts = line.strip().split()
    if len(parts) >= 6:
        correct_word = parts[3]
        predicted_word = parts[4]
        total_predictions += 1
        if predicted_word == correct_word:
            correct_predictions += 1

accuracy = (correct_predictions / total_predictions) * 100 if total_predictions > 0 else 0

print(f"Accuracy Score: {accuracy:.2f}%")

Accuracy Score: 20.19%
