# Word Vector Analysis Project

In this project, we explore word vectors derived from the word2vec algorithm, examining the relationships between words captured via these vectors. We will also use them as features for a classification task.

## Project Task

This project involves using pre-computed word vectors to perform tasks such as finding similar words, solving word analogies, and leveraging these vectors to improve a classifier model.

### Imports and Setup

In [31]:
import numpy as np
import re
import pandas as pd
from collections import Counter
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction._stop_words import ENGLISH_STOP_WORDS
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score
from sklearn.model_selection import RandomizedSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from scipy.stats import uniform
from sklearn.feature_extraction.text import TfidfVectorizer

from tqdm import tqdm
from IPython.display import display

# To ensure consistency in outputs
RANDOM_SEED = 655

### Part 1: Loading the Word Vectors

#### Task 1.1: Load Pre-computed Word Vectors

Pre-computed word vectors are stored in a dictionary `wiki_wv`:

In [3]:
wiki_wv = {}
with open(r'C:/users/akama/downloads/word_vector_analysis/wiki_vocab.txt', encoding='utf-8') as f:
    wiki_wv['vocab'] = f.read().split('\n')
wiki_wv['wv'] = np.load(r'C:/users/akama/downloads/word_vector_analysis/wiki_vects.npy')

#### Basic Facts about Our Word Vectors

Assign the following variables:

In [4]:
n_words = len(wiki_wv['wv'])
n_dims = wiki_wv['wv'].shape[1]
print(f'{n_words} words and {n_dims} dimensions')

134282 words and 100 dimensions


#### Task 1.2: Construct a Word Index

Create a dictionary `word_index` to map words to their indices in the word vector array:

In [6]:
word_index = dict(zip(wiki_wv['vocab'], range(len(wiki_wv['vocab']))))
wiki_wv['index'] = word_index

#### Task 1.2.1: Get the Vector for a Given Word

Complete the `get_word_vector` function:

In [7]:
def get_word_vector(word, wv=wiki_wv):
    word_idx = wv['index'].get(word)    
    if word_idx is not None:
        return wv['wv'][word_idx]
    else:
        return None

In [8]:
# Example

the_wiki_vect = get_word_vector('the')
print(the_wiki_vect)

[-9.4823521e-01  2.3072057e+00  2.1295850e+00 -9.1626394e-01
  1.0018171e+00  1.4503113e+00  4.5253047e-01  1.4185722e+00
  1.6953593e+00  1.0559855e+00  2.7143070e-01  1.5554521e+00
 -3.4456417e-01  1.9864979e-01  2.7903378e+00 -8.6733478e-01
 -4.2806253e-01  3.4738421e-01 -1.2050803e+00  1.0622426e+00
 -5.6063598e-01  5.9032774e-01 -1.0257747e+00 -7.6530254e-01
  3.3861476e-01 -8.8240725e-01  7.9443592e-01 -1.3805602e+00
 -1.2598097e+00  9.5285571e-01 -1.9514867e+00 -2.8805381e-01
  2.0856528e-01  2.3167922e+00  1.9959958e-01  2.1145422e+00
  2.2699206e-01  6.3021503e-02  8.1352192e-01  2.5985492e-03
  1.8291645e+00  1.1634727e+00 -8.0728352e-01  8.4317046e-01
 -1.2693784e-01 -1.6598600e-01  2.2009730e+00  8.4855229e-01
  2.9602451e+00 -1.3164682e+00 -5.9475186e-03 -3.0840275e-01
 -1.9252799e-01  1.5872755e+00  1.2728233e+00  1.1041660e+00
 -5.9441972e-01  1.4613307e+00 -6.2290657e-01  1.7193762e+00
 -5.8605254e-01 -2.7472922e-01 -2.0975404e+00  1.6103998e+00
  2.7877734e+00 -1.65887

### Part 2: Examine What's Represented by the Word Vectors

### Task 2.1: Get Similar Words

Define the `get_most_similar function` to find words similar to a given word:

In [9]:
def get_vector_cossims(vect, wv=wiki_wv):
    return cosine_similarity(np.array([vect]), wv['wv']).flatten()

def get_most_similar(word, wv=wiki_wv, k=10):
    word_vector = get_word_vector(word, wv)
    if word_vector is None:
        return []
    cossims = get_vector_cossims(word_vector, wv)
    word_idx = wv['index'].get(word)
    if word_idx is not None:
        cossims[word_idx] = -1
    most_similar_indices = np.argsort(cossims)[-k:][::-1]
    most_similar_words = [wv['vocab'][idx] for idx in most_similar_indices]
    return most_similar_words

In [10]:
# Example

print('most similar to biologist:')
for word in get_most_similar('biologist'):
    print(word)
print('===\n')
print('most similar to France:')
for word in get_most_similar('France'):
    print(word)
print('===')

most similar to biologist:
geneticist
biochemist
physicist
microbiologist
physiologist
paleontologist
geophysicist
virologist
zoologist
neuroscientist
===

most similar to France:
Belgium
Spain
Algeria
Italy
Marseille
Portugal
Morocco
Bordeaux
Brazil
Switzerland
===


#### Task 2.2: Examine Word Analogies

Define the `get_analogy` function:

In [11]:
def normalize(vect):
    return vect / np.linalg.norm(vect)

def get_analogy(a, b, c, wv=wiki_wv):
    a_vec = get_word_vector(a, wv)
    b_vec = get_word_vector(b, wv)
    c_vec = get_word_vector(c, wv)
    if a_vec is None or b_vec is None or c_vec is None:
        return None
    norm_a = normalize(a_vec)
    norm_b = normalize(b_vec)
    norm_c = normalize(c_vec)
    d_prime = norm_b - norm_a + norm_c
    cossims = get_vector_cossims(d_prime, wv)
    for word in [a, b, c]:
        word_idx = wv['index'][word]
        cossims[word_idx] = -1
    most_similar_idx = np.argmax(cossims)
    most_similar_word = wv['vocab'][most_similar_idx]
    return most_similar_word

In [12]:
# Example

print(get_analogy('France','Paris','England'))
print(get_analogy('biologist', 'biology', 'chemist'))

London
chemistry


#### Task 2.2.1: Test Word Vector Analogies on Professions

Evaluate the performance of word vector analogies on different professions:

In [14]:
test_professions = ['archaeologist', 'botanist', 'economist', 'entomologist', 'linguist', 'mathematician', 'oncologist', 
                    'physicist', 'statistician', 'zoologist']
for profession in test_professions:
    print('input: %s; output: %s' % (profession, get_analogy('biologist', 'biology', profession)))

input: archaeologist; output: archaeology
input: botanist; output: botany
input: economist; output: economics
input: entomologist; output: botany
input: linguist; output: mathematics
input: mathematician; output: mathematics
input: oncologist; output: oncology
input: physicist; output: physics
input: statistician; output: microbiology
input: zoologist; output: botany


In [15]:
# Identify incorrect outputs

wrong_professions = ['entomologist', 'linguist', 'statistician', 'zoologist']
print(wrong_professions)

['entomologist', 'linguist', 'statistician', 'zoologist']


#### Task 2.2.2: Test Word Vector Analogies on Countries and Cities

Evaluate the performance of word vector analogies on different countries:

In [16]:
test_countries = ['Austria', 'Belgium', 'Canada', 'China', 'Germany', 'India', 'Japan', 'Portugal', 'Spain', 'Tanzania']
results = {}
for country in test_countries:
    result = get_analogy('France', 'Paris', country)
    results[country] = result
    print('input: %s; output: %s' % (country, result))

expected_capitals = {
    'Austria': 'Vienna',
    'Belgium': 'Brussels',
    'Canada': 'Ottawa',
    'China': 'Beijing',
    'Germany': 'Berlin',
    'India': 'New Delhi',
    'Japan': 'Tokyo',
    'Portugal': 'Lisbon',
    'Spain': 'Madrid',
    'Tanzania': 'Dodoma'
}

wrong_countries = [country for country, output in results.items() if output.lower() != expected_capitals[country].lower()]
print(wrong_countries)

input: Austria; output: Vienna
input: Belgium; output: Brussels
input: Canada; output: Toronto
input: China; output: Shanghai
input: Germany; output: Berlin
input: India; output: Calcutta
input: Japan; output: Tokyo
input: Portugal; output: Lisbon
input: Spain; output: Madrid
input: Tanzania; output: Nairobi
['Canada', 'China', 'India', 'Tanzania']


### Part 3: Use Word Vectors as Classifier Features

#### Task 3.1.1: Filter out Infrequent Labels

Filter the rows in `nationality_df` for labels occurring at least 500 times:

In [21]:
nationality_df = pd.read_csv(r'C:/users/akama/downloads/word_vector_analysis/bio_name_nationality.tsv.gz', 
                             sep='\t', compression='gzip')
nationality_df = nationality_df.dropna()
nationality_df = nationality_df[:75000]
MIN_NATIONALITY_COUNT = 500
def standardize_nationality(nationality):
    parts = nationality.split(',')
    standardized_label = parts[-1].strip()
    return standardized_label

nationality_df['nationality'] = nationality_df['nationality'].apply(standardize_nationality)
cleaned_nationality_df = nationality_df[nationality_df.groupby('nationality')['nationality'].
                                        transform('count')>=MIN_NATIONALITY_COUNT]
print(len(cleaned_nationality_df), cleaned_nationality_df.nationality.nunique())

51931 19


#### Task 3.1.2: Create Train/Dev/Test Data Splits

Split the cleaned data into train, development, and test sets:

In [23]:
TRAIN_SIZE = .8
DEV_SIZE = .1
TEST_SIZE = .1

train_dev_df, test_df = train_test_split(
    cleaned_nationality_df, test_size=TEST_SIZE, random_state=RANDOM_SEED, shuffle=True
)

train_proportion_in_train_dev = TRAIN_SIZE / (TRAIN_SIZE + DEV_SIZE)

train_df, dev_df = train_test_split(
    train_dev_df, test_size=(1 - train_proportion_in_train_dev), random_state=RANDOM_SEED, shuffle=True
)

print(len(train_df), len(dev_df), len(test_df))

y_train = list(train_df.nationality)
y_dev = list(dev_df.nationality)
y_test = list(test_df.nationality)

41544 5193 5194


#### Task 3.2: Tokenize Text and Remove Stopwords

Tokenize each biography and remove stopwords:

In [24]:
stop_words = ENGLISH_STOP_WORDS
tokenized_train_items = []
tokenized_dev_items = []

stop_words = set(ENGLISH_STOP_WORDS)
token_pattern = re.compile(r'(?u)\b\w\w+\b')

def tokenize_and_remove_stopwords(text, token_pattern, stop_words):
    tokens = token_pattern.findall(text)
    filtered_tokens = [token for token in tokens if token.lower() not in stop_words]
    return filtered_tokens

tokenized_train_items = [tokenize_and_remove_stopwords(bio, token_pattern, stop_words) for bio in train_df['bio']]
tokenized_dev_items = [tokenize_and_remove_stopwords(bio, token_pattern, stop_words) for bio in dev_df['bio']]

print(len(tokenized_train_items[0]), len(tokenized_dev_items[0]))

30 1220


#### Task 3.3: Compute Word Vector-Based Features

Compute the mean word vector for each document:

In [25]:
def generate_word_vector_features(tokenized_texts, wv=wiki_wv):
    vocab = set(wv['vocab'])
    word_vectors = wv['wv']
    index = wv['index']
    features = np.zeros((len(tokenized_texts), word_vectors.shape[1]))
    for i, tokens in enumerate(tokenized_texts):
        valid_vectors = [word_vectors[index[token]] for token in tokens if token in vocab]
        if valid_vectors:
            mean_vector = np.mean(valid_vectors, axis=0)
        else:
            mean_vector = np.zeros(word_vectors.shape[1])
        features[i] = mean_vector
    return features

# Test with a small subset
X_sample = generate_word_vector_features(tokenized_train_items[:200], wiki_wv)
print(X_sample.shape)
print(X_sample)

(200, 100)
[[ 0.39517793 -0.07868216 -0.92518407 ...  0.15139848 -0.08917703
   0.43999135]
 [-0.22381873 -0.78749293  0.40706831 ...  0.41614151 -0.24410585
   0.23181398]
 [ 0.08421799 -0.16369714 -0.53750986 ...  0.86370862  0.90777248
   0.47205114]
 ...
 [ 0.13395222 -0.50426126  0.24911262 ... -0.08104277  0.00547608
   0.99351317]
 [-0.09968048 -0.68094653  0.14089933 ... -0.46044263 -0.25196463
   0.33508641]
 [ 0.58683908  0.43996775 -0.965339   ... -0.31179458  0.26025856
  -0.47247902]]


#### Task 3.3.1: Compute Features for the Entire Data

Generate word-vector-based features for the train and dev sets:

In [26]:
X_train_wv = generate_word_vector_features(tokenized_train_items, wiki_wv)
X_dev_wv = generate_word_vector_features(tokenized_dev_items, wiki_wv)
print(X_train_wv.shape, X_dev_wv.shape)

(41544, 100) (5193, 100)


#### Task 3.4: Train and Evaluate Classifier

Train a Logistic Regression classifier and evaluate its performance:

In [29]:
pipeline = Pipeline([
    ('scaler', StandardScaler()),  
    ('clf', LogisticRegression(random_state=RANDOM_SEED, max_iter=1000, solver='lbfgs', n_jobs=-1))
])

param_dist = {
    'clf__C': uniform(loc=0, scale=4),
}

random_search = RandomizedSearchCV(pipeline, param_distributions=param_dist, n_iter=20, cv=3, scoring='f1_macro', n_jobs=None, random_state=RANDOM_SEED, verbose=2)
random_search.fit(X_train_wv, y_train)
best_model = random_search.best_estimator_
y_pred_dev = best_model.predict(X_dev_wv)
lr_wv_f1 = f1_score(y_dev, y_pred_dev, average='macro')

print(lr_wv_f1)

Fitting 3 folds for each of 20 candidates, totalling 60 fits
[CV] END ..........................clf__C=0.8003453423381894; total time=   7.1s
[CV] END ..........................clf__C=0.8003453423381894; total time=   4.2s
[CV] END ..........................clf__C=0.8003453423381894; total time=   4.6s
[CV] END ..........................clf__C=2.7509546782812846; total time=   4.7s
[CV] END ..........................clf__C=2.7509546782812846; total time=   5.2s
[CV] END ..........................clf__C=2.7509546782812846; total time=   5.5s
[CV] END ...........................clf__C=3.734248334862691; total time=   5.1s
[CV] END ...........................clf__C=3.734248334862691; total time=   4.9s
[CV] END ...........................clf__C=3.734248334862691; total time=   6.1s
[CV] END ..........................clf__C=0.5574663215139735; total time=   3.8s
[CV] END ..........................clf__C=0.5574663215139735; total time=   4.2s
[CV] END ..........................clf__C=0.5574

#### Task 3.5: Consider Model Size

Compute the number of feature weights required for both tf-idf and word vector features:

In [32]:
vectorizer = TfidfVectorizer(min_df=500, stop_words='english')
X_train = vectorizer.fit_transform(train_df.bio)
num_tfidf_features = X_train.shape[1]
num_wv_features = X_train_wv.shape[1]
num_classes = len(set(y_train))
num_tfidf_feature_weights = num_tfidf_features * num_classes
num_wv_feature_weights = num_wv_features * num_classes
print('num feature weights for tfidf', num_tfidf_feature_weights)
print('num feature weights for word vects', num_wv_feature_weights)

num feature weights for tfidf 52440
num feature weights for word vects 1900
