# **Generate Sense Embeddings from word2vec Embeddings**

#### 0. Constants and imports

In [1]:
import os
import sys
import subprocess
from gensim.models.keyedvectors import KeyedVectors

In [4]:
setup = True
generate_txt_embeddings = True
data_file = "data/corpus.txt"
embeddings_file = "data/word2vec_twitter_tokens.bin"
embeddings_txt_file = "data/word2vec_twitter_tokens.word_vectors"

#### 1. Install the packages

All base packages are defined in the requirements file, but faiss is installed as a separate component due to it's OS compatibility issues.
Data from spacy is downloaded using shell.

In [5]:
if setup:
    subprocess.call("pip install -r requirements.txt".split())
    subprocess.call("python -m spacy download en".split())
    subprocess.call("pip install faiss-cpu".split())

--------

#### 2. Generate embeddings *.txt* file

As the Sensegram generator repo expects the embeddings to either be present as *.txt* or *.gz* file, we need to read in the *.bin* file obtained from the download link from Goldin's repo.  
Generate *.txt* file using Gensim, and write to *.word_vectors* file, as mentioned by the creators of Sensegram

In [6]:
if (not os.path.exists(embeddings_txt_file)) and generate_txt_embeddings:
    model = KeyedVectors.load_word2vec_format(embeddings_file, binary=True, unicode_errors='ignore')
    model.save_word2vec_format(embeddings_txt_file, binary=False)

--------

#### 3. Generate Sense Embeddings
Mark the *sensegram_package* github repo as a package(by appending to system path), to allow for relative imports within the package files

In [12]:
sys.path.append("sensegram_package/")

Run *train.py* file to generate embeddings using pretrained word2vec embeddings

In [15]:
subprocess.call(f"python sensegram_package/train.py {embeddings_txt_file}".split())

0