# ConEc (Context Encoders), an extension of word2vec

The code implements the following procedure -
- Use CBOW word2vec model with negative sampling objective
- Train the model on "text8" , "OneBilCorpus" or "conll2003" datasets
- Multiply the trained word2vec embeddings with the word's average context vectors (CVs)
- A word has global CV and local CV
- Choice of alpha in the equation mentioned in the paper determines the emphasis on the word's local context

Download the conec folder from this repository and import the conec class into your script

Follow instructions given in README.md for more details about the Pre-requisites to run this notebook

In [2]:
from conec import conec

Define a conec class object, context2vec
----------------------------------------

In [3]:
context2vec = conec()

Conec object created ... 
Use this object to get text embeddings or text similarity...


---
Loading Dataset
---
Ensure text8/OneBillionCorpus/CoNLL 2003 data is downloaded and present in /data

In [5]:
context2vec.read_Dataset('text8')

Reading dataset ....
Dataset loaded into conec object ....


---
# Training 
Train the conec object on 5 iterations of word2vec model with negative modeling, 
followed by introducing the context vectors to this pre-trained model
Modify the iterations, modelType ("cbow" or "sg") and embedding dimesion (embed_dim) as required
Here, we are only training the model for 1 iteration to save time

In [7]:
context2vec.train(iterations=1, saveInterm=False, modelType="cbow", embed_dim=200 )

2019-04-30 02:13:10,380 : INFO : collecting all words and their counts
2019-04-30 02:13:10,386 : INFO : PROGRESS: at sentence #0, processed 0 words and 0 unique words


conec object being trained using text8 dataset present in the directory data/text8 ....
Creating word2vec model ....
<conec.Text8Corpus object at 0x7f3b689a2358>


2019-04-30 02:13:19,176 : INFO : PROGRESS: at sentence #10000, processed 10000000 words and 189074 unique words
2019-04-30 02:13:26,616 : INFO : collected 253854 unique words from a corpus of 17005207 words and 17006 sentences
2019-04-30 02:13:26,776 : INFO : total of 71290 unique words after removing those with count < 5
2019-04-30 02:13:26,814 : INFO : constructing a table with noise distribution from 71290 words
2019-04-30 02:14:38,368 : INFO : training model on 71290 vocabulary and 200 features
2019-04-30 02:14:58,460 : INFO : PROGRESS: at 0.38% words, alpha 0.00500, 3203 words/s
2019-04-30 02:15:18,584 : INFO : PROGRESS: at 0.71% words, alpha 0.00500, 2970 words/s
2019-04-30 02:15:38,802 : INFO : PROGRESS: at 1.10% words, alpha 0.00500, 3054 words/s
2019-04-30 02:15:59,224 : INFO : PROGRESS: at 1.42% words, alpha 0.00500, 2943 words/s
2019-04-30 02:16:19,547 : INFO : PROGRESS: at 1.72% words, alpha 0.00500, 2848 words/s
2019-04-30 02:16:39,694 : INFO : PROGRESS: at 2.04% words, al

2019-04-30 02:44:31,911 : INFO : PROGRESS: at 35.64% words, alpha 0.00500, 3322 words/s
2019-04-30 02:44:52,089 : INFO : PROGRESS: at 36.06% words, alpha 0.00500, 3324 words/s
2019-04-30 02:45:12,135 : INFO : PROGRESS: at 36.49% words, alpha 0.00500, 3327 words/s
2019-04-30 02:45:32,373 : INFO : PROGRESS: at 36.91% words, alpha 0.00500, 3328 words/s
2019-04-30 02:45:52,438 : INFO : PROGRESS: at 37.28% words, alpha 0.00500, 3326 words/s
2019-04-30 02:46:12,642 : INFO : PROGRESS: at 37.67% words, alpha 0.00500, 3325 words/s
2019-04-30 02:46:32,792 : INFO : PROGRESS: at 38.04% words, alpha 0.00500, 3322 words/s
2019-04-30 02:46:53,034 : INFO : PROGRESS: at 38.43% words, alpha 0.00500, 3321 words/s
2019-04-30 02:47:13,074 : INFO : PROGRESS: at 38.81% words, alpha 0.00500, 3319 words/s
2019-04-30 02:47:33,182 : INFO : PROGRESS: at 39.22% words, alpha 0.00500, 3320 words/s
2019-04-30 02:47:53,314 : INFO : PROGRESS: at 39.62% words, alpha 0.00500, 3321 words/s
2019-04-30 02:48:13,337 : INFO :

2019-04-30 03:16:04,703 : INFO : PROGRESS: at 74.68% words, alpha 0.00500, 3387 words/s
2019-04-30 03:16:24,915 : INFO : PROGRESS: at 75.11% words, alpha 0.00500, 3388 words/s
2019-04-30 03:16:45,144 : INFO : PROGRESS: at 75.54% words, alpha 0.00500, 3389 words/s
2019-04-30 03:17:05,373 : INFO : PROGRESS: at 75.93% words, alpha 0.00500, 3388 words/s
2019-04-30 03:17:25,517 : INFO : PROGRESS: at 76.25% words, alpha 0.00500, 3384 words/s
2019-04-30 03:17:45,543 : INFO : PROGRESS: at 76.55% words, alpha 0.00500, 3380 words/s
2019-04-30 03:18:05,911 : INFO : PROGRESS: at 76.83% words, alpha 0.00500, 3374 words/s
2019-04-30 03:18:26,084 : INFO : PROGRESS: at 77.15% words, alpha 0.00500, 3370 words/s
2019-04-30 03:18:46,403 : INFO : PROGRESS: at 77.53% words, alpha 0.00500, 3368 words/s
2019-04-30 03:19:06,571 : INFO : PROGRESS: at 77.90% words, alpha 0.00500, 3367 words/s
2019-04-30 03:19:26,754 : INFO : PROGRESS: at 78.28% words, alpha 0.00500, 3366 words/s
2019-04-30 03:19:46,878 : INFO :

Training the word2vec model on multiple iterations ....
PROGRESS: at sentence #0, processed 0 words and 0 unique words
PROGRESS: at sentence #1000, processed 1000000 words and 52754 unique words
PROGRESS: at sentence #2000, processed 2000000 words and 78382 unique words
PROGRESS: at sentence #3000, processed 3000000 words and 96644 unique words
PROGRESS: at sentence #4000, processed 4000000 words and 111460 unique words
PROGRESS: at sentence #5000, processed 5000000 words and 125354 unique words
PROGRESS: at sentence #6000, processed 6000000 words and 139565 unique words
PROGRESS: at sentence #7000, processed 7000000 words and 151933 unique words
PROGRESS: at sentence #8000, processed 8000000 words and 164114 unique words
PROGRESS: at sentence #9000, processed 9000000 words and 178162 unique words
PROGRESS: at sentence #10000, processed 10000000 words and 189074 unique words
PROGRESS: at sentence #11000, processed 11000000 words and 198757 unique words
PROGRESS: at sentence #12000, pro

MemoryError: 

Predicting the embedding of a word
---
Use the model trained above to get the embeddings for any word

In [12]:
context2vec.predict_embedding("student")

array([ 0.06313061, -0.07602396, -0.05714705,  0.00653351, -0.11048223,
        0.0252326 , -0.02434733, -0.02137714, -0.01423272,  0.06427327,
       -0.03336669, -0.07254225,  0.00441391,  0.01748119,  0.07599923,
       -0.01571278,  0.05168891, -0.0277834 ,  0.03054369, -0.10963402,
        0.09322368,  0.03089322,  0.03466579,  0.07251599, -0.09093159,
       -0.03380995, -0.02431794, -0.00702254, -0.05796552, -0.00326498,
        0.04588396,  0.19899118,  0.03095079,  0.00246658, -0.01261412,
        0.00416606,  0.06471258, -0.09287103,  0.015593  , -0.07240081,
       -0.10006761, -0.03702702,  0.10057703,  0.08622302,  0.05144106,
       -0.12438264, -0.03972326,  0.10357558, -0.06007904, -0.057866  ,
        0.07848126, -0.05514053,  0.00301895,  0.05253316,  0.1047078 ,
       -0.07706435, -0.01753215, -0.14088754, -0.05128637, -0.11780416,
       -0.02187855, -0.10276379, -0.06722963, -0.11287896, -0.04223928,
       -0.04004713, -0.08583377, -0.07594945, -0.10564983, -0.03

Predicting the average embedding of a sentence
---
Use the model trained above to predict the embedding of a sentence

In [20]:
context2vec.predict_sent_embedding("This is a great class")

array([-0.01163509, -0.07884874,  0.00262184, -0.09123751, -0.06797978,
        0.05508037,  0.10401928, -0.0891546 , -0.0365746 , -0.11399917,
       -0.02391525, -0.12577286, -0.07804844,  0.04706455,  0.00269985,
        0.11023757, -0.02244621, -0.17133969,  0.10515701,  0.1027593 ,
       -0.02952521, -0.00150653, -0.01417139,  0.0075647 ,  0.09811858,
       -0.02553854, -0.03945464,  0.05370576, -0.09607009,  0.04525451,
        0.05001477,  0.12098357,  0.04369219, -0.02442538, -0.04528957,
        0.00766742,  0.05725306, -0.00101892,  0.082896  , -0.06109586,
       -0.06224906, -0.00975781,  0.06704762,  0.07212045,  0.06790268,
        0.00537759, -0.03744291,  0.10451163, -0.01319254, -0.01651715,
       -0.00783649, -0.08851766, -0.01789069, -0.02096957,  0.12992635,
       -0.0729432 , -0.12817314, -0.05297756, -0.01879428, -0.02234147,
        0.05093659, -0.09001711, -0.15655198, -0.06050745, -0.12529044,
       -0.04523863, -0.0414156 ,  0.00205496, -0.07549787,  0.05

Similarity between words using their embeddings
---
Obtain the similarity between two words

In [21]:
context2vec.predict_similarity("wars", "war")

0.6506592455816812

Similarity between sentences using their average embeddings
---
Obtain the similarity between two sentences

In [4]:
context2vec.predict_sent_similarity("I am a girl", "I am a woman")

0.9734589682087043

Evaluation on Google Analogy Dataset
---
Evaluate the trained model on Google Analogy Dataset

Ensure questions-words.txt is present in the data/ folder

In [5]:
context2vec.evaluate_analogy()

EOFError: Ran out of input

Evaluation on CoNLL 2003 NER Task
---
Evaluate the trained model on CoNLL 2003 NER Task

Ensure train.txt, testa.txt, testb.txt are present in the data/conll2003 folder

In [6]:
context2vec.evaluate_ner()

FileNotFoundError: [Errno 2] No such file or directory: ''