# Word2vec Experiments using DCS

For a [Sanskrit parser project](https://github.com/kmadathil/sanskrit_parser) that I am collaborating on, we have been discussing and investigating different language models, and their applicability to parsing Sanskrit. I have been particularly interested in deep learning approaches for language modeling, such as Seq2Seq (+ attention), etc. A building block in many of these deep learning approaches is the embedding of words in a vector space using word2vec or GloVe. This notebook contains some of my experiments with word2vec using the Digital Corpus of Sanskrit to investigate the feasibility of using word2vec on just root words (prAtipadikas/dhAtus) in Sanskrit.

The DCS database is quite small from a deep learning perspective (about 30 MB if we count just the root words), so it was unclear how good the results would be or what to expect. (Spoiler - I was pleasantly surprised by the quality of the results obtained for a first pass).

## Import libraries and configure logging

In [1]:
from __future__ import print_function
import gensim
import logging
import codecs
import zipfile
import os
import itertools
import pprint
from indic_transliteration import sanscript

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', 
                    level=logging.INFO)

Using Theano backend.


## Prepare the data
For this experiment, let us investigate the feasibility of using just the root words (ignoring the POS tags, which in Sanskrit would be vibhakti/vachana/puruSha/lakAra) for embedding. The DCS data has been preprocessed and saved as a csv file of the form:
id, sentence, roots
based on the annotations in the DCS database. Note that the database refers to half a shloka as a sentence in most cases, so this might not be a full sentence / vAkya.

The file sent_roots.zip in the git repository has a zipped form of this file, which the next cell will unzip if necessary.

In [2]:
data_file = "sent_roots.csv"
if not os.path.exists(data_file):
    print("Extracting from zip file ...")
    with zipfile.ZipFile(os.path.splitext(data_file)[0] + ".zip", 'r') as myzip:
        myzip.extract(data_file)
    print("Done")
else:
    print(data_file, "already exists")

Extracting from zip file ...
Done


Word2vec only requires an iterator that yields one sentence at a time as a list of words as described in this [tutorial](https://rare-technologies.com/word2vec-tutorial/). Let us adapt the class given as an example there to our data.

In [3]:
class DCSSentences(object):
    def __init__(self, data_file):
        self.data_file = data_file
 
    def __iter__(self):
        with codecs.open(self.data_file, 'rb', "utf8") as f:
            for line in f:
                yield line.split(",")[-1].split()

In [4]:
# Look at first 10 sentence roots
sentences = DCSSentences(data_file)
pprint.pprint(list(itertools.islice(sentences, 10)))

[[u'paYcan', u'ratna', u'muKya', u'ca', u'uparatna', u'catuzwaya'],
 [u'pravAla', u'lohita', u'pravac', u'vEqUrya', u'harita', u'pARqura'],
 [u'ABIra', u'pAnTa', u'taTA', u'api', u'ca', u'vanya'],
 [u'jYA', u'maDUla', u'saMjYA', u'api', u'maDUka', u'vArisaMsTita'],
 [u'aTa', u'atas', u'katiDApuruzIya', u'SArIra', u'vyAKyA'],
 [u'aNgAraka', u'iti', u'KyAti', u'gam', u'DarAtmaja'],
 [u'devaloka', u'ca', u'tvad', u'rUpa', u'BU'],
 [u'yad', u'ca', u'tvad', u'pUjay', u'caturTI', u'tvad', u'nara'],
 [u'rUpa', u'tad', u'BU'],
 [u'evam', u'SAnti', u'kAma']]


## Word2Vec Exploration

### Setup parameters

In [5]:
# We can use the hierarchical softmax strategy if we want to be able to score sentences using roots later
# The default word2vec model does not use it.
use_hier_softmax = False

# Prepare keyword args
workers = 4        # number of threads to use
iterations = 10    # number of iterations (epochs) over the data

kwargs = dict(workers=workers, iter=iterations)

if use_hier_softmax:
    save_filename = "model_sent_roots_hs.dat"
    kwargs["hs"] = 1
    kwargs["negative"] = 0
else:
    save_filename = "model_sent_roots.dat"

### Train and save the model

In [6]:
model = gensim.models.Word2Vec(sentences, **kwargs)
# Since we will not be doing any more training, we switch over to the KeyedVectors that are generated and only save those
model = model.wv
model.save(save_filename)

2017-08-06 23:56:27,819 : INFO : collecting all words and their counts
2017-08-06 23:56:27,819 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2017-08-06 23:56:27,989 : INFO : PROGRESS: at sentence #10000, processed 65098 words, keeping 10218 word types
2017-08-06 23:56:28,121 : INFO : PROGRESS: at sentence #20000, processed 130561 words, keeping 13805 word types
2017-08-06 23:56:28,253 : INFO : PROGRESS: at sentence #30000, processed 194878 words, keeping 16185 word types
2017-08-06 23:56:28,385 : INFO : PROGRESS: at sentence #40000, processed 256745 words, keeping 18054 word types
2017-08-06 23:56:28,530 : INFO : PROGRESS: at sentence #50000, processed 318673 words, keeping 20161 word types
2017-08-06 23:56:28,684 : INFO : PROGRESS: at sentence #60000, processed 387302 words, keeping 22092 word types
2017-08-06 23:56:28,805 : INFO : PROGRESS: at sentence #70000, processed 444988 words, keeping 23405 word types
2017-08-06 23:56:28,930 : INFO : PROGRESS: at s

### Load a pre-trained model

In [7]:
model = gensim.models.KeyedVectors.load(save_filename)

2017-08-06 23:58:54,618 : INFO : loading KeyedVectors object from model_sent_roots.dat
2017-08-06 23:58:54,756 : INFO : setting ignored attribute syn0norm to None
2017-08-06 23:58:54,756 : INFO : loaded model_sent_roots.dat


### Look at some (interesting?) relationships

In [9]:
# Convenience function for devanagari output
def print_similar_devanagari(word_slp):
    s = sanscript.transliterate(word_slp, sanscript.SLP1, sanscript.DEVANAGARI) + " -- "
    similar = model.wv.most_similar(word_slp)
    for result in similar:
        s += sanscript.transliterate(result[0], sanscript.SLP1, sanscript.DEVANAGARI) + u" ({:0.4f}) | ".format(result[1])
    print(s + "\n")

**The model appears to have correctly learned some synonyms/common contexts for brAhmaNa, dharma, idAnIm, etc**

In [10]:
words = ["brAhmaRa", "Darma", "idAnIm", "putra", "patnI"]
for word in words:
    print_similar_devanagari(word)

2017-08-06 23:59:35,224 : INFO : precomputing L2-norms of word weight vectors


ब्राह्मण -- द्विजाति (0.6738) | द्विज (0.6690) | विप्र (0.6587) | अतिथि (0.6003) | श्रोत्रिय (0.5574) | शूद्र (0.5392) | द्विजोत्तम (0.5375) | याजक (0.5345) | वैश्य (0.5314) | ऋत्विज् (0.5281) | 

धर्म -- स्वधर्म (0.6565) | धर्म्य (0.6315) | आचार (0.5989) | अधर्म (0.5866) | कृतात्मन् (0.5559) | स्मार्त (0.5440) | व्यवसाय (0.5299) | धार्मिक (0.5162) | नय (0.5116) | वृत्ति (0.5096) | 

इदानीम् -- अधुना (0.6937) | किमर्थ (0.6628) | सांप्रतम् (0.6600) | अवश्यम् (0.6560) | हन्त (0.6541) | सम्प्रति (0.6315) | विवक्ष् (0.6172) | स्वामिन् (0.5934) | कस्मात् (0.5916) | यथातथ (0.5904) | 

पुत्र -- सुत (0.8519) | तनय (0.7341) | आत्मज (0.6885) | अपत्य (0.6275) | दायाद (0.6238) | सूनु (0.5822) | स्नुषा (0.5657) | स्याल (0.5491) | भार्या (0.5377) | सुता (0.5363) | 

पत्नी -- भार्या (0.7047) | जाया (0.6891) | अरुन्धती (0.6738) | सती (0.6578) | दाक्षायणी (0.6513) | दुहितृ (0.6300) | सुता (0.6262) | मेना (0.6261) | स्वसृ (0.6036) | देवर (0.5940) | 



In [11]:
roots = ["Sru", "vac", "gam"]
for root in roots:
    print_similar_devanagari(root)

श्रु -- निशामय् (0.6804) | निबुध् (0.6460) | आकर्णय् (0.6092) | संश्रु (0.6029) | कथय् (0.5691) | उपश्रु (0.5514) | श्रावय् (0.5473) | कीर्तय् (0.5161) | समाचक्ष् (0.5140) | कथ् (0.5139) | 

वच् -- अभिधा (0.7199) | अह् (0.7132) | ब्रू (0.7086) | भाष् (0.7014) | प्रतिवच् (0.6943) | प्राह् (0.6919) | विज्ञापय् (0.6333) | व्याहृ (0.6307) | प्रवच् (0.6287) | आचक्ष् (0.6264) | 

गम् -- प्रया (0.7556) | व्रज् (0.7523) | या (0.7329) | नी (0.6761) | आगम् (0.6707) | प्रस्था (0.6436) | उपागम् (0.6378) | उपया (0.6370) | प्रतिगम् (0.6304) | उपगम् (0.6289) | 



**Let's look at some characters**

In [12]:
characters = ["yuDizWira", "BIma", "arjuna", "kfzRa"]
for c in characters:
    print_similar_devanagari(c)

युधिष्ठिर -- सुयोधन (0.6517) | धर्मसुत (0.6478) | पाण्डव (0.6346) | धर्मराज (0.6299) | धनंजय (0.6193) | महीपति (0.6065) | वृकोदर (0.6054) | दुर्योधन (0.5988) | धर्मपुत्र (0.5983) | अजातशत्रु (0.5955) | 

भीम -- भीमसेन (0.7958) | वृकोदर (0.6609) | किरीटिन् (0.6493) | सूतपुत्र (0.6425) | भैमसेनि (0.6417) | सात्यकि (0.6379) | मारुति (0.6364) | राधेय (0.6282) | युयुधान (0.6266) | प्रहस्त (0.6250) | 

अर्जुन -- फल्गुन (0.8275) | धनंजय (0.7889) | बीभत्सु (0.7321) | सात्यकि (0.7128) | सव्यसाचिन् (0.7106) | पार्थ (0.6977) | राधेय (0.6902) | पाण्डव (0.6768) | गाण्डीवधन्वन् (0.6687) | वृकोदर (0.6624) | 

कृष्ण -- शुक्ल (0.5317) | पीत (0.5152) | वासुदेव (0.5089) | गोविन्द (0.4953) | जनार्दन (0.4815) | पार्थ (0.4739) | केशव (0.4666) | चतुर्दशी (0.4662) | श्वेत (0.4648) | हृषीकेश (0.4527) | 



It's interesting (ironic?) that the model thinks duryodhana/suyodhana is mentioned in similar contexts to yudhiShThira. For bhIma and arjuna, we see karNa show up in the similar words, and some synonyms are correctly learned.
For kRShNa, the embedding learns synonyms of the character kRSNa, along with the antonyms of the meaning of kRShNa = black. 

**Some other examples**

In [13]:
words = ["pravAla", "catur", "kavi", "BUpAla", "SAlmalI"]
for word in words:
    print_similar_devanagari(word)

प्रवाल -- विद्रुम (0.8978) | मुक्ताफल (0.8927) | वैडूर्य (0.8830) | मरकत (0.8827) | मुक्ता (0.8776) | कुट्टिम (0.8745) | मौक्तिक (0.8674) | इन्द्रनील (0.8508) | पद्मराग (0.8479) | पुष्पराग (0.8462) | 

चतुर् -- त्रि (0.6457) | अष्टन् (0.6245) | षष् (0.6100) | द्वि (0.5971) | पञ्चन् (0.5778) | नवन् (0.5525) | एकैक (0.5377) | द्वादशन् (0.5293) | षोडशन् (0.5104) | पञ्चदशन् (0.5022) | 

कवि -- काव्य (0.7199) | अविनाशिन् (0.6559) | सूरि (0.6375) | मन्तृ (0.6307) | आयुर्वेद (0.6290) | वेत्तृ (0.6277) | व्याकरण (0.6223) | उपनिषद् (0.6197) | विपरिलोप (0.6098) | नाट्य (0.6096) | 

भूपाल -- महीपाल (0.6893) | वीरवर (0.6221) | अम्बा (0.6214) | द्वाःस्थ (0.6097) | गुह (0.6016) | धर्मसूनु (0.6003) | वसुधाधिप (0.5971) | रघुनन्दन (0.5923) | दर्पसार (0.5903) | शैलेन्द्र (0.5896) | 

शाल्मली -- शाल्मलि (0.9005) | शिरीष (0.8990) | पीलु (0.8976) | फलिनी (0.8938) | वेतस (0.8923) | उशीर (0.8914) | शेलु (0.8875) | धातकी (0.8875) | कोविदार (0.8869) | इङ्गुद (0.8860) | 



## Next Steps
For a first pass, the above results are surprisingly good, and pretty interesting. They seem to confirm that it might be possible to use embeddings using just roots (prAtipadika, dhAtus) as a building block for deep learning. It might be interesting to see what happens once we add in the vibhakti/vacana/puruSha/lakAra/etc. tags. Perhaps it would make sense to learn a separate embedding for them, or just combine all the tags and root words? More experiments are certainly needed ...