In [23]:
import pandas as pd

# Language Modelling

RNN language models (with LSTM cells) were trained on monolingual word lists in 2 configurations: 

- one-hot embeddings
- phonetic vectors

##  Experimental Settings

- LSTM based LM implementation: Modification of Tensorflow's PTB LM sample code
- Configuration: 
  - init_scale = 0.1
  - learning_rate = 1.0
  - max_grad_norm = 5
  - num_layers = 2
  - num_steps = 20
  - hidden_size = 200
  - max_epoch = 4
  - max_max_epoch = 13
  - keep_prob = 1.0
  - lr_decay = 0.5
  - batch_size = 20
  - GradientDescent Optimizer

** Note** 

- For characters which don't have phonetic embeddings, one-hot embeddings were used
- LMs were trained with varying amounts of training corpora (number of words)

## Results 
** (See table below) ** 

_Experiment name: 1-train-size _

- The perplexity of the models trained with phonetic representation is substantially less than that of one-hot representation. 
- With increase in data-size, the phonetic representation seems to be doing better.
- Corpus normalization reduces perplexity. The original NEWS corpus contains nukta-adjoined characters as a single codepoint. Normalization separates the nukta and the character and a phonetic embedding is clearly useful for that


In [42]:
lm_results_fname='lm_results.csv'

In [43]:
lm_results=pd.read_csv(lm_results_fname,sep='|',names=['set','size','exp','lang','perplexity'])

** Dataset: CoNLL 2016 paper dataset **

In [44]:
lm_results[lm_results['set']=='conll16'].pivot_table(index=['size'],columns=['lang','exp'],values=['perplexity'])

Unnamed: 0_level_0,perplexity,perplexity,perplexity,perplexity,perplexity,perplexity,perplexity,perplexity
lang,bn,bn,hi,hi,kn,kn,ta,ta
exp,onehot,phonetic,onehot,phonetic,onehot,phonetic,onehot,phonetic
size,Unnamed: 1_level_3,Unnamed: 2_level_3,Unnamed: 3_level_3,Unnamed: 4_level_3,Unnamed: 5_level_3,Unnamed: 6_level_3,Unnamed: 7_level_3,Unnamed: 8_level_3
1000,15.5,15.367,16.774,16.661,14.657,14.648,12.966,13.211
2000,13.923,13.89,15.369,15.217,12.398,12.468,11.042,11.412
5000,12.43,12.463,13.7,13.872,11.783,11.341,9.672,9.657
15000,12.687,11.923,13.7,12.79,10.918,9.916,9.706,8.842
25000,12.031,11.126,13.067,12.258,9.771,9.141,8.78,8.376
35000,11.577,10.693,12.571,11.636,9.22,8.675,8.379,8.106


** Dataset: Old NEWS 2012 dataset (the one used by Gurneet in his experiments) **

In [45]:
lm_results[lm_results['set']=='news12_old'].pivot_table(index=['size'],columns=['lang','exp'],values=['perplexity'])

Unnamed: 0_level_0,perplexity,perplexity,perplexity,perplexity
lang,hi,hi,kn,kn
exp,onehot,phonetic,onehot,phonetic
size,Unnamed: 1_level_3,Unnamed: 2_level_3,Unnamed: 3_level_3,Unnamed: 4_level_3
1000,17.73,17.309,12.436,12.283
2000,15.629,15.206,11.241,11.152
5000,13.095,12.981,9.875,9.876
8000,12.577,12.188,9.498,9.285
10000,12.313,11.999,9.388,9.209
13000,12.223,11.783,9.135,8.942


** Dataset: NEWS 2012 dataset (Old NEWS 2012 corpus normalized) **

In [46]:
lm_results[lm_results['set']=='news12'].pivot_table(index=['size'],columns=['lang','exp'],values=['perplexity'])

Unnamed: 0_level_0,perplexity,perplexity,perplexity,perplexity
lang,hi,hi,kn,kn
exp,onehot,phonetic,onehot,phonetic
size,Unnamed: 1_level_3,Unnamed: 2_level_3,Unnamed: 3_level_3,Unnamed: 4_level_3
1000,17.524,16.789,12.494,12.315
2000,14.879,15.157,11.159,11.142
5000,12.704,12.735,9.926,9.882
8000,12.285,11.869,9.511,9.488
10000,12.017,11.698,9.368,9.111
13000,12.036,11.209,9.289,8.773
