# Feature Extraction from Kannada Words

The language Kannada has a rather complicated script. Our usual choice of features such as uppercase, lowercase, isdigit, prefix, suffix etc., would not work on such a language. Hence, we need a better way to extract features these words. 

My choice was **Word2Vec**

This script:
1. Parses and structures the corpus file into sentences.
2. Builds a Word2Vec Model (using gensim)
3. Creates a Vocabulary from the corpus (using gensim).
4. Trains the Word2Vec Model on the sentences from (1).
5. Extracts and saves the word vectors from the Word2Vec model.

In [1]:
import codecs
import pickle
import numpy as np
import pandas as pd

#### Reading in the corpus

In [2]:
f = codecs.open("corpus.txt", "r", encoding="utf8")

raw = f.readlines()

f.close()

#### Splitting the long list of words into sentences and decoding from UTF-8

In [3]:
lines = []
l = []

for w in raw:
    w = w.encode('utf-8')
    x = w.strip(b'\n').split(b' ')
    if x[0] == b'*' or x[0] == b'.' or len(x) < 2:
        if len(l) > 0:
            lines.append(l)
        l = []
        continue
    l.append((x[0], x[1]))

In [4]:
sentences = []
labels = []
num_words = 0

for l in lines:
    s = []
    lab = []
    for w in l:
        x = w[0].decode('utf-8')
        y = w[1].decode('utf-8')
        s.append(x)
        num_words += 1
        lab.append(y)
    labels.append(lab)
    sentences.append(s)

#### Create a Word2Vec model and build vocabulary from sentences

In [5]:
from gensim.models import Word2Vec

In [6]:
min_count = 1
size = 100
window = 20

model = Word2Vec(window=window, min_count=min_count, size=size)
model.build_vocab(sentences)

#### Train the model on the sentences

In [7]:
for i in range(1000):
    model.train(sentences, total_words=num_words, epochs=1)
    if i%100 == 0:
        print("Iteration ",i)

Iteration  0
Iteration  100
Iteration  200
Iteration  300
Iteration  400
Iteration  500
Iteration  600
Iteration  700
Iteration  800
Iteration  900


#### Extract vectors from the trained model and generate features

Features of current word includes features of neighboring words in a 3-word-window to provide context for Linear Chain CRF. Thus avoiding the need to add additional hidden units in our CRF.

In [8]:
X = []
y = []
for idx,s in enumerate(sentences):
    x = []
    yy = []
    for wi,w in enumerate(s):
        if w in model:
            m = model[w]
            if wi > 1:
                m = np.hstack([m,model[s[wi-1]]])
            else:
                m = np.hstack([m,np.zeros_like(model[w])])
            
            if wi > 2:
                m = np.hstack([m,model[s[wi-2]]])
            else:
                m = np.hstack([m,np.zeros_like(model[w])])
            
            if wi > 3:
                m = np.hstack([m,model[s[wi-3]]])
            else:
                m = np.hstack([m,np.zeros_like(model[w])])
            
            if wi < len(s)-1:
                m = np.hstack([m,model[s[wi+1]]])
            else:
                m = np.hstack([m,np.zeros_like(model[w])])
            
            if wi < len(s)-2:
                m = np.hstack([m,model[s[wi+2]]])
            else:
                m = np.hstack([m,np.zeros_like(model[w])])
            
            if wi < len(s)-3:
                m = np.hstack([m,model[s[wi+3]]])
            else:
                m = np.hstack([m,np.zeros_like(model[w])])
            x.append(m)
    offset = 100 - len(x)
    pad_x = np.zeros_like(model[w]) - 1
    pad_x = np.hstack([pad_x,pad_x,pad_x,pad_x,pad_x, pad_x, pad_x])
    yy = yy + labels[idx]
    for i in range(offset):
        x.append(pad_x)
        yy.append("IRR")
        #print(idx," ",m)
        #print(m)
    X.append(x)
    y.append(yy)

#### Encoding target labels

In [9]:
X = np.array(X)
y = np.array(y)
np.unique(y)

array(['CC', 'DEM', 'DET', 'INJ', 'IRR', 'JJ', 'NN', 'NUM', 'PRP', 'PSP',
       'QC', 'RB', 'SYM', 'UT', 'VM', 'WQ'], 
      dtype='<U3')

In [10]:
original_shape = y.shape
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y = le.fit_transform(y.ravel()).reshape(original_shape)
np.unique(y)

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15])

#### Saving the feature and label vectors

In [11]:
f = open("kannada-features.numpy", "wb")
np.save(f, X)
f.close()

In [12]:
f = open("kannada-labels.numpy", "wb")
np.save(f,y)
f.close()