# Owner Categorization with an RNN

In this notebook, I will implement a recurrent neural network that categorize owners base on their name. Using an RNN rather than a feedfoward network is more accurate since we can include information about the *sequence* of words. Here we'll use a dataset of owners name from the Philipines and India.

The architecture for this network is shown below.

Here, we'll pass in words to an embedding layer.

From the embedding layer, the new representations will be passed to LSTM cells. These will add recurrent connections to the network so we can include information about the sequence of words in the data. Finally, the LSTM cells will go to a sigmoid output layer here. The output layer will just be a single unit then, with a sigmoid activation function.

We are not interested in the sigmoid outputs except for the very last one, we can ignore the rest. We'll calculate the cost from the output of the last step and the training label.

Charles Jansen

In [1]:
import numpy as np
import tensorflow as tf #TensorFlow 1.0
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold

In [2]:
!pip install xlrd
fullExcel = pd.read_excel("owners.xlsx")



In [3]:
fullExcel.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 181337 entries, 0 to 181336
Data columns (total 3 columns):
ownername                     181337 non-null object
nonPortfolioHolderTypeName    181337 non-null object
Country                       181337 non-null object
dtypes: object(3)
memory usage: 4.2+ MB


In [4]:
fullExcel.head()

Unnamed: 0,ownername,nonPortfolioHolderTypeName,Country
0,DBH International Private Limited,Company,India
1,Karan Thapar,Person,India
2,Karun Carpets Private Limited,Company,India
3,Lotus Global Investments Ltd,Company,India
4,S. K. Toshniwal,Person,India


In [5]:
names = np.array(fullExcel.ownername)
categories =  np.array(fullExcel.nonPortfolioHolderTypeName)

In [6]:
print(names[:3])
print(categories[:3])

['DBH International Private Limited' 'Karan Thapar'
 'Karun Carpets Private Limited']
['Company' 'Person' 'Company']


## Data preprocessing

Since we're using embedding layers, we'll need to encode each word with an integer.

In [7]:
all_names = ' '.join(names)
words = all_names.split()

In [8]:
all_names[:1000]

'DBH International Private Limited Karan Thapar Karun Carpets Private Limited Lotus Global Investments Ltd S. K. Toshniwal Vijay D. Rai Bharath Chmpaklal Sutaria Sridhar reddy Gireddy Srinivasa Reddy Arikatla Surender Reddy Bhimavarapu Vasantha Madasu Vijay SanathBai Chokshi Dharmayug Investments ltd Dilip Kumar Lakhi Gurpreet N S Sobti India - Central Government /State Government Innovative Money Matters Pvt. Ltd. Jagdeep Singh Jasmeet Kaur Sethi KKM Enterprises Pvt. Ltd. Navjeet Singh Sobti Parmeet Kaur Rakan Infrastructures Pvt Ltd Swift Buildwell Pvt. Ltd. Vishvdeva Leasing and Investments pvt ltd Anil Pandit Credit Renaissance Development Fund LP Credit Renaissance Fund Limited Garuda Plant Products Limited Hima Sheth Indrani Khanna Mahindra & Mahindra Limited Smita Patel Surjit Uberoi Trenton Investments Company Private Limited Columbia Wanger Asset Management, L.P FMR LLC Nalanda India Fund Limited Warburg Pincus International Partners, L.P Warburg Pincus Netherlands Internation

In [9]:
words[:20]

['DBH',
 'International',
 'Private',
 'Limited',
 'Karan',
 'Thapar',
 'Karun',
 'Carpets',
 'Private',
 'Limited',
 'Lotus',
 'Global',
 'Investments',
 'Ltd',
 'S.',
 'K.',
 'Toshniwal',
 'Vijay',
 'D.',
 'Rai']

### Encoding the words

The embedding lookup requires that we pass in integers to our network. The easiest way to do this is to create dictionaries that map the words in the vocabulary to integers. Then we can convert each of our owner names into integers so they can be passed into the network.

In [10]:
from collections import Counter
counts = Counter(words)
vocab = sorted(counts, key=counts.get, reverse=True)
vocab_to_int = {word: ii-1 for ii, word in enumerate(vocab, 1)}

names_ints = []
for each in names:
    names_ints.append([vocab_to_int[word] for word in each.split()])

### Encoding the labels (= the category here)

Our labels are "company" or "person". To use these labels in our network, we need to convert them to 0 and 1.


In [11]:
categories = np.array([1 if each == 'Company' else 0 for each in categories])

In [12]:
print(categories)

[1 0 1 ..., 1 0 1]


In [13]:
names_lens = Counter([len(x) for x in names_ints])
print("Zero-length names: {}".format(names_lens[0]))
print("Maximum name length: {}".format(max(names_lens)))

Zero-length names: 0
Maximum name length: 28


For names shorter than 28, we'll pad with 0s. 

In [14]:
seq_len = max(names_lens)
features = np.zeros((len(names_ints), seq_len), dtype=int)
for i, row in enumerate(names_ints):
    features[i, -len(row):] = np.array(row)[:seq_len]

In [15]:
features[:10,:100]

array([[    0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0, 13552,    86,     6,
            2],
       [    0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,   588,
         1297],
       [    0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,  5857,  8031,     6,
            2],
       [    0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,  3005,    91,    14,
            0],
       [    0,     0,     0,     0,     0,     0,     0,     0,     0,
            0

## Training, Validation, Test



With our data in nice shape, we'll split it into training, validation, and test sets.

10% randomly taken for test.

Kfold on the remaining 90% for validation and training.-->canceled. Done, but took much more time for the same 98.9 result

In [16]:
split_frac = 0.9
train_val_x, test_x, train_val_y, test_y = train_test_split(
    features, categories, 
    train_size = split_frac)

#sin Kfold
train_x, val_x, train_y, val_y = train_test_split(
    train_val_x, train_val_y, 
    train_size = split_frac)
'''
#Kfold
train_x = []
val_x   = []
train_y = []
val_y   = []
train_x =  np.empty([0,max(names_lens)])
val_x   =  np.empty([0,max(names_lens)])
train_y =  np.empty(0)
val_y   =  np.empty(0)

kf = KFold(n_splits = 9, shuffle=True)
for train_index, val_index in kf.split(train_val_x):
    train_temp_x, val_temp_x = train_val_x[train_index], train_val_x[val_index]
    train_temp_y, val_temp_y = train_val_y[train_index], train_val_y[val_index]
    train_x = np.concatenate((train_x, train_temp_x), axis=0)
    val_x   = np.concatenate((val_x, val_temp_x), axis=0)
    train_y = np.concatenate((train_y, train_temp_y), axis=0)
    val_y = np.concatenate((val_y, val_temp_y), axis=0)

'''
print("\t\t\tFeature Shapes:")
print("X\nTrain set: \t\t{}".format(train_x.shape), 
      "\nValidation set: \t{}".format(val_x.shape),
      "\nTest set: \t\t{}".format(test_x.shape),
      "\nY\nTrain set: \t\t{}".format(train_y.shape), 
      "\nValidation set: \t{}".format(val_y.shape),
      "\nTest set: \t\t{}".format(test_y.shape),
     )

			Feature Shapes:
X
Train set: 		(146882, 28) 
Validation set: 	(16321, 28) 
Test set: 		(18134, 28) 
Y
Train set: 		(146882,) 
Validation set: 	(16321,) 
Test set: 		(18134,)


## Build the graph

Here, we'll build the graph. First up, defining the hyperparameters.

* `lstm_size`: Number of units in the hidden layers in the LSTM cells. 
* `lstm_layers`: Number of LSTM layers in the network. I'd start with 1, then add more if I'm underfitting.
* `batch_size`: The number of names to feed the network in one training pass.
* `learning_rate`: Learning rate

In [56]:
lstm_size = 64
lstm_layers = 1
batch_size = 1024
learning_rate = 0.01
tf.reset_default_graph()

For the network itself, we'll be passing in our 28 element long names vectors. Each batch will be `batch_size` vectors. We'll also be using dropout on the LSTM layer, so we'll make a placeholder for the keep probability.

In [57]:
n_words = len(vocab_to_int)

# Add nodes to the graph
with tf.name_scope('inputs'):
    inputs_ = tf.placeholder(tf.int32, [None, None], name='inputs')
with tf.name_scope('labels'):
    labels_ = tf.placeholder(tf.int32, [None, None], name='labels')
with tf.name_scope('keep_prob'):    
    keep_prob = tf.placeholder(tf.float32, name='keep_prob')

### Embedding

Now we'll add an embedding layer. 


In [58]:
len(vocab)

79782

In [59]:
vocab[:20]

['Ltd',
 'Pvt',
 'Limited',
 '&',
 'Ltd.',
 'LIMITED',
 'Private',
 'Kumar',
 'Shah',
 'Pvt.',
 'PRIVATE',
 'Fund',
 'Investment',
 'Patel',
 'Investments',
 'S',
 'LTD',
 'P',
 'K',
 'Company']

In [60]:
# Size of the embedding vectors (number of units in the embedding layer)
embed_size = 300 

with tf.name_scope("Embedded"):
    embed = tf.contrib.layers.embed_sequence(inputs_, vocab_size=n_words, embed_dim=embed_size)

### LSTM cell



Next, we'll create our LSTM cells to use in the recurrent network 

In [61]:
with tf.name_scope("RNN_cells"):
    # Your basic LSTM cell
    lstm = tf.contrib.rnn.BasicLSTMCell(lstm_size)
    
    # Add dropout to the cell
    drop = tf.contrib.rnn.DropoutWrapper(lstm, output_keep_prob=keep_prob)
    
    # Stack up multiple LSTM layers, for deep learning
    cell = tf.contrib.rnn.MultiRNNCell([drop] * lstm_layers)
with tf.name_scope("RNN_init_state"):    
    # Getting an initial state of all zeros
    initial_state = cell.zero_state(batch_size, tf.float32)

### RNN forward pass


Now we need to actually run the data through the RNN nodes. 

Above I created an initial state, `initial_state`, to pass to the RNN. This is the cell state that is passed between the hidden layers in successive time steps. 


In [62]:
with tf.name_scope("RNN_forward"):
    outputs, final_state = tf.nn.dynamic_rnn(cell, embed,
                                             initial_state=initial_state)

### Output

We want the final output. So we need to grab the last output with `outputs[:, -1]`

In [63]:
with tf.name_scope('predictions'):
    predictions = tf.contrib.layers.fully_connected(outputs[:, -1], 1, activation_fn=tf.sigmoid)
    tf.summary.histogram('predictions', predictions)
with tf.name_scope('cost'):
    cost = tf.losses.mean_squared_error(labels_, predictions)
    tf.summary.scalar('cost', cost)
with tf.name_scope('train'):    
    optimizer = tf.train.AdamOptimizer(learning_rate).minimize(cost)

### Validation accuracy

Here we can add a few nodes to calculate the accuracy which we'll use in the validation pass.

In [64]:
with tf.name_scope('accuracy'): 
    correct_pred = tf.equal(tf.cast(tf.round(predictions), tf.int32), labels_)
    accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32))

### Batching

This is a simple function for returning batches from our data. First it removes data such that we only have full batches. Then it iterates through the `x` and `y` arrays and returns slices out of those arrays with size `[batch_size]`.

In [65]:
def get_batches(x, y, batch_size=100):
    
    n_batches = len(x)//batch_size
    x, y = x[:n_batches*batch_size], y[:n_batches*batch_size]
    for ii in range(0, len(x), batch_size):
        yield x[ii:ii+batch_size], y[ii:ii+batch_size]

## Training



In [66]:
'''
import os
epochs = 1

with graph.as_default():
    saver = tf.train.Saver()

with tf.Session(graph=graph) as sess:
    sess.run(tf.global_variables_initializer())
    iteration = 1
    for e in range(epochs):
        state = sess.run(initial_state)
        
        for ii, (x, y) in enumerate(get_batches(train_x, train_y, 2), 1):
            print(ii)
            print(x)
            print(y)
            print(x.shape)
            print(y.shape)
            print(y[:, None])
            if ii == 5:
                raise Exception("Manual Stop")
#''' ;           

In [67]:
epochs = 3   
merged = tf.summary.merge_all()

In [None]:
for lstm_size in [64,128,256,512]:
    for num_layers in [1, 2, 3, 5]:
        for learning_rate in [0.001, 0.003, 0.01, 0.03, 0.1]:
            for epochs in [1,3,5,10]:
                log_string = 'logs/4/lr={},rl={},ru={},e={}'.format(learning_rate, num_layers, lstm_size, epochs)
                writer = tf.summary.FileWriter(log_string)
                with tf.Session() as sess:
                    saver = tf.train.Saver()
                    sess.run(tf.global_variables_initializer())
                    #file_writer = tf.summary.FileWriter('./logs/3', sess.graph)
                    #train_writer = tf.summary.FileWriter('./logs/2/train', sess.graph)
                    #test_writer = tf.summary.FileWriter('./logs/2/test')

                    iteration = 1
                    for e in range(epochs):
                        state = sess.run(initial_state)

                        for ii, (x, y) in enumerate(get_batches(train_x, train_y, batch_size), 1):
                            feed = {inputs_: x,
                                    labels_: y[:, None],
                                    keep_prob: 0.5,
                                    initial_state: state}
                            summary, loss, state, _ = sess.run([merged, cost, final_state, optimizer], feed_dict=feed)

                            #train_writer.add_summary(summary, iteration)
                            writer.add_summary(summary, iteration)

                            if iteration%5==0:
                                print("Epoch: {}/{}".format(e, epochs),
                                      "Iteration: {}".format(iteration),
                                      "Train loss: {:.3f}".format(loss))

                            if iteration%285==0:
                                val_acc = []
                                val_state = sess.run(cell.zero_state(batch_size, tf.float32))
                                for x, y in get_batches(val_x, val_y, batch_size):
                                    feed = {inputs_: x,
                                            labels_: y[:, None],
                                            keep_prob: 1,
                                            initial_state: val_state}
                                    summary, batch_acc, val_state = sess.run([merged, accuracy, final_state], feed_dict=feed)
                                    val_acc.append(batch_acc)
                                print("Val acc: {:.3f}".format(np.mean(val_acc)))

                                #test_writer.add_summary(summary, iteration)

                            iteration +=1
                    saver.save(sess, "checkpoints/ownerNameCateg.ckpt")

Epoch: 0/1 Iteration: 5 Train loss: 0.152
Epoch: 0/1 Iteration: 10 Train loss: 0.027
Epoch: 0/1 Iteration: 15 Train loss: 0.022
Epoch: 0/1 Iteration: 20 Train loss: 0.015
Epoch: 0/1 Iteration: 25 Train loss: 0.020
Epoch: 0/1 Iteration: 30 Train loss: 0.013
Epoch: 0/1 Iteration: 35 Train loss: 0.023
Epoch: 0/1 Iteration: 40 Train loss: 0.016
Epoch: 0/1 Iteration: 45 Train loss: 0.011
Epoch: 0/1 Iteration: 50 Train loss: 0.012
Epoch: 0/1 Iteration: 55 Train loss: 0.012
Epoch: 0/1 Iteration: 60 Train loss: 0.006
Epoch: 0/1 Iteration: 65 Train loss: 0.012
Epoch: 0/1 Iteration: 70 Train loss: 0.007
Epoch: 0/1 Iteration: 75 Train loss: 0.006
Epoch: 0/1 Iteration: 80 Train loss: 0.008
Epoch: 0/1 Iteration: 85 Train loss: 0.005
Epoch: 0/1 Iteration: 90 Train loss: 0.009
Epoch: 0/1 Iteration: 95 Train loss: 0.009
Epoch: 0/1 Iteration: 100 Train loss: 0.011
Epoch: 0/1 Iteration: 105 Train loss: 0.007
Epoch: 0/1 Iteration: 110 Train loss: 0.006
Epoch: 0/1 Iteration: 115 Train loss: 0.011
Epoch: 0

## Testing

In [None]:
test_acc = []
with tf.Session() as sess:
    saver.restore(sess, tf.train.latest_checkpoint('checkpoints'))
    test_state = sess.run(cell.zero_state(batch_size, tf.float32))
    for ii, (x, y) in enumerate(get_batches(test_x, test_y, batch_size), 1):
        feed = {inputs_: x,
                labels_: y[:, None],
                keep_prob: 1,
                initial_state: test_state}
        batch_acc, test_state = sess.run([accuracy, final_state], feed_dict=feed)
        test_acc.append(batch_acc)
    print("Test accuracy: {:.3f}".format(np.mean(test_acc)))

## Predictions

In [None]:
x = "Dushyant Sekhar "

x_int = [vocab_to_int[word] for word in x.split()]
x_int = x_int[:28] #ignore words after 28th

x_int_sized = np.zeros((1,seq_len), dtype=int)
x_int_sized[0,-len(x_int):] = np.array(x_int)[:seq_len]
print(x_int_sized)

In [None]:
fillerSize =  batch_size - 1
filler = np.tile(np.zeros((1, seq_len), dtype=int),(fillerSize,1))
#print(filler.shape)
prodBatch = np.append(x_int_sized,filler, axis=0)
#print(prodBatch)
#print(prodBatch.shape)

In [None]:
with tf.Session() as sess:
    saver.restore(sess, tf.train.latest_checkpoint('checkpoints'))
    test_state = sess.run(cell.zero_state(batch_size, tf.float32))
    feed = {inputs_: prodBatch,
            keep_prob: 1,
            initial_state: test_state}
    output = sess.run(predictions, feed_dict=feed)
    print(output[0])
    if output[0]>0.5:
        print(x,"\nCompany")
        print("probability {}%".format(np.round(output[0][0]*100,2)))
    else:
        print(x,"\nPerson")
        print("probability {}%".format(np.round(100-output[0][0]*100,2)))
        