# Abstract

In this paper we model the character classification and a simple version of language drift based on a popular TV show "Friends" as well as other shows running in the same era. The modeling methods applied in this paer include word2vec, DNN, CNN and LSTM. 

# Introduction

<img src="friends-reunion-series-ftr.jpg" style="width: 400px;"> 

"Friends" is one of the most popular TV shows ran from 1994 to 2004 and we are big fans of "Friends" as well. There are 6 main charatcers in "Friends" who are Ross, Rachel, Joey, Chandler, Monica and Phoebe. While we watching the show, we both noticed that each character has a distinct personality and of course, style of speech, and it sometimes makes it very easy to tell who the characters are from their lines. Another interesting trend we noticed is the language drift. The show ran from 1994 to 2004, which was the 10 years that the world drastically changed, and so did the language. In this paper we will show the classification of the characters based on their lines, and briefly talk about the language drifts based on the show.

*A "Friends" icon painting by Aysa:*

<img src="friends_painting.jpg" style="width: 200px;">


# Background

# Methods


* Modeling using word2vec, DNN, CNN and LSTM models, the folders and scripts are shown as below. We are still looking to further improve the accuracy, if it's possible. 
* The baseline accuracy would be random guessing, which has an accuracy of 1/6, or 16.67%. If we get a higher accuracy in our modeling, the model is (at least somewhat) useful!
* We have attched part of the codes in this report, the complete version of all codes will be uploaded into this repository: https://github.com/aysafanxm/w266-final-project


### Folders
* Model folder: word2vec model and DNN model
* Cnnmodel folder: CNN model
* Rnnmodel folder: LSTM model


### Scripts
* *handle_json.py* - The original lines.json file contains all lines from the show "Friends", which was in JSON format, we converted the character names into numbers and write them into data/feature_raw.txt. We also pick only the 6 main characters' lines (6 main characters: Ross, Rachel, Joey, Chandler, Monica and Phoebe).


* *extract_label_and_sentence.py* - Extract labels from data/feature_raw.txt and write them into data/label.txt, also extract the segmantations into data/sentence.txt.


* *extract_feature.py* - Train word vectors using word2vec (4 dimensions) and calculate the feature vectors of each sentence (take the average of the word vectors in each sentence), then write feature vectors into data/feature.txt. This process is mainly for DNN training because CNN and LSTM use embedding which doesn't train word vector the same way.


* *main_word2vec.py* - DNN with 3 hidden layers. The neuron numbers of each layer is 40, 20 and 10, respectively. The input dimension is 4 (4 features) and the output dimension is 6 (6 characters). The first 2 layers‘ activation function is sigmoid and the last layer's is softmax. The learning rate is 0.0001 and there are 1000 iterations. Note that there is a parameter, is_train, in the model, if is_train is True, it starts to train a new model, otherwise it takes the trained model.


* *main.py* - Similar to main_word2vec.py, but it is DNN with embedding.


* *data_helpers.py* - Helps to batch process the data


* *cnn_model.py* - CNN with an embedding layer (100 dimensions word vectos), a CNN layer, a pool layer and a softmax layer to output the probability of each label.


* *textCNN.py* - It takes the cnn_model.py to train or test the lines data. It takes 90% of the lines for training and 10% of them for testing. *The accuracy of CNN is 28%~30%.* The learn rate is 0.0001.


* *textRNN.py* - RNN with an embedding layer, a bi-lstm layer, a concat layer, a fully connected layer and a softmax layer. It takes 90% of the lines for training and 10% of them for testing. *The accuracy of RNN is 39%~40%.* The learn rate is 0.0001.


## Here are some blocks of the codes*

In [13]:
import tensorflow as tf
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
import data_helpers
from cnn_model import TextCNN
from tensorflow.contrib import learn
from sklearn.metrics import accuracy_score
import os
import time
import datetime
from tensorflow.contrib import rnn

### main_word2vec.py

In [14]:
# Read features
def read_feature(file):
  print("reading feature information...\n")
  res = []
  with open(file, 'r', encoding='utf-8') as f:
    lines = f.readlines()
  for line in lines:
    line = line.split()
    for i in range(len(line)):
      line[i] = float(line[i])
    res.append(line)
  return np.array(res)

# Read labels
def read_label(file):
  print("reading label information...\n")
  res = []
  with open(file, 'r', encoding='utf-8') as f:
    lines = f.readlines()
  for line in lines:
    line = int(line.strip())
    res.append(line)
  return np.array(res)

def addLayer(inputData, inSize, outSize, activity_function = None):  
    Weights = tf.Variable(tf.random_normal([inSize, outSize]))   
    basis = tf.Variable(tf.random_uniform([1,outSize], -1, 1))    
    weights_plus_b = tf.matmul(inputData, Weights) + basis  
    #Wx_plus_b = tf.nn.dropout(weights_plus_b, keep_prob = 0.8)     # To prevent overfitting

    if activity_function is None:  
        ans = weights_plus_b  
    else:  
        ans = activity_function(weights_plus_b)
    return ans  

def net(x_data, y_data, x_test, y_test):
    is_train = True


    insize = x_data.shape[1]
    outsize = 8
    xs = tf.placeholder(tf.float32,[None, insize]) 
    ys = tf.placeholder(tf.float32,[None, outsize]) 
    keep_prob = tf.placeholder(tf.float32)  
      
    l1 = addLayer(xs, insize, 40,activity_function=None)  
    l2 = addLayer(l1, 40, 20,activity_function=tf.nn.sigmoid)  
    l3 = addLayer(l2, 20, 10,activity_function=tf.nn.softmax)  
    l4 = addLayer(l3, 10, outsize,activity_function=tf.nn.softmax)


    y = l4
    #loss = tf.reduce_sum(tf.reduce_sum(tf.square((ys-l4)),reduction_indices = [1]))  
    #loss = -tf.reduce_mean(ys * tf.log(l3))
    #loss = tf.reduce_sum(tf.square((ys-y)))
    loss = -tf.reduce_sum(ys * tf.log(y))
    #loss = tf.reduce_sum(-tf.reduce_sum(ys * tf.log(y),reduction_indices=[1]))  # loss  
    train =  tf.train.GradientDescentOptimizer(0.00001).minimize(loss) 

    # Turn 1 dimensional label vectors to 14 dimensional vectors which has only one element = 1
    new_ydata = []
    for i in range(y_data.shape[0]):
      new_ydata.append([0]*outsize)
      new_ydata[i][y_data[i]] = 1
      # print(new_ydata[i])
    new_ydata = np.array(new_ydata)
        
    saver=tf.train.Saver()
    with tf.Session() as sess:
        init = tf.global_variables_initializer()
        sess.run(init)
        if is_train: 
            run_step = 4000
            for i in range(run_step):  
                sess.run(train,feed_dict={xs:x_data,ys:new_ydata})  
                if i%50 == 0:  
                    print(sess.run(loss,feed_dict={xs:x_data,ys:new_ydata}))
            # save the model
            saver=tf.train.Saver(max_to_keep=1)
            saver.save(sess,'model/net.ckpt')
        else:     # take a trained model
            saver.restore(sess, 'model/net.ckpt')
            print("save success!")

        # Prediction
        res = sess.run(fetches=y, feed_dict={xs: x_test})
        new_res = []
        for ele in res:
            mmax = -1111
            index = -1
            for i in range(outsize):
                if ele[i] > mmax:
                    index, mmax  = i, ele[i]
            new_res.append(index)
        #print(new_res)
        new_res = np.array(new_res)
        counter = 0
        for i in range(len(new_res)):
          if(y_test[i] == new_res[i]):
            counter += 1
        print("Accuracy: ", counter/len(new_res))
        print(classification_report(new_res, y_test))

def main():
  feature = read_feature('data/feature.txt')
  label = read_label('data/label.txt')

  x_train , x_test , y_train , y_test = train_test_split(feature, label, test_size = 0.1,random_state=0)
  net(x_train, y_train, x_test, y_test)

if __name__ == '__main__':
  main()

reading feature information...

reading label information...

57168.875
43253.44
42323.68
41977.105
41805.535
41703.83
41635.438
41586.19
41549.297
41520.805
41498.117
41479.613
41464.203
41451.184
41440.0
41430.324
41421.84
41414.35
41407.68
41401.68
41396.26
41391.36
41386.9
41382.81
41379.055
41375.61
41372.4
41369.457
41366.68
41364.098
41361.67
41359.4
41357.273
41355.258
41353.363
41351.566
41349.87
41348.266
41346.715
41345.25
41343.83
41342.484
41341.168
41339.934
41338.742
41337.594
41336.49
41335.44
41334.414
41333.445
41332.508
41331.605
41330.73
41329.887
41329.055
41328.27
41327.516
41326.79
41326.066
41325.383
41324.695
41324.04
41323.414
41322.78
41322.18
41321.586
41321.008
41320.445
41319.9
41319.348
41318.812
41318.29
41317.793
41317.26
41316.8
41316.3
41315.85
41315.39
41314.945
41314.5
Accuracy:  0.1823139851967277
             precision    recall  f1-score   support

          0       0.13      0.16      0.15       354
          1       0.00      0.00      0.00    

  'recall', 'true', average, warn_for)


### main.py

In [17]:
def read_feature(file):
  print("reading feature information...\n")
  res = []
  with open(file, 'r', encoding='utf-8') as f:
    lines = f.readlines()
  for line in lines:
    line = line.split()
    for i in range(len(line)):
      line[i] = float(line[i])
    res.append(line)
  return np.array(res)

def read_label(file):
  print("reading label information...\n")
  res = []
  with open(file, 'r', encoding='utf-8') as f:
    lines = f.readlines()
  for line in lines:
    line = int(line.strip())
    res.append(line)
  return np.array(res)

def addLayer(inputData, inSize, outSize, activity_function = None):  
    Weights = tf.Variable(tf.random_normal([inSize, outSize]))   
    basis = tf.Variable(tf.random_uniform([1,outSize], -1, 1))    
    weights_plus_b = tf.matmul(inputData, Weights) + basis  
    Wx_plus_b = tf.nn.dropout(weights_plus_b, keep_prob = 1)     # To prevent overfitting

    if activity_function is None:  
        ans = weights_plus_b  
    else:  
        ans = activity_function(weights_plus_b)
    return ans  

def net(x_data, y_data, x_test, y_test):
    is_train = True


    insize = x_data.shape[1]
    outsize = 6
    xs = tf.placeholder(tf.float32,[None, insize])   
    ys = tf.placeholder(tf.float32,[None, outsize])  
    keep_prob = tf.placeholder(tf.float32)  
      
    l1 = addLayer(xs, insize, 40,activity_function=tf.nn.sigmoid)  
    l2 = addLayer(l1, 40, 20,activity_function=tf.nn.sigmoid)  
    l3 = addLayer(l2, 20, 10,activity_function=tf.nn.sigmoid)  
    l4 = addLayer(l3, 10, outsize,activity_function=tf.nn.softmax)
    #l5 = addLayer(l4, 10, outsize,activity_function=tf.nn.softmax)


    y = l4
    #loss = tf.reduce_sum(tf.reduce_sum(tf.square((ys-l4)),reduction_indices = [1]))  
    #loss = -tf.reduce_mean(ys * tf.log(l3))
    #loss = tf.reduce_sum(tf.square((ys-y)))
    #oss = -tf.reduce_sum(ys * tf.log(y))
    loss = tf.reduce_mean(-tf.reduce_sum(ys * tf.log(y),reduction_indices=[1]))  # loss  
    train =  tf.train.GradientDescentOptimizer(0.0001).minimize(loss) 

    new_ydata = []
    for i in range(y_data.shape[0]):
      new_ydata.append([0]*outsize)
      new_ydata[i][y_data[i]] = 1
      # print(new_ydata[i])
    new_ydata = np.array(new_ydata)
        
    saver=tf.train.Saver()
    with tf.Session() as sess:
        init = tf.global_variables_initializer()
        sess.run(init)
        if is_train: 
            run_step = 1000
            for i in range(run_step):  
                sess.run(train,feed_dict={xs:x_data,ys:new_ydata})  
                if i%50 == 0:  
                    print(sess.run(loss,feed_dict={xs:x_data,ys:new_ydata}))
            # save the model
            saver=tf.train.Saver(max_to_keep=1)
            saver.save(sess,'model/net.ckpt')
        else:     # use an existing model
            saver.restore(sess, 'model/net.ckpt')
            print("save success!")

        # Prediction
        res = sess.run(fetches=y, feed_dict={xs: x_test})
        new_res = []
        for ele in res:
            mmax = -1111
            index = -1
            for i in range(outsize):
                if ele[i] > mmax:
                    index, mmax  = i, ele[i]
            new_res.append(index) 
        #print(new_res)
        new_res = np.array(new_res)
        counter = 0
        for i in range(len(new_res)):
          if (y_test[i] == new_res[i]):
            counter += 1
        #print("Accuracy: ", counter/len(new_res))
        print("Accuracy: ", accuracy_score(y_test, new_res))

def main():
  #feature = read_feature('data/feature.txt')
  #label = read_label('data/label.txt')

  print("Loading data...")
  x_text, y = data_helpers.load_data_and_labels("data/sentence.txt", "data/label.txt")

  '''
  outsize = 8
  new_ydata = []
  for i in range(len(y)):
    new_ydata.append([0]*outsize)
    new_ydata[i][y[i]] = 1
    #print(new_ydata[i])
  new_ydata = np.array(new_ydata)
  y = new_ydata'''

  # Build vocabulary
  max_document_length = max([len(x.split(" ")) for x in x_text])
  vocab_processor = learn.preprocessing.VocabularyProcessor(max_document_length)
  #feature = np.array(list(vocab_processor.fit_transform(x_text)))
  #label = np.array(y)
  x = np.array(list(vocab_processor.fit_transform(x_text)))

  #print(x.shape)
  #print(max_document_length)

  # Randomly shuffle data
  np.random.seed(10)
  shuffle_indices = np.random.permutation(np.arange(len(y)))

  feature = np.array(x)[shuffle_indices]
  label = np.array(y)[shuffle_indices]

  x_train , x_test , y_train , y_test = train_test_split(feature, label, test_size = 0.1)
  net(x_train, y_train, x_test, y_test)

if __name__ == '__main__':
  main()

Loading data...
2.5228014
2.516215
2.5097437
2.5033638
2.4970734
2.4908638
2.484738
2.4786823
2.4727101
2.466812
2.4609876
2.4552302
2.449538
2.4439137
2.4383428
2.4328444
2.4274163
2.4220476
2.416739
2.4114907
Accuracy:  0.15504479937670432


### textCNN.py

In [None]:
# Parameters
# ==================================================

# Data loading params
tf.flags.DEFINE_float("dev_sample_percentage", 0.1, "Percentage of the training data to use for validation")
tf.flags.DEFINE_string("feature_file", "data/sentence.txt", "feature data (sentence).")
tf.flags.DEFINE_string("label_file", "data/label.txt", "label data (number).")

# Model Hyperparameters
tf.flags.DEFINE_integer("embedding_dim", 100, "Dimensionality of character embedding (default: 128)")
tf.flags.DEFINE_string("filter_sizes", "2,3,4,5", "Comma-separated filter sizes (default: '3,4,5')")
tf.flags.DEFINE_integer("num_filters", 256, "Number of filters per filter size (default: 128)")
tf.flags.DEFINE_float("dropout_keep_prob", 0.5, "Dropout keep probability (default: 0.5)")
tf.flags.DEFINE_float("l2_reg_lambda", 0.0001, "L2 regularization lambda (default: 0.0)")

# Training parameters
tf.flags.DEFINE_integer("batch_size", 120, "Batch Size (default: 64)")
tf.flags.DEFINE_integer("num_epochs", 20, "Number of training epochs (default: 200)")
tf.flags.DEFINE_integer("evaluate_every", 100, "Evaluate model on dev set after this many steps (default: 100)")
tf.flags.DEFINE_integer("checkpoint_every", 1, "Save model after this many steps (default: 100)")
tf.flags.DEFINE_integer("num_checkpoints", 5, "Number of checkpoints to store (default: 5)")
# Misc Parameters
tf.flags.DEFINE_boolean("allow_soft_placement", True, "Allow device soft device placement")
tf.flags.DEFINE_boolean("log_device_placement", False, "Log placement of ops on devices")

FLAGS = tf.flags.FLAGS
FLAGS._parse_flags()
print("\nParameters:")
for attr, value in sorted(FLAGS.__flags.items()):
    print("{}={}".format(attr.upper(), value))
print("")


# Data Preparation
# ==================================================

# Load data
print("Loading data...")
x_text, y = data_helpers.load_data_and_labels(FLAGS.feature_file, FLAGS.label_file)

outsize = 6
new_ydata = []
for i in range(len(y)):
  new_ydata.append([0]*outsize)
  new_ydata[i][y[i]] = 1
  #print(new_ydata[i])
new_ydata = np.array(new_ydata)
y = new_ydata

# Build vocabulary
max_document_length = max([len(x.split(" ")) for x in x_text])
vocab_processor = learn.preprocessing.VocabularyProcessor(max_document_length)
x = np.array(list(vocab_processor.fit_transform(x_text)))

print(x.shape)
print(max_document_length)

# Randomly shuffle data
np.random.seed(10)
shuffle_indices = np.random.permutation(np.arange(len(y)))

x_shuffled = np.array(x)[shuffle_indices]
y_shuffled = np.array(y)[shuffle_indices]

# Split train/test set
# TODO: This is very crude, should use cross-validation
dev_sample_index = -1 * int(FLAGS.dev_sample_percentage * float(len(y)))
x_train, x_dev = x_shuffled[:dev_sample_index], x_shuffled[dev_sample_index:]
y_train, y_dev = y_shuffled[:dev_sample_index], y_shuffled[dev_sample_index:]

del x, y, x_shuffled, y_shuffled


print("Vocabulary Size: {:d}".format(len(vocab_processor.vocabulary_)))
print("Train/Dev split: {:d}/{:d}".format(len(y_train), len(y_dev)))


# Training
# ==================================================


os.environ['CUDA_VISIBLE_DEVICES'] = '3'
config = tf.ConfigProto(allow_soft_placement=True)          #my modification
config.gpu_options.allow_growth = True

with tf.Graph().device('/gpu:3'):
    session_conf = tf.ConfigProto(
      allow_soft_placement=FLAGS.allow_soft_placement,
      log_device_placement=FLAGS.log_device_placement)
    sess = tf.Session(config=session_conf)
    with sess.as_default():
        cnn = TextCNN(
            sequence_length=x_train.shape[1],
            num_classes=y_train.shape[1],
            vocab_size=len(vocab_processor.vocabulary_),
            embedding_size=FLAGS.embedding_dim,
            filter_sizes=list(map(int, FLAGS.filter_sizes.split(","))),
            num_filters=FLAGS.num_filters,
            l2_reg_lambda=FLAGS.l2_reg_lambda)

        # Define Training procedure
        global_step = tf.Variable(0, name="global_step", trainable=False)
        optimizer = tf.train.AdamOptimizer(0.0001)
        grads_and_vars = optimizer.compute_gradients(cnn.loss)
        train_op = optimizer.apply_gradients(grads_and_vars, global_step=global_step)

        # Keep track of gradient values and sparsity (optional)
        grad_summaries = []
        for g, v in grads_and_vars:
            if g is not None:
                grad_hist_summary = tf.summary.histogram("{}/grad/hist".format(v.name), g)
                sparsity_summary = tf.summary.scalar("{}/grad/sparsity".format(v.name), tf.nn.zero_fraction(g))
                grad_summaries.append(grad_hist_summary)
                grad_summaries.append(sparsity_summary)
        grad_summaries_merged = tf.summary.merge(grad_summaries)

        # Output directory for models and summaries
        timestamp = str(int(time.time()))
        #out_dir = os.path.abspath(os.path.join(os.path.curdir, "runs", timestamp))
        #print("Writing to {}\n".format(out_dir))

        # Summaries for loss and accuracy
        loss_summary = tf.summary.scalar("loss", cnn.loss)
        acc_summary = tf.summary.scalar("accuracy", cnn.accuracy)
        saver = tf.train.Saver(tf.global_variables(), max_to_keep=FLAGS.num_checkpoints)

        # Write vocabulary
        #vocab_processor.save(os.path.join(out_dir, "vocab"))

        # Initialize all variables
        sess.run(tf.global_variables_initializer())

        def train_step(x_batch, y_batch):
            """
            A single training step
            """
            feed_dict = {
              cnn.input_x: x_batch,
              cnn.input_y: y_batch,
              cnn.dropout_keep_prob: FLAGS.dropout_keep_prob
            }
            _, step, loss, accuracy = sess.run(
                [train_op, global_step, cnn.loss, cnn.accuracy],
                feed_dict)
            time_str = datetime.datetime.now().isoformat()
            print("{}: step {}, loss {:g}, acc {:g}".format(time_str, step, loss, accuracy))
            #train_summary_writer.add_summary(summaries, step)

        def dev_step(x_batch, y_batch, writer=None):
            """
            Evaluates model on a dev set
            """
            feed_dict = {
              cnn.input_x: x_batch,
              cnn.input_y: y_batch,
              cnn.dropout_keep_prob: 1
            }
            step, loss, accuracy = sess.run(
                [global_step, cnn.loss, cnn.accuracy],
                feed_dict)
            time_str = datetime.datetime.now().isoformat()
            print("{}: step {}, loss {:g}, acc {:g}".format(time_str, step, loss, accuracy))
            #if writer:
                #writer.add_summary(summaries, step)
            return accuracy



        is_train = False         

        if is_train:
            for kk in range(10):
                # Generate batches
                batches = data_helpers.batch_iter(
                    list(zip(x_train, y_train)), FLAGS.batch_size, FLAGS.num_epochs)
                # Training loop. For each batch...
                for batch in batches:
                    x_batch, y_batch = zip(*batch)
                    train_step(x_batch, y_batch)
                    current_step = tf.train.global_step(sess, global_step)
                    if current_step % FLAGS.evaluate_every == 0:
                        print("\n\n\nEvaluation:")
                        test_acc = dev_step(x_dev, y_dev, writer=None)
                        print("accuracy on test data is: {}\n\n\n".format(test_acc))
                        saver.save(sess,'cnnmodel/net.ckpt')
        else:
            saver.restore(sess, 'cnnmodel/net.ckpt')
            print("reload success!")
            test_acc = dev_step(x_dev, y_dev, writer=None)            
            print("\n\nmodel accuracy on test data is: {}%\n\n".format(test_acc*100))


### textRNN.py

In [None]:
#TextRNN: 1. embeddding layer, 2.Bi-LSTM layer, 3.concat output, 4.FC layer, 5.softmax
class TextRNN:
    def __init__(self,num_classes, learning_rate, batch_size, decay_steps, decay_rate,sequence_length,
                 vocab_size,embed_size,is_training,initializer=tf.random_normal_initializer(stddev=0.1)):
        """init all hyperparameter here"""
        # set hyperparamter
        self.num_classes = num_classes
        self.batch_size = batch_size
        self.sequence_length=sequence_length
        self.vocab_size=vocab_size
        self.embed_size=embed_size
        self.hidden_size=embed_size
        self.is_training=is_training
        self.learning_rate=learning_rate
        self.initializer=initializer
        self.num_sampled=20


        # add placeholder (X,label)
        self.input_x = tf.placeholder(tf.int32, [None, self.sequence_length], name="input_x")  # X
        self.input_y = tf.placeholder(tf.int32,[None], name="input_y")  # y [None,num_classes]
        self.dropout_keep_prob=tf.placeholder(tf.float32,name="dropout_keep_prob")
        self.global_step = tf.Variable(0, trainable=False, name="Global_Step")
        self.epoch_step=tf.Variable(0,trainable=False,name="Epoch_Step")
        self.epoch_increment=tf.assign(self.epoch_step,tf.add(self.epoch_step,tf.constant(1)))
        self.decay_steps, self.decay_rate = decay_steps, decay_rate

        #print(self.input_y.shape)

        self.instantiate_weights()
        self.logits = self.inference() #[None, self.label_size]. main computation graph is here.
        if not is_training:
            return
        self.loss_val = self.loss() #-->self.loss_nce()
        self.train_op = self.train()
        self.predictions = tf.argmax(self.logits, axis=1, name="predictions")  # shape:[None,]
        correct_prediction = tf.equal(tf.cast(self.predictions,tf.int32), self.input_y) #tf.argmax(self.logits, 1)-->[batch_size]
        self.accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32), name="Accuracy") # shape=()
    def instantiate_weights(self):
        """define all weights here"""
        with tf.name_scope("embedding"): # embedding matrix
            self.Embedding = tf.get_variable("Embedding",shape=[self.vocab_size, self.embed_size],initializer=self.initializer) #[vocab_size,embed_size] tf.random_uniform([self.vocab_size, self.embed_size],-1.0,1.0)
            self.W_projection = tf.get_variable("W_projection",shape=[self.hidden_size*2, self.num_classes],initializer=self.initializer) #[embed_size,label_size]
            self.b_projection = tf.get_variable("b_projection",shape=[self.num_classes])       #[label_size]

    def inference(self):
        """main computation graph here: 1. embeddding layer, 2.Bi-LSTM layer, 3.concat, 4.FC layer 5.softmax """
        #1.get emebedding of words in the sentence
        self.embedded_words = tf.nn.embedding_lookup(self.Embedding,self.input_x) #shape:[None,sentence_length,embed_size]
        #2. Bi-lstm layer
        # define lstm cess:get lstm cell output
        lstm_fw_cell=rnn.BasicLSTMCell(self.hidden_size) #forward direction cell
        lstm_bw_cell=rnn.BasicLSTMCell(self.hidden_size) #backward direction cell
        if self.dropout_keep_prob is not None:
            lstm_fw_cell=rnn.DropoutWrapper(lstm_fw_cell,output_keep_prob=self.dropout_keep_prob)
            lstm_bw_cell=rnn.DropoutWrapper(lstm_bw_cell,output_keep_prob=self.dropout_keep_prob)
        # bidirectional_dynamic_rnn: input: [batch_size, max_time, input_size]
        #                            output: A tuple (outputs, output_states)
        #                                    where:outputs: A tuple (output_fw, output_bw) containing the forward and the backward rnn output `Tensor`.
        outputs,_=tf.nn.bidirectional_dynamic_rnn(lstm_fw_cell,lstm_bw_cell,self.embedded_words,dtype=tf.float32) #[batch_size,sequence_length,hidden_size] #creates a dynamic bidirectional recurrent neural network
        print("outputs:===>",outputs) #outputs:(<tf.Tensor 'bidirectional_rnn/fw/fw/transpose:0' shape=(?, 5, 100) dtype=float32>, <tf.Tensor 'ReverseV2:0' shape=(?, 5, 100) dtype=float32>))
        #3. concat output
        output_rnn=tf.concat(outputs,axis=2) #[batch_size,sequence_length,hidden_size*2]
        self.output_rnn_last=tf.reduce_mean(output_rnn,axis=1) #[batch_size,hidden_size*2] #output_rnn_last=output_rnn[:,-1,:] ##[batch_size,hidden_size*2] #TODO
        print("output_rnn_last:", self.output_rnn_last) # <tf.Tensor 'strided_slice:0' shape=(?, 200) dtype=float32>
        #4. logits(use linear layer)
        with tf.name_scope("output"): #inputs: A `Tensor` of shape `[batch_size, dim]`.  The forward activations of the input network.
            logits = tf.matmul(self.output_rnn_last, self.W_projection) + self.b_projection  # [batch_size,num_classes]
        return logits

    def loss(self,l2_lambda=0.0001):
        with tf.name_scope("loss"):
            #input: `logits` and `labels` must have the same shape `[batch_size, num_classes]`
            #output: A 1-D `Tensor` of length `batch_size` of the same type as `logits` with the softmax cross entropy loss.
            losses = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=self.input_y, logits=self.logits);#sigmoid_cross_entropy_with_logits.#losses=tf.nn.softmax_cross_entropy_with_logits(labels=self.input_y,logits=self.logits)
            #print("1.sparse_softmax_cross_entropy_with_logits.losses:",losses) # shape=(?,)
            loss=tf.reduce_mean(losses)#print("2.loss.loss:", loss) #shape=()
            l2_losses = tf.add_n([tf.nn.l2_loss(v) for v in tf.trainable_variables() if 'bias' not in v.name]) * l2_lambda
            loss=loss+l2_losses
        return loss

    def loss_nce(self,l2_lambda=0.0001): #0.0001-->0.001
        """calculate loss using (NCE)cross entropy here"""
        # Compute the average NCE loss for the batch.
        # tf.nce_loss automatically draws a new sample of the negative labels each
        # time we evaluate the loss.
        if self.is_training: #training
            #labels=tf.reshape(self.input_y,[-1])               #[batch_size,1]------>[batch_size,]
            labels=tf.expand_dims(self.input_y,1)                   #[batch_size,]----->[batch_size,1]
            loss = tf.reduce_mean( #inputs: A `Tensor` of shape `[batch_size, dim]`.  The forward activations of the input network.
                tf.nn.nce_loss(weights=tf.transpose(self.W_projection),#[hidden_size*2, num_classes]--->[num_classes,hidden_size*2]. nce_weights:A `Tensor` of shape `[num_classes, dim].O.K.
                               biases=self.b_projection,                 #[label_size]. nce_biases:A `Tensor` of shape `[num_classes]`.
                               labels=labels,                 #[batch_size,1]. train_labels, # A `Tensor` of type `int64` and shape `[batch_size,num_true]`. The target classes.
                               inputs=self.output_rnn_last,# [batch_size,hidden_size*2] #A `Tensor` of shape `[batch_size, dim]`.  The forward activations of the input network.
                               num_sampled=self.num_sampled,  #scalar. 100
                               num_classes=self.num_classes,partition_strategy="div"))  #scalar. 1999
        l2_losses = tf.add_n([tf.nn.l2_loss(v) for v in tf.trainable_variables() if 'bias' not in v.name]) * l2_lambda
        loss = loss + l2_losses
        return loss

    def train(self):
        """based on the loss, use SGD to update parameter"""
        learning_rate = tf.train.exponential_decay(self.learning_rate, self.global_step, self.decay_steps,self.decay_rate, staircase=True)
        train_op = tf.contrib.layers.optimize_loss(self.loss_val, global_step=self.global_step,learning_rate=learning_rate, optimizer="Adam")
        return train_op

#test started
def test():
    #below is a function test; if you use this for text classifiction, you need to tranform sentence to indices of vocabulary first. then feed data to the graph.

    tf.flags.DEFINE_string("feature_file", "data/sentence.txt", "feature data (sentence).")
    tf.flags.DEFINE_string("label_file", "data/label.txt", "label data (number).")
    tf.flags.DEFINE_float("dev_sample_percentage", 0.1, "Percentage of the training data to use for validation")
    # Training parameters
    tf.flags.DEFINE_integer("batch_size", 256, "Batch Size (default: 64)")
    tf.flags.DEFINE_integer("num_epochs", 20, "Number of training epochs (default: 200)")

    FLAGS = tf.flags.FLAGS
    FLAGS._parse_flags()
    print("\nParameters:")
    for attr, value in sorted(FLAGS.__flags.items()):
        print("{}={}".format(attr.upper(), value))
    print("")

    # Load data
    print("Loading data...")
    x_text, y = data_helpers.load_data_and_labels(FLAGS.feature_file, FLAGS.label_file)

    y = np.array(y)

    outsize = 6
    new_ydata = []
    for i in range(len(y)):
      new_ydata.append([0]*outsize)
      new_ydata[i][y[i]] = 1   
      #print(new_ydata[i])
    new_ydata = np.array(new_ydata)
    new_y = new_ydata

    # Build vocabulary
    max_document_length = max([len(x.split(" ")) for x in x_text])
    vocab_processor = learn.preprocessing.VocabularyProcessor(max_document_length)
    x = np.array(list(vocab_processor.fit_transform(x_text)))

    # Randomly shuffle data
    np.random.seed(10)
    shuffle_indices = np.random.permutation(np.arange(len(y)))

    x_shuffled = np.array(x)[shuffle_indices]
    y_shuffled = np.array(y)[shuffle_indices]

    # Split train/test set
    # TODO: This is very crude, should use cross-validation
    dev_sample_index = -1 * int(FLAGS.dev_sample_percentage * float(len(y)))
    x_train, x_dev = x_shuffled[:dev_sample_index], x_shuffled[dev_sample_index:]
    y_train, y_dev = y_shuffled[:dev_sample_index], y_shuffled[dev_sample_index:]

    del x, y, x_shuffled, y_shuffled

    num_classes=6
    learning_rate=0.0001
    batch_size=x_train.shape[0]
    decay_steps=1000
    decay_rate=0.9
    sequence_length=max_document_length
    vocab_size=20570
    embed_size=100
    is_training=True
    dropout_keep_prob=0.5
    textRNN=TextRNN(num_classes, learning_rate, batch_size, decay_steps, decay_rate,sequence_length,vocab_size,embed_size,is_training)
    
    os.environ['CUDA_VISIBLE_DEVICES'] = '3'
    config = tf.ConfigProto(allow_soft_placement=True)          #my modification
    config.gpu_options.allow_growth = True

    saver=tf.train.Saver()
    is_train = False

    with tf.Graph().device('/gpu:3'), tf.Session(config=config) as sess:
        sess.run(tf.global_variables_initializer())

        if is_train:
            for kk in range(30):
                # Generate batches
                batches = data_helpers.batch_iter(
                    list(zip(x_train, y_train)), FLAGS.batch_size, FLAGS.num_epochs)
                step = 0
                # Training loop. For each batch...
                tmp_batches = batches
                for batch in tmp_batches:
                    x_batch, y_batch = zip(*batch)
                    #train_step(x_batch, y_batch)
                    input_x = x_batch
                    input_y = y_batch
                    loss,acc,predict,_=sess.run([textRNN.loss_val,textRNN.accuracy,textRNN.predictions,textRNN.train_op],feed_dict={textRNN.input_x:input_x,textRNN.input_y:input_y,textRNN.dropout_keep_prob:dropout_keep_prob})
                    print("iteration", kk, "step", step, "loss:",loss,"acc:",acc)#"label:",input_y,"prediction:",predict)
                    step += 1
                    if step%100 == 0:
                        loss,acc,predict,_=sess.run([textRNN.loss_val,textRNN.accuracy,textRNN.predictions,textRNN.train_op],feed_dict={textRNN.input_x:x_dev,textRNN.input_y:y_dev,textRNN.dropout_keep_prob:dropout_keep_prob})
                        print("**********************************", "iteration", kk, "step", step, "loss:",loss,"acc:",acc)#"label:",input_y,"prediction:",predict)

                        saver=tf.train.Saver(max_to_keep=1)
                        saver.save(sess,'rnnmodel/net.ckpt')
        else:
            saver.restore(sess, 'rnnmodel/net.ckpt')
            print("reload success!")
            loss, acc, predict,_=sess.run([textRNN.loss_val, textRNN.accuracy, textRNN.predictions, textRNN.train_op],feed_dict={textRNN.input_x:x_dev,textRNN.input_y:y_dev,textRNN.dropout_keep_prob:1})
            print("\n\nmodel accuracy on test data is: {}%\n\n".format(acc*100))

test()


# Results and discussion

**The baseline we used for the classification is random guessing. Since there are 6 characters, the probability of correctly guessing a classification is 1/6 which is 16.67%. The accuracies for our modelings are as below:**

* word2vec model and DNN model (with accuracy of 18%~19%)
* CNN model (with acuracy of 28%~30%)
* LSTM model (with accuracy of 39%~40%)

**Apparently, LSTM model has the best accuracy for character classification. It is not as high as we expected, but it is still (much) higher than random guessing. The main issue of the low accuracy is that we only have fewer than 20,000 lines of sentences for the 6 main characters, and there is no way to gather more data because the show is over. Also, the lines from a show are a good representative of the natural language, but definitely not the natural enough: for example, people tend to use the same sentences over and over again in reality, but a show can't have too many same lines for a character or the audience gets bored. We are satisfied with the results so far.**


# Next Steps...

**We are still in the early stage of the language drift part of our project, and that part will be included in the final paper.**
**Different from character classification, we can gather more data from the shows during the same period of time to improve the data size, and that may help with the final results.**