# Using Neural Networks to Solve NLP Problems

## Parts-of-speech tagging (POS)

- Assign a category to a word according to its syntactic function.
    - noun, pronoun, adjective, determiner, verb, adverb, preposition, conjunction, interjection
- Data download link: https://www.clips.uantwerpen.be/conll2000/chunking/
- F1 score: 

$$ F1 = 2 \frac{precision * recall}{precision + recall} $$
$$ precision = \frac{TruePositives}{TruePositives + FalsePositives} $$
$$ recall = \frac{TruePositives}{TruePositives + FalseNegatives} $$

In [80]:
import tensorflow as tf
import numpy as np
import pandas as pd
import time
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.metrics import classification_report

In [32]:
def preprocess(dataset):
    dataset = dataset.drop('drop', axis=1)
    dataset['word'] = dataset['word'].apply(lambda x: x.lower())
    return dataset

In [33]:
trainData = preprocess(pd.read_csv('../large_files/chunking/train.txt', sep=' ', names=['word', 'tag', 'drop']))
testData = preprocess(pd.read_csv('../large_files/chunking/test.txt', sep=' ', names=['word', 'tag', 'drop']))

In [34]:
trainData.head()

Unnamed: 0,word,tag
0,confidence,NN
1,in,IN
2,the,DT
3,pound,NN
4,is,VBZ


In [39]:
vocabulary = trainData['word'].unique()

In [40]:
tags = trainData['tag'].unique()

In [41]:
tags

array(['NN', 'IN', 'DT', 'VBZ', 'RB', 'VBN', 'TO', 'VB', 'JJ', 'NNS',
       'NNP', ',', 'CC', 'POS', '.', 'VBP', 'VBG', 'PRP$', 'CD', '``',
       "''", 'VBD', 'EX', 'MD', '#', '(', '$', ')', 'NNPS', 'PRP', 'JJS',
       'WP', 'RBR', 'JJR', 'WDT', 'WRB', 'RBS', 'PDT', 'RP', ':', 'FW',
       'WP$', 'SYM', 'UH'], dtype=object)

### Logistic regression

- Does not capture sequence information: 
    - p(tag | word) = softmax(W[word_index])
- It just maps one single word to one tag. 
- Ambiguities are not treated by this model
    - A word having more than one possible tag
    - "Book a ship to france"
    - "Ship a book to france"
- Accuracy: > 90%

In [None]:
class LogisticRegression:
    def __init__(self):
        pass
    
    def fit(self, X, Y, vocab_list, tag_list, epochs=10, batch_size=100):
        features = [
            tf.feature_column.categorical_column_with_vocabulary_list('word', vocabulary_list=vocab_list)
        ]
        self.model = tf.estimator.LinearClassifier(feature_columns=features, n_classes=len(tag_list), label_vocabulary=tag_list)
        input_func = tf.estimator.inputs.pandas_input_fn(x=X,y=Y,batch_size=batch_size,num_epochs=epochs,shuffle=True)
        self.model.train(input_func, steps=epochs*len(X)/batch_size)

    def evaluate(self, X, Y, batch_size=10):
        eval_input_func = tf.estimator.inputs.pandas_input_fn(
            x=X,
            y=Y,
            batch_size=batch_size,
            num_epochs=1,
            shuffle=False
        )
        results = self.model.evaluate(eval_input_func)
        return results

    def predict(self, words):
        pred_input_func = tf.estimator.inputs.pandas_input_fn(
              x=pd.DataFrame.from_dict({'word': words}),
              batch_size=100,
              num_epochs=1,
              shuffle=False
        )
        predictions = self.model.predict(pred_input_func)
        return list(predictions)        

In [87]:
model = LogisticRegression()
model.fit(X=trainData, Y=trainData['tag'], vocab_list=vocabulary.tolist(), tag_list=tags.tolist())

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_global_id_in_cluster': 0, '_keep_checkpoint_max': 5, '_task_type': 'worker', '_eval_distribute': None, '_num_worker_replicas': 1, '_model_dir': '/tmp/tmphd4s7hb3', '_is_chief': True, '_save_summary_steps': 100, '_tf_random_seed': None, '_save_checkpoints_secs': 600, '_save_checkpoints_steps': None, '_service': None, '_evaluation_master': '', '_session_config': allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_experimental_distribute': None, '_keep_checkpoint_every_n_hours': 10000, '_train_distribute': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f7a80638f28>, '_protocol': None, '_device_fn': None, '_log_step_count_steps': 100, '_task_id': 0, '_master': '', '_num_ps_replicas': 0}
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph wa

INFO:tensorflow:loss = 30.987915, step = 7101 (0.438 sec)
INFO:tensorflow:global_step/sec: 231.047
INFO:tensorflow:loss = 39.588947, step = 7201 (0.423 sec)
INFO:tensorflow:global_step/sec: 256.925
INFO:tensorflow:loss = 29.076149, step = 7301 (0.388 sec)
INFO:tensorflow:global_step/sec: 226.277
INFO:tensorflow:loss = 39.514095, step = 7401 (0.441 sec)
INFO:tensorflow:global_step/sec: 274.123
INFO:tensorflow:loss = 35.54767, step = 7501 (0.371 sec)
INFO:tensorflow:global_step/sec: 228.236
INFO:tensorflow:loss = 35.500275, step = 7601 (0.436 sec)
INFO:tensorflow:global_step/sec: 214.017
INFO:tensorflow:loss = 57.660866, step = 7701 (0.465 sec)
INFO:tensorflow:global_step/sec: 213.491
INFO:tensorflow:loss = 42.751644, step = 7801 (0.474 sec)
INFO:tensorflow:global_step/sec: 255.021
INFO:tensorflow:loss = 35.18807, step = 7901 (0.387 sec)
INFO:tensorflow:global_step/sec: 231.665
INFO:tensorflow:loss = 55.695854, step = 8001 (0.432 sec)
INFO:tensorflow:global_step/sec: 222.105
INFO:tensorf

INFO:tensorflow:global_step/sec: 226.124
INFO:tensorflow:loss = 20.012203, step = 15401 (0.441 sec)
INFO:tensorflow:global_step/sec: 199.853
INFO:tensorflow:loss = 35.40974, step = 15501 (0.497 sec)
INFO:tensorflow:global_step/sec: 226.678
INFO:tensorflow:loss = 41.794556, step = 15601 (0.441 sec)
INFO:tensorflow:global_step/sec: 227.052
INFO:tensorflow:loss = 37.339996, step = 15701 (0.440 sec)
INFO:tensorflow:global_step/sec: 272.925
INFO:tensorflow:loss = 18.189425, step = 15801 (0.367 sec)
INFO:tensorflow:global_step/sec: 263.724
INFO:tensorflow:loss = 33.17208, step = 15901 (0.380 sec)
INFO:tensorflow:global_step/sec: 183.913
INFO:tensorflow:loss = 35.617386, step = 16001 (0.546 sec)
INFO:tensorflow:global_step/sec: 178.552
INFO:tensorflow:loss = 33.488907, step = 16101 (0.565 sec)
INFO:tensorflow:global_step/sec: 190.908
INFO:tensorflow:loss = 32.82385, step = 16201 (0.517 sec)
INFO:tensorflow:global_step/sec: 227.49
INFO:tensorflow:loss = 21.339346, step = 16301 (0.447 sec)
INFO

In [89]:
model.evaluate(X=testData, Y=testData['tag'])

INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2019-01-02-16:02:18
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/tmphd4s7hb3/model.ckpt-21173
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Finished evaluation at 2019-01-02-16:02:31
INFO:tensorflow:Saving dict for global step 21173: accuracy = 0.89351374, average_loss = 0.3801966, global_step = 21173, loss = 3.8017251
INFO:tensorflow:Saving 'checkpoint_path' summary for global step 21173: /tmp/tmphd4s7hb3/model.ckpt-21173


{'accuracy': 0.89351374,
 'average_loss': 0.3801966,
 'global_step': 21173,
 'loss': 3.8017251}

In [106]:
words = ['car', 'book', 'house', 'run', 'ship', 'of', 'really']
predictions = model.predict(words)
dict(zip(
    words, 
    [ p['classes'][0].decode('utf-8') for p in predictions ]
))

INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/tmphd4s7hb3/model.ckpt-21173
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.


{'book': 'NN',
 'car': 'NN',
 'house': 'NNP',
 'of': 'IN',
 'really': 'RB',
 'run': 'VB',
 'ship': 'NN'}

### Recurrent neural networks

- Use sequences, use context
- Similar to logistic regression model, but with an output entering again to the RNN

$$ h(t) = \sigma(W_x^T x(t) + W_h^T h(t-1) + b) $$

- Modern RNNs:
    - LSTMs
    - GRUs

### Hidden Markov Models

- Hidden states = POS tags, observed = words
- HMM = Pi, A, B
    - Pi = frequency of start tags
    - A = p(tag(t) | tag(t-1))
    - B = p(word(t) | tag(t))
- Can be calculated by just counting
