# NLPTW2
The first model, nlptw was a rather simple model. It did predict words rather well, but it couldn't really understand context beyond the first sentence. NLPTW2 is an attempt to fix that.
Essentially, nlptw2 is just a bigger version of the first model, with more input nodes for previous sentences and such.

## Todo
- [ ] Make model respond with full sentences rather than just a specific amount of words.

In [None]:
%pip install tensorflow --user

In [None]:
import tensorflow as tf
import numpy as np

## Optional: Reading text data
For testing, I just used a few publicly available books from [Project Gutenberg] (https://www.gutenberg.org/), but you could also just copy a bunch of tweets into that text file to train on.

In [None]:
with open('../data/tweetdata.txt', encoding="utf8") as f:
    lines = f.read()

# Preparing Data
This works by reading in text, then split them by sentences and feeding the result into a tokenizer. The tokenizer then splits the sentences into words, and then converts the words into numbers.

In [None]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.utils import to_categorical, pad_sequences    


tokenizer = Tokenizer(
    filters='"#$%&()*+-/<=>@[\\]^_`{|}~\t\n',
    split=' ',
)
data = lines
corpus = data.split(".")

tokenizer.fit_on_texts(corpus)
total_words = len(tokenizer.word_index) + 1
# we add 1 to the length to include a placeholder for unknown words (OOV)

In [None]:
# create input sequences using list of tokens
input_sequences = []
for line in corpus:
    token_list = tokenizer.texts_to_sequences([line])[0]
    for i in range(1, len(token_list)):
        n_gram_sequence = token_list[:i+1]
        input_sequences.append(n_gram_sequence)

In [None]:
# padding sequences
max_sequence_length = max([len(x) for x in input_sequences])
input_sequences = np.array(pad_sequences(input_sequences, maxlen=max_sequence_length, padding='pre'))

In [None]:
# slice list by using the last element as the label
xs = input_sequences[:,:-1]
labels = input_sequences[:,-1]

ys = tf.keras.utils.to_categorical(labels, num_classes=total_words)

# RNN Architecture
This is where the model's layers are defined. Using the `fit( ... )` method, it is then being trained on the data provided previously.

In [None]:
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout, Bidirectional
from tensorflow.keras.models import Sequential
from tensorflow.keras.optimizers import Adam

In [None]:
model = Sequential()
model.add(Embedding(total_words,240, input_length=max_sequence_length-1))
model.add(Bidirectional(LSTM(150)))
model.add(Dense(total_words, activation='softmax'))
adam = Adam(lr=0.01)
model.compile(loss='categorical_crossentropy', optimizer=adam, metrics=["acc"])
history = model.fit(xs, ys, epochs=10, verbose=1)
model.save('../models/model_tweetdata8k.h5')

# Testing
This is where you can provide simple input to the model, which will then try to complete your sentence.
Note that all of this is just a testing environment. I will launch the model on Twitter with a much larger dataset, but essentially the same code sooner or later, so stay tuned!

In [None]:
import random

seed_text = random.sample(tokenizer.word_index.keys(), 1)[0]
next_words = 25

for _ in range(next_words):
    token_list = tokenizer.texts_to_sequences([seed_text])[0]
    token_list = pad_sequences([token_list], maxlen=max_sequence_length-1, padding='pre')
    predicted = model.predict(token_list, verbose=0)
    predicted = np.argmax(predicted)
    output_word = 0
    for word, index in tokenizer.word_index.items():
        if index == predicted:
            output_word = word
            break
    seed_text += " " + output_word
print(seed_text)