# Generating text with deep learning

# Note that none of this works on my PC because my graphics card isn't good enough to run TensorFlow!


**When I have a computer than can run tensorFlow, redo this using the TensorFlow tutorial https://keras.io/examples/nlp/lstm_seq2seq/**

Documentation: Keras: Natural Language Processing https://keras.io/examples/nlp/

Tutorial: TensorFlow: Neural Machine Translation with Attention https://www.tensorflow.org/text/tutorials/nmt_with_attention

One of the most common neural models used for text generation is the sequence-to-sequence model, commonly referred to as seq2seq (pronounced “seek-to-seek”). A type of encoder-decoder model, seq2seq uses recurrent neural networks (RNNs) like LSTM in order to generate output, token by token or character by character. Examples of seq2seq:

- Machine translation software like Google Translate
- Text summary generation
- Chatbots
- Named Entity Recognition (NER)
- Speech recognition

Seq2seq networks have two parts:

- An encoder that accepts language (or audio or video) input. The output matrix of the encoder is discarded, but its state is preserved as a vector.
- A decoder that takes the encoder’s final state (or memory) as its initial state. We use a technique called “teacher forcing” to train the decoder to predict the following text (characters or words) in a target sequence given the previous text.


## Preprocessing for seq2sec

We’ll be using TensorFlow with the Keras API to build a pretty limited English-to-Spanish translator

In [1]:
from tensorflow import keras
import re
# Importing our translations
data_path = "span-eng.txt"
# Defining lines as a list of each line
with open(data_path, 'r', encoding='utf-8') as f:
  lines = f.read().split('\n')

# Building empty lists to hold sentences
input_docs = []
target_docs = []
# Building empty vocabulary sets
input_tokens = set()
target_tokens = set()

for line in lines:
  # Input and target sentences are separated by tabs
  input_doc, target_doc = line.split('\t')
  # Appending each input sentence to input_docs
  input_docs.append(input_doc)
  # Splitting words from punctuation
  target_doc = " ".join(re.findall(r"[\w']+|[^\s\w]", target_doc))
  # Redefine target_doc below 
  # and append it to target_docs:
  target_doc = '<START> ' + target_doc + ' <END>'
  target_docs.append(target_doc)
  
  # Now we split up each sentence into words
  # and add each unique word to our vocabulary set
  for token in re.findall(r"[\w']+|[^\s\w]", input_doc):
    print(token)
    # Add your code here:
    if token not in input_tokens:
      input_tokens.add(token)
  for token in target_doc.split():
    print(token)
    # And here:
    if token not in target_tokens:
      target_tokens.add(token)

input_tokens = sorted(list(input_tokens))
target_tokens = sorted(list(target_tokens))

# Create num_encoder_tokens and num_decoder_tokens:
num_encoder_tokens = len(input_tokens)
num_decoder_tokens = len(target_tokens)

try:
  max_encoder_seq_length = max([len(re.findall(r"[\w']+|[^\s\w]", input_doc)) for input_doc in input_docs])
  max_decoder_seq_length = max([len(re.findall(r"[\w']+|[^\s\w]", target_doc)) for target_doc in target_docs])
except ValueError:
  pass




FileNotFoundError: [Errno 2] No such file or directory: 'span-eng.txt'