# Overview
The neural network approach to language modeling can be described using the three following model properties, taken from A Neural Probabilistic Language Model, 2003.

1. Associate each word in the vocabulary with a distributed word feature vector.
2. Express the joint probability function of word sequences in terms of the feature vectors of these words in the sequence.
3. Learn simultaneously the word feature vector and the parameters of the probability function.

This represents a relatively simple model where both the representation and probabilistic model are learned together directly from raw text data. Recently, the neural based approaches have started to outperform the classical statistical approaches.

    We provide ample empirical evidence to suggest that connectionist 
    language mod- els are superior to standard n-gram techniques, 
    except their high computational(training) complexity.
                — Recurrent neural network based language model, 2010.


Initially, feedforward neural network models were used to introduce the approach. More recently, recurrent neural networks and then networks with a long-term memory like the Long Short-Term Memory network, or LSTM, allow the models to learn the relevant context over much longer input sequences than the simpler feedforward networks.

    [an RNN language model] provides further generalization: instead 
    of considering just several preceding words, neurons with input 
    from recurrent connections are assumed to represent short term memory. The model learns itself from the data 
    how to represent memory. While shallow feedforward neural networks (those with just one hidden layer) can only 
    cluster similar words, recurrent neural network (which can be considered as a deep architecture) can perform 
    clustering of similar histories. This allows for instance efficient representation of patterns with variable 
    length.     — Extensions of recurrent neural network language model, 2011.
    
    
Ssome heuristics for developing high-performing neural language models
in general:
   - **Size matters:**     
       The best models were the largest models, specifically number of memory units.
   
   
   - **Regularization matters:**  
       Use of regularization like dropout on input connections improves results.
   

   - **CNNs vs Embeddings:** 
   
       Character-level Convolutional Neural Network (CNN) models can be used on the front-end instead of word embeddings, achieving similar and sometimes better results.
       
  
  - **Ensembles matter:**    
       Combining the prediction from multiple models can offer large improvements in model 
     performance.

## Character-based Neural Language Models

A language model predicts the next word in the sequence based on the specific words that have come before it in the sequence. It is also possible to develop language models at the character level using neural networks. The benefit of character-based language models is their small vocabulary and flexibility in handling any words, punctuation, and other document structure. This comes at the cost of requiring larger models that are slower to train. Nevertheless, in the field of neural language models, character-based models offer a lot of promise for a general, flexible and powerful approach to language modeling.

### Language Model Design
A language model must be trained on the text, and in the case of a character-based language model, the input and output sequences must be characters. The number of characters used as input will also define the number of characters that will need to be provided to the model in order to elicit the first predicted character. After the first character has been generated, it can be appended to the input sequence and used as input for the model to generate the next character.

Longer sequences offer more context for the model to learn what character to output next but take longer to train and impose more burden on seeding the model when generating text. We will use an arbitrary length of 10 characters for this model. There is not a lot of text, and 10 characters is a few words. We can now transform the raw text into a form that our model can learn; specifically, input and output sequences of characters.

### Data Preperation

    1. Tokenize and clean text
    2. Create Sequences
    3. Encdoe Sequences

In [19]:
# Load doc
def load_doc(fn):
    with open(fn, 'r') as file:
        text = file.read()
    file.close()
    return text

def generate_sequences(txt, length):
    txt = ' '.join(txt.split()) # Strip new line chars
    return [ txt[i-length: i+1] for i in range(len(txt)) if txt[i-length: i+1] != ''  ]

def encode_seq(txt):
    chars = sorted(list(set(txt)))
    mapping = dict((c,i) for i,c in enumerate(chars))
    return len(mapping), [ mapping[char] for line in txt for char in line ]
    
txt = load_doc("../data/rhyme.txt")
len(generate_sequences(txt, 10))

vocab_size, es = encode_seq(txt)
print(vocab_size)

38
