# Character level language models

Make sure you read about [language models](LanguageModel.ipynb) before proceeding to create your own!

Now that you understand the different aspects of creating a language model, you will try making your own character level language model. This notebook contains some tips and helpful pointers to help you get started in the right direction.

### Finding a corpus to use

You can use whichever text file you want to. Make sure it is sufficiently large so that your model actually has something to learn and doesn't overfit to a small amount of data. If you can't find one then you can use the [this one](https://norvig.com/big.txt).

### Building the vocabulary

For a character level vocabulary, you can use `from string import printable`. This contains every printable character which you are likely to find in your corpus.

### Input format

You may choose to use one hot encoding to create your input. One hot encoded vectors are zero vectors with a single `1` at the index of your current token. Hence, the size of the vector is equal to the size of the vocabulary.

```python
vocab_size = len(vocab)
vector = torch.zeros(1, vocab_size)
vector[stoi['token']] = 1
```

### Model architecture

You can use a simple LSTM/GRU based architecture. Make sure to have a sequence length long enough to capture at least moderate length dependencies and the general structure of the data. The last layer of the architecture will consist of `len(itos)` number of nodes. This is because you will try to predict the next character given a sequence of characters. Hence your last layer should be a probability distribution over all the characters in your vocabulary.

### If all else fails

You can read my [blog post](https://aniketsanap.github.io/Character-Level-Language-Model/) on how to build this project.