# Language models

### What is a language model?

A language model is a probability distribution over a sequence of tokens. These tokens could be words or characters or even a larger sequence. Using deep learning we will attempt to build a model which will try to predict the next token given a sequence of tokens as input. 

### Why is language modelling important?

Computer vision had its big boom at the start of this decade with the ImageNet Large Scale Visual Recognition Challenge. Deep CNN architectures were hands down the best approach to tackle vision problems. With the use of transfer learning, these models could be used for a variety of different tasks and applications which led to rapid progress in the field of computer vision. Transfer learning is commonly used in Natural Language Processing through building a language model first and then modifying it to suite our use case. This makes sense because human language is very complex. There is no one correct approach to writing or speaking anything. When we directly train our model on a language specific task, our model has no prior information about the complex structure or any of the nuances of human language. We cannot expect a model which has just been introduced to english to perform well on any language specific task directly without understanding the language first. Hence, we train a language model on a large dataset (like wikipedia) and then use this model which has an understanding of the language on our task.

### How does it work?

#### Vocabulary

Every language model needs a vocabulary. The vocabulary consists of all the tokens which are part of your input and which should be predicted by your output. For a character level language model, the vocabulary consists of every distinct character in your dataset. For a general range of characters you can use 
`from string import printable`. `printable` contains every printable character on your screen and your text probably won't contain any additional characters. For a word level model, your vocabulary will consist of all the different words in your dataset. Word level models suffer with the sparsity problem. This means that there are several words in the corpus which are not used more than once or twice. Hence, we genrally limit the size of our vocabulary to the `n` most frequently used words from the corpus. Your vocabulary will be structured in the following way:

1. stoi:<br>
This is a mapping of the input token to its corresponding index. This mapping is necessary as we cannot pass strings as input to our neural network. This mapping is generally in the form of a dictionary with the keys as the tokens and the values as the indices. 

```python
stoi = {
    'token0':0,
    'token1':1,
    'token2':2
    ...
}

```

2. itos:<br>
This is the reverse mapping of ints to their corresponding tokens. We can use either a list or a dictionary for itos as in a list of tokens, the index is implied.

#### Input format

The input of to the model should be in the form of a sequence. The label should be the same sequence but set ahead by one word. For example:

```python
input_ = ['My', 'favourite', 'colour']
label = ['favourite', 'colour', 'is']
```


### State of the art language models

[OpenAI's GPT-2](https://openai.com/blog/better-language-models/) <br>
This language model uses a very complex parallel architecture and has been trained for a long time on a tremendous amount of data. It is apparent that this language model has the capacity to fool the reader into believing that the content was actually written by a human rather than a neural network. You can play around with this model yourselves [here](https://openai.com/blog/better-language-models/).