# Word level language model

Make sure you have finished implementing the character level language model before you proceed!

Now that you have some experience with LSTMs and language models, you will attempt to build a word level language model. While we were directly passing one hot encoded vectors as input to our character model, we need to do better than that for a word level. A word in itself contains a lot of information which a character does not. This is why we use word embeddings. Word embeddings are essentially just vector representations of input words. These vectors can capture a lot of semantic information which would otherwise be lost if we one hot encoded vectors for words. There are a lot of pretrained word embeddings created using different algorithms like Word2Vec, Glove, CoVE, ELMo etc. You can use any of these or simply add your own `nn.Embedding` layer to your model. The `nn.Embedding` layer acts as a trainable lookup table. This means that it contains `len(vocab)` number of rows and `embedding_size` number of columns.

### Using torchtext

Text preprocessing can be very tedious and take a lot of time. `torchtext` does a lot of the heavy lifting you would otherwise have to do yourself and provides several iterators, tokenizers and datasets. 

#### The wiki dataset

To train a language model, you can use the wiki dataset.

```python
import torch
import torchtext
from torchtext.data.utils import get_tokenizer
from torchtext import data


device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
# Define what to do with the data
TEXT = data.Field(
    tokenize=get_tokenizer("spacy"),
    lower=True
)

# Download and split the data
train_txt, val_txt, test_txt = torchtext.datasets.WikiText2.splits(TEXT)

# Building the vocabulary
TEXT.build_vocab(train_txt)

# To iterate through data arranged for a language model
train_iter, valid_iter, test_iter = data.BPTTIterator.splits(
    (train_txt, val_txt, test_txt),
    batch_size=32,
    bptt_len=30, # this is where we specify the sequence length
    device=device,
    repeat=False
)
```