For this task, you are going to train a name generator! Ever wondered where **Gwyneth** Paltrow got her name, or how Hawaiian woman Janice **Keihanaikukauakahihuliheekahaunaele** came to be? Probably from this very generator.

Let's get going!

### Overall pipeline

- **Training**

  You must implement a model that, given the beginning of a word, predicts the next letter.

  For example, if the model `M` is trained on the word `"house"`, the trained model should output `M("h") = "o"`, `M("o") = "u"`, and so on.

  Of course, the model will be trained on an entire dataset of words. Therefore, the output of `M("h")` should be *not* deterministic, meaning that ***each time you run the trained model, you should get a different result***.

- **Inference**

  At test time, the model generates only one new letter, given an input letter.

  However, you can still generate entire words by an approach that is as simple as it's weird-sounding: *ancestral sampling*:

  1. Select an initial letter, say `"b"`.
  2. Predict the next letter, say `M("b") = "r"`.
  3. Concatenate: `"br"`.
  4. Iterate: go back to step 2, and predict `M("r") = "e"`.
  5. Concatenate: `"bre"`.
  6. Keep going, you might eventually get the word `"bread"`.

- **Delimiters**

  Of course, ancestral sampling will keep generating *ad infinitum*.

  To avoid this, you also want to train the model to *end* the generation.

  This is easily done: simply take your training data and augment it with beginning / end delimiters:

  `"house" -> ".house."`

  We chose `"."` here, but you can choose your own.

  This way, you can ask the model to create a new word completely from scratch, by simply starting inference with `M(".") = ...`. And you can stop generating whenever you get `M(...) = "."`.

### What you'll need

- A list of names: your training data (*see names.txt*)
- A way of encoding names to numbers, and viceversa
- A model that generates new names (duh)

In [None]:
import torch  # let's do this here, to break the wall of text
torch.manual_seed(42)

### Training data

Download the file *names.txt* from the course GitHub page.

From the sidebar on the left, click on the folder icon at the very bottom.

Drag the *human_names.txt* and *pokemon_names.txt* files into the folder and wait for the upload to complete.

In [None]:
names = open('human_names.txt', 'r').read().splitlines()

len(names)  # should be 32033 for human names, 1302 for Pokémon names

### Name encoding and decoding

Your trained model must be able to digest and process text characters. You'll do this by writing a simple encoder/decoder such that:

```"hello" -> [13, 2, 5, 5, 7]```

...and of course, the opposite direction.

The specific numbers are not important, but you need your `encode` and `decode` functions to behave correctly:

```decode(encode(s)) == s```

for any string `s`.

In [None]:
# do it here

### Dataset and data loaders

The training data should simply be a bunch of pairs `(char_in, char_out)`, since you want `M(char_in) = char_out`.

In [None]:
# prepare the data here: this will be wrapped in a Dataset in the next cell.

Create your `NGramDataset` class and `DataLoader`s for training, validation and test:

In [None]:
# here

### Model

Now create your training model!

In [None]:
# here

### Training

Time to train!

In [None]:
# here

### Inference

Test your model and generate new crazy names!

In [None]:
# here

### Larger context

Now that you have a working basic model, generalize it so that it can take as input *more than one character*; this should improve the quality of the generated names!

For example, using a context length of `3`:

`M(".an") = "t"`

...may eventually get you `"anthony"`, because the inference steps see a longer context that can better condition the generation.

Instead, our current context length of `1`:

`M(".") = "a"`

...may diverge and generate unrealistic names like `axyzyll`.

In [None]:
# here

### Pokémon names

We managed to generate these artificial Pokémon names:
- beakwily
- mordar
- dortytel
- bymanlona
- amlozinder

Can you do better than us? 😇

Download the *pokemon_names.txt* file from the course webpage and try to come up with an appropriate solution to beat our names!

Go catch them all!

In [None]:
# here