[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/baggiponte/makemore/blob/main/notebooks/01-data-processing.ipynb)


# Making a computer understand words

To make a neural network learn language, we need to perform two transformations:

1. Represent words as numbers: computer do not understand strings, and we need to transform them into numbers so that we can perform operations on them.
2. Find a way to represent the *meaning* of a word as a number: for example, a `seal` can either be something that keeps something else closed, but also a very cute animal:

![seal](https://external-content.duckduckgo.com/iu/?u=https%3A%2F%2Fwww.nrcm.org%2Fwp-content%2Fuploads%2F2019%2F04%2FSeals-taken-from-Kayak-on-Taunton-Bay-in-Hancock-4.jpg&f=1&nofb=1&ipt=16b91e07919db7caec5c308599acaae538cc4c4305133456ca3f1e581ea1cc1f&ipo=images)

Here, we will do the first part. It turns out, we can leave the second one to the computer itself!

# Our data

In [None]:
try:
    from makemore.datasets import fetch_names
except ModuleNotFoundError:
    !pip install --quiet -- makemore
    from makemore.datasets import fetch_names

names = fetch_names(shuffle=True, seed=42)

`names` is a special object that I created, called `NamesDataset`.

In [None]:
type(names)

Since I created the object, I added some special behaviour. For example, when we print it it we get this:

In [None]:
print(names)

It's nice to see that our dataset contains almost 30_000 names. But, under the hood, this is a glorified `list`, so don't worry: we can still use the `[]` to access its elements.

In [None]:
first_name = names[0]

print(first_name)

# Now's your turn

To make these names understandable by a deep neural network, we can simply replace every letter with its number on the alphabet. For example, `ada` becomes `[1, 4, 1]` and `emma` `[5, 13, 13, 5]`. Step by step, you will apply what you saw in the first notebook to solve this data processing task!

To make things easier, inside `makemore` we can find some utilities, like `character_to_int`:

In [None]:
from makemore.utils import character_to_int

character_to_int("a")

1. Map every character in `'dannika'` to its position in the (English) alphabet, like this:

```python
character_to_int("d")
character_to_int("a")
```

And so on:

2. What happens if we call `character_to_int` on an element of a name? Recall that every string is a sequence of characters, and that you can get the $i$th element by using the `[]` operator, like this:

```python
first_name[0]
```

Doing this by hand is pretty tedious. Since we are running the same operation over and over, maybe we can use a `for` loop.

3. What happens if you run the following cell?

In [None]:
for letter in first_name:
    print(letter)

4. Maybe you can find a way to combine this with the `character_to_int` function and this for loop to the exercise 1 faster...

5. Now we can print all numbers that correspond to a letter, but we need to store them inside a suitable container, like a `lsit`. Write some code that will allow you to append every integer to a list at the end of a for loop. Recall you can create an empty list with `numbers = []` and call `numbers.append(...)` to append an element.

Congratulations! The easy part is over. Now you can transform a string into a list of integers. For example, `anna` becomes `[1, 14, 14, 1]`

To train the neural network, we need to feed it the inputs in this way:

```
# "anna" -> 
[1, 14, 14] [1]

# "maria" -> [13, 1, 18, 9, 1]
[13, 1, 18] [9]
[1, 18, 9] [1]
```

You will need to use a for loop. A hint: you can use the function `len` to obtain the length of an item and `range(x)` to generate a sequence of numbers from 0 to `x`:

In [None]:
name = "maria"

print(len(name))

for i in range(len(name)):
    print(i)