This is an excerpt taken from deep learning with pytorch chapter 4. 
converting into text.

### Approach

There are two particularly intuitive levels at which networks operate on text: at the
character level, by processing one character at a time, and at the word level, where
individual words are the finest-grained entities to be seen by the network. The technique with which we encode text information into tensor form is the same whether we
operate at the character level or the word level. And it’s not magic, either. We stumbled upon it earlier: one-hot encoding.

### Pride and pejudice

 Let’s start with a character-level example. First, let’s get some text to process. An
amazing resource here is Project Gutenberg (www.gutenberg.org), a volunteer effort
to digitize and archive cultural work and make it available for free in open formats,
including plain text files. If we’re aiming at larger-scale corpora, the Wikipedia corpus
stands out: it’s the complete collection of Wikipedia articles, containing 1.9 billion
words and more than 4.4 million articles. Several other corpora can be found at the
English Corpora website (www.english-corpora.org).
 Let’s load Jane Austen’s Pride and Prejudice from the Project Gutenberg website:
www.gutenberg.org/files/1342/1342-0.txt. We’ll just save the file and read it in
(code/p1ch4/5_text_jane_austen.ipynb).

In [2]:
# getting the text file
!wget "www.gutenberg.org/files/1342/1342-0.txt" 

--2021-01-31 06:13:55--  http://www.gutenberg.org/files/1342/1342-0.txt
Resolving www.gutenberg.org (www.gutenberg.org)... 152.19.134.47, 2610:28:3090:3000:0:bad:cafe:47
Connecting to www.gutenberg.org (www.gutenberg.org)|152.19.134.47|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 799738 (781K) [text/plain]
Saving to: ‘1342-0.txt’


2021-01-31 06:13:57 (609 KB/s) - ‘1342-0.txt’ saved [799738/799738]



In [3]:
with open('./1342-0.txt', encoding='utf8') as f:
    text = f.read()

### One hot encoding of characters

There’s one more detail we need to take care of before we proceed: encoding. This is
a pretty vast subject, and we will just touch on it. Every written character is represented
by a code: a sequence of bits of appropriate length so that each character can be
uniquely identified. The simplest such encoding is ASCII (American Standard Code
for Information Interchange), which dates back to the 1960s. ASCII encodes 128 characters using 128 integers. For instance, the letter a corresponds to binary 1100001 or
decimal 97, the letter b to binary 1100010 or decimal 98, and so on. The encoding fits
8 bits, which was a big bonus in 1965.

At this point, we need to parse through the characters in the text and provide a
one-hot encoding for each of them. Each character will be represented by a vector of
length equal to the number of different characters in the encoding. This vector will
contain all zeros except a one at the index corresponding to the location of the character in the encoding.
 We first split our text into a list of lines and pick an arbitrary line to focus on:

In [4]:
text[:10]

'\ufeff\nThe Proj'

In [6]:
lines = text.split('\n')
line = lines[200]
line

'      Michaelmas, and some of his servants are to be in the house by'

creating a one hot encoding for the whole line



In [8]:
import torch

In [9]:
letter_t = torch.zeros(len(line), 128)
letter_t.shape

torch.Size([68, 128])

Note that letter_t holds a one-hot-encoded character per row. Now we just have to
set a one on each row in the correct position so that each row represents the correct
character. The index where the one has to be set corresponds to the index of the character in the encoding:

In [10]:
for i, letter in enumerate(line.lower().strip()):
    letter_index = ord(letter) if ord(letter) < 128 else 0
    # since we are only covering till ascii 128
    letter_t[i][letter_index] = 1

In [17]:
letter_t[10],line[10] # one hot representation of letter a

(tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0.]),
 'a')

### One hot encoding of the whole words

We have one-hot encoded our sentence into a representation that a neural network
could digest. Word-level encoding can be done the same way by establishing a vocabulary and one-hot encoding sentences—sequences of words—along the rows of our
tensor. Since a vocabulary has many words, this will produce very wide encoded vectors, which may not be practical. We will see in the next section that there is a more
efficient way to represent text at the word level, using embeddings. For now, let’s stick
with one-hot encodings and see what happens.
 We’ll define clean_words, which takes text and returns it in lowercase and
stripped of punctuation. 

In [28]:
def clean_words(input_str):
    punctuation = './;:",!?_$*-()'
    wordlist = input_str.lower().replace('\n', ' ').split()
    wordlist = [word.strip(punctuation) for word in wordlist]
    return wordlist
    

In [29]:
words_in_line = clean_words(line)
line, words_in_line

('      Michaelmas, and some of his servants are to be in the house by',
 ['michaelmas',
  'and',
  'some',
  'of',
  'his',
  'servants',
  'are',
  'to',
  'be',
  'in',
  'the',
  'house',
  'by'])

building a mapping of words to indexes in our encoding

In [30]:
wordlist = sorted(set(clean_words(text)))
wordlist[1000:1010]

['brightest',
 'brighton',
 'brighton!”',
 'brighton?”',
 'brilliancy',
 'bring',
 'bringing',
 'brings',
 'bring—good',
 'brink']

In [31]:
word2index_dict  = {word : i for (i, word) in enumerate(wordlist)}

In [32]:
len(word2index_dict), word2index_dict['impossible']

(8450, 3777)

Note that word2index_dict is now a dictionary with words as keys and an integer as a
value. We will use it to efficiently find the index of a word as we one-hot encode it.
Let’s now focus on our sentence: we break it up into words and one-hot encode it— that is, we populate a tensor with one one-hot-encoded vector per word. We create an
empty vector and assign the one-hot-encoded values of the word in the sentence:

In [33]:
word_t = torch.zeros(len(words_in_line), len(word2index_dict))
for i, word in enumerate(words_in_line):
    word_index = word2index_dict[word]
    word_t[i][word_index] =1 
    print('{:2} {:4} {}'.format(i, word_index, word))

 0 4696 michaelmas
 1  436 and
 2 6808 some
 3 5090 of
 4 3588 his
 5 6571 servants
 6  547 are
 7 7375 to
 8  760 be
 9 3804 in
10 7271 the
11 3630 house
12 1051 by


At this point, tensor represents one sentence of length 11 in an encoding space of size
8450, the number of words in our dictionary

Options for splitting text (and using the embeddings we’ll look at in the next section).
 The choice between character-level and word-level encoding leaves us to make a
trade-off. In many languages, there are significantly fewer characters than words: representing characters has us representing just a few classes, while representing words
requires us to represent a very large number of classes and, in any practical application, deal with words that are not in the dictionary. On the other hand, words convey
much more meaning than individual characters, so a representation of words is considerably more informative by itself.

### why text embeddings

One-hot encoding is a very useful technique for representing categorical data in tensors. However, as we have anticipated, one-hot encoding starts to break down when
the number of items to encode is effectively unbound, as with words in a corpus. In
just one book, we had over 8,000 items!

### What text embeddings

How can we compress our encoding down to a more manageable size and put a cap on the size growth? Well, instead of vectors of many zeros and a single one, we can use vectors of floating-point numbers. A vector of, say, 100 floating-point numbers can
indeed represent a large number of words. The trick is to find an effective way to map
individual words into this 100-dimensional space in a way that facilitates downstream
learning. This is called an embedding.

In principle, we could simply iterate over our vocabulary and generate a set of 100
random floating-point numbers for each word. This would work, in that we could
cram a very large vocabulary into just 100 numbers, but it would forgo any concept of
distance between words based on meaning or context. A model using this word
embedding would have to deal with very little structure in its input vectors. An ideal
solution would be to generate the embedding in such a way that words used in similar
contexts mapped to nearby regions of the embedding.

### Example

Well, if we were to design a solution to this problem by hand, we might decide to
build our embedding space by choosing to map basic nouns and adjectives along the
axes. We can generate a 2D space where axes map to nouns—fruit (0.0-0.33), flower
(0.33-0.66), and dog (0.66-1.0)—and adjectives—red (0.0-0.2), orange (0.2-0.4), yellow
(0.4-0.6), white (0.6-0.8), and brown (0.8-1.0). Our goal is to take actual fruit, flowers,
and dogs and lay them out in the embedding.
 As we start embedding words, we can map apple to a number in the fruit and red
quadrant. Likewise, we can easily map tangerine, lemon, lychee, and kiwi (to round out
our list of colorful fruits). Then we can start on flowers, and assign rose, poppy, daffodil,
lily, and … Hmm. Not many brown flowers out there. Well, sunflower can get flower, yellow, and brown, and then daisy can get flower, white, and yellow. Perhaps we should
update kiwi to map close to fruit, brown, and green.
8
 For dogs and color, we can embed
redbone near red; uh, fox perhaps for orange; golden retriever for yellow, poodle for white, and
… most kinds of dogs are brown