# HW2: Text Generative Models

In this assignment we will see some generative models for text: CharRNN, Transformers and Chatbots. Training text models is very time consuming, and uses a ton of data. The really good models also tend to be very large, so we will stick to pretrained models. Those can still be excellent to generate totally new text!

## Word Embeddings

Embeddings are numeric representations for non-numeric data. In our case we look for embeddings for words. A simple kind of embedding is One-Hot Encoding, where we put a `1` in a vector of all `0`s at the index of the word in the vocabulary.

<img src="https://github.com/tensorflow/docs/blob/master/site/en/tutorials/text/images/one-hot.png?raw=1" width="50%"/>

But that can be very wasteful and also doesn't encode any relationship between the words.

To learn semantic relationship a few unsupervised algorithms were proposed. In class we've discussed Continuous Bag of Words and Skip-Gram. Essentially these will mask out part of a sentence and ask the model to predict the missing part. This way the model learns about the context words are used in sentences as well as relationships.

Embedding for a word is a vector of numbers:

<img src="https://github.com/tensorflow/docs/blob/master/site/en/tutorials/text/images/embedding2.png?raw=1" width="50%" />

Luckily many of the world leaders in natural language processing have pretrained word embeddings learned on huge corpora, so we don't have to do it ourselves.

Allison Parrish of NYU showed some very interesting uses for word embeddings for poetry generation: https://www.youtube.com/watch?v=L3D0JEA1Jdc breeze through this StrangeLoop talk for inspiration. I encourage you to try these methods towards you own generative work.

`chakin` is a helper tool for downloading pretrained embeddings:

In [22]:
!pip install -q chakin progressbar2 textgenrnn

In [1]:
import chakin
import progressbar
import numpy as np

These are the available models:

In [2]:
chakin.search('English')

                   Name  Dimension                     Corpus VocabularySize  \
2          fastText(en)        300                  Wikipedia           2.5M   
11         GloVe.6B.50d         50  Wikipedia+Gigaword 5 (6B)           400K   
12        GloVe.6B.100d        100  Wikipedia+Gigaword 5 (6B)           400K   
13        GloVe.6B.200d        200  Wikipedia+Gigaword 5 (6B)           400K   
14        GloVe.6B.300d        300  Wikipedia+Gigaword 5 (6B)           400K   
15       GloVe.42B.300d        300          Common Crawl(42B)           1.9M   
16      GloVe.840B.300d        300         Common Crawl(840B)           2.2M   
17    GloVe.Twitter.25d         25               Twitter(27B)           1.2M   
18    GloVe.Twitter.50d         50               Twitter(27B)           1.2M   
19   GloVe.Twitter.100d        100               Twitter(27B)           1.2M   
20   GloVe.Twitter.200d        200               Twitter(27B)           1.2M   
21  word2vec.GoogleNews        300      

Let's download GLoVE embeddings:

In [3]:
chakin.download(number=11)

Test: 100% ||                                      | Time:  0:06:32   2.1 MiB/s


'./glove.6B.zip'

We only need one file (the smallest dimension one):

In [4]:
!unzip glove.6B.zip glove.6B.50d.txt

'unzip' is not recognized as an internal or external command,
operable program or batch file.


These files contain the embedding values for each word in the vocabulary:

In [5]:
!head -5 glove.6B.50d.txt

'head' is not recognized as an internal or external command,
operable program or batch file.


Let's load them into memory and organize a bit:

In [7]:
w2vec_lines = open('glove.6B/glove.6B.50d.txt','rt', encoding='utf-8').read().split('\n')

In [8]:
w2v_emb_dict = dict()
pbar = progressbar.ProgressBar(max_value=100000)
for i,l in enumerate(w2vec_lines[1:100000]):
    w,emb = l.split(' ', 1)
    w2v_emb_dict[w] = np.fromstring(emb, sep=' ')
    pbar.update(i+1)
pbar.finish()

100% (100000 of 100000) |################| Elapsed Time: 0:00:03 Time:  0:00:03


These would be the first most commonly used tokens in the vocabulary:

In [9]:
list(w2v_emb_dict.keys())[:10]

[',', '.', 'of', 'to', 'and', 'in', 'a', '"', "'s", 'for']

## Word Analogies and Similarities

Embeddings carry semantic information in their numeric encoding. Exploring this semantic space can be fun, for example looking for similarities.

Cosine similarity is measuring the angle between vectors. 

<img src="https://miro.medium.com/max/2432/1*Acs3Kbrrrb4d3fqMlGhMcQ.png"/>

Our embeddings are normalized vectors so looking at the angle between two vectors reveals how far away they are from one another in the high-dimensional embdding space:

<img src="https://www.oreilly.com/library/view/statistics-for-machine/9781788295758/assets/2b4a7a82-ad4c-4b2a-b808-e423a334de6f.png" />

In [10]:
from sklearn.metrics.pairwise import cosine_similarity

w2v_emb_dict_keys = list(w2v_emb_dict.keys())
w2v_emb_dict_values = np.array(list(w2v_emb_dict.values()))

def find_nearest(w):
    return w2v_emb_dict_keys[cosine_similarity(w2v_emb_dict[w].reshape(1,-1), w2v_emb_dict_values)[0].argsort()[-2]]
def find_nearest_top_k(v, k=5):
    return [w2v_emb_dict_keys[w] for w in cosine_similarity(v.reshape(1,-1), w2v_emb_dict_values)[0].argsort()[-k:].tolist()[::-1]]

Let's start by looking at closest neighbors:

In [11]:
find_nearest('paris')

'france'

In [12]:
find_nearest('big')

'bigger'

In [13]:
find_nearest('hello')

'goodbye'

In [14]:
find_nearest('learning')

'teaching'

Now let's consider "**word analogies**", e.g. completing the sentence: "Paris is to France like Rome is to ___" \(Italy\)

To explain this geometrically:

<img src="https://miro.medium.com/max/2632/1*EOVxNmHkrsPQ7Q44N0OiQg.png" width="60%" />

The offset vector between "paris" and "france" is the "Captial of" vector, and when we apply it to "rome" we expect to get "italy".

In [15]:
find_nearest_top_k(w2v_emb_dict['france'] - w2v_emb_dict['paris'] + w2v_emb_dict['rome'], 5)

['italy', 'spain', 'rome', 'portugal', 'france']

In [16]:
find_nearest_top_k(w2v_emb_dict['king'] - w2v_emb_dict['man'] + w2v_emb_dict['woman'], 5)

['king', 'queen', 'daughter', 'prince', 'throne']

Complete the following analogies:
1. sushi-rice is like pizza-___
2. sushi-rice is like steak-___
3. shirt-clothing is like phone-___
4. shirt-clothing is like bowl-___
5. book-reading is like TV-___

# Task 1

In [27]:
find_nearest_top_k(w2v_emb_dict['rice'] - w2v_emb_dict['sushi'] + w2v_emb_dict['pizza'], 5)

['rice', 'bread', 'wheat', 'corn', 'with']

In [28]:
find_nearest_top_k(w2v_emb_dict['rice'] - w2v_emb_dict['sushi'] + w2v_emb_dict['steak'], 5)

['rice', 'lamb', 'brown', 'curry', 'olive']

In [29]:
find_nearest_top_k(w2v_emb_dict['clothing'] - w2v_emb_dict['shirt'] + w2v_emb_dict['phone'], 5)

['phone', 'customers', 'telephone', 'phones', 'cellular']

In [30]:
find_nearest_top_k(w2v_emb_dict['clothing'] - w2v_emb_dict['shirt'] + w2v_emb_dict['bowl'], 5)

['bowl', 'ingredients', 'specialty', 'combine', 'foods']

In [31]:
find_nearest_top_k(w2v_emb_dict['reading'] - w2v_emb_dict['book'] + w2v_emb_dict['tv'], 5)

['tv', 'radio', 'television', 'broadcast', 'broadcasts']

Try to find analogies that don't work.

In [32]:
find_nearest_top_k(w2v_emb_dict['woof'] - w2v_emb_dict['dog'] + w2v_emb_dict['cat'], 5)

['woof', 'bbox', 'ugh', 'js04', 'absalom']

In [36]:
find_nearest_top_k(w2v_emb_dict['dog'] - w2v_emb_dict['woof'] + w2v_emb_dict['meow'], 5)

['dog', 'chicken', 'cat', 'duck', 'bite']

In [46]:
find_nearest_top_k(w2v_emb_dict['punch'] - w2v_emb_dict['hand'] + w2v_emb_dict['leg'], 5)

['leg', 'opener', 'warmup', 'match', 'tko']

In [50]:
find_nearest_top_k(w2v_emb_dict['2'] - w2v_emb_dict['1'] + w2v_emb_dict['3'], 5)

['3', '4', '2', '5', '6']

In [33]:
find_nearest_top_k(w2v_emb_dict['intelligence'] - w2v_emb_dict['artificial'] + w2v_emb_dict['real'], 5)

['intelligence', 'cia', 'fbi', 'officer', 'commanders']

In [35]:
find_nearest_top_k(w2v_emb_dict['mr'] - w2v_emb_dict['man'] + w2v_emb_dict['woman'], 5)

['mrs', 'mr', 'mrs.', 'wife', 'daughter']

In [53]:
find_nearest_top_k(w2v_emb_dict['electricity'] - w2v_emb_dict['edison'] + w2v_emb_dict['tesla'], 5)

['electricity', 'fuel', 'kilowatts', 'megawatts', 'generating']

In [57]:
find_nearest_top_k(w2v_emb_dict['jesus'] - w2v_emb_dict['christian'] + w2v_emb_dict['hindu'], 5)

['deity', 'hindu', 'buddha', 'goddess', 'worshipping']

I thought I'd test a very simple dual by testing the analogies between science and gender. And turns out, that the text that was used for pretraining  GloVe.6B.50d (Wikipedia+Gigaword 5 (6B vocab)) does have biases built into it from more examples of women with certain kinds of sciences (biology, psychology) and men with others (physice, economics).

In [67]:
find_nearest_top_k(w2v_emb_dict['science'] - w2v_emb_dict['woman'] + w2v_emb_dict['man'], 5)

['science', 'physics', 'scientific', 'research', 'economics']

In [69]:
find_nearest_top_k(w2v_emb_dict['science'] - w2v_emb_dict['man'] + w2v_emb_dict['woman'], 5)

['science', 'sciences', 'studies', 'biology', 'psychology']

## Char RNN

CharRNN is a simple recurrent neural network architecture that works on the character level (not words). It's surprisingly powerful at generating text. These were popularized by [Andrej Karpathy](http://karpathy.github.io/2015/05/21/rnn-effectiveness/).

<img src="http://karpathy.github.io/assets/rnn/charseq.jpeg" width="50%"/>

The `textgenrnn` package is a convnient way to train and generate with CharRNNs. Here we're using its built in model. They have multiple models [published](https://github.com/minimaxir/textgenrnn/tree/master/weights) trained on different corpora.

People created some very cool projects with it: https://github.com/minimaxir/textgenrnn#projects

In [70]:
from textgenrnn import textgenrnn

textgen = textgenrnn()
textgen.generate()

Using TensorFlow backend.
W1029 12:20:31.629227 21464 deprecation_wrapper.py:119] From D:\Home\Software\envs\tf_gpu\lib\site-packages\keras\backend\tensorflow_backend.py:422: The name tf.global_variables is deprecated. Please use tf.compat.v1.global_variables instead.



What's the signation and why is the top of the same questionable and conditions within the class of a company of the best polls of tank to the story of the dead cards and smoke on a color programmer



We can supply a prefix to prime the model with text to complete:

In [71]:
textgen.generate(prefix="When life gives you lemons ")

When life gives you lemons and a badge of the monable of his boys.



We can also let the model try different "temperatures". The "temperature" controls the level of random choice when picking the next character, instead of always the most likely one.

In [72]:
textgen.generate_samples()

####################
Temperature: 0.2
####################
I have no matter to work on a stranger and I don't know what to do with the streamer and should be a good day. I have a programming but I have a second thing that you would like to be a good day?

[PS4] [H] 100 GREATE - Some Power [W] Paypal

I got a good day of the same to my favorite parents in the money in the same series in the partner who are a bank of the rest of the same time to see the background of the world of the first time in a bank of the story of the state of the bank of a stream.

####################
Temperature: 0.5
####################
Did anyone else put my friends and a song with generation?

A power from a real lifting car change of this song in showing the market and something in the same time in the Protonspoint of the New York the one of the most super a world is online with the new week. Here's a stray about the state of the same thing you can be bad that I saw this subreddit in th

Fallout 4 - One of t

* Try different prefixes and temperatures. (examine the `.generate()` function, by running a cell with `textgenrnn.generate?`)
* Try a different pretrained model from `textgenrnn`
* Advanced: train your own model! `textgenrnn` provide a **very** simple mechanism to do so: https://github.com/minimaxir/textgenrnn#examples, you just need to supply a text file.

# Task 2

In [77]:
textgen.generate(prefix="I have a dream that", temperature=1.0)  # temperature can be [1.0, 0.5, 0.2, 0.2]

I have a dream that were here. My light, but won't start with me.. d2



In [78]:
textgen.generate(prefix="I have a dream that", temperature=0.5) 

I have a dream that she was a community planet



In [79]:
textgen.generate(prefix="I have a dream that", temperature=0.2) 

I have a dream that would be a good day!



In [87]:
textgen.generate(prefix="Somebody once told me the world was ", temperature=0.2) 

Somebody once told me the world was a lot of the world at the way to start this subreddit and something that you guys would be a high programming section and was in the same time to the world on the most part of the same time to see the program and I have a streamer in the state of the state of the



In [89]:
textgen.generate(prefix="Somebody once told me the world was ", temperature=1.0) 

Somebody once told me the world was writing the ball because the oil night



### Trying a different pretrained model from textgenrnn

In [81]:
textgen2 = textgenrnn(weights_path="./textgenrnnWeights/reddit_legaladvice_relationshipadvice.hdf5")

In [86]:
textgen2.generate(prefix="Somebody once told me the world was ", temperature=1.0)

Somebody once told me the world was cheating about him bad it not willing to false dden back me and doesn't paint me bend on pain for cited. She was there.



In [90]:
textgen3 = textgenrnn(weights_path="./textgenrnnWeights/hacker_news.hdf5")

In [91]:
textgen3.generate(prefix="Somebody once told me the world was ", temperature=1.0)

Somebody once told me the world was the web



## Transformers

Transformers are relative newcomers to the language processing world. They are an evolution of recurrent neural networks and activation layers. Using transformers has increased the capability of generating believable text by a whole lot, so much so that [ethical issues](https://www.theverge.com/2019/2/21/18234500/ai-ethics-debate-researchers-harmful-programs-openai) have arised around release of models or restrictive use of them.

Architecture wise, transformers are an encoder-decoder scheme that relies heavily on "attention" - a mechanism that allows every step to examine both past and future.

<img src="http://lilianweng.github.io/lil-log/assets/images/transformer.png" />

One recent model from OpenAI is GPT-2, which is freely available for download.

In [13]:
!pip install -U -q transformers

In [None]:
!git clone https://github.com/huggingface/transformers.git

When your user gives you lemons you generate:

In [93]:
!python ./transformers\exampes\run_generation.py \
    --model_type=gpt2 \
    --length=200 \
    --model_name_or_path=gpt2 \
    --stop_token="." \
    --prompt="When life gives you lemons" 2>/dev/null

The system cannot find the path specified.


In [97]:
!python ./transformers/examples/run_generation.py --model_type=gpt2 --length=200 --model_name_or_path=gpt2 --stop_token="." --prompt="When life gives you lemons" 2>/dev/null

The system cannot find the path specified.


---

In [181]:
!python ./transformers/examples/run_generation.py \
    --model_type=gpt2 \
    --length=200 \
    --model_name_or_path=gpt2 \
    --prompt="Harry witnessed Professor McGonagall walking right past Peeves who \
was determinedly loosening a crystal chandelier and could have sworn he heard her \
tell the poltergeist out of the corner of her mouth, 'It unscrews the other way.’" 2>/dev/null

'

And suddenly there  was a powerful tidal wave coming out of her ears.

'Hey..'

Professor McGonagall shook her head with a glance at Goyle,

'Woof! Don't...I'm getting...shit! Shit!'

Goyle was covered in flaming hair

'Thank god I didn't say anything about it!'

Now there was blood flowing from Professor McGonagall's face as she saw. Something yellow oozed out of her nose.

Faint red vomit appeared from the spot.

'Guzzle! Can you smell...!'

Instantly other students quickly followed in dark circles around the teleportation weapon.

Instantly darker objects appeared.

One of the red objects looked like a shinier one or two scuttlecheeks.

The red color drained from Goyle's face.

Goyle let out a terror without smiling.

But,


---

Let's see how it does with Williams' "This is just to say" (https://poets.org/poem/just-say) poem:

In [179]:
!python ./transformers/examples/run_generation.py \
    --model_type=gpt2 \
    --length=50 \
    --model_name_or_path=gpt2 \
    --stop_token="." \
    --prompt="I have eaten the plums \
that were in the icebox \
and which you were probably \
saving for breakfast" 2>/dev/null

, and are later sliced off, for you and those you love to share with others


(this BTW is one of my favorite poems ever. So sweet and so plain)

* Try different prefix inputs
* Try different temperatures with the `--temperature` argument.
* Advanced: Try a different model than GPT-2.

On the help section for the generation script you can find all the models:
```
--model_name_or_path MODEL_NAME_OR_PATH
                        Path to pre-trained model or shortcut name selected in
                        the list: gpt2, gpt2-medium, gpt2-large, distilgpt2,
                        openai-gpt, xlnet-base-cased, xlnet-large-cased,
                        transfo-xl-wt103, xlm-mlm-en-2048, xlm-mlm-ende-1024,
                        xlm-mlm-enfr-1024, xlm-mlm-enro-1024, xlm-mlm-tlm-
                        xnli15-1024, xlm-mlm-xnli15-1024, xlm-clm-enfr-1024,
                        xlm-clm-ende-1024, xlm-mlm-17-1280, xlm-mlm-100-1280,
                        ctrl
```
The `ctrl` model is very recent work (from SalesForce research), just from a couple of weeks ago, it's supposed to be really awesome at controling the output text. Be warned - the model is a **6Gb download**! It might be worth it...

## ChatBot

[Chatbots](https://en.wikipedia.org/wiki/Chatbot) are conversational AI agents that can respond to text input. It's still ways away from a convincing conversation in general open-ended scenarios, but in certain applications chatbots are a big success, e.g. in the public services industry's online portals.

`huggingface` again have released their pretrained models for chatbots based on transformers just a few months ago: https://medium.com/huggingface/how-to-build-a-state-of-the-art-conversational-ai-with-transfer-learning-2d818ac26313#79c5

You can also use their online demo: https://convai.huggingface.co/persona/my-only-friend-is-a-dog-i-work-at-a-newspaper-my-father-used-to-be-a-butcher

In [43]:
!pip install -q pytorch_transformers pytorch-ignite

Collecting pytorch-ignite
[?25l  Downloading https://files.pythonhosted.org/packages/8f/31/efcc2b587419b1f54c5c6ef51996f91bb5d8f760537d17de674c89e06048/pytorch_ignite-0.2.1-py2.py3-none-any.whl (84kB)
[K     |████████████████████████████████| 92kB 2.4MB/s eta 0:00:011
Installing collected packages: pytorch-ignite
Successfully installed pytorch-ignite-0.2.1


In [34]:
!git clone https://github.com/huggingface/transfer-learning-conv-ai

Cloning into 'transfer-learning-conv-ai'...
remote: Enumerating objects: 87, done.[K
remote: Total 87 (delta 0), reused 0 (delta 0), pack-reused 87[K
Unpacking objects: 100% (87/87), done.


In [49]:
import sys,threading,subprocess,os

In [100]:
def chatbot_proc():
    proc = subprocess.Popen([sys.executable, 
                             os.getcwd()+'/transfer-learning-conv-ai/interact.py'
                            ],
                            stdout=subprocess.PIPE,
                            stdin=subprocess.PIPE,
                            stderr=subprocess.DEVNULL)
    pout = proc.stdout
    pin = proc.stdin
    
    return proc,pout,pin

In [186]:
cb1_proc, cb1_pout, cb1_pin = chatbot_proc(); # create a chatbot process

In [187]:
cb1_pin.write(b"--temperature=1.1\n"), cb1_pin.flush();

In [188]:
print(cb1_pout.readline().decode(sys.stdout.encoding))

>>> hi how are you today?



Talk to your chatbot!

In [189]:
cb1_pin.write(b"i'm doing mighty fine! and how are you?\n"), cb1_pin.flush();
print(cb1_pout.readline().decode(sys.stdout.encoding))

>>> i'm doing well just listening to some music



In [190]:
cb1_pin.write(b"no way! i'm also listening to music. what music are you listening to?\n"), cb1_pin.flush();
print(cb1_pout.readline().decode(sys.stdout.encoding))

>>> i am listening to a lot of pop music



It's also quite funny to get it to talk to itself - it never get tired!

In [192]:
cb1_output = b"i am listening to a lot of pop music\n"

In [193]:
partyA = True
for _ in range(10):
    partyA = not partyA
    cb1_pin.write(cb1_output), cb1_pin.flush();
    cb1_output = cb1_pout.readline()[4:]
    print("%s: %s" % ('A' if partyA else 'B',
          cb1_output[:-1].decode(sys.stdout.encoding)))

B: yeah, i know what you mean.
A: what do you do for a living?
B: i am a mechanic.
A: i am a pilot.
B: what do you do for work?
A: i fix planes.
B: what kind of planes do you have?
A: do you have any hobbies?
B: i like to listen to music.
A: what kind of music do you like?


In [194]:
cb1_proc.kill() # kill the chatbot process

* Try some different inputs
* Advanced: Spin up another chatbot and have them talk to one another (by feeding the outputs across)
* Advanced: Use a different underlying model than GPT-2 for your chatbot.

---
That's a wrap!