# HW2: Text Generative Models

In this assignment we will see some generative models for text: CharRNN, Transformers and Chatbots. Training text models is very time consuming, and uses a ton of data. The really good models also tend to be very large, so we will stick to pretrained models. Those can still be excellent to generate totally new text!

## Word Embeddings

Embeddings are numeric representations for non-numeric data. In our case we look for embeddings for words. A simple kind of embedding is One-Hot Encoding, where we put a `1` in a vector of all `0`s at the index of the word in the vocabulary.

<img src="https://github.com/tensorflow/docs/blob/master/site/en/tutorials/text/images/one-hot.png?raw=1" width="50%"/>

But that can be very wasteful and also doesn't encode any relationship between the words.

To learn semantic relationship a few unsupervised algorithms were proposed. In class we've discussed Continuous Bag of Words and Skip-Gram. Essentially these will mask out part of a sentence and ask the model to predict the missing part. This way the model learns about the context words are used in sentences as well as relationships.

Embedding for a word is a vector of numbers:

<img src="https://github.com/tensorflow/docs/blob/master/site/en/tutorials/text/images/embedding2.png?raw=1" width="50%" />

Luckily many of the world leaders in natural language processing have pretrained word embeddings learned on huge corpora, so we don't have to do it ourselves.

Allison Parrish of NYU showed some very interesting uses for word embeddings for poetry generation: https://www.youtube.com/watch?v=L3D0JEA1Jdc breeze through this StrangeLoop talk for inspiration. I encourage you to try these methods towards you own generative work.

`chakin` is a helper tool for downloading pretrained embeddings:

In [6]:
!pip3 install -q chakin progressbar2 textgenrnn

In [8]:
!python3 -V

Python 3.5.3


In [6]:
import chakin
import progressbar
import numpy as np

These are the available models:

In [10]:
chakin.search('English')

                   Name  Dimension                     Corpus VocabularySize  \
2          fastText(en)        300                  Wikipedia           2.5M   
11         GloVe.6B.50d         50  Wikipedia+Gigaword 5 (6B)           400K   
12        GloVe.6B.100d        100  Wikipedia+Gigaword 5 (6B)           400K   
13        GloVe.6B.200d        200  Wikipedia+Gigaword 5 (6B)           400K   
14        GloVe.6B.300d        300  Wikipedia+Gigaword 5 (6B)           400K   
15       GloVe.42B.300d        300          Common Crawl(42B)           1.9M   
16      GloVe.840B.300d        300         Common Crawl(840B)           2.2M   
17    GloVe.Twitter.25d         25               Twitter(27B)           1.2M   
18    GloVe.Twitter.50d         50               Twitter(27B)           1.2M   
19   GloVe.Twitter.100d        100               Twitter(27B)           1.2M   
20   GloVe.Twitter.200d        200               Twitter(27B)           1.2M   
21  word2vec.GoogleNews        300      

Let's download GLoVE embeddings:

In [11]:
chakin.download(number=11)

Test: 100% ||                                      | Time:  0:06:27   2.1 MiB/s


'./glove.6B.zip'

We only need one file (the smallest dimension one):

In [12]:
!unzip glove.6B.zip glove.6B.50d.txt

Archive:  glove.6B.zip
  inflating: glove.6B.50d.txt        


These files contain the embedding values for each word in the vocabulary:

In [13]:
!head -5 glove.6B.50d.txt

the 0.418 0.24968 -0.41242 0.1217 0.34527 -0.044457 -0.49688 -0.17862 -0.00066023 -0.6566 0.27843 -0.14767 -0.55677 0.14658 -0.0095095 0.011658 0.10204 -0.12792 -0.8443 -0.12181 -0.016801 -0.33279 -0.1552 -0.23131 -0.19181 -1.8823 -0.76746 0.099051 -0.42125 -0.19526 4.0071 -0.18594 -0.52287 -0.31681 0.00059213 0.0074449 0.17778 -0.15897 0.012041 -0.054223 -0.29871 -0.15749 -0.34758 -0.045637 -0.44251 0.18785 0.0027849 -0.18411 -0.11514 -0.78581
, 0.013441 0.23682 -0.16899 0.40951 0.63812 0.47709 -0.42852 -0.55641 -0.364 -0.23938 0.13001 -0.063734 -0.39575 -0.48162 0.23291 0.090201 -0.13324 0.078639 -0.41634 -0.15428 0.10068 0.48891 0.31226 -0.1252 -0.037512 -1.5179 0.12612 -0.02442 -0.042961 -0.28351 3.5416 -0.11956 -0.014533 -0.1499 0.21864 -0.33412 -0.13872 0.31806 0.70358 0.44858 -0.080262 0.63003 0.32111 -0.46765 0.22786 0.36034 -0.37818 -0.56657 0.044691 0.30392
. 0.15164 0.30177 -0.16763 0.17684 0.31719 0.33973 -0.43478 -0.31086 -0.44999 -0.29486 0.16608 0.11963 -0.41328 -0.42353

Let's load them into memory and organize a bit:

In [14]:
w2vec_lines = open('glove.6B.50d.txt','rt', encoding='utf-8').read().split('\n')

In [15]:
w2v_emb_dict = dict()
pbar = progressbar.ProgressBar(max_value=100000)
for i,l in enumerate(w2vec_lines[1:100000]):
    w,emb = l.split(' ', 1)
    w2v_emb_dict[w] = np.fromstring(emb, sep=' ')
    pbar.update(i+1)
pbar.finish()

100% (100000 of 100000) |################| Elapsed Time: 0:00:01 Time:  0:00:01


These would be the first most commonly used tokens in the vocabulary:

In [16]:
list(w2v_emb_dict.keys())[:10]

['authoritatively',
 '1,520',
 'quinton',
 'editorialized',
 'beutel',
 'gashes',
 'pronounced',
 'bettered',
 'jagdish',
 'eglin']

## Word Analogies and Similarities

Embeddings carry semantic information in their numeric encoding. Exploring this semantic space can be fun, for example looking for similarities.

Cosine similarity is measuring the angle between vectors. 

<img src="https://miro.medium.com/max/2432/1*Acs3Kbrrrb4d3fqMlGhMcQ.png"/>

Our embeddings are normalized vectors so looking at the angle between two vectors reveals how far away they are from one another in the high-dimensional embdding space:

<img src="https://www.oreilly.com/library/view/statistics-for-machine/9781788295758/assets/2b4a7a82-ad4c-4b2a-b808-e423a334de6f.png" />

In [24]:
from sklearn.metrics.pairwise import cosine_similarity

w2v_emb_dict_keys = list(w2v_emb_dict.keys())
w2v_emb_dict_values = np.array(list(w2v_emb_dict.values()))

def find_nearest(w):
    return w2v_emb_dict_keys[cosine_similarity(w2v_emb_dict[w].reshape(1,-1), w2v_emb_dict_values)[0].argsort()[-2]]
def find_nearest_top_k(v, k=5):
    return [w2v_emb_dict_keys[w] for w in cosine_similarity(v.reshape(1,-1), w2v_emb_dict_values)[0].argsort()[-k:].tolist()[::-1]]

Let's start by looking at closest neighbors:

In [25]:
find_nearest('paris')

'france'

In [26]:
find_nearest('big')

'bigger'

In [27]:
find_nearest('hello')

'goodbye'

In [28]:
find_nearest('learning')

'teaching'

Now let's consider "**word analogies**", e.g. completing the sentence: "Paris is to France like Rome is to ___" \(Italy\)

To explain this geometrically:

<img src="https://miro.medium.com/max/2632/1*EOVxNmHkrsPQ7Q44N0OiQg.png" width="60%" />

The offset vector between "paris" and "france" is the "Captial of" vector, and when we apply it to "rome" we expect to get "italy".

In [29]:
find_nearest_top_k(w2v_emb_dict['france'] - w2v_emb_dict['paris'] + w2v_emb_dict['rome'], 5)

['italy', 'spain', 'rome', 'portugal', 'france']

In [30]:
find_nearest_top_k(w2v_emb_dict['king'] - w2v_emb_dict['man'] + w2v_emb_dict['woman'], 5)

['king', 'queen', 'daughter', 'prince', 'throne']

Complete the following analogies:
1. sushi-rice is like pizza-___
2. sushi-rice is like steak-___
3. shirt-clothing is like phone-___
4. shirt-clothing is like bowl-___
5. book-reading is like TV-___

In [31]:
print('1. sushi-rice is like pizza-___')
print(find_nearest_top_k(w2v_emb_dict['sushi'] - w2v_emb_dict['rice'] + w2v_emb_dict['pizza'], 5))

1. sushi-rice is like pizza-___
['pizza', 'sushi', 'fast-food', 'diner', 'nachos']


In [32]:
# 2. sushi-rice is like steak-___
print(find_nearest_top_k(w2v_emb_dict['sushi'] - w2v_emb_dict['rice'] + w2v_emb_dict['steak'], 5))

['steak', 'sushi', 'cheeseburger', 'steaks', 'meatball']


In [33]:
# 3. shirt-clothing is like phone-___
print(find_nearest_top_k(w2v_emb_dict['shirt'] - w2v_emb_dict['clothing'] + w2v_emb_dict['pizza'], 5))

['pizza', 'sandwich', 'hat', 'sandwiches', 'pie']


In [34]:
# 4. shirt-clothing is like bowl-___
print(find_nearest_top_k(w2v_emb_dict['shirt'] - w2v_emb_dict['clothing'] + w2v_emb_dict['bowl'], 5))

['bowl', 'crimson', 'afc', 'gator', 'super']


In [36]:
# 5. book-reading is like TV-___
print(find_nearest_top_k(w2v_emb_dict['book'] - w2v_emb_dict['reading'] + w2v_emb_dict['tv'], 5))

['tv', 'television', 'movie', 'hbo', 'movies']


Try to find analogies that don't work.

#### Testing around to find analogies that do not work…

Breakfast-morning is like dinner-____

Breakfast-morning is like lunch-_____

Pie-dessert is like broccoli-____

cake-dessert is like spinach-____

Ketchup-burger is like syrup-_____

Hummus-carrots is like mustard-____

Art-paint is like literature-_____

In [38]:
# Breakfast-morning is like dinner-____
print(find_nearest_top_k(w2v_emb_dict['breakfast'] - w2v_emb_dict['morning'] + w2v_emb_dict['dinner'], 5))

['dinners', 'breakfast', 'dinner', 'buffet', 'breakfasts']


In [39]:
# Breakfast-morning is like lunch-_____
print(find_nearest_top_k(w2v_emb_dict['breakfast'] - w2v_emb_dict['morning'] + w2v_emb_dict['lunch'], 5))

['breakfast', 'buffet', 'breakfasts', 'dinners', 'lunch']


In [40]:
# Pie-dessert is like broccoli-____
print(find_nearest_top_k(w2v_emb_dict['pie'] - w2v_emb_dict['dessert'] + w2v_emb_dict['broccoli'], 5))

['broccoli', 'cauliflower', 'zucchini', 'sprouts', 'celery']


In [44]:
# Icecream-dessert is like spinach-____
print(find_nearest_top_k(w2v_emb_dict['cake'] - w2v_emb_dict['dessert'] + w2v_emb_dict['spinach'], 5))

['spinach', 'lettuce', 'potatoes', 'peeled', 'carrots']


In [45]:
# Ketchup-burger is like syrup-_____
print(find_nearest_top_k(w2v_emb_dict['ketchup'] - w2v_emb_dict['burger'] + w2v_emb_dict['syrup'], 5))

['syrup', 'vinegar', 'molasses', 'juice', 'vanilla']


In [46]:
# Hummus-carrots is like mustard-____
print(find_nearest_top_k(w2v_emb_dict['hummus'] - w2v_emb_dict['carrots'] + w2v_emb_dict['mustard'], 5))

['hummus', 'naim', 'malak', 'ales', 'kassem']


In [47]:
# Art-paint is like literature-_____
print(find_nearest_top_k(w2v_emb_dict['art'] - w2v_emb_dict['paint'] + w2v_emb_dict['literature'], 5))

['literature', 'literary', 'poetry', 'scholar', 'contemporary']


## Char RNN

CharRNN is a simple recurrent neural network architecture that works on the character level (not words). It's surprisingly powerful at generating text. These were popularized by [Andrej Karpathy](http://karpathy.github.io/2015/05/21/rnn-effectiveness/).

<img src="http://karpathy.github.io/assets/rnn/charseq.jpeg" width="50%"/>

The `textgenrnn` package is a convnient way to train and generate with CharRNNs. Here we're using its built in model. They have multiple models [published](https://github.com/minimaxir/textgenrnn/tree/master/weights) trained on different corpora.

People created some very cool projects with it: https://github.com/minimaxir/textgenrnn#projects

In [1]:
from textgenrnn import textgenrnn

textgen = textgenrnn()
textgen.generate()

Using TensorFlow backend.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])
W1025 18:29:09.660571 139840947357440 deprecation_wrapper.py:119] From /home/jupyter/.local/lib/python3.5/site-packages/keras/backend/tensorflow_backend.py:422: The name tf.global_variables is deprecated. Please use tf.compat.v1.global_variables instead.



The key confirmed for a random burger to be on the floor?



We can supply a prefix to prime the model with text to complete:

In [2]:
textgen.generate(prefix="When life gives you lemons ")

When life gives you lemons as a month of the company.



We can also let the model try different "temperatures". The "temperature" controls the level of random choice when picking the next character, instead of always the most likely one.

In [3]:
textgen.generate_samples()

####################
Temperature: 0.2
####################
[Specific] Can someone please remove the story of the streets and the most support is the best computer and why are the posts of the state of the best community to the post in the top of the state of the state of the same time to a super of the state of the state of the state of the starter of the 

[Specific] Can someone please remove the story of the programming in the most state of the state of the most planet in the state of the season in the back of the same things to get a girl to the strange with a huge character and become still going to be a man who has a stranger to the programming t

The subreddit of the state of the state of the discovery of the same state of the same starting to the state of the first time that we say they are a big defender of the story of the state of the state of the state of the state of the story of the streets of the state of the same time in the first 

####################
Temperature: 0.5


* Try different prefixes and temperatures. (examine the `.generate()` function, by running a cell with `textgenrnn.generate?`)
* Try a different pretrained model from `textgenrnn`
* Advanced: train your own model! `textgenrnn` provide a **very** simple mechanism to do so: https://github.com/minimaxir/textgenrnn#examples, you just need to supply a text file.

In [4]:
textgenrnn.generate?

[0;31mSignature:[0m
[0mtextgenrnn[0m[0;34m.[0m[0mgenerate[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mself[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mn[0m[0;34m=[0m[0;36m1[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mreturn_as_list[0m[0;34m=[0m[0;32mFalse[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mprefix[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mtemperature[0m[0;34m=[0m[0;34m[[0m[0;36m1.0[0m[0;34m,[0m [0;36m0.5[0m[0;34m,[0m [0;36m0.2[0m[0;34m,[0m [0;36m0.2[0m[0;34m][0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mmax_gen_length[0m[0;34m=[0m[0;36m300[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0minteractive[0m[0;34m=[0m[0;32mFalse[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mtop_n[0m[0;34m=[0m[0;36m3[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mprogress[0m[0;34m=[0m[0;32mTrue[0m[0;34m,[0m[0;34m[0m
[0;34m[0m[0;34m)[0m[0;34m[0m[0m
[0;31mDocstring:[0m <no docstring>
[0;31mFile:[0m    

In [3]:
# I'm going to add so much randomness to my day.... (using the default model)
my_day_story="Today I had a meeting at 11am and thought I would do work beforehand but then I ran into my thesis reader in the kitchen."
temp=0.9
textgen.generate(prefix=my_day_story, temperature=temp)

Today I had a meeting at 11am and thought I would do work beforehand but then I ran into my thesis reader in the kitchen...



In [4]:
# That was too random... I will lower the temperature.
temp=0.7
textgen.generate(prefix=my_day_story, temperature=temp)

Today I had a meeting at 11am and thought I would do work beforehand but then I ran into my thesis reader in the kitchen.



# Using RNNs to generate fake cryptocurrency data

I used the [textgenrnn](https://github.com/minimaxir/textgenrnn) RNN machine learning module to generate fake/speculative cryptocurrency data from current cryptocurrency data (it’s this how it’s done anyhow ;-)).

The output is in the form of __symbol, name, $USD price__

Some favorites...

```
BOTC Botcoin 0.00021108
CRT "Credit Token" 0.000653993
BOT BootCoin 0.000129398
SPC Spacecoin 0.00012733

XCC "Cash Coin" 0.0
XCC "Chin Coin" 0.000139998
GET2 "Gene Token" 0.000113792
BET Betters 0.0
DARE Darto 0.0
STT3 "Start Coin" 0.0
PCC2 "Place Coin" 0.000148341
XBC "Bitcoin Blockchain" 0.000523123
PARE Paris 0.0
BARE BitcoinA 0.001190911
```

And then by increasing the ‘temperature’ (adding more  ‘entropy’) they get even better…

```
CTA1 "Currenity Token" 0.0
KROB Krep 0.0
FXC FuxxCoin 0.0
EMC "Decert Money Coin" 0.001999197
BROTC Brostribs 0.0
XBAC2 "Blockchain Adiamend Coin" 0.001187232
DECO Delcoin 0.0
HCC HickCoin 0.0
WOP Wo 0.695537238
MAPL "Marpa Chain" 5.189e-06
BITM Bitmi 0.0
BRIX Biocoin 0.000149343
SHHT "SHT Token" 0.0
KM2 Kikera 0.0
SECN Secus 0.0
NSC Nucoin 0.0
PCOP PoperCoin 0.000465216
```

## Method / Data / How
The data source is all cryptocurrencies listed by coindex (including those no longer trading).
The data was ingested from coindex as JSON and then transformed into a csv/txt file that works with the minimaxir/textgenrnn module.
[Data source](https://coincodex.com/apps/coincodex/cache/all_coins_packed.json?t=26199381&coincodex.com)

All the code for this data transformation is in [github](https://github.com/aberke/city-visions/blob/master/week-2) at `./crypto_data_script.ipynb`

Output datafile: `./data/cryptocurrencies_data.txt`

*Note: Many of the cryptocurrencies used as training input are no longer trading, in which case their $USD price was set to 0.0.*


In [9]:
fake_cryptos_gen = textgenrnn(name="fake_cryptos")
fake_cryptos_gen.reset()
fake_cryptos_gen.train_from_file(
    './data/cryptocurrencies_data.txt',
    new_model=True,
    rnn_bidirectional=True,
    dim_embeddings=300,
    num_epochs=3)

6,318 texts collected.
Training new model w/ 2-layer, 128-cell Bidirectional LSTMs
Training on 150,441 character sequences.
Epoch 1/3
####################
Temperature: 0.2
####################
TRC2 TronCoin 0.0

LTC Lotthron 0.000999989

SPC SppperCoin 0.0

####################
Temperature: 0.5
####################
HNG "Mond Coin" 0.001072389

LLT Lomernes 0.0

SBC StareCoin 0.0

####################
Temperature: 1.0
####################
FHU Frium 0.000988283

ORV2 "Orvine Prad" 0.005528285

SBX2 Smenf 0.0

Epoch 2/3
####################
Temperature: 0.2
####################
CONT CONTOKEN 0.0

CORE CORON 0.0

ENT Enter 0.0

####################
Temperature: 0.5
####################
BUNT Buntor 0.0

GRA GaraCoin 0.000367625

VONT VOANTOKONTOKEN 0.0

####################
Temperature: 1.0
####################
WAWBTA "ARBTCU Token" 0.165626623

ACA2 Alcohia 0.000282005

BBORCONY BORARAR 0.4

Epoch 3/3
####################
Temperature: 0.2
####################
CRO Coinoma 0.001222939

BTC B

In [12]:
N=50
fake_cryptos_gen.generate(
    n=N,
    return_as_list=False,
    temperature=[0.5, 0.3, 0.2],
    max_gen_length=300,
)

  6%|▌         | 3/50 [00:00<00:04,  9.44it/s]

BOT BootCoin 0.000129398

STRT Strea 0.0

CRT "Credit Token" 0.000653993



  8%|▊         | 4/50 [00:00<00:04,  9.39it/s]

TORO TronCoin 0.001070109

BOTC Botcoin 0.00021108



 14%|█▍        | 7/50 [00:00<00:04,  9.57it/s]

XPA Payancoin 0.000219937

SPC Spacecoin 0.00012733

XCC "Cash Coin" 0.0



 20%|██        | 10/50 [00:00<00:04,  9.86it/s]

MONA Mononio 0.0

XCC "Chin Coin" 0.000139998



 26%|██▌       | 13/50 [00:01<00:03, 10.23it/s]

BTC2 "Bitcoin Token" 0.001148301

VERE Verio 0.0

CNT CoinTraden 0.0



 30%|███       | 15/50 [00:01<00:03, 10.50it/s]

JOON "JOO Coin" 0.000197318

ACOIN Acine 0.0

RED Redio 0.0



 38%|███▊      | 19/50 [00:01<00:02, 11.29it/s]

CORC Corencoin 0.000329609

MENT Mentrenity 0.0

SHT "Sher Token" 0.0



 42%|████▏     | 21/50 [00:01<00:02, 10.59it/s]

GET2 "Gene Token" 0.000113792

PRC2 Priplic 0.000171769

KOND Konda 0.0



 50%|█████     | 25/50 [00:02<00:02, 12.37it/s]

TRO TronLite 0.0

BITE BITTY 0.0

HAT "Hash Token" 0.0



 58%|█████▊    | 29/50 [00:02<00:01, 13.54it/s]

BTR Bitcoin 0.000109439

BET Betters 0.0

TRT1 TRRES 0.0

STC Starto 0.0



 62%|██████▏   | 31/50 [00:02<00:01, 12.19it/s]

BAT Batcoin 0.000186939

CORE CoinCoin 0.000197231

EXC Excoin 0.001291339



 70%|███████   | 35/50 [00:03<00:01, 12.52it/s]

TRT2 Tronerto 0.0

DARE Darto 0.0

STT3 "Start Coin" 0.0



 74%|███████▍  | 37/50 [00:03<00:01, 12.08it/s]

STC Starter 0.0

MOD "Moder Coin" 0.000103397

DTC Dectream 0.0



 78%|███████▊  | 39/50 [00:03<00:00, 11.62it/s]

PCC2 "Place Coin" 0.000148341

VET Vetalio 0.0



 86%|████████▌ | 43/50 [00:03<00:00, 12.29it/s]

XBC "Bitcoin Blockchain" 0.000523123

CORX Corex 0.0

STC Stace 0.0



 94%|█████████▍| 47/50 [00:04<00:00, 12.83it/s]

BCT2 "Bitcoin Token" 0.0

UPT UPToken 0.0

BETB Bitcoin 0.0

ALT Allation 0.0



100%|██████████| 50/50 [00:04<00:00, 11.64it/s]

ANCA Anacoin 0.000113473

PARE Paris 0.0

BARE BitcoinA 0.001190911






In [13]:
# And then trying with more 'entropy' or a higher 'temperature'
fake_cryptos_gen.generate(
    n=N,
    return_as_list=False,
    temperature=[0.8, 0.7],
    max_gen_length=300,
)

  2%|▏         | 1/50 [00:00<00:05,  8.88it/s]

CTA1 "Currenity Token" 0.0

KROB Krep 0.0



 10%|█         | 5/50 [00:00<00:04,  9.41it/s]

XBAC2 "Blockchain Adiamend Coin" 0.001187232

BIC Bitcoin 0.000153249

FXC FuxxCoin 0.0



 14%|█▍        | 7/50 [00:00<00:04, 10.06it/s]

ATH Author 0.0

VEB Deceneurbit 7.613e-06

SUP Spoper 0.0



 18%|█▊        | 9/50 [00:00<00:04,  9.99it/s]

EMC "Decert Money Coin" 0.001999197

BROTC Brostribs 0.0



 26%|██▌       | 13/50 [00:01<00:03,  9.75it/s]

2OAS "Soacoin Sure" 0.130983886

SELC Selus 0.003067594

BHP Bitpocoin 0.000109744



 30%|███       | 15/50 [00:01<00:03, 10.11it/s]

ERA ERP 0.000160913

LOLI Luocoin 0.000160972

GOA GAGO 0.0



 38%|███▊      | 19/50 [00:01<00:02, 11.91it/s]

DECO Delcoin 0.0

HCC HickCoin 0.0

WOP Wo 0.695537238



 42%|████▏     | 21/50 [00:01<00:02, 11.79it/s]

BIS Bitcoin 0.00094139

KDO "KODP Coin" 0.0



 46%|████▌     | 23/50 [00:02<00:02,  9.65it/s]

ATV "Atreamond Token" 0.00138699

SDT "Steper Disiance Coin" 0.000299439



 50%|█████     | 25/50 [00:02<00:02, 10.69it/s]

LAR Elargar 0.0

TTX Tartexuri 0.0

DUARD DRORP 0.0



 58%|█████▊    | 29/50 [00:02<00:01, 11.19it/s]

FRN Fraxi 0.000283347

BUT2 Buste 0.000081548

NENA Nexy 0.011165955



 62%|██████▏   | 31/50 [00:02<00:01, 10.86it/s]

BTA Blockports 0.000284311

EBL Ecrep 0.030643852

BBC2 Bitcoin 0.0



 66%|██████▌   | 33/50 [00:03<00:01, 10.93it/s]

CART2 "Car Trader Coin" 0.0

XME "Hextine Coin" 0.007289132



 74%|███████▍  | 37/50 [00:03<00:01, 10.53it/s]

MAPL "Marpa Chain" 5.189e-06

BITM Bitmi 0.0

BRIX Biocoin 0.000149343



 82%|████████▏ | 41/50 [00:03<00:00, 12.47it/s]

SHHT "SHT Token" 0.0

KM2 Kikera 0.0

SECN Secus 0.0

NSC Nucoin 0.0



 86%|████████▌ | 43/50 [00:03<00:00, 12.20it/s]

RDD Redvolo 0.0

XSD "Spold Findation" 0.0



 90%|█████████ | 45/50 [00:04<00:00, 11.92it/s]

EXLC EmeroCoin 0.000253889

RCTC Recoin 0.0

WONO WoonCoin 0.000091916



 98%|█████████▊| 49/50 [00:04<00:00, 11.75it/s]

STS STRT 0.000383109

FR2 FSG 0.0

PCOP PoperCoin 0.000465216



100%|██████████| 50/50 [00:04<00:00, 11.00it/s]

EOT Eather 0.0






## Transformers

Transformers are relative newcomers to the language processing world. They are an evolution of recurrent neural networks and activation layers. Using transformers has increased the capability of generating believable text by a whole lot, so much so that [ethical issues](https://www.theverge.com/2019/2/21/18234500/ai-ethics-debate-researchers-harmful-programs-openai) have arised around release of models or restrictive use of them.

Architecture wise, transformers are an encoder-decoder scheme that relies heavily on "attention" - a mechanism that allows every step to examine both past and future.

<img src="http://lilianweng.github.io/lil-log/assets/images/transformer.png" />

One recent model from OpenAI is GPT-2, which is freely available for download.

In [7]:
!pip install -U -q transformers

In [8]:
!git clone https://github.com/huggingface/transformers.git

fatal: destination path 'transformers' already exists and is not an empty directory.


In [15]:
# Verify the install:
import tensorflow as tf
!python3 -c "import tensorflow as tf; print(tf.reduce_sum(tf.random.normal([1000, 1000])))"

  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])
Tensor("Sum:0", shape=(), dtype=float32)


When your user gives you lemons you generate:

In [21]:
!python ./transformers/examples/run_generation.py \
    --model_type=gpt2 \
    --length=200 \
    --model_name_or_path=gpt2 \
    --stop_token="." \
    --prompt="When life gives you lemons" 2>/dev/null

, it might not require much, but it can definitely have its benefits


---

In [19]:
!python ./transformers/examples/run_generation.py \
    --model_type=gpt2 \
    --length=200 \
    --model_name_or_path=gpt2 \
    --prompt="Harry witnessed Professor McGonagall walking right past Peeves who \
was determinedly loosening a crystal chandelier and could have sworn he heard her \
tell the poltergeist out of the corner of her mouth, 'It unscrews the other way.’" 2>/dev/null

'

And suddenly there  was a powerful tidal wave coming out of her ears.

'Hey..'

Professor McGonagall shook her head with a glance at Goyle,

'Woof! Don't...I'm getting...shit! Shit!'

Goyle was covered in flaming hair

'Thank god I didn't say anything about it!'

Now there was blood flowing from Professor McGonagall's face as she saw. Something yellow oozed out of her nose.

Faint red vomit appeared from the spot.

'Guzzle! Can you smell...!'

Instantly other students quickly followed in dark circles around the teleportation weapon.

Instantly darker objects appeared.

One of the red objects looked like a shinier one or two scuttlecheeks.

The red color drained from Goyle's face.

Goyle let out a terror without smiling.

But,


---

Let's see how it does with Williams' "This is just to say" (https://poets.org/poem/just-say) poem:

In [24]:
!python ./transformers/examples/run_generation.py \
    --model_type=gpt2 \
    --length=50 \
    --model_name_or_path=gpt2 \
    --stop_token="." \
    --prompt="I have eaten the plums \
that were in the icebox \
and which you were probably \
saving for breakfast" 2>/dev/null

, and at about noon you  told me to lie back and lay on my stomach  and get a taste of the plums with my mouth  then your interest was  amazing  and you   prepared for the picnic, so it was


(this BTW is one of my favorite poems ever. So sweet and so plain)

* Try different prefix inputs
* Try different temperatures with the `--temperature` argument.
* Advanced: Try a different model than GPT-2.

On the help section for the generation script you can find all the models:
```
--model_name_or_path MODEL_NAME_OR_PATH
                        Path to pre-trained model or shortcut name selected in
                        the list: gpt2, gpt2-medium, gpt2-large, distilgpt2,
                        openai-gpt, xlnet-base-cased, xlnet-large-cased,
                        transfo-xl-wt103, xlm-mlm-en-2048, xlm-mlm-ende-1024,
                        xlm-mlm-enfr-1024, xlm-mlm-enro-1024, xlm-mlm-tlm-
                        xnli15-1024, xlm-mlm-xnli15-1024, xlm-clm-enfr-1024,
                        xlm-clm-ende-1024, xlm-mlm-17-1280, xlm-mlm-100-1280,
                        ctrl
```
The `ctrl` model is very recent work (from SalesForce research), just from a couple of weeks ago, it's supposed to be really awesome at controling the output text. Be warned - the model is a **6Gb download**! It might be worth it...

I attempted to use the transformers models as alternative ways to generate more fake cryptos.
First with gpt-2 and minimal ‘temperature’….


In [28]:
!python ./transformers/examples/run_generation.py \
    --model_type=gpt2 \
    --length=50 \
    --model_name_or_path=gpt2 \
    --temperature=0 \
    --prompt="BTC Bitcoin 7661.3\
ETH Ethereum 167.42\
XRP Ripple 0.282570591\
UNFLD UnfoldU 34.34\
USDT Tether 1.000337543\
TRX TRON 0.015509159\
ADA Cardano 0.038299367\
LINK ChainLink 2.75" 2>/dev/null

 USDT ChainLink 0.015514094 CMC CMC 0.015514094 CMC 0.015514094 CMC 0.015514094 CMC 0.015514094 C


This wasn't quite right.  
I want the machine to produce sequences of `symbole, name, USD price` So I reintroduced stop tokens.
This worked a bit better...

In [30]:
!python ./transformers/examples/run_generation.py \
    --model_type=gpt2 \
    --length=50 \
    --model_name_or_path=gpt2 \
    --temperature=0 \
    --stop_token='|' \
    --prompt='BTC Bitcoin 7661.3|\
ETH Ethereum 167.42|\
XRP Ripple 0.282570591|\
UNFLD UnfoldU 34.34|\
USDT Tether 1.000337543|\
TRX TRON 0.015509159|\
ADA Cardano 0.038299367|\
LINK ChainLink 2.75|' 2>/dev/null

 BTC BTC 0.0075


Okay BTC already exists... clearly I need to increase the temperature (and maybe add more cryptos ;-))

In [42]:
!python ./transformers/examples/run_generation.py \
    --model_type=gpt2 \
    --length=100 \
    --model_name_or_path=gpt2 \
    --temperature=0.5 \
    --stop_token="|" \
    --prompt="ETH Ethereum 167.42|\
AUD2 'Aussie  Digital' 0.65982667|\
XRP Ripple 0.282570591|\
UNFLD UnfoldU 34.34|\
USDT Tether 1.000337543|\
BCH 'Bitcoin Cash' 222.35|\
GRAM 'Telegram Open Network' 1.807366053|\
LTC Litecoin 52.63|\
BNB 'Binance Coin' 17.78|\
FCT Factom 2.59|\
CVC Civic 0.037939729|\
BOTX botXcoin 0.015452991|\
ICN Iconomi 0.211692693|\
RHOC RChain 0.066656079|\
PAI2 'Project Pai' 0.01704574|\
QBIT Qubitica 30.48|\
BNK Bankera 0.000996052|\
RDD ReddCoin 0.000845153|\
MOF 'Molecular Future' 0.583074748|\
R Revain 0.048948372|\
AION Aion 0.066565099|" #2>/dev/null

  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])
10/27/2019 14:44:45 - INFO - transformers.tokenization_utils -   loading file https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-vocab.json from cache at /Users/aberke/.cache/torch/transformers/f2808208f9bec2320371a9f5f891c184ae0b674ef866b79c58177067d15732dd.1512018be4ba4e8726e41b9145129dc30651ea4fec86aa61f4b9f40bf94eac71
10/27/2019 14:44:45 - INFO - transformers.tokenization_utils -   loading file https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt from cache at /Users/aberke/.cache/torch/transformers/d629f792e430b3c76a1291bb2766b0a047e36fae0588f9dbc1ae51decdff691b.70bec105b4158ed9a1747fea67a43f5dee97855c64d62b6ec3742f4cfdb5feda
10/27/2019 14:44:45 - INFO - tr

Increasing the temperature by a bit and mixing up the prefix subset of input coins....

In [49]:
!python ./transformers/examples/run_generation.py \
    --model_type=gpt2 \
    --length=100 \
    --model_name_or_path=gpt2 \
    --temperature=0.8 \
    --stop_token="|" \
    --prompt="ICN Iconomi 0.211692693|\
RHOC RChain 0.066656079|\
PAI2 'Project Pai' 0.01704574|\
QBIT Qubitica 30.48|\
BNK Bankera 0.000996052|\
RDD ReddCoin 0.000845153|\
MOF 'Molecular Future' 0.583074748|\
R Revain 0.048948372|\
AION Aion 0.066565099|" 2>/dev/null

 SIC SIC Network 0.012328581


Trying out the openai-gpt model...

In [54]:
!python ./transformers/examples/run_generation.py \
    --model_type='openai-gpt' \
    --length=100 \
    --model_name_or_path='openai-gpt' \
    --temperature=0.5 \
    --stop_token="|" \
    --prompt="GRAM 'Telegram Open Network' 1.807366053|\
LTC Litecoin 52.63|\
BNB 'Binance Coin' 17.78|\
FCT Factom 2.59|\
CVC Civic 0.037939729|\
BOTX botXcoin 0.015452991|\
ICN Iconomi 0.211692693|\
RHOC RChain 0.066656079|\
PAI2 'Project Pai' 0.01704574|\
QBIT Qubitica 30.48|\
BNK Bankera 0.000996052|\
RDD ReddCoin 0.000845153|\
MOF 'Molecular Future' 0.583074748|\
R Revain 0.048948372|\
AION Aion 0.066565099|" 2>/dev/null

  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])
10/27/2019 14:58:28 - INFO - transformers.file_utils -   https://s3.amazonaws.com/models.huggingface.co/bert/openai-gpt-vocab.json not found in cache or force_download set to True, downloading to /var/folders/wt/5pyhsskj1hgdzv_4hb1gpkpm0000gn/T/tmpajygfsux
100%|███████████████████████████████| 815973/815973 [00:00<00:00, 4582443.03B/s]
10/27/2019 14:58:28 - INFO - transformers.file_utils -   copying /var/folders/wt/5pyhsskj1hgdzv_4hb1gpkpm0000gn/T/tmpajygfsux to cache at /Users/aberke/.cache/torch/transformers/4ab93d0cd78ae80e746c27c9cd34e90b470abdabe0590c9ec742df61625ba310.b9628f6fe5519626534b82ce7ec72b22ce0ae79550325f45c604a25c0ad87fd6
10/27/2019 14:58:28 - INFO - transformers.fil

10/27/2019 15:01:03 - INFO - __main__ -   Namespace(device=device(type='cpu'), length=100, model_name_or_path='openai-gpt', model_type='openai-gpt', n_gpu=0, no_cuda=False, padding_text='', prompt="GRAM 'Telegram Open Network' 1.807366053| LTC Litecoin 52.63| BNB 'Binance Coin' 17.78| FCT Factom 2.59| CVC Civic 0.037939729| BOTX botXcoin 0.015452991| ICN Iconomi 0.211692693| RHOC RChain 0.066656079| PAI2 'Project Pai' 0.01704574| QBIT Qubitica 30.48| BNK Bankera 0.000996052| RDD ReddCoin 0.000845153| MOF 'Molecular Future' 0.583074748| R Revain 0.048948372| AION Aion 0.066565099|", repetition_penalty=1.0, seed=42, stop_token='|', temperature=0.5, top_k=0, top_p=0.9, xlm_lang='')
100%|█████████████████████████████████████████| 100/100 [00:35<00:00,  2.29it/s]
rdl. 9093982744. 
 " i don't know what to say, " said the reporter. 
 " i don't either, " said the reporter. " but i think we should do something about this. " 
 " what? " 
 " i mean, i think we should do something about this. " 
 

that didn't produce cryptos at all!

## The Real Data is weird enough...

In the debugging process I encountered names of existing crypto coins that are  weirder than ML could have randomly come up with, and are trading at non-zero $USD values….

```
MAY "Theresa May Coin" 0.000197574
TSE "Tattoocoin (Standard Edition)" 0.000217536
FLUZ 'Fluz Fluz' 0.022218693
LALA 'LALA World' 0.017091302
POE Po.et 0.00214985
EMC2 Einsteinium 0.042217778
FAT Fatcoin 0.018319265
GBC2 'Gold Bits Coin' 0.022218693
```

Full list of my input cryptos in github here: https://github.com/aberke/city-visions/blob/master/week-2/data/cryptocurrencies_data.txt 

This is a classic example of why you should inspect your data before processing your data.  I should have realized the futility of this art project  from the start: I wasn’t going to make something weirder than the world of cryptofans had already invested in.

## ChatBot

[Chatbots](https://en.wikipedia.org/wiki/Chatbot) are conversational AI agents that can respond to text input. It's still ways away from a convincing conversation in general open-ended scenarios, but in certain applications chatbots are a big success, e.g. in the public services industry's online portals.

`huggingface` again have released their pretrained models for chatbots based on transformers just a few months ago: https://medium.com/huggingface/how-to-build-a-state-of-the-art-conversational-ai-with-transfer-learning-2d818ac26313#79c5

You can also use their online demo: https://convai.huggingface.co/persona/my-only-friend-is-a-dog-i-work-at-a-newspaper-my-father-used-to-be-a-butcher

In [55]:
!pip install -q pytorch_transformers pytorch-ignite

In [56]:
!git clone https://github.com/huggingface/transfer-learning-conv-ai

Cloning into 'transfer-learning-conv-ai'...
remote: Enumerating objects: 7, done.[K
remote: Counting objects: 100% (7/7), done.[K
remote: Compressing objects: 100% (7/7), done.[K
remote: Total 96 (delta 0), reused 1 (delta 0), pack-reused 89[K
Unpacking objects: 100% (96/96), done.


In [75]:
import sys,threading,subprocess,os

In [96]:
def chatbot_proc():
    proc = subprocess.Popen([sys.executable, 
                             os.getcwd()+'/transfer-learning-conv-ai/interact.py'
                            ],
                            stdout=subprocess.PIPE,
                            stdin=subprocess.PIPE,
                            stderr=subprocess.DEVNULL
                           )
    pout = proc.stdout
    pin = proc.stdin
    
    return proc,pout,pin

In [97]:
cb1_proc, cb1_pout, cb1_pin = chatbot_proc() # create a chatbot process

In [98]:
print('cb1_proc')
cb1_proc

cb1_proc


<subprocess.Popen at 0xb3763de10>

In [99]:
cb1_pin.write(b"--temperature=1.1\n"), cb1_pin.flush()

(18, None)

In [100]:
print(cb1_pout.readline().decode(sys.stdout.encoding))
# print(cb1_pout.readline().decode("--temperature=1.1\n"))

KeyboardInterrupt: 

Talk to your chatbot!

In [None]:
cb1_pin.write(b"i'm doing mighty fine! and how are you?\n"), cb1_pin.flush();
print(cb1_pout.readline().decode(sys.stdout.encoding))
# print(cb1_pout.readline().decode("i'm doing mighty fine! and how are you?\n"))

In [None]:
cb1_pin.write(b"no way! i'm also listening to music. what music are you listening to?\n"), cb1_pin.flush();
print(cb1_pout.readline().decode(sys.stdout.encoding))

It's also quite funny to get it to talk to itself - it never get tired!

In [192]:
cb1_output = b"i am listening to a lot of pop music\n"

In [193]:
partyA = True
for _ in range(10):
    partyA = not partyA
    cb1_pin.write(cb1_output), cb1_pin.flush();
    cb1_output = cb1_pout.readline()[4:]
    print("%s: %s" % ('A' if partyA else 'B',
          cb1_output[:-1].decode(sys.stdout.encoding)))

B: yeah, i know what you mean.
A: what do you do for a living?
B: i am a mechanic.
A: i am a pilot.
B: what do you do for work?
A: i fix planes.
B: what kind of planes do you have?
A: do you have any hobbies?
B: i like to listen to music.
A: what kind of music do you like?


In [194]:
cb1_proc.kill() # kill the chatbot process

* Try some different inputs
* Advanced: Spin up another chatbot and have them talk to one another (by feeding the outputs across)
* Advanced: Use a different underlying model than GPT-2 for your chatbot.

---
That's a wrap!