<a href="https://colab.research.google.com/github/avivajpeyi/investigating-gpt2/blob/master/GPT2_Investigation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Investigating OpenAI's GPT-2


In this notebook, we can try out a small version (345M parameter vs their full 1.5B parameter model) of Open AI's GPT-2 model, described from the paper [Language Models are Unsupervised Multitask Learners](https://d4mucfpksywv.cloudfront.net/better-language-models/language-models.pdf). OpenAI has also published a more human freindly [blog post](https://openai.com/blog/better-language-models/) about the model.

## Notes about model







### Background
The GPT-2 algorithm was trained on the task of *language modeling*--- which tests a program's ability to predict the next word in a given sentence--by ingesting a shit ton of text data (40GB of articles, blogs, and websites,). By using this data it achieved 

*   "*zero-shot learning*": state-of-the-art scores on a number of unseen language tests
*   [Unintentional (albeit not-so-great) Language translation](https://i1.wp.com/slatestarcodex.com/blog_images/english-french.png?zoom=2&w=700)
*   *TLDR summarization*
*   *Text completion*
*  *Reading comprehension*
*  [Essay Writing](https://pbs.twimg.com/media/DzYpsJOU0AA1PO9.png:large)
*   more...



### Why this is cool
*   Another step to AGI
*   Improving qs+a
*   Recovering historical data (interpolation through text)
*   Help explain difficult to understand texts to non native eglish speakers/novices
*   Potentially use it as a tool to weed out "fake news"

### Why its scary: FAKE NEWS

Open AI hasnt released the dataset, training code, or the full GPT-2 model weights. This is due to the concerns about large language models being used to generate deceptive, biased, or abusive language at scale. Some examples of the applications of these models for malicious purposes are:
* Generate misleading news articles
* Impersonate others online
* Automate the production of abusive or faked content to post on social media
* Automate the production of spam/phishing content

As one can imagine, this combined with recent advances in generation of synthetic imagery, audio, and video implies that it's never been easier to create fake content and spread disinformation at scale (check out [this paper](http://grail.cs.washington.edu/projects/AudioToObama/siggraph17_obama.pdf) on making an AI that synthesizes photorealistic, lip-synced [video](https://www.youtube.com/watch?v=9Yq67CjDqvw) of Obama). The public at large will need to become more skeptical of the content they consume online. 

Also scary beceause this is yet another step to AGI.

### My opinions
I think GPT-2 is a brute-force statistical pattern matcher which chews up the internet and gives you back a slightly confusing dump of it when asked... It seems like a plagarism model, that changes so much based on the large dataset that, there is no more of the original context... so its no longer plagarism? 

From what I understand, there is not really a new algorithmic contribution here. I think they are scaling previous research. But I think seeing exactly how strong these scaled up models an awesome contribution and challenge. It’s easy to say in retrospect “of course more data and compute gives you better models”.

#### Skeptical
I am a little skeptical on OpenAi's results -- they havent released the full model -- and they probably shouldnt, but have they just cherry-picked their results? Need to look more into this.

Maybe trying out their 345M Parameter model will make me a believer! 

#### Is it really closer to a true AGI?
Sure, this could do cool stuff with infinite training data and limitless computing resources, but thats true of alot of ML projects. Scaling the data needed to learn is tough problem. 

A true AGI will have to be much better at learning from limited datasets with limited computational resources. It will have to investigate the physical world with the same skill that GPT-2 investigates text....

#### Wake up non-believers! 
AIs already pick up abilities that we dont expect them to learn, eg English-to-French translation without any French texts in their training corpus. GPT-2 is a good example of how AIs only learn what you program them to learn, and that they can do more than one specific task.

#### Should it be released? Tough qs! 
The resources needed to train the full model are beyond the average person and small companies which could use this for potentially very interesting non-malicious applications. However large organizations and state actors that are most likely to use this for malicious purposes can and typically do already have easy access to the resources needed to replicate the full model.

Therefore by not releasing the full model "Open"Ai is in a way ensuring that this sort of AI tech remains in the hands of powerful organizations and state actors that are most likely to misuse it while at the same time unintentionally tricking the general public to think this tech is not "really" available yet....

## Setup

In [0]:
# testing GPU connection
import tensorflow as tf
device_name = tf.test.gpu_device_name()
if device_name != '/device:GPU:0':
  raise SystemError('GPU device not found; Go to Runtime > Change Runtime Type > Harware Accelerator > GPU')
print('Found GPU at: {}'.format(device_name))

Found GPU at: /device:GPU:0


In [0]:
# Clone repo 
! git clone https://github.com/openai/gpt-2.git

Cloning into 'gpt-2'...
remote: Enumerating objects: 174, done.[K
Receiving objects:   0% (1/174)   Receiving objects:   1% (2/174)   Receiving objects:   2% (4/174)   Receiving objects:   3% (6/174)   Receiving objects:   4% (7/174)   Receiving objects:   5% (9/174)   Receiving objects:   6% (11/174)   Receiving objects:   7% (13/174)   Receiving objects:   8% (14/174)   Receiving objects:   9% (16/174)   Receiving objects:  10% (18/174)   Receiving objects:  11% (20/174)   Receiving objects:  12% (21/174)   Receiving objects:  13% (23/174)   Receiving objects:  14% (25/174)   Receiving objects:  15% (27/174)   Receiving objects:  16% (28/174)   Receiving objects:  17% (30/174)   Receiving objects:  18% (32/174)   Receiving objects:  19% (34/174)   Receiving objects:  20% (35/174)   Receiving objects:  21% (37/174)   Receiving objects:  22% (39/174)   Receiving objects:  23% (41/174)   Receiving objects:  24% (42/174)   Receiving objects:  25% (44/174)   Rec

In [0]:
# Move to gpt-2 repo
import os
os.chdir('gpt-2')

In [0]:
# download model and install dependencies
!python download_model.py 345M
!pip3 install -r requirements.txt

Fetching checkpoint:   0%|                                              | 0.00/77.0 [00:00<?, ?it/s]Fetching checkpoint: 1.00kit [00:00, 746kit/s]                                                      
Fetching encoder.json:   0%|                                           | 0.00/1.04M [00:00<?, ?it/s]Fetching encoder.json: 1.04Mit [00:00, 51.0Mit/s]                                                   
Fetching hparams.json: 1.00kit [00:00, 585kit/s]                                                    
Fetching model.ckpt.data-00000-of-00001: 1.42Git [00:22, 62.6Mit/s]                                 
Fetching model.ckpt.index: 11.0kit [00:00, 7.78Mit/s]                                               
Fetching model.ckpt.meta: 927kit [00:00, 52.8Mit/s]                                                 
Fetching vocab.bpe: 457kit [00:00, 49.9Mit/s]                                                       


# Unconditional sample generation
Generates text samples on the whim of the AI

In [0]:
!python3 src/generate_unconditional_samples.py --model_name='345M' --nsamples=3 --top_k=10

2019-05-10 05:13:12.112505: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2300000000 Hz
2019-05-10 05:13:12.112776: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x2301700 executing computations on platform Host. Devices:
2019-05-10 05:13:12.112813: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): <undefined>, <undefined>
2019-05-10 05:13:12.252840: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-05-10 05:13:12.253371: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x2301440 executing computations on platform CUDA. Devices:
2019-05-10 05:13:12.253403: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): Tesla T4, Compute Capability 7.5
2019-05-10 05:13:12.253774: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found de

# Conditional sample generation
Generates text sampels conditional on some user input

## Notes on flags:
The code comes with a few flags available, with a default value:



* **seed** = None || a random value is generated unless specified. give a specific integer value if you want to reproduce same results in the future.
* **nsamples** = 1 || specify the number of samples you want to print
* **length** = None || number of tokens (words) to print on each sample.
* **batch_size** = 1 || how many inputs you want to process simultaneously. doesn't seem to affect the results.
* **temperature** = 1 || scales logits before sampling prior to softmax.
* **top_k** = 0 || truncates the set of logits considered to those with the highest values

## Different usecases:


**1.   Text completion**

**2.  Question answering**

**3.   Summarisation**

**4.  Translation**


### 1. Text Completion



Feed in some random text and see what the AI generates from that!



#### Example usage

> `!python3 src/interactive_conditional_samples.py --nsamples=2 --top_k=40 --temperature=.80 --model_name='345M'`


> Model prompt >>> "Our solar system consists of the inner and outer planets, separated by an asteroid belt. It has "

> Model prompt >>> "The 10 best foods are: 1. Peanut butter jelly sandwiches 2. Marshmallows 3. Broccoli 4."


#### Try it out yourself!

In [0]:
!python3 src/interactive_conditional_samples.py --nsamples=2 --top_k=40 --temperature=.80 --model_name='345M'

2019-05-10 05:20:16.855471: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2300000000 Hz
2019-05-10 05:20:16.855745: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x2f76680 executing computations on platform Host. Devices:
2019-05-10 05:20:16.855778: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): <undefined>, <undefined>
2019-05-10 05:20:17.002035: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-05-10 05:20:17.002571: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x2f75fa0 executing computations on platform CUDA. Devices:
2019-05-10 05:20:17.002601: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): Tesla T4, Compute Capability 7.5
2019-05-10 05:20:17.002988: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found de

### 2. Question-Answering



Feed in a passage, and then some question/answer pairs (Q: blah blah? A: Blah blah.), and token `A:`. The AI will answer the previous ''`Q:`''


Note, for a single word answer (i.e.: Yes/No, city), set flag `length=1`



#### Example usage



> `!python3 src/interactive_conditional_samples.py --nsamples=10 --top_k=40 --temperature=.80 --length=1 --model_name='345M'`

> Model Prompt >>> 
 
> ```
The 2008 Summer Olympics torch relay was run from March 24 until August 8, 2008, prior to the 2008 Summer
Olympics, with the theme of “one world, one dream”. Plans for the relay were announced on April 26, 2007, in
Beijing, China. The relay, also called by the organizers as the “Journey of Harmony”, lasted 129 days and carried
the torch 137,000 km (85,000 mi) – the longest distance of any Olympic torch relay since the tradition was started
ahead of the 1936 Summer Olympics.
After being lit at the birthplace of the Olympic Games in Olympia, Greece on March 24, the torch traveled to the Panathinaiko Stadium in Athens, and then to Beijing, arriving on March 31. From Beijing, the torch was
following a route passing through six continents. The torch has visited cities along the Silk Road, symbolizing
ancient links between China and the rest of the world. The relay also included an ascent with the flame to the top of
Mount Everest on the border of Nepal and Tibet, China from the Chinese side, which was closed specially for the
event.
Q: What was the length of the race?
A: 137,000 km
Q: Was it larger than previous ones?
A: No
Q: Where did the race begin?
A: Olympia, Greece
Q: Where did they go after?
A: Athens
Q: How many days was the race?
A: seven
Q: Did they visit any notable landmarks?
A: Panathinaiko Stadium
Q: And did they climb any mountains?
A:
```


#### Try it out yourself!

In [0]:
!python3 src/interactive_conditional_samples.py --nsamples=10 --top_k=40 --temperature=.80 --length=1 --model_name='345M'

2019-05-10 05:35:36.852390: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2300000000 Hz
2019-05-10 05:35:36.852673: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x1ba2680 executing computations on platform Host. Devices:
2019-05-10 05:35:36.852730: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): <undefined>, <undefined>
2019-05-10 05:35:36.998389: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-05-10 05:35:36.998933: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x1ba1fa0 executing computations on platform CUDA. Devices:
2019-05-10 05:35:36.998966: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): Tesla T4, Compute Capability 7.5
2019-05-10 05:35:36.999334: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found de

### 3. Summarization



Feed in a passage, and add and text *`TL;DR:`* or *`Summary:`* at the end, and the AI will try to summarise the text.




#### Example usage

Note the following passage was obtained from a blog post about the [mars-water paradox](http://www.planetary.org/blogs/guest-blogs/2019/mars-water-stable-paradox.html).


> `!python3 src/interactive_conditional_samples.py --nsamples=3 --length=100 --temperature=1 --model_name='345M'`

> Model Prompt >>>

>```
Mars has been the most extensively studied planet in the Solar System, except of course Earth. For the last 25 years, these missions have focused on the search for life by “following the water.” Although we have acquired compelling evidence of flowing liquid water on early Mars, the fundamental question about how water could be stable under Martian atmospheric conditions remains unsolved. Everything we have learned about Mars points towards a freezing cold Martian climate that would be incapable of stabilizing liquid water throughout Mars’ history.
The two ideas that suggest liquid water could not be stable on early Mars are the “Faint Young Sun Paradox” and the Martian orbit. The following is a summary of two recent papers about the problem of Mars’ early climate: “The climate of early Mars,” by Robin Wordsworth, and a book chapter by Robert Haberle and coauthors, “The Early Mars Climate System.” Mars today as we know it is a cold and dry desert with a thin atmosphere not capable of stabilizing liquid water on its surface. However, there is ample evidence that Mars had flowing liquid water on its surface about 4 to 3.7 billion years ago (named as the Noachian Period). The evidence gathered by Mars orbiters, rovers, and landers is geomorphological; (valley networks, crater lakes, purported Northern ocean, glacial landforms, etc.); mineralogical (iron- and magnesium-rich clay minerals, sulfates, chlorides, iron oxides, and oxyhydroxides, etc.); and isotopic (noble gases, nitrogen, hydrogen, oxygen and carbon).
TL;DR: 
```

> Model Prompt >>>
>```
Theodore McCarrick is the most senior Catholic figure to be dismissed from the priesthood in modern times.
US Church officials said allegations he had sexually assaulted a teenager five decades ago were credible.
Mr McCarrick, 88, had previously resigned but said he had "no recollection" of the alleged abuse.
"No bishop, no matter how influential, is above the law of the Church," Cardinal Daniel DiNardo, president of the United States Conference of Catholic Bishops said in a statement.
"For all those McCarrick abused, I pray this judgment will be one small step, among many, toward healing."
The alleged abuses may have taken place too long ago for criminal charges to be filed because of the statute of limitations.
Mr McCarrick was the archbishop of Washington DC from 2001 to 2006. Since his resignation last year from the College of Cardinals, he has been living in seclusion in a monastery in Kansas.
He was the first person to resign as a cardinal since 1927.
He is among hundreds of members of the clergy accused of sexually abusing children over several decades and his dismissal comes days before the Vatican hosts a summit on preventing child abuse.
The Vatican said Pope Francis had ruled Mr McCarrick's expulsion from the clergy as definitive, and would not allow any further appeals against the decision. 
TL;DR: 
```


#### Try it out yourself!

In [0]:
!python3 src/interactive_conditional_samples.py --nsamples=3 --length=100 --temperature=1 --model_name='345M'

2019-05-10 05:42:12.717897: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2300000000 Hz
2019-05-10 05:42:12.718185: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x17ec680 executing computations on platform Host. Devices:
2019-05-10 05:42:12.718223: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): <undefined>, <undefined>
2019-05-10 05:42:12.865448: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-05-10 05:42:12.866025: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x17ebfa0 executing computations on platform CUDA. Devices:
2019-05-10 05:42:12.866059: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): Tesla T4, Compute Capability 7.5
2019-05-10 05:42:12.866427: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found de

### 4. Translation



Provide a few example translations, using the format of *`english_text = other_language_text`*, and then *`english_text =`*  at the end. The AI will try to translate the english text into the other language.




#### Example usage

> `!python3 src/interactive_conditional_samples.py --nsamples=3 --temperature=1 --model_name='345M' `

> Model Prompt >>>

> ```
Good morning. = Buenos días.
I am lost. Where is the restroom? = Estoy perdido. ¿Dónde está el baño?
How much does it cost? = ¿Cuánto cuesta?
How do you say maybe in Spanish? = ¿Cómo se dice maybe en Español?
Would you speak slower, please. = Por favor, habla mas despacio.
Where is the book store? = ¿Dónde está la librería?
At last a feminist comedian who makes jokes about men. = Por fin un cómico feminista que hace chistes sobre hombres.
How old are you? = 
```

#### Try it out yourself!

In [0]:
!python3 src/interactive_conditional_samples.py --nsamples=3 --temperature=1 --model_name='345M' 

2019-05-10 05:43:46.722814: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2300000000 Hz
2019-05-10 05:43:46.723095: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x272a680 executing computations on platform Host. Devices:
2019-05-10 05:43:46.723130: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): <undefined>, <undefined>
2019-05-10 05:43:46.871727: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-05-10 05:43:46.872278: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x2729fa0 executing computations on platform CUDA. Devices:
2019-05-10 05:43:46.872309: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): Tesla T4, Compute Capability 7.5
2019-05-10 05:43:46.872736: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found de