**IU000128 Coding3: Exploring Machine Intelligence Part 1** 

For more info, please check [Coding3 part 2](https://colab.research.google.com/drive/1l3z4I9KehQgtLqnOCSWH3m6II_xshuiF#scrollTo=PiyjEOM1XoL7)

---


## GPT-2 (small) and Text Generation - by YIFAN FENG


#### Main Reference 


*   [GPT-2 Fine Tune](https://colab.research.google.com/gist/MattPitlyk/45541145ad48b93da395f0a72ec2e7dc/fine-tuning-gpt-2-on-a-custom-dataset.ipynb)

*   [GPT-2 Custom Dataset](https://medium.com/ai-innovation/beginners-guide-to-retrain-gpt-2-117m-to-generate-custom-text-content-8bb5363d8b7f)

*   [Training Dataset: Oxford Handbook of Ethics of AI](https://www.oxfordhandbooks.com/view/10.1093/oxfordhb/9780190067397.001.0001/oxfordhb-9780190067397)


### Import Libs and Set up Env

In [None]:
from google.colab import drive #Access GoogleDrive 
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
#set-up 
import nltk, re, string
from string import punctuation 
from nltk.tokenize import word_tokenize

In [None]:
from nltk.corpus import stopwords
nltk.download('stopwords')
stopwords_list = stopwords.words('english') 

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


### Basic Text Cleaning
#### Reference 
*   [IU000131: NLP for the Creative Industries Week 1.2](https://git.arts.ac.uk/lmccallum/nlp-21-22/blob/master/NLP%20Week%201.2-Solutions.ipynb)
*[Clean and Tokenize Text With Python](https://dylancastillo.co/nlp-snippets-clean-and-tokenize-text-with-python/)

In [None]:
#Use the following three functions to cleanse dataset
#Exclude excessive whitespace 

def remove_whitespace(textfile):
  return "\n".join(re.findall(r"^\s*\b(.*?)\s*$", textfile, flags=re.MULTILINE))

#Exclude excessive paragraphs 
def remove_newlines(textfile):
  return re.sub(r'[\r\n]{3}',"", textfile) 

#Exclude punctuation
def remove_punctuation(textfile):
  return re.sub(f"[{re.escape(punctuation)}]", "", textfile)

#Exlcude stopwords
def remove_stopwords (textfile):
  wordtokens = textfile.split()
  clean_wordtokens = [word for word in wordtokens if not word in stopwords_list]
  clean_text = " ".join(clean_wordtokens)
  return clean_text 

#Exlude numbers 
def remove_numbers(textfile):
  return re.sub(r"\b[0-9]+\b\s*", "", textfile)

#Exlude digits 
def remove_digits(textfile):
  new_text = " ".join([w for w in textfile.split() if not w.isdigit()])
  return new_text

def processText(text):
  rm_whitespace = remove_whitespace(text)
  rm_newlines = remove_newlines(rm_whitespace)
  rm_punctuation = remove_punctuation (rm_newlines)
  rm_numbers = remove_numbers(rm_punctuation)
  rm_digits = remove_digits(rm_numbers)
  rm_stoppers = remove_stopwords(rm_digits)
  return rm_stoppers

In [None]:
#Read text file 
textfile_path = '/content/drive/MyDrive/Coding3/oxford_ai.txt'
file = open(textfile_path, 'rt')
text = file.read().lower()
file.close() 

#Process text file for a new document
rm_stoppers = processText(text)

In [None]:
updated_textfile = rm_stoppers #reassign a variable name to the string

In [None]:
#Write the file
newfile_path = '/content/drive/MyDrive/Coding3/oxford_ai_cleaned.txt'
new_file = open(newfile_path,'w')
new_file.write(updated_textfile)
new_file.close() 

### GPT-2 Fine-Tuning

*Reference see above*


*   [Understanding GPT-2 source code 1](https://medium.com/analytics-vidhya/understanding-the-gpt-2-source-code-part-1-4481328ee10b)
*   [Understanding GPT-2 source code 2](https://medium.com/analytics-vidhya/understanding-the-gpt-2-source-code-part-2-4a980c36c68b)



In [None]:
!pip install -q gpt-2-simple
%tensorflow_version 1.x
import gpt_2_simple as gpt2
from datetime import datetime
from google.colab import files
import tensorflow as tf

  Building wheel for gpt-2-simple (setup.py) ... [?25l[?25hdone
TensorFlow 1.x selected.
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.



In [None]:
gpt2.download_gpt2(model_name="355M") #other options: 124M

Fetching checkpoint: 1.05Mit [00:00, 491Mit/s]                                                      
Fetching encoder.json: 1.05Mit [00:00, 2.86Mit/s]
Fetching hparams.json: 1.05Mit [00:00, 284Mit/s]                                                    
Fetching model.ckpt.data-00000-of-00001: 1.42Git [00:48, 29.5Mit/s]                                 
Fetching model.ckpt.index: 1.05Mit [00:00, 282Mit/s]                                                
Fetching model.ckpt.meta: 1.05Mit [00:00, 2.88Mit/s]
Fetching vocab.bpe: 1.05Mit [00:00, 3.69Mit/s]


In [None]:
'''
#for file store in google drive
gpt2.mount_gdrive()
gpt2.copy_file_from_gdrive(file_name)
'''
file_name = "/content/drive/MyDrive/Coding3/oxford_ai_cleaned.txt"
checkpoint_name = "/content/drive/MyDrive/Coding3/gpt2_fine_tuning_run_1" 
model_size = '355M'
checkpoint_file = '/content/drive/MyDrive/Coding3/gpt2_fine_tuning_run_1/model-1000.data-00000-of-00001'

tf.reset_default_graph() #re-finetune if training fails 
sess = gpt2.start_tf_sess() 
gpt2.finetune(sess,
              dataset=file_name,
              model_name= model_size, #Small options: 124M
              model_dir='models', 
              steps=1000, #training iteration, total ~8.5k 
              restore_from= 'latest', #other options: fresh or checkpoint_file
              run_name = checkpoint_name, 
              save_every= 400, #write a checkpoint every N steps
              use_memory_saving_gradients = True,
              only_train_transformer_layers = False,
              accumulate_gradients = 1,
              learning_rate=0.0001, 
              sample_every=100) #generate samples every N steps

#Save checkpoint to Google Drive 
gpt2.copy_checkpoint_to_gdrive(checkpoint_name) 

For larger models, the recommended finetune() parameters are:
	use_memory_saving_gradients = True
	only_train_transformer_layers = True
	accumulate_gradients = 1

Loading checkpoint /content/drive/MyDrive/Coding3/gpt2_fine_tuning_run_1/model-5000
INFO:tensorflow:Restoring parameters from /content/drive/MyDrive/Coding3/gpt2_fine_tuning_run_1/model-5000
Loading dataset...


100%|██████████| 1/1 [00:03<00:00,  3.33s/it]


dataset has 326722 tokens
Training...
�joshua kroll et al eth r evans “regulating scheduling autonomous intelligent systems” science joshua kroll et al “regulating scheduling autonomous intelligent systems human role scheduling practices” lancet joshcia calders karen ec levy greg hager “assessing risk assessment action” minnesota law review –depaulo bel deborah kashy susan leigh starck “bias computer systems” acm joshua kroll et al “algorithmic decisionmaking based machine learning” figure slight unorthodoxy may result positive attributes—inability human oversight oversight oversight—but also potential risks outlined following discussion two aspects kashy susan leigh starck’s book society algorithms1Initialized many discussions algorithmic bias theory ai2 become involved controversies particular researchers project building neural networks causal ancestors tomahawks phase issues causal dependence neu evolution epistemology ethics ai depend data tradition goes way back natural social cr

### Explore Conditional Text!!


*   [Conditional Text GPT-2](https://www.ivanlai.project-ds.net/post/conditional-text-generation-by-fine-tuning-gpt-2)




In [None]:
'''
#Load the model from checkpoint
gpt2.copy_checkpoint_from_gdrive(run_name=checkpoint_name)
tf.reset_default_graph()
sess = gpt2.start_tf_sess()
gpt2.load_gpt2(sess, run_name=checkpoint_name)
'''

In [None]:
input_prompt = "Artificial Intelligence is… "
gpt2_text_generator = gpt2.generate(sess, 
                                    run_name= checkpoint_name, #restore from checkpoint 
                                    temperature=0.2, 
                                    length=100, 
                                    prefix=input_prompt, 
                                    top_k=70, 
                                    nsamples=3)#the amount of outcome 
                                    #destination_path = "/content/drive/MyDrive/Coding3/gpt2_text.txt")  

Artificial Intelligence is… ict impact every human endeavor ict design london nicholas g shal ows internet brains new york liveright publishing corp next set technologies underwriting ai ethical principles including transparency accountability among paul rauwolf joanna j bryson yourself34 ai could always use feedback improve core algorithms make better even without additional enhancements could always benefit existing ai systems since require input data models different audiences also feedback improve predictive accuracy although noble intentions crow many cases leads people developing ai become tech
Artificial Intelligence is… ict impact every human endeavor ict design ict user experience ict design ict user experience designer x   x ai system generates value user experience leads better customer acquisition experience wide range ethical desirable outcomes ict users google facebook twitter could potential lead better understanding trust algorithms google facebook twitter could help 

In [None]:
input_prompt = "Equity is… "
gpt2_text_generator = gpt2.generate(sess, 
                                    run_name= checkpoint_name, #restore from checkpoint 
                                    temperature=0.15, 
                                    length=100, 
                                    prefix=input_prompt, 
                                    top_k=50, 
                                    nsamples=3)#the amount of outcome 
                                    #destination_path = "/content/drive/MyDrive/Coding3/gpt2_text.txt")  

Equity is… ict offer context specific policy framework trustworthy ai implies acceptance among stakeholders including individual public private sectors responsibility engineering design operation informed informed democratic processes means adequate funding allocation training activities development implementation ai systems including funding bodies european union eu organisation economic cooperation development oecd ai narrative –in conclusion reflecting increased political societal attention issues connected development ai general data protection regulation gdpr applies entities participating european union ai narrative –in conclusion reflecting increased political societal attention issues connected development ai accurate disclosures
Equity is… ict see intelligence identity –identity –invalid –invalid –f intelligent assistants iq see intelligence quotient principles –iran air flight fundamental iran air flight common sense iran air flight common sense jordan –intelligence see also 

In [None]:
input_prompt = "Ethics is… "
gpt2_text_generator = gpt2.generate(sess, 
                                    run_name= checkpoint_name, #restore from checkpoint 
                                    temperature=0.1, 
                                    length=100, 
                                    prefix=input_prompt, 
                                    top_k=70, 
                                    nsamples=3)#the amount of outcome 
                                    #destination_path = "/content/drive/MyDrive/Coding3/gpt2_text.txt")  

Ethics is… ian machines like novel new york knopf doure sturm noemie elhadad proceedings aristotelian society london whittlestone james probability calibration group gender” proceedings conference fairness accountability transparency –igi see industry iq see intelligence inequality –index natural justice –correa fatimah yasseri nabil todonobe “a right reasonable inferences rethinking data protection law age big data ai” harv j law technology review –s
Ethics is… ian machines like novel new york knopf doure sturm noemie elhadad proceedings aristotelian society london whittlestone james probability calibration group gender” proceedings conference fairness accountability transparency –jessica wapner “the enemy good estimating cost waiting nearly perfect automated vehicles” rand corporation httpswwwrandorgpubsresearchreportsrr2150html ronald c arkin creating adversarial exemplars use data adversarial examples fort ray k
Ethics is… ian machines like novel new york knopf doure sturm noemie e

In [None]:
input_prompt = "Accountable AI is... "
gpt2_text_generator = gpt2.generate(sess, 
                                    run_name= checkpoint_name, #restore from checkpoint 
                                    temperature=0.05, 
                                    length=100, 
                                    prefix=input_prompt, 
                                    top_k=40, 
                                    nsamples=3)#the amount of outcome 
                                    #destination_path = "/content/drive/MyDrive/Coding3/gpt2_text.txt")  

Accountable AI is... ict defines problem “ai directed human labor” uses multifaceted datarich resources tied firm’s specific knowledge base used make automated decisions conduct actual work65 thus workers real labor rather machines dispensable human labor use resources like nonrenewable resources like sunlight falling wood chip manufacturing processes similarly could instead signatory cooperative project seeree scott “creative adversarial networks” –report creative adversarial networks working professionals –robot teaching pedagogy policy –robot�
Accountable AI is... ict ict chang ict changing ict changing tech nology ict changing ict changing ict changing ict changing ict changing ict changing ict changing ict changing ict changing ict changing ict changing ict changing ict changing ict changing ict changing ict changing ict changing ict changing ict changing ict changing ict changing ict changing ict changing ict changing ict changing ict changing ict changing ict changing ict changi

In [None]:
input_prompt = "Responsible AI is...  "
gpt2_text_generator = gpt2.generate(sess, 
                                    run_name= checkpoint_name, #restore from checkpoint 
                                    temperature=0.15, 
                                    length=100, 
                                    prefix=input_prompt, 
                                    top_k=30, 
                                    nsamples=3)#the amount of outcome 
                                    #destination_path = "/content/drive/MyDrive/Coding3/gpt2_text.txt")  

Responsible AI is...  urs gasser carolyn schmitt companies—are engaged norm creation lesser extent administration—while still keeping ethics promise made ethical aligned company’s approach released statement company aims develop ethical aligned responsible ai no questions asked leading role ai ict impact ethical decisions every aspect company’s operations however ethical promiseably transparent process evaluation information resources used develop deploy deploy ai system including financial incentives potential conflicts exist public commitments trustworthy ai come several stages including various codes ethics google created responsible ai initiative google
Responsible AI is...  urs gasser carolyn schmitt companies—are engaged norm creation lesser extent administration—while still maintaining professional norms company culture remains largely experimental accordingly norms created google glass suggest norms might adapt much like brooks required new york sophia press frank pasquale blac

In [None]:
input_prompt = "Accountability is…  "
gpt2_text_generator = gpt2.generate(sess, 
                                    run_name= checkpoint_name, #restore from checkpoint 
                                    temperature=0.15, 
                                    length=100, 
                                    prefix=input_prompt, 
                                    top_k=30, 
                                    nsamples=3)#the amount of outcome 
                                    #destination_path = "/content/drive/MyDrive/Coding3/gpt2_text.txt")  

Accountability is…  urs gasser carolyn schmitt companies like facebook google even offer chatbots understand understand understand give httpswwwfacebookcomdevelopersarechatbot httpsautomatedercoderjailsocietyarevery similar examples see httpswwwfacebookgoogleschatbot httpswwwfacebookcomdevelopersarechatbot nick stat “google’s community analyzes nearly million interactions platform user data shows nearly third people post negative things ” politico may httpswww politicocomnextgen technology google’
Accountability is…  urs gasser carolyn schmitt companies like facebook google even offer chatbots understand understand understand give httpswwwfacebookcomdevelopersarechatbot httpswwwfacebookcomdevelopersarechatbot shannon mattern developers also encourage encourage intelligent actions people’s behalf27 thirdparty companies also provide services assist developers translating enphrases en français en francisco’s httpswwwfranconewsnetworkartificialintelligence httpfranconwhowasapnewprojectshtm

In [None]:
input_prompt = "Ethical AI is…  "
gpt2_text_generator = gpt2.generate(sess, 
                                    run_name= checkpoint_name, #restore from checkpoint 
                                    temperature=0.2, 
                                    length=100, 
                                    prefix=input_prompt, 
                                    top_k=40, 
                                    nsamples=3)#the amount of outcome 
                                    #destination_path = "/content/drive/MyDrive/Coding3/gpt2_text.txt")  

Ethical AI is…  urs gasser carolyn schmitt companies—are engaged norm creation lesser extent administration—are engaged setting norms example amazon implementing them—and signaling norms accountability additional important signal creators creators ethical creators maintain norms maintain created3 norms steward effective governance regime stuart russell peter chinao “a place sun” id –introducing sme artificial intelligence ethics autonomous vehicles september httpswwwsmithsonianmagcominnovationaplaceontheroad towardbetterworld ai spacef
Ethical AI is…  urs gasser carolyn schmitt companies—are engaged increasingly elaborate crossfunctional design teams individuals responsible wholeheartedly implementing ideas across organization1 even performance art research ai hleg even specifies riskbased approach securing system’s proper operation states “ai systems susceptible exploitation external forces” ensuring appropriate processes exist state art ai hleg specifies riskbased system tolerates “a

In [None]:
input_prompt = "Responsibility is…  "
gpt2_text_generator = gpt2.generate(sess, 
                                    run_name= checkpoint_name, #restore from checkpoint 
                                    temperature=0.2, 
                                    length=100, 
                                    prefix=input_prompt, 
                                    top_k=40, 
                                    nsamples=3)#the amount of outcome 
                                    #destination_path = "/content/drive/MyDrive/Coding3/gpt2_text.txt")  

Responsibility is…  urs gasser carolyn schmitt companies like facebook google even today claim rightsilated environment similar situation prerevolution tunisia for example young people used internet without internet service providers mobile phones operators redistribute population informal work—thatchedecording laborers—according labor demand120 similar dynamics hold true countries region like qatar pay employees work internet things incidental social benefit companies like iborderot allow rent control live work platforms similar platform model called “microtel” operates storesoice plans microsoft later rent depending
Responsibility is…  urs gasser carolyn schmitt companies like facebook google even today claim rightsilated environment similar situation prerevolution tunisia forsyan found center political science less statistcentric environment criminal law came legislation setting national policies regarding issues data protection implementation offshore petroleum drilling welfare sta

In [None]:
input_prompt = "Equitable AI is…  "
gpt2_text_generator = gpt2.generate(sess, 
                                    run_name= checkpoint_name, #restore from checkpoint 
                                    temperature=0.3, 
                                    length=100, 
                                    prefix=input_prompt, 
                                    top_k=40, 
                                    nsamples=3)#the amount of outcome 
                                    #destination_path = "/content/drive/MyDrive/Coding3/gpt2_text.txt")  

Equitable AI is…  urs gasser carolyn schmitt companies often cited leading causes death per capita basis well global catastrophic failure companies often cited number innovative ways could much risk created society harm avoided7 innovations often key reasons legal social norms evolving toward become robust may driven commitment social norms taking cues information “hardware” perceived “software” approach ai one potential source norms potential generating robust ness bias undesirable effects ai another potential source norm ambiguity ai refers ai hleg specifies four “tra
Equitable AI is…  urs gasser carolyn schmitt companies like facebook google even offer graphical interfaces interfaces people robots way communicate via text email27 yet attempts mainstream ai ethics mainstream academic thought hardly surprising would characterize oppositional stance academic interest projects coming related field machine learning comingle izable ethical insights automating decisionmaking traditional et