## Text Generation: Pick-a Philospher! 




Using this [data set](https://www.kaggle.com/datasets/christopherlemke/philosophical-texts?resource=download), I endeavored to fine tune a model that would allow us to use transformers for text generation of random philosophical tautology in the style of a particular philosopher. 

In this data set there have access to quotes from 5 historical philosphers, each with distinct styles of writing and subjects. They are Aristotle, Hume, Kant, Nietzsche, and Plato. 

In [None]:
!nvidia-smi -L #Connect to a GPU

GPU 0: Tesla T4 (UUID: GPU-968bd564-0b2c-9ac8-e2a7-6af8f396ebdd)


In [None]:
!pip install transformers # Install Hugging Face transformers
import transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.26.1-py3-none-any.whl (6.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.3/6.3 MB[0m [31m97.4 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m94.3 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.11.0
  Downloading huggingface_hub-0.12.0-py3-none-any.whl (190 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m190.3/190.3 KB[0m [31m24.6 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.12.0 tokenizers-0.13.2 transformers-4.26.1


In [None]:
!pip install --upgrade gdown #gdown is used to download files directly from Google Drive

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting gdown
  Downloading gdown-4.6.2-py3-none-any.whl (14 kB)
Installing collected packages: gdown
  Attempting uninstall: gdown
    Found existing installation: gdown 4.4.0
    Uninstalling gdown-4.4.0:
      Successfully uninstalled gdown-4.4.0
Successfully installed gdown-4.6.2


In [None]:
# This will load the 'sentences.csv' dataset into the colab environment
import gdown
!gdown --id 1GHWKsSadOXe_sgxMMx5-tl7c3w-d6zWe #The sentences file is loacted at this location
#https://drive.google.com/uc?1d=1GHWKsSadOXe_sgxMMx5-tl7c3w-d6zWe

Downloading...
From: https://drive.google.com/uc?id=1GHWKsSadOXe_sgxMMx5-tl7c3w-d6zWe
To: /content/sentences.csv
100% 26.1M/26.1M [00:00<00:00, 109MB/s] 


In [None]:
# Import the data into a pandas dataframe
import pandas as pd
sentences = pd.read_csv('sentences.csv')
sentences.head()


Unnamed: 0,label,sentence,author,word_count,mean_word_length,stop_words_ratio,stop_words_count,ADJ_count,ADV_count,ADP_count,...,X_count,INTJ_count,CONJ_count,CCONJ_count,SCONJ_count,PROPN_count,NOUN_count,PRON_count,PART_count,VERB_count
0,0,We may call the faculty of cognition from prin...,Kant,64,5.2,56.25,36,6,2,11,...,1,0,0,2,3,0,17,3,1,7
1,0,That goes merely into our faculty of knowing t...,Kant,82,5.13,57.32,47,5,4,14,...,3,0,0,6,2,3,17,4,0,9
2,0,"The Critique, then, which sifts them all, as r...",Kant,50,4.84,62.0,31,4,1,6,...,1,0,0,1,3,1,10,3,1,6
3,0,It relegates all other pure concepts under Ide...,Kant,27,5.26,62.96,17,5,1,4,...,0,0,0,2,0,1,3,1,2,2
4,0,For they serve as regulative principles; partl...,Kant,63,5.03,61.9,39,4,5,11,...,0,0,0,2,4,0,12,4,3,8


In [None]:
sentences.shape

(107134, 23)

In [None]:
# Create a column in the dataframe that has format 'author: ' the sentences and end of text that will be used for fine-tuning 
sentences_cl = sentences.drop(columns = sentences.columns[3:]) 
sentences_cl = sentences_cl.drop('label', axis=1)
sentences_cl['writer'] = sentences_cl.author + ": " + sentences_cl.sentence + " <|endoftext|>"
sentences_cl.shape

(107134, 3)

In [None]:
sentences_cl.shape #check that dataframe has sstayed the same size

(107134, 3)

In [None]:
#This is to shuffle the dataset so when we make our training and test set it includes hopefully equal number of each authors
df = sentences_cl.sample(frac=1).reset_index(drop=True) 

In [None]:
df['writer'][0]

'Nietzsche: Thank Heaven, it is not in honour of pure foolery! <|endoftext|>'

In [None]:
df.shape

(107134, 3)

In [None]:
dataset_train = df.writer.values[:105000]
dataset_test = df.writer.values[105000:]

In [None]:
len(dataset_train)

105000

In [None]:
len(dataset_test)

2134

In [None]:
# Look at some of the training file inputs to make sure they are formatted correctly for a text file
dataset_train[:5]

array(['Nietzsche: Thank Heaven, it is not in honour of pure foolery! <|endoftext|>',
       'Hume:  Sir Dudley Carletons Letters, p. 27, 28. <|endoftext|>',
       'Plato: : And of argumentation, one sort wastes money, and the other makes money. <|endoftext|>',
       'Kant: It must rather, so far as we can judge in a rational way, hold the derivation, by the aid of such causes, of such a consequence of our good conduct from mere nature without God and immortality, to be an ungrounded and vain, though well-meant, expectation; and if it could have complete certainty of this judgement, it would regard the moral law itself as the mere deception of our Reason in a practical aspect. <|endoftext|>',
       'Hume: They burnt St. Anthony and St. Helens, two towns on the coast of Florida. <|endoftext|>'],
      dtype=object)

In [None]:
# Write text files for model input
with open('dataset_train.txt','w') as f:
  f.write('\n'.join(dataset_train))
with open('dataset_test.txt','w') as f:
  f.write('\n'.join(dataset_test))

In [None]:
!head -20 dataset_train.txt #head of the training .txt file 

Nietzsche: Thank Heaven, it is not in honour of pure foolery! <|endoftext|>
Hume:  Sir Dudley Carletons Letters, p. 27, 28. <|endoftext|>
Plato: : And of argumentation, one sort wastes money, and the other makes money. <|endoftext|>
Kant: It must rather, so far as we can judge in a rational way, hold the derivation, by the aid of such causes, of such a consequence of our good conduct from mere nature without God and immortality, to be an ungrounded and vain, though well-meant, expectation; and if it could have complete certainty of this judgement, it would regard the moral law itself as the mere deception of our Reason in a practical aspect. <|endoftext|>
Hume: They burnt St. Anthony and St. Helens, two towns on the coast of Florida. <|endoftext|>
Aristotle: Eels are not produced from sexual intercourse, nor are they oviparous, nor have they ever been detected with semen or ova, nor when dissected do they appear to possess either seminal or uterine viscera; and this is the only kind of

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM

In [None]:
# Does not need to be run if you will be using the fine-tuned model
tokenizer = AutoTokenizer.from_pretrained('distilgpt2')
model = AutoModelForCausalLM.from_pretrained('distilgpt2')

Downloading (…)lve/main/config.json:   0%|          | 0.00/762 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Downloading (…)"pytorch_model.bin";:   0%|          | 0.00/353M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

In [None]:
#Connect the recipe-generation-model from Google Drive into colab environment
import gdown
gdrivelink='https://drive.google.com/drive/folders/1c0b7exs0gthp6Ub-q17PJOymvP4um1Uk?usp=share_link' # this directory contains the fine-trained model
gdown.download_folder(gdrivelink, quiet=True) 

['/content/recipe_generation_model/config.json',
 '/content/recipe_generation_model/generation_config.json',
 '/content/recipe_generation_model/merges.txt',
 '/content/recipe_generation_model/pytorch_model.bin',
 '/content/recipe_generation_model/special_tokens_map.json',
 '/content/recipe_generation_model/tokenizer_config.json',
 '/content/recipe_generation_model/tokenizer.json',
 '/content/recipe_generation_model/vocab.json']

In [None]:
# Load the fine-tuned tokenizer and model from Google Drive file 
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained('philosophy_generation_model')
model = AutoModelForCausalLM.from_pretrained('philosophy_generation_model')

OSError: ignored

In [None]:
#Only need to run this if fine tuning a model
!curl https://raw.githubusercontent.com/huggingface/transformers/27c1b656cca75efa0cc414d3bf4e6aacf24829de/examples/run_lm_finetuning.py > run_lm_finetuning.py

In [None]:
#ONly need to run if fine-tuning a model
!mkdir experiments

In [None]:
#Only need to run if fine-tuning a model
#Defines the training set document and number of epochs to run in training
epochs=3
file_with_training_set = 'dataset_train.txt'

In [None]:
# this is bash code to fine-tune a language model. File from here: https://github.com/alontalmor/pytorch-transformers/blob/master/examples/run_lm_finetuning.py
text = f"for epoch in {epochs} \n"+\
"do \n"+\
"python run_lm_finetuning.py "+\
f"--output_dir=experiments/epoch_{epochs} "+\
"--model_type=gpt2 "+\
"--model_name_or_path=distilgpt2 "+\
f"--train_data_file={file_with_training_set} "+\
"--do_train "+\
"--overwrite_output_dir "+\
"--save_steps=10000 " +\
f"--num_train_epochs={epochs} \n" +\
"done"

In [None]:
#Only need to run if fine-tuning the model
#Write the script to a file for running in shell
f = open('run_experiments.sh',mode='w')
f.write(text)
f.close()

In [None]:
#Only need to run if fine-tuning the model 
#Run the fine-tuning script but be aware that it will take a long time...
!bash run_experiments.sh

In [None]:
# Only need to run if you have trained a model
# It will take 
tokenizer = AutoTokenizer.from_pretrained('experiments/epoch_3')
model = AutoModelForCausalLM.from_pretrained('experiments/epoch_3')

model_path = "recipe_generation_model"
model.save_pretrained(model_path)
tokenizer.save_pretrained(model_path)
# mount it
from google.colab import drive
drive.mount('/content/drive',force_remount=True)
# copy it as a new directory in the root of your google drive
import shutil
shutil.copytree(model_path,'/content/drive/MyDrive/'+ model_path)

**Now let us test the generated text output! You can put one of the five philosophers, followed byt a colon, to see some philosophical nonsense (i.e. 'Plato: ') or see what these Philosophers would say to a an existential question like what is the meaning of life by input, Plato: What is the meaning of life.**

Try it out yourself! It's also possible to add some leading words following the philosopher, but I for one don't like telling people what they should say. Let's see what Plato has to say!

In [None]:
input_text = 'Plato:' #There cannot be a space between the last word and the close of string
enc_input = tokenizer.encode(input_text,return_tensors='pt',add_special_tokens=False)
output_sequences = model.generate(
    input_ids = enc_input,
    max_length= 150,  # the length of the final sentence
    temperature = 0.9, # the closer to one, the less deterministic. The closer to zero, the more deterministic
    top_k = 20, # how many next words to consider when doing a tree-like structure
    top_p = 0.9,
    repetition_penalty = 1, # penalty for repeating a word in the input (min 1)
    do_sample = True, # True -> probabilistic model (output varies)
    num_return_sequences = 5 # number of output sentences
)
for i in range(len(output_sequences)):
  print(f'{i}: {tokenizer.decode(output_sequences[i])}\n')

Well that's pretty fun! Here are some outputs that I found amusing when I ran this with other philosphers:



*   Aristotle: They also say that the gods are also good, because they are good to the gods.
*   Hume: They had been the most considerable and most numerous men in the kingdom; and it was thought proper to give them a view of the progress and progress of the kingdom; and they were desirous of receiving a supply of money, which they could not afford them.
*   Kant: For he has no concept of this nature, and cannot be an idea of the nature of man, because he cannot make a concept of him; for he cannot say that he has no conception of man; for his concept of the nature of man cannot be an idea of a being.
*   Plato: The one who is in need of help, or is in need of support, and who has been deprived of the means of support, may not be able to help, either from sickness or from the desire to live; but he is able to help.

Deep stuff!



