# Making a Nietzsche - Day 1. Research Journal and simple out-of-the-box ULMFit

## Welcome to my transfer learning for nlp research journal

Long story short, trying to **get acquainted with the most recent nlp pre-trained models (ULMFit, BERT, GPT, GPT-2 and XLNet)** I'm trying to create a language model able to **generate Nietzsche-like prose of a reasonable quality**. I just started and **my results are not even close to good...** but I think the process of trying to create one using these technologies is interesting by itself, so I'm writing **a research journal** and not a result-oriented post.

For now, these are the plans:

1. [Day 1 - Making a Nietzsche - Research Journal and simple out-of-the-box ULMFit](https://github.com/dataista0/making-a-nietzsche/blob/master/nbs/Making%20a%20Nietzsche%20-%201.%20Research%20Journal%20%26%20ULMFit.ipynb) (this notebook)
2. [Day 2 - Out-of-the-box GPT-2 Large](https://www.kaggle.com/julian3833/gpt2-large-774m-w-pytorch-not-that-impressing) (new!)
3. *Making a Nietzsche - Day 3 - GPT-2 for Nietzsche* (Soon!)


4. Define what to do from here... (train more, find better data, another architecture, another thing)


### The last self-love wound to humanity

Just a brief non-technical personal opinion considering the philosophical impact of the ability to train machines to generate human-like prose. You can skip it if it's not your thing.


In *Introduction to Psychoanalysis*, Freud proposes the notion of three big narcissistic wounds of humanity in its self-perception through science and theory: 

> Humanity has in the course of time had to endure from the hands of science two great outrages upon its naive self-love. 
The first was when it realized that our earth was not the center of the universe, but only a tiny speck in a 
world-system of a magnitude hardly conceivable; this is associated in our minds with the name of Copernicus, ...
The second was when biological research robbed man of his peculiar privilege of having been specially created, and 
relegated him to a descent from the animal world, implying an ineradicable animal nature in him: this transvaluation 
has been accomplished in our own time upon the instigation of Charles Darwin... 
But man's craving for grandiosity is now suffering the third and most bitter blow from present-day psychological 
research which is endeavoring to prove to the ego of each one of us that he is not even master in his own house, but 
that he must remain content with the veriest scraps of information about what is going on unconsciously in his own mind.
    
I consider the ability of machines to generate human-like prose like a narcissistic wound in the same sense: it robbes us something we considered esencial to our nature, and a marked differential skill which made us completely different and more important that the rest of the existence: the ability to produce language, a coherent speech. 

Since the deep learning-for-nlp boom during late 2018 I have the intention to get familiar with transfer learning in NLP. This is a personal project in which I will try to take to life all the theory and novelties that are in the air trying to reproduce Nietzsche prose with state-of-art generative models. 


### Message for forkers: remember to turn on the GPU and Internet!

# Day 1: shitty data and out-of-the-box ULMFit with default parameters

Today we are just jumping-in. We will setup the full workflow for a simple trivial model. We will iterate after.

I'm currently learning fast.ai, so I will start using ULMFit wikitext 103 pre-trained weights, which are way less powerfull, I think, than GPT-2. I'm also not sure if BERT can generate text, but I know for sure that OpenAI's GPT-2 [can](https://openai.com/blog/better-language-models/) and I'm willing to get my hands on!

## ULMFit out-of-the-box without fine-tuning

Turns out that the unsupervised task Jeremy used to create ULMFit is vanilla language modelling, which means the network without any finetuning knows how to generate text.
But let's check-it-out.

In a nutshell, fast.ai is a super high-level deep learning API. It is not the equivalent of Keras but the equivalent of something that build on top of Keras are gives structure and automates some high-level workflows which were not automated before.
The main concept is the Learner. 

A `Learner` trains a `model` on some `train data`, validating against some `validation data` using an `optimizer`, a `loss` function and zero or more `metrics`.
So the signature of a learner is something like: `Learner(data, model, optimizer, loss, metrics)`.

Each of the four supported deep learning applications (`text`, `vision`, `tabular`, `collab` and `vision.gan`) have a high-level function for creating a good out-of-the-box learner with some good pre-trained weights specifying only the data. 

The data is modelled as a `Databunch`, which is just set containing a train, a validation and optionaly a test datasets, aka: 2 or 3 lists of (x, y) pairs. 


### A note on reproducibility

We are using a lot of random functions and method, so we need to fix the randomness in order to achieve reproducibility.
We need to set numpy's random seed but also two torch ones.

In [1]:
def make_reproducible():
    import numpy as np
    import torch
    seed = 123
    np.random.seed(seed)
    torch.random.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    
make_reproducible()

## Creating a databunch from a list of sentences

In [2]:
import pandas as pd
from fastai.text import TextLMDataBunch, language_model_learner, AWD_LSTM

# Amount of sentences to use as validation. This is hardcoded for simplicity.
N_VAL = 25

def get_lm_databunch(sentences):
    """ Create a `TextLMDataBunch` for list of sentences """
    train_df = pd.DataFrame({'label': 0, 'text': sentences[:-N_VAL]}) # This is how you specify data for language modeling
    valid_df = pd.DataFrame({'label': 0, 'text': sentences[-N_VAL:]})    
    return TextLMDataBunch.from_df(path='.', train_df=train_df, valid_df=valid_df, bs=192)


data_lm = get_lm_databunch(["fake sentence 1.", "fake sentence 2."]*500)
data_lm

TextLMDataBunch;

Train: LabelList (975 items)
x: LMTextList
xxbos fake sentence 1 .,xxbos fake sentence 2 .,xxbos fake sentence 1 .,xxbos fake sentence 2 .,xxbos fake sentence 1 .
y: LMLabelList
,,,,
Path: .;

Valid: LabelList (25 items)
x: LMTextList
xxbos fake sentence 2 .,xxbos fake sentence 1 .,xxbos fake sentence 2 .,xxbos fake sentence 1 .,xxbos fake sentence 2 .
y: LMLabelList
,,,,
Path: .;

Test: None

## Creating a pre-trained ULMFit language model for a given databunch

In [3]:
learn = language_model_learner(data_lm, AWD_LSTM, drop_mult=0.5)

In [4]:
# The vocabulary is tied to the databunch, which had only 16 character
len(learn.data.vocab.itos)

16

In [5]:
print(learn.data.vocab.itos)

['xxunk', 'xxpad', 'xxbos', 'xxeos', 'xxfld', 'xxmaj', 'xxup', 'xxrep', 'xxwrep', 'fake', 'sentence', '.', '1', '2', 'xxfake', 'xxfake']


In [6]:
# It predicts 
sample = """fake sentence 1. fake"""
print(learn.predict(sample, n_words=20))

fake sentence 1. fake sentence Sentence 2 SENTENCE Sentence Sentence Sentence Sentence . Sentence Sentence .


In [7]:
learn.fit_one_cycle(10, max_lr=1e-2)

epoch,train_loss,valid_loss,accuracy,time
0,6.265474,6.0544,0.185714,00:02
1,6.066381,5.760055,0.185714,00:03
2,6.002691,5.15142,0.185714,00:03
3,5.899597,4.478835,0.3,00:03
4,5.733174,3.911566,0.3,00:03
5,5.511778,3.435199,0.3,00:03
6,5.229397,3.053329,0.3,00:03
7,4.992468,2.794451,0.3,00:03
8,4.803389,2.656074,0.3,00:02
9,4.613796,2.615304,0.385714,00:03


In [8]:
# It improves a little
print(learn.predict("fake sentence 1. fake", n_words=20))

fake sentence 1. fake sentence 2 . Sentence 2 . Sentence 2 . Fake sentence . . . 1 .


### Let's find out what Zarathustra has to say

Here we get some sentences from the Zarathustra. The exploration about cleaning was collapsed into a minimal legible function.

In [9]:
#

import re
import requests
import bs4
import numpy as np

ZARATHUSTRA = "https://archive.org/stream/thusspokezarathu00nietuoft/thusspokezarathu00nietuoft_djvu.txt"

def get_sentences():
    response = requests.get(ZARATHUSTRA)
    html = response.content
    soup = bs4.BeautifulSoup(html, 'html.parser')
    txt_data = soup.pre.contents[0]
    pattern = re.compile(r"\n\n\n[ \w'\(\)]+\n\n", re.MULTILINE)
    txt_data = re.sub(pattern, '', txt_data)
    lines = txt_data.split("\n")
    
    lines = [l for l in lines if len(l) > 1 and not l.isupper() and not l.istitle()]
    
    # Collapse the \n (they are breaking sentences in a wide format), 
    # After collapsing the \n we got a large string, split it by "."
    # Add again the . at the end of the lines
    sentences = [ f"{l}." for l in (' '.join(lines)).split(".")]
    
    return sentences

# We bring 3319 sentences from Zarathustra
s = get_sentences()
len(s)

3319

In [10]:
# Checking everything looks good
np.random.choice(s)

'"  It is a lie! Creators were they who created peoples, and hung  a faith and a love over them : thus they served life.'

## Putting it all together:

In [11]:
def get_trained_lm(sentences):
    data_lm = get_lm_databunch(sentences)
    learn = language_model_learner(data_lm, AWD_LSTM, drop_mult=0.5)
    learn.fit_one_cycle(10, 1e-2)
    return learn

lm = get_trained_lm(s)

epoch,train_loss,valid_loss,accuracy,time
0,5.452039,3.121433,0.410417,00:29
1,5.266574,3.085846,0.372247,00:26
2,4.998517,3.004556,0.375,00:25
3,4.737266,2.712997,0.445164,00:26
4,4.509505,2.617201,0.462946,00:28
5,4.317868,2.534178,0.484003,00:26
6,4.157368,2.518567,0.489509,00:28
7,4.025093,2.502783,0.494568,00:29
8,3.92069,2.492634,0.497693,00:29
9,3.844982,2.507464,0.494345,00:25


In [13]:
def show_sample(lm, s):
    ss = np.random.choice(s)
    print(f"Input sentence: {ss}\n\n")
    print(f"Generation: {lm.predict(ss, n_words=200)}")
show_sample(lm, s)

Input sentence:   Then lived they shamelessly in temporary pleasures, and  beyond the day had hardly an aim.


Generation:   Then lived they shamelessly in temporary pleasures, and  beyond the day had hardly an aim. At last they held the trembling hands of one another and lived hours full of noise ; then should i bear them to avoid sleeping and not to bear be the spirit of a man . xxbos For everything that is around it , it is time for future or fragrance to meet with an enemy : for breast - through - eyes often sit beside your hearts . xxbos my reason is that God may be treated on this patient patient , sneezeth it , when at night such circumstances have come " : they be quite nondescript ; and they make a hard conscience what came before me . So it is the great fate to strength there . " This thing is more of a story than i have lieth in public opinion . xxbos But to not be able to be sure , i am very weary of choosing Zarathustra to torturest his heart . xxbos But there is also him 

In [14]:
# This thing produces this amount of words I think.
len(lm.data.vocab.itos)

4616

In [15]:
show_sample(lm, s)

Input sentence:   "From on high," drippeth the star, and the gracious spittle;  for the hi^h, longeth every starless bobom.


Generation:   "From on high," drippeth the star, and the gracious spittle;  for the hi^h, longeth every starless bobom. xxbos " already our old hearts have now turned their heads on earth ! Already the Super ones ! Now did we make this mantle new happiness ! Have they ever this flock ? How could your fruits become smaller , according to your values ! xxbos And even with a valuing . xxbos Do ye divine only ones ! By your submission : i call your hands , secrets , and knowledge ? xxbos And had three soldiers be killed in this war ? And there is not much in the world ! Anne is the day , and the world was about to be revealed ; it had begun to press up . xxbos Beyond the Star and 51 - 30 - 235 , the latter is saying everything round he may also be a failure . " a good bite about the serpent Santa Margherita came to me , and but that did he always bite ; and now lear

In [16]:
# Not that bad, huh?
lm.predict("God is a non-sense, a spiritual weakness of a degraded society.", n_words=200)

'God is a non-sense, a spiritual weakness of a degraded society. They possess an outward and type so far as they could . xxbos There is much tension between God and the man ; in the moonlight and the path from the gods , it is persistent that God maketh the lightning in the depth . The other feminine form of something which lay home to one another is the Superman , formerly the Superman . a middle - born man , calling similar who jewel his whip with the past , is called the Purple Superman . However , the true meaning of the truth can nor express the coming of the new spirit , The Dead . The " Superman " already called itself the Superman is only an attempt . xxbos Such a star also seem to ye being older than today , but to create that it is perfect because it is only a little harder than one else * all things . xxbos Once i began to see in the last one no longer ingly , and all lands are my least , whilst ye did my own ruined'

In [17]:
# The extra spaces are annoying me
def predict(lm, sample, n_words=200):
    p = lm.predict(sample, n_words=n_words).replace(" ,", ",").replace(" .", "").replace("xxbos", "\n")
    print(p)
    return p

THE_BEGGINING="""When Zarathustra was thirty years old, he left his home and the lake of his home, and went into the mountains. There he 
enjoyed his spirit and his solitude, and for ten years did not weary of it. But at last his heart changed, and rising one 
morning with the rosy dawn, he went before the sun, and spake thus unto it: "Thou great star! What would be thy happiness if thou hadst not those for whom thou shinest! 
For ten years hast thou climbed hither unto my cave: thou wouldsthave wearied of thy light and of the journey, had it not been for me, mine eagle, and my serpent. 
But we awaited thee every morning, took from thee thine overflow, and blessed thee for it". """

predict(lm, THE_BEGGINING);

When Zarathustra was thirty years old, he left his home and the lake of his home, and went into the mountains. There he 
enjoyed his spirit and his solitude, and for ten years did not weary of it. But at last his heart changed, and rising one 
morning with the rosy dawn, he went before the sun, and spake thus unto it: "Thou great star! What would be thy happiness if thou hadst not those for whom thou shinest! 
For ten years hast thou climbed hither unto my cave: thou wouldsthave wearied of thy light and of the journey, had it not been for me, mine eagle, and my serpent. 
But we awaited thee every morning, took from thee thine overflow, and blessed thee for it".  
 All of his joyous aspiration is to the sake of the evil and the virtues All things, all the creating, the holy will of all things ; and by earth it is to be all loving and loved : full of force to be surpassed : truth would every one, with all pharisees feel the swine and values of power, they know thou also must have an " im

# Wrapping up and exporting the code to a `.py`

In [18]:
# export
import re
import requests
import bs4
import numpy as np
import pandas as pd
from fastai.text import TextLMDataBunch, language_model_learner, AWD_LSTM

# Amount of sentences to use as validation. This is hardcoded for simplicity.
N_VAL = 25

ZARATHUSTRA_URL = "https://archive.org/stream/thusspokezarathu00nietuoft/thusspokezarathu00nietuoft_djvu.txt"

ANTI_GOD_SAMPLE="God is a non-sense, a spiritual weakness of a degraded society."

THE_BEGGINING="""When Zarathustra was thirty years old, he left his home and the lake of his home, and went into the mountains. There he 
enjoyed his spirit and his solitude, and for ten years did not weary of it. But at last his heart changed, and rising one 
morning with the rosy dawn, he went before the sun, and spake thus unto it: "Thou great star! What would be thy happiness if thou hadst not those for whom thou shinest! 
For ten years hast thou climbed hither unto my cave: thou wouldsthave wearied of thy light and of the journey, had it not been for me, mine eagle, and my serpent. 
But we awaited thee every morning, took from thee thine overflow, and blessed thee for it". """

SAMPLES = [ANTI_GOD_SAMPLE, THE_BEGGINING]

def make_reproducible():
    import numpy as np
    import torch
    seed = 123
    np.random.seed(seed)
    torch.random.manual_seed(seed)
    torch.cuda.manual_seed(seed)

def get_sentences():
    response = requests.get(ZARATHUSTRA_URL)
    html = response.content
    soup = bs4.BeautifulSoup(html, 'html.parser')
    txt_data = soup.pre.contents[0]
    pattern = re.compile(r"\n\n\n[ \w'\(\)]+\n\n", re.MULTILINE)
    txt_data = re.sub(pattern, '', txt_data)
    lines = txt_data.split("\n")
    
    lines = [l for l in lines if len(l) > 1 and not l.isupper() and not l.istitle()]
    
    # Collapse the \n (they are breaking sentences in a wide format), 
    # After collapsing the \n we got a large string, split it by "."
    # Add again the . at the end of the lines
    sentences = [ f"{l}." for l in (' '.join(lines)).split(".")]
    
    return sentences

def get_lm_databunch(sentences):
    """ Create a `TextLMDataBunch` for list of sentences """
    train_df = pd.DataFrame({'label': 0, 'text': sentences[:-N_VAL]}) # This is how you specify data for language modeling
    valid_df = pd.DataFrame({'label': 0, 'text': sentences[-N_VAL:]})    
    return TextLMDataBunch.from_df(path='.', train_df=train_df, valid_df=valid_df, bs=192)

def get_trained_lm(sentences):
    data_lm = get_lm_databunch(sentences)
    learn = language_model_learner(data_lm, AWD_LSTM, drop_mult=0.5)
    learn.fit_one_cycle(10, 1e-2)
    return learn

def predict(lm, sample, n_words=200):
    p = lm.predict(sample, n_words=n_words).replace(" ,", ",").replace(" .", "").replace("xxbos", "\n")
    print(p)
    return p

def show_sample(lm, s):
    ss = np.random.choice(s)
    print(f"Input sentence: {ss}\n\n")
    print(f"Generation: {lm.predict(ss, n_words=200)}")

    
def run():
    make_reproducible()
    s = get_sentences()
    lm = get_trained_lm(s)
    
    show_sample(lm, s)
    show_sample(lm, s)
    
    for sample in SAMPLES:
        predict(lm, sample)

In [23]:
!python notebook2script.py "Making a Nietzsche - 1. Research Journal & ULMFit.ipynb" "day_1.py"

Converted Making a Nietzsche - 1. Research Journal & ULMFit.ipynb to exp/day_1.py
