# Scientific Abstract Text Generation

In this project, I use the scientific abstracts previously scraped from [Pubmed](https://pubmed.ncbi.nlm.nih.gov/) to train a Recurrent Neural Network for text generation. Instead of working with the DataFrame/csv format from our scraped output [data](https://github.com/chauvu/chauvu.github.io/blob/main/Data/pubmed/manuscripts.csv), I have already extracted all the *lower-case* abstract text into a text file [abstracts_str.txt](https://github.com/chauvu/chauvu.github.io/blob/main/Data/pubmed/abstracts_str.txt). I will now build an RNN to generate the scientific text based on an initial 100-character seed.

In [1]:
import numpy as np
import scipy as sp
import pandas as pd
import pickle
import tensorflow as tf
import matplotlib.pyplot as plt 
from sklearn import model_selection
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Conv2D, Dropout, Flatten, MaxPooling2D, LSTM, SimpleRNN
from tensorflow.keras.optimizers import Adam
from keras.callbacks import ModelCheckpoint

In [2]:
# variable to keep track of whether training has been done
already_trained = 1

In [3]:
filename = "../Data/pubmed/abstracts_str.txt"
abstract_file = open(filename, 'r', encoding='utf-8')
abstract = abstract_file.read()
abstract_file.close()
print(abstract[:5000])

objectives: to explore the characteristics of idh and tert promoter mutations in gliomas in chinese patients.methods: a total of 124 chinese patients with gliomas were enrolled to study the frequencies of mutations in isocitrate dehydrogenase (idh) and telomerase reverse transcriptase promoter (tertp). among the 124 patients, 59 patients were enrolled to study the classification of gliomas based on mutations in idh and tertp.results: isocitrate dehydrogenase mutations are positively correlated with a good prognosis but mutations in tertp cannot predict prognoses independently. the combined analysis of the mutations of idh and tertp can predict the prognosis more accurately. patients with idh and tertp glioma mutations have the best prognosis, followed by only idh mutation patients and only tertp mutation patients, which have the worst prognosis. idh and tertp mutations occur frequently in males, younger patients or lower-grade patients. in contrast, only tertp mutations occur frequentl

The block of text above is pretty daunting, but we realize that each paragraph is an abstract. The abstracts are separated from each other using a `\n` newline character. Therefore, we can split the string with `\n` and look at each independent abstract. Below I print out the 1st and 2nd abstracts as illustration.

In [4]:
abstracts = abstract.split('\n')

In [5]:
print(abstracts[0])

objectives: to explore the characteristics of idh and tert promoter mutations in gliomas in chinese patients.methods: a total of 124 chinese patients with gliomas were enrolled to study the frequencies of mutations in isocitrate dehydrogenase (idh) and telomerase reverse transcriptase promoter (tertp). among the 124 patients, 59 patients were enrolled to study the classification of gliomas based on mutations in idh and tertp.results: isocitrate dehydrogenase mutations are positively correlated with a good prognosis but mutations in tertp cannot predict prognoses independently. the combined analysis of the mutations of idh and tertp can predict the prognosis more accurately. patients with idh and tertp glioma mutations have the best prognosis, followed by only idh mutation patients and only tertp mutation patients, which have the worst prognosis. idh and tertp mutations occur frequently in males, younger patients or lower-grade patients. in contrast, only tertp mutations occur frequentl

In [6]:
print(abstracts[1])

objective: to investigate the demographics, natural history and treatment outcomes of non-molar gestational choriocarcinoma.design: a retrospective national population-based study setting: uk 1995-2015 population: a total of 234 women with a diagnosis of gestational choriocarcinoma, in the absence of a prior molar pregnancy, managed at the uks two gestational trophoblast centres in london and sheffield.methods: retrospective review of the patient's demographic and clinical data. comparison with contemporary uk birth and pregnancy statistics.main outcomes: incidence statistics for non-molar choriocarcinoma across the maternal age groups. cure rates for patients by figo prognostic score group.results: over the 21-year study period there were a total of 234 cases of non-molar gestational choriocarcinoma, giving an incidence of 1:66,775 relative to live births and 1:84,226 to viable pregnancies. for women aged under 20 the incidence relative to viable pregnancies was 1:223,494, for ages 30

For an RNN network, I pass in a series of a specific length and predict the next character. I will create a dictionary to convert each alphabet character to a number and a corresponding dictionary to convert number to character. I will also use a sequence length of `seq_len=100`, meaning the RNN takes in a series of 100 characters and predict the 101th character.

* Variable `sequences` is a list of all possible 100-char sequences generated from these abstracts.
* Variable `pred_chars` is the list of the 101th character corresponding to the items in `sequences`.

Note that since the `abstract` string contains **all** abstracts separated by a newline `\n` character, I will exclude an sequence that contains a newline. This is done because these abstracts are independent of each other, and it makes no sense for a sequence at the end of one abstract to predict the start of the next abstract.

In [7]:
seq_len = 100
sequences = []
pred_chars = []

for i in range(0, len(abstract)-seq_len):
    sequence = abstract[i:i+seq_len].lower()
    pred_char = abstract[i+seq_len].lower()
    if('\n' not in sequence and '\n'!=pred_char):
        sequences.append(sequence)
        pred_chars.append(pred_char)

In [8]:
char_list = sorted(list(set(abstract))) 
char_list = char_list[1:] # remove \n newline
index_list = np.arange(len(char_list))
print(char_list)

[' ', '"', '%', '&', "'", '(', ')', '+', ',', '-', '.', '/', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', '<', '=', '>', '?', '@', '[', ']', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '~', '\xa0', '©', '®', '°', '±', 'µ', '·', '×', 'á', 'í', 'ï', 'ö', 'ü', 'α', 'β', 'γ', 'δ', 'ε', 'κ', 'μ', 'ρ', 'σ', '\u2003', '\u2009', '‑', '•', '\u202f', '‰', '€', '∑', '∙', '∞', '∼', '≅', '≤', '≥', '△', 'ﬂ']


Looking at `char_list`, we see some uncommon symbols like `\xa0` (denoting hexadecimals) or `\u2003` (denoting a Unicode char). However, since we expect these characters not be infrequent, we don't have to remove them. In total, we can count **96** characters. We will then create 2 dictionaries `char2index_dict` and `index2char_dict` to conversion between characters and index numbers.

In [9]:
char2index_dict = dict(zip(char_list,index_list))
index2char_dict = dict(zip(index_list,char_list))

The following code converts these two lists `sequences` and `pred_chars` from list of *characters* to list of *index numbers*, generating `sequences_char_array` and `pred_char_list_onehot`. The latter one is in *one-hot* form, which is similar to dummifying categorical variables.

In [10]:
sequences_char_list = []
pred_char_list = []
for i in range(0,len(sequences)):
    seq = sequences[i]
    pred_char = pred_chars[i]
    seq_char = [char2index_dict[char] for char in list(seq)]
    sequences_char_list.append(seq_char)
    pred_char_list.append(char2index_dict[pred_char])

sequences_char_array = np.array(sequences_char_list)
sequences_char_array = np.reshape(sequences_char_array, (len(sequences_char_array),seq_len,1)) # X
pred_char_list_onehot = tf.keras.utils.to_categorical(np.array(pred_char_list)) # y

In [11]:
# normalize X between 0 and 1 values
sequences_char_array_norm = sequences_char_array / (len(char_list))

I also create 2 functions:
* `convert_index_to_charseq` (input is list of integers and output is sequence of characters)
* `convert_char_to_indexseq` (input is a sequence/list of characters and output is list of integers)

In [12]:
def convert_index_to_charseq(index_list): # pass in integers
    charseq = [index2char_dict[index] for index in index_list]
    return charseq

def convert_char_to_indexseq(char_list): # pass in list of chars
    indexseq = [char2index_dict[char] for char in char_list]
    return indexseq

To test the neural networks after training, I create 2 functions:
* `text_generation_random`: input is the neural network and prints out 100 predicted characters following a random seed of sequences.
* `text_generation_seq`: input is the neural network and a 100-char sequence, prints out the 100-char predicted sequence.

I have set `s` as 'patients with portal vein invasion (main trunk or the 1st order branch) were enrolled. during a 5-we', which is a random string I have chosen to test all trained models.

In [13]:
def text_generation_random(model):
    start = np.random.randint(0, len(sequences_char_array_norm)-1)
    pattern = sequences_char_list[start]
    seq = convert_index_to_charseq(pattern)
    print(''.join(seq))
    for i in range(100):
        x = np.reshape(pattern, (1, len(pattern), 1))
        x = x / float(len(char_list))
        prediction = model.predict(x, verbose=0)
        index = np.argmax(prediction)
        result = convert_index_to_charseq([index])
        seq += result
        pattern.append(index)
        pattern = pattern[1:len(pattern)]
    print(''.join(seq))
    
def text_generation_seq(seq, model):
    seq = list(seq)
    pattern = convert_char_to_indexseq(seq)
    print(''.join(seq))
    for i in range(100):
        x = np.reshape(pattern, (1, len(pattern), 1))
        x = x / float(len(char_list))
        prediction = model.predict(x, verbose=0)
        index = np.argmax(prediction)
        result = convert_index_to_charseq([index])
        seq += result
        pattern.append(index)
        pattern = pattern[1:len(pattern)]
    print(''.join(seq))
    
s = 'patients with portal vein invasion (main trunk or the 1st order branch) were enrolled. during a 5-we'

To start off, I will build a simple RNN network. I used `ModelCheckpoint` to save out weights file of each iteration in which the loss has improved.

After 10 epochs, the model yields a loss of `2.7367` during training. When passed in sequence in `s`, the simple RNN network generates a meaningless sequence, which is definitely not we expect from a scientific abstract. Keep in mind that this is only after 10 epochs of training, we only want to look at the preliminary result at this step.

In [14]:
model = Sequential()
model.add(SimpleRNN(256, input_shape=(sequences_char_array_norm.shape[1], sequences_char_array_norm.shape[2])))
model.add(Dropout(0.2))
model.add(Dense(pred_char_list_onehot.shape[1], activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam')

filepath="../Data/pubmed/weights/abstracts-simplernn256-{epoch:02d}-{loss:.4f}.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='loss', verbose=1, save_best_only=True, mode='min')
callbacks_list = [checkpoint]

if already_trained==0:
    model.fit(sequences_char_array_norm, pred_char_list_onehot, epochs=10, batch_size=128, callbacks=callbacks_list)
else:
    model.load_weights('../Data/pubmed/weights/abstracts-simplernn256-10-2.7367.hdf5')

In [15]:
text_generation_seq(s, model)

patients with portal vein invasion (main trunk or the 1st order branch) were enrolled. during a 5-we
patients with portal vein invasion (main trunk or the 1st order branch) were enrolled. during a 5-we ao  noe) and poe soeti the see aoreriateo wi the aod aod ao the pereo aod tee pereo aod tee pereo a


Next, I will build a very simple Long Short Term Memory (LSTM) neural network. I have chosen an LSTM instead of a simple RNN since LSTM is more robust with retaining or forgetting information and is suited for making predictions based on time series data.

After 10 epochs (loss=`2.3221`), predicted result from `text_generation_seq` was also meaningless, but compared to the simple RNN model, the prediction had longer words. Therefore, I decide to train this network for an extra 10 epochs to assess the prediction further.

After a total of 20 epochs (loss=`2.2017`), predicted sequence shows certain phrases seem to resemble actual meaningful words, such as `sesults` which resembles `results` and `costelation` which resembles `constellation`.

In [16]:
model = Sequential()
model.add(LSTM(256, input_shape=(sequences_char_array_norm.shape[1], sequences_char_array_norm.shape[2])))
model.add(Dropout(0.2))
model.add(Dense(pred_char_list_onehot.shape[1], activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam')

filepath="../Data/pubmed/weights/abstracts-lstm256-{epoch:02d}-{loss:.4f}.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='loss', verbose=1, save_best_only=True, mode='min')
callbacks_list = [checkpoint]

if already_trained==0:
    model.fit(sequences_char_array_norm, pred_char_list_onehot, epochs=10, batch_size=128, callbacks=callbacks_list)
else:
    model.load_weights('../Data/pubmed/weights/abstracts-lstm256-10-2.3221.hdf5')

In [17]:
text_generation_seq(s, model)

patients with portal vein invasion (main trunk or the 1st order branch) were enrolled. during a 5-we
patients with portal vein invasion (main trunk or the 1st order branch) were enrolled. during a 5-wea sesuete teses and the cnnteot of the cnnteoi tas see pote tf the reetttonn of the cnnteoi tf the r


In [18]:
if already_trained==0:
    model.fit(sequences_char_array_norm, pred_char_list_onehot, epochs=10, batch_size=128, callbacks=callbacks_list)
else:
    model.load_weights('../Data/pubmed/weights/abstracts-lstm256-20-2.2017.hdf5')

In [19]:
text_generation_seq(s, model)

patients with portal vein invasion (main trunk or the 1st order branch) were enrolled. during a 5-we
patients with portal vein invasion (main trunk or the 1st order branch) were enrolled. during a 5-weik and tee ans oe the pesuots oe the sesults seree to tee costelation botpeeation of the sesuete in 


All in all, direct comparison between the simple RNN and the LSTM after 10 epochs shows lower loss and more hopeful predictions from the LSTM network. After 20 epochs, the LSTM shows even better prediction. Therefore, I will discard the simple RNN and only focus on the LSTM network from here on.

I notice that the loss jump between 10 epochs and 20 epochs was quite small (from `2.3221` to `2.2017`). This suggests that the default learning rate of the Adam optimizer (default `lr=0.01`) is too fast. Next, I will try the same LSTM network but with a learning rate `lr=0.001`. I will initially train for 10 epochs, evaluate the prediction then train for an extra 10 epochs.

* After 10 epochs, loss=`2.3492`, and the predicted sequence shows some strange repetitions.
* After 20 epochs, loss=`2.1168`, and the predicted sequence also shows suspicious repetitions of the phrase `of the petionmance`. It is encouraging that the network is generating words like `of` and `the` together correctly, and the word `petionmance` resembles the word `performance`. This repetition suggests that this learning rate `lr=0.001` is still too fast.

In [20]:
model = Sequential()
model.add(LSTM(256, input_shape=(sequences_char_array_norm.shape[1], sequences_char_array_norm.shape[2])))
model.add(Dropout(0.2))
model.add(Dense(pred_char_list_onehot.shape[1], activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer=Adam(lr=0.001))

filepath="../Data/pubmed/weights/abstracts-lstm256-adam001-{epoch:02d}-{loss:.4f}.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='loss', verbose=1, save_best_only=True, mode='min')
callbacks_list = [checkpoint]

if already_trained==0:
    model.fit(sequences_char_array_norm, pred_char_list_onehot, epochs=10, batch_size=128, callbacks=callbacks_list)
else:
    model.load_weights('../Data/pubmed/weights/abstracts-lstm256-adam001-10-2.3492.hdf5')

In [21]:
# after 10 epochs
text_generation_seq(s,model)

patients with portal vein invasion (main trunk or the 1st order branch) were enrolled. during a 5-we
patients with portal vein invasion (main trunk or the 1st order branch) were enrolled. during a 5-wegl she petiona cortelation oe the petionanc toent and the ceneriii and tee petion of the petionanc t


In [22]:
if already_trained==0:
    model.fit(sequences_char_array_norm, pred_char_list_onehot, epochs=10, batch_size=128, callbacks=callbacks_list)
else:
    model.load_weights('../Data/pubmed/weights/abstracts-lstm256-adam001-20-2.1168.hdf5')

In [23]:
# after 20 epochs
text_generation_seq(s,model)

patients with portal vein invasion (main trunk or the 1st order branch) were enrolled. during a 5-we
patients with portal vein invasion (main trunk or the 1st order branch) were enrolled. during a 5-wegk alti and tee contint of the petionmance of the petionmance of the petionmance of the petionmance 


I will now construct the same LSTM network but using an Adam optimizer with learning rate `lr=0.0001`. I also trained for 10 and 20 epochs to compare with previous networks. We can see that for both 10 and 20 epochs, network has loss of `2.8654` and `2.7270` respectively, both much lower compared to the losses shown above. This is as expected because the learning rate is much much slower, so the network will require more epochs to reach the same loss. Additionally, the sequences predicted are nonsensical as repetitions of several characters.

In [24]:
model = Sequential()
model.add(LSTM(256, input_shape=(sequences_char_array_norm.shape[1], sequences_char_array_norm.shape[2])))
model.add(Dropout(0.2))
model.add(Dense(pred_char_list_onehot.shape[1], activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer=Adam(lr=0.0001))

filepath="../Data/pubmed/weights/abstracts-lstm256-adam0001-{epoch:02d}-{loss:.4f}.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='loss', verbose=1, save_best_only=True, mode='min')
callbacks_list = [checkpoint]

if already_trained==0:
    model.fit(sequences_char_array_norm, pred_char_list_onehot, epochs=10, batch_size=128, callbacks=callbacks_list)
else:
    model.load_weights('../Data/pubmed/weights/abstracts-lstm256-adam0001-10-2.8654.hdf5')

In [25]:
# after 10 epochs
text_generation_seq(s,model)

patients with portal vein invasion (main trunk or the 1st order branch) were enrolled. during a 5-we
patients with portal vein invasion (main trunk or the 1st order branch) were enrolled. during a 5-wee toe tee tee tee tee tee tee tee tee tee tee tee tee tee tee tee tee tee tee tee tee tee tee tee te


In [26]:
if already_trained==0:
    model.fit(sequences_char_array_norm, pred_char_list_onehot, epochs=10, batch_size=128, callbacks=callbacks_list)
else:
    model.load_weights('../Data/pubmed/weights/abstracts-lstm256-adam0001-20-2.7270.hdf5')

In [27]:
# after 20 epochs
text_generation_seq(s,model)

patients with portal vein invasion (main trunk or the 1st order branch) were enrolled. during a 5-we
patients with portal vein invasion (main trunk or the 1st order branch) were enrolled. during a 5-wee  and  and tee aatee te the tee tere te tee teee to tee tee toet fo the pete te tee tete tee teee t


I will continue training this LSTM network with `lr=0.0001` to epoch 50, 80 and 100. The predicted sequences are shown in the cells below.

Note the result for epoch 80 is particularly similar in form to what we would expect a scientific abstract to be. It includes numbers `19.7`, `0.05` and even a whole phrase `(n = 0.001)`. Abstracts tend to include an `(n=XYZ)` to denote how many subjects the study has enrolled. 

Despite the good result from epoch 80, the prediction from epoch 100 is not as encouraging. The one good thing about this prediction is the ability of the network to generate a whole word `analyzed` correctly, which is a very frequently-occuring word in scientific reports.

In [28]:
# 50 epochs
if already_trained==0:
    model.fit(sequences_char_array_norm, pred_char_list_onehot, epochs=30, batch_size=128, callbacks=callbacks_list)
else:
    model.load_weights('../Data/pubmed/weights/abstracts-lstm256-adam0001-50-2.4821.hdf5')
text_generation_seq(s,model)

patients with portal vein invasion (main trunk or the 1st order branch) were enrolled. during a 5-we
patients with portal vein invasion (main trunk or the 1st order branch) were enrolled. during a 5-wee tere setelted  nore tere seteled  nore  and ses  and aod tere tee poteit of the peteont se tee pet


In [29]:
# 80 epochs
if already_trained==0:
    model.fit(sequences_char_array_norm, pred_char_list_onehot, epochs=30, batch_size=128, callbacks=callbacks_list)
else:
    model.load_weights('../Data/pubmed/weights/abstracts-lstm256-adam0001-80-2.3372.hdf5')
text_generation_seq(s,model)

patients with portal vein invasion (main trunk or the 1st order branch) were enrolled. during a 5-we
patients with portal vein invasion (main trunk or the 1st order branch) were enrolled. during a 5-wea tete  sai)saa aad  19.7))  neseitianl  a d 0.05) and c.teen sirh (n = 0.001) and dod (0 h k   2.15


In [30]:
# 100 epochs
if already_trained==0:
    model.fit(sequences_char_array_norm, pred_char_list_onehot, epochs=50, batch_size=128, callbacks=callbacks_list)
else:
    model.load_weights('../Data/pubmed/weights/abstracts-lstm256-adam0001-100-2.2762.hdf5')
text_generation_seq(s,model)

patients with portal vein invasion (main trunk or the 1st order branch) were enrolled. during a 5-we
patients with portal vein invasion (main trunk or the 1st order branch) were enrolled. during a 5-wea teted cane sereeted toene the canee on the cate  oh tae cig  cnd se analyzed the tee cot cot oetio


I will continue training this network to epoch 120, 130, 140 and 150. The results for this training is shown in the cells below.

* Epoch 120: encouraging. Words such as `theneficantly` and `hicrerse` likely are `significantly` and `increase`. The appearance of digits, especially float numbers as a percentage, is encouraging.
* Epoch 130: simple phrases like `were a` or `with the` are correct. Scientific phrase `a = 0.05` is very good, since a refers to the typical alpha level of 0.05 used in statistical tests. Word `5nd` is likely a combination between `2nd` and `5th` but is good since it shows order.
* Epoch 140: `alalyses` likely refers to `analyses` in British English. However, the rest of the words seem meaningless.
* Epoch 150: `aosociated` likely refers to `associated`, but the rest of the sequence seems meaningless.

In [31]:
# 120 epochs
if already_trained==0:
    model.fit(sequences_char_array_norm, pred_char_list_onehot, epochs=20, batch_size=128, callbacks=callbacks_list)
else:
    model.load_weights('../Data/pubmed/weights/abstracts-lstm256-adam0001-120-2.2314.hdf5')
text_generation_seq(s,model)

patients with portal vein invasion (main trunk or the 1st order branch) were enrolled. during a 5-we
patients with portal vein invasion (main trunk or the 1st order branch) were enrolled. during a 5-wea tete  c00.,  and theneficantly hicrerse th  55.5%  55%  ii 1..5- ...5)  nortivev ver rereiredn fn 


In [32]:
# 130 epochs
if already_trained==0:
    model.fit(sequences_char_array_norm, pred_char_list_onehot, epochs=10, batch_size=128, callbacks=callbacks_list)
else:
    model.load_weights('../Data/pubmed/weights/abstracts-lstm256-adam0001-130-2.2137.hdf5')
text_generation_seq(s,model)

patients with portal vein invasion (main trunk or the 1st order branch) were enrolled. during a 5-we
patients with portal vein invasion (main trunk or the 1st order branch) were enrolled. during a 5-weaal doales were ansoditted with the sitil afner toret (a = 0.05  o = 0.00), 5nd phe rote of thriint 


In [33]:
# 140 epochs
if already_trained==0:
    model.fit(sequences_char_array_norm, pred_char_list_onehot, epochs=10, batch_size=128, callbacks=callbacks_list)
else:
    model.load_weights('../Data/pubmed/weights/abstracts-lstm256-adam0001-140-2.1997.hdf5')
text_generation_seq(s,model)

patients with portal vein invasion (main trunk or the 1st order branch) were enrolled. during a 5-we
patients with portal vein invasion (main trunk or the 1st order branch) were enrolled. during a 5-wean alalyses  nep)  nn th tee sees tf petertian ienetr so the reteltial  ohreite toees the reren and 


In [34]:
# 150 epochs
if already_trained==0:
    model.fit(sequences_char_array_norm, pred_char_list_onehot, epochs=50, batch_size=128, callbacks=callbacks_list)
else:
    model.load_weights('../Data/pubmed/weights/abstracts-lstm256-adam0001-150-2.1818.hdf5')
text_generation_seq(s,model)

patients with portal vein invasion (main trunk or the 1st order branch) were enrolled. during a 5-we
patients with portal vein invasion (main trunk or the 1st order branch) were enrolled. during a 5-weaat were eesnttideed the tesu tf tee poetinen of shein tise sas aosociated with an  nerrerel afn to 


Since we lack a validation set and an actual quantitative measure to how well the trained network performs, we could not determine how many epochs are necessary for a good text generation network. However, we do note that the network after ~120-130 epochs seem to yield better results than when trained with >140 epochs, which suggests that early cutoff is necessary. Therefore, we will use the model trained at 130 epochs and show some results from this.

In [35]:
model.load_weights('../Data/pubmed/weights/abstracts-lstm256-adam0001-130-2.2137.hdf5')

In [36]:
s = '2015 and april 2017 was performed. the final cohort consisted of 1508 thyroidectomy specimens from 1'
text_generation_seq(s,model)

2015 and april 2017 was performed. the final cohort consisted of 1508 thyroidectomy specimens from 1
2015 and april 2017 was performed. the final cohort consisted of 1508 thyroidectomy specimens from 1005  00.%  oo the pitiooan sesuotseo of the magi and 10.7 ± 1..   .. -..3     = 0.001, and phe lothr


In [37]:
s = 'by the low surface soil water content significantly depressed photosynthesis and et during the dayti'
text_generation_seq(s,model)

by the low surface soil water content significantly depressed photosynthesis and et during the dayti
by the low surface soil water content significantly depressed photosynthesis and et during the dayti of the petional and soetianes and the petionaliy ofseod oo tee cene conte ti tee set domisiiinal ae


In [38]:
s = 'e samples. hpv-16 was found in 75% of the positive samples for an overall prevalence of 13.5%. p16 i'
text_generation_seq(s,model)

e samples. hpv-16 was found in 75% of the positive samples for an overall prevalence of 13.5%. p16 i
e samples. hpv-16 was found in 75% of the positive samples for an overall prevalence of 13.5%. p16 ii a.. ced    5.. % 5.%% 1.%  ni, a., 95%  ..%  ne aad ci  0005, ...5))  nif a torg ff tee seree tren


In [39]:
s = 'production, the mechanism involved in this process is unclear. in the present study, we demonstrated'
text_generation_seq(s,model)

production, the mechanism involved in this process is unclear. in the present study, we demonstrated
production, the mechanism involved in this process is unclear. in the present study, we demonstrated that the poesiatid ro cn the resuett on the peselntn cffecte te tee crriciation of the finer cnnpen


In [40]:
s = 'eliminating aftertreatment tampering for china iv and china v hddvs.copyright © 2020 elsevier ltd. a'
text_generation_seq(s,model)

eliminating aftertreatment tampering for china iv and china v hddvs.copyright © 2020 elsevier ltd. a
eliminating aftertreatment tampering for china iv and china v hddvs.copyright © 2020 elsevier ltd. all rights reserved..bheliad by llseoved.ty calerv rh .al  iugh pirhonsdr  n = 0.001). the soenes cfr


## Final words
We can see that this model does not perform very well in generating scientific text for PubMed abstracts. After 130 epochs, we can see glimpses of potential, in which some phrases resemble what we would expect from an abstract. Some  examples shown in previous cells show encouraging results. Examples:
* The generation of phrases `95%` and `ci` (confidence interval) on the same sentence.
* When seed sequence ends with `we demonstrated`, the generated sequence continues with `that the`.
* When seed sequence ends with a digit, the generated sequence continues with numerical characters.
* When seed sequence ends with `2020 elsevier ltd. a` (which a number of abstracts end with to show copyrights by elsevier as the publishing company), the generated sequence is able to generate the whole phrase `all rights reserved`.

With a total of > 400,000 sequences of 100 characters in length, we did not have enough training data to built a more robust recurrent neural network. Additionally, scientific abstracts actually contain a number of issues. 
* *Too much abbreviation*. Not only are frequent abbreviations like `CI` (confidence interval) or `MRI` (magnetic resonance imaging) used, other more obscure scientific abbreviations are also frequently used, with the assumption that the reader is familiar with the particular topic.
* *Variation in writing style*. Text generation benefits from having sequences written in a specific style so it is easier for the network to learn. Since each abstract is written (likely) by a different author, many of whom English is not their first language, the paragraph or sentence structure is more difficult to learn.
* *Overuse of non-alphabet characters*. Scientific abstracts contain more symbols or digits than normal texts. If these characters are used frequently and uniformly, then the network will be able to learn. However, due to formatting of the abstract, sometimes symbol like alpha can be written as `alpha`, `α` or `a`. Numbers can be formatted with hexadecimal `\xa`, or contains `,` to denote the thousandth place, or not. Non-uniformity in style contributes to training error.
* *Non-uniform spacing*. Sometimes a sentence would start without spacing after punctuation, sometimes it starts after multiple spaces.

But most importantly,
* *Too many scientific terms*. To predict everyday speech, a smaller set of colloquial terms and structures needed to be learned. However, the set of scientific phrases is much much larger, thus our 400,000 training sequences did not suffice for this application.