<a href="https://colab.research.google.com/github/XuZuoLizzie/Archived_Work/blob/main/Generating_PubMed_Abstracts.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Generating PubMed abstracts

The objective of this homework is to train a character-level RNN to generate PubMed abstracts. The trained model is able to generate PubMed abstracts in Medline format.

## Load Data

In [None]:
! pip install biopython

Collecting biopython
  Downloading biopython-1.79-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.whl (2.3 MB)
[K     |████████████████████████████████| 2.3 MB 4.1 MB/s 
Installing collected packages: biopython
Successfully installed biopython-1.79


In [None]:
class Record():
  def __init__(self, pmid, title, abstract):
    self.pmid = pmid
    self.title = title
    self.abstract = abstract

In [None]:
from Bio import Medline

fp = "/content/medline.0.txt"

records = []
with open(fp) as handle:
  for article in Medline.parse(handle):
    records.append(Record(article['PMID'], article['TI'], article['AB']))

print(records[0].pmid, records[0].title, records[0].abstract)

22997744 [Value of magnetic resonance imaging in the diagnosis of recurrent colorectal cancer]. To diagnose recurrent colorectal cancer is an urgent problem of oncoproctology. Eighty patients with suspected recurrent colon tumor were examined. All the patients underwent irrigoscopy, colonoscopy, magnetic resonance imaging of the abdomen and small pelvis. The major magnetic resonance symptoms of recurrent colon tumors were studied; a differential diagnosis of recurrent processes and postoperative changes at the site of intervention was made.


In [None]:
# abstracts = [record.abstract for record in records]
abstracts = []
for record in records:
  content = "PMID - " + record.pmid + "\n" + "AB - " + record.abstract
  abstracts.append(content)

input = "\n\n".join(abstracts)

with open("/content/medline_input.txt", 'w', encoding='ascii') as input_file:
  input_file.write(input)

## Train the RNN model

In [None]:
! git clone https://github.com/spro/char-rnn.pytorch.git

Cloning into 'char-rnn.pytorch'...
remote: Enumerating objects: 54, done.[K
Unpacking objects: 100% (54/54), done.
remote: Total 54 (delta 0), reused 0 (delta 0), pack-reused 54[K


In [None]:
% cd char-rnn.pytorch/

/content/char-rnn.pytorch


In [None]:
! ls -l

total 24
-rwxr-xr-x 1 root root 1784 Oct 18 20:14 generate.py
-rw-r--r-- 1 root root  756 Oct 18 20:14 helpers.py
-rw-r--r-- 1 root root 1081 Oct 18 20:14 LICENSE
-rw-r--r-- 1 root root 1617 Oct 18 20:14 model.py
-rw-r--r-- 1 root root 1950 Oct 18 20:14 README.md
-rwxr-xr-x 1 root root 3160 Oct 18 20:14 train.py


In [None]:
! pip install unidecode

Collecting unidecode
  Downloading Unidecode-1.3.2-py3-none-any.whl (235 kB)
[?25l[K     |█▍                              | 10 kB 30.2 MB/s eta 0:00:01[K     |██▉                             | 20 kB 9.5 MB/s eta 0:00:01[K     |████▏                           | 30 kB 8.1 MB/s eta 0:00:01[K     |█████▋                          | 40 kB 7.7 MB/s eta 0:00:01[K     |███████                         | 51 kB 4.2 MB/s eta 0:00:01[K     |████████▍                       | 61 kB 4.4 MB/s eta 0:00:01[K     |█████████▊                      | 71 kB 4.5 MB/s eta 0:00:01[K     |███████████▏                    | 81 kB 5.1 MB/s eta 0:00:01[K     |████████████▌                   | 92 kB 5.2 MB/s eta 0:00:01[K     |██████████████                  | 102 kB 4.2 MB/s eta 0:00:01[K     |███████████████▎                | 112 kB 4.2 MB/s eta 0:00:01[K     |████████████████▊               | 122 kB 4.2 MB/s eta 0:00:01[K     |██████████████████              | 133 kB 4.2 MB/s eta 0:00:01

In [None]:
%%writefile train.py

#!/usr/bin/env python
# https://github.com/spro/char-rnn.pytorch

import torch
import torch.nn as nn
from torch.autograd import Variable
import argparse
import os

from tqdm import tqdm

from helpers import *
from model import *
from generate import *

# Parse command line arguments
argparser = argparse.ArgumentParser()
argparser.add_argument('filename', type=str)
argparser.add_argument('--model', type=str, default="gru")
argparser.add_argument('--n_epochs', type=int, default=2000)
argparser.add_argument('--print_every', type=int, default=100)
argparser.add_argument('--hidden_size', type=int, default=100)
argparser.add_argument('--n_layers', type=int, default=2)
argparser.add_argument('--learning_rate', type=float, default=0.01)
argparser.add_argument('--chunk_len', type=int, default=200)
argparser.add_argument('--batch_size', type=int, default=100)
argparser.add_argument('--shuffle', action='store_true')
argparser.add_argument('--cuda', action='store_true')
args = argparser.parse_args()

if args.cuda:
    print("Using CUDA")

file, file_len = read_file(args.filename)

def random_training_set(chunk_len, batch_size):
    inp = torch.LongTensor(batch_size, chunk_len)
    target = torch.LongTensor(batch_size, chunk_len)
    for bi in range(batch_size):
        start_index = random.randint(0, file_len - chunk_len)
        end_index = start_index + chunk_len + 1
        chunk = file[start_index:end_index]
        inp[bi] = char_tensor(chunk[:-1])
        target[bi] = char_tensor(chunk[1:])
    inp = Variable(inp)
    target = Variable(target)
    if args.cuda:
        inp = inp.cuda()
        target = target.cuda()
    return inp, target

def train(inp, target):
    hidden = decoder.init_hidden(args.batch_size)
    # if args.cuda:
    #     hidden = hidden.cuda()
    if args.cuda:
        if args.model == "gru":
            hidden = hidden.cuda()
        else:
            hidden = (hidden[0].cuda(), hidden[1].cuda())
    decoder.zero_grad()
    loss = 0

    for c in range(args.chunk_len):
        output, hidden = decoder(inp[:,c], hidden)
        loss += criterion(output.view(args.batch_size, -1), target[:,c])

    loss.backward()
    decoder_optimizer.step()

    # return loss.data[0] / args.chunk_len
    return loss.data / args.chunk_len

def save():
    save_filename = os.path.splitext(os.path.basename(args.filename))[0] + '.pt'
    torch.save(decoder, save_filename)
    print('Saved as %s' % save_filename)

# Initialize models and start training

decoder = CharRNN(
    n_characters,
    args.hidden_size,
    n_characters,
    model=args.model,
    n_layers=args.n_layers,
)
decoder_optimizer = torch.optim.Adam(decoder.parameters(), lr=args.learning_rate)
criterion = nn.CrossEntropyLoss()

if args.cuda:
    decoder.cuda()

start = time.time()
all_losses = []
loss_avg = 0

try:
    print("Training for %d epochs..." % args.n_epochs)
    for epoch in tqdm(range(1, args.n_epochs + 1)):
        loss = train(*random_training_set(args.chunk_len, args.batch_size))
        loss_avg += loss

        if epoch % args.print_every == 0:
            print('[%s (%d %d%%) %.4f]' % (time_since(start), epoch, epoch / args.n_epochs * 100, loss))
            print(generate(decoder, 'Wh', 100, cuda=args.cuda), '\n')

    print("Saving...")
    save()

except KeyboardInterrupt:
    print("Saving before quit...")
    save()

Overwriting train.py


In [None]:
%%writefile generate.py

#!/usr/bin/env python
# https://github.com/spro/char-rnn.pytorch

import torch
import os
import argparse

from helpers import *
from model import *

def generate(decoder, prime_str='A', predict_len=100, temperature=0.8, cuda=False):
    hidden = decoder.init_hidden(1)
    prime_input = Variable(char_tensor(prime_str).unsqueeze(0))

    if cuda:
        if isinstance(hidden, tuple):
            hidden = (hidden[0].cuda(), hidden[1].cuda())
        else:
            hidden = hidden.cuda()
        prime_input = prime_input.cuda()
    predicted = prime_str

    # Use priming string to "build up" hidden state
    for p in range(len(prime_str) - 1):
        _, hidden = decoder(prime_input[:,p], hidden)

    inp = prime_input[:,-1]

    for p in range(predict_len):
        output, hidden = decoder(inp, hidden)

        # Sample from the network as a multinomial distribution
        output_dist = output.data.view(-1).div(temperature).exp()
        top_i = torch.multinomial(output_dist, 1)[0]

        # Add predicted character to string and use as next input
        predicted_char = all_characters[top_i]
        predicted += predicted_char
        inp = Variable(char_tensor(predicted_char).unsqueeze(0))
        if cuda:
            inp = inp.cuda()

    return predicted

# Run as standalone script
if __name__ == '__main__':

# Parse command line arguments
    argparser = argparse.ArgumentParser()
    argparser.add_argument('filename', type=str)
    argparser.add_argument('-p', '--prime_str', type=str, default='A')
    argparser.add_argument('-l', '--predict_len', type=int, default=100)
    argparser.add_argument('-t', '--temperature', type=float, default=0.8)
    argparser.add_argument('--cuda', action='store_true')
    args = argparser.parse_args()

    decoder = torch.load(args.filename)
    del args.filename
    print(generate(decoder, **vars(args)))

Overwriting generate.py


In [None]:
! python train.py /content/medline_input.txt \
--model lstm \
--n_epochs 2000 \
--print_every 100 \
--learning_rate 0.01 \
--batch_size 100 \
--cuda

Using CUDA
Training for 2000 epochs...
  5% 99/2000 [00:49<15:56,  1.99it/s][0m 49s (100 5%) 1.8705]
Whition in the 6.879 nospatial dissaibker, ound prenionalles restations were with regressiectiated the 

 10% 199/2000 [01:38<14:50,  2.02it/s][1m 39s (200 10%) 1.5181]
Whome in 5 time and reate that the formatient depin cases. studies the gastric cancer in the case, and 

 15% 299/2000 [02:28<13:59,  2.03it/s][2m 29s (300 15%) 1.4106]
Whogeneges and the patients with women with to despited the endostanced with the levels carcinominal c 

 20% 399/2000 [03:18<13:25,  1.99it/s][3m 18s (400 20%) 1.3475]
Whe prostate and gene additiorations, malignancy with a single sage eye of metastatic associated to be 

 25% 499/2000 [04:07<12:17,  2.03it/s][4m 8s (500 25%) 1.2893]
Whology (CD32) (CRC) proscopy.

PMID - 23006454
AB - Curring neoplasion-chronit cancer potential were  

 30% 599/2000 [04:57<11:38,  2.01it/s][4m 57s (600 30%) 1.3078]
Whis provide DSS4 in HDAT axmation in pathways determin

## Generate PubMed abstracts

In [None]:
! python generate.py medline_input.pt \
--prime_str "PMID" \
--predict_len 2000 \
--cuda

PMID - 23016853
AB - Multidiations and the predominant and most provide inflammation. A prigus and labeling the probable more uptanglientially underly to evaluating overexpressed in patients with some knowledge after chosen in plasma (ROS) to patient demonstrated and management and comparative GORC the proposed and the overall effective and obstruction of 42% increase-related specific cancer didenclity time to overcome gloes of cancer tissue methods are not related to be increased types of Gaves. The androgen chroperative and to active organ treatment. The increasingly associated with diagnostic tests as a conventions to physical reconstroke polymorphism setting cancer cells. CONCLUSIONS: The physical components that management of the potent behavior understanding CRC and ties suggest that the preliminary cell lung cancer and an important predicting staging and overexpression of it population. PURPOSE: The women, integarded the EUS and a potently high cofies histologic and chemobracter