
#Work 3 Hybrid Summarization with SumBasic and Pegasus-large Out-of-the-Box (claims/description sections)

Prior to this worksheet the claims and description sections of organic chemistry patents from HUPD were condensed using extractive summarization (SumBasic technique). We now try to form better summaries with Pegasus-large.

## Setup

In [None]:
#install libraries
!pip install -q datasets
!pip install -q transformers
!pip install --quiet --upgrade accelerate
!pip install -q sentencepiece
!pip install -q evaluate
!pip install -q rouge_score

In [None]:
# Install Packages

#standard data science libraries
import pandas as pd
import numpy as np
import random
import matplotlib

#visualization
import matplotlib.pyplot as plt
from pprint import pprint
from IPython.display import display, HTML

#datasets
import datasets
from datasets import load_from_disk
from datasets import load_dataset, load_metric

#transformers
from transformers import PegasusForConditionalGeneration, PegasusTokenizer

# PyTorch
import torch
from torch.utils.data import DataLoader

#rouge
import evaluate

In [None]:
# This cell will authenticate you and mount your Drive in the Colab.
from google.colab import drive
drive.mount('/content/drive', force_remount=True)
save_dir="/content/drive/MyDrive/W266/HUPD"

Mounted at /content/drive


##Load Data

In [None]:
extracted_train = load_from_disk(save_dir + '/extracted_train_sample_cd')

In [None]:
extracted_train

Dataset({
    features: ['patent_number', 'decision', 'title', 'abstract', 'cpc_label', 'ipc_label', 'filing_date', 'patent_issue_date', 'date_published', 'examiner_id', 'claims_desc', 'extractive_summaries'],
    num_rows: 720
})

##Prepare Data for Summmarization


In [None]:
#so that we look at the same ones every time.
random_indices = [515, 175, 380, 364, 552, 238, 558, 6, 692, 504]

## Summarizing Extractive Summaries with Pegasus-large


In [None]:
#specify base model
model_name = "google/pegasus-large"

#specify compute device (GPU if available)
device = "cuda" if torch.cuda.is_available() else "cpu"

#specify tokenizer
tokenizer = PegasusTokenizer.from_pretrained(model_name)

#instantiate model
model = PegasusForConditionalGeneration.from_pretrained(model_name).to(device)

tokenizer_config.json:   0%|          | 0.00/88.0 [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/1.91M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/65.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/3.09k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/2.28G [00:00<?, ?B/s]

Some weights of PegasusForConditionalGeneration were not initialized from the model checkpoint at google/pegasus-large and are newly initialized: ['model.encoder.embed_positions.weight', 'model.decoder.embed_positions.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


generation_config.json:   0%|          | 0.00/260 [00:00<?, ?B/s]

In [None]:
#create a function to tokenize the data
def tokenize_input(input,column_name):
    tokenized = tokenizer(input[column_name], truncation=True, padding='max_length', return_tensors="pt") #.to(device)
    tokenized['input_ids'] = tokenized['input_ids'].squeeze()
    tokenized['attention_mask'] = tokenized['attention_mask'].squeeze()
    return tokenized

### Summarize Extracted Text using Pegasus Large

In [None]:
# Tokenize training set
tokenized_input = extracted_train.map(lambda obs: tokenize_input(obs,'extractive_summaries'))

In [None]:
#set format to Torch
tokenized_input.set_format(type='torch', columns=['input_ids', 'attention_mask']) #device = device to load the torch formatted tensors to your GPU.

#set up DataLoader
batch_size = 5 #note that the standard batch size for the DataLoader is 200
train_dataloader = DataLoader(tokenized_input, batch_size = batch_size) #num_workers = 4 << increase if you need higher speed.


In [None]:
summarized_texts = []
for batch in iter(train_dataloader):
    input_ids = batch['input_ids'].to(device)
    attention_mask = batch['attention_mask'].to(device)
    summary = model.generate(input_ids = input_ids,
                             attention_mask = attention_mask,
                             max_length = 150
                             )
    summarized_text = tokenizer.batch_decode(summary, skip_special_tokens=False)
    summarized_texts.extend(summarized_text)

In [None]:
#Generate Rouge Scores
rouge = evaluate.load('rouge')
predictions = summarized_texts
references = tokenized_input['abstract']
results = rouge.compute(predictions=predictions,
                        references=references)
print(results)

Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

{'rouge1': 0.22505036775063114, 'rouge2': 0.08565479109805345, 'rougeL': 0.15808531994130587, 'rougeLsum': 0.15792722163262335}


In [None]:
# Randomly sample example paies
sampled_summarized_texts = [summarized_texts[i] for i in random_indices]
sampled_abstracts = [extracted_train['abstract'][i] for i in random_indices]

# Zip the lists together
comparisons = list(zip(sampled_summarized_texts, sampled_abstracts))

# Create a pandas DataFrame from the zipped lists
comparison_df = pd.DataFrame(comparisons, columns=['predicted_summary', 'abstract'])

# Display df
display(HTML(comparison_df.to_html()))

Unnamed: 0,predicted_summary,abstract
0,"<pad>The method of claim 19, wherein the naturally occurring fatty acids and esters comprise chicken fat, beef tallow, yellow grease, choice white grease, lard, soybean oil, corn oil, cottonseed oil, canola oil, sunflower oil, palm oil, rapeseed oil, or a combination of any two or more thereof; and the at least 75 wt % even carbon number paraffins selected from the C12-C24 range comprise n-hexadecane and n-octadecane. The hydrogen-rich gas 105 is supplied at a rate of about 3,000 to about 10,000 SCF/bbl (volume gas per volume biological feedstock). The diameter of bubble column reactor vessel 104 is selected such that the gas flow rate is in the churn-</s>","Paraffin compositions including mainly even carbon number paraffins, and a method for manufacturing the same, is disclosed herein. In one embodiment, the method involves contacting naturally occurring fatty acid/glycerides with hydrogen in a slurry bubble column reactor containing bimetallic catalysts with equivalent particle diameters from about 10 to about 400 micron. The even carbon number compositions are particularly useful as phase change material."
1,"<pad>Then, a mixture with a purity of about 70.0%, 71.0%, 73.0%, 74.0%, 75.0%, 76.0%, 77.0%, 78.0%, 79.0%, 80.0%, 81.0%, 82.0%, 83.0%, 84.0%, 85.0%, 86.0%, 87.0%, 88.5%, 89.0%, 89.5%, 90.0%, about 90.5%, about 91.0%, about 91.5%, about 92.0%, about 92.5%, about 93.0%, about 93.5%, about 94.0%, about 94.5%, about 95.0%, about 95.5%, about 96.0%, about 96.5%, about 97.0%, about 97.5%, about 98</s>","The present application relates to a method for photochlorination, and specifically to photochlorination by a photochemical reaction of an aromatic compound with gaseous chlorine so as to prepare a trichloromethyl-substituted benzene, and to a method using bis-(trichloromethyl)-benzene as the trichloromethyl-substituted benzene to prepare by further reaction bis-(chloroformyl)-benzene. Through the control of temperature, illuminance and consumption of gaseous chlorine, the method of this application can greatly improve the purity of trichloromethyl-substituted benzene and further prepare polymer-grade bis-(chloroformyl)-benzene with low cost. The present application also relates to a method for purifying trichloromethyl-substituted benzene, and specifically to a method for purifying trichloromethyl-substituted benzene via molecular distillation. The present application further relates to a photochlorination reactor for use in photochlorination reactions (such as those of the present application)."
2,"<pad>Human CD133 is a cell surface glycoprotein that has been used as a marker of hematopoietic stem cells, neural stem cells, and for enrichment of a tumor initiating cell population in many cancers including colon carcinoma and glioblastoma. The cancer stem cell compartment contains more than one stem cell population and CD133KDEL is highly effective against at least one of these populations expressing CD133. A fusion polypeptide comprising: a targeting moiety comprising: the monoclonal antibody of embodiment 2; or the scFv of any one of embodiments 3-5; and a toxin moiety comprising a therapeutically active portion of a cytolytic toxin. EXAMPLES Example 1 Cell Culture Caco-2 cells were grown in Minimum Essential Medium (MEM) containing 20% FBS, 1%</s>",Disclosed herein are a monoclonal antibody that specifically binds to human CD133 and single-chain variable fragments thereof. Also disclosed herein is a hybridoma that produces the monoclonal antibody that specifically binds to human CD133.
3,"<pad>The receptors of several hematopoietic growth factors are tyrosine kinases; these include c-fms, which is the colony stimulating factor 1 receptor [Sherr et al., Cell, 41: 665-676 (1985)] and c-kit, a primitive hematopoietic growth factor receptor [Huang et al., Cell, 63: 225-33 (1990)]. The invention is further directed to a method of screening for a neoplastic disease state, comprising the steps of: (a) contacting tissue from a mammalian organism suspected of having a neoplastic disease state with a composition comprising an antibody or antibody fragment that specifically binds Flt4 receptor tyrosine kinase; (b) detecting the antibody or antibody fragment bound to cells in the mammalian organism; and (</s>","The present invention provide purified Flt4 receptor tyrosine kinase polypeptides and fragments thereof, polynucleotides encoding such polypeptides, antibodies that specifically bind such polypeptides, and uses therefor."
4,"<pad>The sequence of human albumin is as shown, in the prealbumin form with the N-terminal signaling residues MKWVTFISLLFLFSSAYSRGVFRR (SEQ ID NO: 175): SEQ ID NO: 187 MKWVTFISLLFLFSSAYSRGVFRRDAHKSEVAHRFKDLGEENFKALVLIA FAQYLQNOQCPFEDHVKLVNEVTEFAKTCVSAEADENCDKSLHTLFGDKLCT polypeptides that are either monomeric or fused in the host cell; Type 2 diabetes mellitus or cancer; Cancer can be assessed by fusing the N-terminus (NM3CQAADKAA CLLPKLDELRDEGKASSAKQRLK</s>","Provided herein are multifunctional heteromultimer proteins. In specific embodiments is a heteromultimer comprising: at least two polypeptide constructs, each polypeptide construct comprising at least one cargo polypeptide attached to a transporter polypeptide, said transporter polypeptides derived from a monomeric native protein such that said monomeric constructs associate to form the heteromultimer and said transporter polypeptides associate to form a quasi-native structure of the monomeric native protein or analog thereof. These therapeutically novel molecules encompass heteromultimers comprising constructs that function as scaffolds for the conjugation or fusion of therapeutic molecular entities (cargo polypeptides) resulting in the creation of bispecific or multivalent molecular species. Provided herein is a method for creation of bispecific or multivalent molecular species."
5,"<pad>The administering may be carried out on a daily basis, every other day or other periodic dosing schedule, at a dose between 0.5 and 25 mg/kg body weight. For the treatment of pancreatic or lung cancer, the compound may be administered at a daily, every other day or other periodic dosing schedule, at a dose of between 0.5 and 25 mg/kg body weight of the compound over a period of at least fit a five weeks.</s><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad>","Silylalkyloxyaryl compounds useful as anti-cancer agents. The compounds and pharmaceutical compositions containing them are particularly useful for the treatment of melanoma, colon, neuroblastoma, bladder, breast, lung, pancreatic, melanoma, sarcoma, lymphoma or gastric cancer."
6,"<pad>“Acyl” by itself or as part of another substituent refers to a radical —C(O)R400, where R400 is hydrogen, alkyl, substituted alkyl, aryl, substituted aryl, arylalkyl, substituted arylalkyl, substituted heteroalkyl, heteroarylalkyl or substituted heteroarylalkyl as defined herein. “Arylalkyl” by itself or as part of another substituent, refers to an acyclic alkyl group in which one of the hydrogen atoms bonded to a carbon atom, typically a terminal or sp3 carbon atom, is replaced with an aryl group as, as defined herein.</s><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad>","Provided herein are ergoline compounds and pharmaceutical compositions thereof. In some embodiments, provided herein are methods of treatment, prevention, or amelioration of a variety of medical disorders such as, for example, migraine using the compounds and pharmaceutical compositions disclosed herein. In still other embodiments, provided herein are methods of agonizing receptors such as, for example, the 5-HT1A, 5-HT1B and 5-HT1D receptors without agonizing the 5-HT2B receptor using the compounds and pharmaceutical compositions disclosed herein. In still other embodiments, provided herein are methods of antagonizing the 5-HT2B adrenergic alpha2A and/or the alpha2B receptors using the compounds and pharmaceutical compositions disclosed herein. In still other embodiments, provided herein are methods of antagonizing the D2 and D3 receptor using the compounds and pharmaceutical compositions disclosed herein."
7,"<pad>The invention of claim 3, wherein said fragment comprises an amino acid sequence ranging from an amino acid residue at position 86 to an amino acid residue at position 189 in SEQ ID NO:1, an amino acid sequence ranging from an amino acid residue at position 86 to an amino acid residue at position 174 in SEQ ID NO:2, or an amino acid sequence ranging from an amino acid residue at position 86 to an amino acid residue at position 175 in SEQ ID NO:4. In some embodiments, the linker has 4; 5; 6; 7; 8; 9; 10; 11; 12; 13; 14; 15; 16; 17; 18; 19; 20; 21; 22; 23; 24; 25;</s>","The present invention relates to novel alternative splice transcripts (AST) for MICA (MHC class I related chain alpha) encoding novel MICA protein isoforms and uses thereof. In particular, the present invention relates to an isolated polypeptide at least 80% of identity with a sequence selected from the group consisting of SEQ ID NO:1 (MICA-A), SEQ ID NO:2 (MICA-B1), SEQ ID NO:3 (MICA-B2); SEQ ID NO:4 (MICA-C) and SEQ ID NO: (MICA-D)."
8,"<pad>The acyl radical contents of the individual betaines, the acyl radicals of which comprise from 8 to 14 carbon atoms, and also the sum of the acyl radical contents of the betaines, the acyl radicals percentage of which comprise from 16 to 18 (C16, C18:0, C18:1 and C18:2) carbon atoms, can consequently be determined from the peak area percentages of the HPLC/RI measurement. Particularly preferred fatty acid amidoalkyl betaines according to the invention are characterized in that the mixture of acyl radicals of the palm kernel oil exhibits a content of oleic acid acyl radicals, based on all acyl radicals of the mixture, of 12% by weight to 21% by weight and a content</s>","A subject-matter of the invention are betaines with a special fatty acid chain distribution, for example based on palm kernel oil, the palm kernel oil comprising an enhanced content of unsaturated C18 fatty acids."
9,"<pad>An antibody specific for the epitope peptide of claim 1, wherein the antibody is a nanobody, wherein the nanobody comprises an amino acid sequence comprising framework region 1, CDR1, framework region 2, CDR2, framework region 3, CDR3, and framework region 4, wherein CDR3 is having the amino acid sequence as defined in SEQ ID NO:12, and wherein a cysteine is present in the framework region 2, which is capable of forming a disulfide bridge with the cysteine present in CDR3. The tag is specifically recognized by a novel epitope specific antibody, which allows antibody-tag interaction. The BC2-peptide (BC2T) folds into a <unk>-strand that is part of a <unk>-sheet structure formed</s>","Provided herein is a novel epitope that can be used as a tag in methods for rapid and effective characterization, purification, and subcellular localization of polypeptides of interest, which comprise the tag. The tag is specifically recognized by an epitope specific antibody, which can be used to detect, capture, quantify, and/or purify polypeptides of interest that are tagged with the epitope. Also provided is novel epitope specific antibody."


*  Extractive sum actually did well on example 2 where we are describing a synthesis pathway, but the compound names seem to be stored in a list somewhere (compound 2, compound 3, ...) so we cannot directly discern the synthesis from the summary provided.
*  Further, we see references to things like "Formula 1" which indicates a formula listed somewhere in the document that isn't easily accessed by the model.
