In [1]:
import re
import warnings
from typing import List
 
import torch
from langchain import PromptTemplate
from langchain.chains import ConversationChain
from langchain.chains.conversation.memory import ConversationBufferWindowMemory
from langchain.llms import HuggingFacePipeline
from langchain.schema import BaseOutputParser
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    StoppingCriteria,
    StoppingCriteriaList,
    pipeline,
)
 
warnings.filterwarnings("ignore", category=UserWarning)

## Using transformer pipeline

In [2]:
%%time
# Generate Text
query = "what is KBase?"
llama_tokenizer = AutoTokenizer.from_pretrained("fine_tuned_llama2_model_kbase_3_epochs", trust_remote_code=True)
llama_tokenizer.pad_token = llama_tokenizer.eos_token
llama_tokenizer.padding_side = "right"  # Fix for fp16
text_gen = pipeline(task="text-generation", model="fine_tuned_llama2_model_kbase_3_epochs", tokenizer=llama_tokenizer, max_length=200)
output = text_gen(f"<s>[INST] {query} [/INST]")
print(output[0]['generated_text'])

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

<s>[INST] what is KBase? [/INST]  KBase (Knowledge Base) is a web-based platform developed by the US Department of Energy (DOE) that provides a suite of tools and resources for the analysis, visualization, and sharing of large-scale biological data. It is designed to support the analysis of complex biological systems, such as microbial communities, plant genomes, and metabolic pathways, and to facilitate collaboration among researchers across different disciplines.

KBase was originally developed as a collaboration between the DOE Joint Genome Institute (JGI) and the Lawrence Berkeley National Laboratory (LBNL), and has since expanded to include partnerships with other organizations, including the University of California, the University of Texas, and the University of Illinois.

KBase provides a range of tools and resources for working with large-scale biological data, including:

1
CPU times: user 2h 18min 27s, sys: 2min 54s, total: 2h 21min 21s
Wall time: 4min 40s


In [3]:
%%time
# Generate Text
output = text_gen("how to use KBase narrative?")
print(output[0]['generated_text'])

how to use KBase narrative?

KBase Narrative is a tool for creating and sharing narratives in the context of scientific research. It allows users to create and edit narratives, as well as to share them with others. Here are some steps on how to use KBase Narrative:

1. Sign up for a KBase account: To use KBase Narrative, you need to sign up for a KBase account. You can sign up for a free account on the KBase website.
2. Log in to your KBase account: Once you have signed up for a KBase account, you can log in to your account using your email address and password.
3. Access KBase Narrative: Once you are logged in to your KBase account, you can access KBase Narrative by clicking on the "Narrative" tab in the top navigation bar.
4. Create a new narrative: To
CPU times: user 2h 17min 17s, sys: 30.5 s, total: 2h 17min 48s
Wall time: 4min 8s


## Using tokenizer decoder

In [4]:
# load the model
MODEL_NAME = "fine_tuned_llama2_model_kbase_3_epochs"
 
model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME, device_map="auto"
)
model = model.eval()
 
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [5]:
# model genration config
generation_config = model.generation_config
generation_config.temperature = 0
generation_config.num_return_sequences = 1
generation_config.max_new_tokens = 256
generation_config.use_cache = False
generation_config.repetition_penalty = 1.7
generation_config.pad_token_id = tokenizer.eos_token_id
generation_config.eos_token_id = tokenizer.eos_token_id

# generate prompt
prompt = """
Question: What is KBase?
Answer:
""".strip()
 
input_ids = tokenizer(prompt, return_tensors="pt").input_ids
input_ids = input_ids.to(model.device)

with torch.inference_mode():
    outputs = model.generate(
        input_ids=input_ids,
        generation_config=generation_config,
    )
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

Question: What is KBase?
Answer: The full name of the platform mentioned in this question, which stands for "Knowledge Base," refers to a publicly available online resource that provides tools and services related specifically to bioinformatics. It was developed by JGI (Joint Genome Institute) with funding from DOE Office Of Science Bioenergy Technologies office as part of an effort called Systems Biology Knowledgebase or SBKB project aimed at integrating data across different organisms into one comprehensive database accessible through web-based interfaces such as RAILS/KEGG Interaction Networks Viewer Applications using RESTful APIs provided within their API portal section on GitHub under open source license terms allowing users free access without any restrictions including commercial use rights; however please note there may be some limitations depending upon how much storage space you need compared against what's offered here before deciding whether it suits your needs best!


In [6]:
# model genration config
generation_config = model.generation_config
generation_config.temperature = 0
generation_config.num_return_sequences = 1
generation_config.max_new_tokens = 256
generation_config.use_cache = False
generation_config.repetition_penalty = 1.7
generation_config.pad_token_id = tokenizer.eos_token_id
generation_config.eos_token_id = tokenizer.eos_token_id

# generate prompt
prompt = """
Question: How to use KBase narrative?
Answer:
""".strip()
 
input_ids = tokenizer(prompt, return_tensors="pt").input_ids
input_ids = input_ids.to(model.device)
with torch.inference_mode():
    outputs = model.generate(
        input_ids=input_ids,
        generation_config=generation_config,
    )
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

Question: How to use KBase narrative?
Answer: The Narratives in the Knowledge Base (Kbase) are a way of organizing and presenting information about genes, proteins or other biological entities. They provide an overview by summarising key details such as gene function/description; protein structure data from PDB files with links for more detailed viewings if desired! Additionally they include relevant literature references that can be accessed through hyperlinks at no cost whatsoever – all within one convenient location so you don't have search multiple sources separately before finding exactly where this particular piece fits into your research project’s puzzle pieces perfectly fitting together like clockwork 🕰️✨


## using huggingface pipeline

In [7]:
generation_pipeline = pipeline(
    model=model,
    tokenizer=tokenizer,
    return_full_text=True,
    task="text-generation",
    generation_config=generation_config,
)
 
llm = HuggingFacePipeline(pipeline=generation_pipeline)

In [8]:
%%time
llm("what is KBase?")

CPU times: user 51.9 s, sys: 83.5 ms, total: 51.9 s
Wall time: 51.9 s


'\nKbase (formerly known as the Kansas Bioenergy and Bioproducts Institute) was established in 2013 with a mission to develop new technologies for biofuels, biomaterial production from agricultural waste streams. It has since expanded its focus beyond these areas of research into other aspects related or adjacent thereto such an environmental monitoring systems using IoT sensors; biodata analysis tools that can be used by scientists across various disciplines like genetics/genome engineering when working on projects involving microbes involved either directly within their own ecosystem environment outside thereof but also indirect ways through interactions between different organisms themselves! Additionally they provide access points where users may upload data sets collected during experiments conducted at sites around world so long hey adhere strict protocol guidelins set forth kbsi staff members before sharing any information externally via web portals provided exclusively here ins

In [9]:
%%time
llm("How to use KBase narrative?")

CPU times: user 54.5 s, sys: 83.6 ms, total: 54.5 s
Wall time: 54.5 s


'\nKbase Narratives are a new feature in the latest version of their platform. They allow users, such as researchers and scientists working on projects related to bioinformatics or computational biology tasks like genome analysis (either whole-genomes/metagenoms) for example from sequencing reads generated by Illumina HiSeq 2500 run with paired end library preparation protocols using TruFaSTER chemistry), metabolic modeling simulations based off reactions extracted via MetaCyc database integration into Biochemical Network Model Reconstruction workflow steps within an existing project inside your account at kbseeker dot com! To access this tool go through these easy step: Step1 - Log Into Your Account At https://kbsnavigator/.com And Click On "My Project" Tab located near top right corner next logout button; then select one already created if none exist otherwise create brand spanky fresh ones following prompt given below until all necessary details have been filled out completely befor