---
title: Mistral Demo
format:
    html:
        code-fold: true
---

In [1]:
### do not run this cell block unless using Perlmutter
#| echo: false
#| output: false
%env HF_DATASETS_CACHE=/pscratch/sd/a/azaidi/llm/cache
%env HF_HOME=/pscratch/sd/a/azaidi/llm/cache
cache_dir = '/pscratch/sd/a/azaidi/llm/cache'

env: HF_DATASETS_CACHE=/pscratch/sd/a/azaidi/llm/cache
env: HF_HOME=/pscratch/sd/a/azaidi/llm/cache


In [2]:
#| echo: false
#| output: false
from transformers import AutoTokenizer, AutoModelForCausalLM
import requests
import torch
import glob

First lets setup a function to call the [DCE](https://metadata.namesforlife.com) in order to get the full text of our desired articles. This function will return the full text of the article in question by default. If you would prefer to receive the metadata, you can set the `full_text` field to False

- In order to get the full-text -- you need to provide the pubmed `central` ID <br>
- In order to get the metadata -- you need to provide the pubmed ID
<br><br>

By default, our function will pull one of our main reference papers --> pmid: 31801294 / pmcid: PMC6955870
- Paper title: "Comparative Genomics Reveals Metabolic Specificity of Endozoicomonas Isolated from a Marine Sponge and the Genomic Repertoire for Host-Bacteria Symbioses."

In [3]:
#| echo: false
#| output: false
def call_dce(url=None, pmid=None, pmcid=None, full_text=True):
    if pmid is None: pmid = '31801294'
    if pmcid is None: pmcid = 'PMC6955870'
    if url is None: 
        if full_text: url = "https://metadata.namesforlife.com/context?pmcid=" + pmcid
        else: url = "https://metadata.namesforlife.com/reference?pmid=" + pmid
    return requests.get(url).text, pmid

In [4]:
#| echo: false
#| output: false
txt, pmid = call_dce(full_text=True)
len(txt), txt[:100]

(37995,
 'Sponges (Phylum Porifera) interact and co-evolve with microbes belonging to different lineages.\nDesp')

In [5]:
call_dce(full_text=False)[0]

'{"title":"Comparative Genomics Reveals Metabolic Specificity of Endozoicomonas Isolated from a Marine Sponge and the Genomic Repertoire for Host-Bacteria Symbioses","doi":"10.3390/microorganisms7120635","pmc":"PMC6955870","pmid":"31801294","date":"2019-12-01","authors":[],"publication_title":"Microorganisms","publication_abbrev":"Microorganisms","publication_issn":null,"volume":null,"issue":null,"e_location":null,"first_page":null,"last_page":null}'

In [6]:
#| echo: false
#| output: false
def get_assets(model_name, four_bit=False, eigth_bit=False, cache_dir=None):
    tokenizer = AutoTokenizer.from_pretrained(model_name,
                                            cache_dir=cache_dir,
                                            load_in_8bit=eigth_bit,
                                            load_in_4bit=four_bit)

    model = AutoModelForCausalLM.from_pretrained(model_name,
                                            cache_dir=cache_dir,
                                            load_in_4bit=four_bit,
                                            load_in_8bit=eigth_bit
                                            );
    return tokenizer, model

In [7]:
#model_name = "mistralai/Mixtral-8x7B-Instruct-v0.1"
model_name = "mistralai/Mistral-7B-Instruct-v0.2"

In [8]:
#| echo: True
# Do we want to quantize our model?
four_bit = False
eight_bit = False

In [9]:
#| echo: false
#| output: false
tokenizer, model = get_assets(model_name=model_name, four_bit=four_bit, cache_dir=cache_dir)

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

Lets take a look at our Mistral model [Mixtral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mixtral-8x7B-v0.1)

In [10]:
#| echo: false
#| output: true
model

MistralForCausalLM(
  (model): MistralModel(
    (embed_tokens): Embedding(32000, 4096)
    (layers): ModuleList(
      (0-31): 32 x MistralDecoderLayer(
        (self_attn): MistralSdpaAttention(
          (q_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear(in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear(in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (rotary_emb): MistralRotaryEmbedding()
        )
        (mlp): MistralMLP(
          (gate_proj): Linear(in_features=4096, out_features=14336, bias=False)
          (up_proj): Linear(in_features=4096, out_features=14336, bias=False)
          (down_proj): Linear(in_features=14336, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): MistralRMSNorm()
        (post_attention_layernorm): MistralRMSNorm()
      )
    )
    (norm): MistralRMSNorm(

In [12]:
# Now that we have our model, we can move it onto the GPU (if it's available)
if (torch.cuda.is_available() | ~four_bit | ~eight_bit): 
    model.to('cuda');

Before feeding any text into our model, lets just quickly get a sense of how our tokenizer will break down our text into tokens

In [34]:
#| echo: false
#| output: false
num_words = len(txt.split(' '))
num_tokens = len(tokenizer(txt).input_ids)
print(f'Our text has {num_words} individual words (separated by white space) --> this gets broken out into {num_tokens} tokens via our tokenization process')

Our text has 5166 individual words (separated by white space) --> this gets broken out into 10866 tokens via our tokenization process


### Now lets declare a function that will tokenize our text and generate a response to the given text, based on the parameters we choose

In [35]:
#| echo: false
#| output: false

def get_response(prompt, new_tokens=250, rep_penalty=1.0, length_penalty=1.0, device=None):
    if model.device.type == 'cuda': device = 'cuda'
    else: device = 'cpu'
    inputs = tokenizer(prompt,return_tensors="pt").to(device)
    output = tokenizer.batch_decode(model.generate(inputs=inputs['input_ids'],
                                                   max_new_tokens=new_tokens,
                                                   repetition_penalty=rep_penalty,
                                                   length_penalty=length_penalty,
                                                   pad_token_id=tokenizer.eos_token_id), 
                                    skip_special_tokens=True, 
                                    clean_up_tokenization_spaces=False)[0]
    print(output)

In [38]:
prompt = "The ostrich pooped its pants while walking across the street. But, the Tyrannosaurus rex was one of the most ferocious predators to ever walk the Earth. With a massive body, sharp teeth, and jaws so powerful they could crush a car, this famous carnivore dominated the forested river valleys in western North America during the late Cretaceous period, 68 million years ago. \n \n"

In [39]:
get_response(prompt=f'what was the previous text about? \n{prompt}')

what was the previous text about? 
The ostrich pooped its pants while walking across the street. But, the Tyrannosaurus rex was one of the most ferocious predators to ever walk the Earth. With a massive body, sharp teeth, and jaws so powerful they could crush a car, this famous carnivore dominated the forested river valleys in western North America during the late Cretaceous period, 68 million years ago. 
 
The previous text was about an ostrich having an accident and the description of a Tyrannosaurus rex, a dinosaur known for its ferocity and dominance during the late Cretaceous period.


### That seems to make sense -- now lets build a more useful and relevant prompt

In [41]:
def build_prompt(questions, input_text):
    return f"[INST] You are a microbiology journal editor. Please answer as concisely as possible: {questions} \n here is the text in question: {input_text} \n [/INST]"

In [43]:
get_response(prompt=build_prompt(questions="can you provide the exact quote surrounding a mention of GCA_00092485.1?",
             input_text=call_dce(full_text=True)[0][:3700],),
             new_tokens=1000)

[INST] You are a microbiology journal editor. Please answer as concisely as possible: can you provide the exact quote surrounding a mention of GCA_00092485.1? 
 here is the text in question: Sponges (Phylum Porifera) interact and co-evolve with microbes belonging to different lineages.
Despite the ubiquity of microbes in a wide range of environments, associations between sponge and microbes are not random and often result in sharing of the resources in a particular niche.
Moreover, sponge-associated bacteria play a crucial role in sponge biology, metabolism, and ecology [1].
Studies using whole-genome sequencing of microbes isolated from sponges and metagenomic binning approaches have shown the genomic and molecular mechanisms involved in the successful association between the sponges and symbiotic microbes.
For instance, genome streamlining [2], evolution of bacterial genome through transposable insertion elements [3], presence of adhesion-related genes, the genes encoding eukaryotic-

Some work left to be done to make this more sensible and workable :)

In [None]:
#mistral prompt format
# <s>[INST] Instruction [/INST] Model answer</s>[INST] Follow-up instruction [/INST]

In [None]:
#prompt = f'<s>[INST] Can you summarize this text? {sample_text} [/INST]'
#prompt = f'<s>[INST] {text} [/INST]'