<a href="https://colab.research.google.com/github/frank-morales2020/MLxDL/blob/main/Mixtral_8x7B.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Dependencies

In [None]:
#https://platform.openai.com/docs/guides/text-generation

#!pip install gradio --quiet
#!pip install xformer --quiet
#!pip install chromadb --quiet

!pip install langchain --quiet
!pip install accelerate --quiet
!pip install transformers --quiet
!pip install bitsandbytes --quiet


#!pip install unstructured --quiet
#!pip install sentence-transformers --quiet


%pip install openai==0.28  --root-user-action=ignore
%pip install tiktoken

!pip install -U transformers

# https://huggingface.co/docs/transformers/model_doc/mixtral
# https://github.com/Dao-AILab/flash-attention

#!pip install -U flash-attn --no-build-isolation

# Load and run a model using Flash Attention 2

In [None]:
import torch
#from textwrap import fill
#from IPython.display import Markdown, display

#from langchain.prompts.chat import (
#    ChatPromptTemplate,
#    HumanMessagePromptTemplate,
#    SystemMessagePromptTemplate,
#    )

#from langchain import PromptTemplate
from langchain import HuggingFacePipeline

#from langchain.vectorstores import Chroma
#from langchain.schema import AIMessage, HumanMessage
#from langchain.memory import ConversationBufferMemory
#from langchain.embeddings import HuggingFaceEmbeddings
#from langchain.text_splitter import RecursiveCharacterTextSplitter
#from langchain.document_loaders import UnstructuredMarkdownLoader, UnstructuredURLLoader
#from langchain.chains import LLMChain, SimpleSequentialChain, RetrievalQA, ConversationalRetrievalChain

from transformers import BitsAndBytesConfig, AutoModelForCausalLM, AutoTokenizer, GenerationConfig, pipeline
import warnings
warnings.filterwarnings('ignore')

### T4 withOUT FlashAttention BUT need A100 to run with FlashAttention
#RuntimeError: FlashAttention only supports Ampere GPUs or newer.
# https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1
MODEL_NAME = "mistralai/Mistral-7B-Instruct-v0.1"


### A100
# https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1
#MODEL_NAME = "mistralai/Mixtral-8x7B-Instruct-v0.1"

# https://huggingface.co/docs/transformers/main/en/main_classes/quantization#offload-between-cpu-and-gpu
#Quantization
#Quantization techniques reduces memory and computational costs by representing weights and activations
#with lower-precision data types like 8-bit integers (int8). This enables loading larger models you normally wouldn’t
#be able to fit into memory, and speeding up inference. Transformers supports the AWQ and GPTQ quantization
#algorithms and it supports 8-bit and 4-bit quantization with bitsandbytes.


#( load_in_8bit = Falseload_in_4bit = Falsellm_int8_threshold = 6.0llm_int8_skip_modules = Nonellm_int8_enable_fp32_cpu_offload = Falsellm_int8_has_fp16_weight = False
#bnb_4bit_compute_dtype = None
# bnb_4bit_quant_type = 'fp4'bnb_4bit_use_double_quant = False**kwargs )


quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_quant_type="nf4",
    load_in_8bit_fp32_cpu_offload=True,
    bnb_4bit_use_double_quant=True,
)

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, use_fast=True)
tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    torch_dtype=torch.float16,
    low_cpu_mem_usage=True,
    trust_remote_code=True,
    device_map="auto",
    quantization_config=quantization_config,
    #FlashAttention only supports Ampere GPUs or newer.
    #attn_implementation="flash_attention_2" NEED A100 IN GOOGLE COLAB
)

generation_config = GenerationConfig.from_pretrained(MODEL_NAME)
generation_config.max_new_tokens = 1024
generation_config.temperature = 0.8
generation_config.top_p = 0.95
generation_config.do_sample = True
generation_config.repetition_penalty = 1.15

pipeline = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    return_full_text=True,
    generation_config=generation_config,
    pad_token_id=tokenizer.eos_token_id
)

In [4]:
llm = HuggingFacePipeline(pipeline=pipeline,)

# Prompt Completion - Use Cases

In [5]:
query='I bought an ice cream for 6 kids. Each cone was $1.25 and I paid with a $10 bill. How many dollars did I get back? Explain first before answering.'
query1 = "who is the President of the USA?"
query2 = "Who won the baseball World Series in 2020? and Who Lost"

device="cuda"
def prompt_completion(query):
    messages = [
        {"role": "user", "content": "%s"%query}
    ]

    encodeds = tokenizer.apply_chat_template(messages, return_tensors="pt")
    model_inputs = encodeds.to(device)

    #https://stackoverflow.com/questions/69609401/suppress-huggingface-logging-warning-setting-pad-token-id-to-eos-token-id
    #RuntimeError: FlashAttention only supports Ampere GPUs or newer.
    generated_ids = model.generate(model_inputs, max_new_tokens=512, do_sample=True, negative_prompt_attention_mask='attention_mask',
                    pad_token_id=tokenizer.eos_token_id
    )

    decoded = tokenizer.batch_decode(generated_ids)
    print()
    print()
    result=decoded[0].replace('<s> [INST] %s [/INST]'%query,"")
    result=result.replace('</s>',"")
    print('Prompt: %s'%query)
    print('-'*80)
    print('Answer: %s'%result)

prompt_completion(query)
print()
print('='*80)
prompt_completion(query1)
print()
print('='*80)
prompt_completion(query2)

query3 = "what is the 20.5% of 40?"
query4 = "As a data scientist, can you explain the concept of regularization in machine learning?"
query5 ='Which country has the most natural lakes? Answer with only the country name.'

print()
print('='*80)
prompt_completion(query3)
print()
print('='*80)
prompt_completion(query4)
print()
print('='*80)
prompt_completion(query5)


query6 = "How AWS has evolved?"
print()
print('='*80)
prompt_completion(query6)



Prompt: I bought an ice cream for 6 kids. Each cone was $1.25 and I paid with a $10 bill. How many dollars did I get back? Explain first before answering.
--------------------------------------------------------------------------------
Answer:  When answering, I will use the following steps:

1. Calculate the total cost of the ice creams.
2. Calculate how much money I paid and subtract that from the total cost to find out how much money I got back.

Step 1: The total cost of the ice creams is found by multiplying the price of each cone by the number of kids, which gives us $1.25 x 6 = $<<1.25*6=7.50>>7.50.

Step 2: I paid with a $10 bill, so I subtract the cost of the ice creams from $10, which gives us $10 - $7.50 = $<<10-7.5=2.50>>2.50.

Therefore, I got $2.50 back.



Prompt: who is the President of the USA?
--------------------------------------------------------------------------------
Answer:  The current President of the United States of America is Joe Biden. Biden took office