<a href="https://colab.research.google.com/github/akimi-yano/data-science/blob/main/LLM_HuggingFace_RAG.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install datasets transformers



NOTES: Use T4 GPU :)

## Import libraries

In [2]:
from tqdm.notebook import tqdm # library to create loading bars when loading huge data for knowing the visuals and ETA
from typing import Optional, List, Tuple
from datasets import Dataset
import pandas as pd
import matplotlib.pyplot as plt

## Dataset

Link: https://www.kaggle.com/datasets/chaitanyakck/medical-text

NOTES: change the format form .dat to .txt by renaming them

NOTES: add the data sets to the folders on the left <-

In [3]:
!pip install langchain



## Loading and chunking dataset

![](https://miro.medium.com/v2/resize:fit:1127/1*Jq9bEbitg1Pv4oASwEQwJg.png)

In [4]:
with open("train.txt", "r") as f:
    data = f.read()

In [5]:
data[:100]

'4\tCatheterization laboratory events and hospital outcome with direct angioplasty for acute myocardia'

In [6]:
from langchain.docstore.document import Document as LangchainDocument

RAW_KNOWLEDGE_BASE = LangchainDocument(page_content=data)

In [7]:
MARKDOWN_SEPARATORS = [
    "\n#{1,6} ",
    "```\n",
    "\n\\*\\*\\*+\n",
    "\n---+\n",
    "\n___+\n",
    "\n\n",
    "\n",
    " ",
    "",
]

In [8]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

In [9]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,  # The maximum number of characters in a chunk: we selected this value arbitrarily
    chunk_overlap=100,  # The number of characters to overlap between chunks
    add_start_index=True,  # If `True`, includes chunk's start index in metadata
    strip_whitespace=True,  # If `True`, strips whitespace from the start and end of every document
    separators=MARKDOWN_SEPARATORS, #<- if you see the markdown separators, then split
)

In [10]:
docs_processed = text_splitter.split_documents([RAW_KNOWLEDGE_BASE])

NOTES: need to install the below as they are running under the hood. langchain_community uses sentence-transformers under the hood.

In [11]:
!pip install langchain_community # imports hagging face embeddings to immediately use hugging face
!pip install sentence-transformers # hagging face's encoding and processing sentences and texts



## Tokenizing/Vectorizing the dataset

In [12]:
from langchain_community.embeddings import HuggingFaceEmbeddings
EMBEDDING_MODEL_NAME = "thenlper/gte-small"

In [13]:
embedding_model = HuggingFaceEmbeddings(
    model_name=EMBEDDING_MODEL_NAME,
    multi_process=True,
    model_kwargs={"device": "cuda"}, # Use GPU
    encode_kwargs={"normalize_embeddings": True},  # Set `True` for cosine similarity
)

  embedding_model = HuggingFaceEmbeddings(
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [14]:
emb = embedding_model.embed_query(docs_processed[0].page_content)

In [15]:
print(len(emb)) # Use this as the dimention for the pinecone index database

384


In [16]:
import numpy as np
np.array(emb).shape

(384,)

In [17]:
!pip install pinecone-client



## Storing dataset into a vector database

Using: https://pinecone.com

In [18]:
from tqdm.notebook import tqdm
from pinecone import Pinecone
from google.colab import userdata

pc = Pinecone(api_key = userdata.get('PINECONE_KEY'))
index = pc.Index(userdata.get('PINECONE_INDEX'))

In [19]:
upsert_data = []

for i, entry in tqdm(enumerate(docs_processed[:10])):
    text = entry.page_content
    vector = embedding_model.embed_query(text)
    upsert_data.append(
        {
            "id": "vec{}".format(i),
            "values": vector,
            "metadata": {"text": text}
        }
    )

0it [00:00, ?it/s]

In [20]:
index.upsert(
    vectors=upsert_data,
    namespace= "ns1" # it can be anything but this is the tradition
)

{'upserted_count': 10}

## Loading a LLM

In [21]:
from transformers import pipeline
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

model_name = "HuggingFaceH4/zephyr-7b-beta"

In [22]:
!pip install bitsandbytes
!pip install accelerate



In [23]:
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)
model = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=bnb_config)
tokenizer = AutoTokenizer.from_pretrained(model_name)

`low_cpu_mem_usage` was None, now set to True since model is quantized.


model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/8 [00:00<?, ?it/s]

model-00001-of-00008.safetensors:   0%|          | 0.00/1.89G [00:00<?, ?B/s]

model-00002-of-00008.safetensors:   0%|          | 0.00/1.95G [00:00<?, ?B/s]

model-00003-of-00008.safetensors:   0%|          | 0.00/1.98G [00:00<?, ?B/s]

model-00004-of-00008.safetensors:   0%|          | 0.00/1.95G [00:00<?, ?B/s]

model-00005-of-00008.safetensors:   0%|          | 0.00/1.98G [00:00<?, ?B/s]

model-00006-of-00008.safetensors:   0%|          | 0.00/1.95G [00:00<?, ?B/s]

model-00007-of-00008.safetensors:   0%|          | 0.00/1.98G [00:00<?, ?B/s]

model-00008-of-00008.safetensors:   0%|          | 0.00/816M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/8 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.43k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/168 [00:00<?, ?B/s]

In [24]:
llm_model = pipeline(
    model=model,
    tokenizer=tokenizer,
    task="text-generation",
    do_sample=True,
    temperature=0.4,
    repetition_penalty=1.1,
    return_full_text=False,
    max_new_tokens=500,
)

In [25]:
llm_model("Hey there!")

[{'generated_text': '\n\nI’m so glad you could join me today. I have a really exciting post for you all, as I’ve been working on this for quite some time now. I wanted to share with you my top 10 tips for getting started with affiliate marketing. These are the things that I’ve learned along the way and have found to be most helpful in building a successful affiliate marketing business. So let’s dive right into it!\n\nTip #1: Choose the Right Niche\nThe first step in starting an affiliate marketing business is to choose the right niche. This is the area of interest or expertise that you will focus on promoting products related to. It’s important to choose a niche that you’re passionate about because it will make the process more enjoyable and easier to promote products.\n\nSome factors to consider when choosing a niche include:\n- Passion: Are you genuinely interested in this topic?\n- Profitability: Is there demand for products in this niche?\n- Competition: How competitive is this nic

## Prompting the model

In [26]:
prompt = """
<|system|>
You are a helpful assistant that answers on medical questions based on the real information provided from different sources and in the context.
Give the rational and well written response. If you don't have proper info in the context, answer "I don't know"
Respond only to the question asked.

<|user|>
Context:
{}
---
Here is the question you need to answer.

Question: {}
<|assistant|>
"""

In [27]:
user_input = input("User: ")

vectorized_input = embedding_model.embed_query(user_input)

context = index.query(
    namespace="ns1",
    vector=vectorized_input,
    top_k=1,
    include_metadata=True
)

answer = llm_model(prompt.format(context['matches'][0]['metadata']['text'], user_input))

print("AI response: ", answer[0]['generated_text'])

User: What is Cardiogenic shock?
AI response:  Cardiogenic shock is a life-threatening condition characterized by inadequate organ perfusion resulting from weak heart function, typically caused by a severe heart attack or damage to the heart muscle. In this context, it is mentioned that cardiogenic shock was present in eight patients with an infarction (heart attack) affecting the left anterior descending coronary artery, four patients with an infarction affecting the right coronary artery, and four patients with an infarction affecting the circumflex coronary artery. This indicates that the location of the heart attack can contribute to the development of cardiogenic shock. Patients experiencing cardiogenic shock may require interventions such as cardioversion, cardiopulmonary resuscitation, medication like dopamine or intra-aortic balloon pump support for hypotension, and urgent surgery to manage their symptoms and improve outcomes. It should be noted that while cardiogenic shock can

In [28]:
context['matches'][0]['metadata']['text']

'artery, 90%). Cardiogenic shock was present in eight patients with infarction of the left anterior descending coronary artery, four with infarction of the right coronary artery, and four with infarction of the circumflex coronary artery. Major catheterization laboratory events (cardioversion, cardiopulmonary resuscitation, dopamine or intra-aortic balloon pump support for hypotension, and urgent surgery) occurred in 10 patients with infarction of the left anterior descending coronary artery, eight with infarction of the right coronary artery, and four with infarction of the circumflex coronary artery (16 of 16 shock and six of 234 nonshock patients, p less than 0.001). There was one in-laboratory death (shock patient with infarction of the left anterior descending coronary artery).'

NOTES: temperature: 0: always the same way; 1: completely random. 0.1 or 0.2 are good in general.