<a href="https://colab.research.google.com/github/chow33/Applied-Deep-Learning-and-Artificial-Intelligence-assignment-3/blob/main/improved%20assignment_3_AI_Chow.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Introduction**
This assignment focuses on leveraging retrieval-augmented generation (RAG) techniques, particularly in the context of extracting and synthesizing information from various documents (or a document). You’ll be using Langchain to implement these concepts and create a system that not only generates responses but also retrieves relevant information from a database.

#**Objective**
# Task Description
Your task is to create a system that uses RAG for extracting information from a set of documents or a document which can be either a scientific paper or report. This involves integrating a database to store vectors of document information and designing customized prompts to effectively use GPT models for generation. Here are some project ideas:



*   Build a QA system that retrieves information from a given set of documents (or a document) to answer complex queries.
*   Develop a tool for summarizing research papers, where the system extracts key points from a database of paper vectors.
* Create a recommendation engine that suggests content based on user queries and retrieved document data.
* Explore other innovative applications of RAG, such as automated content generation, data analysis, or any other creative use case you can envision.

# **Key Components**
* **Database Integration:** Set up a database to store and retrieve vectors representing document information.
* **Customized Prompts**: Design and implement prompts that effectively utilize GPT models for generation based on retrieved data.
* **RAG Implementation**: Use Langchain to integrate retrieval-augmented generation in your system.

# **Data**
* Utilize open-source datasets or create your own corpus of documents for retrieval.
* Ensure the chosen datasets are suitable for demonstrating the capabilities of your RAG system.

In this project I will create a QA system that retrieves document I have uploaded from the National library of medicine. These document will be about cancer in young people, and the QA is about general complex answear about cancer.

I start by installing langchain and getting all the libraries

In [1]:
import locale
locale.getpreferredencoding = lambda: "UTF-8"

In [5]:
!pip install -Uqqq pip --progress-bar off
!pip install -qqq torch==2.0.1 --progress-bar off
!pip install -qqq transformers==4.33.2 --progress-bar off
!pip install -qqq langchain==0.0.299 --progress-bar off
!pip install -qqq chromadb==0.4.10 --progress-bar off
!pip install -qqq xformers==0.0.21 --progress-bar off
!pip install -qqq sentence_transformers==2.2.2 --progress-bar off
!pip install -qqq tokenizers==0.14.0 --progress-bar off
!pip install -qqq optimum==1.13.1 --progress-bar off
!pip install -qqq auto-gptq==0.4.2 --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/ --progress-bar off
!pip install -qqq unstructured==0.10.16 --progress-bar off

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Building wheel for lit (pyproject.toml) ... [?25l[?25hdone
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
torchaudio 2.1.0+cu121 requires torch==2.1.0, but you have torch 2.0.1 which is incompatible.
torchdata 0.7.0 requires torch==2.1.0, but you have torch 2.0.1 which is incompatible.
torchtext 0.16.0 requires torch==2.1.0, but you have torch 2.0.1 which is incompatible.
torchvision 0.16.0+cu121 requires torch==2.1.0, but you have torch 2.0.1 which is incompatible.[0m[31m
[0m  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ...

In [1]:
!pip install accelerate --q

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/265.7 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m30.7/265.7 kB[0m [31m873.8 kB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.9/265.7 kB[0m [31m1.3 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━[0m [32m245.8/265.7 kB[0m [31m2.5 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━[0m [32m256.0/265.7 kB[0m [31m2.4 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m265.7/265.7 kB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0m
[?25h

In [2]:
!pip install pypdf --q

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m278.2/278.2 kB[0m [31m1.3 MB/s[0m eta [36m0:00:00[0m
[?25h

Uploading my research paper from the open sourcing from National Library of Medicine

In [26]:
from langchain.document_loaders import PyPDFLoader
from langchain.document_loaders import UnstructuredMarkdownLoader
# List of file paths to your PDF documents
pdf_file_paths = [
    "/content/Genomic and molecular feature on young adult cancer.pdf",
    "/content/Global burdn of young cancer patient.pdf",
    "/content/Young adult cancer survivorship and tips.pdf"
]

docs = []

# Load documents from each PDF file
for file_path in pdf_file_paths:
    try:
        # Use PyPDFLoader to load the documents from each PDF file
        loader = PyPDFLoader(file_path)
        # Extend the 'docs' list with the loaded documents
        docs.extend(loader.load())
    except Exception as e:
        print(f"Error loading {file_path}: {e}")

# Get the total number of documents extracted
total_documents = len(docs)
print(f"Total number of documents extracted: {total_documents}")


Total number of documents extracted: 63


Splitting the text into chunks so it is easier to read

In [28]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
# Initialize the text splitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1024, chunk_overlap=64)
# Extract text from all documents
texts = [doc.page_content for doc in docs]

# Split the texts into chunks
split_texts = []

for text in texts:
    # Split the text using the text splitter
    split_texts.extend(text_splitter.split_text(text))
# Get the total number of text chunks
total_chunks = len(split_texts)
print(f"Total number of text chunks: {total_chunks}")


Total number of text chunks: 458


In [29]:
len(texts)

63

Splitting the document into chunks is required due to the limited number of tokens a LLM can look at once (4096 for Llama 2). Next, I will use the HuggingFaceEmbeddings class to create embeddings for the chunks:

In [8]:
!pip install langchain pypdf --q

[0m

In [43]:
from langchain.embeddings import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings(
    model_name="thenlper/gte-large",
    model_kwargs={"device": "cuda"},
    encode_kwargs={"normalize_embeddings": True},
)

query_result = embeddings.embed_query(texts[0])
print(len(query_result))

1024


I am use Chroma database to store/cache the embeddings and make it easy to search them:

To combine the LLM with the database, we'll use the RetrievalQA chain:

In [44]:
from langchain.vectorstores import Chroma

db = Chroma.from_texts(texts, embeddings, persist_directory="db")
results = db.similarity_search("Transformer models", k=2)
print(results[0].page_content)


Figure 7. Clinical actionability in young
adult versus later-onset tumors
(A) Percentages of unique young adult and later-
onset cases with somatic variants druggable atthe A or B evidence levels, further subdivided byon versus off-label status.(B) Percentages of unique young adult and later-onset cases expressing each of the top ﬁve(ranked by frequency of appearance in the sevenselected cancers) clinically druggable somaticvariants. Young adult percentages are red; later-onset are black. After correcting for confoundingfactors, a single signiﬁcant association (FDR<0.05) was found between BRAF V600E mutations
and young adult SKCM.(C) Percentages of unique young adult and later-
onset cases with copy-number ampliﬁcations
druggable at the A or B evidence levels, furthersubdivided by on versus off-label status.(D) Percentages of unique young adult and later-onset cases expressing each of the top ﬁve(ranked by frequency of appearance in the sixselected cancers) clinically druggable copy-nu

Creating a model (LLM) wrapper for use in the QA model

In [52]:
import torch
from langchain import HuggingFacePipeline
from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig, pipeline

MODEL_NAME = "TheBloke/Llama-2-7B-Chat-GPTQ"

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, use_fast=True)

model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME, torch_dtype=torch.float16, trust_remote_code=True, device_map="auto"
)

# Create a configuration for text generation based on the specified model name
generation_config = GenerationConfig.from_pretrained(MODEL_NAME)

# Set the maximum number of new tokens in the generated text to 1024.
# This limits the length of the generated output to 1024 tokens.
generation_config.max_new_tokens = 1024

# Set the temperature for text generation. Lower values (e.g., 0.0001) make output more deterministic, following likely predictions.
# Higher values make the output more random.
generation_config.temperature = 0.0001

# Set the top-p sampling value. A value of 0.95 means focusing on the most likely words that make up 95% of the probability distribution.
generation_config.top_p = 0.95

# Enable text sampling. When set to True, the model randomly selects words based on their probabilities, introducing randomness.
generation_config.do_sample = True

# Set the repetition penalty. A value of 1.15 discourages the model from repeating the same words or phrases too frequently in the output.
generation_config.repetition_penalty = 1.15


# Create a text generation pipeline using the initialized model, tokenizer, and generation configuration
text_pipeline = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    generation_config=generation_config,
)

# Create a LangChain pipeline that wraps the text generation pipeline and set a specific temperature for generation
llm = HuggingFacePipeline(pipeline=text_pipeline, model_kwargs={"temperature": 0})


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/727 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/411 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/789 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/3.90G [00:00<?, ?B/s]



generation_config.json:   0%|          | 0.00/137 [00:00<?, ?B/s]

These code create the template for the QA system, and it is answeared by a doctor who has a lot of knowledge about cancer. The first question is about how to avoid getting cancer.

In [53]:
from langchain.chains import RetrievalQA
from langchain import PromptTemplate

template = """
<s>[INST] <<SYS>>
Act as cancer specialist giving teenager health advise to avoid cancer. Use the following information to answer the question at the end.
<</SYS>>

{context}

{question} [/INST]
"""

prompt = PromptTemplate(template=template, input_variables=["context", "question"])


qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=db.as_retriever(search_kwargs={"k": 2}),
    return_source_documents=True,
    chain_type_kwargs={"prompt": prompt},
)

result = qa_chain(
    "How do i avoid getting cancer."
)
print(result["result"].strip())

As a cancer specialist, I must first emphasize that while there are certain steps you can take to reduce your risk of developing cancer, it is essential to understand that cancer can occur even without any known causes or risk factors. Nonetheless, here are some general tips that may help reduce your risk of getting cancer:
1. Don't smoke or use tobacco products: Tobacco use is the leading cause of preventable cancer deaths worldwide. Quitting smoking and avoiding exposure to secondhand smoke can significantly reduce your risk of developing lung, mouth, throat, and other cancers.
2. Maintain a healthy weight: Being overweight or obese increases the risk of several types of cancer, including breast, colon, and pancreatic cancer. Eating a balanced diet and exercising regularly can help maintain a healthy weight.
3. Exercise regularly: Physical activity has been shown to reduce the risk of developing certain types of cancer, including breast and colon cancer. Aim for at least 150 minutes 

The second question is about the consequent of cancer

In [54]:
from textwrap import fill

result = qa_chain(
    "Summarize the consequent of cancer at a young age."
)
print(fill(result["result"].strip(), width=80))

Based on the article provided, here are some key points about the consequences
of cancer at a young age: 1. Incident cancer cases: There were approximately 1.2
million new cancer cases diagnosed in people aged 15-39 worldwide in 2019. 2.
Deaths due to cancer: Cancer was responsible for 396,000 deaths among people
aged 15-39 worldwide in 2019. 3. Age-standardized incidence rates: The highest
age-standardized incidence rates occurred in high-middle SDI countries (53.2 per
100,000 person-years) and high SDI countries (59.6 per 100,000 person-years). 4.
Age-standardized mortality rates: The highest age-standardized mortality rates
were in low-middle SDI (14.2 per 10,000 person-years) and middle SDI (13.6 per
10,000 person-years) countries. 5. Disability-adjusted life-years (DALYs):
Adolescent and young adult cancers contributed 2.35 million DALYs to the global
burden of disease in 2019, with 2.7% of these DALYs coming from years lived with
disability (YLDs) and 97.3% from years lost due to

The third question is the general persona of young people that get cancer.

In [55]:
from textwrap import fill

result = qa_chain(
    "What is a general persona for a young person who got cancer"
)
print(fill(result["result"].strip(), width=80))

As a cancer specialist, I would advise a young person who has been diagnosed
with cancer to take the following steps to reduce their risk of developing
cancer: 1. Maintain a healthy weight: Being overweight or obese increases the
risk of developing certain types of cancer, such as breast, colon, and
pancreatic cancer. Eating a healthy diet and exercising regularly can help
maintain a healthy weight. 2. Don't smoke: Smoking is a significant risk factor
for many types of cancer, including lung, mouth, throat, and bladder cancer.
Quitting smoking can significantly reduce the risk of developing cancer. 3.
Limit alcohol consumption: Drinking too much alcohol can increase the risk of
developing certain types of cancer, including liver, breast, and colon cancer.
Limiting alcohol consumption to moderate levels (one drink per day for women and
two drinks per day for men) can help reduce the risk of cancer. 4. Protect
yourself from the sun: Exposure to UV radiation from the sun or tanning beds c