# Conversational Agent on Medical Research Papers
Medical professionals have to constantly keep abreast of the latest research in the field not only limited to their specialization. With the rate of research and research papers and reports flooding the internet, it becomes tough for them to ramp up on the trusted and approved research papers. Having access to trusted repositories helps them a bit, but there are many sources like Nature, PubMed, Assorted Journals which is still a lot of work. Having a knowledge system that curates trusted papers and then allows fast retrieval with a Question and Answer agent will immensely simplify the medical professionals knowledge initiatives.

Another key point to note is that an LLM can halucinate and provide answers to a question, having the knowledge base provides contextual data for the LLM to ground itself and not halucinate. Also, the knowledge base provides the LLM with information that it has not been trained on.

This Notebook aims to provide instructions on how to build one such system using DataRobot's Generative AI Solution framework. We show how users can build a pipeline to create a knowledge base with only trusted research papers, and build a conversational agent that can answer questions from medical professionals.

# Setup
## READ BEFORE STARTING NOTEBOOK
1. Enable the following **feature flags** on your account:
    - Enable Notebooks Filesystem Management
    - Enable Proxy models
    - Enable Public Network Access for all Custom Models 
	- Enable the Injection of Runtime Parameters for Custom Models
    - Enable Monitoring Support for Generative Models (Staging-only)
    - Enable Custom Inference Models (GA: on by default)
2. Enable the notebook filesystem for this notebook in the notebook sidebar
3. Add the notebook environment variable `OPENAI_API_KEY`, `OPENAI_ORGANIZATION`,
   `OPENAI_API_BASE` and set the values with your Azure OpenAI credentials
4. Set the notebook session timeout to 180 minutes
5. Restart the notebook container using at least a "Medium" (16GB ram) instance
6. Upload your documents archive to the notebook

In [None]:
try:
    import os
    assert 'OPENAI_API_KEY' in os.environ
    assert 'OPENAI_ORGANIZATION' in os.environ
    assert 'OPENAI_API_BASE' in os.environ
    assert os.path.isfile('./storage/files.zip')
except Exception as e:
    raise RuntimeError('Please follow the setup steps before running the notebook.') from e

### Installing prerequisite libraries
We will be using <a href='https://python.langchain.com/docs/get_started/introduction.html'>Langchain</a> for developing the Agent, and <a href='https://engineering.fb.com/2017/03/29/data-infrastructure/faiss-a-library-for-efficient-similarity-search/'>FAISS</a>, <a href='https://www.sbert.net/'>Sentence Transformers</a> for the <a href='https://arxiv.org/abs/2005.11401'>RAG</a> system. The LLM is an OpenAI model hosted on Azure. DataRobot provides the freedom to users to use their preferred components in their stack.

In [None]:
!pip install "langchain==0.0.244" \
             "faiss-cpu==1.7.4" \
             "sentence-transformers==2.2.2" \
             "unstructured==0.8.4" \
             "openai==0.27.8" \
             "datarobotx==0.1.14"

Collecting langchain==0.0.244
  Downloading langchain-0.0.244-py3-none-any.whl (1.4 MB)
[K     |████████████████████████████████| 1.4 MB 16.5 MB/s 
[?25hCollecting faiss-cpu==1.7.4
  Downloading faiss_cpu-1.7.4-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.6 MB)
[K     |████████████████████████████████| 17.6 MB 62.7 MB/s 
[?25hCollecting sentence-transformers==2.2.2
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[K     |████████████████████████████████| 85 kB 7.9 MB/s 
[?25hCollecting unstructured==0.8.4
  Downloading unstructured-0.8.4-py3-none-any.whl (1.4 MB)
[K     |████████████████████████████████| 1.4 MB 59.5 MB/s 
[?25hCollecting openai==0.27.8
  Downloading openai-0.27.8-py3-none-any.whl (73 kB)
[K     |████████████████████████████████| 73 kB 3.7 MB/s 
[?25hCollecting datarobotx==0.1.14
  Downloading datarobotx-0.1.14-py3-none-any.whl (171 kB)
[K     |████████████████████████████████| 171 kB 65.4 MB/s 
[?25hCollecting openapi-schema-py

Collecting SQLAlchemy<3,>=1.4
  Downloading SQLAlchemy-2.0.20-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.0 MB)
[K     |████████████████████████████████| 3.0 MB 58.4 MB/s 
Collecting dataclasses-json<0.6.0,>=0.5.7
  Downloading dataclasses_json-0.5.14-py3-none-any.whl (26 kB)
Collecting numexpr<3.0.0,>=2.8.4
  Downloading numexpr-2.8.5-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (382 kB)
[K     |████████████████████████████████| 382 kB 50.5 MB/s 
Collecting langsmith<0.1.0,>=0.0.11
  Downloading langsmith-0.0.30-py3-none-any.whl (35 kB)
Collecting transformers<5.0.0,>=4.6.0
  Downloading transformers-4.32.1-py3-none-any.whl (7.5 MB)
[K     |████████████████████████████████| 7.5 MB 60.3 MB/s 
Collecting torch>=1.6.0
  Downloading torch-2.0.1-cp39-cp39-manylinux1_x86_64.whl (619.9 MB)
[K     |███                             | 57.6 MB 60.5 MB/s eta 0:00:10

[K     |███████████████▎                | 296.4 MB 62.2 MB/s eta 0:00:06

[K     |████████████████████████████    | 542.9 MB 61.4 MB/s eta 0:00:02

[K     |████████████████████████████████| 619.9 MB 77.1 MB/s eta 0:00:01

[K     |████████████████████████████████| 619.9 MB 11 kB/s 
[?25hCollecting torchvision
  Downloading torchvision-0.15.2-cp39-cp39-manylinux1_x86_64.whl (6.0 MB)
[K     |████████████████████████████████| 6.0 MB 60.2 MB/s 
Collecting nltk
  Downloading nltk-3.8.1-py3-none-any.whl (1.5 MB)
[K     |████████████████████████████████| 1.5 MB 58.2 MB/s 
[?25hCollecting sentencepiece
  Downloading sentencepiece-0.1.99-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[K     |████████████████████████████████| 1.3 MB 56.9 MB/s 
[?25hCollecting huggingface-hub>=0.4.0
  Downloading huggingface_hub-0.16.4-py3-none-any.whl (268 kB)
[K     |████████████████████████████████| 268 kB 57.7 MB/s 
[?25hCollecting chardet
  Downloading chardet-5.2.0-py3-none-any.whl (199 kB)
[K     |████████████████████████████████| 199 kB 59.5 MB/s 
[?25hCollecting xlrd
  Downloading xlrd-2.0.1-py2.py3-none-any.whl (96 kB)
[K     |████████████████████████████████| 96 kB 9.0 MB/s 
[?25hCo

Collecting python-docx
  Downloading python-docx-0.8.11.tar.gz (5.6 MB)
[K     |████████████████████████████████| 5.6 MB 47.2 MB/s 
[?25hCollecting python-pptx
  Downloading python_pptx-0.6.22-py3-none-any.whl (471 kB)
[K     |████████████████████████████████| 471 kB 56.2 MB/s 
[?25hCollecting msg-parser
  Downloading msg_parser-1.2.0-py2.py3-none-any.whl (101 kB)
[K     |████████████████████████████████| 101 kB 18.2 MB/s 
[?25hCollecting filetype
  Downloading filetype-1.2.0-py2.py3-none-any.whl (19 kB)
Collecting pdf2image
  Downloading pdf2image-1.16.3-py3-none-any.whl (11 kB)
Collecting markdown
  Downloading Markdown-3.4.4-py3-none-any.whl (94 kB)
[K     |████████████████████████████████| 94 kB 5.0 MB/s 
Collecting python-magic
  Downloading python_magic-0.4.27-py2.py3-none-any.whl (13 kB)
Collecting pdfminer.six
  Downloading pdfminer.six-20221105-py3-none-any.whl (5.6 MB)
[K     |████████████████████████████████| 5.6 MB 54.1 MB/s 


Collecting lxml
  Downloading lxml-4.9.3-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_24_x86_64.whl (7.1 MB)
[K     |████████████████████████████████| 7.1 MB 49.6 MB/s 
Collecting datarobot-early-access
  Downloading datarobot_early_access-3.3.0.2023.8.28-py3-none-any.whl (553 kB)
[K     |████████████████████████████████| 553 kB 58.9 MB/s 
Collecting names-generator
  Downloading names_generator-0.1.0-py3-none-any.whl (26 kB)
Collecting ipywidgets
  Downloading ipywidgets-8.1.0-py3-none-any.whl (139 kB)
[K     |████████████████████████████████| 139 kB 63.5 MB/s 
Collecting greenlet!=0.4.17; platform_machine == "aarch64" or (platform_machine == "ppc64le" or (platform_machine == "x86_64" or (platform_machine == "amd64" or (platform_machine == "AMD64" or (platform_machine == "win32" or platform_machine == "WIN32")))))
  Downloading greenlet-2.0.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (610 kB)
[K     |████████████████████████████████| 610 kB 5

Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.3-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[K     |████████████████████████████████| 7.8 MB 54.6 MB/s 
[?25hCollecting safetensors>=0.3.1
  Downloading safetensors-0.3.3-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[K     |████████████████████████████████| 1.3 MB 56.6 MB/s 
[?25hCollecting regex!=2019.12.17
  Downloading regex-2023.8.8-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (771 kB)
[K     |████████████████████████████████| 771 kB 54.7 MB/s 
Collecting sympy
  Downloading sympy-1.12-py3-none-any.whl (5.7 MB)
[K     |████████████████████████████████| 5.7 MB 53.7 MB/s 
[?25hCollecting nvidia-cuda-runtime-cu11==11.7.99; platform_system == "Linux" and platform_machine == "x86_64"
  Downloading nvidia_cuda_runtime_cu11-11.7.99-py3-none-manylinux1_x86_64.whl (849 kB)
[K     |████████████████████████████████| 849 kB 55.8 MB/s 
[?25hCollecting

[K     |████████████████████████████████| 102.6 MB 74.9 MB/s 
[?25hCollecting triton==2.0.0; platform_system == "Linux" and platform_machine == "x86_64"
  Downloading triton-2.0.0-1-cp39-cp39-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (63.3 MB)
[K     |████████████████████████████████| 63.3 MB 247 kB/s 
[?25h

Collecting nvidia-cublas-cu11==11.10.3.66; platform_system == "Linux" and platform_machine == "x86_64"
  Downloading nvidia_cublas_cu11-11.10.3.66-py3-none-manylinux1_x86_64.whl (317.1 MB)
[K     |████████████████████████▏       | 239.0 MB 60.6 MB/s eta 0:00:02

[K     |████████████████████████████████| 317.1 MB 32 kB/s 
[?25h

Collecting nvidia-cudnn-cu11==8.5.0.96; platform_system == "Linux" and platform_machine == "x86_64"
  Downloading nvidia_cudnn_cu11-8.5.0.96-2-py3-none-manylinux1_x86_64.whl (557.1 MB)
[K     |██████████████▍                 | 250.8 MB 77.7 MB/s eta 0:00:04

[K     |████████████████████████████▋   | 498.2 MB 64.9 MB/s eta 0:00:01

[K     |████████████████████████████████| 557.1 MB 64.9 MB/s eta 0:00:01

[K     |████████████████████████████████| 557.1 MB 8.5 kB/s 
[?25hCollecting nvidia-nccl-cu11==2.14.3; platform_system == "Linux" and platform_machine == "x86_64"
  Downloading nvidia_nccl_cu11-2.14.3-py3-none-manylinux1_x86_64.whl (177.1 MB)
[K     |██████████████████████████████▋ | 169.4 MB 58.1 MB/s eta 0:00:01

[K     |████████████████████████████████| 177.1 MB 149 kB/s 
[?25hCollecting nvidia-cuda-cupti-cu11==11.7.101; platform_system == "Linux" and platform_machine == "x86_64"
  Downloading nvidia_cuda_cupti_cu11-11.7.101-py3-none-manylinux1_x86_64.whl (11.8 MB)
[K     |████████████████████████████████| 11.8 MB 60.4 MB/s 
Collecting nvidia-cusparse-cu11==11.7.4.91; platform_system == "Linux" and platform_machine == "x86_64"
  Downloading nvidia_cusparse_cu11-11.7.4.91-py3-none-manylinux1_x86_64.whl (173.2 MB)
[K     |█████████████████████████▎      | 136.8 MB 76.8 MB/s eta 0:00:01

[K     |████████████████████████████████| 173.2 MB 53 kB/s 
[?25hCollecting nvidia-nvtx-cu11==11.7.91; platform_system == "Linux" and platform_machine == "x86_64"
  Downloading nvidia_nvtx_cu11-11.7.91-py3-none-manylinux1_x86_64.whl (98 kB)
[K     |████████████████████████████████| 98 kB 13.0 MB/s 
[?25hCollecting nvidia-cufft-cu11==10.9.0.58; platform_system == "Linux" and platform_machine == "x86_64"
  Downloading nvidia_cufft_cu11-10.9.0.58-py3-none-manylinux1_x86_64.whl (168.4 MB)
[K     |█████████████████████▋          | 113.8 MB 70.0 MB/s eta 0:00:01

[K     |████████████████████████████████| 168.4 MB 143 kB/s 
[?25hCollecting nvidia-curand-cu11==10.2.10.91; platform_system == "Linux" and platform_machine == "x86_64"
  Downloading nvidia_curand_cu11-10.2.10.91-py3-none-manylinux1_x86_64.whl (54.6 MB)
[K     |████████████████████████████████| 54.6 MB 381 kB/s 
[?25hCollecting nvidia-cuda-nvrtc-cu11==11.7.99; platform_system == "Linux" and platform_machine == "x86_64"
  Downloading nvidia_cuda_nvrtc_cu11-11.7.99-2-py3-none-manylinux1_x86_64.whl (21.0 MB)
[K     |████████████████████████████████| 21.0 MB 52.8 MB/s 
[?25h

Collecting XlsxWriter>=0.5.7
  Downloading XlsxWriter-3.1.2-py3-none-any.whl (153 kB)
[K     |████████████████████████████████| 153 kB 59.4 MB/s 
[?25hCollecting olefile>=0.46
  Downloading olefile-0.46.zip (112 kB)
[K     |████████████████████████████████| 112 kB 61.6 MB/s 
Collecting cmdkit>=2.1.2
  Downloading cmdkit-2.6.1-py3-none-any.whl (28 kB)
Collecting widgetsnbextension~=4.0.7
  Downloading widgetsnbextension-4.0.8-py3-none-any.whl (2.3 MB)
[K     |████████████████████████████████| 2.3 MB 58.3 MB/s 
[?25hCollecting jupyterlab-widgets~=3.0.7
  Downloading jupyterlab_widgets-3.0.8-py3-none-any.whl (214 kB)
[K     |████████████████████████████████| 214 kB 64.6 MB/s 
Collecting mpmath>=0.19
  Downloading mpmath-1.3.0-py3-none-any.whl (536 kB)
[K     |████████████████████████████████| 536 kB 59.4 MB/s 
[?25hCollecting wheel
  Downloading wheel-0.41.2-py3-none-any.whl (64 kB)
[K     |████████████████████████████████| 64 kB 5.6 MB/s 
[?25h

Collecting cmake
  Downloading cmake-3.27.2-py2.py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (26.1 MB)
[K     |████████████████████████████████| 26.1 MB 53.3 MB/s 
[?25hCollecting lit
  Downloading lit-16.0.6.tar.gz (153 kB)
[K     |████████████████████████████████| 153 kB 63.9 MB/s 
[?25h  Installing build dependencies ... [?25l-

 \ | done
[?25h  Getting requirements to build wheel ... [?25l- done
[?25h  Installing backend dependencies ... [?25l-

 \ done
[?25h    Preparing wheel metadata ... [?25l- done
Using legacy 'setup.py install' for sentence-transformers, since package 'wheel' is not installed.
Using legacy 'setup.py install' for python-docx, since package 'wheel' is not installed.
Using legacy 'setup.py install' for olefile, since package 'wheel' is not installed.
Building wheels for collected packages: lit
  Building wheel for lit (PEP 517) ... [?25l- \ done
[?25h  Created wheel for lit: filename=lit-16.0.6-py3-none-any.whl size=93584 sha256=d9635c06392ff6ac75d069497a332c6a7d61a8b336f70b7bf9322dac4170d7ac
  Stored in directory: /home/notebooks/.cache/pip/wheels/a5/36/d6/cac2e6fb891889b33a548f2fddb8b4b7726399aaa2ed32b188
Successfully built lit


Installing collected packages: openapi-schema-pydantic, greenlet, SQLAlchemy, typing-inspect, marshmallow, dataclasses-json, numexpr, langsmith, langchain, faiss-cpu, tokenizers, safetensors, huggingface-hub, regex, transformers, mpmath, sympy, wheel, nvidia-cuda-runtime-cu11, nvidia-cublas-cu11, nvidia-cusolver-cu11, cmake, lit, triton, nvidia-cudnn-cu11, nvidia-nccl-cu11, nvidia-cuda-cupti-cu11, nvidia-cusparse-cu11, nvidia-nvtx-cu11, nvidia-cufft-cu11, nvidia-curand-cu11, nvidia-cuda-nvrtc-cu11, torch, torchvision, nltk, sentencepiece, sentence-transformers, chardet, xlrd, pypandoc, lxml, python-docx, XlsxWriter, python-pptx, olefile, msg-parser, filetype, pdf2image, markdown, python-magic, pdfminer.six, unstructured, openai, datarobot-early-access, cmdkit, names-generator, widgetsnbextension, jupyterlab-widgets, ipywidgets, datarobotx


    Running setup.py install for sentence-transformers ... [?25l- \ | / done
[?25h    Running setup.py install for python-docx ... [?25l- \ | / done
[?25h    Running setup.py install for olefile ... [?25l

- \ | done
[?25hSuccessfully installed SQLAlchemy-2.0.20 XlsxWriter-3.1.2 chardet-5.2.0 cmake-3.27.2 cmdkit-2.6.1 dataclasses-json-0.5.14 datarobot-early-access-3.3.0.2023.8.28 datarobotx-0.1.14 faiss-cpu-1.7.4 filetype-1.2.0 greenlet-2.0.2 huggingface-hub-0.16.4 ipywidgets-8.1.0 jupyterlab-widgets-3.0.8 langchain-0.0.244 langsmith-0.0.30 lit-16.0.6 lxml-4.9.3 markdown-3.4.4 marshmallow-3.20.1 mpmath-1.3.0 msg-parser-1.2.0 names-generator-0.1.0 nltk-3.8.1 numexpr-2.8.5 nvidia-cublas-cu11-11.10.3.66 nvidia-cuda-cupti-cu11-11.7.101 nvidia-cuda-nvrtc-cu11-11.7.99 nvidia-cuda-runtime-cu11-11.7.99 nvidia-cudnn-cu11-8.5.0.96 nvidia-cufft-cu11-10.9.0.58 nvidia-curand-cu11-10.2.10.91 nvidia-cusolver-cu11-11.4.0.1 nvidia-cusparse-cu11-11.7.4.91 nvidia-nccl-cu11-2.14.3 nvidia-nvtx-cu11-11.7.91 olefile-0.46 openai-0.27.8 openapi-schema-pydantic-1.2.4 pdf2image-1.16.3 pdfminer.six-20221105 pypandoc-1.11 python-docx-0.8.11 python-magic-0.4.27 python-pptx-0.6.22 regex-2023.8.8 safetensors-0.3

In [None]:
!pip install datarobotx[llm] json2html

Collecting json2html
  Downloading json2html-1.3.0.tar.gz (7.0 kB)
Collecting tiktoken; extra == "llm"
  Downloading tiktoken-0.4.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.7 MB)
[K     |████████████████████████████████| 1.7 MB 18.3 MB/s 
Building wheels for collected packages: json2html
  Building wheel for json2html (setup.py) ... [?25l

- \ | done
[?25h  Created wheel for json2html: filename=json2html-1.3.0-py3-none-any.whl size=7591 sha256=3bff1b7c8dcde6f8bcc55292e25d6a7a33fe9366ced3ae91a7e446a212ef21b1
  Stored in directory: /home/notebooks/.cache/pip/wheels/b9/56/a2/f610a5e8a635d74d27c9971d6099b2521d2155169ff2d99b89
Successfully built json2html


Installing collected packages: json2html, tiktoken
Successfully installed json2html-1.3.0 tiktoken-0.4.0


### Document Corpus
Below is the corpus of both trusted and non-trusted medical research abstracts. These will simulate the real world documents that need to be processed and added to the Agent's knowledge base. This dataset is sourced from <a href='https://www.kaggle.com/datasets/anshulmehtakaggl/200000-abstracts-for-seq-sentence-classification'>Kaggle</a>. For this demo we will be using a subset of the papers to help readers run the notebook quickly. Please find the files.zip file <a href='https://s3.amazonaws.com/datarobot_public_datasets/ai_accelerators/medical_agent/files.zip'>here</a>.

In [None]:
# Decompress the documents
#!tar -xf ./storage/dr_docs.tar -C ./storage/
import shutil
shutil.unpack_archive('/home/notebooks/storage/files.zip', '/home/notebooks/storage/', 'zip')


In [None]:
import os
len(os.listdir('/home/notebooks/storage/files/'))

2500

### Trusted Research Papers
As our aim is to only include trusted papers into the knowledge base, we will define a function to check if the paper can be trusted or not. In this demo, we are building a DataRobot AutoML predictive model to predict if the research paper trust level is high or not. Using <a href='https://datarobot-public-api-client.readthedocs-hosted.com/en/latest-release/'>DataRobot</a> and <a href='https://drx.datarobot.com/model/automl.html'>DataRobotX</a> APIs it is easy to build and deploy this model. Please find the dataset medical_papers_trust_scoring.csv <a href='https://s3.amazonaws.com/datarobot_public_datasets/ai_accelerators/medical_agent/medical_papers_trust_scoring.csv'>here</a>.

In [None]:
import pandas as pd
import datarobotx as drx
import time
from sklearn.model_selection import train_test_split

# Initialize Client if running this notebook out of DataRobot platform
#drx.Client()

df = pd.read_csv('storage/medical_papers_trust_scoring.csv')
df_train, df_test = train_test_split(df, test_size=0.4, random_state=42)
model = drx.AutoMLModel()
model.fit(df_train, target='trust')
deployment = model.deploy(wait_for_autopilot=True)

[1m[34m#[0m[1m Waiting for autopilot to complete...[0m
[1m[34m#[0m[1m Creating project[0m
[1m  - [0mUploading project dataset...
    100%|██████████████████████████████████| 2.98M/2.98M [00:00<00:00, 45.7MB/s]
[1m  - [0mAwaiting project initialization...


[1m  - [0mCreated project
    [hungry-pare](https://app.datarobot.com/projects/64f0231eced0f1b02a290ca0/eda)
[1m[34m#[0m[1m Running autopilot[0m
[1m  - [0mAwaiting autopilot initialization...


[1m  - [0mFitting models...
    |                                  |0 completed [00:03, Fitting=4, Queued=0]

    |                                  |0 completed [00:06, Fitting=4, Queued=0]

    |                                  |0 completed [00:09, Fitting=4, Queued=0]

    |                                  |0 completed [00:12, Fitting=4, Queued=0]

    |                                  |0 completed [00:15, Fitting=4, Queued=0]

    |                                  |0 completed [00:18, Fitting=4, Queued=0]

    |                                  |0 completed [00:21, Fitting=4, Queued=0]

    |                                  |0 completed [00:24, Fitting=4, Queued=0]

    |                                  |0 completed [00:27, Fitting=4, Queued=0]

    |                                  |0 completed [00:30, Fitting=4, Queued=0]

    |                                  |0 completed [00:33, Fitting=4, Queued=0]

    |                                  |0 completed [00:36, Fitting=4, Queued=0]

    |                                  |0 completed [00:39, Fitting=4, Queued=0]

    |                                  |0 completed [00:42, Fitting=4, Queued=0]

    |                                  |0 completed [00:45, Fitting=4, Queued=0]

    |                                  |0 completed [00:48, Fitting=4, Queued=0]

    |                                  |0 completed [00:51, Fitting=4, Queued=0]

    |                                  |0 completed [00:54, Fitting=4, Queued=0]

    |████████▌                         |1 completed [00:57, Fitting=3, Queued=0]

    |████████▌                         |1 completed [01:00, Fitting=3, Queued=0]

    |████████████████████▍             |3 completed [01:03, Fitting=2, Queued=0]

    |█████████████████████████▌        |3 completed [01:06, Fitting=1, Queued=0]

    |█████████████████████████▌        |3 completed [01:09, Fitting=1, Queued=0]

    |█████████████████████████▌        |3 completed [01:12, Fitting=1, Queued=0]

    |█████████████████████████▌        |3 completed [01:15, Fitting=1, Queued=0]

    |█████████████████████████▌        |3 completed [01:18, Fitting=1, Queued=0]

    |█████████████████████████▌        |3 completed [01:21, Fitting=1, Queued=0]

    |█████████████████████████▌        |3 completed [01:24, Fitting=1, Queued=0]

    |█████████████████████████▌        |3 completed [01:27, Fitting=1, Queued=0]

    |██████████████████████████████████|4 completed [01:30, Fitting=0, Queued=0]

    |██████████████████████████████████|4 completed [01:33, Fitting=0, Queued=0]

    |██████████████████████████████████|4 completed [01:36, Fitting=0, Queued=0]

    |██████▊                           |4 completed [01:39, Fitting=8, Queued=8]

    |██████▊                           |4 completed [01:42, Fitting=8, Queued=8]

    |██████▊                           |4 completed [01:45, Fitting=8, Queued=8]

    |██████▊                           |4 completed [01:48, Fitting=8, Queued=8]

    |██████▊                           |4 completed [01:51, Fitting=8, Queued=8]

    |██████▊                           |4 completed [01:54, Fitting=8, Queued=8]

    |██████▊                           |4 completed [01:57, Fitting=8, Queued=8]

    |██████▊                           |4 completed [02:00, Fitting=8, Queued=8]

    |██████▊                           |4 completed [02:03, Fitting=8, Queued=8]

    |██████▊                           |4 completed [02:06, Fitting=8, Queued=8]

    |██████▊                           |4 completed [02:09, Fitting=8, Queued=8]

    |██████▊                           |4 completed [02:12, Fitting=8, Queued=8]

    |██████▊                           |4 completed [02:15, Fitting=8, Queued=8]

    |██████▊                           |4 completed [02:18, Fitting=8, Queued=8]

    |██████▊                           |4 completed [02:21, Fitting=8, Queued=8]

    |██████▊                           |4 completed [02:24, Fitting=8, Queued=8]

    |███████▏                          |4 completed [02:27, Fitting=7, Queued=8]

    |███████▏                          |4 completed [02:30, Fitting=7, Queued=8]

    |███████▌                          |4 completed [02:33, Fitting=6, Queued=8]

    |████████                          |4 completed [02:36, Fitting=6, Queued=7]

    |████████                          |4 completed [02:39, Fitting=8, Queued=5]

    |████████                          |4 completed [02:42, Fitting=8, Queued=5]

    |████████                          |4 completed [02:45, Fitting=8, Queued=5]

    |████████                          |4 completed [02:48, Fitting=8, Queued=5]

    |████████                          |4 completed [02:51, Fitting=8, Queued=5]

    |████████▌                         |4 completed [02:54, Fitting=7, Queued=5]

    |████████▌                         |4 completed [02:57, Fitting=8, Queued=4]

    |██████████▍                       |4 completed [03:00, Fitting=5, Queued=4]

    |██████████▍                       |4 completed [03:03, Fitting=5, Queued=4]

    |██████████▍                       |4 completed [03:06, Fitting=5, Queued=4]

    |██████████▍                       |4 completed [03:09, Fitting=8, Queued=1]

    |██████████▍                       |4 completed [03:12, Fitting=8, Queued=1]

    |██████████▍                       |4 completed [03:15, Fitting=8, Queued=1]

    |███████████▎                      |4 completed [03:18, Fitting=7, Queued=1]

    |████████████▎                     |4 completed [03:21, Fitting=6, Queued=1]

    |████████████▎                     |4 completed [03:24, Fitting=7, Queued=0]

    |█████████████▌                    |4 completed [03:28, Fitting=6, Queued=0]

    |█████████████▌                    |4 completed [03:30, Fitting=6, Queued=0]

    |█████████████▌                    |4 completed [03:33, Fitting=6, Queued=0]

    |█████████████▌                    |4 completed [03:36, Fitting=6, Queued=0]

    |███████████████                   |4 completed [03:39, Fitting=5, Queued=0]

    |███████████████                   |4 completed [03:42, Fitting=5, Queued=0]

    |███████████████                   |4 completed [03:45, Fitting=5, Queued=0]

    |███████████████                   |4 completed [03:48, Fitting=5, Queued=0]

    |███████████████████▍              |4 completed [03:51, Fitting=3, Queued=0]

    |███████████████████▍              |4 completed [03:54, Fitting=3, Queued=0]

    |███████████████████▍              |4 completed [03:57, Fitting=3, Queued=0]

    |███████████████████▍              |4 completed [04:00, Fitting=3, Queued=0]

    |███████████████████▍              |4 completed [04:03, Fitting=3, Queued=0]

    |██████████████████████▋           |4 completed [04:06, Fitting=2, Queued=0]

    |███████████████████████████▏      |4 completed [04:09, Fitting=1, Queued=0]

    |███████████████████████████▏      |4 completed [04:12, Fitting=1, Queued=0]

    |███████████████████████████▏      |4 completed [04:15, Fitting=1, Queued=0]

    |███████████████████████████▏      |4 completed [04:19, Fitting=1, Queued=0]

    |███████████████████████████▏      |4 completed [04:21, Fitting=1, Queued=0]

    |███████████████████████████▏      |4 completed [04:24, Fitting=1, Queued=0]

    |███████████████████████████▏      |4 completed [04:27, Fitting=1, Queued=0]

    |███████████████████████████▏      |4 completed [04:30, Fitting=1, Queued=0]

    |███████████████████████████▏      |4 completed [04:33, Fitting=1, Queued=0]

    |███████████████████████████▏      |4 completed [04:36, Fitting=1, Queued=0]

    |███████████████████████████▏      |4 completed [04:40, Fitting=1, Queued=0]

    |██████████████████████████████████|4 completed [04:43, Fitting=0, Queued=0]

    |██████████████████████████████████|4 completed [04:46, Fitting=0, Queued=0]

    |██████████████████████████████████|4 completed [04:49, Fitting=0, Queued=0]

    |██████████████████████████████████|4 completed [04:52, Fitting=0, Queued=0]

    |██████████████████████████████████|4 completed [04:55, Fitting=0, Queued=0]

    |██████████████████████████████████|4 completed [04:58, Fitting=0, Queued=0]

    |██████████████████████████████████|4 completed [05:01, Fitting=0, Queued=0]

    |██████████████████████████████████|4 completed [05:04, Fitting=0, Queued=0]

    |██████████████████████████████████|4 completed [05:07, Fitting=0, Queued=0]

    |██████████████████████████████████|4 completed [05:10, Fitting=0, Queued=0]

    |██████████████████████████████████|4 completed [05:13, Fitting=0, Queued=0]

    |██████████████████████████████████|4 completed [05:16, Fitting=0, Queued=0]

    |██████████████████████████████████|4 completed [05:19, Fitting=0, Queued=0]

    |██████████████████████████████████|4 completed [05:22, Fitting=0, Queued=0]

    |██████████████████████████████████|4 completed [05:25, Fitting=0, Queued=0]

    |██████████████████████████████████|4 completed [05:28, Fitting=0, Queued=0]

    |██████████████████████████████████|4 completed [05:31, Fitting=0, Queued=0]

    |██████████████████████████████████|4 completed [05:34, Fitting=0, Queued=0]

    |██████████████████████████████████|4 completed [05:37, Fitting=0, Queued=0]

    |██████████████████████████████████|4 completed [05:40, Fitting=0, Queued=0]

    |██████████████████████████████████|4 completed [05:43, Fitting=0, Queued=0]

    |███████████████████████████▏      |4 completed [05:46, Fitting=1, Queued=0]

    |███████████████████████████▏      |4 completed [05:49, Fitting=1, Queued=0]

    |███████████████████████████▏      |4 completed [05:52, Fitting=1, Queued=0]

    |███████████████████████████▏      |4 completed [05:55, Fitting=1, Queued=0]

    |███████████████████████████▏      |4 completed [05:58, Fitting=1, Queued=0]

    |███████████████████████████▏      |4 completed [06:01, Fitting=1, Queued=0]

    |███████████████████████████▏      |4 completed [06:04, Fitting=1, Queued=0]

    |███████████████████████████▏      |4 completed [06:07, Fitting=1, Queued=0]

    |███████████████████████████▏      |4 completed [06:10, Fitting=1, Queued=0]

    |███████████████████████████▏      |4 completed [06:13, Fitting=1, Queued=0]

    |███████████████████████████▏      |4 completed [06:16, Fitting=1, Queued=0]

    |███████████████████████████▏      |4 completed [06:19, Fitting=1, Queued=0]

    |███████████████████████████▏      |4 completed [06:22, Fitting=1, Queued=0]

    |███████████████████████████▏      |4 completed [06:25, Fitting=1, Queued=0]

    |██████████████████████████████████|5 completed [06:28, Fitting=0, Queued=0]

    |██████████████████████████████████|5 completed [06:31, Fitting=0, Queued=0]

    |██████████████████████████████████|5 completed [06:34, Fitting=0, Queued=0]

    |██████████████████████████████████|5 completed [06:37, Fitting=0, Queued=0]

    |██████████████████████████████████|5 completed [06:40, Fitting=0, Queued=0]

    |██████████████████████████████████|5 completed [06:43, Fitting=0, Queued=0]

    |██████████████████████████████████|5 completed [06:46, Fitting=0, Queued=0]

    |██████████████████████████████████|5 completed [06:49, Fitting=0, Queued=0]

    |██████████████████████████████████|5 completed [06:52, Fitting=0, Queued=0]

    |██████████████████████████████████|5 completed [06:55, Fitting=0, Queued=0]

    |██████████████████████████████████|5 completed [06:58, Fitting=0, Queued=0]

    |██████████████████████████████████|5 completed [07:01, Fitting=0, Queued=0]

    |██████████████████████████████████|5 completed [07:04, Fitting=0, Queued=0]

    |██████████████████████████████████|5 completed [07:07, Fitting=0, Queued=0]

    |██████████████████████████████████|5 completed [07:10, Fitting=0, Queued=0]

    |██████████████████████████████████|5 completed [07:14, Fitting=0, Queued=0]

    |██████████████████████████████████|5 completed [07:16, Fitting=0, Queued=0]

    |██████████████████████████████████|5 completed [07:19, Fitting=0, Queued=0]

    |██████████████████████████████████|5 completed [07:22, Fitting=0, Queued=0]

    |██████████████████████████████████|5 completed [07:24, Fitting=0, Queued=0]
     ,,,,,,,,,
   #-#       #-#
   # #   [33m*[0m   # #    CHAMPION
    # #     # #     Elastic-Net Classifier with Naive Bayes Feature Weighting (L2)
      #*# #*#
        /,\
       `````
[1m[34m#[0m[1m Autopilot complete[0m
[1m[34m#[0m[1m Creating deployment[0m
[1m  - [0mCalculating feature impact...


In [None]:
predictions = deployment.predict(df_test)
df_test['predictions'] = predictions.prediction.values
predictions.info()

[1m[34m#[0m[1m Waiting for deployment to be initialized...[0m
[1m  - [0mInitializing model for prediction explanations...


[1m  - [0mAwaiting deployment creation...
[1m[34m#[0m[1m Making predictions[0m
[1m  - [0mMaking predictions with deployment
    [confident-agnesi](https://app.datarobot.com/deployments/64f02556e90d38197cd2a5a2/overview)
[1m  - [0mUploading dataset to be scored...
    100%|██████████████████████████████████| 2.00M/2.00M [00:00<00:00, 34.2MB/s]
[1m  - [0mScoring...
[1m  - [0mCreated deployment
    [confident-agnesi](https://app.datarobot.com/deployments/64f02556e90d38197cd2a5a2/overview)
    from model [Elastic-Net Classifier with Naive Bayes Feature Weighting
    (L2)](https://app.datarobot.com/projects/64f0231eced0f1b02a290ca0/models/64f024d35dfcafa3b95be4ae/blueprint)
    in project
    [hungry-pare](https://app.datarobot.com/projects/64f0231eced0f1b02a290ca0/eda)
[1m[34m#[0m[1m Deployment complete[0m


[1m[34m#[0m[1m Predictions complete[0m
<class 'datarobotx.common.utils.FutureDataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 1 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   prediction  1000 non-null   object
dtypes: object(1)
memory usage: 7.9+ KB


In [None]:
%%time
def get_paper_trust_level(file_path):
    file_paper = open(file_path,"r+")
    paper_content = file_paper.read()
    file_paper.close()
    pred = deployment.predict(pd.DataFrame({'abstract':[paper_content]}), wait_for_autopilot=True)
    return pred['prediction'].iloc[0]

print("Trust level for paper # 24219891 ", get_paper_trust_level('/home/notebooks/storage/files/24219891.txt'))
print("Trust level for paper # 24229754 ", get_paper_trust_level('/home/notebooks/storage/files/24229754.txt'))

[1m[34m#[0m[1m Making predictions[0m
[1m  - [0mMaking predictions with deployment
    [confident-agnesi](https://app.datarobot.com/deployments/64f02556e90d38197cd2a5a2/overview)
[1m  - [0mUploading dataset to be scored...
    100%|███████████████████████████████████| 1.81k/1.81k [00:00<00:00, 103kB/s]
[1m  - [0mScoring...
[1m[34m#[0m[1m Predictions complete[0m
Trust level for paper # 24219891  low
[1m[34m#[0m[1m Making predictions[0m
[1m  - [0mMaking predictions with deployment
    [confident-agnesi](https://app.datarobot.com/deployments/64f02556e90d38197cd2a5a2/overview)
[1m  - [0mUploading dataset to be scored...
    100%|███████████████████████████████████| 1.89k/1.89k [00:00<00:00, 115kB/s]
[1m  - [0mScoring...


[1m[34m#[0m[1m Predictions complete[0m
Trust level for paper # 24229754  high
CPU times: user 105 ms, sys: 13.5 ms, total: 118 ms
Wall time: 4.01 s


# Load and Split Text

If applying this recipe to a different use case, consider:

- Using additional or alternative document loaders
- Filtering out extraneous or noisy documents
- Choosing an appropriate `chunk_size` and `overlap`. These are counted by number of characters, NOT tokens

In [None]:
import re
from langchain.document_loaders import DirectoryLoader
from langchain.text_splitter import MarkdownTextSplitter, RecursiveCharacterTextSplitter

SOURCE_DOCUMENTS_DIR = "/home/notebooks/storage/files/"
SOURCE_DOCUMENTS_FILTER = "*.txt"

loader = DirectoryLoader(f"{SOURCE_DOCUMENTS_DIR}", glob=SOURCE_DOCUMENTS_FILTER)
splitter = RecursiveCharacterTextSplitter(
    chunk_size=2000,
    chunk_overlap=1000,
)

print(f"Loading {SOURCE_DOCUMENTS_DIR} directory")
data = loader.load()
print(f"Splitting {len(data)} documents")
docs = splitter.split_documents(data)
print(f"Created {len(docs)} documents")

Loading /home/notebooks/storage/files/ directory
[nltk_data] Downloading package punkt to /home/notebooks/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/notebooks/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


Splitting 2500 documents
Created 3474 documents


### Filtration
Filtering only trusted papers to be loaded to the knowledge base.

In [None]:
from tqdm import tqdm
approved_docs = []
for i in tqdm(range(len(docs))):
    if((docs[i].metadata['source'].split('/')[-1] in df_test[df_test.predictions=='high']['filename'].tolist())):
        approved_docs.append(docs[i])
len(approved_docs)

100%|██████████| 3474/3474 [00:01<00:00, 2204.90it/s]
254

In [None]:
approved_docs[0]

Document(page_content="24432712 BACKGROUND\tThe EXAcerbations of Chronic Pulmonary Disease Tool ( EXACT ) is a patient-reported outcome measure to standardize the symptomatic assessment of chronic obstructive pulmonary disease exacerbations , including reported and unreported events . BACKGROUND\tThe instrument has been validated in a short-term study of patients with acute exacerbation and stable disease ; its performance in longer-term studies has not been assessed . OBJECTIVE\tTo test the EXACT 's performance in three randomized controlled trials and describe the relationship between resource-defined medically treated exacerbations ( MTEs ) and symptom ( EXACT ) - defined events . METHODS\tPrespecified secondary analyses of data from phase II randomized controlled trials testing new drugs for the management of chronic obstructive pulmonary disease : one 6-month trial ( United States ) ( n = 235 ) and two 3-month , multinational trials ( AZ 1 [ n = 749 ] , AZ 2 [ n = 597 ] ) . METHOD

# Create Vector Database from Documents

1. This notebook uses FAISS, an open source, in-memory vector store that can be serialized and loaded to disk.
2. It uses the open source HuggingFace `all-MiniLM-L6-v2` [embeddings model](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2). Users are free to experiment with other embedding models.

In [None]:
from langchain.embeddings.sentence_transformer import SentenceTransformerEmbeddings
from langchain.vectorstores.faiss import FAISS
from langchain.docstore.document import Document
import torch

if not torch.cuda.is_available():
    EMBEDDING_MODEL_NAME = "all-MiniLM-L6-v2"
else:
    EMBEDDING_MODEL_NAME = "all-mpnet-base-v2"

# Will download the model the first time it runs
embedding_function = SentenceTransformerEmbeddings(
    model_name=EMBEDDING_MODEL_NAME,
    cache_folder="storage/deploy/sentencetransformers",
)
try:
    # Load existing db from disk if previously built
    db = FAISS.load_local("storage/deploy/faiss-db", embedding_function)
except:
    texts = [doc.page_content for doc in approved_docs]
    metadatas = [doc.metadata for doc in approved_docs]   
    # Build and save the FAISS db to persistent notebook storage; this can take some time w/o GPUs
    db = FAISS.from_texts(texts, embedding_function, metadatas=metadatas)  
    db.save_local("storage/deploy/faiss-db")

print(f"FAISS VectorDB has {db.index.ntotal} documents")

Downloading (…)e9125/.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)7e55de9125/README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

Downloading (…)55de9125/config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading (…)125/data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading (…)e9125/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

Downloading (…)9125/train_script.py:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

Downloading (…)7e55de9125/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)5de9125/modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

FAISS VectorDB has 254 documents


### Sanity Tests on Vector Database
Testing the Vector Database retrieval of relevant information for our <a href='https://arxiv.org/abs/2005.11401'>RAG</a>.

In [None]:
# Test the database
#db.similarity_search("Find papers around obesity")
db.similarity_search("Can antioxidants impact exercise performance in normobaric hypoxia")
#db.max_marginal_relevance_search("How do I replace a custom model on an existing custom environment?")

[Document(page_content='24516645 OBJECTIVE\tThe training response of an intensified period of high-intensity exercise is not clear . OBJECTIVE\tTherefore , we compared the cardiovascular adaptations of completing 24 high-intensity aerobic interval training sessions carried out for either three or eight weeks , respectively . METHODS\tTwenty-one healthy subjects ( 23.02.1 years , 10 females ) completed 24 high-intensity training sessions throughout a time-period of either eight weeks ( moderate frequency , MF ) or three weeks ( high frequency , HF ) followed by a detraining period of nine weeks without any training . METHODS\tIn both groups , maximal oxygen uptake ( VO2max ) was evaluated before training , at the 9 ( th ) and 17 ( th ) session and four days after the final 24 ( th ) training session . METHODS\tIn the detraining phase VO2max was evaluated after 12 days and thereafter every second week for eight weeks . METHODS\tLeft ventricular echocardiography , carbon monoxide lung dif

# Define Hooks for Deploying an Unstructured Custom Model
Deployinng unstructured custom models in DataRobot requires two hooks load_model and score_unstructured, as this helps DataRobot understand the model structure, inputs, outputs and monitors. More information is available <a href='https://drx.datarobot.com/consume/deploy.html#example-3-thin-monitored-openai-wrapper-with-secret-handling'>here</a>.

In [None]:
import os
OPENAI_API_BASE = os.environ['OPENAI_API_BASE']
OPENAI_ORGANIZATION = os.environ['OPENAI_ORGANIZATION']
OPENAI_API_KEY = os.environ["OPENAI_API_KEY"]
OPENAI_API_TYPE = os.environ["OPENAI_API_TYPE"]
OPENAI_API_VERSION = os.environ["OPENAI_API_VERSION"]
OPENAI_DEPLOYMENT_NAME = os.environ["OPENAI_DEPLOYMENT_NAME"]

def load_model(input_dir):
    """Custom model hook for loading our knowledge base."""
    import os
    from langchain.embeddings.sentence_transformer import SentenceTransformerEmbeddings
    from langchain.vectorstores.faiss import FAISS
    os.environ["OPENAI_API_TYPE"] = OPENAI_API_TYPE
    os.environ["OPENAI_API_BASE"] = OPENAI_API_BASE
    embedding_function = SentenceTransformerEmbeddings(
        model_name=EMBEDDING_MODEL_NAME,
        cache_folder=input_dir + '/' + 'storage/deploy/sentencetransformers',
    )
    db = FAISS.load_local(input_dir + "/" + "storage/deploy/faiss-db", embedding_function)
    return OPENAI_DEPLOYMENT_NAME, db


def score_unstructured(model, data, query, **kwargs) -> str:
    """Custom model hook for making completions with our knowledge base.

    When requesting predictions from the deployment, pass a dictionary
    with the following keys:
    - 'question' the question to be passed to the retrieval chain
    - 'openai_api_key' the openai token to be used
    - 'chat_history' (optional) a list of two-element lists corresponding to
      preceding dialogue between the Human and AI, respectively

    datarobot-user-models (DRUM) handles loading the model and calling
    this function with the appropriate parameters.

    Returns:
    --------
    rv : str
        Json dictionary with keys:
            - 'question' user's original question
            - 'chat_history' chat history that was provided with the original question
            - 'answer' the generated answer to the question
            - 'references' list of references that were used to generate the answer
            - 'error' - error message if exception in handling request
    """
    import json
    from langchain.chains import ConversationalRetrievalChain
    from langchain.vectorstores.base import VectorStoreRetriever
    from langchain.chat_models import AzureChatOpenAI
    try:
        deployment_name, db = model
        data_dict = json.loads(data)
        llm = AzureChatOpenAI(
            deployment_name=OPENAI_DEPLOYMENT_NAME,
            openai_api_type=OPENAI_API_TYPE,
            openai_api_base=OPENAI_API_BASE,
            openai_api_version=OPENAI_API_VERSION,
            openai_api_key=data_dict["openai_api_key"],
            openai_organization=OPENAI_ORGANIZATION,
            model_name=OPENAI_DEPLOYMENT_NAME,
            temperature=0,
            verbose=True
        )
        retriever = VectorStoreRetriever(vectorstore=db,
                                         # search_kwargs={"filter": {"trust_level": "high"}}
                                        )
        chain = ConversationalRetrievalChain.from_llm(llm, 
                                                      retriever=retriever, 
                                                      return_source_documents=True)
        if 'chat_history' in data_dict:
            chat_history = [(human, ai,) for human, ai in data_dict['chat_history']]
        else:
            chat_history = []
        rv = chain(
                inputs={
                    'question': data_dict['question'],
                    'chat_history': chat_history,
                },
             )
        rv['references'] = [doc.metadata['source'] for doc in rv.pop('source_documents')]
    except Exception as e:
        rv = {'error': f"{e.__class__.__name__}: {str(e)}"}
    return json.dumps(rv)

## Examples
Here are some examples of the agent answering questions using the research papers as context. 

In [None]:
import json
from json2html import *
import warnings
warnings.filterwarnings('ignore')

def get_completion(question):
    output = score_unstructured(
    load_model("."),
    json.dumps(
        {
            "question": question,
            "openai_api_key": os.environ["OPENAI_API_KEY"],
        }),None,)
    output = json.loads(output)
    output_cleaned = {'question':output['question'],
                      'answer':output['answer'], 
                      'references':[ 
                          (open(file,'r')).read()[0:300].replace('\t',' ').replace('\n',' ')+'....'
                          for file in output['references']]}
    html_ = json2html.convert(json = output_cleaned)
    return html_


In [None]:
from IPython.core.display import display, HTML

question = "How to treat obesity? Please provide conclusions from papers where the methodology is robust."
display(HTML(get_completion(question)))

0,1
question,How to treat obesity? Please provide conclusions from papers where the methodology is robust.
answer,"Based on the provided context, here are the conclusions from the papers that have robust methodologies: 1. From the first paper (NCT00432809): Among obese patients with uncontrolled type 2 diabetes, 3 years of intensive medical therapy plus bariatric surgery resulted in glycemic control in significantly more patients than did medical therapy alone. Surgical groups showed greater reductions in weight and better quality of life compared to the medical therapy group. 2. From the second paper (NCT00842426): Bariatric surgery (Roux-en-Y gastric bypass or sleeve gastrectomy) combined with intensive medical therapy resulted in a glycated hemoglobin level of 6.0% or less in a higher percentage of patients compared to intensive medical therapy alone. The use of glucose-lowering medications, including insulin, was lower in the surgical groups. Please note that these conclusions are specific to the context provided and may not encompass all possible conclusions related to the treatment of obesity. It is always recommended to consult with a healthcare professional for personalized advice and treatment options."
references,"24679060 BACKGROUND In short-term randomized trials ( duration , 1 to 2 years ) , bariatric surgery has been associated with improvement in type 2 diabetes mellitus . METHODS We assessed outcomes 3 years after the randomization of 150 obese patients with uncontrolled type 2 diabetes to receive eithe....24369008 OBJECTIVE To examine whether baseline obesity severity modifies the effects of two different , primary care-based , technology-enhanced lifestyle interventions among overweight or obese adults with prediabetes and/or metabolic syndrome . METHODS We compared mean differences in changes from ....24679060 BACKGROUND In short-term randomized trials ( duration , 1 to 2 years ) , bariatric surgery has been associated with improvement in type 2 diabetes mellitus . METHODS We assessed outcomes 3 years after the randomization of 150 obese patients with uncontrolled type 2 diabetes to receive eithe....24754911 BACKGROUND The Canola Oil Multicenter Intervention Trial ( COMIT ) was a randomized controlled crossover study designed to evaluate the effects of five diets that provided different oils and/or oil blends on cardiovascular disease ( CVD ) risk factors in individuals with abdominal obesity ....."


In [None]:
question="What are the effective treatments for rheumatoid arthritis? Please provide \
conclusions from papers where the methodology is robust."
display(HTML(get_completion(question)))

0,1
question,What are the effective treatments for rheumatoid arthritis? Please provide conclusions from papers where the methodology is robust.
answer,"Based on the provided context, there are two papers that discuss effective treatments for rheumatoid arthritis: 1. The first paper (24941177) compares the efficacy of tofacitinib, an oral Janus kinase inhibitor, with methotrexate monotherapy in patients with rheumatoid arthritis who had not previously received methotrexate or therapeutic doses of methotrexate. The study found that tofacitinib was effective in reducing joint damage and improving disease symptoms. However, it does not provide a direct comparison with other treatments. 2. The second paper (not provided) discusses the use of nonsteroidal anti-inflammatory drugs (NSAIDs), specifically diclofenac, for the treatment of osteoarthritis. It mentions that NSAIDs, including diclofenac, are commonly used to treat osteoarthritis but are associated with dose-related adverse events. The study evaluates the efficacy and safety of low-dose submicron diclofenac in patients with osteoarthritis pain. Unfortunately, the provided context does not include robust conclusions from papers specifically discussing effective treatments for rheumatoid arthritis."
references,"24941177 BACKGROUND Methotrexate is the most frequently used first-line antirheumatic drug . BACKGROUND We report the findings of a phase 3 study of monotherapy with tofacitinib , an oral Janus kinase inhibitor , as compared with methotrexate monotherapy in patients with rheumatoid arthritis who had....25050589 OBJECTIVE NSAIDs , such as diclofenac , are the most commonly used medications to treat osteoarthritis ( OA ) , but they are associated with dose-related adverse events ( AEs ) . OBJECTIVE Low-dose submicron diclofenac was developed using a new , proprietary dry milling process that creates....25199526 BACKGROUND Knee osteoarthritis ( OA ) causes pain and long-term disability with annual healthcare costs exceeding $ 185 billion in the United States . BACKGROUND Few medical remedies effectively influence the course of the disease . BACKGROUND Finding effective treatments to maintain functi....24885354 BACKGROUND Radiotherapy has a good effect in palliation of painful bone metastases , with a pain response rate of more than 60 % . BACKGROUND However , shortly after treatment , in approximately 40 % of patients a temporary pain flare occurs , which is defined as a two-point increase of the...."


## Adversarial Example
Here is an example where the knowledge base doesn't have the required information for the agent. This means that there is no trusted paper yet included in the knowledge base. With the combination of Temperature and the Knowledge Base, we can keep the Agent under checks and balances and avoid hallucinations.

In [None]:
question="Can high sweetener intake worsen pathogenesis of cardiometabolic disorders?"
display(HTML(get_completion(question)))

0,1
question,Can high sweetener intake worsen pathogenesis of cardiometabolic disorders?
answer,I don't have any information on the specific effects of high sweetener intake on the pathogenesis of cardiometabolic disorders.
references,"25319187 BACKGROUND Whether the type of dietary fat could alter cardiometabolic responses to a hypercaloric diet is unknown . BACKGROUND In addition , subclinical cardiometabolic consequences of moderate weight gain require further study . RESULTS In a 7-week , double-blind , parallel-group , random....24980134 BACKGROUND Managing cardiovascular risk factors is important for reducing vascular complications in type 2 diabetes , even in individuals who have achieved glycemic control . BACKGROUND Nut consumption is associated with reduced cardiovascular risk ; however , there is mixed evidence about ....24284442 BACKGROUND Leucine is a key amino acid involved in the regulation of skeletal muscle protein synthesis . OBJECTIVE We assessed the effect of the supplementation of a lower-protein mixed macronutrient beverage with varying doses of leucine or a mixture of branched chain amino acids ( BCAAs )....25833983 BACKGROUND Abdominal obesity and exaggerated postprandial lipemia are independent risk factors for cardiovascular disease ( CVD ) and mortality , and both are affected by dietary behavior . OBJECTIVE We investigated whether dietary supplementation with whey protein and medium-chain saturate...."


### Adding new papers into the knowledge base
Let's add a paper into the Knowledge base on the above topic to see what happens. Langchain provides hooks to add new documents to the Vector Database index.

In [None]:
SOURCE_DOCUMENTS_DIR = "/home/notebooks/storage/files/"
SOURCE_DOCUMENTS_FILTER = "24219891.txt"

loader = DirectoryLoader(f"{SOURCE_DOCUMENTS_DIR}", glob=SOURCE_DOCUMENTS_FILTER)
print(f"Loading {SOURCE_DOCUMENTS_DIR} directory")
data = loader.load()
print(f"Splitting {len(data)} documents")
docs = splitter.split_documents(data)
print(f"Created {len(docs)} documents")
for i in tqdm(range(len(docs))):
    docs[i].metadata['trust_level'] = 'high'
    
texts = [doc.page_content for doc in docs]
metadatas = [doc.metadata for doc in docs]
db.add_texts(texts, metadatas)
db.save_local("storage/deploy/faiss-db")

print(f"\n FAISS VectorDB has {db.index.ntotal} documents")

Loading /home/notebooks/storage/files/ directory
Splitting 1 documents
Created 1 documents
100%|██████████| 1/1 [00:00<00:00, 10618.49it/s]
 FAISS VectorDB has 255 documents


### Voila!
The Agent now has the context to answer the question with the trusted paper that we just added to our knowledge base.


In [None]:
question="Can high sweetener intake worsen pathogenesis of cardiometabolic disorders?"
display(HTML(get_completion(question)))

0,1
question,Can high sweetener intake worsen pathogenesis of cardiometabolic disorders?
answer,"Yes, high intake of added sweeteners, especially high-fructose intake, is considered to have a causal role in the pathogenesis of cardiometabolic disorders. It may not only cause weight gain but also low-grade inflammation, which is an independent risk factor for developing type 2 diabetes and cardiovascular disease."
references,"24219891 OBJECTIVE High intake of added sweeteners is considered to have a causal role in the pathogenesis of cardiometabolic disorders . OBJECTIVE Especially , high-fructose intake is regarded as potentially harmful to cardiometabolic health . OBJECTIVE It may cause not only weight gain but also lo....25319187 BACKGROUND Whether the type of dietary fat could alter cardiometabolic responses to a hypercaloric diet is unknown . BACKGROUND In addition , subclinical cardiometabolic consequences of moderate weight gain require further study . RESULTS In a 7-week , double-blind , parallel-group , random....24980134 BACKGROUND Managing cardiovascular risk factors is important for reducing vascular complications in type 2 diabetes , even in individuals who have achieved glycemic control . BACKGROUND Nut consumption is associated with reduced cardiovascular risk ; however , there is mixed evidence about ....24284442 BACKGROUND Leucine is a key amino acid involved in the regulation of skeletal muscle protein synthesis . OBJECTIVE We assessed the effect of the supplementation of a lower-protein mixed macronutrient beverage with varying doses of leucine or a mixture of branched chain amino acids ( BCAAs )...."


# Deploy the knowledge base to staging and query it

This convenience method

- Builds a new Custom Model Environment containing the contents of storage/deploy/
- Assembles a new Custom Model with the provided hooks
- Deploys an Unstructured Custom Model to your Deployments
- Returns an object which can be used to make predictions

Use `environment_id` to re-use an existing Custom Model Environment that you're happy with for shorter iteration cycles on the custom model hooks.

In [None]:
import datarobotx as drx

deployment = drx.deploy(
    "storage/deploy/",
    name="Medical Research Papers redux",
    hooks={
        "score_unstructured": score_unstructured,
        "load_model": load_model
    },
    extra_requirements=["langchain", "faiss-cpu", "sentence-transformers", "openai"],
    # Re-use existing environment if you want to change the hook code,
    # and not requirements
    # environment_id="646e81c124b3abadc7c66da0",
)
# enable storing prediction data, necessary for Data Export for monitoring purposes
deployment.dr_deployment.update_predictions_data_collection_settings(enabled=True)

[1m[34m#[0m[1m Deploying custom model[0m
[1m  - [0mUnable to auto-detect model type; any provided paths and files will be
    exported - dependencies should be explicitly specified using
    extra_requirements
[1m  - [0mPreparing model and environment...
[1m  - [0mConfigured environment [[Custom] Medical Research Papers
    redux](https://app.datarobot.com/model-registry/custom-environments/64edfad0abee78c9e6b9dc45)
    with requirements:
      python 3.9.16
      datarobot-drum==1.10.3
      datarobot-mlops==8.2.7
      cloudpickle>=2.0.0
      langchain==0.0.244
      faiss-cpu==1.7.4
      sentence-transformers==2.2.2
      openai==0.27.8
[1m  - [0mAwaiting custom environment build...


[1m  - [0mConfiguring and uploading custom model...
    100%|███████████████████████████████████| 92.4M/92.4M [00:00<00:00, 240MB/s]


[1m  - [0mRegistered custom model [Medical Research Papers
    redux](https://app.datarobot.com/model-registry/custom-models/64ee013fb4482185322c1375/info)
    with target type: Unstructured
[1m  - [0mCreating and deploying model package...


[1m  - [0mCreated deployment [Medical Research Papers
    redux](https://app.datarobot.com/deployments/64ee0150da79fc4182e4e537/overview)
[1m[34m#[0m[1m Custom model deployment complete[0m


In [None]:
# Test the deployment
deployment.predict_unstructured(
    {
        "question": "Can high sweetener intake worsen pathogenesis of cardiometabolic disorders?",
        "openai_api_key": os.environ["OPENAI_API_KEY"],
    }
)

[1m[34m#[0m[1m Making predictions[0m
[1m  - [0mMaking predictions with deployment [Medical Research Papers
    redux](https://app.datarobot.com/deployments/64ee0150da79fc4182e4e537/overview)


[1m[34m#[0m[1m Predictions complete[0m
{'question': 'Can high sweetener intake worsen pathogenesis of cardiometabolic disorders?',
 'chat_history': [],
 'answer': 'Yes, high intake of added sweeteners, especially high-fructose intake, is considered to have a causal role in the pathogenesis of cardiometabolic disorders. It may not only cause weight gain but also low-grade inflammation, which is an independent risk factor for developing type 2 diabetes and cardiovascular disease.',
 'references': ['/home/notebooks/storage/files/24219891.txt',
  '/home/notebooks/storage/files/25319187.txt',
  '/home/notebooks/storage/files/24980134.txt',
  '/home/notebooks/storage/files/24284442.txt']}

# Conclusion
In this notebook, we have observed how to
<br> - use predictive models to classify text files
<br> - create a vector store out of research paper abstracts
<br> - use Retrieval Augmented Generation with Generative AI model
<br> - deploy said Generative AI model to the DataRobot platform
<br> Inorder to create a Conversational Agent that can be used by healthcare professionals.