# Motivation

Perhaps a more practice-based approach could be more fruitful.

# Dependencies

In [1]:
!pip install -qU datasets pinecone-client sentence-transformers torch

[0m

# Dataset

In [6]:
from datasets import load_dataset

# load the dataset from huggingface in streaming mode and shuffle it
radiology_data = load_dataset(
    path = 'Ka4on/radiology',
    split='train',
    streaming=True
).shuffle(seed=960)

In [7]:
# streaming mode allows us to iterate over the dataset without needing to download it
# show the contents of a single document in the dataset
next(iter(radiology_data))

{'instruction': 'Generate impression based on medical findings.',
 'input': 'Dysarthria. There is no evidence of intracranial hemorrhage, mass, or acute infarct. There are mild scattered foci of cerebral white matter T2 hyperintensity. There is diffuse cerebral volume loss, which is most pronounced in the medial temporal lobes. There is no midline shift or herniation. The major cerebral flow voids are intact. The orbits, skull, paranasal sinuses, and scalp soft tissues are grossly unremarkable.',
 'output': '1. Nonspecific mild scattered foci of cerebral white matter T2 hyperintensity may represent chronic small vessel ischemic disease. Otherwise, no evidence of acute infarction.2. Diffuse cerebral volume loss, which is most pronounced in the medial temporal lobes, which may represent Alzheimer disease in the appropriate clinical setting. '}

In [8]:
from tqdm.auto import tqdm

total_prognosis_count = 50000

counter = 0
docs = []

for d in tqdm(radiology_data, total=total_prognosis_count):
    # extract the fields we need
    doc = {
        "input": d["input"],
        "output": d["output"],
    }
    docs.append(doc)

    # stop iteration once we reach 50k
    if counter == total_prognosis_count:
        break
    counter += 1

  0%|          | 0/50000 [00:00<?, ?it/s]

In [9]:
import pandas as pd

df = pd.DataFrame(docs)
df.head()

Unnamed: 0,input,output
0,Dysarthria. There is no evidence of intracrani...,1. Nonspecific mild scattered foci of cerebral...
1,Male 8 years old Reason: Ao root dilatation Le...,1. Status post arterial switch operation.2. No...
2,Pituitary adenoma status post TSH in 11/2013: ...,Interval evolution postoperative findings rela...
3,"History of neuroblastoma of lumbar spine, rela...",1. Postoperative findings related to laminecto...
4,"Encephalopathy: confusion, encephalopathy. Man...",Scattered chronic infarcts and probable chroni...


# Initialize Pinecone idx

In [17]:
import pinecone

# connect to pinecone environment
pinecone.init(
    api_key="05ce3e92-d0da-4ef4-9e3b-2a1ad822689b",
    environment="us-east1-gcp"  # find next to API key in console
)

In [18]:
index_name = "qa"

# check if the abstractive-question-answering index exists
if index_name not in pinecone.list_indexes():
    # create the index if it does not exist
    pinecone.create_index(
        index_name,
        dimension=768,
        metric="cosine"
    )

# connect to abstractive-question-answering index we created
index = pinecone.Index(index_name)

# Initialize Retriever

##### Retriever Tasks:

- Generate embeddings for all historical passages (context vectors/embeddings)
- Generate embeddings for our questions (query vector/embedding)

The retriever will create embeddings such that the questions and passages that hold the answers to our queries are close to one another in the vector space. 

Uses SentenceTransformer model based on Microsoft's MPNet as our retriever.

In [23]:
import torch
from sentence_transformers import SentenceTransformer

# set device to GPU if available
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(device)

# load the retriever model from huggingface model hub
retriever = SentenceTransformer("flax-sentence-embeddings/all_datasets_v3_mpnet-base", device=device)
retriever

ImportError: cannot import name 'logger' from 'joblib' (unknown location)

In [None]:
!conda install -c anaconda scikit-learn

Retrieving notices: ...working... done
Collecting package metadata (current_repodata.json): done
Solving environment: / 
The environment is inconsistent, please check the package plan carefully
The following packages are causing the inconsistency:

  - defaults/osx-arm64::nbclassic==0.5.5=py311hca03da5_0
  - conda-forge/noarch::jupyterlab==4.0.3=pyhd8ed1ab_0
  - defaults/osx-arm64::evaluate==0.4.0=py311hca03da5_0
  - defaults/osx-arm64::ipykernel==6.19.2=py311h37496c9_0
  - defaults/osx-arm64::scikit-learn==1.2.2=py311h313beb8_1
  - conda-forge/noarch::notebook==7.0.1=pyhd8ed1ab_0
done


  current version: 23.7.2
  latest version: 23.10.0

Please update conda by running

    $ conda update -n base -c defaults conda

Or to minimize the number of packages updated during conda update use

     conda install conda=23.10.0



## Package Plan ##

  environment location: /Users/andychen2121/anaconda3

  added / updated specs:
    - scikit-learn


The following packages will be downloaded:

  

In [None]:
y