### How to access the OpenAI application programming interface

In your command prompt window:
>> pip install openai

https://help.openai.com/en/articles/5112595-best-practices-for-api-key-safety



In [10]:
import pandas as pd 
import numpy as np
import sklearn as sklearn
import os as os

import matplotlib.pyplot as plt
import seaborn as sns
from kneed import KneeLocator
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.preprocessing import StandardScaler
import re as re


if os.getlogin()=="JVARGH7":
    path_equity_precision_llm_folder = "C:/Cloud/OneDrive - Emory University/Papers/Global Equity in Diabetes Precision Medicine LLM"
    path_equity_precision_llm_repo =  'C:/code/external/equity_precision_llm'

if os.getlogin()=='aamnasoniwala':
    path_equity_precision_llm_folder = '/Users/aamnasoniwala/Library/CloudStorage/OneDrive-Emory/Global Equity in Diabetes Precision Medicine LLM'
    path_equity_precision_llm_repo = ' ' #Please add your repo path here


excel_path = path_equity_precision_llm_folder + '/llm training/Methods.xlsx'
# path_equity_precision_llm_repo = os.path.abspath("").replace("preprocessing", "")


In [17]:
api_key_epl_shared = ""

from openai import OpenAI
# https://stackoverflow.com/questions/36959031/how-to-source-file-into-python-script
execfile(path_equity_precision_llm_repo + "/constants.py")


In [18]:
execfile(path_equity_precision_llm_repo + "/functions/prompt_generator.py")
execfile(path_equity_precision_llm_repo + "/functions/base_prompt_append.py")


base_prompt_files_v4 = ['p1v4', 'p2v4', 'p3v4']
base_prompts_v4 = base_prompt_append(base_prompt_files_v4)

prompt_pmid_v4 = prompt_generator(22744164, base_prompts_v4, excel_path)


print(base_prompts_v4[0])
print(base_prompts_v4[1])
print(prompt_pmid_v4)

I am going to outline inclusion criteria for four categories: diabetes, precision medicine, source population, and primary study. 

Please wait for me to prompt you on what to do based on these criteria.  
Here are the inclusion criteria:  

DIABETES: Do not exclude any type of diabetes or prediabetes. The presence of certain conditions or risk factors may not definitively confirm that the study is related to diabetes or prediabetes, unless there is a clear link to diabetes pathophysiology, diagnosis, or management. 

PRECISION MEDICINE: Precision medicine is an assessment of genetic or metabolic state to guide preventive and therapeutic decisions in humans. Exclude epidemiological studies using traditional biomarkers only, focusing on omics (genomics, metabolomics, proteomics, lipidomics etc.) or multi-omics studies. 

SOURCE POPULATION: The source population is correct if the population based on the prompt and the study participantsâ€™ racial/ethnic background, regardless of geograph

In [19]:
# https://platform.openai.com/docs/api-reference/chat/create?lang=python

client = OpenAI(api_key= api_key_epl_shared)

completion = client.chat.completions.create(

    model="gpt-3.5-turbo",

    messages=[

        {"role": "system", "content": base_prompts_v4[0]},
        {"role": "system", "content": base_prompts_v4[1]},
        {"role": "user", "content": prompt_pmid_v4}

    ],

    max_tokens = 1000

)


print(completion.choices[0].message)

InternalServerError: Error code: 500 - {'error': {'message': 'Internal server error', 'type': 'auth_subrequest_error', 'param': None, 'code': 'internal_error'}}

### 1. Preparing Your Batch File

https://platform.openai.com/docs/guides/batch

.jsonl file where each line contains the details of an individual request to the API

Each request must include a custom_id value

{"custom_id": "request-1", "method": "POST", "url": "/v1/chat/completions", "body": {"model": "gpt-3.5-turbo-0125", "messages": [{"role": "system", "content": "You are a helpful assistant."},{"role": "user", "content": "Hello world!"}],"max_tokens": 1000}}


{"custom_id": "request-2", "method": "POST", "url": "/v1/chat/completions", "body": {"model": "gpt-3.5-turbo-0125", "messages": [{"role": "system", "content": "You are an unhelpful assistant."},{"role": "user", "content": "Hello world!"}],"max_tokens": 1000}}

In [13]:
pmid_list = pd.read_excel(excel_path, sheet_name='Training Data')['PMID'].tolist()

json_list = []

for index, pmid in enumerate(pmid_list):
    prompt_pmid_v4 = prompt_generator(pmid, base_prompts_v4, excel_path)
    dict_pmid = {"custom_id": str(index) + "_" + str(pmid), 
                 "method": "POST", 
                 "url": "/v1/chat/completions", 
                 "body": {"model": "gpt-3.5-turbo-0125", 
                          "messages": [ {"role": "system", "content": base_prompts_v4[0]},
                                        {"role": "system", "content": base_prompts_v4[1]},
                                        {"role": "user", "content": prompt_pmid_v4}],
                            "max_tokens": 1000
                        }
                }

    json_list.append(dict_pmid)



In [15]:
import json
with open(path_equity_precision_llm_folder + '\llm training\Training.jsonl', 'w') as outfile:
    for entry in json_list:
        json_line = json.dumps(entry)
        outfile.write(json_line + '\n')

### 2. Uploading Your Batch Input File

In [16]:
client = OpenAI(api_key= api_key_epl_shared)

batch_input_file = client.files.create(
  file=open(path_equity_precision_llm_folder + '\llm training\Training.jsonl', "rb"),
  purpose="batch"
)



### 3. Creating the Batch

In [23]:
batch_input_file_id = batch_input_file.id
print(batch_input_file_id)
batch_created = client.batches.create(
    input_file_id=batch_input_file_id,
    endpoint="/v1/chat/completions",
    completion_window="24h",
    metadata={
      "description": "training data for PMID query"
    }
)


file-59SQmdAF274a0YJwu2HXQ668


### 4. Checking the Status of a Batch

In [64]:
client = OpenAI(api_key= api_key_epl_shared)

batch_status = client.batches.retrieve(batch_created.id)
batch_status.status

'completed'

### 5. Retrieving the Results

In [65]:
client = OpenAI(api_key= api_key_epl_shared)

file_response = client.files.content(batch_status.output_file_id)


In [66]:
import pandas as pd

df = pd.read_json(file_response.content.decode('utf-8'), lines=True)




  df = pd.read_json(file_response.content.decode('utf-8'), lines=True)


In [67]:
df.head()

Unnamed: 0,id,custom_id,response,error
0,batch_req_67378baaf98081909d4735ec47990630,22744164,"{'status_code': 200, 'request_id': 'a33ff4fffa...",
1,batch_req_67378bab06d88190b6aaaf6dede6b157,115561964,"{'status_code': 200, 'request_id': '0716eefcbc...",
2,batch_req_67378bab13c88190ba7e780166b01d04,228770629,"{'status_code': 200, 'request_id': '84793c8a30...",
3,batch_req_67378bab21288190ab3a38c529f41bea,333764184,"{'status_code': 200, 'request_id': '72f7d12a34...",
4,batch_req_67378bab2ea48190acb4d494b404a1f8,436155119,"{'status_code': 200, 'request_id': 'e08cc7a75a...",


In [80]:

results = pd.DataFrame()
for index in range(len(df)):
    markdown_table = df['response'][index]['body']['choices'][0]['message']['content']
    out = pd.read_csv(pd.io.common.StringIO(markdown_table.split('\n\n')[0]), 
                      sep="|", skipinitialspace=False, 
                      skipfooter=0, engine='python',header=0)
    out.columns = out.columns.str.lower().str.strip()
    results = pd.concat([results,out.iloc[[1]]])
    
results = results.filter(regex=r'^(?!unnamed)')

results.to_csv(path_equity_precision_llm_folder + '\llm training\Training_results.csv', index=False)

results.head()

Unnamed: 0,pmid,title,precision medicine,diabetes,correct source population,primary study
1,22744164,Acculturation and glycemic control of Asian I...,no,yes,South Asia (SA),yes
1,15561964,Linkage analysis of diabetes status among hyp...,No,Yes,NA (The article focuses on Caucasian and Afri...,Yes
1,28770629,"Ipragliflozin, a sodium glucose co-transporte...",No,Yes,EA,Yes
1,33764184,Gastrodin protects against high glucose-induc...,No,Yes,East Asia (EA),Yes
1,36155119,Curcumin supplementation reduces blood glucos...,No,Yes,LAC,Yes


### Comparison of ChatGPT API with Training Data



In [81]:
input = pd.read_excel(excel_path, sheet_name='Training Data')
input.head()



Unnamed: 0,PMID,Title,Abstract,MeSH,Source Population,Precision Medicine,Diabetes,Correct Source Population,Primary Study
0,22744164,Acculturation and glycemic control of Asian In...,The prevalence of type 2 diabetes is dispropor...,Acculturation*; Asian / psychology; Asian / st...,South Asia,No,Yes,Yes,Yes
1,15561964,Linkage analysis of diabetes status among hype...,Type 2 diabetes susceptibility is determined b...,"Chromosome Mapping*; Diabetes Mellitus, Type 2...",South Asia,Yes,Yes,No,Yes
2,28770629,"Ipragliflozin, a sodium glucose co-transporter...",Objective: We recently investigated the effect...,"Blood Glucose / analysis; Diabetes Mellitus, T...",East Asia,No,Yes,Yes,Yes
3,33764184,Gastrodin protects against high glucose-induce...,Diabetic cardiomyopathy (DCM) is one of the ma...,Aryl Hydrocarbon Receptor Nuclear Translocator...,East Asia,No,Yes,No,Yes
4,36155119,Curcumin supplementation reduces blood glucose...,Objective: To evaluate the effect of curcumin ...,Blood Glucose* / metabolism; Body Mass Index; ...,Latin America & Caribbean,No,No,Yes,Yes
