# TESTING: GPT FINE-TUNED MODELS

GROUP MEMBERS:
- Rishabh TIWARI;
- Felipe BAGNI;
- Erfan AMIDI;
- Federica VINCIGUERRA;
- Dan LIONIS.

---

# Testing de GPT Fine Tunned Model on the Original Dataset

The goal of this notebook is to test the GPT (davinci-02) fine tunned model on the Medical Flashcards dataset.

## Install requirements

In [1]:
!pip uninstall -y openai
!pip install openai==0.28
!pip install datasets
!pip install scikit-learn sentence-transformers
!pip install nltk

[0mCollecting openai==0.28
  Downloading openai-0.28.0-py3-none-any.whl (76 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.5/76.5 kB[0m [31m724.1 kB/s[0m eta [36m0:00:00[0m
Installing collected packages: openai
Successfully installed openai-0.28.0
Collecting datasets
  Downloading datasets-2.19.1-py3-none-any.whl (542 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m542.0/542.0 kB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m9.3 MB/s[0m eta [36m0:00:00[0m
Collecting xxhash (from datasets)
  Downloading xxhash-3.4.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (194 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting multiprocess (from datasets)
  Downlo

In [2]:
import json
import openai
import pandas as pd
import numpy as np
from datasets import load_dataset
from sklearn.model_selection import train_test_split
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import nltk
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction
from google.colab import drive
import os

## Connect to GDrive

In [4]:
drive.mount('/content/drive')

Mounted at /content/drive


In [5]:
path = '_NLP/Project'

os.chdir(f'/content/drive/MyDrive/{path}')
os.getcwd()

'/content/drive/MyDrive/_NLP/Project'

## Add API key

In [14]:
api_key ="sk-proj-###" # ADD YOUR API KEY HERE
openai.api_key = api_key

## Split dataset

In [15]:
dataset = load_dataset('arrow', data_files='data-00000-of-00001.arrow')

In [16]:
df = dataset['train'].to_pandas()

In [17]:
df.head()

Unnamed: 0,input,output,instruction
0,What is the relationship between very low Mg2+...,Very low Mg2+ levels correspond to low PTH lev...,Answer this question truthfully
1,What leads to genitourinary syndrome of menopa...,Low estradiol production leads to genitourinar...,Answer this question truthfully
2,What does low REM sleep latency and experienci...,Low REM sleep latency and experiencing halluci...,Answer this question truthfully
3,What are some possible causes of low PTH and h...,"PTH-independent hypercalcemia, which can be ca...",Answer this question truthfully
4,How does the level of anti-müllerian hormone r...,The level of anti-müllerian hormone is directl...,Answer this question truthfully


In [18]:
df = dataset['train'].to_pandas()
df = df.iloc[:, :-1]

In [19]:
train_data, test_data = train_test_split(df, test_size=0.2, random_state=42)

In [20]:
test_data.head()

Unnamed: 0,input,output
27911,What are some physical signs that may indicate...,What are some physical signs that may indicate...
7251,What is the name of the amino acid that serves...,Arginine is the amino acid that acts as the pr...
32050,Do high or low potency typical antipsychotics ...,High potency typical antipsychotics are more l...
7969,Which type of heart valves are commonly affect...,Viridans streptococci infection is typically s...
6904,"Among all bugs, which one is the most frequent...",Staphylococcus aureus is the bug that is the m...


## Generate answers from the model

In [21]:
def generate_answer(question):
    prompt = question + " ->"
    response = openai.Completion.create(
        model='ft:davinci-002:personal::9KLi6nKN',
        prompt=prompt,
        max_tokens=100,
        top_p=0.9,
        frequency_penalty=2,
        presence_penalty=1,
        stop=["\n"]
    )
    return response.choices[0].text

total_questions = len(test_data)
predictions = []
count = 0
for index, row in test_data.iterrows():
    if count >= 100:
        break
    print(f"Processed question {count + 1} out of {total_questions}")
    question = row["input"]
    predicted_answer = generate_answer(question)
    predictions.append(predicted_answer)
    count = count + 1

Processed question 1 out of 6791
Processed question 2 out of 6791
Processed question 3 out of 6791
Processed question 4 out of 6791
Processed question 5 out of 6791
Processed question 6 out of 6791
Processed question 7 out of 6791
Processed question 8 out of 6791
Processed question 9 out of 6791
Processed question 10 out of 6791
Processed question 11 out of 6791
Processed question 12 out of 6791
Processed question 13 out of 6791
Processed question 14 out of 6791
Processed question 15 out of 6791
Processed question 16 out of 6791
Processed question 17 out of 6791
Processed question 18 out of 6791
Processed question 19 out of 6791
Processed question 20 out of 6791
Processed question 21 out of 6791
Processed question 22 out of 6791
Processed question 23 out of 6791
Processed question 24 out of 6791
Processed question 25 out of 6791
Processed question 26 out of 6791
Processed question 27 out of 6791
Processed question 28 out of 6791
Processed question 29 out of 6791
Processed question 30 o

# Extracting Reference Answers from the Test Dataset

## Overview
In this section, we extract a subset of reference answers from the test dataset for evaluation purposes.

## Steps Involved

### 1. Initialize Variables
- Initialized a counter (`count`) to track the number of extracted answers.
- Created an empty list (`references`) to store the reference answers.

### 2. Iterate Through the Test Data
- Iterated through each row of the DataFrame `test_data`.
- Added the value of the `output` column to the `references` list.
- Stopped after extracting 100 answers.

## Result
- A list of 100 reference answers is prepared for further evaluation.

In [22]:
# Get answers frmo the dataset
count = 0
references = []
for index, row in test_data.iterrows():
    if count >= 100:
        break
    count = count + 1
    references.append(row["output"])

# Computing Embeddings for Evaluation

## Load a Pre-trained Model
- Loaded the pre-trained `SentenceTransformer` model (`paraphrase-MiniLM-L6-v2`).

## Compute Embeddings
- Generated embeddings for the reference answers.
- Generated embeddings for the predicted answers.

In [23]:
# Load a pre-trained model
model = SentenceTransformer('paraphrase-MiniLM-L6-v2')

# Compute embeddings
reference_embeddings = model.encode(references)
prediction_embeddings = model.encode(predictions)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/3.73k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/314 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [24]:
# Log the shape of the embeddings for debugging
print(f"Reference embeddings shape: {np.array(reference_embeddings).shape}")
print(f"Prediction embeddings shape: {np.array(prediction_embeddings).shape}")

Reference embeddings shape: (100, 384)
Prediction embeddings shape: (100, 384)


# Cosine Similarity Calculation

- Calculated the cosine similarity for each pair of reference and predicted embeddings.
- Stored the cosine similarity scores in a list for further analysis.

In [25]:
# Compute cosine similarity for each pair
cosine_similarities = []
for ref_emb, pred_emb in zip(reference_embeddings, prediction_embeddings):
    cos_sim = cosine_similarity([ref_emb], [pred_emb])[0][0]
    cosine_similarities.append(cos_sim)

In [26]:
# Calculate average cosine similarity
average_cosine_similarity = np.mean(cosine_similarities)
print(f'Average Cosine Similarity: {average_cosine_similarity:.2f}')

Average Cosine Similarity: 0.79


# BLEU Score Calculation Function

## Purpose
- To evaluate the quality of predicted text by comparing it to reference text using the BLEU score.

## Function Definition
- **Tokenization:** Tokenizes both reference and prediction text into words.
- **Smoothing:** Applies a smoothing function to handle short sequences and avoid zero scores.
- **BLEU Score Calculation:** Uses the `sentence_bleu` function from NLTK to compute the BLEU score.

In [27]:
# Function to calculate BLEU score
def calculate_bleu(reference, prediction):
    reference_tokens = [nltk.word_tokenize(reference)]
    prediction_tokens = nltk.word_tokenize(prediction)
    # Using smoothing function to avoid zero scores for short sequences
    smoothing_function = SmoothingFunction().method1
    bleu_score = sentence_bleu(reference_tokens, prediction_tokens, smoothing_function=smoothing_function)
    return bleu_score

In [28]:
# Ensure NLTK resources are downloaded
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [29]:
# Calculate BLEU scores for all predictions
bleu_scores = [calculate_bleu(ref, pred) for ref, pred in zip(references, predictions)]

In [30]:
# Calculate average BLEU score
average_bleu_score = sum(bleu_scores) / len(bleu_scores)
print(f'Average BLEU Score: {average_bleu_score:.2f}')

Average BLEU Score: 0.18


# Manual checking with random question from test dataset

In [31]:
import random

# Assuming you have a pandas DataFrame named test_data containing your test dataset
random_index = random.randint(0, len(test_data) - 1)
random_row = test_data.iloc[random_index]
prompt = random_row["input"] + " ->"
actual_answer = random_row["output"]

bot_answer = generate_answer(prompt)

print('*************************************')
print('Question: ', prompt)
print('Actual Answer:', actual_answer)
print('Bot Answer: ', bot_answer)


*************************************
Question:  What is methimazole and what are some of the potential teratogenic complications associated with its use during the first trimester of pregnancy? ->
Actual Answer: Methimazole is a medication used to treat hyperthyroidism. However, if taken during the first trimester of pregnancy, it can be teratogenic and cause birth defects in the developing fetus. One potential complication is aplasia cutis, which is the absence of skin on the scalp or other parts of the body. Other potential teratogenic complications of methimazole include choanal atresia, esophageal atresia, and congenital heart defects. It is important for pregnant women to discuss any medications they are taking with their healthcare provider to determine if they are safe to use during pregnancy.
Bot Answer:   Methimazole is a medication used to treat hyperthyroidism, which can cause complications during pregnancy if it is taken in the first trimester. Specifically, methimazole ha

---

# Testing de GPT Fine Tunned Model on a Different Dataset

The goal of this notebook is to test the GPT (davinci-02) fine tunned model on a different dataset. Instead of using the Medical Flashcards, here we test the model on this dataset: [GokulWork/QuestionAnswer_MCQ](https://huggingface.co/datasets/GokulWork/QuestionAnswer_MCQ)

## Install requirements

In [1]:
!pip uninstall -y openai
!pip install openai==0.28
!pip install datasets
!pip install scikit-learn sentence-transformers
!pip install nltk

Found existing installation: openai 0.28.0
Uninstalling openai-0.28.0:
  Successfully uninstalled openai-0.28.0
Collecting openai==0.28
  Using cached openai-0.28.0-py3-none-any.whl (76 kB)
Installing collected packages: openai
Successfully installed openai-0.28.0


In [2]:
import json
import openai
import pandas as pd
import numpy as np
from datasets import load_dataset
from sklearn.model_selection import train_test_split
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import nltk
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction
import os
from google.colab import drive
import random

## Connect to GDrive

In [3]:
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [4]:
path = '_NLP/Project'

os.chdir(f'/content/drive/MyDrive/{path}')
os.getcwd()

'/content/drive/MyDrive/_NLP/Project'

## Add API key

In [5]:
api_key ="sk-proj-###" #ADD YOUR API KEY HERE
openai.api_key = api_key

## Load and Split dataset

In [6]:
dataset = load_dataset("GokulWork/QuestionAnswer_MCQ")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [7]:
df = dataset['train'].to_pandas()
df.head()

Unnamed: 0.1,Unnamed: 0,question,answer,text
0,0,What is a force?,Correct Answer- A force is a push or pull that...,"###Human:\ngenerate a correct answer, a ration..."
1,1,What is Newton's First Law of Motion?,"Correct Answer- Newton's First Law of Motion, ...","###Human:\ngenerate a correct answer, a ration..."
2,2,What is the difference between speed and veloc...,Correct Answer- Speed is a scalar quantity tha...,"###Human:\ngenerate a correct answer, a ration..."
3,3,Explain that when the kinetic energy of an obj...,Correct Answer- The change in kinetic energy c...,"###Human:\ngenerate a correct answer, a ration..."
4,4,What is the SI unit of electric current?,Correct Answer- Ampere\n\nRationale- Ampere is...,"###Human:\ngenerate a correct answer, a ration..."


In [8]:
df = df.drop(['Unnamed: 0', 'text'], axis=1)
df.head()

Unnamed: 0,question,answer
0,What is a force?,Correct Answer- A force is a push or pull that...
1,What is Newton's First Law of Motion?,"Correct Answer- Newton's First Law of Motion, ..."
2,What is the difference between speed and veloc...,Correct Answer- Speed is a scalar quantity tha...
3,Explain that when the kinetic energy of an obj...,Correct Answer- The change in kinetic energy c...
4,What is the SI unit of electric current?,Correct Answer- Ampere\n\nRationale- Ampere is...


In [9]:
train_data, test_data = train_test_split(df, test_size=100, random_state=42) # limit to 100

In [10]:
len(test_data)

100

In [11]:
test_data.head()

Unnamed: 0,question,answer
15,What is the unit of measure for electric poten...,Correct Answer- Volt.\n\nRationale- The volt i...
9,What property of a wave determines its loudness?,Correct Answer- Amplitude.\n\nRationale- Ampli...
100,What is a planet?,"Correct Answer- A large, spherical body that o..."
132,Which planet mentioned in the context is an ex...,Correct Answer- Jupiter.\n\nRationale- Jupiter...
68,What type of rock is formed from the compactio...,Correct Answer- Sedimentary rock.\n\nRationale...


## Generate answers from the model

In [12]:
def generate_answer(question):
    prompt = question + " ->"
    response = openai.Completion.create(
        model='ft:davinci-002:personal::9KLi6nKN',
        prompt=prompt,
        max_tokens=100,
        top_p=0.9,
        frequency_penalty=2,
        presence_penalty=1,
        stop=["\n"]
    )
    return response.choices[0].text

total_questions = len(test_data)
predictions = []
count = 0
for index, row in test_data.iterrows():
    print(f"Processed question {count + 1} out of {total_questions}")
    question = row["question"]
    predicted_answer = generate_answer(question)
    predictions.append(predicted_answer)
    count = count + 1

Processed question 1 out of 100
Processed question 2 out of 100
Processed question 3 out of 100
Processed question 4 out of 100
Processed question 5 out of 100
Processed question 6 out of 100
Processed question 7 out of 100
Processed question 8 out of 100
Processed question 9 out of 100
Processed question 10 out of 100
Processed question 11 out of 100
Processed question 12 out of 100
Processed question 13 out of 100
Processed question 14 out of 100
Processed question 15 out of 100
Processed question 16 out of 100
Processed question 17 out of 100
Processed question 18 out of 100
Processed question 19 out of 100
Processed question 20 out of 100
Processed question 21 out of 100
Processed question 22 out of 100
Processed question 23 out of 100
Processed question 24 out of 100
Processed question 25 out of 100
Processed question 26 out of 100
Processed question 27 out of 100
Processed question 28 out of 100
Processed question 29 out of 100
Processed question 30 out of 100
Processed question 

## Get the reference answers

In [13]:
# Get answers from the dataset
references = []
for index, row in test_data.iterrows():
    references.append(row["answer"])

## Compute some metrics

In [14]:
# Load a pre-trained model
model = SentenceTransformer('paraphrase-MiniLM-L6-v2')

# Compute embeddings
reference_embeddings = model.encode(references)
prediction_embeddings = model.encode(predictions)



In [15]:
# Log the shape of the embeddings for debugging
print(f"Reference embeddings shape: {np.array(reference_embeddings).shape}")
print(f"Prediction embeddings shape: {np.array(prediction_embeddings).shape}")

Reference embeddings shape: (100, 384)
Prediction embeddings shape: (100, 384)


### Cosine Similarity

In [16]:
# Compute cosine similarity for each pair
cosine_similarities = []
for ref_emb, pred_emb in zip(reference_embeddings, prediction_embeddings):
    cos_sim = cosine_similarity([ref_emb], [pred_emb])[0][0]
    cosine_similarities.append(cos_sim)

In [17]:
# Calculate average cosine similarity
average_cosine_similarity = np.mean(cosine_similarities)
print(f'Average Cosine Similarity: {average_cosine_similarity:.2f}')

Average Cosine Similarity: 0.54


### BLEU Score

In [18]:
# Function to calculate BLEU score
def calculate_bleu(reference, prediction):
    reference_tokens = [nltk.word_tokenize(reference)]
    prediction_tokens = nltk.word_tokenize(prediction)
    # Using smoothing function to avoid zero scores for short sequences
    smoothing_function = SmoothingFunction().method1
    bleu_score = sentence_bleu(reference_tokens, prediction_tokens, smoothing_function=smoothing_function)
    return bleu_score

In [19]:
# Ensure NLTK resources are downloaded
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [20]:
# Calculate BLEU scores for all predictions
bleu_scores = [calculate_bleu(ref, pred) for ref, pred in zip(references, predictions)]

In [21]:
# Calculate average BLEU score
average_bleu_score = sum(bleu_scores) / len(bleu_scores)
print(f'Average BLEU Score: {average_bleu_score:.2f}')

Average BLEU Score: 0.03


# Manual checking with random question from test dataset

In [24]:
# Assuming you have a pandas DataFrame named test_data containing your test dataset
random_index = random.randint(0, len(test_data) - 1)
random_row = test_data.iloc[random_index]
prompt = random_row["question"] + " ->"
actual_answer = random_row["answer"]

response = openai.Completion.create(
    model='ft:davinci-002:personal::9KLi6nKN',
    prompt=prompt,
    max_tokens=100,
    top_p=0.9,
    frequency_penalty=2,
    presence_penalty=1,
    stop=["\n"]
)

print('*************************************')
print('Question: ', prompt)
print('Actual Answer:', actual_answer)
print('Bot Answer: ', response.choices[0].text)


*************************************
Question:  What is the term for the transfer of heat through electromagnetic waves? ->
Actual Answer: Correct Answer- Radiation.

Rationale- Radiation involves the emission of energy in the form of electromagnetic waves.

Distractor 1- Conduction.
Distractor 2- Convection.
Distractor 3- Reflection.
Bot Answer:   Conduction is a term for heat transfer through molecular contact. However, convection and radiation are other methods of heat transfer that do not involve direct physical contact..


---

# Testing the GPT 3.5 Turbo Fine Tunned Model on Original Dataset

The goal of this notebook is to test the GPT (gpt-3.5-turbo-0125) fine tunned model on the Medical Flashcards dataset.

## Install requirements

In [1]:
!pip uninstall -y openai
!pip install openai==0.28
!pip install datasets
!pip install scikit-learn sentence-transformers
!pip install nltk

Found existing installation: openai 0.28.0
Uninstalling openai-0.28.0:
  Successfully uninstalled openai-0.28.0
Collecting openai==0.28
  Using cached openai-0.28.0-py3-none-any.whl (76 kB)
Installing collected packages: openai
Successfully installed openai-0.28.0


In [2]:
import json
import openai
import pandas as pd
import numpy as np
from datasets import load_dataset
from sklearn.model_selection import train_test_split
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import nltk
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction
import random
from google.colab import drive
import os

## Connect to GDrive

In [3]:
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [4]:
path = '_NLP/Project'

os.chdir(f'/content/drive/MyDrive/{path}')
os.getcwd()

'/content/drive/MyDrive/_NLP/Project'

## Add API key

In [5]:
api_key ="sk-proj-###" #ADD YOUR API KEY HERE
openai.api_key = api_key

## Split dataset

In [6]:
dataset = load_dataset('arrow', data_files='data-00000-of-00001.arrow')

In [7]:
df = dataset['train'].to_pandas()

In [8]:
df.head()

Unnamed: 0,input,output,instruction
0,What is the relationship between very low Mg2+...,Very low Mg2+ levels correspond to low PTH lev...,Answer this question truthfully
1,What leads to genitourinary syndrome of menopa...,Low estradiol production leads to genitourinar...,Answer this question truthfully
2,What does low REM sleep latency and experienci...,Low REM sleep latency and experiencing halluci...,Answer this question truthfully
3,What are some possible causes of low PTH and h...,"PTH-independent hypercalcemia, which can be ca...",Answer this question truthfully
4,How does the level of anti-müllerian hormone r...,The level of anti-müllerian hormone is directl...,Answer this question truthfully


In [9]:
df = dataset['train'].to_pandas()
df = df.iloc[:, :-1]

In [10]:
train_data, test_data = train_test_split(df, test_size=0.2, random_state=42)

In [11]:
test_data.head()

Unnamed: 0,input,output
27911,What are some physical signs that may indicate...,What are some physical signs that may indicate...
7251,What is the name of the amino acid that serves...,Arginine is the amino acid that acts as the pr...
32050,Do high or low potency typical antipsychotics ...,High potency typical antipsychotics are more l...
7969,Which type of heart valves are commonly affect...,Viridans streptococci infection is typically s...
6904,"Among all bugs, which one is the most frequent...",Staphylococcus aureus is the bug that is the m...


## Generate answers from the model

In [32]:
DEFAULT_SYSTEM_PROMPT = 'Answer this question truthfully.'

def generate_answer(question):
  response = openai.ChatCompletion.create(
              model="ft:gpt-3.5-turbo-0125:personal::9So7gDaT",
              messages=[{"role": "system", "content": DEFAULT_SYSTEM_PROMPT},
                        {"role": "user", "content": question}],
              max_tokens=100,
              top_p=0.9,
              frequency_penalty=2,
              presence_penalty=1,
              stop=["\n"]
              )
  return response["choices"][0]["message"]["content"]

In [34]:
total_questions = len(test_data)
predictions = []
count = 0
for index, row in test_data.iterrows():
    if count >= 100:
        break
    print(f"Processed question {count + 1} out of {total_questions}")
    question = row["input"]
    predicted_answer = generate_answer(question)
    predictions.append(predicted_answer)
    count = count + 1

Processed question 1 out of 6791
Processed question 2 out of 6791
Processed question 3 out of 6791
Processed question 4 out of 6791
Processed question 5 out of 6791
Processed question 6 out of 6791
Processed question 7 out of 6791
Processed question 8 out of 6791
Processed question 9 out of 6791
Processed question 10 out of 6791
Processed question 11 out of 6791
Processed question 12 out of 6791
Processed question 13 out of 6791
Processed question 14 out of 6791
Processed question 15 out of 6791
Processed question 16 out of 6791
Processed question 17 out of 6791
Processed question 18 out of 6791
Processed question 19 out of 6791
Processed question 20 out of 6791
Processed question 21 out of 6791
Processed question 22 out of 6791
Processed question 23 out of 6791
Processed question 24 out of 6791
Processed question 25 out of 6791
Processed question 26 out of 6791
Processed question 27 out of 6791
Processed question 28 out of 6791
Processed question 29 out of 6791
Processed question 30 o

## Compute some metrics

In [37]:
# Get answers frmo the dataset
count = 0
references = []
for index, row in test_data.iterrows():
    if count >= 100:
        break
    count = count + 1
    references.append(row["output"])

In [38]:
# Load a pre-trained model
model = SentenceTransformer('paraphrase-MiniLM-L6-v2')

# Compute embeddings
reference_embeddings = model.encode(references)
prediction_embeddings = model.encode(predictions)



In [39]:
# Log the shape of the embeddings for debugging
print(f"Reference embeddings shape: {np.array(reference_embeddings).shape}")
print(f"Prediction embeddings shape: {np.array(prediction_embeddings).shape}")

Reference embeddings shape: (100, 384)
Prediction embeddings shape: (100, 384)


### Cosine Similarity

In [40]:
# Compute cosine similarity for each pair
cosine_similarities = []
for ref_emb, pred_emb in zip(reference_embeddings, prediction_embeddings):
    cos_sim = cosine_similarity([ref_emb], [pred_emb])[0][0]
    cosine_similarities.append(cos_sim)

In [41]:
# Calculate average cosine similarity
average_cosine_similarity = np.mean(cosine_similarities)
print(f'Average Cosine Similarity: {average_cosine_similarity:.2f}')

Average Cosine Similarity: 0.81


### BLEU Score

In [42]:
# Function to calculate BLEU score
def calculate_bleu(reference, prediction):
    reference_tokens = [nltk.word_tokenize(reference)]
    prediction_tokens = nltk.word_tokenize(prediction)
    # Using smoothing function to avoid zero scores for short sequences
    smoothing_function = SmoothingFunction().method1
    bleu_score = sentence_bleu(reference_tokens, prediction_tokens, smoothing_function=smoothing_function)
    return bleu_score

In [43]:
# Ensure NLTK resources are downloaded
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [44]:
# Calculate BLEU scores for all predictions
bleu_scores = [calculate_bleu(ref, pred) for ref, pred in zip(references, predictions)]

In [45]:
# Calculate average BLEU score
average_bleu_score = sum(bleu_scores) / len(bleu_scores)
print(f'Average BLEU Score: {average_bleu_score:.2f}')

Average BLEU Score: 0.20


### Exact Match

In [46]:
def exact_match(prediction, ground_truth):
    return prediction == ground_truth

# Assuming predictions and references are lists of answers
em_score = sum(exact_match(pred, ref) for pred, ref in zip(predictions, references)) / len(predictions)
print(f'Exact Match Score: {em_score:.2f}')

Exact Match Score: 0.02


# Manual checking with random question from test dataset

In [48]:
# Assuming you have a pandas DataFrame named test_data containing your test dataset
random_index = random.randint(0, len(test_data) - 1)
random_row = test_data.iloc[random_index]
prompt = random_row["input"]
actual_answer = random_row["output"]

bot_answer = generate_answer(prompt)

print('*************************************')
print('Question: ', prompt)
print('Actual Answer:', actual_answer)
print('Bot Answer: ', bot_answer)


*************************************
Question:  Do patients with secondary adrenocortical insufficiency exhibit symptoms such as hyperkalemia, metabolic acidosis, or hypotension?
Actual Answer: No, patients with secondary adrenocortical insufficiency do not exhibit those symptoms because their aldosterone levels are normal.
Bot Answer:  No, patients with secondary adrenocortical insufficiency do not typically exhibit symptoms such as hyperkalemia, metabolic acidosis or hypotension. Adrenal insufficiency is a condition in which the adrenal glands do not produce enough of certain hormones, including cortisol and aldosterone. In primary adrenal insufficiency (Addison's disease), the adrenal glands themselves are damaged or destroyed. In contrast, secondary adrenal insufficiency occurs when there is a problem with the pituitary


---

# Testing the GPT 3.5 Turbo Fine Tunned Model on a Different Dataset

The goal of this notebook is to test the GPT (gpt3.5-turbo-0125) fine tunned model on a different dataset. Instead of using the Medical Flashcards, here we test the model on this dataset: [GokulWork/QuestionAnswer_MCQ](https://huggingface.co/datasets/GokulWork/QuestionAnswer_MCQ)

## Install requirements

In [1]:
!pip uninstall -y openai
!pip install openai==0.28
!pip install datasets
!pip install scikit-learn sentence-transformers
!pip install nltk

[0mCollecting openai==0.28
  Downloading openai-0.28.0-py3-none-any.whl (76 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.5/76.5 kB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: openai
Successfully installed openai-0.28.0
Collecting datasets
  Downloading datasets-2.19.1-py3-none-any.whl (542 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m542.0/542.0 kB[0m [31m5.0 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m5.8 MB/s[0m eta [36m0:00:00[0m
Collecting xxhash (from datasets)
  Downloading xxhash-3.4.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (194 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m8.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting multiprocess (from datasets)
  Download

In [2]:
import json
import openai
import pandas as pd
import numpy as np
from datasets import load_dataset
from sklearn.model_selection import train_test_split
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import nltk
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction
import random
from google.colab import drive
import os

## Connect to GDrive

In [3]:
drive.mount('/content/drive')

Mounted at /content/drive


In [4]:
path = '_NLP/Project'

os.chdir(f'/content/drive/MyDrive/{path}')
os.getcwd()

'/content/drive/MyDrive/_NLP/Project'

## Add API key

In [5]:
api_key ="sk-proj-##" # ADD YOUR API KEY HERE
openai.api_key = api_key

## Load and Split dataset

In [6]:
dataset = load_dataset("GokulWork/QuestionAnswer_MCQ")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/81.0 [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/154k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/35.3k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/205 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/50 [00:00<?, ? examples/s]

In [7]:
df = dataset['train'].to_pandas()
df.head()

Unnamed: 0.1,Unnamed: 0,question,answer,text
0,0,What is a force?,Correct Answer- A force is a push or pull that...,"###Human:\ngenerate a correct answer, a ration..."
1,1,What is Newton's First Law of Motion?,"Correct Answer- Newton's First Law of Motion, ...","###Human:\ngenerate a correct answer, a ration..."
2,2,What is the difference between speed and veloc...,Correct Answer- Speed is a scalar quantity tha...,"###Human:\ngenerate a correct answer, a ration..."
3,3,Explain that when the kinetic energy of an obj...,Correct Answer- The change in kinetic energy c...,"###Human:\ngenerate a correct answer, a ration..."
4,4,What is the SI unit of electric current?,Correct Answer- Ampere\n\nRationale- Ampere is...,"###Human:\ngenerate a correct answer, a ration..."


In [8]:
df = df.drop(['Unnamed: 0', 'text'], axis=1)
df.head()

Unnamed: 0,question,answer
0,What is a force?,Correct Answer- A force is a push or pull that...
1,What is Newton's First Law of Motion?,"Correct Answer- Newton's First Law of Motion, ..."
2,What is the difference between speed and veloc...,Correct Answer- Speed is a scalar quantity tha...
3,Explain that when the kinetic energy of an obj...,Correct Answer- The change in kinetic energy c...
4,What is the SI unit of electric current?,Correct Answer- Ampere\n\nRationale- Ampere is...


In [9]:
train_data, test_data = train_test_split(df, test_size=100, random_state=42)

In [10]:
test_data.head()

Unnamed: 0,question,answer
15,What is the unit of measure for electric poten...,Correct Answer- Volt.\n\nRationale- The volt i...
9,What property of a wave determines its loudness?,Correct Answer- Amplitude.\n\nRationale- Ampli...
100,What is a planet?,"Correct Answer- A large, spherical body that o..."
132,Which planet mentioned in the context is an ex...,Correct Answer- Jupiter.\n\nRationale- Jupiter...
68,What type of rock is formed from the compactio...,Correct Answer- Sedimentary rock.\n\nRationale...


## Generate answers from the model

In [11]:
DEFAULT_SYSTEM_PROMPT = 'Answer this question truthfully.'

def generate_answer(question):
  response = openai.ChatCompletion.create(
              model="ft:gpt-3.5-turbo-0125:personal::9So7gDaT",
              messages=[{"role": "system", "content": DEFAULT_SYSTEM_PROMPT},
                        {"role": "user", "content": question}],
              max_tokens=100,
              top_p=0.9,
              frequency_penalty=2,
              presence_penalty=1,
              stop=["\n"]
              )
  return response["choices"][0]["message"]["content"]

In [14]:
total_questions = len(test_data)
predictions = []
count = 0
for index, row in test_data.iterrows():
    print(f"Processed question {count + 1} out of {total_questions}")
    question = row["question"]
    predicted_answer = generate_answer(question)
    predictions.append(predicted_answer)
    count = count + 1

Processed question 1 out of 100
Processed question 2 out of 100
Processed question 3 out of 100
Processed question 4 out of 100
Processed question 5 out of 100
Processed question 6 out of 100
Processed question 7 out of 100
Processed question 8 out of 100
Processed question 9 out of 100
Processed question 10 out of 100
Processed question 11 out of 100
Processed question 12 out of 100
Processed question 13 out of 100
Processed question 14 out of 100
Processed question 15 out of 100
Processed question 16 out of 100
Processed question 17 out of 100
Processed question 18 out of 100
Processed question 19 out of 100
Processed question 20 out of 100
Processed question 21 out of 100
Processed question 22 out of 100
Processed question 23 out of 100
Processed question 24 out of 100
Processed question 25 out of 100
Processed question 26 out of 100
Processed question 27 out of 100
Processed question 28 out of 100
Processed question 29 out of 100
Processed question 30 out of 100
Processed question 

## Compute some metrics

In [16]:
# Get answers frmo the dataset
references = []
for index, row in test_data.iterrows():
    references.append(row["answer"])

In [17]:
# Load a pre-trained model
model = SentenceTransformer('paraphrase-MiniLM-L6-v2')

# Compute embeddings
reference_embeddings = model.encode(references)
prediction_embeddings = model.encode(predictions)

modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/3.73k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/314 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [18]:
# Log the shape of the embeddings for debugging
print(f"Reference embeddings shape: {np.array(reference_embeddings).shape}")
print(f"Prediction embeddings shape: {np.array(prediction_embeddings).shape}")

Reference embeddings shape: (100, 384)
Prediction embeddings shape: (100, 384)


### Cosine Similarity

In [19]:
# Compute cosine similarity for each pair
cosine_similarities = []
for ref_emb, pred_emb in zip(reference_embeddings, prediction_embeddings):
    cos_sim = cosine_similarity([ref_emb], [pred_emb])[0][0]
    cosine_similarities.append(cos_sim)

In [20]:
# Calculate average cosine similarity
average_cosine_similarity = np.mean(cosine_similarities)
print(f'Average Cosine Similarity: {average_cosine_similarity:.2f}')

Average Cosine Similarity: 0.56


### BLEU Score

In [21]:
# Function to calculate BLEU score
def calculate_bleu(reference, prediction):
    reference_tokens = [nltk.word_tokenize(reference)]
    prediction_tokens = nltk.word_tokenize(prediction)
    # Using smoothing function to avoid zero scores for short sequences
    smoothing_function = SmoothingFunction().method1
    bleu_score = sentence_bleu(reference_tokens, prediction_tokens, smoothing_function=smoothing_function)
    return bleu_score

In [22]:
# Ensure NLTK resources are downloaded
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [23]:
# Calculate BLEU scores for all predictions
bleu_scores = [calculate_bleu(ref, pred) for ref, pred in zip(references, predictions)]

In [24]:
# Calculate average BLEU score
average_bleu_score = sum(bleu_scores) / len(bleu_scores)
print(f'Average BLEU Score: {average_bleu_score:.2f}')

Average BLEU Score: 0.03


### Exact Match

In [25]:
def exact_match(prediction, ground_truth):
    return prediction == ground_truth

# Assuming predictions and references are lists of answers
em_score = sum(exact_match(pred, ref) for pred, ref in zip(predictions, references)) / len(predictions)
print(f'Exact Match Score: {em_score:.2f}')

Exact Match Score: 0.00


# Manual checking with random question from test dataset

In [31]:
# Assuming you have a pandas DataFrame named test_data containing your test dataset
random_index = random.randint(0, len(test_data) - 1)
random_row = test_data.iloc[random_index]
prompt = random_row["question"]
actual_answer = random_row["answer"]

bot_answer = generate_answer(prompt)

print('*************************************')
print('Question: ', prompt)
print('Actual Answer:', actual_answer)
print('Bot Answer: ', bot_answer)


*************************************
Question:  What is the term for the process by which a solid changes directly into a gas without becoming a liquid?
Actual Answer: Correct Answer- Sublimation.

Rationale- Sublimation is the phase transition in which a solid changes directly into a gas without passing through the liquid phase.

Distractor 1- Evaporation.
Distractor 2- Melting.
Distractor 3- Condensation.
Bot Answer:  The term for this process is sublimation.


---