### Imports

In [113]:
import pickle
import numpy
import pandas as pd

import json
import os
import openai

### Results Description:
1. How many questions generated per video? 5
2. What should be the format of the results generated, so that it is easier to evaluate during testing?
    Currently, the results are stored in a JSON object.
    {'video_id': [q1, q2, q3]}
3. Prompt length is 100 to 900. 


### Challenges for Completion API:
1. The prompt doesn't give similar results everytime. 
2. The sum of prompt and generation results should be less than the maximum length of the model.
3. The data has to be preprocessed such that the prompt length is between 100 and 700, so that there is enough relevant content through which cohesive results can be generated. Preprocessing requires removing stopwords and also remove the dataset that is too large or too small.

In [141]:
with open('preprocessed_YoutubeData_668.pkl', 'rb') as file:
    # Load the contents of the file using pickle.load()
    data = pickle.load(file)

In [142]:
print(data.shape)
print(data.head)
print(numpy.max(data['Subtitles'].apply(len)))


(192, 2)
<bound method NDFrame.head of         Video_ID                                          Subtitles
0    9MwScz5OFVY  the mg brand has been around since 1924 and si...
1    826Nd2DQpEw  nate is a world leader in training and capacit...
2    qT1QpkBG-bU  we get our meal prep so I wake up in the morni...
3    MJZTcKXcLJc  what is the best vehicle access control system...
4    9egAVV5J_WM  Amero engineering is an additive manufacturing...
..           ...                                                ...
187  dQ1xxoP7NJk  When did I start to forget All of the great th...
188  yZIS65TJwxo  In this video I ll show you why you should spr...
189  6qkNFQIi6jw  t 15 Championship battle e Rojo D15 Championsh...
190  HD0atZKG2BY  I stopped talking you do not come up to me Dr ...
191  I5G2BPydg7Y  foreign foreign foreign foreign foreign foreig...

[192 rows x 2 columns]>
3824


In [170]:
data['Subtitles'][1]

'nate is a world leader in training and capacity building over the past 40 years our polytechnic institution has worked with organizations in over 50 countries we train your workers to compete in a constantly evolving business and technology climate we can meet your training needs whether you operate in the private or public sector you can build a program using nate s established courses and certificates or create a totally customized program from the ground up we re flexible because we know every organization has unique needs nate works closely with its industry partners we will be with you every step of the way providing support before during and after your training and delivery the success of your workforce development program is our priority we know a skilled workforce leads to the improvement of your organization and the lives of people in your community make an inquiry today to learn more'

In [175]:
# openai.organization = "YOUR_ORG_ID"
openai.api_key = "sk-3XIyTICDK1K2eBDpaxgNT3BlbkFJFfYuMyYs25N7oryZbp"

# Set the GPT-3 model ID
#davinci is the most powerful model in gpt3
model_id = "text-davinci-002"

### #Prompt Engineering

Using the prompt below, we can generate 5 MCQs along with their answers.
The top_p and temperature makes the model more deterministic, which helps control the randomness of the choices in the MCQs.

In [176]:
 prompt = f'''Generate FIVE different Multiple Choice Questions from the text. Provide the correct answer for each multiple choice question generated. 
    Text: nate is a world leader in training and capacity building over the past 40 years our polytechnic institution has worked with organizations in over 50 countries we train your workers to compete in a constantly evolving business and technology climate we can meet your training needs whether you operate in the private or public sector you can build a program using nate s established courses and certificates or create a totally customized program from the ground up we re flexible because we know every organization has unique needs nate works closely with its industry partners we will be with you every step of the way providing support before during and after your training and delivery the success of your workforce development program is our priority we know a skilled workforce leads to the improvement of your organization and the lives of people in your community make an inquiry today to learn mor
    Answer:
    '''

result = openai.Completion.create(
    engine=model_id,
    prompt=prompt,
    max_tokens=1024,
    n=1,
    top_p = 0.8,
    temperature=0.6,
    frequency_penalty=0.1,
    presence_penalty=0.0,
)

print(result['choices'][0]["text"])


   1. What does nate specialize in?
   A. Training and capacity building
   B. Business and technology
   C. Established courses and certificates
   D. Customized programs

   Answer: A. Training and capacity building

   2. How many countries has nate worked with in the past?
   A. 10
   B. 40
   C. 50
   D. Over 50

   Answer: D. Over 50

   3. Which of the following is not a service that nate offers?
   A. Support before training
   B. Delivery of training
   C. Support after training
   D. Consultation

   Answer: D. Consultation

   4. What is the main goal of nate?
   A. To provide skilled workers for organizations
   B. To improve the lives of people in the community
   C. To work with industry partners
   D. To be flexible

   Answer: A. To provide skilled workers for organizations

   5. What does a skilled workforce lead to?
   A. The success of an organization
   B. The improvement of an organization
   C. The improvement of people's lives
   D. All of the above

   Answer:

### Observations

#### Unprocessed Data
1. Remove the Youtube Slangs, such as comments related to liking, commenting and subscribing. These tend to show up in the questions and answers. This is another form of preprocessing required specifically for our problem statement. 


### Results generation using Text Completion in GPT3

In [118]:
# Function that generates the questions for a particular video.
def send_to_gpt(text):
    prompt = f'''Generate FIVE different Multiple Choice Questions from the text. 
    Provide the correct answer for each multiple choice question generated. 
    Text: {text} 
    Answer:'''

    result = openai.Completion.create(
        engine=model_id,
        prompt=prompt,
        max_tokens=1024,
        n=1,
        top_p = 0.8,
        temperature=0.6,
        frequency_penalty=0.1,
        presence_penalty=0.0,
    )

    return result['choices'][0]["text"]


In [119]:
# Generate results in the directory
dir_base = "outputs"

# Define the directory path to check/create
# dir_base = "/path/to/directory"

# Check if the directory exists
if not os.path.exists(dir_base):
    # Create the directory if it does not exist
    os.makedirs(dir_base)
    print("Directory created!")
else:
    print("Directory already exists.")

output_dict = {}


Directory already exists.


In [None]:
# Get questions for all the captions for the YouTube dataset
def perform_analysis(data):

    for eachVideo in data.index:
        id = data.iloc[eachVideo][0]
        subtitle = data.iloc[eachVideo][1]
        print(len(subtitle.split(' ')))
        output = send_to_gpt(subtitle)

        text_split = output.split('\n\n')

        text_list = []

        for i in range(len(text_split)):
            if len(text_split[i]) > 0:
                text_list.append(text_split[i])

        output_dict[id] = text_list
        
    
perform_analysis(data)


In [40]:
# save the file 
file_path = dir_base + "/run1"

with open(file_path, "w") as f:
    json.dump(output_dict, f)


In [58]:
# read the file
with open(file_path, "r") as f:
    return_dict = json.load(f)
    print(return_dict['826Nd2DQpEw'][4])
#     print(return_dict.keys(), return_dict[list(return_dict.keys())[0]])


5. What is the institution's priority?
a. The success of the workforce development program
b. The improvement of the organization
c. The lives of the people in the community
d. All of the above


### Verify the answers using GPT3.

In [None]:
# Verify the answers using GPT3. 

response = openai.Completion.create(
  model="text-davinci-003",
  prompt='''I am a highly intelligent question answering bot. If you ask me a question that is rooted in truth, I will give you the answer. If you ask me a question that is nonsense, trickery, or has no clear answer, I will respond with \"Unknown\".\n\nQ: What was the name of the British mortar employed during the Crimean War?
A. Little David
B. Big Bertha
C. Mallet's Mortar
D. Paris Gun''',
  temperature=0,
  max_tokens=100,
  top_p=1,
  frequency_penalty=0.0,
  presence_penalty=0.0,
  
)

In [None]:
response

In [37]:
file_path = dir_base + "/results_2.txt"

with open(file_path, "w") as f:
    f.write(format(row_string))


### Fine-tuned model Results Generation

In [164]:
# Function that generates the questions for YouTube Subtitles using the fine-tuned model.
def send_to_finetuned_gpt(text):
    prompt = f''' {text} \n\n###\n\n'''

    result = openai.Completion.create(
        engine="davinci:ft-personal-2023-04-22-22-35-10",
        prompt=prompt,
        max_tokens=1024,
        n=1,
        top_p = 0.8,
    #     temperature=0.6,
        frequency_penalty=0.1,
        presence_penalty=0.0,
    )
#     print(result)
    return result['choices'][0]["text"]


In [165]:
def parse_fine_tune_result(text):
    initial_split = text.split('\n\n')
    final_ouput = []
    count = len(initial_split)
    i = 0
    while (i < count and i < 6):
        option1 = initial_split[i]
        option2 = initial_split[i + 1]

        q_flag = False
        t_flag = False
        if '?' in option1:
            q_flag = True
        if '_' in option1: 
            q_flag = True

        if 'A.' in option2 and 'B.' in option2 and 'C.' in option2 and 'D.' in option2:
            t_flag = True
        
        if q_flag and t_flag:
            final_ouput.append(option1 + "\n\n" + option2)
            i += 2
        else:

            print("Error", option1, option2)
            break
    
    return final_ouput

In [None]:
def perform_analysis_finetuned(data):

    for eachVideo in masked_data.index:
        id = data.iloc[eachVideo][0]
        subtitle = data.iloc[eachVideo][1]
        print(len(subtitle.split(' ')))
        output = send_to_finetuned_gpt(subtitle)
       
        text_list = parse_fine_tune_result(output)
        output_dict[id] = text_list
            
perform_analysis_finetuned(data)


In [167]:
file_path = dir_base + "/finetune_run1"

with open(file_path, "w") as f:
    json.dump(output_dict, f)
# save the file 


In [168]:
# read the file
with open(file_path, "r") as f:
    return_dict = json.load(f)


In [108]:
ans = openai.ChatCompletion.create(
  model="gpt-3.5-turbo",
  messages=[
        {"role": "system", "content": "You are a question answer generation expert."},
        {"role": "user", "content": "Generate, 5 Multiple Choice Question and Answers for this text : Today I ll make a very small and pretty Mini Safe Let s start with making a cylinder form Check the size Marking Cutting I ll use water Okay Sharpening It has to be sophisticated Umm Good Holding Use a rubber band Marking Next Gluing Okay Le me put it Done Perfect Um great Let s check the overall size I ll draw a rough sketch wait a minute Check Okay Let s make a lid Done It s going to be like this right Gluing Cutting the overlapping parts I m going to cut the circle shape by using a wooden pattern decorative lines Good Mark the area for gluing Sanding Let me put it Sharpening For today s work Red color a symbol of wealth Edge Painting edges court heating Again Check the size This part will be finished in advance Let s match the dark brown on the inside Sharpening It has to be sophisticated Again Gluing Um It s amazing right Cut out the parts you don t need I m gonna drill a hole in this part Again Sharpening Punch the hole Okay This is the outside of the with the dial Pre Cutting Aga"},
    ]
)

In [144]:
print(result['choices'][0]["text"])

 Amero engineering is an additive manufacturing solution provider using   _  . 

A. selective laser melting
B. direct laser deposition
C. cutting edge technology
D. leading research in additive manufacturing
Answer = B

The objective of this project was to create a technology demonstrator that would showcase Amira's expertise in   _  . 

A. selective laser melting
B. direct laser deposition
C. cutting edge technology
D. leading research in additive manufacturing
Answer = D

 According to the text what is the final design of the rocket engine? 

A. A narrow spike rocket engine with a unique multi chamber design.
B. A spike nozzle engine that can be most efficient at any altitude compared to a conventional rocket nozzle design.
C. A set one piece rocket design that could only be additively manufactured and beneficial in terms of performance outcomes of the design.
D. A special rocket nozzle and that means that it can be most efficient at any altitude compared to a conventional rocket noz