### Import Statements

In [1]:
from langchain.prompts import PromptTemplate
import openai
import os
import re
from typing import List, Tuple
import pandas as pd
import csv

In [2]:
# Please note: based on the parameters above, if for instance you need to generate 60 Q/A pairs per course, you could set the values as follows:
# NUM_RUNS = 12
# NUM_QA = 5
# ALL_COURSES = True

NUM_RUNS=12             # number of runs (a run is an iteration through all the courses with a call to an LLM for each course to generate questions)
NUM_QA=5                # the number of Q/A pairs the LLM is asked to generate with a single prompt (parameter included in the prompt)
MAT_TOKENS = 10_000     # max tokens parameter for the LLM

ALL_COURSES = True      # If True, iterate through all courses, if False, iterate through MAX_COURSES courses
MAX_COURSES = 2

### Set OpenAI API key and create OpenAI client

In [4]:
# Loading the OpenAI API key from an .env file
from dotenv import load_dotenv
import os
from pathlib import Path

env_path = Path("../../openai/.env")
load_dotenv(dotenv_path=env_path)

openai_key = os.getenv("OPENAI_API_KEY")

In [5]:
# Creating the OpenAI client that will be used to call the LLM with a prompt

openai.api_key = openai_key
client = openai.OpenAI(api_key=openai_key)

### Define data for courses and prompt to be used with LLM

In [6]:
# List of all courses
courses = [
    "CSCI E-25 Computer Vision",
    "CSCI E-63C Elements of Data Science and Statistical Learning with R",
    "CSCI E-80 Introduction to Artificial Intelligence with Python",
    "CSCI E-83 Fundamentals of Data Science",
    "CSCI E-89 Deep Learning",
    "CSCI E-89B Introduction to Natural Language Processing",
    "CSCI E-96 Data Mining for Business", 
    "CSCI E-101 Foundations of Data Science and Engineering", 
    "CSCI E-102 Econometrics and Causal Inference with R",
    "CSCI E-109A Introduction to Data Science", 
    "CSCI E-109B Advanced Topics in Data Science", 
    "CSCI E-116 Dynamic Modeling and Forecasting in Big Data",
    "ISMT E-136 Time Series Analysis with Python",
    "ISMT E-161 Computational Bayesian Inference",
    "MATH E-156 Mathematical Statistics",
    "STAT E-109 Introduction to Statistical Modeling"
]

# List of descriptions of all courses
# Descriptions are passed on to the prompt and are used to 
# provide the LLM with more information about a given course
descriptions = [
    """
    Computer vision is an exciting and rapidly changing field. In a little over ten years, deep learning algorithms have 
    revolutionized several aspects of computer vison. Applications that were infeasible or impractical a few years ago are now 
    in routine production. These advances allow intelligent systems to interact with the real-world using vision. Examples of modern 
    computer vision (CV) applications include digital photography, robotic vision, autonomous vehicles, medical imaging, and 
    scientific imaging. This course is a fast-moving survey of both fundamental theory of CV algorithms along with hands-on practical 
    assignments applying these methods using Python. Successfully deploying CV applications often requires a combination of classical 
    methods and state-of-the-art algorithms. Therefore, this course covers the classical image processing and CV techniques often 
    found in practical CV solutions. From this foundation the course moves to the deep learning algorithms that have revolutionized 
    CV. Students apply tools drawn from the extensive universe of Python CV related packages in the hands-on assignments to reinforce 
    key principles. Major topics covered in the course include: algorithms used to prepare images, transform images, and extract 
    features; statistical properties of images and methods of decomposition; machine learning algorithms for CV, including deep 
    learning; classification of objects in images; motion in images and optical flow; object detection and tracking algorithms; models 
    for stereo vision; segmentation of images; and generative models.
    """,
    """
    One of the broad goals of data science is examining raw data with the purpose of identifying its structure and trends, and of 
    deriving conclusions and hypotheses from it. In the modern world awash with data, data analytics is more important than ever to 
    fields ranging from biomedical research, space and weather science, finance, business operations and production, to marketing and 
    social media applications. This course introduces various statistical learning methods and their applications. The R programming 
    language, a very popular and powerful platform for scientific and statistical analysis and visualization, is introduced and used 
    throughout the course. We discuss the fundamentals of statistical testing and learning, and cover topics of linear and non-linear 
    regression, clustering and classification, support vector machines, and decision trees. The datasets used in the examples are 
    drawn from diverse domains such as finance, genomics, and customer sales and survey data.
    """,
    """
    This course explores the concepts and algorithms at the foundation of modern artificial intelligence, diving into the ideas that 
    give rise to technologies like game-playing engines, handwriting recognition, and machine translation. Through hands-on projects, 
    students gain exposure to the theory behind graph search algorithms, classification, optimization, machine learning, large language 
    models, and other topics in artificial intelligence as they incorporate them into their own Python programs. By course's end, 
    students emerge with experience in libraries for machine learning as well as knowledge of artificial intelligence principles that 
    enable them to design intelligent systems of their own.
    """,
    """
    This course builds on CSCI E-101, giving students a solid foundation for advanced data modeling, machine learning, and artificial 
    intelligence (AI). The course focuses on the modern computational statistical methods underpinning advanced data science. In the 
    twenty-first century, these powerful, computationally intensive models are both practical and widely used. Such models enable us 
    to explore and model the complex datasets commonly encountered in the real world. The course employs a combination of theory and 
    hands-on experience using Python programming tools. The focus is on foundational computational statistical algorithms, statistical 
    inference methods, and effective visualization methods. The hands-on component of the course uses the Python packages, NumPy, 
    Pandas, Seaborn, Statsmodels, and PyMC3, along with selected other open-source packages. The focus of this course is on methods to 
    address the exploration, inference, and modeling changes arising from the analysis of increasingly complex datasets. Three 
    approaches to large scale computational statistical inference are addressed: maximum likelihood, modern resampling methods, and 
    Bayesian models. The properties and behavior of the rich family of linear models and Bayesian models, foundational to many 
    statistical, machine learning and AI algorithms are surveyed. Additionally, time series models are explored.
    """,
    """
    The ability of computerized systems to acquire vast amounts of data and display them in informative ways raises our expectations 
    for fast, accurate identification or recognition of events or objects and for predictions about future events. Machine learning and 
    artificial intelligence (AI) have fulfilled those needs to some degree. Over the last 10 years, a versatile architectural style of 
    artificial neural networks called deep learning has emerged as the most promising answer to those expectations. Today, deep 
    learning is the primary technique for analysis and resolution of many issues in data analyses and natural sciences, linguistics, 
    and engineering. We use deep learning for image classification, manipulation and generation, speech recognition and synthesis, 
    natural language translation, sound and music manipulation and generation, navigation of self-driving cars, and many other 
    activities. In this course, students master several key architectures for implementation of deep learning networks, such as 
    convolutional neural networks (CNNs), recurrent neural networks (RNNs), long short-term memory networks (LSTMs), autoencoders, 
    generative adversarial networks (GANs), transformers with attention, and graph neural networks. We provide references to many 
    practical applications where those architectures are successfully used. The course starts with a review of the theoretical 
    foundations of the neural networks approach to machine learning including auto-differentiation and backpropagation. The emphasis 
    of the course is on practical applications of deep learning using Keras (packages within TensorFlow 2.x framework) and PyTorch.
    """,
    """
    Students are introduced to modern techniques of natural language processing (NLP) and learn foundations of text classification, 
    named entity recognition, parsing, language modeling including text generation, topic modeling, and machine translation. Methods 
    for representing text as data studied in the course are tokenization, n-grams, bag of words, term frequency-inverse document 
    frequency (TD-IDF) weighting, word embeddings like Word2Vec and GloVe, autoencoders, t-SNE, character embeddings, and topic 
    modeling. The machine learning algorithms for NLP covered in the course are recurrent neural networks (RNNs) including long 
    short-term memory (LSTM), conditional random fields (CRFs), bidirectional LSTM with a CRF (BiLSTM-CRF), generative adversarial 
    networks (GANs), attention models, transformers, bidirectional encoder representations from transformers (BERT), latent Dirichlet 
    allocation (LDA), non-negative matrix factorization (NMF), and structural topic modeling (STM). Students get hands-on experience 
    using both Python and R.
    """,
    """
    This course introduces non-mathematical business professionals to data science principles widely used in today's corporations. 
    Quantitative methods affect many of today's interactions for business leaders, students, and consumers. Emphasis is placed on 
    practical uses and case studies utilizing data to inform business decisions rather than theoretical or complex mathematics. 
    Case study topics include understanding customer demand, marketing, new market forecasting, revenue projections, and data mining 
    to improve decisions. Learning goals include quantitative business application, basic programming, algorithm development, and 
    process workflow. The course highlights methods that business leaders and data scientists have found to be the most useful. 
    It introduces the basic concepts of R for data mining. This course is for students who want an introduction to how data science 
    improves business outcomes.
    """,
    """
    Most data scientists spend 20 percent of their time building data models and analyzing model results. What do they do with the 
    remaining 80 percent of their time? The answer is data engineering. Data engineering is a subdiscipline of software engineering 
    that focuses on the transportation, transformation, and management of data. This course takes a comprehensive approach to explore 
    data science, which includes data engineering concepts and techniques. Key topics include data management and transformation, 
    exploratory data analysis and visualization, statistical thinking and machine learning, natural language processing, and 
    storytelling with data, emphasizing the integration of Python, MySQL, Tableau, development, and big data analytics platforms. 
    """,
    """
    Supervised learning algorithms, such as support-vector machines, random forests, and neural networks have demonstrated phenomenal 
    performance in the era of big data. However, they often fail in answering the question, what would happen if the world changed in 
    some specific way while holding other variables fixed? Such problems arise in many business applications including in finance, 
    policymaking, and healthcare. This course covers modern econometric techniques for evaluating causal effects based on observational 
    (that is, non-experimental) data. Topics covered in the course include multivariate linear regression, heteroscedasticity and 
    weighted least squares (WLS), dummy variables and interactions, difference in differences (DD), logistic regression, probit model, 
    censored regression models, exact matching, propensity score matching (PSM), regression discontinuity design (RDD), fuzzy 
    regression discontinuity (FRD), synthetic control, instrumental variables (IV), and two-stage least squares (2SLS). Students get 
    hands-on experience using R.
    """,
    """
    This course focuses on the analysis of messy, real life data to perform predictions using statistical 
    and machine learning methods. Material covered integrates the five key facets of an investigation using data: 
    data collection—data wrangling, cleaning, and sampling to get a suitable data set; data management—accessing data quickly and 
    reliably; exploratory data analysis—generating hypotheses and building intuition; prediction or statistical learning; 
    and communication—summarizing results through visualization, stories, and interpretable summaries.
    """,
    """
    Building upon the material in CSCI E-109a, the course introduces advanced methods for statistical modeling, representation, and 
    prediction. Topics include multiple deep learning architectures such as convolutional neural networks (CNNs), recurrent neural 
    networks (RNNs), transformers, language models, autoencoders, and generative models, as well as basic Bayesian methods and 
    unsupervised learning. Students are strongly encouraged to enroll in both the fall and spring course within the same academic year. 
    """,
    """
    Most machine learning models focus on cross-sectional data, while most time-series models focus on time series with few 
    variables and low-frequency data. This course covers the skills and models to handle big data that are both rich in variables 
    and time. We discuss both structural models and reduced-form models. Students learn dynamic regression model, 
    dynamic factor model, vector autoregressions model, error correction model, dimensional reduction tools for fat dataset, 
    and state-space model. Students also learn advanced methods to decompose trend, cycle, and seasonality in high-frequency data 
    and to make more reliable time series forecasting.
    """,
    """
    Time series data (for example, closing prices of an exchangetraded fund, maximum yearly temperatures, monthly PC sales, or daily 
    numbers of visitors) arise whenever correlations of adjacent observations in time cannot be ignored. This course covers modern 
    methods for time series analysis and forecasting. In addition to mathematical foundations of time series, students get hands-on 
    experience building predictive models in cases of both stationary and non-stationary time series. Topics covered in the course 
    include autocorrelation and partial autocorrelation, Fourier analysis, stationarity, time series decomposition, autoregressive 
    integrated moving average (ARIMA) process and the Box-Jenkins methodology, generalized autoregressive conditional 
    heteroskedasticity (GARCH) model, and long short-term memory (LSTM), a special type of recurrent neural networks (RNN) which has 
    demonstrated to be superior to classical time series models in many applications.
    """,
    """
    The techniques of statistical inference for studying properties of data generating processes include method of moments, maximum 
    likelihood estimation, Bayesian inference, and nonparametric statistics. Bayesian inference is an important approach to data 
    analysis in which unknown parameters are treated as random variables whose probability distributions can be updated in light of 
    new information. Bayesian inference is particularly advantageous for sequential data analysis and hypothesis testing when data are 
    being collected sequentially. In this course, students learn foundations of Bayesian inference, including contemporary 
    computational methods such as Markov Chain Monte Carlo (MCMC) and get hands-on experience using R. Topics covered in the course 
    include Bayes' rule, prior and posterior distributions, Markov Chain (MC), MCMC methods, the celebrated Metropolis-Hastings 
    algorithm, and the Gibbs sampler.
    """,
    """
    This course is an introduction to mathematical statistics and data analysis. It starts by introducing central concepts of 
    probability theory (events, probability measure, random variables, distributions, joint distributions, and conditional 
    distributions) and then moves on to the development of mathematical foundations of statistical inference. Topics covered in the 
    course include random variables, expectations, parameter estimation (method of moments, method of maximum likelihood, and Bayesian 
    approach), properties of point estimators (bias, variance, consistency, and efficiency), confidence intervals, hypotheses testing, 
    likelihood ratio test, data summary methods, and introduction to linear regression. A class of distributions, including chi-squared, 
    t, and F distributions, the distributions derived from normal that occur in many applications of hypothesis testing and statistical 
    inference, is introduced.
    """,
    """
    This is a second course in statistical inference and is a further examination of statistics and data analysis beyond the 
    introductory course. Topics include t-tools and permutation-based alternatives including bootstrapping, analysis of variance, 
    linear regression, model checking, and refinement. Statistical computing and simulation-based emphasis is also covered as well as 
    basic programming in the R statistical package. Emphasis is placed on thinking statistically, evaluating assumptions, and 
    developing tools for real-life applications. By the end of the course, students should be able to evaluate the strengths and 
    weaknesses of a variety of statistical techniques appearing in the media, scientific literature, or students' own work. 
    """
]

In [7]:
# First version of the prompt - this does not keep track of questions already obtained from the LLM
# If this same prompt is used in several iterations for a given course, the LLM might likely generate duplicates
# The next version of the prompt (prompt_2) fixes this.

prompt_1 = """
You are an expert data scientist and educator for the course "{course}" at the Harvard Extension School, which is part of the 
Data Science curriculum. Here is a description of the course:

{description}

Your task is to generate high-quality question and answer pairs specifically for fine-tuning a large language model (LLM) to behave like
a data science expert for this course.

Please follow these rules when generating the Q/A pairs:

1.) The course description is only provided here to delimit the domain at hand. Questions should only relate to this domain and not specifically to the course as such.

2.) Each question should cover a specific and meaningful topic of the course "{course}"

3.) Each answer should be clear, concise, scientifically sound and accurate, ideally between 2 to 3 short paragraphs (i.e., under 200-300 words).

4.) The tone should be professional, didactic, and aimed at intermediate to advanced learners of this course.

5.) Avoid extremely niche topics unless they are relevant in industry or academia.

6.) Do not use code unless necessary, and if code is included, keep it minimal and explained.

7.) Ensure that the content is original and informative, and not hallucinated or vague.

Produce {num_qa} Q/A pairs per generation in the following format:

### Question 1
<insert question text here>

### Answer 1
<insert answer text here>

### Question 2
...

### Answer 2
...

--> 

Please begin generating the Q/A pairs now...!

"""

In [8]:
# Second version of the prompt.
# This includes an additional parameter that contains a list of questions that the LLM had already
# generated for a given LLM. This is to avoid duplicate questions for the same course (which was a problem in early versions
# of the notebook).

prompt_2 = """

You are an expert data scientist and educator for the course "{course}" at the Harvard Extension School, which is part of the 
Data Science curriculum. Here is a description of the course:

{description}

Your task is to generate high-quality question and answer pairs specifically for fine-tuning a large language model (LLM) to behave like
a data science expert for this course.

Please follow these rules when generating the Q/A pairs:

1.) The course description is only provided here to delimit the domain at hand. Questions should only relate to this domain and not specifically to the course as such.

2.) Each question should cover a specific and meaningful topic of the course "{course}"

3.) Each answer should be clear, concise, scientifically sound and accurate, ideally between 2 to 3 short paragraphs (i.e., under 200-300 words).

4.) The tone should be professional, didactic, and aimed at intermediate to advanced learners of this course.

5.) Avoid extremely niche topics unless they are relevant in industry or academia.

6.) Do not use code unless necessary, and if code is included, keep it minimal and explained.

7.) Ensure that the content is original and informative, and not hallucinated or vague.

Produce {num_qa} Q/A pairs per generation in the following format:

### Question 1
<insert question text here>

### Answer 1
<insert answer text here>

### Question 2
...

### Answer 2
...

--> 

VERY IMPORTANT: Please avoid questions that are already in the following list, since you had already
generated them in a previous run (if the list is empty, then please ignore):

List of questions to avoid: {avoid_questions_l}

Please begin generating the Q/A pairs now...!

"""

In [9]:
# Define a prompt_template for the prompt (will be used for generating prompts for the LLM)
# The parameters are:
# course:               the title of the course
# description:          description of the course
# num_qa:               number of the Q/A pairs that the LLM is asked to generate in a single prompt
# avoid_questions_l:    list of questions that LLM had already generated before for the course
#                       (i.e. avoid re-generating these questions)

prompt_template = PromptTemplate(
    template=prompt_2,
    input_variables=["course", "description", "num_qa", "avoid_questions_l"]
)

In [10]:
# Testing the prompt with an empty list of questions...

for i in range(len(courses)):
    prompt = prompt_template.format(
        course = courses[i],
        description = descriptions[i],
        num_qa = NUM_QA,
        avoid_questions_l = []
    )
    print(f"prompt --> {prompt}\n")

prompt --> 

You are an expert data scientist and educator for the course "CSCI E-25 Computer Vision" at the Harvard Extension School, which is part of the 
Data Science curriculum. Here is a description of the course:


    Computer vision is an exciting and rapidly changing field. In a little over ten years, deep learning algorithms have 
    revolutionized several aspects of computer vison. Applications that were infeasible or impractical a few years ago are now 
    in routine production. These advances allow intelligent systems to interact with the real-world using vision. Examples of modern 
    computer vision (CV) applications include digital photography, robotic vision, autonomous vehicles, medical imaging, and 
    scientific imaging. This course is a fast-moving survey of both fundamental theory of CV algorithms along with hands-on practical 
    assignments applying these methods using Python. Successfully deploying CV applications often requires a combination of classical 

### Get Question/Answer pairs from the LLM

In [11]:
# Function that will be used to extract Q/A pairs from the response of the LLM
# Questions and answers are formatted according to the template provided in the prompt:

# ### Question 1
# <insert question text here>

# ### Answer 1
# <insert answer text here>

# ### Question 2
# ...

# ### Answer 2
# ...
#
# This function then uses a regexp to to extract all the Q/A pairs from the response
# of the LLM.

def extract_qa_pairs(text: str) -> Tuple[List[str], List[str]]:

    pattern = r"### Question \d+\n(.*?)\n\n### Answer \d+\n(.*?)(?=(\n\n### Question \d+|$))"
    matches = re.findall(pattern, text, re.DOTALL)
    
    questions = [q.strip() for q, _, _ in matches]
    answers = [a.strip() for _, a, _ in matches]
    
    return questions, answers

In [12]:
# Main loop for generating the questions and answers

questions = []      # stores all generated questions 
answers = []        # stores all generated answers
meta_data = []      # stores the context for every Q/A pair (in our case the context corresponds to the course title)

questions_dict = {} # stores, per course, the list of questions generated so far for a given course (a dictionary is used for that)

for course in courses:              # initialize the dictionary with an empty list for each question
    questions_dict[course] = []

if ALL_COURSES:                     # If ALL_COURSES is true, generate Q/A pairs for *all* courses
    num_courses = len(courses)
else:                               # ... else generate Q/A pairs only for the first MAX_COURSES courses
    num_courses = MAX_COURSES

for j in range(NUM_RUNS): 
    print("\n--------------------------------------------------------------------------------------------------------")
    print(f"Working on NUM_RUN = {j}\n")
    for i in range(num_courses):  
        current_course = courses[i]  

        # Format the prompt based on the current course
        prompt = prompt_template.format(
            course = current_course,
            description = descriptions[i],
            num_qa = NUM_QA,
            avoid_questions_l = questions_dict[current_course]
        )

        print(f"Getting Q/A pairs for course '{courses[i]}'...")

        # Send request to the OpenAI API - several models have been tested here
        # It turned out the gpt-4o worked best, since it is very generous with the max token limits
        response = client.chat.completions.create(
            model="gpt-4o",  # or "gpt-3.5-turbo"
            messages=[
                {"role": "user", "content": prompt}
            ],
            temperature=0.7,
            max_tokens=MAT_TOKENS
        )

        q, a = extract_qa_pairs(response.choices[0].message.content)
        num_q = len(q)
        num_a = len(a)

        # Making sure, just in case, that the number of answers corresponds to the number of questions.
        # If not, exit with a fatal error...
        assert num_a == num_q, "Fatal error --> number of questions not equal to number of answers"

        # Add the newly generated questions and answers to the lists, as well as the meta data (i.e. course title)
        questions.extend(q)
        questions_dict[current_course].extend(q) # assuming q only contains new questions
        answers.extend(a)
        meta_data.extend([current_course for i in range(num_q)])


--------------------------------------------------------------------------------------------------------
Working on NUM_RUN = 0

Getting Q/A pairs for course 'CSCI E-25 Computer Vision'...
Getting Q/A pairs for course 'CSCI E-63C Elements of Data Science and Statistical Learning with R'...
Getting Q/A pairs for course 'CSCI E-80 Introduction to Artificial Intelligence with Python'...
Getting Q/A pairs for course 'CSCI E-83 Fundamentals of Data Science'...
Getting Q/A pairs for course 'CSCI E-89 Deep Learning'...
Getting Q/A pairs for course 'CSCI E-89B Introduction to Natural Language Processing'...
Getting Q/A pairs for course 'CSCI E-96 Data Mining for Business'...
Getting Q/A pairs for course 'CSCI E-101 Foundations of Data Science and Engineering'...
Getting Q/A pairs for course 'CSCI E-102 Econometrics and Causal Inference with R'...
Getting Q/A pairs for course 'CSCI E-109A Introduction to Data Science'...
Getting Q/A pairs for course 'CSCI E-109B Advanced Topics in Data Science

In [13]:
# Print a sample of NUM_QA * 3 Q/A pairs
for i in range(NUM_QA*3):
    print("---------------------------------------------------------------------")
    print(f"question {i} -->\n")
    print(questions[i], "\n")
    print(answers[i])
    print("\n")


---------------------------------------------------------------------
question 0 -->

What are the key differences between classical image processing techniques and deep learning approaches in computer vision? 

Classical image processing techniques and deep learning approaches represent two distinct paradigms in computer vision. Classical techniques often involve handcrafted algorithms that are designed to perform specific tasks, such as edge detection, filtering, or feature extraction. These methods rely heavily on mathematical models and domain knowledge to process images. For instance, techniques like the Sobel or Canny filters are used for edge detection, and algorithms such as SIFT or SURF are employed for feature extraction. These methods are generally well-understood and can be computationally efficient, but they may struggle with complex image patterns or require extensive tuning for different applications.

In contrast, deep learning approaches, particularly convolutional neu

### Generate json and csv files in different formats (including files with a split in train/test)

In [14]:
prompt_l = [[{"role": "user", "content": q}] for q in questions]
completion_l = [[{"role": "assistant", "content": a}] for a in answers]

dataset_dict = {
    "prompt": prompt_l,
    "completion": completion_l
}

In [15]:
import json

with open("dataset.json", "w") as f:
    json.dump(dataset_dict, f, indent=2)

In [16]:
dataset_dict_alternative = {
    "prompt": questions,
    "context": meta_data,
    "answer": answers
}

In [17]:
import json

with open("dataset_alternative_format.json", "w") as f:
    json.dump(dataset_dict_alternative, f, indent=2)

In [18]:
#Split the dataset into train and test
import random

total_size = len(questions)
test_size = int(0.2 * total_size)  

all_indices = list(range(total_size))

#TODO: use set operations
test_indices = random.sample(all_indices, test_size)
train_indices = [i for i in all_indices if i not in test_indices]

print("Test indices:", test_indices)
print("Train indices:", train_indices)

Test indices: [14, 3, 36, 51, 61, 45, 17, 6, 33, 15, 57, 41]
Train indices: [0, 1, 2, 4, 5, 7, 8, 9, 10, 11, 12, 13, 16, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 34, 35, 37, 38, 39, 40, 42, 43, 44, 46, 47, 48, 49, 50, 52, 53, 54, 55, 56, 58, 59, 60, 62, 63]


In [19]:
prompt_l_train = [[{"role": "user", "content": q}] for i, q in enumerate(questions) if i in train_indices]
prompt_l_test = [[{"role": "user", "content": q}] for i, q in enumerate(questions) if i in test_indices]

completion_l_train = [[{"role": "assistant", "content": a}] for i, a in enumerate(answers) if i in train_indices]
completion_l_test = [[{"role": "assistant", "content": a}] for i, a in enumerate(answers) if i in test_indices]

dataset_dict_train = {
    "prompt": prompt_l_train,
    "completion": completion_l_train
}

dataset_dict_test = {
    "prompt": prompt_l_test,
    "completion": completion_l_test
}

In [20]:
with open("dataset_train.json", "w") as f:
    json.dump(dataset_dict_train, f, indent=2)

with open("dataset_test.json", "w") as f:
    json.dump(dataset_dict_test, f, indent=2)

In [22]:
with open("dataset.csv", "w", newline="", encoding="utf-8") as f:
    writer = csv.writer(f)
    writer.writerow(["prompt", "context", "answer"])  # Header

    for prompt, context, answer in zip(questions, meta_data, answers):
        writer.writerow([prompt, context, answer])

with open("dataset_train.csv", "w", newline="", encoding="utf-8") as f:
    writer = csv.writer(f)
    writer.writerow(["prompt", "context", "answer"])  # Header    

    for i, (prompt, context, answer) in enumerate(zip(questions, meta_data, answers)):
        if i in train_indices:
            writer.writerow([prompt, context, answer])

with open("dataset_test.csv", "w", newline="", encoding="utf-8") as f:
    writer = csv.writer(f)
    writer.writerow(["prompt", "context", "answer"])  # Header    

    for i, (prompt, context, answer) in enumerate(zip(questions, meta_data, answers)):
        if i in test_indices:
            writer.writerow([prompt, context, answer])
