# Metricizing LLMaAiTB-E

- Our Focus: Generation Quality
- Measurement Techniques: 
    - Vector Comparison
    - Human Preference Sample (A/B/C Testing)
    - ???
- Iterative documents to measure:
    - Concepts (Generation phase)
    - Slides (Teaching phase)
    - Questions (Testing phase)
- Resources for testing:
    - Expert (From classes)
    - GPT4 (Generation)
    - LLMaAiTB-E (Teachabull)
- Main concepts to cover:
    - Object Oriented Programming
    - Programming Language Semantics
    - Math
    - History
    - 


# OpenAI Helper Functions
We will demonstrate our metrics using OpenAI's Vector Embeddings on our generated documents. We decided to use OpenAI's embeddings due to their large document size capacity. We agreed that this method would prove to be the best while comparing large documents.

## LLM Prompt/Text Completion


## Vector Comparison
Embeddings: OpenAI’s text embeddings measure the relatedness of text strings.

In [14]:
import openai
import myenv
import os
import pickle as pkl
from AITutor_Backend.src.TutorUtils.concepts import *
from AITutor_Backend.src.TutorUtils.notebank import NoteBank
from AITutor_Backend.src.TutorUtils.slides import SlidePlan, Slide, SlidePlanner, Purpose
from AITutor_Backend.src.TutorUtils.questions import Question, QuestionSuite

In [15]:
### OPENAI HELPER FUNCTIONS 
def request_output_from_llm(prompt, model: str):
    """Requests the Concept information from an LLM.

    Args:
        prompt: (str) - string to get passed to the model
        model: (str) - 

    Returns:
        _type_: _description_
    """
    client = openai.OpenAI() 
    

    response = client.chat.completions.create(
    model=model,
    messages=[
    {
    "role": "system",
    "content": prompt,
    },
    ],
    temperature=1,
    max_tokens=8000,
    top_p=1,
    frequency_penalty=0,
    presence_penalty=0,
    )

    return response.choices[0].message.content

In [16]:
### Vector Functions
import numpy as np
from openai import OpenAI
import tiktoken
import json

client = OpenAI()

def get_embedding(text, model="text-embedding-ada-002"):
    text = text.replace("\n", " ")
    return client.embeddings.create(input=[text], model=model).data[0].embedding

def tokenizer(text):
    encoding = tiktoken.get_encoding("cl100k_base")
    tokens = encoding.encode(text)
    return tokens

def process_in_batches(tokens, batch_size=8000):
    for i in range(0, len(tokens), batch_size):
        yield tokens[i:i + batch_size]

def create_embeddings(text):
    tokens = tokenizer(text)
    embeddings = []
    encoding = tiktoken.get_encoding("cl100k_base")  # Reuse the encoding for decoding

    for token_batch in process_in_batches(tokens):
        # Convert token batch back to string
        batch_text = encoding.decode(token_batch)
        batch_embedding = get_embedding(batch_text)
        embeddings.append(batch_embedding)

    return np.mean([embeddings], axis=1)

def cosine_similarity(vec1, vec2):
    dot_product = np.dot(vec1, vec2)
    norm_vec1 = np.linalg.norm(vec1)
    norm_vec2 = np.linalg.norm(vec2)
    similarity = dot_product / (norm_vec1 * norm_vec2)
    return similarity

# Test works
embeddings = create_embeddings("Research/generation_data/slides/Expert/codingSlides_expert.json")
print(f"Vector Embeddings {embeddings.shape}:", embeddings, )

# Test = 1
vec1 = np.array([1, 2, 3])
vec2 = np.array([2, 4, 6])
similarity = cosine_similarity(vec1, vec2)
print(f"Cosine Similarity: {similarity}")


Vector Embeddings (1, 1536): [[-0.00125975  0.00951579  0.0048352  ... -0.01831474 -0.00588236
  -0.04087433]]
Cosine Similarity: 1.0


# Concepts
- Preprocessing
- Generation of Graph for (Expert, GPT, LLMaAiT-BE)
- Comparison and Data Analysis

In [17]:
### Notebanks from AI Tutor
current_topic = "ds"
main_concept = "Graph Data Structure"
tutor_plan_nlp = '''Main Concept: Regular Expressions, Text Normalization, Edit Distance in Natural Language Processing (NLP)
Student is a computer science student with no prior knowledge of the topic, requiring an introductory lesson.
Student is taking an NLP class, suggesting the lessons are for academic purposes and should cover necessary conceptual detail.
Student provided a chapter summary that includes key subtopics; this will be a guide in structuring the lesson plan.
Tutor shall educate on the following concepts:
Subconcept: Introduction to Regular Expressions
Subconcept: Uses of Regular Expressions in NLP
Subconcept: Basic Syntax and Operators of Regular Expressions
Subconcept: Practical Examples and Exercises Using Regular Expressions
Subconcept: Introduction to Text Normalization
Subconcept: Tokenization of Text
Subconcept: Lemmatization and its Importance
Subconcept: Sentence Segmentation Techniques
Subconcept: Introduction to Edit Distance
Subconcept: Applications of Edit Distance Algorithm in NLP
Subconcept: Calculation of Edit Distance and String Alignment
Tutor will apply practical examples relevant to modern NLP applications, such as chatbots, using the chapter summary as a conversational context.
Tutor will provide hands-on practice problems and ensure the student understands the implementation of the concepts.
Students objective: To gain a foundational understanding of the chapter's main points, to apply this understanding in an academic setting, and to perform well in the NLP class.
Since the student might need to have a deep understanding of the class material, the lesson should provide a solid theoretical basis, followed by practical application.
Concept: Regular Expressions, Text Normalization, Edit Distance in Natural Language Processing (NLP)
Concept: Introduction to Regular Expressions
Concept: Uses of Regular Expressions in NLP
Concept: Basic Syntax and Operators of Regular Expressions
Concept: Practical Examples and Exercises Using Regular Expressions
Concept: Introduction to Text Normalization
Concept: Tokenization of Text
Concept: Lemmatization and its Importance
Concept: Sentence Segmentation Techniques
Concept: Introduction to Edit Distance
Concept: Applications of Edit Distance Algorithm in NLP
Concept: Calculation of Edit Distance and String Alignment
Concept: Practical Examples and Exercises in Modern NLP Applications (e.g., Chatbots)
Concept: Hands-on Practice Problems
Concept: Foundational Understanding of Main Points
Concept: Academic Application of Concepts
Concept: Theoretical Basis Followed by Practical Application
Student's Interest Statement: I find natural language processing interesting and important since I am taking it as a course in college where I will be tested
Student's Slides Preference Statement: I want to be taught by information and examples
Student's Questions Preference Statement: 2 of multiple choice, 2 of free response and 2 coding questions'''

tutor_plan_economics = """

"""
tutor_plan_calc = """

"""
tutor_plan_data_structures = """Comprehensive overview of graph data structures planned.\n Student wants to learn about graph data structures, their formalization, complexities (time and space), representations, algorithms, and applications.\n 'Subconcept: Definitions and Formalization of Graph Theory\n Subconcept: Time Complexity of Graph Algorithms\n 'Subconcept: Space Complexity of Graph Data Structures\n 'Subconcept: Representations of Graphs (Adjacency Matrix and List)\n 'Subconcept: Graph Traversal Algorithms (DFS and BFS)\n "Subconcept: Graph Pathfinding Algorithms (Dijkstra's, A*, Bellman-Ford)", 'Subconcept: Network Flow (Ford-Fulkerson, Edmonds-Karp)\n 'Subconcept: Graph Coloring and Scheduling (Chromatic Number, Greedy Algorithm)\n "Subconcept: Trees and Special Graphs (Spanning Trees, Minimum Spanning Trees: Prim's and Kruskal's Algorithms)", 'Subconcept: Graph Invariants (Degree Sequence, Hamiltonian, Eulerian Paths and Circuits)\n 'Subconcept: Practical Applications of Graph Theory in Various Fields (Computer Science, Biology, Social Sciences, etc.)\n 'Tutor will explain and demystify complex topics with easily digestible examples, ensuring theoretical knowledge is bolstered by practical application.\n 'Tutor will discuss the computational considerations involved in using graphs with a focus on optimizations and real-world constraints.\n 'Tutor will present common problems and solutions in graph theory to illustrate course concepts.\n 'Since the conversation is leading toward a structured and comprehensive overview, it will be necessary to propose structured lessons that build upon each other, to cement understanding and facilitate retention.
Graph Data Structures\n 'Definitions and Formalization of Graph Theory\n 'Time Complexity of Graph Algorithms\n 'Space Complexity of Graph Data Structures\n 'Representations of Graphs (Adjacency Matrix and List)\n 'Graph Traversal Algorithms (DFS and BFS)\n "Graph Pathfinding Algorithms (Dijkstra's, A*, Bellman-Ford)", 'Network Flow (Ford-Fulkerson, Edmonds-Karp)\n 'Graph Coloring and Scheduling (Chromatic Number, Greedy Algorithm)\n "Trees and Special Graphs (Spanning Trees, Minimum Spanning Trees: Prim's and Kruskal's Algorithms)", 'Graph Invariants (Degree Sequence, Hamiltonian, Eulerian Paths and Circuits)\n 'Practical Applications of Graph Theory in Various Fields (Computer Science, Biology, Social Sciences, etc.)\n 'Graph Theory Learning Approach\n 'Demystifying Complex Topics with Easily Digestible Examples\n 'Theoretical Knowledge Bolstered by Practical Application\n 'Computational Considerations in Graph Theory\n 'Optimizations and Real-World Constraints\n 'Common Problems and Solutions in Graph Theory\n Structured and Comprehensive Lesson Plan\n


"""
tutor_plan_history = """

"""

current_plan = {'NLP': tutor_plan_nlp, "history": tutor_plan_history, "ds":tutor_plan_data_structures, "econ": tutor_plan_economics, "calc": tutor_plan_calc}[current_topic]

In [18]:
#Concept generation from AITutor:
import pickle as pkl
notebank = NoteBank()
[notebank.add_note(n) for n in current_plan.split("\n")]

# Check if the file exists
if os.path.exists(f"Research/temp_data/temp_concepts_{current_topic}.pkl"):
    # Load the object from the file
    with open(f"Research/temp_data/temp_concepts_{current_topic}.pkl", "rb") as f:  # 'rb' mode is for reading in binary format
        concept = pkl.load(f)
        concept_db = ConceptDatabase(main_concept, notebank.env_string(), False)
        concept_db.Concepts = concept
else:
    concept_db = ConceptDatabase(main_concept,notebank.env_string())
    with open(f"Research/temp_data/temp_concepts_{current_topic}.pkl", "wb") as f:  # 'wb' mode is for writing in binary format
        pkl.dump(concept_db.Concepts, f)

print("\n\n".join([slide.format_json() for slide in concept_db.Concepts]))


# Slides: 
- Preprocessing
- Generation of Document for (Expert, GPT, LLMaAiT-BE)
- Comparison and Data Analysis

In [None]:
### SLIDE OBJ PROMPTs
prompt = ''' #Your task is to create a JSON object from a slide string. View the example Input and output, and then repeat the same for the provided input. 
Perform the conversion for each slide s in the input string such that s->json_object(s). You should be able to figure out which is the title and which is the description.
IMPORTANT: Escape Characters in JSON Data can cause errors if the JSON Object or JSON data contains backslashes, which means they need to be properly escaped
Avoid these errors: Invalid \escape: line 24 column 72 (char 2199)
By properly escaping your backslashes ('\\')
IMPORTANT: If there is two words together, such as "functionwhere", without being separated with a white space, that most probably means that there is a new line ('\n') or space (' ') in between them, e.g. "function where".

// Input:
Page 1 Content:
Natural Language ProcessingProfessor John LicatoUniversity of South FloridaChapter 2:RegEx, Edit Distance

----------------------------------------
Page 2 Content:
"Knowing [regular expressions] can mean the difference between solving a problem in 3 steps and solving it in 3,000 steps. When you’re a nerd, you forget that the problems you solve with a couple keystrokes can take other people days of tedious, error-prone work to slog through."Regular Expressions
----------------------------------------
Page 3 Content:
The following function called `isPhoneNumber(text)` is designed to check if the provided string is a phone number in a specific format using regex. def isPhoneNumber(text):    if len(text) != 12:        return False    for i in range(0, 3):        if not text[i].isdecimal():            return False    if text[3] != '-':        return False    for i in range(4, 7):        if not text[i].isdecimal():            return False    if text[7] != '-':        return False    for i in range(8, 12):        if not text[i].isdecimal():            return False    return Trueprint('415-555-4242 is a phone number:')print(isPhoneNumber('415-555-4242'))print('Moshi moshi is a phone number:')print(isPhoneNumber('Moshi moshi'))Regular Expressions
----------------------------------------
Page 4 Content:
The following Python code uses the previously defined `isPhoneNumber` function within a loop to search through a longer string for valid phone number formats. message = 'Call me at 415-555-1011 tomorrow. 415-555-9999 is my office.'for i in range(len(message)):    chunk = message[i:i+12]    if isPhoneNumber(chunk):        print('Phone number found: ' + chunk)print('Done')Regular Expressions
----------------------------------------
Page 5 Content:
Creating regex objectsr’ = raw string\d – placeholder for a single digit>>> import re>>> phoneNumRegex = re.compile(r’\d\d\d-\d\d\d-\d\d\d\d’)
----------------------------------------
Page 6 Content:
Matching regex objects
mo = match object – contains the result of our search>>> import re>>> phoneNumRegex = re.compile(r’\d\d\d-\d\d\d-\d\d\d\d’)>>> mo = phoneNumRegex.search(‘My number is 415-555-4242.’)>>> print(‘Phone number found: ’ + mo.group())Phone number found: 415-555-4242
----------------------------------------
Page 7 Content:
Text Normalization•We will work a lot with large datasets / corpora•We often need to pre-process text•Tokenizing (segmenting) words•Normalizing word formats•Segmenting sentences (e.g. by using punctuation)
----------------------------------------
Page 8 Content:
Tokenization – segmenting running text into words (or word-like units)>>> text = 'That U.S.A. poster-print costs $12.40...'>>> pattern = r\'\'\', (?x)  # set flag to allow verbose regexps...     ([A-Z]\.)+        # abbreviations, e.g. U.S.A....     | \w+(-\w+)*      # words with optional internal hyphens...     | \$?\d+(\.\d+)?%?  # currency and percentages, e.g. $12.40, 82%...     | \.\.\.          # ellipsis...     | [][.,;"'?():-_`]  # these are separate tokens; includes ], [... \'\'\'>>> nltk.regexp_tokenize(text, pattern)['That', 'U.S.A.', 'poster-print', 'costs', '$12.40', '...']
----------------------------------------
Page 9 Content:
Subword tokenization•How do we capture relations between words like:–new, newer–blow, blowing–precipitation, precipitate•Often useful to break tokens into *sub*words•Usually split into token learners, and token segmenters
----------------------------------------
Page 10 Content:
Byte-pair encoding (BPE)•A way of performing subword tokenizationfunction BYTE-PAIR ENCODING(strings C, number of merges k) returns vocab VV <- all unique characters in C                  # initial set of tokens is charactersfor i = 1 to k do                                # merge tokens til k times    t_L, t_R <- Most frequent pair of adjacent tokens in C    t_new <- t_L + t_R                           # make new token by concatenating    V <- V + t_new                               # update the vocabulary    Replace each occurrence of t_L, t_R in C with t_new # and update the corpusreturn Vcorpus5 low_2 lowest_6 newer_3 wider_2 new_vocabulary_, d, e, i, l, n, o, r, s, t, wcorpus5 low _2 lowest _6 newer _3 wider _2 new _vocabulary_, d, e, i, l, n, o, r, s, t, w, er
----------------------------------------
...
        
// Output:
        { 
                \"slides\":[
                        {
                                \"Title\":\"Natural Language Processing\", 
                                \"Description\": \"Professor John Licato University of South Florida Chapter 2:RegEx, Edit Distance\",
                                \"Latex\": []
                        },
                        {
                                \"Title\":\"Regular Expressions\", 
                                \"Description\": \"Knowing [regular expressions] can mean the difference between solving a problem in 3 steps and solving it in 3,000 steps. When you’re a nerd, you forget that the problems you solve with a couple keystrokes can take other people days of tedious, error-prone work to slog through.\",
                                \"Latex\": []
                        },
                        {
                                \"Title\":\"Regular Expressions\", 
                                \"Description\": \"The following function called `isPhoneNumber(text)` is designed to check if the provided string is a phone number in a specific format using regex. def isPhoneNumber(text):    if len(text) != 12:        return False    for i in range(0, 3):        if not text[i].isdecimal():            return False    if text[3] != '-':        return False    for i in range(4, 7):        if not text[i].isdecimal():            return False    if text[7] != '-':        return False    for i in range(8, 12):        if not text[i].isdecimal():            return False    return Trueprint('415-555-4242 is a phone number:')print(isPhoneNumber('415-555-4242'))print('Moshi moshi is a phone number:')print(isPhoneNumber('Moshi moshi'))\",
                                \"Latex\": []
                        },
                        { 
                                \"Title\":\"Regular Expressions\", 
                                \"Description\": \"The following Python code uses the previously defined `isPhoneNumber` function within a loop to search through a longer string for valid phone number formats. message = 'Call me at 415-555-1011 tomorrow. 415-555-9999 is my office.'for i in range(len(message)):    chunk = message[i:i+12]    if isPhoneNumber(chunk):        print('Phone number found: ' + chunk)print('Done')\",
                                \"Latex\": []
                        },
                        { 
                                \"Title\":\"Creating regex objects\", 
                                \"Description\": \"r’ = raw string\d – placeholder for a single digit>>> import re>>> phoneNumRegex = re.compile(r’\d\d\d-\d\d\d-\d\d\d\d’)\",
                                \"Latex\": []
                        },
                        {
                                \"Title\":\"Matching regex objects\", 
                                \"Description\": \">>> import re>>> phoneNumRegex = re.compile(r’\d\d\d-\d\d\d-\d\d\d\d’)>>> mo = phoneNumRegex.search(‘My number is 415-555-4242.’)>>> print(‘Phone number found: ’ + mo.group())Phone number found: 415-555-4242 mo = match object – contains the result of our search\",
                                \"Latex\": []
                        },
                        {
                                \"Title\":\"Text Normalization\", 
                                \"Description\": \"•We will work a lot with large datasets / corpora\n•We often need to pre-process text\n•Tokenizing (segmenting) words\n•Normalizing word formats\n•Segmenting sentences (e.g. by using punctuation) )\",
                                \"Latex\": []
                        },
                        { 
                                \"Title\":\"Tokenization – segmenting running text into words (or word-like units)\", 
                                \"Description\": \">>> text = 'That U.S.A. poster-print costs $12.40...'\n>>> pattern = r\'\'\', (?x)  # set flag to allow verbose regexps\n...     ([A-Z]\.)+        # abbreviations, e.g. U.S.A\n....     | \w+(-\w+)*      # words with optional internal hyphens\n...     | \$?\d+(\.\d+)?%?  # currency and percentages, e.g. $12.40, 82%\n...     | \.\.\.          # ellipsis...     | [][.,;"'?():-_`]  # these are separate tokens; includes ], [\n... \'\'\'\n>>> nltk.regexp_tokenize(text, pattern)\n['That', 'U.S.A.', 'poster-print', 'costs', '$12.40', '...']\",
                                \"Latex\": []
                        },
                        {
                                \"Title\":\"Subword tokenization\", 
                                \"Description\": \"•How do we capture relations between words like:\n–new, newer\n–blow, blowing\n–precipitation, precipitate\n•Often useful to break tokens into *sub*words•Usually split into token learners, and token segmenters\",
                                \"Latex\": []
                        },
                        {
                                \"Title\":\"Byte-pair encoding (BPE)\", 
                                \"Description\": \"•A way of performing subword tokenizationfunction BYTE-PAIR ENCODING(strings C, number of merges k) returns vocab V\nV <- all unique characters in C                  # initial set of tokens is characters\nfor i = 1 to k do                                # merge tokens til k times    \nt_L, t_R <- Most frequent pair of adjacent tokens in C    \nt_new <- t_L + t_R                           # make new token by concatenating    \nV <- V + t_new                               # update the vocabulary    \nReplace each occurrence of t_L, t_R in C with t_new # and update the corpus\nreturn V\ncorpus\n5 low_\n2 lowest_\n6 newer_\n3 wider_\n2 new_\nvocabulary\n_, d, e, i, l, n, o, r, s, t, w\ncorpus\n5 low _\n2 lowest _\n6 newer _\n3 wider _\n2 new _\nvocabulary\n_, d, e, i, l, n, o, r, s, t, w, er\",
                                \"Latex\": []
                        },
                        ...
                ]
        }
Remember! Escape Characters in JSON Data: If the JSON Object or JSON data contains backslashes, they need to be properly escaped.
Avoid these errors: Invalid \escape: line 24 column 72 (char 2199)

// Input:
        $SLIDE$

// Output:
        '''


In [None]:
### Slide helper functions
import PyPDF2
from pptx import Presentation
import json

def read_pdf(file_path):
    """Reads a PDF file and prints the content of each page"""
    slide_str = ""
    with open(file_path, 'rb') as file:
        reader = PyPDF2.PdfReader(file)
        num_pages = len(reader.pages)

        for i in range(num_pages):
            page = reader.pages[i]
            text = page.extract_text()
            slide_str += f"Page {i+1} Content:\n{text}"
            slide_str += "\n" + ("-" * 40) + "\n"
    return slide_str

def extract_text_from_slide(slide):
    """Extracts title and content from a slide"""
    title = slide.shapes.title.text if slide.shapes.title else "No Title"
    content = []

    for shape in slide.shapes:
        if hasattr(shape, "text"):
            content.append(shape.text)

    return title, content

def read_pptx(file_path):
    """Reads a pptx file and prints the title and content of each slide"""
    prs = Presentation(file_path)
    s = ""
    for slide in prs.slides:
        title, content = extract_text_from_slide(slide)
        s+="Title: {title}"
        s+="Content:"+"\n".join(content)
        s+="-" * 40 + "\n"
    return s
def get_slide_prompt(slide_template, data):
    return slide_template.replace("$SLIDE$", data)


In [None]:
### TEST SLIDE OBJ GEN FROM GPT FOR EXPERT
slide_str = read_pptx('Research/generation_data/slides/Expert/L18_ Graphs.pptx')


curr_prompt = get_slide_prompt(prompt, slide_str)
try:
    json_data = request_output_from_llm(prompt=curr_prompt, model="gpt-3.5-turbo-16k")
    slide_obj = json.loads(json_data)
    print(slide_obj)

    # Convert the dictionary to a JSON-formatted string
    json_str = json.dumps(slide_obj, indent=4)  # indent for pretty-printing

    # Write the JSON string to a file
    with open("Research/generation_data/slides/Expert/dsSlides_expert.json", "w") as f:
        f.write(json_str)

except Exception as e:
    print(e)



Invalid \escape: line 21 column 38 (char 1854)


In [9]:
### Slide generation from AITutor
notebank = NoteBank()
[notebank.add_note(n) for n in current_plan.split("\n")]
slide_planner = SlidePlanner(notebank, concept_db)
# Check if the file exists
if os.path.exists(f"Research/temp_data/temp_slideplan_{current_topic}.pkl"):
    # Load the object from the file
    with open(f"Research/temp_data/temp_slideplan_{current_topic}.pkl", "rb") as f:  # 'rb' mode is for reading in binary format
        slide_plans = pkl.load(f)
        print("\n\n".join([str(slide.format_json()) for slide in slide_plans]))
        slide_planner.SlidePlans = slide_plans
else:
    slide_planner.generate_slide_plan()
    with open(f"Research/temp_data/temp_slideplan_{current_topic}.pkl", "wb") as f:  # 'wb' mode is for writing in binary format
        pkl.dump(slide_planner.SlidePlans, f)

if os.path.exists(f"Research/temp_data/temp_slides_{current_topic}.pkl"):
    # Load the object from the file
    with open(f"Research/temp_data/temp_slides_{current_topic}.pkl", "rb") as f:  # 'rb' mode is for reading in binary format
        slides = pkl.load(f)
        slide_planner.Slides = slides
else:
    slide_planner.generate_slide_deque()
    with open(f"Research/temp_data/temp_slides_{current_topic}.pkl", "wb") as f:  # 'wb' mode is for writing in binary format
        pkl.dump(slide_planner.Slides, f)
import json
print(json.dumps(slide_planner.format_json(), indent=4))

{'title': 'Introduction to Natural Language Processing (NLP) and Regular Expressions', 'purpose': 0, 'purpose_statement': 'To give the student an initial overview of NLP with a focus on Regular Expressions and how they play a foundational role in text analysis.', 'concepts': ['Natural Language Processing (NLP)', 'Regular Expressions', 'Text Normalization', 'Tokenization of Text', 'Edit Distance']}

{'title': 'Decoding Patterns with Regular Expressions in NLP', 'purpose': 0, 'purpose_statement': 'This slide will serve as an initial deep dive into the world of Regular Expressions, enabling students to understand their syntax, basic operators, and fundamental uses in NLP, setting the groundwork for more sophisticated text processing tasks.', 'concepts': ['Regular Expressions']}

{'title': 'Discovering Lemmatization: Enhancing Text Analysis in NLP', 'purpose': 0, 'purpose_statement': "This slide aims to introduce Lemmatization as an essential NLP text preprocessing technique, building on t

# Questions
- Preprocessing
- Generation of questions from (Expert, GPT, LLMaAiT-BE)
- Comparison and Data Analysis

In [None]:
### Slide generation from AITutor
notebank = NoteBank()
[notebank.add_note(n) for n in current_plan.split("\n")]
q_suite = QuestionSuite(5, notebank, concept_db)
# Check if the file exists
if os.path.exists(f"Research/temp_data/temp_questions_{current_topic}.pkl"):
    # Load the object from the file
    with open(f"Research/temp_data/temp_questions_{current_topic}.pkl", "rb") as f:  # 'rb' mode is for reading in binary format
        questions = pkl.load(f)
        print("\n\n".join([str(slide.format_json()) for slide in slide_plans]))
        q_suite.Questions = questions
else:
    q_suite.generate_question_data()
    with open(f"Research/temp_data/temp_questions_{current_topic}.pkl", "wb") as f:  # 'wb' mode is for writing in binary format
        pkl.dump(q_suite.Questions, f)
        
print(json.dumps(q_suite.format_json(), indent=4))

{'title': 'Introduction to Natural Language Processing (NLP) and Regular Expressions', 'purpose': 0, 'purpose_statement': 'To give the student an initial overview of NLP with a focus on Regular Expressions and how they play a foundational role in text analysis.', 'concepts': ['Natural Language Processing (NLP)', 'Regular Expressions', 'Text Normalization', 'Tokenization of Text', 'Edit Distance']}

{'title': 'Decoding Patterns with Regular Expressions in NLP', 'purpose': 0, 'purpose_statement': 'This slide will serve as an initial deep dive into the world of Regular Expressions, enabling students to understand their syntax, basic operators, and fundamental uses in NLP, setting the groundwork for more sophisticated text processing tasks.', 'concepts': ['Regular Expressions']}

{'title': 'Discovering Lemmatization: Enhancing Text Analysis in NLP', 'purpose': 0, 'purpose_statement': "This slide aims to introduce Lemmatization as an essential NLP text preprocessing technique, building on t

In [None]:
### Testing Concepts

coding_concepts_aitutor = "Research/generation_data/concept_graph/Teachabull/codingConcepts_teachabull.json"
with open(coding_concepts_aitutor, "r") as f:
    coding_concepts_aitutor = json.load(f)
coding_concepts_aitutor = {"concepts": [
    {"name": concept['name']} for concept in coding_concepts_aitutor['concepts']
]}

# Normalize and create embedding
coding_concepts_aitutor = json.dumps(coding_concepts_aitutor, indent=4)
coding_concepts_aitutor = create_embeddings(coding_concepts_aitutor)
# expert
coding_concepts_expert = "Research/generation_data/concept_graph/Expert/codingConcepts_expert.json"
with open(coding_concepts_expert, "r") as f:
    coding_concepts_expert = json.load(f)
coding_concepts_expert = {"concepts": [
    {"name": concept['name']} for concept in coding_concepts_expert['concepts']
]}
coding_concepts_expert = json.dumps(coding_concepts_expert, indent=4)
coding_concepts_expert = create_embeddings(coding_concepts_expert)

# chatgpt
coding_concepts_chatgpt = "Research/generation_data/concept_graph/ChatGPT/codingConcepts_chatgpt4.json"
with open(coding_concepts_chatgpt, "r") as f:
    coding_concepts_chatgpt = json.load(f)
coding_concepts_chatgpt = {"concepts": [
    {"name": concept['name']} for concept in coding_concepts_chatgpt['concepts']
]}
coding_concepts_chatgpt = json.dumps(coding_concepts_chatgpt, indent=4)
coding_concepts_chatgpt = create_embeddings(coding_concepts_chatgpt)

coschatgpt = cosine_similarity(coding_concepts_chatgpt[0], coding_concepts_expert[0])
cosaitutor = cosine_similarity(coding_concepts_aitutor[0], coding_concepts_expert[0])
print(coschatgpt, cosaitutor)

0.9867042631283537 0.9690687772047663


In [None]:
### Testing Slides

coding_slides_aitutor = "Research/generation_data/slides/Teachabull/codingSlides_aitutor.json"
with open(coding_slides_aitutor, "r") as f:
    coding_slides_aitutor = json.load(f)
coding_slides_aitutor = {"slides": [
    {"title": slide['title'],"content": slide['content']} for slide in coding_slides_aitutor['slides']
]}

# Normalize and create embedding
coding_slides_aitutor = json.dumps(coding_slides_aitutor, indent=4)
coding_slides_aitutor = create_embeddings(coding_slides_aitutor)
# expert
coding_slides_expert = "Research/generation_data/slides/Expert/codingSlides_expert.json"
with open(coding_slides_expert, "r") as f:
    coding_slides_expert = json.load(f)
coding_slides_expert = {"slides": [
    {"title": slide['Title'],"content": slide['Description']} for slide in coding_slides_expert['slides']
]}
coding_slides_expert = json.dumps(coding_slides_expert, indent=4)
coding_slides_expert = create_embeddings(coding_slides_expert)

# chatgpt
coding_slides_chatgpt = "Research/generation_data/slides/ChatGPT/codingSlides_chatgpt.json"
with open(coding_slides_chatgpt, "r") as f:
    coding_slides_chatgpt = json.load(f)
coding_slides_chatgpt = {"slides": [
    {"title": slide['Title'],"content": slide['Description']} for slide in coding_slides_chatgpt['slides']
]}
coding_slides_chatgpt = json.dumps(coding_slides_chatgpt, indent=4)
coding_slides_chatgpt = create_embeddings(coding_slides_chatgpt)

coschatgpt = cosine_similarity(coding_slides_chatgpt[0], coding_slides_expert[0])
cosaitutor = cosine_similarity(coding_slides_aitutor[0], coding_slides_expert[0])
print(coschatgpt, cosaitutor)

0.9469275623592007 0.9354448040938711


In [13]:
### Testing Questions

coding_questions_aitutor = "Research/generation_data/questions/Teachabull/codingQuestions_aitutor.json"
with open(coding_questions_aitutor, "r") as f:
    coding_questions_aitutor = json.load(f)
s = ""
for i, question in enumerate(coding_questions_aitutor["questions"]):
    s+=f"{i}.\n"
    for k, v in question["data"].items():
        if isinstance(v, str):
            s+=v+"\n"
    s+="\n"
coding_questions_aitutor = s

# Normalize and create embedding
coding_questions_aitutor = create_embeddings(coding_questions_aitutor)

# expert
coding_questions_expert = "Research/generation_data/questions/Expert/codingQuestions_expert_RAW.txt"
with open(coding_questions_expert, "r") as f:
    coding_questions_expert = f.read()

coding_questions_expert = create_embeddings(coding_questions_expert)

# chatgpt
coding_questions_chatgpt = "Research/generation_data/questions/ChatGPT/codingQuestions_chatgpt_RAW.txt"
with open(coding_questions_chatgpt, "r") as f:
    coding_questions_chatgpt = f.read()

coding_questions_chatgpt = create_embeddings(coding_questions_chatgpt)

coschatgpt = cosine_similarity(coding_questions_chatgpt[0], coding_questions_expert[0])
cosaitutor = cosine_similarity(coding_questions_aitutor[0], coding_questions_expert[0])
print(coschatgpt, cosaitutor)

0.8775049669925067 0.8033181749423033


# ERRORS


the thing we are checking for errors is number of api calls per errors. api calls during translation / errors during translation
gpt-4 and gpt-3.5

### CONCEPTS RATIO OF NUMBER OF RELEVANT CONCEPTS OVER NUMBER OF CONCEPS
GPT-3.5 vs GPT-4