# Research task

**Analyze the problem's data:** 

Review the step_title, step_description, the wrong_code_submission, and the correct_code_submission from the attached JSON files. The default_feedback represents the default response from Hyperskill's testing framework, and the gpt_current_feedback is the current AI-generated feedback. Use these as reference points for the improved feedback your solution will provide.

**Develop improved prompts:**   

Create new prompt(s) that would guide an AI model (like gpt-3.5-turbo or mixtral-small) to produce better feedback for the incorrect code submissions. Your prompts should aim to:
- Highlight the specific errors in the student's code.
- Offer actionable advice for correcting these errors.
- Encourage best coding practices without providing the solution directly.
- Consider the code_template and test_cases fields in your analysis to understand the initial setup and examples provided to the students, which may influence the context of their code submissions and the feedback needed.

# The approach:
1. **Research the dataset**
2. **Develop a prompt-building and evaluation system**
3. **Test initial prompt results**
4. **Test a simple prompt**
5. **Make several attempts to improve it**
6. **Compare the results of the original and improved prompts** 
7. **Draw conclusions**


-----------

# Importing libraries

In [1]:
from ipywidgets import HTML
import pandas as pd
import yaml
import markdown
import json

# Exploring data
### Checking input data

In [2]:
with open('./input_data/python_code_problems_submissions.json', 'r') as f:
    python = json.load(f)
    
with open('./input_data/java_code_problems_submissions.json', 'r') as f:
    java = json.load(f)


python_df = pd.DataFrame(python['python']['code_problems'])
java_df = pd.DataFrame(java['java']['code_problems'])
print("Python df:")
display(python_df.head(2))
print("Java df:")
display(java_df.head(2))

Python df:


Unnamed: 0,topic_id,topic_title,topic_theory_url,step_id,step_title,step_description,step_url,code_template,test_cases,wrong_code_submission,default_feedback,gpt_current_feedback,correct_code_submission
0,399,Program with numbers,https://hyperskill.org/learn/step/5872,6464,Divide nuts equally between squirrels,<p></p><p>N squirrels found K nuts and decided...,https://hyperskill.org/learn/step/6464,# put your python code here,"[{'input': '3 14', 'output': '4'}]","# put your python code here\nN=int(input(""Ente...",Failed test #1 of 6. Wrong answer\n\nThis is a...,1. Ensure to separate inputs clearly by adding...,N = int(input())\nK = int(input())\nprint(K //...
1,534,"Function, arguments and parameters",https://hyperskill.org/learn/step/7248,10302,Equation,"<p></p>\n<p>Write a function <code class=""java...",https://hyperskill.org/learn/step/10302,def equation_writing(...):\n ...,"[{'input': '3 4 3', 'output': '3 x + 4 = 3'}]","def equation_writing(a, b, c):\n print(a, '...",Failed test #1 of 3. Wrong answer\n\nThis is a...,1. Change the existing print statement to use ...,"def equation_writing(a, b, c):\n print(f""{a..."


Java df:


Unnamed: 0,topic_id,topic_title,topic_theory_url,step_id,step_title,step_description,step_url,code_template,test_cases,wrong_code_submission,default_feedback,gpt_current_feedback,correct_code_submission
0,89,For-loop,https://hyperskill.org/learn/step/3505,2187,The sum of integers from a to b,<p>Print the sum of all integers from <strong>...,https://hyperskill.org/learn/step/2187,import java.util.Scanner;\n\nclass Main {\n ...,"[{'input': '5 21 ', 'output': '221'}]",import java.util.Scanner;\n\nclass Main {\n ...,Failed test #1 of 10. Runtime error\n\nThis is...,1. Correct the input reading: Ensure you read ...,import java.util.Scanner;\n\nclass Main {\n ...
1,7,Grouping classes with packages,https://hyperskill.org/learn/step/3621,9896,Sorting arrays,"<p></p>\n<p>The class <code class=""java"">Array...",https://hyperskill.org/learn/step/9896,class ArraySorting {\n /**\n * @param a...,"[{'input': 'y d b u t f m h x q', 'output': 'b...",class ArraySorting {\n /**\n * @param a...,Compilation error\nMain.java:17: error: incomp...,1. Change java.util.Arrays.sort(array) to java...,class ArraySorting {\n /**\n * @param a...


In [3]:
python_test_cases_count = len(python_df)
print("Python test cases")
python_test_cases_count

Python test cases


10

In [4]:
java_test_cases_count = len(java_df)
print("Java test cases")
java_test_cases_count

Java test cases


10

In [6]:
print("Submission response keys:")
list(python_df.columns)

Submission response keys:


['topic_id',
 'topic_title',
 'topic_theory_url',
 'step_id',
 'step_title',
 'step_description',
 'step_url',
 'code_template',
 'test_cases',
 'wrong_code_submission',
 'default_feedback',
 'gpt_current_feedback',
 'correct_code_submission']

In [7]:
def render_data(data, llm_feedback=None, passed=None, grading=[]):
    # Display topic title
    print(f"\n{'=' * 80}\nTopic Title: {data['topic_title']}\n{'=' * 80}")

    # Display step title with link to the theory
    print(f"Step Title: {data['step_title']} (Link: {data['topic_theory_url']})")

    # Display the description of the step
    print(f"\nDescription:\n{data['step_description']}")

    # Wrong Code submission
    print(f"\nWrong Code Submission:\n{data['wrong_code_submission']}")

    # Display test cases
    print("\nTest Cases:")
    for case in data['test_cases']:
        print(f"  Input: {case['input']} Output: {case['output']}")

    # Display feedback from a wrong submission
    print(f"\nFeedback for Wrong Submission:\n{data['default_feedback']}")

    # Display feedback from GPT
    print(f"\nOriginal GPT Feedback:\n{data['gpt_current_feedback']}")

    # Display correct code submission
    print(f"\nCorrect Code Submission:\n{data['correct_code_submission']}")

    if llm_feedback:
        print("\n" + "-" * 80)
        if 'User-Facing Feedback:' in llm_feedback:
            llm_feedback = llm_feedback.split('User-Facing Feedback:')[-1]
        print(f"GPT New Feedback (Passed: {passed}):\n{llm_feedback}")
        print("\nEvaluation Feedback:")
        print(f"Score: {grading['score']}. Passed: {grading['pass']}")
        print(f"Reason:\n{grading['reason']}\n")


In [9]:
step = 3
data = python['python']['code_problems'][step]
render_data(data)


Topic Title: Loop control: break, continue, pass
Step Title: Game over (Link: https://hyperskill.org/learn/step/6302)

Description:
<p></p><p>In online test games, there is usually a limited number of lives: if, for example, you make 3 mistakes, you lose, and cannot continue with the game. Let's try to implement this system.</p><p><br/>Read the line with user answers from the input<strong>: C </strong>for a correct answer and <strong>I </strong>for an incorrect one, separated by space. The objective is to count the final score, which is the number of correct answers. Count them until the user reaches <strong>3 incorrect answers</strong>. If they make no more than 2 mistakes, they win. In this case print <strong>"You won" </strong>and their score, otherwise print "<strong>Game over" </strong>along with their score. Display the message and score on separate lines, without quotation marks.</p><p><strong>The input format:</strong></p><p>A line with <strong>N</strong> user answers in a gam

# Making test-cases

## Making assertion rules

In [10]:
assertion_rules = [
    "The user-faced feedback does not give direct answer",
    "The feedback is correct and highlight the specific errors in the student's code.",
    "The feedback offers actionable advice and encourages best coding practices.",
    "The feedback is easy to understand for a adult programming newbie with no experience."
]

In [11]:
def extract_important_data(data):
    return {
#         "topic": data['topic_title'],
#         "task_name": data['step_title'],
        "task_description": data['step_description'],
        "test_cases": data['test_cases'],
        "submission_system_output": data['default_feedback'],
#         "init_code_template": data['code_template'],
        "user_wrong_answer": data['wrong_code_submission'],
        "test_correct_answer": data['correct_code_submission'],
        # We will ignore it in the future
        "gpt_current_feedback": data['gpt_current_feedback'],
    }

In [12]:
def make_testcase(prefix, id_, assertion_rules, extracted_data):
    return {
        "vars": {
            "id": f"{prefix}_{id_}", 
            **extracted_data
        },
        "assert": [
            {
                "type": "model-graded-closedqa",
                "value": rule,
            }
            for rule in assertion_rules
        ]
    }

def make_testcases(prefix, test_data, assertion_rules, extract_data):
    return [
        make_testcase(prefix, i, assertion_rules, extract_data(t))
        for i, t in enumerate(test_data)
    ]

# Writing data to a YAML file
with open('./tests/python.yaml', 'w') as file:
    python_test_cases = make_testcases(
        "python",
        test_data=python['python']['code_problems'],
        extract_data=extract_important_data,
        assertion_rules=assertion_rules,
    )
    yaml.dump(python_test_cases, file, default_flow_style=False, indent=2)
    
    
# Writing data to a YAML file
with open('./tests/java.yaml', 'w') as file:
    java_test_cases = make_testcases(
        "java",
        test_data=java['java']['code_problems'],
        extract_data=extract_important_data,
        assertion_rules=assertion_rules,
    )
    yaml.dump(java_test_cases, file, default_flow_style=False, indent=2)

# Evaluation function

In [13]:
def find_async_feedback_by_id(prefix, feedbacks, id_):
    for feedback in feedbacks['results']['results']:
        if feedback['vars']['id'] == f"{prefix}_{id_}":
            return feedback

In [14]:
def view_results_table():
    !npx promptfoo@latest view -y

In [15]:
def make_eval(config_file_path, output_file_path):
    !npx promptfoo@latest eval -c $config_file_path --no-progress-bar --output $output_file_path

# Original prompt evaluation

## Defininig mock functions for init prompt results evaluation

In [13]:
%%writefile providers/mock_provider.py
import json

def call_api(prompt, options, context):
    result = {
        "output": context['vars']['gpt_current_feedback'],
    }
    return result

Overwriting providers/mock_provider.py


In [14]:
%%writefile prompts/mock_prompt.py
import json
import sys

def mock_prompt_function(context: dict) -> str:
    del context['vars']['gpt_current_feedback']
    
    # Mocked prompt, does not matter what it contains
    return f"Give some feedback for this submission {json.dumps(context['vars'])}"

if __name__ == "__main__":
    # If you don't specify a `function_name` in the provider string, it will run the main
    print(mock_prompt_function(json.loads(sys.argv[1])))

Overwriting prompts/mock_prompt.py


## Making init evaluation config

In [15]:
%%writefile configs/init.yaml
description: 'Init Eval'

prompts:
  - file://../prompts/mock_prompt.py
    

providers:
  - id: 'python:../providers/mock_provider.py'

tests:
  - ../tests/python.yaml
  - ../tests/java.yaml



Overwriting configs/init.yaml


In [16]:
make_eval(
    "./configs/init.yaml",
    "./outputs/init.json",
)


[90m┌────────[39m[90m┬────────[39m[90m┬────────[39m[90m┬────────[39m[90m┬────────[39m[90m┬────────[39m[90m┬────────[39m[90m┬────────┐[39m
[90m│[39m[1m[34m gpt_c… [39m[22m[90m│[39m[1m[34m id     [39m[22m[90m│[39m[1m[34m submi… [39m[22m[90m│[39m[1m[34m task_… [39m[22m[90m│[39m[1m[34m test_… [39m[22m[90m│[39m[1m[34m test_… [39m[22m[90m│[39m[1m[34m user_… [39m[22m[90m│[39m[1m[34m import [39m[22m[90m│[39m
[90m│[39m[1m[34m        [39m[22m[90m│[39m[1m[34m        [39m[22m[90m│[39m[1m[34m        [39m[22m[90m│[39m[1m[34m        [39m[22m[90m│[39m[1m[34m        [39m[22m[90m│[39m[1m[34m        [39m[22m[90m│[39m[1m[34m        [39m[22m[90m│[39m[1m[34m json   [39m[22m[90m│[39m
[90m│[39m[1m[34m        [39m[22m[90m│[39m[1m[34m        [39m[22m[90m│[39m[1m[34m        [39m[22m[90m│[39m[1m[34m        [39m[22m[90m│[39m[1m[34m        [39m[22m[90m│[39m[1m[34

Done.


In [17]:
# view_results_table()

----

## ORIGINAL PROMPT TOTAL SCORE: 50%

----

# First attempt to improve prompt results

In [28]:
%%writefile prompts/prompt_1.txt

This is feedback of the coding testing system. 
Write a human feedback explaining the student what to fix

task_description: {{task_description}}
test_cases: {{test_cases}}
submission_system_output: {{submission_system_output}}
user_wrong_answer: {{user_wrong_answer}}
test_correct_answer: {{test_correct_answer}}

Highlight the specific errors in the student's code.
Offer actionable advice for correcting these errors.
Encourage best coding practices without providing the solution directly.
Make the answer simple to understand for a adult programming newbie with no experience.

Overwriting prompts/prompt_1.txt


In [29]:
%%writefile configs/first_attempt.yaml
description: 'First improving attempt Eval'

prompts:
  - file://../prompts/prompt_1.txt
    

providers:
  - openai:gpt-3.5-turbo

tests:
  - ../tests/python.yaml
  - ../tests/java.yaml



Overwriting configs/first_attempt.yaml


In [30]:
make_eval(
    "./configs/first_attempt.yaml",
    "./outputs/first_attempt.json",
)


[90m┌────────[39m[90m┬────────[39m[90m┬────────[39m[90m┬────────[39m[90m┬────────[39m[90m┬────────[39m[90m┬────────[39m[90m┬────────┐[39m
[90m│[39m[1m[34m gpt_c… [39m[22m[90m│[39m[1m[34m id     [39m[22m[90m│[39m[1m[34m submi… [39m[22m[90m│[39m[1m[34m task_… [39m[22m[90m│[39m[1m[34m test_… [39m[22m[90m│[39m[1m[34m test_… [39m[22m[90m│[39m[1m[34m user_… [39m[22m[90m│[39m[1m[34m This   [39m[22m[90m│[39m
[90m│[39m[1m[34m        [39m[22m[90m│[39m[1m[34m        [39m[22m[90m│[39m[1m[34m        [39m[22m[90m│[39m[1m[34m        [39m[22m[90m│[39m[1m[34m        [39m[22m[90m│[39m[1m[34m        [39m[22m[90m│[39m[1m[34m        [39m[22m[90m│[39m[1m[34m is     [39m[22m[90m│[39m
[90m│[39m[1m[34m        [39m[22m[90m│[39m[1m[34m        [39m[22m[90m│[39m[1m[34m        [39m[22m[90m│[39m[1m[34m        [39m[22m[90m│[39m[1m[34m        [39m[22m[90m│[39m[1m[34

Done.


----

## FIRST ATTEMPT TOTAL SCORE: 70%

----

# Second attempt to improve prompt results

In [31]:
%%writefile prompts/prompt_2.txt

Submission system output:

task_description: {{task_description}}
test_cases: {{test_cases}}
submission_system_output: {{submission_system_output}}
user_wrong_answer: {{user_wrong_answer}}
test_correct_answer: {{test_correct_answer}}
    
Based on the provided system feedback, 
perform an internal analysis of the specific errors identified in the student's wrong code submission. 
This internal analysis should summarize the issues without resolving them and is meant for internal
processing only. 
Following the analysis, prepare a user-facing feedback message that explains 
these errors and offers actionable advice without directly solving the problem. 
The output should be structured as follows:

Internal Analysis:
- Identify specific errors based on the error message and the code submitted.
- Assess the implications of these errors on the student's understanding of the programming concepts.

++++++
User-Facing Feedback:
- IMPORTANT: Highlight the specific errors identified in student's code.
- Provide structured, actionable advice on how to correct these errors.
- Encourage best coding practices and explain these practices in a manner simple enough for an adult programming novice.
- Use a list format for clarity and simplicity.
IMPORTANT: Do NOT provide the correct solution directly, only hints and guidance.
IMPORTANT: Follow the answer stucture of the examples


Examples of complete output:

Example 1:
```
Internal Analysis:
- Error: 'NameError' for undefined variable 'word'.
- Issue: The student did not define 'word' before attempting to use it, indicating a misunderstanding of how variables must be declared and used.

++++++
User-Facing Feedback:
1. **Issue Identified:** You attempted to use a variable named 'word' which wasn't defined anywhere in your code, resulting in a 'NameError'.
2. **Actionable Advice:** Start by defining all variables before they are used. For instance, include a line such as 'word = input("Please enter a word: ")' at the beginning of your script.
3. **Best Practice:** Always test your program with different inputs to ensure that all variables are properly initialized and used within their scope. This practice helps avoid common errors and improves the reliability of your code.
```

Example 2:
```
Internal Analysis:
- Error: 'SyntaxError' due to missing colon after an if statement.
- Issue: The student forgot to put a colon at the end of the if statement, which is crucial for defining block scopes in Python.

++++++
User-Facing Feedback:
1. **Issue Identified:** Your if statement is missing a colon at the end, which caused a 'SyntaxError'.
2. **Actionable Advice:** Remember to add a colon right after the condition in your if statements. This tells Python that you're about to start a block of statements that should only execute under certain conditions.
3. **Best Practice:** Writing small test cases to check each part of your code incrementally can help catch these types of syntax errors before they affect larger parts of your program.
```

Example 3:
```
Internal Analysis:
- Error: 'IndexError' because the code attempts to access a list index that is out of range.
- Issue: The student seems unaware of how to properly check the bounds of a list before accessing its elements.

++++++
User-Facing Feedback:
1. **Issue Identified:** You tried to access a list element with an index that exceeds the list's boundaries, leading to an 'IndexError'.
2. **Actionable Advice:** Always check that your index is within the valid range of the list by comparing it against the length of the list minus one. For example, you can use a condition like `if index < len(my_list):` before accessing `my_list[index]`.
3. **Best Practice:** Implementing boundary checks for list accesses is a crucial habit that can prevent runtime errors and ensure your program handles data gracefully.
```

Writing prompts/prompt_2.txt


In [32]:
%%writefile configs/second_attempt.yaml
description: 'Second improving attempt Eval'

prompts:
  - file://../prompts/prompt_2.txt
    

providers:
  - openai:gpt-3.5-turbo

tests:
  - ../tests/python.yaml
  - ../tests/java.yaml



Writing configs/second_attempt.yaml


In [33]:
make_eval(
    "./configs/second_attempt.yaml",
    "./outputs/second_attempt.json",
)


[90m┌────────[39m[90m┬────────[39m[90m┬────────[39m[90m┬────────[39m[90m┬────────[39m[90m┬────────[39m[90m┬────────[39m[90m┬────────┐[39m
[90m│[39m[1m[34m gpt_c… [39m[22m[90m│[39m[1m[34m id     [39m[22m[90m│[39m[1m[34m submi… [39m[22m[90m│[39m[1m[34m task_… [39m[22m[90m│[39m[1m[34m test_… [39m[22m[90m│[39m[1m[34m test_… [39m[22m[90m│[39m[1m[34m user_… [39m[22m[90m│[39m[1m[34m Submi… [39m[22m[90m│[39m
[90m│[39m[1m[34m        [39m[22m[90m│[39m[1m[34m        [39m[22m[90m│[39m[1m[34m        [39m[22m[90m│[39m[1m[34m        [39m[22m[90m│[39m[1m[34m        [39m[22m[90m│[39m[1m[34m        [39m[22m[90m│[39m[1m[34m        [39m[22m[90m│[39m[1m[34m system [39m[22m[90m│[39m
[90m│[39m[1m[34m        [39m[22m[90m│[39m[1m[34m        [39m[22m[90m│[39m[1m[34m        [39m[22m[90m│[39m[1m[34m        [39m[22m[90m│[39m[1m[34m        [39m[22m[90m│[39m[1m[34

[90m│[39m string [90m│[39m        [90m│[39m Wrong  [90m│[39m that   [90m│[39m        [90m│[39m # run  [90m│[39m # run  [90m│[39m Error: [90m│[39m
[90m│[39m is in  [90m│[39m        [90m│[39m answer [90m│[39m takes  [90m│[39m        [90m│[39m the    [90m│[39m the    [90m│[39m The    [90m│[39m
[90m│[39m the    [90m│[39m        [90m│[39m        [90m│[39m two    [90m│[39m        [90m│[39m membe… [90m│[39m membe… [90m│[39m wrong  [90m│[39m
[90m│[39m second [90m│[39m        [90m│[39m        [90m│[39m strin… [90m│[39m        [90m│[39m test   [90m│[39m test   [90m│[39m answer [90m│[39m
[90m│[39m string [90m│[39m        [90m│[39m        [90m│[39m checks [90m│[39m        [90m│[39m below  [90m│[39m below  [90m│[39m is due [90m│[39m
[90m│[39m using  [90m│[39m        [90m│[39m        [90m│[39m wheth… [90m│[39m        [90m│[39m print… [90m│[39m print… [90m│[39m to     [90m│[39m
[90m│[39

Done.


----

## SECOND ATTEMPT TOTAL SCORE: 90%

----

----

# TL/DR 

- The original GPT answer from task inputs has a score of 50% according to the assertion rules I've created.
- As a result of my attempts, the GPT answer improved to a 90% score.

---

# Conclusion



I think that score of 90% is enough for showing the prompt improvement, so lets finish the improving session.


## What corner cases are not covered by evaluation system now and have to be improved in the future

- Evaluation system has no assertion of all code issues coverage (need manual dataset markup)
- Evaluation system has no assertion of response format (need to determine it with backend developer)

## Why score of 90 is acceptable

- GPT is not fully deterministic, so we have to accept that achiving the 100% score may take a lot of time.
- We have to add more items to the requiements list of the prompt asserts to ensure that prompt has good quality and format.

## Results comparison (original and improved)

Lets take good results and compare with the initial model answer:

In [17]:
with open("./outputs/second_attempt.json", 'r') as f:
    feedbacks2 = json.load(f)

In [18]:
step = 4
data = python['python']['code_problems'][step]
feedback = find_async_feedback_by_id("python", feedbacks2, step)
llm_feedback = feedback['response']['output']
passed = feedback['success']
grading = feedback['gradingResult']
render_data(data, llm_feedback, passed, grading)


Topic Title: Working with strings: basic methods
Step Title: Poster artist (Link: https://hyperskill.org/learn/step/6842)

Description:
<p>Imagine that you design film posters for a living. Write a program that prints each film title in all caps.</p>

Wrong Code Submission:
print(word.upper())





Test Cases:
  Input: the lion king Output: THE LION KING

Feedback for Wrong Submission:
Failed test #1 of 7. Runtime error

This is a sample test from the problem statement!

Test input:
the lion king
Correct output:
THE LION KING

Your code output:


Error:
Traceback (most recent call last):
  File "/sandbox/main.py", line 1, in <module>
    print(word.upper())
          ^^^^
NameError: name 'word' is not defined. Did you mean: 'ord'?


Original GPT Feedback:
1. Define a variable and assign it the input provided.
2. Use the variable with the film title to apply the `upper()` method for converting to uppercase.
3. Ensure the variable defined in step 1 is used inside the `print()` function 

**Initial model answer**
1. Define a variable and assign it the input provided.
2. Use the variable with the film title to apply the `upper()` method for converting to uppercase.
3. Ensure the variable defined in step 1 is used inside the `print()` function for the correct output.


----

**New GPT Feedback**
1. **Issue Identified:** You attempted to use a variable named 'word' which wasn't defined anywhere in your code, resulting in a 'NameError'.
2. **Actionable Advice:** Start by defining all variables before they are used. For instance, include a line such as 'word = input("Please enter a word: ")' at the beginning of your script.
3. **Best Practice:** Always test your program with different inputs to ensure that all variables are properly initialized and used within their scope. This practice helps avoid common errors and improves the reliability of your code.

----
**Prompt comparison conclusions**

- The new prompt gives more human-like explanation, especially helpfull for newbies
- It gives more common hint, but also gives the best practices advice 

## How to improve the prompt evaluation system
- Add more examples
- Perform manual data markup with feedback and add it to asserts for test cases

## How to make the process of the continious prompt improving

- Once a day, take few samples of the responses from the production
- Evaluate them and make an average score metric dashboard
- Monitor it and create improving goals
- Continiosly add examples to the testing dataset
- Collect bad responses examples, add them to asserts to make scoring metric more accurate