In [21]:
import os
print(os.getcwd())

/Users/akshaysundru2004/Desktop/Year 4/AI-Learning-Outcome-Builder/app


# Bulk Testing AI Evaluation of Learning Outcomes

This notebook serves as interactive way to view the teams procedure in conducting the efficacy of the AI tool developed for the purposes of this project. The main purpose of these tests are to evaluate the AI's ability to accurately classify unit learning outcomes according to Bloom's taxonomy and in line with expert review as well as grading the quality of the AI's rewrite ability.

This notebook will go step by step in demonstrating each test conducted by the team, the tests can be executed individually with notation explaining the purpose and methodology for each test.

The first step was to gather the data sources used to perform the test. As agreed upon and recorded in meetings between the team and the client, it was agreed that the client would consult experts and provide a set of flagged learning outcomes to serve as the test set for our AI in addition to a larger set of data for unit learning outcomes. Using this dataset, we can perform tests like F1 score, confusion matrix and ROUGE score to evaluate the AI's capabilities.

### Pre-Cleaning

Before we can construct the AI tests, we must first clean the data provided to us. For the test csv data used, we constructed a SQL query to create a table with unit information and corresponding learning outcomes, copying that joined table to a CSV. This CSV contained the following columns:

- Unit-id: The unique id attributed to each id in the Unit table.
- Unit code: A unique 8 digit code attributed to units by UWA for each conducted hosted by UWA.
- Level: The course "level" of each unit, indicating the complexity of each unit.
- Description: Each unit has a set of learning outcomes associated with it, in our website these outcomes are stored in their own separate table. In our case, we've joined basic unit info with each learning outcome. Each learning outcome is a description of a goal that the unit should provide student or a skill that will be assessed.
- Position: This parameter is mainly used for front end rendering on the learning outcome editor page, indicating the position of a learning outcome in the table.
- Credit Points: The completion of each unit corresponds to a certain number of credit points added towards a student, the completion of a degree typically involves the completion of certain units as well as a requisite number of credit points.

After extracting this data into a CSV, a manual review was performed for each learning outcome, and an addition column was added.

- Flag: This parameter indicates human analysis of learning outcomes, and categorises learning outcomes into three levels, 'Good', 'Needs Revision' and 'Could Improve'.

The first step of cleaning is to load this CSV into Python using a pandas dataframe.

In [49]:
import pandas as pd
import os


testing_set = pd.read_csv("de.csv")

In [50]:
testing_set.head()

Unnamed: 0,unit id,unitcode,level,description,position,flag
0,4,EMPL3301,3,describe the core debates over the meaning of ...,1,GOOD
1,4,EMPL3301,3,explain the relationship between globalisation...,2,GOOD
2,4,EMPL3301,3,identify organisations and institutions centra...,3,NEEDS_REVISION
3,4,EMPL3301,3,gain a critical appreciation of how globalisat...,4,NEEDS_REVISION
4,4,EMPL3301,3,develop a critical understanding of individual...,5,GOOD


Upon running the above commands, we can see the first 5 rows of the test dataset. Using the pandas library allows us to perform some elementary analysis of these outcomes, first of which is counting the number of outcomes classified into the three flag categories.

In [54]:
good_rows = testing_set[testing_set.flag == "GOOD"]
needs_revision_rows = testing_set[testing_set.flag == "NEEDS_REVISION"]
could_improve_rows = testing_set[testing_set.flag == "COULD_IMPROVE"]

count_good = len(good_rows)
count_needs_revision = len(needs_revision_rows)
count_could_improve = len(could_improve_rows)

print(f"The number of good learning outcomes is {count_good}.")
print(f"The number of good learning outcomes is {count_needs_revision}.")
print(f"The number of good learning outcomes is {count_could_improve}.")

The number of good learning outcomes is 42.
The number of good learning outcomes is 19.
The number of good learning outcomes is 4.


This metric allows us to construct the first set of tests, the F1 score and confusion matrix. These tests are used to determine the classification accuracy of our model by calculating the number of tests that are matched in evaluation as well as counting the number of false positives and false negatives.

In [55]:
descriptions = testing_set["description"]

In [56]:
text = []
for i in range(5):
    text.append(f"{descriptions[i]}")

#outcomes_text = "\n".join(text)
print(text)

outcomes_text = "\n".join(text)

["describe the core debates over the meaning of the term 'globalisation'", 'explain the relationship between globalisation and labour market and workplace restructuring', 'identify organisations and institutions central to globalisation and their impact on work', 'gain a critical appreciation of how globalisation reshapes the experience of work and worker identity', 'develop a critical understanding of individual and collective responses to the impact of globalisation on work']


In [57]:
outcomes_text

"describe the core debates over the meaning of the term 'globalisation'\nexplain the relationship between globalisation and labour market and workplace restructuring\nidentify organisations and institutions central to globalisation and their impact on work\ngain a critical appreciation of how globalisation reshapes the experience of work and worker identity\ndevelop a critical understanding of individual and collective responses to the impact of globalisation on work"

In [28]:
unitcode = "EMPL3301"
level = "3"
creditpoints = "6"
print(outcomes_text)

describe the core debates over the meaning of the term 'globalisation'
explain the relationship between globalisation and labour market and workplace restructuring
identify organisations and institutions central to globalisation and their impact on work
gain a critical appreciation of how globalisation reshapes the experience of work and worker identity
develop a critical understanding of individual and collective responses to the impact of globalisation on work


In [None]:
from ai_evaluate import run_eval
os.environ["GOOGLE_API_KEY"] = 'your-api-key'

response = run_eval(level=level, unit_name=unitcode, credit_points=creditpoints, outcomes_text=outcomes_text)

print(response)

**LO Analysis**

'describe the core debates over the meaning of the term 'globalisation'' - STATUS:NEEDS_REVISION - This outcome is at the Comprehension level (Bloom's Taxonomy) rather than Application. SUGGESTION: 'Apply knowledge of core debates surrounding globalisation to analyse contemporary case studies.'

'explain the relationship between globalisation and labour market and workplace restructuring' - STATUS:NEEDS_REVISION - This outcome is at the Comprehension level (Bloom's Taxonomy) rather than Application. SUGGESTION: 'Demonstrate how globalisation impacts labour market and workplace restructuring through the analysis of specific industry examples.'

'identify organisations and institutions central to globalisation and their impact on work' - STATUS:NEEDS_REVISION - This outcome is at the Knowledge level (Bloom's Taxonomy) rather than Application. SUGGESTION: 'Utilise knowledge of key organisations and institutions involved in globalisation to assess their impact on a chosen 

Here we've taken the first unit's set of learning outcomes and general information and fed it through our AI evaluation tool. Looking at the output, we can see that it is constrcuted to provide a STATUS flag, general feedback and a suggested rewrite to align more with Bloom's taxonomy. For classification testing, we can extract the outputted flag using a regex (regular expression) to isolate the flag and add it to a new column.

In [30]:
import re

matches = re.findall(r"'(.*?)'\s*-\s*STATUS:\s*(\w+)", response, re.IGNORECASE)

status_flag = []
for outcome, status in matches:
    status_flag.append(status)

status_flag

['NEEDS_REVISION',
 'NEEDS_REVISION',
 'NEEDS_REVISION',
 'COULD_IMPROVE',
 'NEEDS_REVISION']

In [61]:
grouped_data = testing_set.groupby('unitcode')

ai_data = []
for unitcode, group in grouped_data:
        unit_data = group.iloc[0]  # same metadata for each outcome
        outcomes = group['description'].dropna().tolist()

        outcomes_text = "\n".join(outcomes)

        ai_data.append({
            "unitcode": unitcode,
            "level": int(unit_data['level']),
            "credit_points": 6,
            "outcomes": outcomes_text
        })


In [62]:
print(ai_data[0])

{'unitcode': 'ACCT2242', 'level': 2, 'credit_points': 6, 'outcomes': 'identify and explain the roles and components of AIS\ndescribe the role of an internal control system in maintaining data integrity and recommend internal control improvements to protect key business processes\ncritically evaluate business processes and system documentation\ndemonstrate basic data analytics and visualisation techniques that enhances the efficiency and effectiveness of communication\nexplain the impact of ICT on current and emerging accounting practices\ndevelop competencies to work effectively in teams to resolve ICT issues.'}


In [63]:
import time
import random
from httpx import RemoteProtocolError

flags = []

DELAY_SECONDS = 3
MAX_RETRIES = 3

for i, entry in enumerate(ai_data, start=1):
    print(f"\n🔹 Processing {i}/{len(ai_data)}: {entry['unitcode']}")

    retries = 0
    success = False

    while retries < MAX_RETRIES and not success:
        try:
            response = run_eval(
                level=entry['level'],
                unit_name=entry['unitcode'],
                credit_points=entry['credit_points'],
                outcomes_text=entry['outcomes']
            )

            matches = re.findall(r"'(.*?)'\s*-\s*STATUS:\s*(\w+)", response, re.IGNORECASE)
            for outcome, status in matches:
                flags.append({
                    "unitcode": entry['unitcode'],
                    "outcome": outcome,
                    "status": status
                })

            print(f"✅ Completed {entry['unitcode']} — {len(matches)} outcomes parsed.")
            success = True

        except RemoteProtocolError:
            retries += 1
            wait_time = DELAY_SECONDS * retries + random.uniform(1, 2)
            print(f"⚠️ Server disconnected. Retrying in {wait_time:.1f}s (attempt {retries}/{MAX_RETRIES})...")
            time.sleep(wait_time)

        except Exception as e:
            print(f"❌ Unexpected error for {entry['unitcode']}: {e}")
            flags.append({
                "unitcode": entry['unitcode'],
                "outcome": None,
                "status": "ERROR"
            })
            break

    # Delay between units to reduce load
    if success:
        time.sleep(DELAY_SECONDS + random.uniform(0.5, 1.5))



🔹 Processing 1/10: ACCT2242
✅ Completed ACCT2242 — 6 outcomes parsed.

🔹 Processing 2/10: CITS2211
✅ Completed CITS2211 — 4 outcomes parsed.

🔹 Processing 3/10: DENT3005
✅ Completed DENT3005 — 10 outcomes parsed.

🔹 Processing 4/10: ELEC3016
✅ Completed ELEC3016 — 7 outcomes parsed.

🔹 Processing 5/10: ELEC5505
✅ Completed ELEC5505 — 7 outcomes parsed.

🔹 Processing 6/10: EMPL3301
✅ Completed EMPL3301 — 6 outcomes parsed.

🔹 Processing 7/10: LAWS3214
✅ Completed LAWS3214 — 5 outcomes parsed.

🔹 Processing 8/10: PHIL1001
✅ Completed PHIL1001 — 10 outcomes parsed.

🔹 Processing 9/10: PHYS2003
✅ Completed PHYS2003 — 5 outcomes parsed.

🔹 Processing 10/10: PSYC1101
⚠️ Server disconnected. Retrying in 4.1s (attempt 1/3)...
✅ Completed PSYC1101 — 5 outcomes parsed.


In [65]:
flags_list =[ entry['status'] for entry in flags]
flags_list

testing_set['AI_Flag'] = flags_list

testing_set.head(5)

Unnamed: 0,unit id,unitcode,level,description,position,flag,AI_Flag
0,4,EMPL3301,3,describe the core debates over the meaning of ...,1,GOOD,GOOD
1,4,EMPL3301,3,explain the relationship between globalisation...,2,GOOD,COULD_IMPROVE
2,4,EMPL3301,3,identify organisations and institutions centra...,3,NEEDS_REVISION,NEEDS_REVISION
3,4,EMPL3301,3,gain a critical appreciation of how globalisat...,4,NEEDS_REVISION,NEEDS_REVISION
4,4,EMPL3301,3,develop a critical understanding of individual...,5,GOOD,GOOD


In [66]:
testing_set["flag"] = testing_set["flag"].str.upper()

testing_set

Unnamed: 0,unit id,unitcode,level,description,position,flag,AI_Flag
0,4,EMPL3301,3,describe the core debates over the meaning of ...,1,GOOD,GOOD
1,4,EMPL3301,3,explain the relationship between globalisation...,2,GOOD,COULD_IMPROVE
2,4,EMPL3301,3,identify organisations and institutions centra...,3,NEEDS_REVISION,NEEDS_REVISION
3,4,EMPL3301,3,gain a critical appreciation of how globalisat...,4,NEEDS_REVISION,NEEDS_REVISION
4,4,EMPL3301,3,develop a critical understanding of individual...,5,GOOD,GOOD
...,...,...,...,...,...,...,...
60,5815,ELEC3016,3,"analyse the performance (regulation, losses an...",3,GOOD,NEEDS_REVISION
61,5815,ELEC3016,3,explain the working principle of transformers ...,4,NEEDS_REVISION,GOOD
62,5815,ELEC3016,3,analyse torque-speed characteristics to develo...,5,GOOD,NEEDS_REVISION
63,5815,ELEC3016,3,develop transmission line parameters and power...,6,GOOD,NEEDS_REVISION


In [68]:
testing_set.to_excel('Testing Set.xlsx', index=False)

In [70]:
from sklearn.metrics import f1_score

manual_flagged = testing_set.flag
ai_flagged = testing_set.AI_Flag

f1_macro = f1_score(manual_flagged, ai_flagged, average='macro')
print(round(f1_macro, 3))

f1_weighted = f1_score(manual_flagged, ai_flagged, average='weighted')
print(round(f1_weighted, 3))


0.236
0.314
