In [23]:
import os
print(os.getcwd())

/Users/akshaysundru2004/Desktop/Year 4/AI-Learning-Outcome-Builder/app


# Bulk Testing AI Evaluation of Learning Outcomes

This notebook serves as interactive way to view the teams procedure in conducting the efficacy of the AI tool developed for the purposes of this project. The main purpose of these tests are to evaluate the AI's ability to accurately classify unit learning outcomes according to Bloom's taxonomy and in line with expert review as well as grading the quality of the AI's rewrite ability.

This notebook will go step by step in demonstrating each test conducted by the team, the tests can be executed individually with notation explaining the purpose and methodology for each test.

The first step was to gather the data sources used to perform the test. As agreed upon and recorded in meetings between the team and the client, it was agreed that the client would consult experts and provide a set of flagged learning outcomes to serve as the test set for our AI in addition to a larger set of data for unit learning outcomes. Using this dataset, we can perform tests like F1 score, confusion matrix and ROUGE score to evaluate the AI's capabilities.

### Pre-Cleaning

Before we can construct the AI tests, we must first clean the data provided to us. For the test csv data used, we constructed a SQL query to create a table with unit information and corresponding learning outcomes, copying that joined table to a CSV. This CSV contained the following columns:

- Unit-id: The unique id attributed to each id in the Unit table.
- Unit code: A unique 8 digit code attributed to units by UWA for each conducted hosted by UWA.
- Level: The course "level" of each unit, indicating the complexity of each unit.
- Description: Each unit has a set of learning outcomes associated with it, in our website these outcomes are stored in their own separate table. In our case, we've joined basic unit info with each learning outcome. Each learning outcome is a description of a goal that the unit should provide student or a skill that will be assessed.
- Position: This parameter is mainly used for front end rendering on the learning outcome editor page, indicating the position of a learning outcome in the table.
- Credit Points: The completion of each unit corresponds to a certain number of credit points added towards a student, the completion of a degree typically involves the completion of certain units as well as a requisite number of credit points.

After extracting this data into a CSV, a manual review was performed for each learning outcome, and an addition column was added.

- Flag: This parameter indicates human analysis of learning outcomes, and categorises learning outcomes into three levels, 'Good', 'Needs Revision' and 'Could Improve'.

The first step of cleaning is to load this CSV into Python using a pandas dataframe.

In [85]:
import pandas as pd
import os


testing_set = pd.read_csv("Sample-Test-Data.csv")

In [86]:
testing_set.head()

Unnamed: 0,unit id,unitcode,level,credit points,description,position,flag
0,1,PHAR2220,2,6,"describe drug action at molecular, cellular, t...",1,Good
1,1,PHAR2220,2,6,describe both the effects of the drug on its t...,2,Good
2,1,PHAR2220,2,6,discuss ethical approaches to responsible cond...,3,Good
3,1,PHAR2220,2,6,"select, critically appraise, and communicate s...",4,Needs Revision
4,1,PHAR2220,2,6,perform laboratory experiments relevant to ass...,5,Good


Upon running the above commands, we can see the first 5 rows of the test dataset. Using the pandas library allows us to perform some elementary analysis of these outcomes, first of which is counting the number of outcomes classified into the three flag categories.

In [55]:
good_rows = testing_set[testing_set.flag == "Good"]
needs_revision_rows = testing_set[testing_set.flag == "Needs Revision"]
could_improve_rows = testing_set[testing_set.flag == "Could Improve"]

count_good = len(good_rows)
count_needs_revision = len(needs_revision_rows)
count_could_improve = len(could_improve_rows)

print(f"The number of good learning outcomes is {count_good}.")
print(f"The number of good learning outcomes is {count_needs_revision}.")
print(f"The number of good learning outcomes is {count_could_improve}.")

The number of good learning outcomes is 351.
The number of good learning outcomes is 124.
The number of good learning outcomes is 38.


This metric allows us to construct the first set of tests, the F1 score and confusion matrix. These tests are used to determine the classification accuracy of our model by calculating the number of tests that are matched in evaluation as well as counting the number of false positives and false negatives.

In [56]:
descriptions = testing_set["description"]

In [81]:
text = []
for i in range(5):
    text.append(f"{descriptions[i]}")

#outcomes_text = "\n".join(text)
print(text)

outcomes_text = "\n".join(text)

['describe drug action at molecular, cellular, tissue and whole-body levels', 'describe both the effects of the drug on its target, and the effects of disease processes and other drugs on this relationship', 'discuss ethical approaches to responsible conduct in learning and research', 'select, critically appraise, and communicate scientific information on a selected drug', 'perform laboratory experiments relevant to assessing the action of drugs and their impact on pathophysiological processes']


In [82]:
unitcode = "PHAR2220"
level = "2"
creditpoints = "6"
print(outcomes_text)

describe drug action at molecular, cellular, tissue and whole-body levels
describe both the effects of the drug on its target, and the effects of disease processes and other drugs on this relationship
discuss ethical approaches to responsible conduct in learning and research
select, critically appraise, and communicate scientific information on a selected drug
perform laboratory experiments relevant to assessing the action of drugs and their impact on pathophysiological processes


In [84]:
from ai_evaluate import run_eval
os.environ["GOOGLE_API_KEY"] = "AIzaSyBgBtP4gskrkLXVkq5maU333gabJE5sqiY "

response = run_eval(level=level, unit_name=unitcode, credit_points=creditpoints, outcomes_text=outcomes_text)

print(response)

**LO Analysis**

'describe drug action at molecular, cellular, tissue and whole-body levels' - STATUS:GOOD - This outcome aligns well with the Comprehension level, requiring students to explain and articulate drug action across different biological scales.

'describe both the effects of the drug on its target, and the effects of disease processes and other drugs on this relationship' - STATUS:GOOD - This outcome also fits the Comprehension level, asking students to explain complex interactions and relationships between drugs, targets, and disease states.

'discuss ethical approaches to responsible conduct in learning and research' - STATUS:GOOD - The verb 'discuss' is appropriate for Comprehension, requiring students to elaborate on and explain ethical principles.

'select, critically appraise, and communicate scientific information on a selected drug' - STATUS:NEEDS_REVISION - This outcome contains verbs from multiple Bloom's levels (Application - select, Analysis - critically apprais