## Qualitative Feedback Process

Requires python >= 3.9.7

```bash
pip install pandas langid txtai sacremoses sentencepiece langchain nltk
```

before getting started, install ollama

```bash
curl https://ollama.ai/install.sh | sh
ollama pull bodeby/qsf
ollama run bodeby/qsf
```

For increased reproducebility utilize modelfile in project
to build model with fixed seed and 0 temperature.

```bash
curl https://ollama.ai/install.sh | sh
ollama create bodeby/qsf -f ./modelfile
ollama run bodeby/qsf # 
```

In [464]:
# installation of required packages
%pip install pandas langid txtai sacremoses sentencepiece langchain nltk matplotlib

Note: you may need to restart the kernel to use updated packages.


In [465]:
# Package imports
import pandas as pd
import langid
import nltk
import re, os

from langchain.llms import Ollama
from txtai.pipeline import Translation
from nltk.tokenize import word_tokenize

# Download the punkt tokenizer data
nltk.download('punkt') 

[nltk_data] Downloading package punkt to /Users/vk64lk/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

## Processing Configuration

**Notes on Translation**
- True:  translate text manually using ```translate_text()```. 
- False: Ollama (language model) will attempt auto translation perhaps.


**Model Selection**
- Llama2:7b served with Ollama
- QSF:7b (llama2) served with Ollama

**File Paths**
- Input: the dataset to work with
- Output: the complete actionable feedback
- Logs: intermediary log-level events

In [466]:
translate_text_manually = False     # toggle translation
max_token_context_length = 2000     # size of context
# ollama_model = "llama2"           # language model (zero-shot)
ollama_model = "bodeby/qsf"         # language model (paramter tuned)

input_path = 'data/complete.csv'    # original dataset
output_path = "data/actionable.csv" # the generated actionable feedback
log_path = "data/log.csv"           # log-level data

**Output Config**
- campus: limit feedback bycampus ["copenhagen", "aalborg", "all"]
- semester: limit feedback by semester [1,3,5,7,9]
- course: limit feedback by course [52]

In [467]:
campus = "all"                      # ["all", input...]
semester = "all"                    # ["all", input...]
course = "all"                      # ["all", input...]

## Supporting Functions

In [468]:
def split_text_into_array(text, max_tokens=20):
    tokens = word_tokenize(text) # use nltk to tokenize input
    result_array = []

    current_chunk = []
    current_chunk_length = 0

    for token in tokens:
        if current_chunk_length + len(token) <= max_tokens:
            current_chunk.append(token)
            current_chunk_length += len(token)
        else:
            result_array.append(" ".join(current_chunk))
            current_chunk = [token]
            current_chunk_length = len(token)

    if current_chunk:
        result_array.append(" ".join(current_chunk))

    return result_array

## Tranlation via txtai

```python
lang, confidence = langid.classify(text) # language classification
result = translate_text(text, lang) # utilizing txtai translation
```

In [469]:
# Original
def translate_text(text):
    lang, confidence = langid.classify(text)
    print(lang)
    if lang == "da":
        # Translate the text to a specific language
        translate = Translation()
        translation = []
        for x in text:
            translation.append(translate(x,"da", "en"))
        return translation
    if lang == "sv":
        # Translate the text to a specific language (e.g., Spanish)
        translate = Translation()
        translation = []
        for x in text:
            translation.append(translate(x,"sv", "en"))
        return translation
    if lang == "no":
        # Translate the text to a specific language (e.g., Spanish)
        translate = Translation()
        translation = []
        for x in text:
            translation.append(translate(x,"no", "en"))
        return translation
    return text

In [470]:
# Refactored
def translate_text(text, lang):
    # Translate the text to a specific language
    target = "en"
    translate = Translation()
    translation = []

    for word in text:
        translated = translate(word, lang, target)
        translation.append(translated)
    
    return translation

## Logging result to CSV

1. check if previous log file exists
    - read file, and load it into a pandas DF
    - create new pandas DF with required columns
2. Create a new row with query and output information
3. Append new row to results DF
4. Save modified results DF to log-file.

In [471]:
# Original Code
def log_query(query,output):

    # Check if the CSV file exists
    if os.path.exists(log_path):
        # If the file exists, load it into a DataFrame
        log_df = pd.read_csv(log_path)
    else:
        # If the file doesn't exist, generate new data and save it to the CSV file
        log_df = pd.DataFrame(columns=['let_llama_translate', 'max_token_context_length','model','query','output'])
    new_row = {'let_llama_translate':translate_text_manually, 'max_token_context_length':max_token_context_length,'model':ollama_model,'query':query,'output':output}
    log_df = pd.concat([log_df, pd.DataFrame([new_row])], ignore_index=True)
    log_df.to_csv(log_path, index=False)

In [472]:

# Log the query and its output to a CSV file.
# @arg query: string -> the query string.
# @arg output: string -> The output generated from the query.
def log_query(query, output):

    # Check if the CSV file exists
    if os.path.exists(log_path):
        # Read file, and load it into a pandas DF
        log_df = pd.read_csv(log_path) 
    else:
        # Create new pandas DF with required columns
        log_df = pd.DataFrame(columns=['let_llama_translate', 'max_token_context_length', 'model', 'query', 'output'])

    # Create a new row with query and output information
    new_row = {
        'let_llama_translate': translate_text_manually,
        'max_token_context_length': max_token_context_length,
        'model': ollama_model,
        'query': query,
        'output': output
    }

    # Append the new row to the DataFrame
    log_df = pd.concat([log_df, pd.DataFrame([new_row])], ignore_index=True)

    # Save the DataFrame to the CSV file
    log_df.to_csv(log_path, index=False)


## Data Processing and Cleaning

**refactoring required** to increase presentability

1. Identify **Campus** from dataset
2. Identify **Semester** from dataset

In [473]:
# fixes the awfull formatting of the data, 
# --> Requirements for dataframe: assuming 63 columns, first 42 being feedback, and a total of 11 course columns
# @dataframe    : the dataframe to be processed
# @write_file   : decide if the results should be logged
def process_data(dataframe, write_file):
    # Create a new DataFrame with selected columns
    new_df = pd.DataFrame(columns=['Campus', 'Semester', 'Course','Feedback_good','Feedback_bad','Feedback_extra'])
    
    # Display the new DataFrame
    for index, row in dataframe.iterrows():
        #figure out the campus:
        campus = "Aalborg"
        semester = re.search(r'\d+', row['Semesterbetegnelse'])

        # Check if we can find a semester number in string
        if semester:
            semester = semester.group()
        if '-KBH' in str(row['Semesterbetegnelse']):
            campus = "Copenhagen"
        
        # harvesting feedback, assuming 63 columns.
        feedback_good = ""
        feedback_bad = ""
        feedback_extra = ""
        course = ""
        row_arr = row.to_numpy()

        # 
        for i in range(0,42,3):
            if translate_text_manually:
                feedback_good = translate_text(row_arr[i])
                feedback_bad = translate_text(row_arr[i+1])
                feedback_extra = translate_text(row_arr[i+2])
            else:
                feedback_good = row_arr[i]
                feedback_bad = row_arr[i+1]
                feedback_extra = row_arr[i+2]
            
            course = row_arr[int((46+(i/3)))]
            if i == 33:
                course = "Project"
            if i == 36:
                course = "Semester" 
            if i == 39:
                course = "Studiemiljø"

            if not ((pd.isna(feedback_good)) and (pd.isna(feedback_bad)) and (pd.isna(feedback_extra))):
                new_row = {'Campus':campus, 'Semester': semester, 'Course':course, 'Feedback_good':feedback_good,'Feedback_bad':feedback_bad,'Feedback_extra':feedback_extra}
                new_df = pd.concat([new_df, pd.DataFrame([new_row])], ignore_index=True)

    if write_file:
        new_df.to_csv("processed_data/proc_dataset.csv", index=False)
        
    return new_df


## Helper functions

NB: Only used for CLI portion of the program

Directly used for user input matching, ex.:

```python
if check_strings_in_array(user_input,['kbh','cph','copenhagen','købehavn'])[0]:
    campus = "Copenhagen"
```

In [474]:
def check_strings_in_array(main_string, string_array):
    for substring in string_array:
        if substring.lower() in main_string.lower():
            return [True,substring]
    return [False,""]

## Synthesising Formatted Dataset

In [475]:
# Load dataset from csv file (tab seperated, utf-16 encoded)
df = pd.read_csv(input_path, sep='\t', encoding='utf-16')

In [476]:
# Apply preproc. function to dataframe and overide
df = process_data(df, False)

In [477]:
# Sanity-check of preproc.
df.head()

Unnamed: 0,Campus,Semester,Course,Feedback_good,Feedback_bad,Feedback_extra
0,Aalborg,9,IT-ret,Alle undervisere var rigtig gode og engageret ...,Fire timers forelæsning i træk er meget lang t...,
1,Aalborg,9,Specialiseringskursus i programmeringsteknologi,Vi nåede alle at lave tre fremlæggelser i løbe...,"Holdet var opdelt på en måde, så fremlæggelser...",Kurset fungerede fint selvom det var online.
2,Aalborg,9,Project,Vejledere er engageret og gode til at give fee...,,
3,Aalborg,9,Semester,Godt samspil mellem specialiseringskurset og p...,,
4,Aalborg,9,Studiemiljø,Vi er glade for at have vores eget grupperum!,Bedre køkkenfaciliteter i FRB 7 G (fx mikroovn...,De fleste grupperum på FRB 7 G står tomme/bliv...


### Generating the Actionable Feedback

- original values; are columns that are inherited from input dataset
- analysis values; are columns that are placed for later analysis
- outcome values; are the final actionable feedback generated with the LLM

In [478]:
# configuration of Ollama instance
ollama = Ollama(base_url="http://127.0.0.1:11434", model=ollama_model)

In [479]:
# Create query and process with Ollama
def gen_summarized(arr):
    shortened_feedback_text = ""
    arr_length = len(arr)
    max_size = str(int((max_token_context_length / arr_length)))

    for i in range(arr_length):
        # Query builder
        query = f"Summarize, to a maximum of {max_size} tokens this text that is based on course evaluations: {arr[i]}"

        # Process query
        output = ollama(query)  # Query Ollama Model
        log_query(query, output)  # Log output to CSV
        shortened_feedback_text += output  # Append output
    
    return {
        "query": query,
        "output": shortened_feedback_text
    }

In [480]:
# TEST: gen_summarized
test_feedback = "There are plenty of material. The invited lectures are interesting."

# Create Array af Tokens
test_array = split_text_into_array(test_feedback, max_tokens=max_token_context_length)
print(test_array)

# run ollama inference
result_summ = gen_summarized(test_array)

# destructure results
print(result_summ["query"])
print(result_summ["output"])

['There are plenty of material . The invited lectures are interesting .']
Summarize, to a maximum of 2000 tokens this text that is based on course evaluations: There are plenty of material . The invited lectures are interesting .

Here's a summary of the text in up to 2000 tokens:

There are many interesting materials in the course. Invited lectures are also enjoyable.


In [481]:
# Generate actionable feedback
def gen_actionable(summarized):
    query = f"You are now an actionable feedback bot. Summarize and give actionable feedback, based upon these summarized course evaluations, to the instructor of the course. Leave out names that could identify entities. Make sure that the feedback is factual, actionable, and appropriate to the instructor: {summarized}"
    output = ollama(query)  # Query Ollama Model
    log_query(query, output)  # Log query-output pair to CSV

    return {
        "query": query,
        "output": output
    }

In [482]:
# test of gen_actionable
test_summarized = "Provide better slides and planned activities"

# run ollama inference
result = gen_actionable(result_summ["output"])

# destructure results
print(result["query"])
print(result["output"])

You are now an actionable feedback bot. Summarize and give actionable feedback, based upon these summarized course evaluations, to the instructor of the course. Leave out names that could identify entities. Make sure that the feedback is factual, actionable, and appropriate to the instructor: 
Here's a summary of the text in up to 2000 tokens:

There are many interesting materials in the course. Invited lectures are also enjoyable.

Based on the summarized course evaluations provided, here is some actionable feedback for the instructor:

"Overall, the course offers a diverse range of materials and lectures that appear engaging and informative for students. However, there may be opportunities to enhance the course content or delivery to further support student learning. Specifically:

* Consider incorporating more interactive elements, such as in-class activities or online discussions, to keep students engaged and motivated.
* Explore ways to provide additional context or background inf

In [483]:
# Desired Columns in the output file
master = pd.DataFrame(
    columns=[
        "campus",           # original
        "semester",         # original
        "course",           # original
        "factuality",       # analysis
        "actionability",    # analysis
        "appropriateness",  # analysis
        "feedback_good",    # original
        "feedback_bad",     # orignal
        "feedback_extra",   # oroginal
        "summarize_query",  # product secondary
        "summarize_output", # product secondary
        "actionable_query", # product primary
        "actionable_output" # product primary
    ]
)

In [484]:
# initialize iteration counter (max courses: 52)
time_total = len(df["Course"].unique()) # iter-tracking
time_spent = 0                          # iter-tracking

In [485]:
# Notification
print("Starting generation of actionable feedback")

# Generate master file
for course in df["Course"].unique():

    # iteration tracking for console prints
    time_spent += 1
    print("step ", time_spent, "/", time_total)

    # generate copy of dataframe
    df_cpy = df.copy()
    df_cpy_course = df_cpy[df_cpy["Course"] == course]

    for campus in df_cpy_course["Campus"].unique():
        df_cpy_course_campus = df_cpy_course[df_cpy_course["Campus"] == campus]

        for semester in df_cpy_course_campus["Semester"].unique():
            # What?? 
            df_cpy_course_campus_semester = df_cpy_course_campus[
                df_cpy_course_campus["Semester"] == semester
            ]
            
            print(f"amount of responses: {len(df_cpy_course_campus_semester)}")

            if not df_cpy_course_campus.empty:
                feedback_good = ". ".join(df_cpy_course_campus_semester["Feedback_good"].dropna())
                feedback_bad = ". ".join(df_cpy_course_campus_semester["Feedback_bad"].dropna())
                feedback_extra = ". ".join(df_cpy_course_campus_semester["Feedback_extra"].dropna())

                # Concat feedback strings
                feedback = f'{feedback_good}. {feedback_bad}. {feedback_extra}'

                # Create Array af Tokens
                feedback_array = split_text_into_array(feedback, max_tokens=max_token_context_length)
                feedback_array_length = len(feedback_array)

                # Summarized feedback
                summarized_feedback = gen_summarized(feedback_array)
                print(f"Sum. result: {summarized_feedback["output"]}")

                # Generate Actionable feedback
                actionable_feedback = gen_actionable(summarized_feedback["output"])
                print(f"Act. result: {actionable_feedback["output"]}")


                # Define the new entry
                new_row = {
                    "campus": campus,
                    "semester": semester,
                    "course": course,
                    "factuality": "",
                    "actionability": "",
                    "appropriateness": "",
                    "feedback_good": feedback_good,
                    "feedback_bad": feedback_bad,
                    "feedback_extra": feedback_extra,
                    "summarize_query": summarized_feedback["query"],
                    "summarize_output": summarized_feedback["output"],
                    "actionable_query": actionable_feedback["query"],
                    "actionable_output": actionable_feedback["output"],
                }

                # Add the new row to master dataframe
                master = pd.concat([master, pd.DataFrame([new_row])], ignore_index=True)

master.to_csv(output_path, index=False)

Starting generation of actionable feedback
step  1 / 52
amount of responses: 6
Sum. result: Here is a summary of the text in 1000 tokens or less:

* Students found all lecturers to be good and engaged in their teaching.
* Course content was relevant to the study material.
* Lecturers were good at making the subject interesting, involving students actively, and making it relevant for future use.
* Lectures were well-structured and engaging.
* Students were enthusiastic and relatable to many technical things.
* Four hours of lecturing per week was too long, especially in the afternoon.
* Lectures could be split up into two per week for more practical exercises.
* The course on criminal law was maybe a bit too broad.
* There were some confusion about what constitutes a legally valid paragraph.
* Both synopses and PMM assignments contained contradictory information.
* There was no opportunity to follow the course material at home due to illness or uncertainty.
* Synopses were not well-know