# Data exploration
Each of the files within `data/` should contain the following:
* A JSON formatted string that have the keys 'score' and 'explanation', which describe the score given to the student response by the autograder, and the explanation behind it.
* A PredictionStats object formatted into a string, which contain information on the token count, time taken, etc. This is important in the future for calculating computing power consumption.
However, this is not the case in every file:

In [85]:
with open('data/expt_temp_0_resp_10_run_4.txt', 'r') as f:
    for line in f:
        print(line, end='')

{
  'explanation': 'The student has partially met all criteria: The dataset is resampled with replacement multiple times (1 point), but it\'s not specified if this is done n times where n is the size of the dataset, so 1/2 point for the first criterion. The new resampled datasets are used to obtain coefficients such as residuals and gradient of the line of best fit, which is then plotted together in a histogram and summary statistics are obtained (4 points).',
  'score': 5
}
LlmPredictionStats.from_dict({
  "numGpuLayers": -1.0,
  "predictedTokensCount": 109,
  "promptTokensCount": 233,
  "stopReason": "eosFound",
  "timeToFirstTokenSec": 0.039,
  "tokensPerSecond": 26.32290085129227,
  "totalTokensCount": 342
})

In addition to the JSON object, there is also a short output from the LLM. One method to address this would be to use a structured output, however, preliminary studies done using structured output resulted in simple, one-line explanations that do not address the rubric. Standardizing the output will be something we investigate further down the line.

# Data processing
The most important thing for us to extract from this data is the score, specifically, the `int` following the score. Since all scores seem to be following the format of `'score': [some int]`, we will use regular expressions to extract the score from every file. Along with the score, we will also collect diagnostic information from the `LlmPredictionStats` object. The regex should be able to detect the following patterns:
* `'score': 6`
* `"score": 6`
* `'score' : 6`

At the same time, we'll extract information about the student response ID, the temperature, as well as the repeat number for each of the trials.

In [106]:
# Import dependencies
import re
import numpy as np
import pandas as pd
import json
import os
from scipy import stats

In [81]:
test = ["'score': 6", "\"score\": 6", "'score' : 6", "'score':6"]

pattern = r"[\'\"]\s*score\s*[\'\"]\s*:\s*(\d+)"
scores = []

for snippet in test:
    match = re.search(pattern, snippet)
    if match:
        match = match.group(1)
    else:
        match = None
    scores.append(match)

scores

# Regex works!

['6', '6', '6', '6']

In [82]:
pattern = r"[\'\"]\s*score\s*[\'\"]\s*:\s*(\d+)"
data = []

for file in os.listdir('data'):
    file = 'data/'+file

    # print(f"Analyzing file: {file}")

    # We open the file and search for the regex pattern
    with open(file, 'r') as f:
        lines = f.readlines()
        for line in lines:
            # print(f"Analyzing line: {line}", end="")
            match = re.search(pattern, line)
            # If we find a match, then we store it in match, and we break out of the loop
            if match:
                match = int(match.group(1))
                # print(f"Score found: {match}")
                break
        # Since the inference metadata is always the last 9 lines, we extract the last 9 lines for processing later
        inference_metadata = lines[-9:]

    # If, after it loops through the entire file, a match cannot be found, it is set to None
    if not match:
        match = None

    # Then we use regex to extract information about the run itself
    run_metadata = re.search(r"temp_(0|0\.2|0\.4|0\.6|0\.8|1)_resp_(\d*)_run_(\d*)", file)
    temperature, response, repeat = float(run_metadata.group(1)), int(run_metadata.group(2)), int(run_metadata.group(3))

    # Although this should not yield an error, just in case, we will wrap it inside a try/except block.
    try:
        # We then extract the inference metadata. Because this was directly generated by LMStudio, there should not be any formatting issues. So we'll just use split.
        inference_metadata = ''.join(inference_metadata)
        # Remove the brackets, and turn it into a JSON formatted string
        inference_metadata = inference_metadata.replace('LlmPredictionStats.from_dict(', '').rstrip(')')
        # Then use
        inference_metadata = json.loads(inference_metadata)
    # This way if any error happens, the code does not just crash, it just stores it as none
    except Exception as e:
        inference_metadata = None

    # Now that all data is prepared, we can store it into a dictionary to append to our data array
    entry = {
        'response': response,
        'score': match,
        'temperature': temperature,
        'repeat': repeat,
    }

    entry.update(inference_metadata)
    data.append(entry)

Now that we have all the data, we can clean up the dataframe for data analysis. However, there is a small issue. When we use `DataFrame.dropna()`, we lose 4 rows, which means something happened during the data extraction process that resulted in the loss of data.

In [83]:
df = pd.DataFrame(data)

# By creating a separate "clean_df", we can see which indices were removed when we dropna.
clean_df = df.dropna()
dropped_indices = set(df.index) - set(clean_df.index)
dropped_df = df.loc[list(dropped_indices)]
dropped_df

Unnamed: 0,response,score,temperature,repeat,numGpuLayers,predictedTokensCount,promptTokensCount,stopReason,timeToFirstTokenSec,tokensPerSecond,totalTokensCount
705,9,,0.6,8,-1.0,319,233,eosFound,0.039,25.702527,552
71,2,,0.6,7,-1.0,311,233,eosFound,0.039,25.724358,544
651,4,,0.8,11,-1.0,185,233,eosFound,0.04,26.844138,418
559,1,,0.0,3,-1.0,105,233,eosFound,0.04,25.531741,338
26,9,,0.6,10,-1.0,334,233,eosFound,0.039,25.821087,567


It seems like the regex pattern did not work because these expressed their scores as a fraction, as well as some that did not have it in a JSON format. Instead of ignoring these results, we will input them manually instead since it's just 4 data points.

In [87]:
df.loc[(df['temperature'] == 0.6) & (df['response'] == 2) & (df['repeat'] == 7), 'score'] = 4
df.loc[(df['temperature'] == 0.6) & (df['response'] == 9) & (df['repeat'] == 10), 'score'] = 4
df.loc[(df['temperature'] == 0.6) & (df['response'] == 9) & (df['repeat'] == 8), 'score'] = 5
df.loc[(df['temperature'] == 0.0) & (df['response'] == 1) & (df['repeat'] == 3), 'score'] = 0
df.loc[(df['temperature'] == 0.8) & (df['response'] == 4) & (df['repeat'] == 11), 'score'] = 3

df.dropna()
df.shape

(900, 11)

# Data analysis

The questions we want to answer are as follows:
* How does temperature affect variance in solutions?
* Is variance a predictor for potential inaccuracies?

In [89]:
df.columns

Index(['response', 'score', 'temperature', 'repeat', 'numGpuLayers',
       'predictedTokensCount', 'promptTokensCount', 'stopReason',
       'timeToFirstTokenSec', 'tokensPerSecond', 'totalTokensCount'],
      dtype='object')

In [132]:
# We generate descriptive statistics for every response/temperature combination
stats_df = df.groupby(['response', 'temperature'])['score'].describe().reset_index()

# To ensure that mean/std is appropriate, we perform the Shapiro-Wilk test to test for normality.
normality_results = []

for (resp, temp), group in df.groupby(['response', 'temperature']):
    stat, p_value = stats.shapiro(group['score'])
    normality_results.append({
        'response': resp,
        'temperature': temp,
        'p_value': p_value,
        'is_normal': p_value > 0.05
    })

normality_df = pd.DataFrame(normality_results)
stats_df = pd.merge(stats_df, normality_df, on=['response', 'temperature'])

# Now we want to see how many of these data points are non-normal
print(f"Temperature/Response combinations that have non-normally distributed results: {((stats_df['is_normal'] == False).sum()/len(stats_df['is_normal'])) * 100}%")

Temperature/Response combinations that have non-normally distributed results: 95.0%


Since 95% of our data is non-normal, we will instead use median and IQR to quantify the variance of our grades.

In [147]:
# First create temporary dataframe with all the statistics we need
stats_df = df.groupby(['response', 'temperature'])['score'].agg([
    'median',
    lambda x: x.quantile(0.25),
    lambda x: x.quantile(0.75)
]).reset_index()

# Rename the columns for clarity
stats_df.columns = ['response', 'temperature', 'median', 'q1', 'q3']

# Calculate IQR
stats_df['iqr'] = stats_df['q3'] - stats_df['q1']

# Keep only the requested columns
stats_df = stats_df[['temperature', 'response', 'median', 'iqr']].sort_values('temperature')
stats_df.head()

Unnamed: 0,temperature,response,median,iqr
0,0.0,1,4.0,1.0
48,0.0,9,4.0,1.0
42,0.0,8,4.0,1.0
36,0.0,7,4.0,1.0
6,0.0,2,4.0,0.0


Now, we want to see if temperature has an effect on the variance of the data.

In [144]:
stats_df2 = stats_df.groupby(['temperature'])['iqr'].describe().reset_index()

normality_results = []

for (temp), group in stats_df.groupby(['temperature']):
    stat, p_value = stats.shapiro(group['iqr'])
    normality_results.append({
        'temperature': temp[0],
        'p_value': p_value,
        'is_normal': p_value > 0.05
    })

normality_df = pd.DataFrame(normality_results)
stats_df2 = pd.merge(stats_df2, normality_df, on=['temperature'])

stats_df2

Unnamed: 0,temperature,count,mean,std,min,25%,50%,75%,max,p_value,is_normal
0,0.0,10.0,0.9,0.394405,0.0,1.0,1.0,1.0,1.5,0.003622,False
1,0.2,10.0,0.6,0.516398,0.0,0.125,0.5,1.0,1.5,0.190991,True
2,0.4,10.0,0.75,0.353553,0.0,0.5,1.0,1.0,1.0,0.002088,False
3,0.6,10.0,1.0,0.408248,0.5,0.625,1.0,1.375,1.5,0.035215,False
4,0.8,10.0,0.8,0.537484,0.0,0.5,1.0,1.0,1.5,0.176996,True
5,1.0,10.0,0.6,0.459468,0.0,0.125,0.75,1.0,1.0,0.004219,False


Due to the non-normality of the data, median and IQR will be plotted instead.

The rest of the analysis was done in Prism.