<a name="top"></a>

# Identify and upload stimuli sequences to Mongo

## Contents:
* [Import Packages + Set up Paths](#import)

* [Compute & summarize puzzle metrics](#compute)  

* [Identify experimental puzzles](#identify)  

* [Shuffle and counterbalance experiment stimuli sequences](#shuffle)  

* [Push stimuli sequences to Mongo](#mongo)

In [None]:
import os, sys
import pymongo as pm
import pandas as pd
import ast
import numpy as np
from scipy.stats import skew, kurtosis
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_context('talk')
sns.set_style('white')

import random
import itertools
import json

from operator import itemgetter
from collections import Counter

from IPython.display import clear_output
from pprint import pprint

# mongo parameters
project_name = 'fun-puzzles'
experiment_name = 'fun-puzzles-exp1'
iterationName = 'production2' # pilot1_debug
mongo_collection_name = experiment_name #'fun-puzzles-debug' # this should match experiment name

# repo directory and file hierarchy
proj_dir =  os.path.abspath('../..')
stimuli_dir = os.getcwd()
output_dir = os.path.join(stimuli_dir, experiment_name)

dumpjson = False ## do we want to save the stimuli json?
write = False ## do we ACTUALLY want to write to mongo?


# <a name="import"></a> import csv of novice levels([^](#top))

### Some settings + set up paths

### Load level identifiers
Load in manually identified collections that are recommended for novices (e.g., from online forums, from author descriptions)

In [None]:
novice_levels = pd.read_csv("sokoban-for-novices_metadata.csv",
                            converters={"layout": ast.literal_eval,
                                        "top_solutions": ast.literal_eval})
novice_levels['sokobanonline__top_solutions'] = novice_levels['sokobanonline__top_solutions'].apply(ast.literal_eval)
novice_levels.head()

# <a name="compute"></a> Compute difficulty and enjoyment metrics([^](#top))

The information from sokobanonline allows computing three sets of metrics for each puzzle: 

- Visual features
- Difficulty
- Enjoyment 

### Visual features: 

- **area**: Rectangular area occupied by puzzle (in number of tiles)

In [None]:
# approx area of puzzle
novice_levels["level_area"] = novice_levels["level_width"]*novice_levels["level_height"]

### Gleaned Difficulty:

- **shortest_solution**: Step length of shortest solution
- **solve_rate**: Number completed / attempted
- From top solutions table (i.e., number of completions for each of top 10 shortest solution lengths)
    - **Kurtosis**: How concentrated vs. wide is the distribution? Does everyone find the same best solution?
    - **Skew**: How skewed is this distribution? Left skew suggests many people found the best solution. Alternatives are everyone finds a non-optimal solution, or perhaps there is a bimodal distribution

Note: **High kurtosis indicates**:
- Sharp peakedness in the distribution’s center.
- More values concentrated around the mean than normal distribution.
- Heavier tails because of a higher concentration of extreme values or outliers in tails.
- Greater likelihood of extreme events.

In [None]:
# Step length of shortest solution
novice_levels["sokobanonline__shortest"] = novice_levels["sokobanonline__top_solutions"].apply(
    lambda x: x[0]['steps'] if isinstance(x, list) and len(x) > 0 else None
)
# ratio completed/attempted 
novice_levels["sokobanonline__solve_rate"] = novice_levels["sokobanonline__num_solved"]/ novice_levels["sokobanonline__num_played"]
## record of every attempt that made it to top 10 solutions 
novice_levels["sokobanonline__top_solutions_flattened"] = novice_levels["sokobanonline__top_solutions"].apply(
    lambda x: [step for a in x for step in [a['steps']]* a['num_players']] if isinstance(x, list) and len(x) > 0 else None
)
## Kurtosis, skewness: density curve of height (num solved), x-axis (number of steps)
# novice_levels["sokobanonline___skewness"] = novice_levels["sokobanonline__top_solutions_flattened"].apply(lambda x: abs(skew(x, bias=True)))
# novice_levels["sokobanonline___kurtosis"] = novice_levels["sokobanonline__top_solutions_flattened"].apply(lambda x: kurtosis(x, bias=True))

In [None]:
difficulty_columns = ["level_area", "sokobanonline__shortest_solution", "sokobanonline__solve_rate"]

# univariate histograms
fig, ax = plt.subplots(1,len(difficulty_columns), figsize=(15, 3))

for i, ax in enumerate(ax.flatten()):
    sns.histplot(novice_levels[difficulty_columns[i]],
                                  stat='percent', ax=ax,)
    
fig.suptitle('Gleaned difficulty metrics')
plt.tight_layout()
plt.show()

In [None]:
## What do top 10 solutions give us? 
level_ids = random.sample(range(len(novice_levels)), 8)

fig, ax = plt.subplots(2,4, figsize=(12, 6))

for i, ax in enumerate(ax.flatten()):
    sns.histplot(novice_levels.iloc[level_ids[i]].sokobanonline__top_solutions_flattened,
                 bins=10, discrete=True,
                 ax=ax)
    ax.set_title("{}_{}".format(novice_levels.collection_name[level_ids[i]][0:7], str(novice_levels.level_name[i])))
    ax.set_xlabel("steps taken")
    ax.set_xticks([min(novice_levels.iloc[level_ids[i]].sokobanonline__top_solutions_flattened),
                       max(novice_levels.iloc[level_ids[i]].sokobanonline__top_solutions_flattened)])
    
fig.suptitle("Frequency distribution of top 10 solutions per puzzle")
plt.tight_layout()
plt.show()

In [None]:
correlation_matrix = novice_levels[difficulty_columns].corr()

# Create a heatmap for the correlation matrix
plt.figure(figsize=(8, 6))
sns.heatmap(correlation_matrix, annot=True, cmap="coolwarm", fmt=".2f")
plt.title("Gleaned difficulty metrics - Correlations")
plt.show()

In [None]:
# Scatterplot matrix with regression lines
plt.figure(figsize=(8, 6))
sns.pairplot(
    novice_levels[difficulty_columns],
    kind="reg",  # Adds regression lines to scatterplots
    diag_kind="kde",  # Kernel density estimation for the diagonal
    plot_kws={'line_kws':{'color':'red'}}
)
plt.suptitle('Gleaned difficulty metrics')
plt.tight_layout()
plt.show()

### Gleaned Enjoyment: 

- **reaction_rate**: total_reactions/num_attempted 
- **like_rate**: num_liked/num_attempted 
- **dislike rate**: num_disliked/num_attempted
- **like_perc**: num_liked / total_reactions
- **dislike_perc**: num_disliked / total_reactions

In [None]:
# Feelings: total_reactions/num_attempted
# ratio completed/attempted 
novice_levels["sokobanonline__reaction_rate"] = (novice_levels["sokobanonline__likes"]+novice_levels["sokobanonline__dislikes"])/ novice_levels["sokobanonline__num_played"]
# Like Rate: num_liked/num_attempted
novice_levels["sokobanonline__likes_rate"] = novice_levels["sokobanonline__likes"]/ novice_levels["sokobanonline__num_played"]
# Like Score: num_liked / total reactions
novice_levels["sokobanonline__likes_perc"] = novice_levels["sokobanonline__likes"]/(novice_levels["sokobanonline__likes"]+ novice_levels["sokobanonline__dislikes"])
# Dislike Rate: num_disliked/num_attempted
novice_levels["sokobanonline__dislikes_rate"] = novice_levels["sokobanonline__dislikes"]/ novice_levels["sokobanonline__num_played"]
# Like Score: num_liked / total reactions
novice_levels["sokobanonline__dislikes_perc"] = novice_levels["sokobanonline__dislikes"]/(novice_levels["sokobanonline__likes"]+ novice_levels["sokobanonline__dislikes"])


In [None]:
enjoyment_columns = ["sokobanonline__reaction_rate", 
                     "sokobanonline__likes_rate", "sokobanonline__likes_perc", 
                     "sokobanonline__dislikes_rate", "sokobanonline__dislikes_perc"]

# univariate histograms
fig, ax = plt.subplots(1,len(enjoyment_columns), figsize=(15, 3))

for i, ax in enumerate(ax.flatten()):
    this_var = enjoyment_columns[i]
    sns.histplot(novice_levels[this_var],
                 stat='percent',
                 ax=ax)
    ax.set_xlabel(this_var.split("__")[-1])
    
fig.suptitle('Gleaned enjoyment metrics')
plt.tight_layout()
plt.show()

In [None]:
correlation_matrix = novice_levels[enjoyment_columns].corr()

# Create a heatmap for the correlation matrix
plt.figure(figsize=(8, 6))
sns.heatmap(correlation_matrix, annot=True, cmap="coolwarm", fmt=".2f")
plt.title("Gleaned enjoyment metrics - Correlations")
plt.show()

In [None]:
# exclude outlier of dislike ratio
df = novice_levels.query('sokobanonline__dislikes_rate < .03')[enjoyment_columns]
plt.figure(figsize=(8, 6))
# # Scatterplot matrix with regression lines
sns.pairplot(
    df,
    kind="reg",  # Adds regression lines to scatterplots
    diag_kind="kde",  # Kernel density estimation for the diagonal
    plot_kws={'line_kws':{'color':'red'}}
)
plt.suptitle("Gleaned enjoyment metrics - correlations")
plt.tight_layout()
plt.show()

## compare / correlations

In [None]:
# Enjoyment vs. Difficulty -- novice collections
sns.pairplot(
    novice_levels,
    x_vars=difficulty_columns,
    y_vars=enjoyment_columns,
    hue="collection_name",
    plot_kws={'alpha': 0.5}
)

In [None]:
# Determine the number of rows and columns for the subplot grid
num_rows = len(enjoyment_columns)
num_cols = len(difficulty_columns)
# Create a figure and a grid of subplots
fig, axes = plt.subplots(num_cols, num_rows, figsize=(5 * num_cols, 3 * num_rows))

# Iterate over each pair of columns and create a scatter plot with a regression line
for i, col1 in enumerate(difficulty_columns):
    for j, col2 in enumerate(enjoyment_columns):
        ax = axes[i, j]
        sns.regplot(x=novice_levels[col1], y=novice_levels[col2], ax=ax, ci=None, line_kws={"color": "red"})
        pearson_text = f"Pearson {r'$\rho$'} = {novice_levels[col1].corr(df[col2]):.2f}"
        spearman_text = f"Spearman {r'$\rho$'} = {novice_levels[col1].corr(df[col2], method='spearman'):.2f}"
        ax.set_title(pearson_text + "\n" + spearman_text)
        # ax.label_outer()  # Only show outer labels and tick labels
        

# Add column labels at the top
# for ax, col in zip(axes[-1], enjoyment_columns):
#     ax.annotate(col, xy=(0.5, -0.1), xytext=(0, -5),
#                 xycoords='axes fraction', textcoords='offset points',
#                 ha='center', va='top', fontsize=14, fontweight='bold')

# Add row labels on the left
# for ax, row in zip(axes[:,0], difficulty_columns):
#     ax.annotate(row, xy=(-0.1, 0.5), xytext=(-ax.yaxis.labelpad - 5, 0),
#                 xycoords='axes fraction', textcoords='offset points',
#                 ha='right', va='center', fontsize=14, fontweight='bold', rotation=90)

plt.suptitle("Enjoyment vs. Difficulty -- novice collections")
# Adjust layout to prevent overlap
plt.tight_layout()
plt.show()

In [None]:
selected_col = enjoyment_columns + difficulty_columns
correlation_matrix = novice_levels[selected_col].corr()

# Create a heatmap for the correlation matrix
plt.figure(figsize=(10, 10))
sns.heatmap(correlation_matrix, annot=True, cmap="coolwarm", fmt=".2f")
plt.title("Correlation Matrix Heatmap")
plt.show()

# <a name="identify"></a> Identify 24 levels([^](#top))

## Filter for not too complex or hard

First, we identify levels with the following features:
- Width and height is between 5 to 9 tiles (inclusive)
- 3 boxes
- at least 100 attempts in corpus
- at least 50% solved
- max 99 moves

In [None]:
novice_3box_df = (
    novice_levels
    .query('5 < level_width < 10')
    .query('5 < level_height < 10')
    .query('num_boxes == 3')
    .query('100 <= sokobanonline__num_played')
    .query('100 > sokobanonline__shortest_solution')
    .query('sokobanonline__solve_rate > 0.5') # at least 50% solved
)

print(f"Filtered to {novice_3box_df.shape[0]} levels")


## Define 2x2x2 parameters

Next we compute three features, and produce 3 bins per feature
- enjoyment: (#likes - #dislikes) / #played
- difficulty: #solved / #played
- shortest solution found in corpus

In [None]:
novice_3box_df = (
    novice_3box_df
    .assign(enjoyment_cat = pd.qcut(novice_3box_df['sokobanonline__likes_rate'], q=2, labels=['low','high']),
         difficulty_cat = pd.qcut(novice_3box_df['sokobanonline__solve_rate'], q=2, labels=['hard', 'easy']),
         shortestPath_cat = pd.qcut(novice_3box_df['sokobanonline__shortest_solution'], q=2, labels=['short', 'long']))
)

# export
novice_3box_df.to_csv(os.path.join(stimuli_dir, "small-3box_puzzles.csv"),index=False)

# tabulate
print(pd.crosstab([novice_3box_df["shortestPath_cat"], novice_3box_df["difficulty_cat"]], novice_3box_df["enjoyment_cat"]))


## Sample 3 levels for each 2x2x2 cell

In [None]:
exp1_levels = (
    novice_3box_df
    .groupby(['shortestPath_cat', 'difficulty_cat', 'enjoyment_cat'])
    .sample(3, random_state=40)
    .filter(['author_name','collection_name', 'level_name', 'layout','level_width','level_height',
           'sokobanonline__likes_rate','sokobanonline__solve_rate','sokobanonline__shortest_solution',
           'enjoyment_cat', 'difficulty_cat', 'shortestPath_cat'])
    .reset_index(drop=True))

exp1_levels['stimuli_set'] = np.tile(['A','B','C'], int(len(exp1_levels)/3))

In [None]:
pd.options.display.float_format = '{:.4f}'.format

# Check
(exp1_levels
 .groupby('stimuli_set')
 .agg({'sokobanonline__solve_rate': ['min', 'median', 'max'],
       'sokobanonline__likes_rate': ['min', 'median', 'max'],
       'sokobanonline__shortest_solution': ['min', 'median', 'max']}))

# Check
pd.pivot_table(exp1_levels, index=['difficulty_cat', 'enjoyment_cat', 'shortestPath_cat'], columns='stimuli_set', values='layout',aggfunc='count')


## Save levels

In [None]:
def get_boxes(layout):
    # print(get_boxes(original_json['SokobanLevels']['LevelCollection']['Level'][1]['L']))
    all_boxes = []
    for y, row in enumerate(layout):
        for x, obj in enumerate(row):
            if obj == "$" or obj == '*':  # $ if on floor, * if on goal
                all_boxes.append({'x': x, 'y': y, 'state': obj})
    return (all_boxes)

def get_start_position(layout):
    # print(get_start_position(original_json['SokobanLevels']['LevelCollection']['Level'][0]['L']))
    for y, row in enumerate(layout):
        for x, symbol in enumerate(row):
            if symbol == '@' or symbol == '+':    # @ if on floor, + if on goal
                return {"x": x, "y": y}
    return None

exp1_levels['boxes'] = exp1_levels['layout'].apply(get_boxes)
exp1_levels['start_position'] = exp1_levels['layout'].apply(get_start_position)

## inspect first few rows of metadata object
exp1_levels.head()

In [None]:
exp1_levels = exp1_levels.reset_index(names="puzzle_id")
## tell us some useful information
print(f'We have {len(exp1_levels)} stimuli represented in our metadata dataframe.')
print(' ')
print(f'These are the columns in this dataframe: {exp1_levels.columns}.')

# Export
exp1_levels.to_csv(os.path.join(output_dir, f'{iterationName}_puzzles-test.csv'),index=False)

# <a name="shuffle"></a> Shuffle and counterbalance experiment stimuli sequences([^](#top))

We generally take one of two approaches when inserting our metadata into mongo:

- **DIRECT INSERTION**: direct insertion of individual trials (a.k.a. 'items') as individual records in mongo. This option is reasonable when it doesn't matter which stimulus a participant gets on any given trial, e.g., if we simply plan to annotate a bunch of stimuli.
- **BATCHING**: grouping metadata from multiple trials into a batch and then inserting these complete batches into mongo. This option is reasonable when we want to control which exact combination of stimuli a specific participant gets.

For this study, we use **BATCHING** since we need to counterbalance across pretest, test, and posttest trials.

In [None]:
df = (exp1_levels[['stimuli_set', 'collection_name', 'level_name',
                  'level_width', 'level_height', 'layout', 'start_position', 'boxes']]
                  .rename(columns={'level_width': 'width', 'level_height': 'height'}))
df.head()

### Sample comparison pairs

For each set, make 20 sessions. Each session has 8 comparison pairs.
each set will be assigned to one of 2 conditions & either pre/post -- so for every 'comparison trial', matched Ns for condition and study phase.

20 stimuli matches x 2 conditions x 2 study phase = 80
80 x 3 sets = 240 sessions

In [None]:
# sampling and printing functions
# 
def make_comparison_trials(list_of_level_indexes, n_sessions):
    sessions = []
    # possible_pairs = [(x, y) for i, x in enumerate(list_of_level_indexes) for j, y in enumerate(list_of_level_indexes) if i < j]
    # possible_pairs = list(itertools.combinations(list_of_level_indexes, 2))
    for i in range(n_sessions):
        l1 = [x for x in list_of_level_indexes]
        random.shuffle(l1)        
        l1.extend(l1[1:])
        l1.append(l1[0])
        pairs = list(itertools.batched(l1, 2))
        trials = []
        for pair in pairs:
            trials.append(df[df.index.isin(pair)].to_dict('records'))
        sessions.append(trials)

    # print helpful stuff
    print(f"Generating {n_sessions} sessions of {len(trials)} comparison trials each with {len(pair)} items...")

    # Return the action list of sessions
    return(sessions)

def count_pairs_in_comparisons(list_of_sessions, print=False):
    '''
    Given list of sessions, each session containing a list of comparison trials, each trial a list of two levels,
    Count the number of comparisons taht each level occurs in
    '''    
    pairings = []
    for sessionN in range(len(list_of_sessions)):
        for trialN in range(len(list_of_sessions[sessionN])):
            pairings.append([itemgetter(*['collection_name','level_name'])(x) for x in list_of_sessions[sessionN][trialN]])
    my_dict = Counter([tuple(i) for i in pairings])
    if print: 
        pprint(my_dict)
    return my_dict

def count_levels_in_comparisons(list_of_sessions):
    '''
    Given list of sessions, each session containing a list of comparison trials, each trial a list of two levels,
    Count the number of comparisons taht each level occurs in
    '''
    # Count occurrences of pairs
    # Counter([tuple(i) for i in list_of_sessions])
    # Count occurrences of each level
    flat = list(itertools.chain(*list_of_sessions))
    flat_df = pd.concat(list(map(pd.json_normalize, flat)))
    print(flat_df.value_counts(['collection_name', 'level_name']))

In [None]:
# Set up 
stim_level_indexes = {
    'A' : df[df['stimuli_set']=='A'].index,
    'B': df[df['stimuli_set']=='B'].index,
    'C': df[df['stimuli_set']=='C'].index
    }

# Parameters
n_sessions = 20 # per condition and counterbalancing
n_compare_trials = 8
random.seed(42)

# Let's make some stimuli
stim_comparisons = {}
for letter in ['A', 'B', 'C']:
    print("\nCOMPARISONS FOR SET " + letter)
    stim_comparisons[letter] = make_comparison_trials(stim_level_indexes[letter], n_sessions)
    count_levels_in_comparisons(stim_comparisons[letter])
    count_pairs_in_comparisons(stim_comparisons[letter], False)

In [None]:
# sanity check
print(f"Comparison A has {len(stim_comparisons['A'])} sessions with {len(stim_comparisons['A'][0])} trials, each contrasting {len(stim_comparisons['A'][0][0])} levels")

### Assign condition and counterbalance exp phases

In [None]:
J = []
for [a, b, c] in itertools.permutations(['A', 'B', 'C']):
  for condition in ['difficult', 'enjoyable']: 
    print(condition, a, b, c)
    for session_id in range(20):
      J.append({"condition": condition,
        "stimuli_set_order": a + b + c,
        "stims": {
          "stimuli_test": df[df.index.isin(stim_level_indexes[a])].to_dict('records'),
          "stimuli_compare1": stim_comparisons[b][session_id],
          "stimuli_compare2": stim_comparisons[c][session_id]
      }})
    # print(f"Test: {len(blah['stims']['stimuli_test'])}; Pretest: {len(blah['stims']['stimuli_compare1'])}; Posttest: {len(blah['stims']['stimuli_compare2'])}")

### Double check before mongo

In [None]:
## optionally, save out meta to meta.js file
if dumpjson:
    with open(os.path.join(output_dir, f'{iterationName}_all-stimuli-records.json'), 'w') as fout:
        json.dump(J, fout)

In [None]:
## let's look at a single record before inserting it into mongo
single_record = J[0]
single_record

# <a name="mongo"></a> Push To Mongo ([^](#top))

## Connect to Mongo

Note: This command for "establishing an SSH tunnel" is also known as "remote port forwarding."

You'll need to re-run this command basically every time your internet connection resets. A clue that this has happened is that you'll see Broken pipe appear in your terminal. No worries, just do it again!

In [None]:
! ssh -fNL 27017:127.0.0.1:27017 junyichu@cogtoolslab.org

In [None]:
# set vars
auth = pd.read_json(os.path.join(proj_dir,'auth.json'), typ='series') # this auth.json file contains the password
pswd = auth.password
user = auth.user
host = 'cogtoolslab.org'

# have to fix this to be able to analyze from local
import socket
conn = pm.MongoClient('mongodb://sketchloop:' + pswd + '@127.0.0.1:27017')
db = conn['stimuli'] # for data, it's experimentName
coll = db[mongo_collection_name] #FIXME check me everytime.

## Insert each session as a record into mongoDB

In [None]:
with open(os.path.join(output_dir, f'{iterationName}_all-stimuli-records.json'), 'r') as f:
    json_string = f.read() 
    data = json.loads(json_string)
    
type(data)
data[1].keys()

In [None]:
num_existing_records = coll.estimated_document_count() ## how many existing records are there?
print(num_existing_records)

if write:
    ## first drop existing records from this collection only if it is NOT empty (be careful!)
    if num_existing_records>0:    
        db.drop_collection(mongo_collection_name)
        print('Dropped existing records from this collection.')

    ## ok, now let's actually add our metadata to the database
    for (i,m) in enumerate(J):
        coll.insert_one(m)
        print(f'{i+1} of {len(J)}| Inserting condition {m["condition"]}')
        clear_output(wait=True)

print('Done inserting records into mongo!')

In [None]:
## check collection to see what records look like
coll.find_one()

In [None]:
## how many records are there in this collection?
coll.estimated_document_count()