# DREAM dataset
The DREAM dataset emerges as a potential avenue for exploration. It encapsulates the findings derived from a research undertaking 79 children, previously diagnosed with Autism Spectrum Disorder (ASD), wherein a controlled experiment is implemented with distinct assistance modalities.

In this experiment, the **Control group** partakes in a conventional Applied Behavioral Analysis (ABA) session, under the guidance of a human therapist. Concurrently, the **Treatment group** engages in an identical ABA session, albeit facilitated by a robotic assistant, tasked with stimulating the children's responses.

The primary objective of this study is to substantiate whether Robot Enhanced Therapy (hereinafter referred to as RET) can be employed to enhance therapeutic outcomes.

Regrettably, even the most recent version of the dataset publicly available lacks the "post therapy results". Consequently, while it is feasible to plot the children's reactions to the test, it remains uncertain whether there is a distinguishable discrepancy between the two groups.

In [1]:
import pandas as pd
import numpy as np
import altair as alt
import glob
import json
import dream_loader

import warnings
warnings.filterwarnings('ignore')

## Loading data
Right from the start we have a difficult task:  To load and clean up the dataset, that has a considerable size and is in JSON format.

Each child is given a folder (without actual identification) and on this folder we have files for each therapy session undertaken.  The [https://raw.githubusercontent.com/dream2020/data/master/specification/dream.1.1.json](JSON schema) contains some information about the columns, although not an explicit definition of their meanings.

The columns that we could infer from the [https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0236939#pone-0236939-t002](original paper) are as follows:
- Child ID (numerical index)
- Child’s gender
- Child’s age in months
- 3D skeleton comprising joint positions for upper body, 3D head position and orientation, 3D eye gaze vectors
- Therapy condition:  SHT for Standard Human Therapy and RET for Robot Enhanced Therapy
- Therapy task:  JA for Joint attention, IM for Imitation and TT for Turn-taking
- Date and time of recording
- Pre test:  Scores for the ADOS (Autism Diagnostic Observation Schedule) standard test taken before the study began
    - Communication
    - Interaction
    - Module
    - Play
    - SocialCommunicationQuestionnaire

Unfortunately, the Post test ADOS scores were not released to the public so we cannot distinguish the groups' differences.

In [2]:
# We'll list all JSON files recursively on the dataset folder
files = glob.glob('./assets/DREAMdataset/**/*.json', recursive=True)

data = []
for filename in files:
    # The code below is used to normalize one step further and have rows for each coordinate on the gazes.  Comment this out and uncomment the "with open..." block to switch modes.
    # For each listed file, we'll call the function created on the .py file to normalize the JSON file
        file_rows = dream_loader.normalize_dream_json(filename)
    
        data = data + file_rows
    # with open(filename, 'r') as f:
    #     df = pd.json_normalize(json.load(f))
    #     data = data + [df]

# dream_df = pd.concat(data)
dream_df = pd.DataFrame(data)

In [3]:
# Let's print out some statistics from our data
dream_df.describe()

Unnamed: 0,frame_rate,preTest.communication,preTest.interaction,preTest.module,preTest.play,preTest.socialCommunicationQuestionnaire,preTest.stereotype,preTest.total,ageInMonths,id,task.difficultyLevel,task.end,task.index,task.start
count,3121.0,3121.0,3121.0,3121.0,3121.0,3107.0,3121.0,3121.0,3121.0,3121.0,3121.0,3121.0,2802.0,3121.0
mean,24.982478,5.206024,8.770266,1.096123,1.915091,16.298037,2.646267,13.97629,53.675425,45.544056,1.393464,-5942076.0,31.673804,12.704261
std,0.99364,1.730989,2.280089,0.294807,1.416308,5.703334,1.405516,3.579601,11.479304,24.386736,0.757193,25841630.0,34.562641,116.25728
min,4.37,2.0,3.0,1.0,0.0,3.0,0.0,7.0,34.0,3.0,0.0,-150085300.0,0.0,0.0
25%,25.16,4.0,8.0,1.0,1.0,13.0,2.0,12.0,45.0,26.0,1.0,3570.0,11.0,0.0
50%,25.17,5.0,9.0,1.0,2.0,17.0,2.0,14.0,53.0,47.0,1.0,6487.0,27.0,0.0
75%,25.17,7.0,10.0,1.0,3.0,21.0,4.0,17.0,62.0,69.0,2.0,10383.0,43.0,0.0
max,25.19,10.0,13.0,2.0,4.0,26.0,6.0,20.0,76.0,81.0,3.0,46473.0,258.0,6129.0


In [4]:
# And also, take a look at its data
dream_df.head()

Unnamed: 0,user_id,file_index,evaluation_step,date,time,frame_rate,condition,preTest.communication,preTest.interaction,preTest.module,...,preTest.stereotype,preTest.total,ageInMonths,gender,id,task.ability,task.difficultyLevel,task.end,task.index,task.start
0,58,64,Final diagnosis,20180222,144244,25.16,SHT,5,10,1.0,...,4,15,67,female,58,IM,3,2977,64.0,0
1,58,5,Initial diagnosis,20180118,145145,25.17,SHT,5,10,1.0,...,4,15,67,female,58,TT,1,8022,5.0,0
2,58,34,Intervention 4,20180205,144645,25.1,SHT,5,10,1.0,...,4,15,67,female,58,TT,1,3392,34.0,0
3,58,6,Initial diagnosis,20180122,130049,25.17,SHT,5,10,1.0,...,4,15,67,female,58,IM,2,4325,6.0,0
4,58,16,Intervention 2,20180125,144324,25.15,SHT,5,10,1.0,...,4,15,67,female,58,JA,2,1862,16.0,0


In [5]:
# Let's check its columns as well
dream_df.columns

Index(['user_id', 'file_index', 'evaluation_step', 'date', 'time',
       'frame_rate', 'condition', 'preTest.communication',
       'preTest.interaction', 'preTest.module', 'preTest.play',
       'preTest.protocol', 'preTest.socialCommunicationQuestionnaire',
       'preTest.stereotype', 'preTest.total', 'ageInMonths', 'gender', 'id',
       'task.ability', 'task.difficultyLevel', 'task.end', 'task.index',
       'task.start'],
      dtype='object')

# Cleaning Data

In [6]:
# We can see that our task duration is represented in seconds.  Let's create another column so we can check the values in minutes:
dream_df['task.end_minutes'] = dream_df['task.end'] / 360

In [7]:
dream_df[['task.end', 'task.end_minutes']].head()

Unnamed: 0,task.end,task.end_minutes
0,2977,8.269444
1,8022,22.283333
2,3392,9.422222
3,4325,12.013889
4,1862,5.172222


In [8]:
# Let's convert our file_index to a number
dream_df['file_index'] = pd.to_numeric(dream_df['file_index'])

# Data Analysis
Unfortunately there aren't many insights to be taken from the dataset as it was published, except for 3D vectors of the childrens responses.  To simplify our analysis, we chose to focus on simpler evaluations.

## Task completion times by Therapist
Since the DREAM dataset comprised an A/B test, we can use its findings to compare the treatment and control groups.  We split the data for both groups RET (Robot Enhanced Therapy) and SHT (Standard Human Therapy) to validate if there were any differences in the mean time for each session.

The figure below indicates that the RET group had slightly higher mean session times.  Unfortunately, there is not enough data to establish the cause so we cannot discuss the reasons or efficacy of either group.

In [9]:
alt.Chart(dream_df[(dream_df['task.end'] > 0) & (dream_df['file_index'] < 70)]).mark_line().transform_aggregate(
    mean_duration='mean(task.end_minutes)',
    groupby=['condition', 'file_index']
).encode(
    x=alt.X('file_index:Q', sort='x'),
    y='mean_duration:Q',
    color='condition:N'
)

## Task difficulty by ADOS score
Each child that participated in the study was evaluated with the ADOS test and a score was assigned.  We have three different tasks that were required by each participant to complete:  JA for Joint attention, IM for Imitation and TT for Turn-taking.  We also have a task difficulty, which increases as the children progresses on the therapy.

We decided to look for a correlation between the ADOS score and the task difficulty.  Do higher or lower scores tend to progress more?

The following cells prepare the dataframe to answer this question:

In [10]:
# Group the dataframe by 'user_id' and get the index of the row with the smallest 'task.start'
user_index_of_min_task_start = dream_df.groupby('user_id')['file_index'].idxmin()

# Return a Dataframe with the initial task evaluation for each user
initial_task_per_user_df = dream_df.loc[user_index_of_min_task_start, ['preTest.total', 'task.ability', 'task.difficultyLevel']].reset_index()#.set_index(['preTest.total', 'task.ability'])

del initial_task_per_user_df['index']

initial_task_per_user_df

Unnamed: 0,preTest.total,task.ability,task.difficultyLevel
0,14,TT,1
1,20,TT,1
2,15,TT,1
3,20,TT,1
4,15,TT,1
...,...,...,...
56,17,,0
57,14,TT,1
58,16,,0
59,10,,0


In [11]:
# Group the dataframe by 'user_id' and get the index of the row with the highest 'task.end'
user_index_of_max_task_end = dream_df.groupby('user_id')['file_index'].idxmax()

# Return a Dataframe with the initial task evaluation for each user
final_task_per_user_df = dream_df.loc[user_index_of_max_task_end, ['preTest.total', 'task.ability', 'task.difficultyLevel']].reset_index()#.set_index(['preTest.total', 'task.ability'])

del final_task_per_user_df['index']

final_task_per_user_df

Unnamed: 0,preTest.total,task.ability,task.difficultyLevel
0,14,TT,2
1,20,TT,2
2,15,IM,3
3,20,TT,2
4,15,TT,1
...,...,...,...
56,17,,0
57,14,TT,2
58,16,,0
59,10,,0


In [12]:
# Now let's merge the initial and final dataframes into one
merged_df = pd.merge(initial_task_per_user_df, final_task_per_user_df, left_on=['preTest.total', 'task.ability'], right_on=['preTest.total', 'task.ability'], how='inner')

# And calculate the variation for each difficulty level
merged_df['variation'] = merged_df['task.difficultyLevel_y'] - merged_df['task.difficultyLevel_x']

# Filter out columns without an explicit task
merged_df = merged_df[merged_df['task.ability'].str.len() > 0]

# And prepare the data to be plotted:
grouped_df = merged_df.groupby(['task.ability', 'preTest.total'])['variation'].mean().reset_index()

grouped_df.columns = ['task_ability', 'preTest_total', 'variation']

grouped_df

Unnamed: 0,task_ability,preTest_total,variation
0,IM,8,1.0
1,IM,15,1.0
2,IM,18,0.916667
3,IM,19,0.0
4,JA,13,1.0
5,TT,7,1.0
6,TT,8,0.75
7,TT,9,1.0
8,TT,12,0.666667
9,TT,13,1.0


The figure below indicates the mean changes in the difficulty level for each task as the study ends.  It intends to estimate if the treatment is presenting results for each group.

In [13]:
alt.Chart(grouped_df).mark_bar().encode(
    x=alt.X('task_ability', axis=alt.Axis(title='')),
    y=alt.Y('variation', axis=alt.Axis(title='Task Difficulty Growth')),
    column=alt.Column('preTest_total', title='ADOS Score'),
    color=alt.Color('task_ability', title='Task')
)