# Merging the DREAM and Kaggle Datasets
One of the goals of our study was to merge the different studies and try to predict a success in one by the data available on the other.

Unfortunately, there was not enough data concerning the study results on the DREAM dataset so this work was cut short.  We did, however, found a representation to be shown:

In [1]:
import pandas as pd
import numpy as np
import altair as alt
import glob
import json
import dream_loader

import warnings
warnings.filterwarnings('ignore')

## Loading Data

In [2]:
# First we load the Kaggle dataset
kaggle_df = pd.read_csv('./assets/data_csv.csv', header=0, sep=',')

kaggle_df.head()

Unnamed: 0,CASE_NO_PATIENT'S,A1,A2,A3,A4,A5,A6,A7,A8,A9,...,Global developmental delay/intellectual disability,Social/Behavioural Issues,Childhood Autism Rating Scale,Anxiety_disorder,Sex,Ethnicity,Jaundice,Family_mem_with_ASD,Who_completed_the_test,ASD_traits
0,1,0,0,0,0,0,0,1,1,0,...,Yes,Yes,1,Yes,F,middle eastern,Yes,No,Family Member,No
1,2,1,1,0,0,0,1,1,0,0,...,Yes,Yes,2,Yes,M,White European,Yes,No,Family Member,Yes
2,3,1,0,0,0,0,0,1,1,0,...,Yes,Yes,4,Yes,M,Middle Eastern,Yes,No,Family Member,Yes
3,4,1,1,1,1,1,1,1,1,1,...,Yes,Yes,2,Yes,M,Hispanic,No,No,Family Member,Yes
4,5,1,1,0,1,1,1,1,1,1,...,Yes,Yes,1,Yes,F,White European,No,No,Family Member,Yes


In [3]:
# And now we move on to the DREAM dataset
# We'll list all JSON files recursively on the dataset folder
files = glob.glob('./assets/DREAMdataset/**/*.json', recursive=True)

data = []
for filename in files:
    # The code below is used to normalize one step further and have rows for each coordinate on the gazes.  Comment this out and uncomment the "with open..." block to switch modes.
    # For each listed file, we'll call the function created on the .py file to normalize the JSON file
        file_rows = dream_loader.normalize_dream_json(filename)
    
        data = data + file_rows
    # with open(filename, 'r') as f:
    #     df = pd.json_normalize(json.load(f))
    #     data = data + [df]

# dream_df = pd.concat(data)
dream_df = pd.DataFrame(data)

dream_df.head()

Unnamed: 0,user_id,file_index,evaluation_step,date,time,frame_rate,condition,preTest.communication,preTest.interaction,preTest.module,...,preTest.stereotype,preTest.total,ageInMonths,gender,id,task.ability,task.difficultyLevel,task.end,task.index,task.start
0,58,64,Final diagnosis,20180222,144244,25.16,SHT,5,10,1.0,...,4,15,67,female,58,IM,3,2977,64.0,0
1,58,5,Initial diagnosis,20180118,145145,25.17,SHT,5,10,1.0,...,4,15,67,female,58,TT,1,8022,5.0,0
2,58,34,Intervention 4,20180205,144645,25.1,SHT,5,10,1.0,...,4,15,67,female,58,TT,1,3392,34.0,0
3,58,6,Initial diagnosis,20180122,130049,25.17,SHT,5,10,1.0,...,4,15,67,female,58,IM,2,4325,6.0,0
4,58,16,Intervention 2,20180125,144324,25.15,SHT,5,10,1.0,...,4,15,67,female,58,JA,2,1862,16.0,0


## Analysing Data
We have two similar columns on both Datasets, the biological sex and the child's age so we can use them to merge the data.  However, a small cleaning is necessary as the Kaggle dataset represents the age in years and the DREAM dataset represents it in months:

In [4]:
kaggle_df['age_months'] = kaggle_df['Age_Years'] * 12

kaggle_df[['Age_Years', 'age_months']].head()

Unnamed: 0,Age_Years,age_months
0,2,24
1,3,36
2,3,36
3,2,24
4,2,24


Now let's check the biological sex representation on both.  Do we need to clean it as well?

In [5]:
print(kaggle_df['Sex'].unique())
print(dream_df['gender'].unique())

['F' 'M']
['female' 'male']


In [6]:
def convert_gender(row):
    if row['gender'] == 'male':
        return 'M'
    elif row['gender'] == 'female':
        return 'F'

dream_df['Sex'] = dream_df.apply(lambda row: convert_gender(row), axis=1)

dream_df[['gender', 'Sex']].head()

Unnamed: 0,gender,Sex
0,female,F
1,female,F
2,female,F
3,female,F
4,female,F


### AQ-10 / ADOS Score correlation
Now that we cleaned both datasets, we are ready to merge them.  The resulting figure is created by merging both datasets on the features “age” and “sex” to correlate both ASD evaluations.  It shows the correlation between the correlation between age, the AQ-10 and ADOS scores and we can see that there is a positive and fairly strong correlation between both scores and a negative correlation between them and the age.  This could suggest that the younger the children, the better are the chances to diagnose autism with these specific tests.

Please note that since ASD is a spectrum, it’s not a simple task to link different tests and this should not be considered representative of the population.

In [7]:
merged_df = pd.merge(dream_df, kaggle_df, left_on=['Sex', 'ageInMonths'], right_on=['Sex', 'age_months'], how='inner')[['Sex', 'age_months', 'preTest.total', 'Qchat_10_Score']]

merged_df.columns = ['Gender', 'Age', 'ADOS Score', 'AQ-10 Score']

merged_df.corr()

Unnamed: 0,Age,ADOS Score,AQ-10 Score
Age,1.0,-0.328281,-0.14802
ADOS Score,-0.328281,1.0,0.058945
AQ-10 Score,-0.14802,0.058945,1.0


In [8]:
# Calculate the correlation matrix
corr_matrix = merged_df.corr()

# Reshape the correlation matrix into long-form for Altair
corr_matrix = corr_matrix.stack().reset_index()
corr_matrix.columns = ['Variable1', 'Variable2', 'Correlation']

# Create the correlation matrix plot
heatmap = alt.Chart(corr_matrix).mark_rect().encode(
    x=alt.X('Variable2:N', title=None),
    y=alt.Y('Variable1:N', title=None),
    color='Correlation:Q',
).properties(
    width=500,
    height=400
)

# Add text labels to the plot
text = heatmap.mark_text(baseline='middle').encode(
    text='Correlation:Q',
    color=alt.condition(
        alt.datum.Correlation > 0.5,
        alt.value('white'),
        alt.value('black')
    )
)

# Combine the heatmap and text layers
heatmap + text