## Investigating if scoring of some items needs to be reversed when using my item subsests (wrong correlations between diagnoses and responses)

### Read data

In [1]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

import sys
sys.path.append("..")  # Adds the parent directory (project_root) to the path
from data_reading import DataReader

data_reader = DataReader()

df = data_reader.read_data(data_type = "item_lvl", 
                            params = ["parent_and_sr", "multiple_assessments", "all_assessments", "learning_and_consensus_diags"])

Reading data from:  ../diagnosis_predictor_data_archive/data/create_datasets/2023-08-17 12.47.56___only_parent_report__0___first_assessment_to_drop__WHODAS_P___use_other_diags_as_input__0___only_free_assessments__0___learning?__0___NIH?__0___fix_n__1/


### Check if correlations for items that are expected to be negative are indeed negative, and positive if expected positive

#### SDQ
Exptected a negative correlation between ASD diagnosis and item "Generally liked by other children (for 11-17 year olds: Generally liked by other youth) - 0=Not True, 1=Somewhat True, 2=Certainly True"

In [2]:
diag = "Diag.Autism Spectrum Disorder"
expected_negative = [
    "SDQ,SDQ_01", # Considerate of other people's feelings
    "SDQ,SDQ_11", # Has at least one good friend 
    "SDQ,SDQ_14", # Generally liked by other children (for 11-17 year olds: Generally liked by other youth)
    "SDQ,SDQ_20", # Often offers to help others (parents, teachers, children)
]
expected_positive = [
    "SDQ,SDQ_06", # Rather solitary, prefers to play alone (for 11-17 year olds: Would rather be alone than with other youth)
    "SDQ,SDQ_19", # Picked on or bullied by other children (for 11-17 year olds: Picked on or bullied by other youth)
    "SDQ,SDQ_23", # Gets along better with adults than other children (for 11-17 year olds: Gets along better with adults than with other youth)
]

print("Expected negative:")
print(df[expected_negative + [diag]].corr()[diag])
print()
print("Expected positive:")
print(df[expected_positive + [diag]].corr()[diag])

Expected negative:
SDQ,SDQ_01                      -0.112997
SDQ,SDQ_11                       0.156190
SDQ,SDQ_14                       0.170005
SDQ,SDQ_20                      -0.089845
Diag.Autism Spectrum Disorder    1.000000
Name: Diag.Autism Spectrum Disorder, dtype: float64

Expected positive:
SDQ,SDQ_06                       0.197547
SDQ,SDQ_19                       0.114496
SDQ,SDQ_23                       0.160057
Diag.Autism Spectrum Disorder    1.000000
Name: Diag.Autism Spectrum Disorder, dtype: float64


Two of the items that were expected to be negatively correlated with ASD are indeed correlated negatively, the other two are correlated positively. 
All of those that were expected to be positively correlated with ASD are indeed correlated positively

In [3]:
diag = "Diag.ADHD-Combined Type"

expected_negative = [
    "SDQ,SDQ_21", # Thinks things out before acting
    "SDQ,SDQ_25", # Good attention span, sees chores or homework through to the end
]
expected_positive = [
    "SDQ,SDQ_10", # Constantly fidgeting or squirming
    "SDQ,SDQ_15", # Easily distracted, concentration wanders
    "SDQ,SDQ_17", # Restless, overactive, cannot stay still for long
]

print("Expected negative:")
print(df[expected_negative + [diag]].corr()[diag])
print()
print("Expected positive:")
print(df[expected_positive + [diag]].corr()[diag])

Expected negative:
SDQ,SDQ_21                 0.225885
SDQ,SDQ_25                 0.218915
Diag.ADHD-Combined Type    1.000000
Name: Diag.ADHD-Combined Type, dtype: float64

Expected positive:
SDQ,SDQ_10                 0.319331
SDQ,SDQ_15                 0.240324
SDQ,SDQ_17                -0.082391
Diag.ADHD-Combined Type    1.000000
Name: Diag.ADHD-Combined Type, dtype: float64


All tested items are positively correlted with ADHD-Combined, even though two were expected to correlate negativelym

#### SRS

Expected negative correlation (ASD, "Plays appropriately with children his or her age. - 0= Not True, 1= Sometimes True, 2= Often True, 3= Almost Always True")

In [4]:
diag = "Diag.Autism Spectrum Disorder"

expected_negative = [
    "SRS,SRS_03", # Seems self-confident when interacting with others.
    "SRS,SRS_07", # 7. Is aware of what others are thinking or feeling.
    "SRS,SRS_11", # 11. Has good self-confidence.
    "SRS,SRS_12", # 12. Is able to communicate his or her feelings to others.
    "SRS,SRS_15", # 15. Is able to understand the meaning of other people's tone of voice and facial expressions.
    "SRS,SRS_22", # Plays appropriately with children his or her age. - 0= Not True, 1= Sometimes True, 2= Often True, 3= Almost Always True
]

expected_positive = [
    "SRS,SRS_01", # 1. Seems much more fidgety in social situations than when alone.
    "SRS,SRS_02", # 2. Expressions on his or her face don't match what he or she is saying.
    "SRS,SRS_04", # 4. When under stress, he or she shows rigid or inflexible patterns of behavior that seem odd.
    "SRS,SRS_05", # 5. Doesn't recognize when others are trying to take advantage of him or her.
]

print("Expected negative:")
print(df[expected_negative + [diag]].corr()[diag])
print()

print("Expected positive:")
print(df[expected_positive + [diag]].corr()[diag])


Expected negative:
SRS,SRS_03                       0.105485
SRS,SRS_07                       0.128704
SRS,SRS_11                       0.084484
SRS,SRS_12                       0.085100
SRS,SRS_15                       0.139448
SRS,SRS_22                       0.230247
Diag.Autism Spectrum Disorder    1.000000
Name: Diag.Autism Spectrum Disorder, dtype: float64

Expected positive:
SRS,SRS_01                       0.133270
SRS,SRS_02                       0.119858
SRS,SRS_04                       0.154877
SRS,SRS_05                       0.177707
Diag.Autism Spectrum Disorder    1.000000
Name: Diag.Autism Spectrum Disorder, dtype: float64


All tested items are positively correlted with ADHD-Combined, even though two were expected to correlate negativelym

#### ICU_P
Expected negative corerlation between ODD and "Does not let feelings control him/her. - 0=Not at all true, 1=Somewhat true, 2=Very true, 3=Definitely true"

In [5]:
diag = "Diag.Oppositional Defiant Disorder"

expected_negative = [
    "ICU_P,ICU_P_10", # 10. Does not let feelings control him/her. 
    "ICU_P,ICU_P_13", # 13. Easily admits to being wrong. 
    "ICU_P,ICU_P_16", # 16. Apologizes (“says he/she is sorry”) to persons he/she has hurt. 
    "ICU_P,ICU_P_17", # 17. Tries not to hurt others’ feelings.
    "ICU_P,ICU_P_24", # 24. Does things to make others feel good. 
]

print("Expected negative:")
print(df[expected_negative + [diag]].corr()[diag])

Expected negative:
ICU_P,ICU_P_10                       -0.104132
ICU_P,ICU_P_13                        0.170675
ICU_P,ICU_P_16                        0.177124
ICU_P,ICU_P_17                        0.232849
ICU_P,ICU_P_24                        0.143438
Diag.Oppositional Defiant Disorder    1.000000
Name: Diag.Oppositional Defiant Disorder, dtype: float64


## Check items that seem wrong even after changing the response options:
SDQ_25: Good attention span, sees chores or homework through to the end - 0=Not True, 1=Somewhat True, 2=Certainly True

CBCL_111: Withdrawn, doesn't get inolved with others - 0=Not true, 1=Somewhat or sometimes true, 2=Very true or often true 

All tested items expected to have negative correlations, instead all except one are positively correlated. 

In [6]:
df_scale = df[[
    "SDQ,SDQ_25",
    "Diag.ADHD-Combined Type"
]]

display(df_scale.corr()["Diag.ADHD-Combined Type"])

df_scale = df[[
    "CBCL,CBCL_111",
    "Diag.Autism Spectrum Disorder",
]]

display(df_scale.corr()["Diag.Autism Spectrum Disorder"])

SDQ,SDQ_25                 0.218915
Diag.ADHD-Combined Type    1.000000
Name: Diag.ADHD-Combined Type, dtype: float64

CBCL,CBCL_111                    0.108389
Diag.Autism Spectrum Disorder    1.000000
Name: Diag.Autism Spectrum Disorder, dtype: float64

In [None]:
## Check correlations for items that are included in the total score and those who don't

In [None]:
diag = "Diag.Autism Spectrum Disorder"
included_in_total = [
    "SDQ,SDQ_07",
    "SDQ,SDQ_11", # Has at least one good friend 
    "SDQ,SDQ_14", # Generally liked by other children (for 11-17 year olds: Generally liked by other youth)
    "SDQ,SDQ_21", 
    "SDQ,SDQ_25", 
]
not_included_in_total = [
    "SDQ,SDQ_01", # Considerate of other people's feelings
    "SDQ,SDQ_20", # Often offers to help others (parents, teachers, children)
]

print("Expected negative:")
print(df[expected_negative + [diag]].corr()[diag])
print()
print("Expected positive:")
print(df[expected_positive + [diag]].corr()[diag])

## Check if pre-calculated subscale scores are affected
### In the following subscales, some items are scored positively (should be added up when calculating the sum-score), some are scored negatively (should be substracted). If everything is correct, the subscale value should be smaller than just a sum of all item response values. If instead all values are added (ignoring the fact that some are scored negatively), the sum and the subscale score will have the same value.

#### Read data

In [7]:
# Read original dataset before preprocessing
df = pd.read_csv("../diagnosis_predictor/data/raw/LORIS-release-10.csv")

  df = pd.read_csv("../diagnosis_predictor/data/raw/LORIS-release-10.csv")


#### SRS MOT
Compare pre-calculated sum-score of SRS_MOT subscale, that has both positively and negatively coded items, to simply adding all subscale item scores as if all items should be positively coded.

In [8]:
subscale_col = "SRS,SRS_MOT"
subscale_cols = [
    "SRS,SRS_01",
    "SRS,SRS_03", 
    "SRS,SRS_06", 
    "SRS,SRS_09",
    "SRS,SRS_11",
    "SRS,SRS_23",
    "SRS,SRS_27",
    "SRS,SRS_34",
    "SRS,SRS_43",
    "SRS,SRS_64",
    "SRS,SRS_65"
]

df_scale = df[subscale_cols + [subscale_col]].replace('.', np.nan).dropna().apply(pd.to_numeric, errors='coerce')

values_from_df = df_scale[subscale_col].values
calculated_values = df_scale[subscale_cols].sum(axis=1).values

print("values_from_df", values_from_df)
print("calculated_values", calculated_values)

values_from_df [ 0  6 21 ... 11  5 12]
calculated_values [ 0  6 21 ... 11  5 12]


Values of "SRS,SRS_MOT" column of each participant seem to be equal to summing all items from the MOT subscale, and not substracting ones that should be negatively scored. 

#### SDQ,Conduct_Problems_Total
Same analysis for SDQ,Conduct_Problems_Total (one of the items should be negatively coded)

In [9]:
subscale_col = "SDQ,Conduct_Problems_Total"
subscale_cols = [
    "SDQ,SDQ_05",
    "SDQ,SDQ_07", # Negative ("Generally obidient")
    "SDQ,SDQ_12", 
    "SDQ,SDQ_18",
    "SDQ,SDQ_22"
]

df_scale = df[subscale_cols + [subscale_col]].replace('.', np.nan).dropna().apply(pd.to_numeric, errors='coerce')

values_from_df = df_scale[subscale_col].values
calculated_values = df_scale[subscale_cols].sum(axis=1).values

print("values_from_df", values_from_df)
print("calculated_values", calculated_values)

values_from_df []
calculated_values []


All SDQ subscale values are missing

#### ICU_P,ICU_P_Total
Same analysis for ICU_P total score

In [10]:
subscale_col = "ICU_P,ICU_P_Total"
subscale_cols = [
    "ICU_P,ICU_P_01", 
    "ICU_P,ICU_P_02", 
    "ICU_P,ICU_P_03", 
    "ICU_P,ICU_P_04",
    "ICU_P,ICU_P_05",
    "ICU_P,ICU_P_06",
    "ICU_P,ICU_P_07",
    "ICU_P,ICU_P_08",
    "ICU_P,ICU_P_09",
    "ICU_P,ICU_P_10",
    "ICU_P,ICU_P_11",
    "ICU_P,ICU_P_12",
    "ICU_P,ICU_P_13",
    "ICU_P,ICU_P_14",
    "ICU_P,ICU_P_15",
    "ICU_P,ICU_P_16",
    "ICU_P,ICU_P_17",
    "ICU_P,ICU_P_18",
    "ICU_P,ICU_P_19",
    "ICU_P,ICU_P_20",
    "ICU_P,ICU_P_21",
    "ICU_P,ICU_P_22",
    "ICU_P,ICU_P_23",
    "ICU_P,ICU_P_24",
]

df_scale = df[subscale_cols + [subscale_col]].replace('.', np.nan).dropna().apply(pd.to_numeric, errors='coerce')

values_from_df = df_scale[subscale_col].values
calculated_values = df_scale[subscale_cols].sum(axis=1).values

print("values_from_df", values_from_df, sum(values_from_df))
print("calculated_values", calculated_values, sum(calculated_values))

values_from_df [16 16 24 ... 46 24 17] 77889
calculated_values [16 16 24 ... 46 24 17] 77888


Same result as for SRS, HBN ICU_P_Total column seems to be calculated by just summing up all ICU_P response values, instead of substracting the negatively coded ones.

Also printing the sum of the column of both values to confirm that the pre-calculated ICU_P_Total column values and the same as the column that summs all ICU_P responses.