## Defining the Goal
Challenge Question and Task:
**“What brain activity patterns are associated with ADHD; are they different between males and females, and, if so, how?”**

The task is to create a multi-outcome model to predict two separate target variables: 
1) ADHD (1=yes or 0=no) and 
2) female (1=yes or 0=no).


### Summary of what this notebook does (not in order)
1. Imports all the data into dataframes
2. Encodes the categorical features from `TRAIN_CATEGORICAL_METADATA.XLSX` appropriately
3. Label encodes the `ADHD_Outcome` feature from `TRAINING_SOLUTIONS.XLSX` (previously str)
4. Checks for nan values and imputes with most frequent feature value/discards the example as appropriate.
5. Normalizes the values from `TRAIN_QUANTITATIVE_METADATA.XLSX` such that they have mean 0 and standard deviation 1.
6. ADD
7. ADD
8. Merges the data into one single dataframe and saves it to a csv file called `merged_data.csv`

In [81]:
#Importing the required libraries
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler

### Importing and Examining Data

In [82]:
# Importing and preprocessing the dataset
cat_metadata = pd.read_excel('data\TRAIN\TRAIN_CATEGORICAL_METADATA.xlsx') # Categorical metadata
connectome_matrices = pd.read_csv('data\TRAIN\TRAIN_FUNCTIONAL_CONNECTOME_MATRICES.csv') #Connectome matrices
quant_metadata = pd.read_excel('data\TRAIN\TRAIN_QUANTITATIVE_METADATA.xlsx') # Quantitative metadata
sol = pd.read_excel('data\TRAIN\TRAINING_SOLUTIONS.xlsx') # Solutions to the training data examples

In [83]:
cat_metadata.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1213 entries, 0 to 1212
Data columns (total 10 columns):
 #   Column                            Non-Null Count  Dtype  
---  ------                            --------------  -----  
 0   participant_id                    1213 non-null   object 
 1   Basic_Demos_Enroll_Year           1213 non-null   int64  
 2   Basic_Demos_Study_Site            1213 non-null   int64  
 3   PreInt_Demos_Fam_Child_Ethnicity  1202 non-null   float64
 4   PreInt_Demos_Fam_Child_Race       1213 non-null   int64  
 5   MRI_Track_Scan_Location           1213 non-null   int64  
 6   Barratt_Barratt_P1_Edu            1213 non-null   int64  
 7   Barratt_Barratt_P1_Occ            1213 non-null   int64  
 8   Barratt_Barratt_P2_Edu            1213 non-null   int64  
 9   Barratt_Barratt_P2_Occ            1213 non-null   int64  
dtypes: float64(1), int64(8), object(1)
memory usage: 94.9+ KB


In [84]:
cat_metadata.head()

Unnamed: 0,participant_id,Basic_Demos_Enroll_Year,Basic_Demos_Study_Site,PreInt_Demos_Fam_Child_Ethnicity,PreInt_Demos_Fam_Child_Race,MRI_Track_Scan_Location,Barratt_Barratt_P1_Edu,Barratt_Barratt_P1_Occ,Barratt_Barratt_P2_Edu,Barratt_Barratt_P2_Occ
0,UmrK0vMLopoR,2016,1,0.0,0,1,21,45,21,45
1,CPaeQkhcjg7d,2019,3,1.0,2,3,15,15,0,0
2,Nb4EetVPm3gs,2016,1,1.0,8,1,18,40,0,0
3,p4vPhVu91o4b,2018,3,0.0,8,3,15,30,18,0
4,M09PXs7arQ5E,2019,3,0.0,1,3,15,20,0,0


In [85]:
connectome_matrices.head()

Unnamed: 0,participant_id,0throw_1thcolumn,0throw_2thcolumn,0throw_3thcolumn,0throw_4thcolumn,0throw_5thcolumn,0throw_6thcolumn,0throw_7thcolumn,0throw_8thcolumn,0throw_9thcolumn,...,195throw_196thcolumn,195throw_197thcolumn,195throw_198thcolumn,195throw_199thcolumn,196throw_197thcolumn,196throw_198thcolumn,196throw_199thcolumn,197throw_198thcolumn,197throw_199thcolumn,198throw_199thcolumn
0,70z8Q2xdTXM3,0.093473,0.146902,0.067893,0.015141,0.070221,0.063997,0.055382,-0.035335,0.068583,...,0.003404,-0.010359,-0.050968,-0.014365,0.128066,0.112646,-0.05898,0.028228,0.133582,0.143372
1,WHWymJu6zNZi,0.02958,0.179323,0.112933,0.038291,0.104899,0.06425,0.008488,0.077505,-0.00475,...,-0.008409,-0.008479,0.020891,0.017754,0.09404,0.035141,0.032537,0.075007,0.11535,0.1382
2,4PAQp1M6EyAo,-0.05158,0.139734,0.068295,0.046991,0.111085,0.026978,0.151377,0.021198,0.083721,...,0.053245,-0.028003,0.028773,0.024556,0.166343,0.058925,0.035485,0.063661,0.042862,0.162162
3,obEacy4Of68I,0.016273,0.204702,0.11598,0.043103,0.056431,0.057615,0.055773,0.07503,0.001033,...,-0.023918,-0.005356,0.018607,0.016193,0.072955,0.130135,0.05612,0.084784,0.114148,0.190584
4,s7WzzDcmDOhF,0.065771,0.098714,0.097604,0.112988,0.071139,0.085607,0.019392,-0.036403,-0.020375,...,0.066439,-0.07668,-0.04753,-0.031443,0.221213,0.007343,0.005763,0.08382,0.079582,0.067269


In [86]:
quant_metadata.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1213 entries, 0 to 1212
Data columns (total 19 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   participant_id              1213 non-null   object 
 1   EHQ_EHQ_Total               1213 non-null   float64
 2   ColorVision_CV_Score        1213 non-null   int64  
 3   APQ_P_APQ_P_CP              1213 non-null   int64  
 4   APQ_P_APQ_P_ID              1213 non-null   int64  
 5   APQ_P_APQ_P_INV             1213 non-null   int64  
 6   APQ_P_APQ_P_OPD             1213 non-null   int64  
 7   APQ_P_APQ_P_PM              1213 non-null   int64  
 8   APQ_P_APQ_P_PP              1213 non-null   int64  
 9   SDQ_SDQ_Conduct_Problems    1213 non-null   int64  
 10  SDQ_SDQ_Difficulties_Total  1213 non-null   int64  
 11  SDQ_SDQ_Emotional_Problems  1213 non-null   int64  
 12  SDQ_SDQ_Externalizing       1213 non-null   int64  
 13  SDQ_SDQ_Generating_Impact   1213 

In [87]:
sol.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1213 entries, 0 to 1212
Data columns (total 3 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   participant_id  1213 non-null   object
 1   ADHD_Outcome    1213 non-null   int64 
 2   Sex_F           1213 non-null   int64 
dtypes: int64(2), object(1)
memory usage: 28.6+ KB


### Processing NAN Data

All of the datasets have non null data for 1213 examples (in the training data). No further preprocessing required to impute missing values. Now we check if they have values that are outliers/nan.

In [88]:
# Counting how many nan values in each column of Categorical Metadata
print(f"NAN for Categorical Data:{cat_metadata.isna().sum()}")

NAN for Categorical Data:participant_id                       0
Basic_Demos_Enroll_Year              0
Basic_Demos_Study_Site               0
PreInt_Demos_Fam_Child_Ethnicity    11
PreInt_Demos_Fam_Child_Race          0
MRI_Track_Scan_Location              0
Barratt_Barratt_P1_Edu               0
Barratt_Barratt_P1_Occ               0
Barratt_Barratt_P2_Edu               0
Barratt_Barratt_P2_Occ               0
dtype: int64


**Design Decision:** Since the Categorical Data has 11 NAN values in one of the features, and there are a total of 1213 entries, we can replace these NAN values with the most common value in the dataset for this feature. 

In [89]:
# Replacing nan in the Categorical Data with the most common value in the dataset
col = 'PreInt_Demos_Fam_Child_Ethnicity'
cat_metadata[col] = cat_metadata[col].fillna(cat_metadata[col].mode()[0])
print(f"NAN for Categorical Data:{cat_metadata.isna().sum()}") # Values in all columns should be 0 now.

NAN for Categorical Data:participant_id                      0
Basic_Demos_Enroll_Year             0
Basic_Demos_Study_Site              0
PreInt_Demos_Fam_Child_Ethnicity    0
PreInt_Demos_Fam_Child_Race         0
MRI_Track_Scan_Location             0
Barratt_Barratt_P1_Edu              0
Barratt_Barratt_P1_Occ              0
Barratt_Barratt_P2_Edu              0
Barratt_Barratt_P2_Occ              0
dtype: int64


In [90]:
# Counting how many nan values in each column of Quantitative Metadata
print(f"NAN for Quantitative Data:{quant_metadata.isna().sum()}")

NAN for Quantitative Data:participant_id                  0
EHQ_EHQ_Total                   0
ColorVision_CV_Score            0
APQ_P_APQ_P_CP                  0
APQ_P_APQ_P_ID                  0
APQ_P_APQ_P_INV                 0
APQ_P_APQ_P_OPD                 0
APQ_P_APQ_P_PM                  0
APQ_P_APQ_P_PP                  0
SDQ_SDQ_Conduct_Problems        0
SDQ_SDQ_Difficulties_Total      0
SDQ_SDQ_Emotional_Problems      0
SDQ_SDQ_Externalizing           0
SDQ_SDQ_Generating_Impact       0
SDQ_SDQ_Hyperactivity           0
SDQ_SDQ_Internalizing           0
SDQ_SDQ_Peer_Problems           0
SDQ_SDQ_Prosocial               0
MRI_Track_Age_at_Scan         360
dtype: int64


**DESIGN DECISION:** Instead of imputing 360/1213 nan values (~30% of values), we just discard these values from the dataset because there's too many of them to impute.

In [91]:
column_to_check = 'MRI_Track_Age_at_Scan'
quant_metadata = quant_metadata.dropna(subset=column_to_check)
print(f"NAN for Quantitative Data:{quant_metadata.isna().sum()}") # Value for all columns should be 0

NAN for Quantitative Data:participant_id                0
EHQ_EHQ_Total                 0
ColorVision_CV_Score          0
APQ_P_APQ_P_CP                0
APQ_P_APQ_P_ID                0
APQ_P_APQ_P_INV               0
APQ_P_APQ_P_OPD               0
APQ_P_APQ_P_PM                0
APQ_P_APQ_P_PP                0
SDQ_SDQ_Conduct_Problems      0
SDQ_SDQ_Difficulties_Total    0
SDQ_SDQ_Emotional_Problems    0
SDQ_SDQ_Externalizing         0
SDQ_SDQ_Generating_Impact     0
SDQ_SDQ_Hyperactivity         0
SDQ_SDQ_Internalizing         0
SDQ_SDQ_Peer_Problems         0
SDQ_SDQ_Prosocial             0
MRI_Track_Age_at_Scan         0
dtype: int64


In [92]:
# Counting how many nan values in each column of MRI Data
print(f"NAN for MRI Data:{connectome_matrices.isna().sum().sum()}")
# Since there are no nan values, no need for further processing of this dataset.

NAN for MRI Data:0


## Normalizing Quantitative Data

Many of the quantitative data features come from scores of different types of scales. As such, it may be necessary to normalize them so that one scale does not overpower another by virtue of its definition.

In the output of the cell given below, notice how the means of each of the feature columns are vastly different, indicating that they have largely different magnitudes, and might overpower each other (i.e. features with larger absolute value might be learned as the model to be 'more important' than features with smaller absolute value). To mitigate this, we implement normalization of all these features. 

In [93]:
# Examining the range of the different columns of the quantitative data
quant_metadata.describe()

Unnamed: 0,EHQ_EHQ_Total,ColorVision_CV_Score,APQ_P_APQ_P_CP,APQ_P_APQ_P_ID,APQ_P_APQ_P_INV,APQ_P_APQ_P_OPD,APQ_P_APQ_P_PM,APQ_P_APQ_P_PP,SDQ_SDQ_Conduct_Problems,SDQ_SDQ_Difficulties_Total,SDQ_SDQ_Emotional_Problems,SDQ_SDQ_Externalizing,SDQ_SDQ_Generating_Impact,SDQ_SDQ_Hyperactivity,SDQ_SDQ_Internalizing,SDQ_SDQ_Peer_Problems,SDQ_SDQ_Prosocial,MRI_Track_Age_at_Scan
count,853.0,853.0,853.0,853.0,853.0,853.0,853.0,853.0,853.0,853.0,853.0,853.0,853.0,853.0,853.0,853.0,853.0,853.0
mean,59.623036,13.128957,3.80891,13.30129,39.370457,17.746776,16.534584,25.199297,2.141852,12.14068,2.282532,7.611958,4.028136,5.470106,4.528722,2.24619,7.656506,11.245678
std,48.394417,2.844114,1.407497,3.810901,6.136833,3.710362,5.483461,3.913099,2.088375,6.758156,2.150147,4.251285,2.844838,2.857427,3.548548,2.104182,2.206573,3.234372
min,-100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,46.67,14.0,3.0,11.0,36.0,16.0,13.0,23.0,0.0,7.0,1.0,4.0,2.0,4.0,2.0,0.0,6.0,8.803901
50%,74.47,14.0,3.0,13.0,40.0,18.0,16.0,26.0,2.0,12.0,2.0,8.0,4.0,5.0,4.0,2.0,8.0,10.739219
75%,94.47,14.0,4.0,16.0,43.0,20.0,20.0,28.0,3.0,17.0,4.0,11.0,6.0,8.0,7.0,4.0,9.0,13.460871
max,100.0,14.0,12.0,28.0,50.0,27.0,37.0,30.0,10.0,32.0,10.0,20.0,10.0,10.0,16.0,9.0,10.0,21.564453


In [94]:
quant_metadata.info()

<class 'pandas.core.frame.DataFrame'>
Index: 853 entries, 2 to 1212
Data columns (total 19 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   participant_id              853 non-null    object 
 1   EHQ_EHQ_Total               853 non-null    float64
 2   ColorVision_CV_Score        853 non-null    int64  
 3   APQ_P_APQ_P_CP              853 non-null    int64  
 4   APQ_P_APQ_P_ID              853 non-null    int64  
 5   APQ_P_APQ_P_INV             853 non-null    int64  
 6   APQ_P_APQ_P_OPD             853 non-null    int64  
 7   APQ_P_APQ_P_PM              853 non-null    int64  
 8   APQ_P_APQ_P_PP              853 non-null    int64  
 9   SDQ_SDQ_Conduct_Problems    853 non-null    int64  
 10  SDQ_SDQ_Difficulties_Total  853 non-null    int64  
 11  SDQ_SDQ_Emotional_Problems  853 non-null    int64  
 12  SDQ_SDQ_Externalizing       853 non-null    int64  
 13  SDQ_SDQ_Generating_Impact   853 non-nul

In [95]:
# We normalize all columns in this dataset (except participant_id, which is a string) to have mean 0 and standard deviation 1
scaler = StandardScaler()
numeric_columns = quant_metadata.select_dtypes(include=['number']).columns
numeric_columns = numeric_columns.difference(['EHQ_EHQ_Total'])
quant_metadata[numeric_columns] = scaler.fit_transform(quant_metadata[numeric_columns])
quant_metadata.describe()       # All the means should be 0, standard deviations should be 1

Unnamed: 0,EHQ_EHQ_Total,ColorVision_CV_Score,APQ_P_APQ_P_CP,APQ_P_APQ_P_ID,APQ_P_APQ_P_INV,APQ_P_APQ_P_OPD,APQ_P_APQ_P_PM,APQ_P_APQ_P_PP,SDQ_SDQ_Conduct_Problems,SDQ_SDQ_Difficulties_Total,SDQ_SDQ_Emotional_Problems,SDQ_SDQ_Externalizing,SDQ_SDQ_Generating_Impact,SDQ_SDQ_Hyperactivity,SDQ_SDQ_Internalizing,SDQ_SDQ_Peer_Problems,SDQ_SDQ_Prosocial,MRI_Track_Age_at_Scan
count,853.0,853.0,853.0,853.0,853.0,853.0,853.0,853.0,853.0,853.0,853.0,853.0,853.0,853.0,853.0,853.0,853.0,853.0
mean,59.623036,1.655573e-16,-1.457737e-16,6.663941e-17,5.164554e-16,-2.165781e-16,2.457328e-16,3.665168e-16,-8.329927e-17,0.0,1.665985e-17,-9.995912000000001e-17,-1.332788e-16,1.665985e-16,-9.995912000000001e-17,4.9979560000000006e-17,6.663941e-17,-1.16619e-16
std,48.394417,1.000587,1.000587,1.000587,1.000587,1.000587,1.000587,1.000587,1.000587,1.000587,1.000587,1.000587,1.000587,1.000587,1.000587,1.000587,1.000587,1.000587
min,-100.0,-4.618893,-2.707746,-3.492374,-6.419199,-4.785837,-3.017124,-6.443506,-1.026209,-1.797503,-1.062193,-1.791558,-1.416776,-1.915469,-1.276967,-1.068115,-3.471899,-3.478968
25%,46.67,0.3064414,-0.5750523,-0.6042245,-0.5495399,-0.4710594,-0.6449681,-0.5623642,-1.026209,-0.761109,-0.5968357,-0.850114,-0.7133361,-0.5147875,-0.7130257,-1.068115,-0.7511551,-0.7553892
50%,74.47,0.3064414,-0.5750523,-0.07910631,0.1026445,0.0682878,-0.09754742,0.2047413,-0.06796457,-0.020829,-0.1314784,0.09133,-0.009895993,-0.1646171,-0.1490842,-0.1170689,0.1557597,-0.1566785
75%,94.47,0.3064414,0.1358457,0.7085709,0.5917828,0.607635,0.6323468,0.716145,0.4111576,0.719452,0.7992361,0.797413,0.6935442,0.8858943,0.6968281,0.8339768,0.6092171,0.6852929
max,100.0,0.3064414,5.82303,3.85928,1.733105,2.49535,3.734397,1.227549,3.765013,2.940295,3.59138,2.915662,2.100424,1.586235,3.234565,3.211591,1.062674,3.19222


## Encode Categorical Features/Feature Engineering

Read `Data Dictionary.xlsx`. It details the values/descriptions of the data in all the files. Note that all the variables in the Categorical Metadata file can be one hot encoded (or preprocessed in some way).

**Design Decision 1**: HOW DO WE ENCODE THESE?
1. One Hot Encoding
2. Label Encoding

Asmi's thoughts: One Hot Encoding might lead to sparse data because there are a lot of categories for some of the features. But label encoding might imply some sort of hierarchy among the feature values (that may not necessarily be the case).

ASK  TA

Final Decision [ADD HERE]

In [96]:
# Encoding the categorical variables

Notice that in Target Variables given in `TRAINING_SOLUTIONS.XLSX` the `ADHD_Outcome` field is a string. We need to convert this to a label encoded categorical int. This is done by the code in the cell below. 

In [97]:
# Converting ADHD_Outcome to categorical int
label_encoder = LabelEncoder()
sol['ADHD_Outcome'] = label_encoder.fit_transform(sol['ADHD_Outcome'])  #Converts it to a categorical int
sol.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1213 entries, 0 to 1212
Data columns (total 3 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   participant_id  1213 non-null   object
 1   ADHD_Outcome    1213 non-null   int64 
 2   Sex_F           1213 non-null   int64 
dtypes: int64(2), object(1)
memory usage: 28.6+ KB


### Feature Creation

Notice from the `DATA DICTIONARY.XLSX` file that the `EHQ_EHQ_Total` variable from `TRAIN_QUANTITATIVE_METADATA.XLSX` that represents the Latrality Index (Score) is a float, but the numbers correspond to 3 broad categories:
- -100 = 10th left (strongly left handed)
- -28 <= LI < 48 ------> middle (ambidextrous)
- 100 = 10th right (stronlly right handed)

**Design Decision 2:** DO WE ENCODE THIS DATA AS WELL?

In [None]:
# Add code to potentially encode/process EHQ_EHQ_Total Data
print(f"Number of unique values in this column: {quant_metadata['EHQ_EHQ_Total'].nunique()}")
quant_metadata['EHQ_EHQ_Total'].describe()


Number of unique values in this column: 132


count    853.000000
mean      59.623036
std       48.394417
min     -100.000000
25%       46.670000
50%       74.470000
75%       94.470000
max      100.000000
Name: EHQ_EHQ_Total, dtype: float64

In [106]:
# Counting how many values fall into each category
range_1 = (-float('inf'), -100)     #Right handed
range_2 = (-28, 48)                 #Ambidextrous
range_3 = (100, float('inf'))       #Left handed

count_range_1 = ((quant_metadata['EHQ_EHQ_Total'] >= range_1[0]) & (quant_metadata['EHQ_EHQ_Total'] <= range_1[1])).sum()
count_range_2 = ((quant_metadata['EHQ_EHQ_Total'] > range_2[0]) & (quant_metadata['EHQ_EHQ_Total'] <= range_2[1])).sum()
count_range_3 = ((quant_metadata['EHQ_EHQ_Total'] > range_3[0]) & (quant_metadata['EHQ_EHQ_Total'] <= range_3[1])).sum()

print(f"Count of values in range {range_1[0]} to {range_1[1]}: {count_range_1}")
print(f"Count of values in range {range_2[0]} to {range_2[1]}: {count_range_2}")
print(f"Count of values in range {range_3[0]} to {range_3[1]}: {count_range_3}")

Count of values in range -inf to -100: 9
Count of values in range -28 to 48: 163
Count of values in range 100 to inf: 0


In [99]:
# Inspecting other data
quant_metadata['MRI_Track_Age_at_Scan'].values

array([-9.29867566e-01, -7.13075534e-01,  1.70844791e+00, -7.53893477e-03,
       -8.27488764e-01,  1.28834454e+00,  1.75338845e-01, -1.01968324e+00,
        3.82849445e-01, -2.68480098e-01, -2.26060291e-01, -7.51224930e-01,
        1.12064182e+00, -7.86021788e-01, -1.61592343e+00,  9.36882050e-01,
       -7.27474095e-01, -6.79266774e-01,  1.98445833e+00,  1.98368184e+00,
       -5.00447494e-01, -1.38805011e+00, -1.58218552e+00, -6.24918649e-01,
       -5.36056113e-01, -3.18346209e-01,  5.87925070e-01, -4.33712269e-01,
        1.93125520e-01,  1.83373083e+00,  7.20054442e-01, -1.49903996e+00,
        1.61782079e+00, -3.43049874e-01, -5.48725349e-01,  1.09604426e+00,
       -8.82472006e-01,  1.42377113e-01,  1.82628422e+00, -1.34319533e+00,
       -1.12809674e+00, -1.47207765e+00,  5.24295515e-01, -3.47896753e+00,
        1.14195768e+00, -7.57965584e-01, -4.20795854e-01, -6.75843392e-01,
       -1.10953387e+00, -3.93021787e-01, -1.46953688e+00, -7.19039693e-01,
       -1.21357147e+00, -

### Ideas for Dummy Variables
Quantitative Metadata
1. `MRI_Track,Age_at_Scan` - change this into categorical variables (e.g)

## Combining the datasets into one giant training dataset

In [100]:
merge_column = 'participant_id'

#Note that we have how set to inner so that we discard the rows where the participant ids dont match 
#(because we discarded some rows with nan results earlier)
merged_df = cat_metadata.merge(connectome_matrices, on=merge_column, how='inner')
merged_df = merged_df.merge(quant_metadata, on=merge_column, how='inner')
merged_df = merged_df.merge(sol, on=merge_column, how='inner')
merged_df.columns


Index(['participant_id', 'Basic_Demos_Enroll_Year', 'Basic_Demos_Study_Site',
       'PreInt_Demos_Fam_Child_Ethnicity', 'PreInt_Demos_Fam_Child_Race',
       'MRI_Track_Scan_Location', 'Barratt_Barratt_P1_Edu',
       'Barratt_Barratt_P1_Occ', 'Barratt_Barratt_P2_Edu',
       'Barratt_Barratt_P2_Occ',
       ...
       'SDQ_SDQ_Emotional_Problems', 'SDQ_SDQ_Externalizing',
       'SDQ_SDQ_Generating_Impact', 'SDQ_SDQ_Hyperactivity',
       'SDQ_SDQ_Internalizing', 'SDQ_SDQ_Peer_Problems', 'SDQ_SDQ_Prosocial',
       'MRI_Track_Age_at_Scan', 'ADHD_Outcome', 'Sex_F'],
      dtype='object', length=19930)

In [101]:
# Saving the merged data
merged_df.to_csv('merged_data.csv', index=False)