## Defining the Goal
Challenge Question and Task:
**“What brain activity patterns are associated with ADHD; are they different between males and females, and, if so, how?”**

The task is to create a multi-outcome model to predict two separate target variables: 
1) ADHD (1=yes or 0=no) and 
2) female (1=yes or 0=no).


### Summary of what this notebook does
1. Imports all the data into dataframes
2. Encodes the categorical features from `TRAIN_CATEGORICAL_METADATA.XLSX` appropriately
3. Label encodes the `ADHD_Outcome` feature from `TRAINING_SOLUTIONS.XLSX` (previously str)
4. ADD
5. ADD
6. ADD
7. ADD
8. Merges the data into one single dataframe and saves it to a csv file called `merged_data.csv`

In [10]:
#Importing the required libraries
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder

In [3]:
# Importing and preprocessing the dataset
cat_metadata = pd.read_excel('data\TRAIN\TRAIN_CATEGORICAL_METADATA.xlsx') # Categorical metadata
connectome_matrices = pd.read_csv('data\TRAIN\TRAIN_FUNCTIONAL_CONNECTOME_MATRICES.csv') #Connectome matrices
quant_metadata = pd.read_excel('data\TRAIN\TRAIN_QUANTITATIVE_METADATA.xlsx') # Quantitative metadata
sol = pd.read_excel('data\TRAIN\TRAINING_SOLUTIONS.xlsx') # Solutions to the training data examples

In [4]:
cat_metadata.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1213 entries, 0 to 1212
Data columns (total 10 columns):
 #   Column                            Non-Null Count  Dtype  
---  ------                            --------------  -----  
 0   participant_id                    1213 non-null   object 
 1   Basic_Demos_Enroll_Year           1213 non-null   int64  
 2   Basic_Demos_Study_Site            1213 non-null   int64  
 3   PreInt_Demos_Fam_Child_Ethnicity  1202 non-null   float64
 4   PreInt_Demos_Fam_Child_Race       1213 non-null   int64  
 5   MRI_Track_Scan_Location           1213 non-null   int64  
 6   Barratt_Barratt_P1_Edu            1213 non-null   int64  
 7   Barratt_Barratt_P1_Occ            1213 non-null   int64  
 8   Barratt_Barratt_P2_Edu            1213 non-null   int64  
 9   Barratt_Barratt_P2_Occ            1213 non-null   int64  
dtypes: float64(1), int64(8), object(1)
memory usage: 94.9+ KB


In [5]:
cat_metadata.head()

Unnamed: 0,participant_id,Basic_Demos_Enroll_Year,Basic_Demos_Study_Site,PreInt_Demos_Fam_Child_Ethnicity,PreInt_Demos_Fam_Child_Race,MRI_Track_Scan_Location,Barratt_Barratt_P1_Edu,Barratt_Barratt_P1_Occ,Barratt_Barratt_P2_Edu,Barratt_Barratt_P2_Occ
0,UmrK0vMLopoR,2016,1,0.0,0,1,21,45,21,45
1,CPaeQkhcjg7d,2019,3,1.0,2,3,15,15,0,0
2,Nb4EetVPm3gs,2016,1,1.0,8,1,18,40,0,0
3,p4vPhVu91o4b,2018,3,0.0,8,3,15,30,18,0
4,M09PXs7arQ5E,2019,3,0.0,1,3,15,20,0,0


In [5]:
connectome_matrices.head()

Unnamed: 0,participant_id,0throw_1thcolumn,0throw_2thcolumn,0throw_3thcolumn,0throw_4thcolumn,0throw_5thcolumn,0throw_6thcolumn,0throw_7thcolumn,0throw_8thcolumn,0throw_9thcolumn,...,195throw_196thcolumn,195throw_197thcolumn,195throw_198thcolumn,195throw_199thcolumn,196throw_197thcolumn,196throw_198thcolumn,196throw_199thcolumn,197throw_198thcolumn,197throw_199thcolumn,198throw_199thcolumn
0,70z8Q2xdTXM3,0.093473,0.146902,0.067893,0.015141,0.070221,0.063997,0.055382,-0.035335,0.068583,...,0.003404,-0.010359,-0.050968,-0.014365,0.128066,0.112646,-0.05898,0.028228,0.133582,0.143372
1,WHWymJu6zNZi,0.02958,0.179323,0.112933,0.038291,0.104899,0.06425,0.008488,0.077505,-0.00475,...,-0.008409,-0.008479,0.020891,0.017754,0.09404,0.035141,0.032537,0.075007,0.11535,0.1382
2,4PAQp1M6EyAo,-0.05158,0.139734,0.068295,0.046991,0.111085,0.026978,0.151377,0.021198,0.083721,...,0.053245,-0.028003,0.028773,0.024556,0.166343,0.058925,0.035485,0.063661,0.042862,0.162162
3,obEacy4Of68I,0.016273,0.204702,0.11598,0.043103,0.056431,0.057615,0.055773,0.07503,0.001033,...,-0.023918,-0.005356,0.018607,0.016193,0.072955,0.130135,0.05612,0.084784,0.114148,0.190584
4,s7WzzDcmDOhF,0.065771,0.098714,0.097604,0.112988,0.071139,0.085607,0.019392,-0.036403,-0.020375,...,0.066439,-0.07668,-0.04753,-0.031443,0.221213,0.007343,0.005763,0.08382,0.079582,0.067269


In [6]:
quant_metadata.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1213 entries, 0 to 1212
Data columns (total 19 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   participant_id              1213 non-null   object 
 1   EHQ_EHQ_Total               1213 non-null   float64
 2   ColorVision_CV_Score        1213 non-null   int64  
 3   APQ_P_APQ_P_CP              1213 non-null   int64  
 4   APQ_P_APQ_P_ID              1213 non-null   int64  
 5   APQ_P_APQ_P_INV             1213 non-null   int64  
 6   APQ_P_APQ_P_OPD             1213 non-null   int64  
 7   APQ_P_APQ_P_PM              1213 non-null   int64  
 8   APQ_P_APQ_P_PP              1213 non-null   int64  
 9   SDQ_SDQ_Conduct_Problems    1213 non-null   int64  
 10  SDQ_SDQ_Difficulties_Total  1213 non-null   int64  
 11  SDQ_SDQ_Emotional_Problems  1213 non-null   int64  
 12  SDQ_SDQ_Externalizing       1213 non-null   int64  
 13  SDQ_SDQ_Generating_Impact   1213 

In [7]:
sol.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1213 entries, 0 to 1212
Data columns (total 3 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   participant_id  1213 non-null   object
 1   ADHD_Outcome    1213 non-null   int64 
 2   Sex_F           1213 non-null   int64 
dtypes: int64(2), object(1)
memory usage: 28.6+ KB


In [23]:
# Counting how many nan values in each column of each dataset
print(f"NAN for Categorical Data:{cat_metadata.isna().sum()}")

NAN for Categorical Data:participant_id                       0
Basic_Demos_Enroll_Year              0
Basic_Demos_Study_Site               0
PreInt_Demos_Fam_Child_Ethnicity    11
PreInt_Demos_Fam_Child_Race          0
MRI_Track_Scan_Location              0
Barratt_Barratt_P1_Edu               0
Barratt_Barratt_P1_Occ               0
Barratt_Barratt_P2_Edu               0
Barratt_Barratt_P2_Occ               0
dtype: int64


All of the datasets have non null data for 1213 examples (in the training data). No further preprocessing required to impute missing values. 

## Encode Categorical Features/Feature Engineering

Read `Data Dictionary.xlsx`. It details the values/descriptions of the data in all the files. Note that all the variables in the Categorical Metadata file can be one hot encoded (or preprocessed in some way).

**Design Decision 1**: HOW DO WE ENCODE THESE?
1. One Hot Encoding
2. Label Encoding

Asmi's thoughts: One Hot Encoding might lead to sparse data because there are a lot of categories for some of the features. But label encoding might imply some sort of hierarchy among the feature values (that may not necessarily be the case).

ASK  TA

Final Decision [ADD HERE]

In [8]:
# Encoding the categorical variables

Notice that in Target Variables given in `TRAINING_SOLUTIONS.XLSX` the `ADHD_Outcome` field is a string. We need to convert this to a label encoded categorical int. This is done by the code in the cell below. 

In [13]:
# Converting ADHD_Outcome to categorical int
label_encoder = LabelEncoder()
sol['ADHD_Outcome'] = label_encoder.fit_transform(sol['ADHD_Outcome'])  #Converts it to a categorical int
sol.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1213 entries, 0 to 1212
Data columns (total 3 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   participant_id  1213 non-null   object
 1   ADHD_Outcome    1213 non-null   int64 
 2   Sex_F           1213 non-null   int64 
dtypes: int64(2), object(1)
memory usage: 28.6+ KB


### Feature Creation

Notice from the `DATA DICTIONARY.XLSX` file that the `EHQ_EHQ_Total` variable from `TRAIN_QUANTITATIVE_METADATA.XLSX` that represents the Latrality Index (Score) is a float, but the numbers correspond to 3 broad categories:
- -100 = 10th left
- -28 <= LI < 48 ------> middle
- 100 = 10th right

**Design Decision 2:** DO WE ENCODE THIS DATA AS WELL?

In [None]:
# Add code to potentially encode/process EHQ_EHQ_Total Data
print(f"Number of unique values in this column: {quant_metadata['EHQ_EHQ_Total'].nunique()}")


Number of unique values in this column: 158


In [22]:
# Inspecting other data
quant_metadata['MRI_Track_Age_at_Scan'].values

array([      nan,       nan,  8.239904, ...,       nan, 12.089094,
       12.59571 ], shape=(1213,))

### Ideas for Dummy Variables
Quantitative Metadata
1. `MRI_Track,Age_at_Scan` - change this into categorical variables (e.g)

## Combining the datasets into one giant training dataset

In [8]:
merge_column = 'participant_id'

merged_df = cat_metadata.merge(connectome_matrices, on=merge_column, how='outer')
merged_df = merged_df.merge(quant_metadata, on=merge_column, how='outer')
merged_df = merged_df.merge(sol, on=merge_column, how='outer')
merged_df.columns


Index(['participant_id', 'Basic_Demos_Enroll_Year', 'Basic_Demos_Study_Site',
       'PreInt_Demos_Fam_Child_Ethnicity', 'PreInt_Demos_Fam_Child_Race',
       'MRI_Track_Scan_Location', 'Barratt_Barratt_P1_Edu',
       'Barratt_Barratt_P1_Occ', 'Barratt_Barratt_P2_Edu',
       'Barratt_Barratt_P2_Occ',
       ...
       'SDQ_SDQ_Emotional_Problems', 'SDQ_SDQ_Externalizing',
       'SDQ_SDQ_Generating_Impact', 'SDQ_SDQ_Hyperactivity',
       'SDQ_SDQ_Internalizing', 'SDQ_SDQ_Peer_Problems', 'SDQ_SDQ_Prosocial',
       'MRI_Track_Age_at_Scan', 'ADHD_Outcome', 'Sex_F'],
      dtype='object', length=19930)

In [None]:
# Saving the merged data
merged_df.to_csv('merged_data.csv', index=False)