# RBC Dataset

The Reproducible Brain Charts (RBC) dataset contains data from many studies that have been pre-processed and harmonized. This notebook demonstrates how to access these data using an extension of the library [`CloudPathLib`](https://cloudpathlib.drivendata.org/stable/).

* Give students the participants with some p-factor cells (for test dataset) set to NaN; they have to predict these.
* We report at the end how everyone did.
* Notebook should contain example of linear regression on something like BA1 size.

## Getting Started

In [1]:
# We will need the RBCPath type from the rbclib package to load data from the RBC.
from rbclib import RBCPath

# We'll want to load some of the data using pandas.
import pandas as pd

In [10]:
# An RBC path is formatted as follows:
# rbc://GITHUB-REPO/GITHUB-PATH
# So we can represent the directory "freesurfer/sub-A00008326_ses-BAS1" of the RBC
# github repo github.com:ReproBrainChart/PNC_FreeSurfer as follows:
path = RBCPath('rbc://PNC_FreeSurfer/freesurfer/sub-1000393599')

In [11]:
# List the directory contents:
contents = list(path.iterdir())
contents

[RBCPath('rbc://PNC_FreeSurfer/freesurfer/sub-1000393599/sub-1000393599_brainmeasures.json'),
 RBCPath('rbc://PNC_FreeSurfer/freesurfer/sub-1000393599/sub-1000393599_brainmeasures.tsv'),
 RBCPath('rbc://PNC_FreeSurfer/freesurfer/sub-1000393599/sub-1000393599_freesurfer.tar.xz'),
 RBCPath('rbc://PNC_FreeSurfer/freesurfer/sub-1000393599/sub-1000393599_fsLR_den-164k.tar.xz'),
 RBCPath('rbc://PNC_FreeSurfer/freesurfer/sub-1000393599/sub-1000393599_fsaverage.tar.xz'),
 RBCPath('rbc://PNC_FreeSurfer/freesurfer/sub-1000393599/sub-1000393599_regionsurfacestats.tsv')]

## Loading Atlas data from a Participant

In [13]:
# Use pandas to read in the final TSV file in the list from the above code-cell.
# This TSV file contains 
rbcfile = contents[-1]
# We can alternatively create the file like so:
#    sub = '1000393599'
#    ses = 'BAS1'
#    dataname = 'regionsurfacestats'
#    rbcroot = RBCPath('rbc://PNC_FreeSurfer/')
#    rbc_subdir = rbcroot / f'freesurfer/sub-{sub}_ses-{ses}'
#    rbcfile = rbc_subdir / f'sub-{sub}_ses-{ses}_{dataname}.tsv')

print(f"Loading {rbcfile} ...")
with rbcfile.open() as f:
    data = pd.read_csv(f, sep='\t')

data

Loading rbc://PNC_FreeSurfer/freesurfer/sub-1000393599/sub-1000393599_regionsurfacestats.tsv ...


Unnamed: 0,subject_id,session_id,atlas,hemisphere,StructName,NumVert,SurfArea,GrayVol,ThickAvg,ThickStd,...,StdDev_wgpct,Min_wgpct,Max_wgpct,Range_wgpct,SNR_wgpct,Mean_piallgi,StdDev_piallgi,Min_piallgi,Max_piallgi,Range_piallgi
0,sub-1000393599,,aparc.DKTatlas,lh,caudalanteriorcingulate,1668,1121,3493,2.870,0.588,...,5.8371,-1.8413,42.8855,44.7269,4.4281,1.9877,0.0777,1.8054,2.1455,0.3402
1,sub-1000393599,,aparc.DKTatlas,lh,caudalmiddlefrontal,3308,2236,7030,2.882,0.537,...,4.6666,7.1531,40.4774,33.3243,5.0341,3.3898,0.2448,2.7003,3.8032,1.1029
2,sub-1000393599,,aparc.DKTatlas,lh,cuneus,4102,2619,5753,2.019,0.490,...,5.2623,-13.1617,33.8137,46.9754,3.0343,3.2453,0.3093,2.4099,3.5491,1.1392
3,sub-1000393599,,aparc.DKTatlas,lh,entorhinal,737,549,2714,3.655,0.585,...,6.0438,2.5989,37.5099,34.9110,3.4560,2.6710,0.1285,2.4654,2.9647,0.4993
4,sub-1000393599,,aparc.DKTatlas,lh,fusiform,4115,2822,8180,2.738,0.526,...,5.2854,-5.9378,39.6908,45.6286,3.9405,2.8272,0.1093,2.3304,3.1105,0.7800
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
13735,sub-1000393599,,Yeo2011_7Networks_N1000,rh,7Networks_3,14937,9936,27688,2.611,0.492,...,5.0774,-10.8846,39.2314,50.1161,4.1769,3.1173,0.3747,2.4544,4.7044,2.2500
13736,sub-1000393599,,Yeo2011_7Networks_N1000,rh,7Networks_4,13382,9146,29555,2.909,0.582,...,5.8317,-41.1954,52.2013,93.3967,3.8157,3.5262,0.9928,1.8828,5.1531,3.2703
13737,sub-1000393599,,Yeo2011_7Networks_N1000,rh,7Networks_5,10558,7677,31072,3.196,0.792,...,7.1063,-22.2837,88.8118,111.0955,3.3020,2.5300,0.3971,2.0215,4.7753,2.7538
13738,sub-1000393599,,Yeo2011_7Networks_N1000,rh,7Networks_6,20144,13602,41999,2.696,0.641,...,6.0781,-11.6287,43.5814,55.2101,3.6592,3.0563,0.5547,1.8599,4.9149,3.0550


## Load in meta-data about the participants

In [7]:
# Participant meta-data is generally located in the BIDS repository for each study:
participants_file = RBCPath('rbc://PNC_BIDS/study-PNC_desc-participants.tsv')

print("Loading PNC participants TSV file...")
with participants_file.open() as f:
    participants = pd.read_csv(f, sep='\t')

participants

Loading NKI participants TSV file...


Unnamed: 0,participant_id,study,study_site,session_id,wave,age,sex,race,ethnicity,bmi,handedness,participant_education,parent_1_education,parent_2_education,p_factor_mcelroy_harmonized_all_samples,internalizing_mcelroy_harmonized_all_samples,externalizing_mcelroy_harmonized_all_samples,attention_mcelroy_harmonized_all_samples,cubids_acquisition_group
0,1000393599,PNC,PNC1,PNC1,1,15.583333,Male,Black,not Hispanic or Latino,22.15,Right,9th Grade,Complete primary,Complete secondary,0.589907,-0.449373,-0.630780,-1.842178,1
1,1000881804,PNC,PNC1,PNC1,1,14.916667,Male,Black,not Hispanic or Latino,21.52,Right,7th Grade,Complete secondary,Complete secondary,-0.655377,0.097355,0.387355,-0.467807,113
2,1001970838,PNC,PNC1,PNC1,1,17.833333,Male,Other,Hispanic or Latino,23.98,Right,11th Grade,Complete tertiary,Complete tertiary,-0.659061,0.531072,0.392751,0.190706,1
3,100527940,PNC,PNC1,PNC1,1,8.250000,Male,Black,not Hispanic or Latino,,Ambidextrous,1st Grade,Complete secondary,Complete primary,-0.591516,0.699062,-0.781881,-0.982040,3
4,1006151876,PNC,PNC1,PNC1,1,21.500000,Female,Other,not Hispanic or Latino,,Right,12th Grade,Complete tertiary,Complete secondary,-0.377828,0.495947,0.806481,-0.832210,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1596,985910486,PNC,PNC1,PNC1,1,18.750000,Female,Black,not Hispanic or Latino,24.50,Right,12th Grade,Complete primary,,-1.233807,-0.896835,-0.449099,0.111167,1
1597,986035435,PNC,PNC1,PNC1,1,9.916667,Female,White,not Hispanic or Latino,,Right,3rd Grade,Complete primary,Complete primary,-0.872749,1.581768,-0.619987,0.556958,1
1598,987544292,PNC,PNC1,PNC1,1,18.416667,Female,White,not Hispanic or Latino,,Right,11th Grade,Complete secondary,Complete primary,-1.136788,0.399735,-0.490472,0.018679,1
1599,993394555,PNC,PNC1,PNC1,1,19.500000,Female,White,not Hispanic or Latino,,Right,Some College,Complete secondary,Complete secondary,-1.420477,0.750985,-0.377146,-0.519601,7
