__Exploratory Data Analysis__

Monday, April 18, 2022

Goal is to get a high level understanding of the data we have.  

Conclusions: 
- We have Autism vs. Control, gender split, and functional MRI images
- 391 images in total
- 146 Autism cases, 245 Controls
- 124 Autistic Male cases, 184 Male Controls 
- 22 Autistic Female cases, 61 Female Controls

In [3]:
# Load Packages
import pandas as pd
import os

In [6]:
# First and foremost, exploring this metadata sheet
meta = pd.read_csv("/gpfs/gpfs0/project/ds6050-soa2wg/team_lambda_II/ASD_DSM_CasesvsControls.csv") #??? Did Chelsea create this or was this sheet given to us? 
meta

Unnamed: 0,FILE_ID,DX_GROUP,DSM_IV_TR,SEX,DX_Control,DX_DSM,SEX_,PATH
0,Pitt_0050005,1,1,2,Autism,Autism,Female,/project/ds6050-soa2wg/team_lambda_II/Outputs/...
1,Pitt_0050006,1,1,1,Autism,Autism,Male,/project/ds6050-soa2wg/team_lambda_II/Outputs/...
2,Pitt_0050007,1,1,1,Autism,Autism,Male,/project/ds6050-soa2wg/team_lambda_II/Outputs/...
3,Pitt_0050011,1,1,1,Autism,Autism,Male,/project/ds6050-soa2wg/team_lambda_II/Outputs/...
4,Pitt_0050014,1,1,1,Autism,Autism,Male,/project/ds6050-soa2wg/team_lambda_II/Outputs/...
...,...,...,...,...,...,...,...,...
386,UCLA_1_0051280,2,0,1,Control,Control,Male,/project/ds6050-soa2wg/team_lambda_II/Outputs/...
387,UCLA_1_0051281,2,0,1,Control,Control,Male,/project/ds6050-soa2wg/team_lambda_II/Outputs/...
388,UCLA_1_0051282,2,0,2,Control,Control,Female,/project/ds6050-soa2wg/team_lambda_II/Outputs/...
389,UCLA_2_0051303,2,0,2,Control,Control,Female,/project/ds6050-soa2wg/team_lambda_II/Outputs/...


In [7]:
# Check out the columns. Looks like we used 'DX_Control' for classes, 'SEX' for gender, and 'PATH' for path of image
# ??? What do the rest of the columns mean? 
meta.columns 

Index(['FILE_ID', 'DX_GROUP', 'DSM_IV_TR', 'SEX', 'DX_Control', 'DX_DSM',
       'SEX_', 'PATH'],
      dtype='object')

In [8]:
# Check for missing data. Result: No missing data - horray! 
meta.isnull().sum(axis=0) 

FILE_ID       0
DX_GROUP      0
DSM_IV_TR     0
SEX           0
DX_Control    0
DX_DSM        0
SEX_          0
PATH          0
dtype: int64

In [9]:
# Check for unique values
for col in meta.columns: 
    print(col, "unique values:", meta[col].nunique(), "\n")

FILE_ID unique values: 391 

DX_GROUP unique values: 2 

DSM_IV_TR unique values: 2 

SEX unique values: 2 

DX_Control unique values: 2 

DX_DSM unique values: 2 

SEX_ unique values: 2 

PATH unique values: 391 



In [10]:
# With the exception of first and last column, what are the unique values? 
for col in meta.columns[1:-1]: 
    print(col, "unique values:", meta[col].unique(), "\n")

DX_GROUP unique values: [1 2] 

DSM_IV_TR unique values: [1 0] 

SEX unique values: [2 1] 

DX_Control unique values: ['Autism' 'Control'] 

DX_DSM unique values: ['Autism' 'Control'] 

SEX_ unique values: ['Female' 'Male'] 



In [11]:
# Checking if columns are duplicative.  Result: DX_Control and DX_DSM are the same
meta.groupby(["DX_Control", "DX_DSM"]).size()

DX_Control  DX_DSM 
Autism      Autism     146
Control     Control    245
dtype: int64

In [12]:
# # Double-check. Result: DX_Control and DX_DSM are the same
# meta.query("DX_Control == 'Autism' & DX_DSM == 'Control'")

In [13]:
# Checking if columns are duplicative.  Result: SEX and SEX_ are the same
meta.groupby(["SEX", "SEX_"]).size()

SEX  SEX_  
1    Male      308
2    Female     83
dtype: int64

In [14]:
# Checking if columns are duplicative.  Result: DX_GROUP and DSM_IV_TR do not add new information
for col in meta.columns[1:-1]: 
    print(col, "\n"*2, meta.groupby(["DX_GROUP", col]).size(), "\n"*3)

DX_GROUP 

 DX_GROUP  DX_GROUP
1         1           146
2         2           245
dtype: int64 



DSM_IV_TR 

 DX_GROUP  DSM_IV_TR
1         1            146
2         0            245
dtype: int64 



SEX 

 DX_GROUP  SEX
1         1      124
          2       22
2         1      184
          2       61
dtype: int64 



DX_Control 

 DX_GROUP  DX_Control
1         Autism        146
2         Control       245
dtype: int64 



DX_DSM 

 DX_GROUP  DX_DSM 
1         Autism     146
2         Control    245
dtype: int64 



SEX_ 

 DX_GROUP  SEX_  
1         Female     22
          Male      124
2         Female     61
          Male      184
dtype: int64 





In [15]:
# value counts by class and gender
meta.groupby(["DX_Control", "SEX_"]).size()

DX_Control  SEX_  
Autism      Female     22
            Male      124
Control     Female     61
            Male      184
dtype: int64

In [16]:
pd.set_option("max_colwidth",None)
# Print relevant columns
meta[['DX_Control', 'SEX_', 'PATH']]
#pd.reset_option("max_colwidth")

Unnamed: 0,DX_Control,SEX_,PATH
0,Autism,Female,/project/ds6050-soa2wg/team_lambda_II/Outputs/ccs/filt_global/func_mean/female_asd/Pitt_0050005_func_mean.nii.gz
1,Autism,Male,/project/ds6050-soa2wg/team_lambda_II/Outputs/ccs/filt_global/func_mean/male_asd/Pitt_0050006_func_mean.nii.gz
2,Autism,Male,/project/ds6050-soa2wg/team_lambda_II/Outputs/ccs/filt_global/func_mean/male_asd/Pitt_0050007_func_mean.nii.gz
3,Autism,Male,/project/ds6050-soa2wg/team_lambda_II/Outputs/ccs/filt_global/func_mean/male_asd/Pitt_0050011_func_mean.nii.gz
4,Autism,Male,/project/ds6050-soa2wg/team_lambda_II/Outputs/ccs/filt_global/func_mean/male_asd/Pitt_0050014_func_mean.nii.gz
...,...,...,...
386,Control,Male,/project/ds6050-soa2wg/team_lambda_II/Outputs/ccs/filt_global/func_mean/male_control/UCLA_1_0051280_func_mean.nii.gz
387,Control,Male,/project/ds6050-soa2wg/team_lambda_II/Outputs/ccs/filt_global/func_mean/male_control/UCLA_1_0051281_func_mean.nii.gz
388,Control,Female,/project/ds6050-soa2wg/team_lambda_II/Outputs/ccs/filt_global/func_mean/female_control/UCLA_1_0051282_func_mean.nii.gz
389,Control,Female,/project/ds6050-soa2wg/team_lambda_II/Outputs/ccs/filt_global/func_mean/female_control/UCLA_2_0051303_func_mean.nii.gz
