In [40]:
import pandas as pd

# Universal DEV data check

This file is designed to access all data sources and cross-reference to see where data is missing.

We should be able to do this locally (in order to access the Dropbox, which is inaccessible on talapas) and FTP to talapas to grab talapas stuff.

## Setup

### Talapas FTP

Let's set up the talapas FTP.

In [12]:
import pysftp
import paramiko
#not sure if this will really continue to wokr without an explicit reference to a key file but let's see!
pagent = paramiko.agent.Agent()
srv = pysftp.Connection(host="talapas-ln2.uoregon.edu", username="bsmith16", private_key=pagent.get_keys()[0])


## paths

In [61]:
#paths for accessing from local machine
redcap_csv_path= '../dicom_check/DEV-Sessions_DATA_2022-10-19_2234.csv'
behavioral_data_sst_all_csv = '~/Dropbox (University of Oregon)/UO-SAN Lab/Berkman Lab/Devaluation/analysis_files/data/sst_behavioral_data_all_20230119.csv'
self_report_behav_summary_data = '/Users/benjaminsmith/Dropbox (University of Oregon)/UO-SAN Lab/Berkman Lab/Devaluation/analysis_files/data/data_by_ppt.csv'

#absolute paths for accessing directly from talapas
fmriprep_server_path = '/gpfs/projects/sanlab/shared/DEV/bids_data/derivatives/fmriprep'

## Load items




### DICOMs

In [26]:
import re
#regex filter the directory 

dicom_dev_regex = 'DEV\d\d\d\_'
raw_dicom_folder_list = srv.listdir('/gpfs/projects/lcni/dcm/sanlab/Berkman/DEV/')
raw_dicom_dev_folder_list = [x for x in raw_dicom_folder_list if re.search(dicom_dev_regex, x)]
non_dev_folders =  [x for x in raw_dicom_folder_list if not re.search(dicom_dev_regex, x)]

In [27]:
non_dev_folders

['159_20200126_115140',
 'DEV0903_20190607_090700',
 'DEV310`_20220911_120849',
 'DEV_20171214',
 'DEV_20210608_170401',
 'DEV_20210714_173417',
 'Matlabtest_20191023_154406',
 'Phantom^of the opera_20180430_110701',
 'TEST999_20180219_170145',
 'dev125_20190823_102411',
 'dev133_20190808_171200',
 'matlabtest2_20191025_121247',
 'phantom_20181211_112538',
 'test dev_20180219_155612',
 'wwe_20180131_135555']

There are a bunch of incorrectly named DICOM files here that we'll need to go through. We arleady have a process for this in `DEV_scripts/org/dicom_check/` so need to follow that.

Maybe we can skip forward to the fMRIPrep data and register what is missing there vs. the REDCAP data.

### BIDS data

Let's get a list of the participants with BIDS folders

In [32]:
bids_folder_all = srv.listdir('/gpfs/projects/sanlab/shared/DEV/bids_data')

In [33]:
bids_folder_regex = 'sub-DEV\d\d\d'
bids_participants = [x for x in bids_folder_all if re.search(bids_folder_regex, x)]

### fMRIPrep data

In [55]:
fMRIPrep_folder_all = srv.listdir(fmriprep_server_path)
fmriprep_folder_regex = 'sub-DEV\d\d\d$'
fmriprep_folder_list = [x for x in fMRIPrep_folder_all if re.search(fmriprep_folder_regex, x)]

#now get a list of fMRIPrep sessions
fmri_prep_session_df = pd.DataFrame(columns=['fmriprep_subj_folder', 'fmriprep_session_folder'])
for fmriprep_subj_folder in fmriprep_folder_list:
    fmriprep_subj_path = fmriprep_server_path + '/' + fmriprep_subj_folder
    fmriprep_subj_dirlist = srv.listdir(fmriprep_subj_path)
    fmri_prep_session_list = [x for x in fmriprep_subj_dirlist if re.search('ses-wave\d+', x)]
    for fmriprep_session_folder in fmri_prep_session_list:
        fmri_prep_session_df = pd.concat([fmri_prep_session_df,pd.DataFrame({'fmriprep_subj_folder': [fmriprep_subj_folder], 'fmriprep_session_folder': [fmriprep_session_folder]})], ignore_index=True)
    #print(fmriprep_subj_folder)

### Redcap session list

In [43]:
redcap_sessions_list = pd.read_csv(redcap_csv_path)

### Behavioral data

In [62]:
sr_behav_summary_df = pd.read_csv(self_report_behav_summary_data)

That isn't quite the right file, because it only has data for the first wave, and we need to, at this point, get data for all waves. So let's put aside cross-referencing behavioral data until we decide how to do that.

## Matching

In [77]:
#prepare the redcap list for matching
redcap_sessions_list['redcap_wave'] = redcap_sessions_list['redcap_event_name'].str.extract('session_(\d+)')[0]
#rename dev_id to redcap_dev_id
redcap_sessions_list.rename(columns={'dev_id': 'redcap_dev_id'}, inplace=True)

In [84]:
#prepare the fmriprep list for matching

fmri_prep_session_df['fmriprep_dev_id'] = fmri_prep_session_df['fmriprep_subj_folder'].str.extract('sub-(DEV\d\d\d)')[0]
fmri_prep_session_df['fmriprep_wave'] = fmri_prep_session_df['fmriprep_session_folder'].str.extract('ses-wave(\d+)')[0]


In [86]:
matched_list_1 = pd.merge(redcap_sessions_list,fmri_prep_session_df, how='outer', 
    left_on=['redcap_dev_id', 'redcap_wave'],
    right_on=['fmriprep_dev_id', 'fmriprep_wave'])

In [89]:
matched_list_1.to_csv('match_out.csv')

In [None]:
## 

## Cross-reference DEV QC and Exclusions with Redcap to get a new list of DEV subjects

In [90]:
# access a public google sheet with teh googlesheets api





ImportError: cannot import name 'gspread' from 'pydrive2.drive' (/Users/benjaminsmith/opt/anaconda3/envs/dataanalysis/lib/python3.11/site-packages/pydrive2/drive.py)

In [99]:
#access a google sheet from python
from gsheets import Sheets
sheets = Sheets.from_files('~/.ssh/client_secret.json', '~/.ssh/storage.json')


Your browser has been opened to visit:

    https://accounts.google.com/o/oauth2/auth?client_id=435162197765-u17g2s1q1mac7fsulm6ebbb79s1tq4hm.apps.googleusercontent.com&redirect_uri=http%3A%2F%2Flocalhost%3A8080%2F&scope=https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fspreadsheets.readonly+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive.readonly&access_type=offline&response_type=code

If your browser is on a different machine then exit and re-run this
application with the command-line parameter

  --noauth_local_webserver

Authentication successful.
