**AUTHOR:** <br>
Vasilis Raptis

**DATE:** <br>
21.05.2024 

**PURPOSE:** <br>
This notebook: 
- runs regenie step 1, using filted microarray genotypes and delirium phenotype file (eur, afr, amr)
- copy regenie step 1 output to bucket

**NOTES:** <br>
- use lists of filterd ids and snps from bucket, array_noY genotypes from workspace (created with 02_part1_genotypes_preprocessing.ipynb)
- afr population has >1M filtered snps, regenie throws an error. Use --force-step1
- run in background, using 00_run_notebook_in_background.ipynb
- running times :
 - amr: 5h, 15m, and 19s (with 4 CPUs & 15G RAM)
 - eur: 8.22h (with 16 CPUs, 14.4G RAM)
 - afr: 9.51h (with 4 CPUs & 15G RAM)
 
**UPDATE 09.07.24:**
- running times (with 16 CPUs, 14.4G RAM):
 - amr: 3.2h
 - eur: 7.9h
 - afr: 

**Setup:**

In [1]:
# libraries
library(data.table)
library(tidyverse)

## Get my bucket name
my_bucket  <- Sys.getenv("WORKSPACE_BUCKET")
## Google project name
GOOGLE_PROJECT <- Sys.getenv("GOOGLE_PROJECT")

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.5.1     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mbetween()[39m     masks [34mdata.table[39m::between()
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m      masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mfirst()[39m       masks [34mdata.table[39m::first()
[31m✖[39m [34mlubridate[39m::[32mhour()[39m    masks [34mdata.table[39m::hour()
[31m✖[39m [34mlubridate[39m::[32misoweek()[39m masks [34mdata.table[39m::isoweek()
[31m✖[39m 

In [3]:
# List data in my bucket folder
system(paste0("gsutil ls ", my_bucket, "/data/pheno/clean"), intern=T)
system(paste0("gsutil ls ", my_bucket, "/data/arrays/clean/all"), intern=T)
# List storage usage in workspace
system("du -h", intern=T)

**Regenie Step 1:**

In [8]:
## create folder in workspace
system(paste0("mkdir -p regenie_out/step1/tmpdir"), intern=T)
system(paste0("ls regenie_out/step1/tmpdir"), intern=T)

In [4]:
## load pheno tables from bucket
system(paste0("mkdir -p ./pheno/"), intern=T)
system(paste0("gsutil cp ", my_bucket, "/data/pheno/clean/*pheno_clean.txt", " ./pheno/"), intern=T)
system(paste0("ls ./pheno/"), intern=T)
## load filtered lists from bucket
system(paste0("gsutil cp ", my_bucket, "/data/arrays/clean/all/arrays_qc_*.snplist", " ./microarray/plink_v7.1/"), intern=T)
system(paste0("gsutil cp ", my_bucket, "/data/arrays/clean/all/arrays_qc_*.id", " ./microarray/plink_v7.1/"), intern=T)
system(paste0("ls ./microarray/plink_v7.1/"), intern=T)

In [6]:
system(paste0("mkdir -p ./regenie_out/step1"), intern=T)


American set:

In [9]:
# paths
bed     = "microarray/plink_v7.1/arrays_noY"
keep    = "microarray/plink_v7.1/arrays_qc_amr_clean.id"
extract = "microarray/plink_v7.1/arrays_qc_amr_clean.snplist"
pheno   = "pheno/amr_pheno_clean.txt"
out     = "regenie_out/step1/del_amr_clean_step1"

system(paste0("regenie --step 1 ",
              " --bed ", bed,
              " --keep ", keep,
              " --extract ", extract,
              " --phenoFile ", pheno,
              " --phenoCol delirium_status ",
              " --covarFile ", pheno,
              " --covarColList age,PC{1:10} ",
              " --catCovarList sex ",
              " --bt ", 
              " --bsize 1000 ", 
              " --lowmem --lowmem-prefix regenie_out/step1/tmpdir/regenie_tmp_preds ",
              " --threads $(nproc) ",
              " --out ", out
             ),
       intern=T)

European set:

In [10]:
# paths
bed     = "microarray/plink_v7.1/arrays_noY"
keep    = "microarray/plink_v7.1/arrays_qc_eur_clean.id"
extract = "microarray/plink_v7.1/arrays_qc_eur_clean.snplist"
pheno   = "pheno/eur_pheno_clean.txt"
out     = "regenie_out/step1/del_eur_clean_step1"

system(paste0("regenie --step 1 ",
              " --bed ", bed,
              " --keep ", keep,
              " --extract ", extract,
              " --phenoFile ", pheno,
              " --phenoCol delirium_status ",
              " --covarFile ", pheno,
              " --covarColList age,PC{1:10} ",
              " --catCovarList sex ",
              " --bt ", 
              " --bsize 1000 ", 
              " --lowmem --lowmem-prefix regenie_out/step1/tmpdir/regenie_tmp_preds ",
              " --threads $(nproc) ",
              " --out ", out
             ),
       intern=T)

African set:

In [11]:
# paths
bed     = "microarray/plink_v7.1/arrays_noY"
keep    = "microarray/plink_v7.1/arrays_qc_afr_clean.id"
extract = "microarray/plink_v7.1/arrays_qc_afr_clean.snplist"
pheno   = "pheno/afr_pheno_clean.txt"
out     = "regenie_out/step1/del_afr_clean_step1"

system(paste0("regenie --step 1 ",
              "--force-step1 ",
              " --bed ", bed,
              " --keep ", keep,
              " --extract ", extract,
              " --phenoFile ", pheno,
              " --phenoCol delirium_status ",
              " --covarFile ", pheno,
              " --covarColList age,PC{1:10} ",
              " --catCovarList sex ",
              " --bt ", 
              " --bsize 1000 ", 
              " --lowmem --lowmem-prefix regenie_out/step1/tmpdir/regenie_tmp_preds ",
              " --threads $(nproc) ",
              " --out ", out
             ),
       intern=T)

**Copy to bucket:**

In [7]:
# step1 loco predictions
system(paste0("gsutil cp regenie_out/step1/del_*_step1_1.loco", " ", my_bucket, "/data/regenie/step1/"), intern=T)
# path file (modify if reused in another location)
system(paste0("gsutil cp regenie_out/step1/del_*_step1_pred.list", " ", my_bucket, "/data/regenie/step1/"), intern=T)
# log files
#system(paste0("gsutil cp regenie_out/step1/del_*_step1.log", " ", my_bucket, "/data/regenie/step1/"), intern=T)
# check
system(paste0("gsutil ls ", my_bucket, "/data/regenie/step1/"), intern=T)