**AUTHOR:** <br>
Vasilis Raptis

**DATE:** <br>
20.05.2024 

**PURPOSE:** <br>
This notebook: 
- makes summary tables by ancestry.
- creates clean phenotype & and covariate files ready for plink & regenie, splitted by genetic ancestry.
- create {ancestry}_ids.csv files to use with plink2 --keep -> two columns; no header: FID==0,IID==person_id

**NOTES:** <br>
- covariates: age, sex, PCs1-10.
- pheno/covariate files are space separated (.txt)
- FID == 0 for the microarray data
- We will use only EUR, AMR & AFR ancestries, as they have adequate (>100) delirium cases.
- This notebook needs to be run once, as it saves the phenotype/covariates files to the bucket. After this, just load the *"{my_bucket}/data/###"* file into the workspace.
- **UPDATE 03.06.2024:** Use the *"{my_bucket}/data/full_pheno_clean_del_df.csv"* file (see 01_part1_pheno_preprocessing.ipynb) & and save updated output files to: {my_bucket}/data/pheno/clean/*_clean.txt
- **UPDATE 07.10.2024:** make discriptive statistics table for puplication (ST1) + add dementia / AD stats

**Setup:**

In [1]:
# libraries
library(data.table)
library(tidyverse)

## Get my bucket name
my_bucket  <- Sys.getenv("WORKSPACE_BUCKET")
## Google project name
GOOGLE_PROJECT <- Sys.getenv("GOOGLE_PROJECT")

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.5.1     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mbetween()[39m     masks [34mdata.table[39m::between()
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m      masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mfirst()[39m       masks [34mdata.table[39m::first()
[31m✖[39m [34mlubridate[39m::[32mhour()[39m    masks [34mdata.table[39m::hour()
[31m✖[39m [34mlubridate[39m::[32misoweek()[39m masks [34mdata.table[39m::isoweek()
[31m✖[39m 

In [2]:
# List data in my bucket pheno folder
system(paste0("gsutil ls ", my_bucket, "/data/pheno"), intern=T)
# List object in workspace
system("ls .", intern=T)
# List storage usage in workspace
system("du -h", intern=T)

**Load full pheno table:**

In [3]:
## Copy the file from current workspace to the bucket
#system(paste0("gsutil cp ", my_bucket, "/data/pheno/", "full_pheno_df.csv", " ."), intern=T)
system(paste0("gsutil cp ", my_bucket, "/data/pheno/", "full_pheno_clean_del_df.csv", " ."), intern=T)

## Load the file into an R dataframe
#pheno  <- fread("full_pheno_df.csv")
pheno  <- fread("full_pheno_clean_del_df.csv")

## filter participants with genomic data & male/female sex 
pheno <- pheno %>% filter(!is.na(gen_ancestry) & !is.na(sex))
head(pheno)
cat(nrow(pheno), "participants with genomic data")

person_id,dob,sex,race,depind,depind_date,gen_ancestry,PC1,PC2,PC3,⋯,PC12,PC13,PC14,PC15,PC16,delirium_code,delirium_date,delirium_count,delirium_status,age
<int>,<chr>,<int>,<chr>,<dbl>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>,<chr>,<int>,<int>,<int>
1583400,1971/06/15 00:00:00,1,None of these,0.2289,2020/01/08 00:00:00,oth,0.092957,-0.02499529,-0.026522432,⋯,3.907187e-05,0.000228023,-5.0049e-06,0.0001372348,-0.0005315775,,,,0,51
1193078,1978/06/15 00:00:00,1,None of these,0.2619931,2019/06/06 00:00:00,eur,0.10155976,0.12976805,-0.010792255,⋯,-0.0001879208,-0.0001394847,-0.0009090964,0.0008948572,0.001538901,,,,0,44
2403498,1942/06/15 00:00:00,1,None of these,0.241524,2019/02/11 00:00:00,eur,0.10040533,0.13062048,-0.008823804,⋯,-0.0007487362,0.0002732151,0.000274647,-0.0013316116,-7.235247e-05,,,,0,80
1883620,1998/06/15 00:00:00,1,None of these,0.3884576,2020/02/02 00:00:00,oth,0.04925685,0.09593613,-0.008628618,⋯,-0.00216507,0.0017120609,0.0022692362,0.0024377976,0.001059544,,,,0,24
3454673,1996/06/15 00:00:00,1,None of these,0.3325541,2019/08/29 00:00:00,oth,-0.09021576,0.06932578,-0.006370319,⋯,0.002625063,0.0030603247,-0.002571081,0.0025824752,0.002779786,,,,0,26
1802369,1970/06/15 00:00:00,1,None of these,0.3312764,2020/02/25 00:00:00,eur,0.09648393,0.1255298,-0.013973749,⋯,2.26877e-05,0.0033589271,0.0004961625,0.0008061035,0.0003523692,,,,0,52


240158 participants with genomic data

**Load dementia/AD data:**

In [52]:
system(paste0("gsutil cp ", my_bucket, "/data/pheno/", "with_dementia/full.txt", " ."), intern=T)
dem  <- fread("full.txt") 
dim(dem)
head(dem)

person_id,dob,sex,delirium_code,delirium_date,delirium_count,delirium_status,dementia_code,dementia_date,dementia_concept_name,dementia_count,dementia_status,drug_code,drug_date,drug_count,drug_status,earliest_dementia_date,earliest_dementia_source,dementia_incident
<int>,<IDate>,<int>,<chr>,<IDate>,<int>,<int>,<chr>,<IDate>,<chr>,<int>,<int>,<int>,<IDate>,<int>,<int>,<IDate>,<chr>,<int>
1036100,1996-06-15,,,,,0,,,,,0,,,,,,,-9
1938775,1986-06-15,,,,,0,,,,,0,,,,,,,-9
3028890,1982-06-15,,,,,0,,,,,0,,,,,,,-9
2061175,1997-06-15,,,,,0,,,,,0,,,,,,,-9
9251240,1979-06-15,,,,,0,,,,,0,,,,,,,-9
3419434,1967-06-15,,,,,0,,,,,0,,,,,,,-9


In [67]:
system(paste0("gsutil cp ", my_bucket, "/data/pheno/", "with_dementia/full_AD.txt", " ."), intern=T)
ad  <- fread("full_AD.txt") 
ad %>% head

person_id,source_concept_name,source_concept_code,alzheimer_status,alzheimer_date
<int>,<chr>,<chr>,<int>,<IDate>
1001959,Alzheimer's disease with late onset,G30.1,1,2020-07-30
1004198,Alzheimer's disease,331.0,1,2012-12-10
1006354,"Alzheimer's disease, unspecified",G30.9,1,2021-01-19
1008635,"Alzheimer's disease, unspecified",G30.9,1,2016-07-27
1009326,"Alzheimer's disease, unspecified",G30.9,1,2022-02-24
1010116,"Alzheimer's disease, unspecified",G30.9,1,2018-03-21


In [71]:
## merge pheno with dementia table
pheno2 <-
    left_join(pheno, dem[,c("person_id","dementia_status","dementia_concept_name")], by="person_id") %>%
    # add AD status
    mutate(AD_status = ifelse(person_id %in% ad$person_id, 1, 0))
pheno2 %>% head

person_id,dob,sex,race,depind,depind_date,gen_ancestry,PC1,PC2,PC3,⋯,PC15,PC16,delirium_code,delirium_date,delirium_count,delirium_status,age,dementia_status,dementia_concept_name,AD_status
<int>,<chr>,<int>,<chr>,<dbl>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<chr>,<chr>,<int>,<int>,<int>,<int>,<chr>,<dbl>
1583400,1971/06/15 00:00:00,1,None of these,0.2289,2020/01/08 00:00:00,oth,0.092957,-0.02499529,-0.026522432,⋯,0.0001372348,-0.0005315775,,,,0,51,0,,0
1193078,1978/06/15 00:00:00,1,None of these,0.2619931,2019/06/06 00:00:00,eur,0.10155976,0.12976805,-0.010792255,⋯,0.0008948572,0.001538901,,,,0,44,0,,0
2403498,1942/06/15 00:00:00,1,None of these,0.241524,2019/02/11 00:00:00,eur,0.10040533,0.13062048,-0.008823804,⋯,-0.0013316116,-7.235247e-05,,,,0,80,0,,0
1883620,1998/06/15 00:00:00,1,None of these,0.3884576,2020/02/02 00:00:00,oth,0.04925685,0.09593613,-0.008628618,⋯,0.0024377976,0.001059544,,,,0,24,0,,0
3454673,1996/06/15 00:00:00,1,None of these,0.3325541,2019/08/29 00:00:00,oth,-0.09021576,0.06932578,-0.006370319,⋯,0.0025824752,0.002779786,,,,0,26,0,,0
1802369,1970/06/15 00:00:00,1,None of these,0.3312764,2020/02/25 00:00:00,eur,0.09648393,0.1255298,-0.013973749,⋯,0.0008061035,0.0003523692,,,,0,52,0,,0


**Delirium by ancestry stats:**

In [4]:
## stats by ancestry

pheno %>% 
    group_by(gen_ancestry) %>% 
    summarise(n = n(),
              del_cases = sum(delirium_status==1), 
              `del_prev (%)` = round((sum(delirium_status==1)/n())*100,2),
              #mean_age = round(mean(age),2),
              `female (%)` = round((sum(sex==0)/n())*100,2),
              `age (median)` = round(median(age),2),
              `age_at_onset (median)` = round(median(age[delirium_status==1]),2),
              mean_depind = round(mean(depind),2)) %>%
    arrange(desc(del_cases))

gen_ancestry,n,del_cases,del_prev (%),female (%),age (median),age_at_onset (median),mean_depind
<chr>,<int>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
eur,120476,652,0.54,60.25,61,63.0,0.31
afr,52536,233,0.44,56.99,55,56.0,0.36
amr,40096,117,0.29,66.81,47,59.0,0.35
oth,18905,97,0.51,60.02,55,69.0,0.33
eas,5320,6,0.11,63.61,45,62.0,0.31
mid,511,4,0.78,55.77,46,63.5,0.31
sas,2314,4,0.17,49.91,42,66.5,0.31


**Discriptive stats by delirium status:**

In [73]:
## for st1
pheno2 %>%
    #filter(gen_ancestry=="eur") %>%
    group_by(gen_ancestry) %>% 
    summarise(`n cases (prev)` = paste0(sum(delirium_status==1), " (", round(100*(sum(delirium_status==1)/n()),2), ")"),
              `n controls`     = paste0(sum(delirium_status==0)),
              `age cases; mean (sd)`  = paste0(round(mean(age[delirium_status==1]),1), " (", round(sd(age[delirium_status==1]),1), ")"),
              `age contr; mean (sd)`  = paste0(round(mean(age[delirium_status==0]),1), " (", round(sd(age[delirium_status==0]),1), ")"),
              `sex cases; female (%)` = paste0(sum(sex[delirium_status==1]==0), " (", round(100*(sum(sex[delirium_status==1]==0))/sum(delirium_status==1),1), ")"),
              `sex contr; female (%)` = paste0(sum(sex[delirium_status==0]==0), " (", round(100*(sum(sex[delirium_status==0]==0))/sum(delirium_status==0),1), ")"),
              `dementia in cases; (%)`= paste0(sum(dementia_status[delirium_status==1]==1), " (", round(100*(sum(dementia_status[delirium_status==1]==1)/sum(delirium_status==1)),1), ")"),
              `dementia in contr; (%)`= paste0(sum(dementia_status[delirium_status==0]==1), " (", round(100*(sum(dementia_status[delirium_status==0]==1)/sum(delirium_status==0)),1), ")"),
              `AD in cases; (%)`      = paste0(sum(AD_status[delirium_status==1]==1), " (", round(100*(sum(AD_status[delirium_status==1]==1)/sum(delirium_status==1)),1), ")"),
              `AD in contr; (%)`      = paste0(sum(AD_status[delirium_status==0]==1), " (", round(100*(sum(AD_status[delirium_status==0]==1)/sum(delirium_status==0)),1), ")"),
              fct  = sum(delirium_status==1)
             ) %>%
    arrange(desc(fct)) %>% select(-c(fct))


gen_ancestry,n cases (prev),n controls,age cases; mean (sd),age contr; mean (sd),sex cases; female (%),sex contr; female (%),dementia in cases; (%),dementia in contr; (%),AD in cases; (%),AD in contr; (%)
<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
eur,652 (0.54),119824,62.3 (15.4),58.1 (17),297 (45.6),72286 (60.3),133 (20.4),2023 (1.7),30 (4.6),309 (0.3)
afr,233 (0.44),52303,54.2 (15),52.3 (14.7),118 (50.6),29820 (57),39 (16.7),577 (1.1),4 (1.7),62 (0.1)
amr,117 (0.29),39979,57.7 (16.8),47.5 (15.8),66 (56.4),26723 (66.8),31 (26.5),471 (1.2),7 (6),80 (0.2)
oth,97 (0.51),18808,66.5 (16.1),54 (18.8),34 (35.1),11312 (60.1),23 (23.7),322 (1.7),12 (12.4),57 (0.3)
eas,6 (0.11),5314,61.3 (18),47.1 (17.3),5 (83.3),3379 (63.6),1 (16.7),47 (0.9),0 (0),6 (0.1)
mid,4 (0.78),507,53 (26.3),47.7 (17.2),0 (0),285 (56.2),1 (25),6 (1.2),1 (25),0 (0)
sas,4 (0.17),2310,67 (14.6),45 (16.6),0 (0),1155 (50),1 (25),23 (1),0 (0),1 (0)


**Make regenie pheno/covariate files by ancestry (eur, afr & amr):**

In [10]:
## european
pheno_eur <-
pheno %>% 
    filter(gen_ancestry == "eur") %>%
    mutate(FID = 0) %>%
    mutate(IID = person_id) %>%
    select(FID, IID, delirium_status, age, sex, starts_with("PC")[1:10])

## african
pheno_afr <-
pheno %>% 
    filter(gen_ancestry == "afr") %>%
    mutate(FID = 0) %>%
    mutate(IID = person_id) %>%
    select(FID, IID, delirium_status, age, sex, starts_with("PC")[1:10])

## american
pheno_amr <-
pheno %>% 
    filter(gen_ancestry == "amr") %>%
    mutate(FID = 0) %>%
    mutate(IID = person_id) %>%
    select(FID, IID, delirium_status, age, sex, starts_with("PC")[1:10])

## save .txt to workspace (space seperated)
write.table(pheno_eur, "eur_pheno.txt", sep=" ", row.names=F, col.names=T, quote=F)
write.table(pheno_afr, "afr_pheno.txt", sep=" ", row.names=F, col.names=T, quote=F)
write.table(pheno_amr, "amr_pheno.txt", sep=" ", row.names=F, col.names=T, quote=F)

## save .txt to bucket
# system(paste0("gsutil cp ./", "eur_pheno.txt", " ", my_bucket, "/data/pheno/"), intern=T)
# system(paste0("gsutil cp ./", "afr_pheno.txt", " ", my_bucket, "/data/pheno/"), intern=T)
# system(paste0("gsutil cp ./", "amr_pheno.txt", " ", my_bucket, "/data/pheno/"), intern=T)
system(paste0("gsutil cp ./", "eur_pheno.txt", " ", my_bucket, "/data/pheno/eur_pheno_clean.txt"), intern=T)
system(paste0("gsutil cp ./", "afr_pheno.txt", " ", my_bucket, "/data/pheno/afr_pheno_clean.txt"), intern=T)
system(paste0("gsutil cp ./", "amr_pheno.txt", " ", my_bucket, "/data/pheno/amr_pheno_clean.txt"), intern=T)

## check
system(paste0("gsutil ls ", my_bucket, "/data/pheno/*.txt"), intern=T)

**Make _ids files (to use with --keep in plink2):** <br>
*two columns; no header: FID==0, IID 

In [12]:
eur_ids <- pheno_eur %>% select(FID,IID)
afr_ids <- pheno_afr %>% select(FID,IID)
amr_ids <- pheno_amr %>% select(FID,IID)


In [14]:
## save .txt to workspace
write.table(eur_ids, "eur_ids.txt", sep=" ", row.names=F, col.names=F, quote=F)
write.table(afr_ids, "afr_ids.txt", sep=" ", row.names=F, col.names=F, quote=F)
write.table(amr_ids, "amr_ids.txt", sep=" ", row.names=F, col.names=F, quote=F)
## check
system("wc -l *_ids.txt", intern=T)

## save .txt to bucket
# system(paste0("gsutil cp ./", "eur_ids.txt", " ", my_bucket, "/data/pheno/"), intern=T)
# system(paste0("gsutil cp ./", "afr_ids.txt", " ", my_bucket, "/data/pheno/"), intern=T)
# system(paste0("gsutil cp ./", "amr_ids.txt", " ", my_bucket, "/data/pheno/"), intern=T)
system(paste0("gsutil cp ./", "eur_ids.txt", " ", my_bucket, "/data/pheno/eur_ids_clean.txt"), intern=T)
system(paste0("gsutil cp ./", "afr_ids.txt", " ", my_bucket, "/data/pheno/afr_ids_clean.txt"), intern=T)
system(paste0("gsutil cp ./", "amr_ids.txt", " ", my_bucket, "/data/pheno/amr_ids_clean.txt"), intern=T)

## move all clean pheno files to a bucket folder:
system(paste0("gsutil cp ", my_bucket, "/data/pheno/*_clean.txt", " ", my_bucket, "/data/pheno/clean/"), intern=T)
system(paste0("gsutil rm ", my_bucket, "/data/pheno/*_clean.txt"), intern=T)

## check
system(paste0("gsutil ls ", my_bucket, "/data/pheno/*"), intern=T)
