# UKB Blood Cells Prepare Data

## Approach

1. For each data-set, take the strongest snp as the strong set
2. Also select from each data-set 2 "null" snps.
3. then try to run your estimate of Vhat to get Vhat first, and run Yunqi / Peter's ED

In UKB bloodcells, we have about 600 regions. 

## Workflow

In [None]:
[global]
parameter: cwd = path('/project2/mstephens/yuxin/ukb-bloodcells')
parameter: name = 'ukbbloodcells_prepare'

In [1]:
%cd /project2/mstephens/yuxin/ukb-bloodcells

/project2/mstephens/gaow/mvarbvs/dsc/mnm_prototype/mnm_sumstats

### Get top SNP and random close to null SNP per region

In [31]:
# extract data from summary stats
[extract_effects_1]
parameter: datadir = path
parameter: seed = 999
parameter: n_null = 2
# Analysis units file. For RDS files it can be generated by `ls *.rds | sed 's/\.rds//g' > analysis_units.txt`
parameter: analysis_units = path
# handle N = per_chunk data-set in one job
parameter: per_chunk = 1000
regions = [x.strip().split() for x in open(analysis_units).readlines() if x.strip() and not x.strip().startswith('#')]
input: [f'{datadir}/{x[0]}.rds' for x in regions], group_by = per_chunk
output: f"{cwd}/{name}/cache/{name}_{_index+1}.rds"
task: trunk_workers = 1, walltime = '1h', trunk_size = 1, mem = '4G', cores = 1, tags = f'{_output:bn}'
R: expand = "${ }"
    set.seed(${seed})
    matxMax <- function(mtx) {
      return(arrayInd(which.max(mtx), dim(mtx)))
    }
    remove_rownames = function(x) {
        for (name in names(x)) rownames(x[[name]]) = NULL
        return(x)
    }
    handle_nan_etc = function(x) {
      x$zhat[which(is.nan(x$bhat))] = 0
      return(x)
    }
    extract_one_data = function(infile, n_null) {
        # If cannot read the input for some reason then we just skip it, assuming we have other enough data-sets to use.
        dat = tryCatch(readRDS(infile)$Z, error = function(e) return(NULL))
        if (is.null(dat)) return(NULL)
        dat = as.matrix(dat)
        z = as.matrix(abs(dat))
        max_idx = matxMax(z)
        # strong effect samples
        strong = list(zhat = dat[max_idx[1],,drop=F])
        # null samples defined as |z| < 2
        null.id = which(apply(abs(dat), 1, max) < 2)
        if (length(null.id) == 0) {
          warning(paste("Null data is empty for input file", infile))
          null = list()
        } else {
          null_idx = sample(null.id, min(n_null, length(null.id)), replace = F)
          null = list(zhat = dat[null_idx,,drop=F])
        }
        dat = (list(null = remove_rownames(null), strong = remove_rownames(strong)))
        dat$null = handle_nan_etc(dat$null)
        dat$strong = handle_nan_etc(dat$strong)
        return(dat)
    }
    reformat_data = function(dat) {
        # make output consistent in format with 
        # https://github.com/stephenslab/gtexresults/blob/master/workflows/mashr_flashr_workflow.ipynb      
        res = list(strong.z = dat$strong$zhat, null.z = dat$null$zhat)
      return(res)
    }
    merge_data = function(res, one_data) {
      if (length(res) == 0) {
          return(one_data)
      } else if (is.null(one_data)) {
          return(res)
      } else {
          for (d in names(one_data)) {
            if (is.null(one_data[[d]])) {
              next
            } else {
                res[[d]] = rbind(res[[d]], one_data[[d]])
            }
          }
          return(res)
      }
    }
    res = list()
    for (f in c(${_input:r,})) {
      res = merge_data(res, reformat_data(extract_one_data(f, ${n_null})))
    }
    saveRDS(res, ${_output:r})

[extract_effects_2]
input: group_by = "all"
output: f"{cwd}/{name}.rds"
task: trunk_workers = 1, walltime = '1h', trunk_size = 1, mem = '16G', cores = 1, tags = f'{_output:bn}'
R: expand = "${ }"
    merge_data = function(res, one_data) {
      if (length(res) == 0) {
          return(one_data)
      } else {
          for (d in names(one_data)) {
            res[[d]] = rbind(res[[d]], one_data[[d]])
          }
          return(res)
      }
    }
    dat = list()
    for (f in c(${_input:r,})) {
      dat = merge_data(dat, readRDS(f))
    }
    # compute empirical covariance XtX
    dat$XtX = t(as.matrix(dat$strong.z)) %*% as.matrix(dat$strong.z) / nrow(dat$strong.z)
                       
    saveRDS(dat, ${_output:r})

To run it:

```
m=/project2/mstephens/yuxin/ukb-bloodcells/zscores
cd $m && ls *.rds | sed 's/\.rds//g' > analysis_units.txt && cd -
sos run /project2/mstephens/yuxin/mvarbvs/analysis/multivariate/20201221_ukb_ED_prior.ipynb extract_effects \
        --analysis-units $m/analysis_units.txt \
        --datadir $m &> extract_effects.log
```

## Residual Covariance from Y

In [None]:
[Ycov]
parameter: Ydata = 'bloodcells.pheno.resid.txt'
output: f"{cwd}/{name}.Vcov.rds"
R: expand = "${ }"
    library(data.table)
    traits = fread('/project2/mstephens/yuxin/ukb-bloodcells/bloodcells.pheno.resid.txt')
    Ycov = cov(traits[,3:18])
    saveRDS(Ycov, ${_output:r})

To run it:

```
sos run /project2/mstephens/yuxin/mvarbvs/analysis/multivariate/20201221_ukb_ED_prior.ipynb Ycov \
    &> Ycov.log
```

In [None]:
[Ycor]
parameter: Ydata = 'bloodcells.pheno.resid.txt'
output: f"{cwd}/{name}.Vcor.rds"
R: expand = "${ }"
    library(data.table)
    traits = fread('/project2/mstephens/yuxin/ukb-bloodcells/bloodcells.pheno.resid.txt')
    Ycor = cor(traits[,3:18])
    saveRDS(Ycor, ${_output:r})

To run it:

```
sos run /project2/mstephens/yuxin/mvarbvs/analysis/multivariate/20201221_ukb_ED_prior.ipynb Ycor \
    &> Ycor.log
```

## Residual Correlation from z

In [None]:
[zcor]
input: f"{cwd}/{name}.rds"
output: f"{cwd}/{name}.zcor.rds"
R: expand = "${ }"
    dat = readRDS("${input}")
    saveRDS(cor(dat$null.z), ${_output:r})