# Individual-level genotype-expression data generation

## Goal
This pipeline aims to perform cis-eQTL fine-mapping using individual-level genotype-expression data.

## Overview of methods
Specifically, we are to organize the genotype, expression, and covariates data into a gene-wise form, i.e., a data file per gene. Besides, we need the expression data 'perpendicular' to covariates, which will be used for regression with genotype, in instrument variable idea.

## Details of methods
covariates (`Z`): This is independent of genes, so this is the same among files and can be directly copied from the original covariates data.

expression (`y`): As `grep` is too time-consuming, we construct `gene:line_number` as preprocessing. For each gene, the `line_number` is the line numbers of the gene in the #{tissues} expression data files. As a result, to extract the expression of each gene, it is only the job of reading the lines in the files, and combining the lines together.

genotype (`X`): Obtain TSS of all genes as preprocessing. Once querying a gene, take the TSS of the gene, extracting SNPs within a predefined flanking region around the TSS and treating their genotypes as X.

expression 'perpendicular' to covariates (`y_res`): The part of expression data that is independent from covariates.

### Crystallized plan for generating y with the main step `gene:line_number`

1) extract the column of ensembl_gene_id in the 49 expression files by `pandas`

2) assign `gene:line_number` for each gene in each tissue based on the extracted column just now

3) for each gene, extract the row of expression of the gene with sample names in each tissue using `sed`

4) that's it!

# Documentation

## Input
genotype data all in a file and expression and covariates data organized tissue-wise.

- genotype: .vcf.gz

    #CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  GTEX-1117F      GTEX-111CU      GTEX-111FC      GTEX-111VG      GTEX-111YS      GTEX-1122O
    chr1    13526   chr1_13526_C_T_b38      C       T       .       PASS    AN=1676;AF=0.000596659;AC=1     GT      0|0     0|0     0|0     0|0     0|0     0|0
    

- expression: .bed.gz

    #chr    start   end     gene_id GTEX-1117F      GTEX-111CU      GTEX-111FC      GTEX-111VG      GTEX-111YS      GTEX-1122O      GTEX-1128S
    chr1    29552   29553   ENSG00000227232.5       1.3135329173730264      -0.9007944960568992     -0.29268956046586164    -0.7324113622418431     -0.27475411245874887    -0.6990255198601908     0.18188866123299216


- covariates: .txt

    ID      GTEX-1117F      GTEX-111CU      GTEX-111FC      GTEX-111VG      GTEX-111YS
    PC1     -0.0867  0.0107  0.0099  0.0144  0.0154
    PC2     -0.0132 -0.0026 -0.0050 -0.0081 -0.0093
    

## Output
An RDS file containing genotype specific for each gene and expression and covariates of each gene.


In [1]:
sos run Multi_tissues.ipynb -h

usage: sos run Multi_tissues.ipynb [workflow_name | -t targets] [options] [workflow_options]
  workflow_name:        Single or combined workflows defined in this script
  targets:              One or more targets to generate
  options:              Single-hyphen sos parameters (see "sos run -h" for details)
  workflow_options:     Double-hyphen workflow-specific parameters

Workflows:
  preprocessing
  extract

Sections
  preprocessing:        preprocessing: prepare gene expressions, gene TSS.
    Workflow Options:
      --cwd . (as path)
      --expression-directory '/project/compbio/GTEx_dbGaP/GTEx_Analysis_2017-06-05_v8/eqtl/GTEx_Analysis_v8_eQTL_expression_matrices'
      --genotype-file '/project/compbio/GTEx_dbGaP/GTEx_Analysis_2017-06-05_v8/genotypes/WGS/variant_calls/GTEx_Analysis_2017-06-05_v8_WholeGenomeSeq_838Indiv_Analysis_Freeze.SHAPEIT2_phased.vcf.gz'
  extract:              extract Z, y, X from covariates, expression and genotype
                        data and generate

## Minimal working example
**You can edit and change the following 2 bash variables.** First edit and run the following 2 bash variables.
```
work_dir=/scratch/midway2/chj1ar/GTEx_Analysis_v8_eQTL_expression_genewise
ensembl_gene_id=ENSG00000140265.12
```
Then run as follows.
```
sos run Multi_tissues.ipynb preprocessing --cwd $work_dir
sos run Multi_tissues.ipynb extract --cwd $work_dir --ensembl-gene-id $ensembl_gene_id
```

If you want to generate output files for all genes, run as follows. (**you can edit and change the bash variable $work_dir**)
```
work_dir=/scratch/midway2/chj1ar/GTEx_Analysis_v8_eQTL_expression_genewise
sos run Multi_tissues.ipynb preprocessing --cwd $work_dir
sos run Multi_tissues.ipynb parallelize --cwd $work_dir --no-self-defined
```

If you want to generate output files for a list of self-defined genes, collect the list of genes into `ensembl_gene_id_run.txt` in work_dir with each gene ID a line, possibly after preprocessing, and run as follows. (**you can edit and change the bash variable $work_dir**)
```
work_dir=/scratch/midway2/chj1ar/GTEx_Analysis_v8_eQTL_expression_genewise
sos run Multi_tissues.ipynb preprocessing --cwd $work_dir
sos run Multi_tissues.ipynb parallelize --cwd $work_dir --self-defined
```


In [25]:
# preprocessing: prepare gene expressions, gene TSS.
[preprocessing]
depends: R_library("biomaRt"), R_library("data.table")
parameter: cwd = path()
parameter: expression_directory = '/project2/compbio/GTEx_dbGaP/GTEx_Analysis_2017-06-05_v8/eqtl/GTEx_Analysis_v8_eQTL_expression_matrices'
parameter: genotype_file = '/project2/compbio/GTEx_dbGaP/GTEx_Analysis_2017-06-05_v8/genotypes/WGS/variant_calls/GTEx_Analysis_2017-06-05_v8_WholeGenomeSeq_838Indiv_Analysis_Freeze.SHAPEIT2_phased.vcf.gz'
input: expression_directory, genotype_file
output: 'ensembl_gene_id.txt', 'samplenames_X.txt', 'gene-line_number/*.csv', 'gene_TSS.csv'
bash: expand = '${ }', workdir = cwd
    # `ensembl_gene_id.txt`
    for file in ${expression_directory}/*.gz; do zless "$file" | grep -oh "ENSG\w*\.\w*" >> ensembl_gene_id_original.txt ; done
    sort -u ensembl_gene_id_original.txt > ensembl_gene_id.txt
    rm ensembl_gene_id_original.txt
    
    # mkdir gene-line_number (in cwd)
    mkdir gene-line_number
    
    # `samplenames_X.txt`
    zless ${genotype_file} | sed '3386q;d' > samplenames_X.txt

python3: expand = '${ }', workdir = cwd
    # `gene:line_number`
    
    with open("ensembl_gene_id.txt") as f:
        ensembl_gene_id = [line.rstrip("\n") for line in f]
    
    import glob
    import pandas as pd
    filenames_y = glob.glob("${expression_directory}/*.gz")
    filenames_y.sort()
    gene_linenumber = {} # Now gene_linenumber is a list of 49 dict, each of which has 20000 items. But if gene_linenumber is a dict of 20000 dict, each of which has 49 items, then it can be saved into 20000 files!
    fields = ['gene_id', ]
    for j in range(len(ensembl_gene_id)): # construct ~20000 dict, each of which has 49 items
        gene_linenumber[ensembl_gene_id[j]] = {}

    for i in range(len(filenames_y)): # construct each of the 49 `gene:line_number`
        gene_id = pd.read_csv(filenames_y[i], sep='\t', skipinitialspace=True, usecols=fields)
        for j in range(gene_id.shape[0]): # go over lines
            gene_linenumber[gene_id['gene_id'].tolist()[j]][i] = j + 2
            # print('%s at line %d in file %d' % (gene_id['gene_id'].tolist()[j], j + 2, i))
    
    import csv
    for j in range(len(gene_linenumber)): # write ~20000 files
        with open(('gene-line_number/%s.csv' % ensembl_gene_id[j]), 'w') as f:
            writer = csv.writer(f)
            for key, value in gene_linenumber[ensembl_gene_id[j]].items():
                writer.writerow([key, value])
        
R: expand = True, workdir = cwd
    # `gene_TSS.csv`
    ensembl_gene_id <- data.table::fread(file = "ensembl_gene_id.txt", sep = "\n", quote = "", header = FALSE)
    mart <- biomaRt::useDataset("hsapiens_gene_ensembl", biomaRt::useMart("ensembl"))
    gene_TSS <- biomaRt::getBM(attributes = c("chromosome_name", "transcript_start", "ensembl_gene_id", "ensembl_gene_id_version"), filters = "ensembl_gene_id_version", values = ensembl_gene_id, mart = mart)
    rm(ensembl_gene_id)
    rm(mart)
    write.table(x = gene_TSS, file = "gene_TSS.csv", sep = '\t', quote = FALSE, col.names = TRUE, row.names = FALSE)
    rm(gene_TSS)
    # gene_TSS_retrieval <- read.table(file = "gene_TSS.csv", sep = '\t', quote = "", stringsAsFactors = FALSE, header = TRUE)
    

In [2]:
# parallelize genes
[parallelize]
parameter: cwd = path()
parameter: self_defined = bool
import os
if self_defined:
    with open(os.path.join(cwd, 'ensembl_gene_id_run.txt')) as f:
        ensembl_gene_id = f.readlines()
    
else:
    with open(os.path.join(cwd, 'ensembl_gene_id.txt')) as f:
        ensembl_gene_id = f.readlines()
    
ensembl_gene_id_list = [x.strip('\n') for x in ensembl_gene_id]
input: for_each={'x': ensembl_gene_id_list}
sos_run('extract', ensembl_gene_id=ensembl_gene_id_list[_index])

In [None]:
# extract Z, y, X from covariates, expression and genotype data and generate y_res
[extract]
depends: R_library("data.table")
parameter: cwd = path()
parameter: ensembl_gene_id = str
parameter: expression_directory = '/project2/compbio/GTEx_dbGaP/GTEx_Analysis_2017-06-05_v8/eqtl/GTEx_Analysis_v8_eQTL_expression_matrices'
parameter: genotype_file = '/project2/compbio/GTEx_dbGaP/GTEx_Analysis_2017-06-05_v8/genotypes/WGS/variant_calls/GTEx_Analysis_2017-06-05_v8_WholeGenomeSeq_838Indiv_Analysis_Freeze.SHAPEIT2_phased.vcf.gz'
parameter: covariates_directory = '/project2/compbio/GTEx_dbGaP/GTEx_Analysis_2017-06-05_v8/eqtl/GTEx_Analysis_v8_eQTL_covariates'
parameter: ntissues = 49
input: expression_directory, genotype_file, covariates_directory
# output: "Multi_Tissues.%s.RDS" % ensembl_gene_id
bash: expand = '${ }', workdir = cwd
    # X
    awk '$4 ~ /${ensembl_gene_id}/ {print $0}' gene_TSS.csv | head -n 1 | awk '{$3=$2+1000000} {$2=$2-1000000} {print "chr"$1":"$2"-"$3}' > ${ensembl_gene_id}_TSSregion.txt
    tabix ${genotype_file} "$(< ${ensembl_gene_id}_TSSregion.txt)" | sed -e 's/0|0/0/g' -e 's/0|1/1/g' -e 's/1|0/1/g' -e 's/1|1/2/g' > ${ensembl_gene_id}_genotype.txt # still remain a potential problem of multi-allelic case.
    
R: expand = True, workdir = cwd
    # Z
    filenames <- list.files(path = "{covariates_directory}", pattern = "*.txt", full.names = TRUE)
    Z <- lapply(filenames, function(x) t(as.matrix(read.table(file = x, header = TRUE, sep = '\t', quote = "", row.names = 1))))
    for (i in 1:{ntissues}) {{ # the format of sample names Z was changed somehow, from "GTEX-*" to "GTEX.*", so we need to convert it back.
        rownames(Z[[i]]) <- gsub(pattern = "GTEX.", replacement = "GTEX-", x = rownames(Z[[i]]))
    }}
    
    # y
    
    filenames_y <- list.files(path = "{expression_directory}", pattern = "*.gz$", full.names = TRUE)
    line_numbers <- read.table(file = "gene-line_number/{ensembl_gene_id}.csv", sep = ',', header = FALSE, row.names = 1, quote = "")
    extract_y <- function(i) {{
        yi <- data.table::fread(file = filenames_y[i], skip = line_numbers[i,] - 1, nrows = 1)
        samplenames_yi <- data.table::fread(file = filenames_y[i], skip = 0, nrows = 1)
        colnames(yi) <- colnames(samplenames_yi)
        yi <- t(as.matrix(yi))
        yi <- yi[-1:-4, , drop = FALSE]
        return(yi)
    }}
    
    y <- lapply(1:{ntissues}, extract_y)
    
    ## sample names matching between y and Z
    samplenamesmatching <- function(x, reference) {{
        lapply(1:{ntissues}, function(i) x[[i]][match(rownames(reference[[i]]), rownames(x[[i]])), , drop = FALSE])
    }}
    
    y_matchZ <- samplenamesmatching(x = y, reference = Z)
    y <- y_matchZ
    
    # X
    X <- read.csv(file = "{ensembl_gene_id}_genotype.txt", sep = '\t', header = FALSE, row.names = 3, stringsAsFactors = FALSE)
    X <- X[,-(1:8)]
    samplenames_X <- scan(file = "samplenames_X.txt", what = character(), quote = "")
    samplenames_X <- samplenames_X[-(1:9)]
    colnames(X) <- samplenames_X
    X <- as.matrix(x = X)
    X <- t(X)
    
    # y_res
    y_res <- lapply(1:{ntissues}, function(i) .lm.fit(x = Z[[i]], y = y[[i]])$residuals)

    # save
    saveRDS(object = list(X = X, y = y, Z = Z, y_res = y_res), file = "Multi_Tissues.{ensembl_gene_id}.RDS")

bash: expand = '${ }', workdir = cwd
    # remove intermediate files
    rm ${ensembl_gene_id}_TSSregion.txt
    rm ${ensembl_gene_id}_genotype.txt
    