# Fine mapping data preparation

## 1000 Genome data
The position of SNPs in 1000 Genome data **start from $0$**, instead of $1$. In order to be consistent with summary statistics, we need to **add $1$** to all 1000 genome SNPs position.

## Test results 11/07/18

|dataset|chunk_653|chunk_654|chunk_655|
|:--:|:--:|:--:|:--:|
|S1|8286510|8286510|8286510|
|G1|23749|22424|35634|
|S2|169|141|112|
|G2|170|141|112|
|no strand flip or switch ref/alt|11|10|10|
|flipped by strand|73|75|50|
|flipped by reference and alternative|13|13|7|
|S3|97|98|67|

The nubmer of overlapped SNPs of $S_1$ and $G_1$ is limited, much less than original summary statistics and genotype matrix.

## Test results 11/08/18

|dataset|chunk_653|chunk_655|
|:--:|:--:|:--:|
|S1|3080|1944|
|G1|23749|35634|
|S2|3048|1933|
|G2|3050|1933|
|no strand flip or switch ref/alt|1397|939|
|flipped by strand|0|0|
|flipped by reference and alternative|1631|987|
|S3|3028|1926|

## Input
Data are prepared based on human LD chunks/blocks.
- Summary statistics in a single human LD chunk, including 7 columns
    - chromsome
    - position
    - reference allele
    - alternative allele
    - ${\beta}$ (effect size)
    - standard error (se)
    - z score


- LD chunk file. There are 1703 chunks in human genome (exclude X chromosome). Each row represents one block, for example:

      1st column is chr; 2nd is chunk start position; 3rd is chunk end position; 4th is chunk number.

        chr22	44995308	46470495	1699
        chr22	46470495	47596318	1700
        chr22	47596318	48903703	1701
        chr22	48903703	49824534	1702
        chr22	49824534	51243298	1703


- Prior information from enrichment analysis. For example:

      1st columns format “chr:bp:ref:alt”，2nd is prior

        1:1847856:T:G  1.4413e-04
        1:1847979:T:C  7.3716e-05
        1:1848109:C:G  1.4413e-04
        1:1848160:A:G  1.4413e-04
        1:1848734:A:G  7.3716e-05


- 1000 Genome genotype file. Available online [here](http://hgdownload.cse.ucsc.edu/gbdb/hg19/1000Genomes/phase3/). They are listed by chromosome, so it's not necessary to download all. We can only donwload the chromosome where LD chunk located instead.


- Population definition: `EUR` (European). Available online [here](ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/integrated_call_samples_v3.20130502.ALL.panel). There are $503$ European.

## Steps

- Denote summary statistics matrix in the specific LD chunk $S_1$, and annotation (prior) matrix as $A_1$. The number of row in these two matrices is the number of SNPs in GWAS summary statistics.

- Extract EUR (European) population genotype from 1000 Genome genotype file. Rows are SNPs, columns are population genotypes. Transfer genotypes to $0,1,2$ and `nan`. Remove non-variant sites (lines having identical genotypes for everybody). Denote this genotype matrix as $G_1$.

        0|0    0
        1|0    1
        1|1    2
        2|0    nan
        2|1    nan
        2|2    nan


    **update** : do not mark `nan`; still use zero instead.

- Find overlapped SNPs of $S_1$ and $G_1$, then extract new genotype matrix from $G_1$, denote as $G_2$, and new summary statistics matrix from $S_1$, denote as $S_2$.

- Compare $G_2$ and $S_2$'s reference and alternative allele, then generate new summay statistic matrix $S_3$ and new genotype matrix $G_3$. There could be several situations as follows:

    - completely identical;
    - Not identical, but identical after switching ref and alt in $S_2$: add opposite sign for z score and beta; does not apply to `A/T`, `T/A`, `G/C` or `C/G` for `ref/alt`;
    - Not identical, but identical after strand flip ref and alt in $S_2$: keep the sign of z score and beta; does not apply to `A/T`, `T/A`, `G/C` or `C/G` for `ref/alt`;
    - Not identical, but identical after strand flip ref and alt then swith their positions: add opposite sign for z score and beta; only apply to `A/G`, `G/A`, `A/C`, `C/A`, `T/C`, `C/T`, `T/G` and `G/T`.
    - Not identical, `A/T`, `T/A`, `G/C` or `C/G` for `ref/alt`: consider this situation at last; if flip strand has applied in this LD block, then flip strand and keep the sign of z score and beta; if not, switch ref and alt of $S_2$, add opposite sign for z score and beta.
    - Not identical after previous 5 substeps: drop.


- Calculate row correlation matrix of $G_3$, denote as $R$. The number of rows and columns of $R$ is the number of SNPs in $G_3$.

- Obtain overlapped SNPs for $S_3$ and $A_1$ and use overlapped SNPs to generate new annotation/prior $A_2$.

## Outputs
- $S_3$: the number of rows of $S_3$ is the same with $A_2$.

- $R$: correlation matrix, #SNPs $\times$ #SNPs of $G_3$.

- $A_2$: adjusted annotation/prior information, the number of rows is the same with $S_3$.

In [1]:
%cd /data/fine_mapping_data/

/data/fine_mapping_data

In [2]:
[global]
parameter: WDir=path("/data/fine_mapping_data")
# chrom = f'{WDir:b}'.split("_")[0][3:]
# paths to input files
# parameter: kgenome=path(f"{WDir}/1000Genome/ALL.chr{chrom}.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz")
parameter: kgenome_cwd=path(f"{WDir}/1000Genome")
parameter: IDrace="EUR"
parameter: IDfile=path(f"{WDir}/1000Genome/integrated_call_samples_v3.20130502.ALL.panel.txt")
parameter: LDchunk = path(f"{WDir}/chunk.dat")
parameter: summaryfile = path(f"{WDir}/GWAS/Summary_statistics.gz")
parameter: anno = "atac-seq_asca"
annofile = path(f"{WDir}/annotation/Annotation_{anno}.gz")
# Use --no-strand-flip to set it to false
parameter: strand_flip = True

### Obtain EUR genotype data
According to integrated_call_samples_v3.20130502.ALL.panel.txt EUR sample ID, export genotype information in ALL.chr6.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz.

In [3]:
# Get information about a specified race
[prepare_reference_vcf]
depends: executable("bcftools"), executable("tabix")
input: f"{kgenome_cwd}/ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz", # group_by = 1, concurrent = True
output: f"{_input:bnn}.{IDrace}.vcf.gz"
bash: expand = True
    grep -w "{IDrace}" {IDfile} | cut -f 1 > {_output:nn}_extracted.txt
    bcftools view -S {_output:nn}_extracted.txt {_input} -Oz > {_output}
    tabix -p vcf {_output}

In [4]:
# !sos run /home/min/GIT/atac-gwas/fineMapping/20181101_FineMapping_Data_preparation_workflow.ipynb prepare_reference_vcf -s build

Rename chrX
```
cd /data/fine_mapping_data/1000Genome/
mv ALL.chrX.phase3_shapeit2_mvncall_integrated_v1b.20130502.genotypes.EUR_extracted.txt ALL.chrX.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.EUR_extracted.txt
mv ALL.chrX.phase3_shapeit2_mvncall_integrated_v1b.20130502.genotypes.EUR.vcf.gz ALL.chrX.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.EUR.vcf.gz
mv ALL.chrX.phase3_shapeit2_mvncall_integrated_v1b.20130502.genotypes.EUR.vcf.gz.tbi ALL.chrX.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.EUR.vcf.gz.tbi
```

### Obtain genotype matrix `𝐺1`

In [5]:
[default_1]
import pandas as pd
## FIXME: chrX snps, LD chunk does not include chromosome X
chunks = [x.tolist() for idx, x in pd.read_table(f'{LDchunk}',header=None,sep='\s+').iterrows() if not x[0].startswith('#')]
print (chunks)
input: for_each = 'chunks', concurrent = True
output: f"{WDir:d}/fine_mapping/{_chunks[0]}_{_chunks[-1]}/{summaryfile:bn}/chunk_{_chunks[-1]}.pkl"
kgenome = path(f'{kgenome_cwd}/ALL.chr{_chunks[0][3:]}.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.{IDrace}.vcf.gz')
python3: expand="${ }"
    from cyvcf2 import VCF
    import pandas as pd
    import numpy as np
    chromosome="${_chunks[0]}".replace("chr","")
    tit_tmp = pd.read_table("${kgenome:nn}_extracted.txt", header=None)
    tit = tit_tmp.T
    se = pd.Series(["index","ID","ref","alt"])
    ss = se.append(tit.iloc[0,:].map(str))
    # chunk region
    queryid=str(chromosome)+":"+str("${_chunks[1]}")+"-"+str("${_chunks[2]}")
    # scan VCF chunk
    vcf = VCF(${kgenome:r}, gts012=False)
    res = []
    ## Attention: 1000 Genome position start from 0, not 1! So need to +1 for variant.start
    for variant in vcf(queryid):
        for i in range(len(variant.ALT)):
            line = [f'{variant.CHROM}:{variant.start+1}:{variant.REF}:{variant.ALT[i]}', 
                    f'{variant.CHROM}:{variant.start+1}', variant.REF, variant.ALT[i]] + \
                    [x[:-1].count(i+1) for x in variant.genotypes] # do not use else np.nan because it will mess up the cor matrix (will not be P.D.)
            if len(set(line[4:])) == 1:
                # remove non-variant site in 1000 data
                continue
            res.append(line)
    gt = pd.DataFrame(res, columns = ss)
    gt.set_index('index',inplace=True)
    gt.to_pickle(${_output:r})

### Obtain genotype matrix 𝐺2, summary statistics 𝑆2, compare ref/alt in 𝑆2 and 𝐺2 => 𝑆3 and 𝐺3, calculate 𝑅=row_corr(𝐺3)

In [6]:
[default_2]
input: group_by = 1, concurrent = True
output: f"{_input:n}.matched_ss.txt", f"{_input:n}.LD.txt"
python3: expand="${ }", stdout=f'{_input:n}.flip_summary'
    import pandas as pd, numpy as np
    import datetime, sys
    print(datetime.datetime.now(),"\n")
    n_chunk = int("${_input:bn}".split("_")[-1])
    chunks = pd.read_table(f'${LDchunk}',header=None,sep='\s+')
    chunk_info = chunks[chunks[3] == n_chunk].iloc[0, 0:3].tolist()
    S1 = pd.read_table("${summaryfile}",compression='gzip',header = 0)
    S1 = S1[(S1["chr"] == chunk_info[0]) & (S1["bp"] >= int(chunk_info[1])) & (S1["bp"] < int(chunk_info[2]))]
    print("SAMPLE: %s\tLENGTH:%d\n" % ("S1",len(S1)))
    if S1.empty:
        open("${_output[0]}", 'w').close()
        open("${_output[1]}", 'w').close()
        sys.exit(0)
    S1["chr"] = S1["chr"].str.split("chr",expand=True)[1]
    S1["ID"] = S1["chr"].map(str)+":"+S1["bp"].map(str)
    G1 = pd.read_pickle("${_input}")
    print("SAMPLE: %s\tLENGTH:%d\n" % ("G1",len(G1)))
    # S2
    S2 = S1[S1["ID"].isin(G1["ID"])]
    #S2["ID"] = S2["ID"].fillna
    print("SAMPLE: %s\tLENGTH:%d\n" % ("S2",len(S2)))
    # 𝐺2
    G2 = G1[G1["ID"].isin(S1["ID"])]
    print("SAMPLE: %s\tLENGTH:%d\n" % ("G2",len(G2)))
    # 𝑆2 and 𝐺2 : reference and alternative
    # at ta cg gc
    def gt(s1,s2,s3,s4):
        if (s1+s2 == "AT" and s3+s4 == "TA") or (s1+s2 == "GC" and s3+s4 == "CG" ) or (s1+s2 == "TA" and s3+s4 == "AT") or (s1+s2 == "CG" and s3+s4 == "GC"):
            return ${0 if strand_flip else 1}
        else:
            return 1
    # flip
    def atcg(inp):
        if inp =="A":
            return "T"
        elif inp == "T":
            return "A"
        elif inp == "G":
            return "C"
        elif inp == "C":
            return "G"
    # S3 
    LDFLIP="N"
    FLIPD=[]
    # equal
    EQUAL = 0
    # flipped by strand 
    FLIPNUM = 0
    # flipped by reference and alternative
    REFALTNUM = 0 
    
    for i in range(len(S2)):
        line = list(S2.iloc[i,:2])
        if set([S2.iloc[i,2],S2.iloc[i,3]]) != set([G2[G2["ID"]==S2.iloc[i,-1]]["ref"][0],G2[G2["ID"]==S2.iloc[i,-1]]["alt"][0]]):
            print (S2.iloc[i,1],S2.iloc[i,2],S2.iloc[i,3],S2.iloc[i,-1], G2[G2["ID"]==S2.iloc[i,-1]]["ref"][0],G2[G2["ID"]==S2.iloc[i,-1]]["alt"][0])
        # not at/ta/gc/cg
        if gt(S2.iloc[i,2],S2.iloc[i,3],G2[G2["ID"]==S2.iloc[i,-1]]["ref"][0],G2[G2["ID"]==S2.iloc[i,-1]]["alt"][0])==1:
            if G2[G2["ID"]==S2.iloc[i,-1]]["ref"][0] == S2.iloc[i,2] and G2[G2["ID"]==S2.iloc[i,-1]]["alt"][0] == S2.iloc[i,3]:
                line=S2.iloc[i,:].tolist()
                EQUAL+=1
                FLIPD.append(line)
            elif G2[G2["ID"]==S2.iloc[i,-1]]["alt"][0] == S2.iloc[i,2] and G2[G2["ID"]==S2.iloc[i,-1]]["ref"][0] == S2.iloc[i,3]:
                REFALTNUM+=1
                line.extend([S2.iloc[i,3],S2.iloc[i,2],0-S2.iloc[i,4],0-S2.iloc[i,5],S2.iloc[i,6]])
                FLIPD.append(line)
            elif atcg(S2.iloc[i,2]) == G2[G2["ID"]==S2.iloc[i,-1]]["ref"][0] and atcg(S2.iloc[i,3]) == G2[G2["ID"]==S2.iloc[i,-1]]["alt"][0]:
                LDFLIP="Y"
                FLIPNUM+=1
                line.extend([atcg(S2.iloc[i,2]),atcg(S2.iloc[i,3]),S2.iloc[i,4],S2.iloc[i,5],S2.iloc[i,6]])
                FLIPD.append(line)
            elif atcg(S2.iloc[i,2]) == G2[G2["ID"]==S2.iloc[i,-1]]["alt"][0] and atcg(S2.iloc[i,3]) == G2[G2["ID"]==S2.iloc[i,-1]]["ref"][0]:
                LDFLIP="Y"
                FLIPNUM+=1
                tmp = S2.iloc[i,2]
                line.extend([atcg(S2.iloc[i,3]),atcg(tmp),0-S2.iloc[i,4],0-S2.iloc[i,5],S2.iloc[i,6]])
                FLIPD.append(line)
        # at/ta/cg/gc AND flip
        elif gt(S2.iloc[i,2],S2.iloc[i,3],G2[G2["ID"]==S2.iloc[i,-1]]["ref"][0],G2[G2["ID"]==S2.iloc[i,-1]]["alt"][0])==0 and LDFLIP=="Y":
            FLIPNUM+=1
            line.extend([atcg(S2.iloc[i,2]),atcg(S2.iloc[i,3]),S2.iloc[i,4],S2.iloc[i,5],S2.iloc[i,6]])
            FLIPD.append(line)
        # at/ta/cg/gc AND not flip
        elif gt(S2.iloc[i,2],S2.iloc[i,3],G2[G2["ID"]==S2.iloc[i,-1]]["ref"][0],G2[G2["ID"]==S2.iloc[i,-1]]["alt"][0])==0 and LDFLIP=="N":
            REFALTNUM+=1
            line.extend([atcg(S2.iloc[i,3]),atcg(S2.iloc[i,2]),0-S2.iloc[i,4],0-S2.iloc[i,5],S2.iloc[i,6]])
            FLIPD.append(line)
            
    print("equal:%s\tflipped by strand:%d\tflipped by reference and alternative:%d\n" % (EQUAL,FLIPNUM,REFALTNUM))
    S3 = pd.DataFrame(FLIPD)
    if S3.shape[1] == 7:
        S3.columns=["chr","bp","ref","alt","z","beta","se"]
    else:
        S3.columns=["chr","bp","ref","alt","z","beta","se","ID"]
    S3["ID"] = S3["chr"].map(str)+":"+S3["bp"].map(str)
    # 𝐺3
    G3 = G2[G2["ID"].isin(S3["ID"])]
    G3 = G3.drop_duplicates(subset = ["ID"])
    # S3 OUT
    S3 = S3.drop(columns=['ID'])
    S3.sort_values(by = ["chr", "bp"]).to_csv("${_output[0]}",sep="\t",index=False)
    print("SAMPLE: %s\tLENGTH:%d\n" % ("S3",len(S3)))
    # 𝑅=row_corr(𝐺3) OUT 
    G3 = G3.drop(columns=['ID',"ref","alt"]).values
    np.savetxt("${_output[1]}", np.corrcoef(G3), fmt = '%.5f')
#_input.zap()

bash: expand="${ }"
    cat ${_output[0]} | awk '{print $1":"$2"\t"$6"\t"$7}' | tail -n +2 > ${_output[0]:nn}.dat

```
cd /data/05-Nov-2018/Summary_statistics
cat chunk_5.matched_ss.txt | awk '{print $1":"$2"\t"$6"\t"$7}' | tail -n +2 > chunk_5.dat
cat chunk_5.matched_ss.txt | awk '{print $1":"$2"\t"$5}' | tail -n +2 > chunk_5.dat
```

### Obtain annotation `𝐴2`

In [7]:
[default_3]
input: group_by = 2, concurrent = True
print(_input)
output: f"{_input[0]:nn}.{annofile:bn}.txt"
stop_if(os.stat(_input[0]).st_size == 0)
python3: expand="${ }"
    import pandas as pd
    A1 = pd.read_table("${annofile}", compression='gzip', sep="\s+", header=None)
    A1.columns=["ID","VALUE"]
    A1["ID_2"] = A1["ID"].apply(lambda x: ':'.join(map(str, x.split(":")[0:2])))
    S3 = pd.read_table("${_input[0]}",sep="\t")
    S3["ID"]=S3.iloc[:,0].map(str)+":"+S3.iloc[:,1].map(str)#+":"+S3.iloc[:,2]+":"+S3.iloc[:,3]
    A2 = A1[A1["ID_2"].isin(S3["ID"])][["ID_2","VALUE"]]
    A2 = A2.drop_duplicates(subset = ["ID_2"])
    A2.to_csv("${_output}",sep="\t",index=False,header=None)

In [6]:
!sos run /home/min/GIT/atac-gwas/fineMapping/20181101_FineMapping_Data_preparation_workflow.ipynb --no-strand_flip --anno atac-seq_asca -s build

INFO: Running [32mdefault_1[0m: 
[['chr5', 140023664, 140222664, 108]]
INFO: Step [32mdefault_1[0m (index=0) is [32mignored[0m with signature constructed
INFO: output:   [32m/data/fine_mapping/chr5_108/Summary_statistics/chunk_108.pkl[0m
INFO: Running [32mdefault_2[0m: 
INFO: Step [32mdefault_2[0m (index=0) is [32mignored[0m with signature constructed
INFO: output:   [32m/data/fine_mapping/chr5_108/Summary_statistics/chunk_108.matched_ss.txt /data/fine_mapping/chr5_108/Summary_statistics/chunk_108.LD.txt[0m
INFO: Running [32mdefault_3[0m: 
INFO: output:   [32m/data/fine_mapping/chr5_108/Summary_statistics/chunk_108.Annotation_atac-seq_asca.txt[0m
INFO: Workflow default (ID=6aea0b9f1f7f3627) is executed successfully with 1 completed step and 2 ignored steps.


In [2]:
#!sos run /home/min/GIT/atac-gwas/fineMapping/20181101_FineMapping_Data_preparation_workflow.ipynb --no-strand_flip --anno atac-seq -s build

In [3]:
#!sos run /home/min/GIT/atac-gwas/fineMapping/20181101_FineMapping_Data_preparation_workflow.ipynb --no-strand_flip --anno general -s build

In [7]:
#!sos run /home/min/GIT/atac-gwas/fineMapping/20181101_FineMapping_Data_preparation_workflow.ipynb --no-strand_flip --anno all -s build

### Results of LD chunk 655

### Flip summary

In [24]:
%preview /data/chr12_4/Summary_statistics/chunk_4.flip_summary -n --limit 20

2018-11-19 22:48:44.925929 

SAMPLE：S1	LENGTH:596

SAMPLE：G1	LENGTH:1792

SAMPLE：S2	LENGTH:588

SAMPLE：G2	LENGTH:588

equal:283	flipped by strand:0	flipped by reference and alternative:303

SAMPLE：S3	LENGTH:586

2018-11-19 23:39:44.591608 

SAMPLE：S1	LENGTH:4933

SAMPLE：G1	LENGTH:13745


The number of SNPs in 

- S1: 8286510
- G1: 35634
- S2: 112
- G2: 112
- S3: 67

The number of identical SNPs in S2 and G2 is 10; strand flip 50 SNPs; switch ref/alt 7 SNPs.

### Summary statistics matrix S3

In [11]:
%preview Summary_statistics/chunk_1696.matched_ss.txt -n

chr	bp	ref	alt	z	beta	se
22	40546153	G	A	2.181181	0.023903	0.011
22	40546525	A	G	-2.058627	-0.076201	0.037000000000000005
22	40546633	C	T	2.197286	0.023699	0.0108
22	40547688	A	G	0.367294	0.006797	0.0184

### Annotation/prior matrix A2

In [12]:
%preview Summary_statistics/chunk_1696.annotation.txt -n

22:40546153	7.3409e-05
22:40546525	7.3409e-05
22:40546633	7.3409e-05
22:40547688	7.3409e-05
22:40548084	7.3409e-05

### 𝑅: LD matrix

In [None]:
# %preview Summary_statistics/chunk_1696.LD.txt -n -l 1