# Annotation of exome variants using Annovar

## Aim

Prepare the data for further association analyses using the LMM.ipynb on rare variants. 

## Description of the pipeline

This pipeline provides 3 different possibilities depending on the type of input data you are starting with:

### Scenario 1 : you have multiple bim files (e.g. one per chromosome) and you want to merge them into one file for later annotation with annovar

Run `bim_merge` to concatenate all the bim files and then run `annovar` to annotate all the variants at once

### Scenario 2: you either want to work with common or rare variants.

Run `get_snps` using the `--maf` or `max-maf` depending on the type of variants you would like to extract and then run `annovar`

### Scenario 3: you already have a specific list of variants you would like to annotate stored in a bim file. 

Run `annovar`

## Command interface

In [1]:
!sos run annovar.ipynb -h

[91mERROR[0m: [91mNotebook JSON is invalid: %s[0m
usage: sos run annovar.ipynb [workflow_name | -t targets] [options] [workflow_options]
  workflow_name:        Single or combined workflows defined in this script
  targets:              One or more targets to generate
  options:              Single-hyphen sos parameters (see "sos run -h" for details)
  workflow_options:     Double-hyphen workflow-specific parameters

Workflows:
  bim_merge
  get_snps
  annovar

Global Workflow Options:
  --cwd VAL (as path, required)
                        the output directory for generated files
  --numThreads 2 (as int)
                        Specific number of threads to use
  --job-size 1 (as int)
                        For cluster jobs, number commands to run per job
  --build hg38
                        Human genome build
  --bim-name VAL (as path, required)
                        Name for the merged bimfiles
  --name-prefix VAL (as str, required)
                        Prefix for the n

## Illustration with minimal working example

**Scenario 3:** On Yale's cluster, here modify humandb and ukbb paths to match the location of the databases needed by annovar to function

```
sos run ~/project/bioworkflows/variant-annotation/annovar.ipynb annovar \
    --cwd output \
    --bim_name ukb23156_c22.merged.filtered.bim \
    --humandb /gpfs/ysm/datasets/db/annovar/humandb \
    --ukbb /gpfs/gibbs/pi/dewan/data/UKBiobank \
    --job_size 1 \
    --name_prefix mwe_chr22 \
    --container_annovar /gpfs/gibbs/pi/dewan/data/UKBiobank/annovar.sif
```

On Columbia's cluster running `annovar`

```
sos run ~/project/bioworkflows/variant-annotation/annovar.ipynb annovar \
    --cwd output \
    --bim_name /mnt/mfs/statgen/UKBiobank/data/exome_files/project_VCF/plink_files/ukb23156_c22.merged.filtered.bim \
    --humandb /mnt/mfs/statgen/isabelle/REF/humandb  \
    --ukbb /mnt/mfs/statgen/isabelle/REF/humandb \
    --job_size 1 \
    --name_prefix mwe_chr22 \
    --container_annovar /mnt/mfs/statgen/containers/gatk4-annovar.sif
```
On Columbia's cluster running `burden_files`
```
sos run ~/project/bioworkflows/variant-annotation/annovar.ipynb burden_files\
    --cwd ~/output \
    --annotated_file /mnt/mfs/statgen/UKBiobank/results/annovar_exome/ukb32285_exomespb_chr1_22.hg38.hg38_multianno.csv\
    --bim_name /mnt/mfs/statgen/UKBiobank/data/exome_files/project_VCF/plink_files/ukb23156_c1.merged.filtered.bim \
    --job_size 1 \
    --name_prefix test \
    --container_lmm /mnt/mfs/statgen/containers/lmm.sif
```

In [None]:
[global]
# the output directory for generated files
parameter: cwd = path
# Specific number of threads to use
parameter: numThreads = 2
# For cluster jobs, number commands to run per job
parameter: job_size = 1
# Name for the merged bimfiles to use
parameter: bim_name = path
# Human genome build hg19 or hg38
parameter: build = 'hg38'
# Prefix for the name based on common/rare variant filtering
parameter: name_prefix = str
# Wall clock time expected
parameter: walltime = "15h"
# Memory expected
parameter: mem = "30G"
# Load annovar module from cluster
parameter: annovar_module = '''
module load Annovar/202004
echo "Module annovar loaded"
{cmd}
'''
# Software container option
parameter: container_annovar = 'gaow/gatk4-annovar'
parameter: container_lmm = 'statisticalgenetics/lmm:2.4'

### Format file for plink .bim

A text file with no header line, and one line per variant with the following six fields:
1. Chromosome code (either an integer, or 'X'/'Y'/'XY'/'MT'; '0' indicates unknown) or name
2. Variant identifier
3. Position in morgans or centimorgans (safe to use dummy value of '0')
4. Base-pair coordinate (1-based; limited to 231-2)
5. Allele 1 (corresponding to clear bits in .bed; usually minor)
6. Allele 2 (corresponding to set bits in .bed; usually major)

In the bim file the second column e.g `1:930232:C:T` contains the alleles in ref/alt mode

## Step to merge *.bim files from plink formatted data (e.g exome data in the UKBB, genotype array data)

In [None]:
# Merge all the *.bim files into a single file. Needs to be run once per type of data (e.g. genotype, exome)
[bim_from_plink]
# Path to the *.bim files to merge
parameter: bimfiles= paths
# Specify path of the merged bim file
parameter: bim_name = path
input: bimfiles 
output: bim_name
task: trunk_workers = 1, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output:bn}'
bash: expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout' 
      cat ${_input} >> ${_output}

## Step to create a list of variants from *.bgen files and a merged *.bim file to annotate (e.g imputed genotype data UKBB)

In [None]:
# Create a merged *.bim file from *.bgen files
[bim_from_bgen]
# Specify bgen files path
parameter: genoFile = paths
# Specify name of the merged bim file
parameter: bim_name = str
# The input here is the bgen file from which to extract the list of variants
input: genoFile, group_by=1
output: f'{cwd}/{_input:bn}.bim'
task: trunk_workers = 1, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output:bn}'
bash: container=container_lmm, expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout'
    bgenix -g ${_input} -list | awk 'NR>2 { gsub("_",":",$1); print $3, $1, $4, $7, $6 }' | awk 'BEGIN{FS=OFS=" "}{$2 = $2 OFS 0}1'  > ${_output}
    cat ${_output} | awk '{x=$1+0;print x,$2,$3,$4,$5,$6}' >> ${cwd}/${bim_name}.merged.bim

In [None]:
# Get a list of common SNPs above (--maf) or rare SNPs below (--max-maf) certain MAF
[get_snps_1]
# bed files plink format
parameter: bfiles = paths
# Filter based on minor allele frequency (use when filtering common variants)
parameter: maf_filter = 0.0
# Filter based on the maximum maf allowed (use when filtering rare variants)
parameter: max_maf_filter = 0.001
# Filter out variants with missing call rate higher that this value
parameter: geno_filter = 0.0
# Filter according to Hardy Weiberg Equilibrium
parameter: hwe_filter = 0.0
# Fitler out samples with missing rate higher than this value
parameter: mind_filter = 0.0
input: bfiles, group_by=1
output: f'{cwd}/cache/{_input:bn}.{name_prefix}.snplist'
task: trunk_workers = 1, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output:bn}'
bash: container=container_lmm, expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout' 
    plink2 \
      --bfile ${_input:n}\
      ${('--maf %s' % maf_filter) if maf_filter > 0 else ''} \
      ${('--max-maf %s' % max_maf_filter) if max_maf_filter > 0 else ''} \
      ${('--geno %s' % geno_filter) if geno_filter > 0 else ''} \
      ${('--hwe %s' % hwe_filter) if hwe_filter > 0 else ''} \
      ${('--mind %s' % mind_filter) if mind_filter > 0 else ''} \
      --write-snplist --no-id-header\
      --freq \
      --threads ${numThreads} \
      --out ${_output:n} 

In [None]:
# Merge all of the common_var.snplist into a single file and all the rare_var.snplist into another single file
[get_snps_2]
input: group_by='all'
output: f'{cwd}/cache/{name_prefix}.snplist'
task: trunk_workers = 1, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output:bn}'
bash: expand= "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output:n}.stdout' 
      cat ${_input} > ${_output}

In [None]:
# Search for common or rare variants in bimfile and generate annovar input file
[get_snps_3]
depends: bim_name
output: f'{cwd}/{_input:bn}.avinput'
task: trunk_workers = 1, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output:bn}'
bash: expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout' 
    awk -F" " 'FNR==NR {lines[$1]; next} $2 in lines ' ${_input} ${bim_name} > ${_output:n}.tmp
    awk '{if ($2 ~ /D/) {print $1, $4, $4 + (length ($6) - length ($5)), $6, $5 } else {print $1, $4, $4, $6, $5 }}'  ${_output:n}.tmp >  ${_output}
    # remove temporary files
    rm -f ${_output:n}.tmp 

## Annovar details

For a list of available [databases](https://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/)

On Farnam's Yale HPC there is a folder for shared databases
```/gpfs/ysm/datasets/db/annovar/humandb``` 

and a folder for the x_ref database ```/gpfs/gibbs/pi/dewan/data/UKBiobank/mart_export_2019_LOFtools3.txt```

On Columbia's cluster there folder for shared databases for build hg19 is under Isabelle's folder
```/mnt/mfs/statgen/isabelle/REF/humandb```

and the x_ref database is under that same folder ```/mnt/mfs/statgen/isabelle/REF/humandb```


### Important note

Please make sure you are using the correct build for your annotations UKBB exome data for 200K individuals need hg38 build

### Format file for annovar input

On each line, the first five space- or tab- delimited columns represent 

1. chromosome 
2. start position 
3. end position 
4. the reference nucleotides
5. the observed nucleotides

In [None]:
# Create annovar input file
[annovar_1]
# Input is the file to be annotated and can be a bim file, a pvar file or a summary stats file generated by regenie
## Format of sumstats should be CHROM, GENPOS, ID, ALLELE0 (REF) and ALLELE1 (ALT)
## Format of the pvar file should be #CHROM POS ID REF ALT
input: bim_name
output: f'{cwd}/{_input:bn}.{build}.avinput'
task: trunk_workers = 1, walltime = walltime, mem = mem, cores = numThreads, tags = f'{_output:bn}'
bash: expand= "${ }", stderr = f'{_output:n}.err', stdout = f'{_output:n}.out' 
    # $1 chromosome, $2 variant_id, $3 cM, $4 POS, $6 ref_allele (allele 2 usually major), $5 alt_allele (allele 1 usually minor) in the bim files 
    # Output as annovar avinput chr, start, end (has to be calculated depending on reference allele length), reference, alternative
    filename="${_input}"
    suffix=$(echo "$filename" | awk -F. '{print $NF}')
    echo $suffix
    if [[ "$suffix" == "bim" ]]; then
        echo "Input file has a .bim extension"
        awk '{if (length ($6) > 1) {print $1, $4, $4 + (length ($6) - 1), $6, $5, $2} else {print $1, $4, $4, $6, $5, $2}}'   ${_input} >  ${_output}
    elif [[ "$suffix" == "pvar" ]]; then
        echo "Input file has .pvar extension, it will be considered as a plink pvar format and the columns are CHROM, POS, ID,  REF, ALT. Please confirm that your input file has this format"
        awk  'NR>1 {if (length ($4) > 1) {print $1,$2,$2 + (length ($4) - 1),$4,$5} else {print $1,$2,$2,$4,$5}}'   ${_input} >  ${_output}
    else
        echo "Input file does not have .bim extension, it will be considered as a summary stats and the columns are chrom, genpos, ID, allele0(ref), allele1(alt). Please confirm that your file has this format"
        awk 'NR>1 {if (length ($4) > 1) {print $1, $2, $2 + (length ($4) - 1), $4, $5, $3} else {print $1, $2, $2, $4, $5, $3}}'   ${_input} >  ${_output}
    fi

## The version of annovar used to annotate for the RAP system

In [None]:
# Annotate variants file using ANNOVAR
[annovar_2]
# humandb path for ANNOVAR
parameter: humandb = path
# Path to x-ref file
parameter: xref_path = path
# Annovar protocol
if build == 'hg19':
    protocol = ['refGene', 'refGeneWithVer', 'knownGene', 'ensGene', 'phastConsElements46way', 'gwasCatalog', 'gnomad211_exome', 'avsnp150', 'dbnsfp42a', 'dbscsnv11', 'gene4denovo201907']
    operation = ['g', 'g', 'g', 'g', 'r', 'r', 'f', 'f', 'f', 'f', 'f']
    arg = ['"-splicing 12"', '"-splicing 2"', '"-splicing 12"', '"-splicing 12"', '', '', '', '', '', '', '']
else:
    protocol = ['refGene', 'refGeneWithVer', 'knownGene', 'ensGene', 'gwasCatalog', 'gnomad312_genome', 'gnomad211_exome', 'avsnp150', 'dbnsfp42a', 'dbscsnv11', 'clinvar_20220320', 'gene4denovo201907']
    operation = ['g', 'g', 'g', 'gx', 'r', 'f', 'f', 'f', 'f', 'f', 'f', 'f']
    arg = ['"-splicing 12"', '"-splicing 2"', '"-splicing 12"', '"-splicing 12"', '', '', '', '', '', '', '', '']

#add xreffile to option without -exonicsplicing
#mart_export_2019_LOFtools3.txt #xreffile latest option -> Phenotype description,HGNC symbol,MIM morbid description,CGD_CONDITION,CGD_inh,CGD_man,CGD_comm,LOF_tools
parameter: x_ref = path(f"{xref_path}/mart_export_2021_LOFtools.txt")
output: f'{cwd}/{_input:bn}.{build}_multianno.csv'
task: trunk_workers = 1, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output:bn}', template = '{cmd}' if executable('annotate_variation.pl').target_exists() else annovar_module
bash:  expand="${ }", stderr=f'{_output:n}.err', stdout=f'{_output:n}.out'
    #do not add -intronhgvs as option -> writes cDNA variants as HGVS but creates issues (+2 splice site reported only)
    #-nastring . can only be . for VCF files
    #regsnpintron might cause shifted lines (be carefull using)
    table_annovar.pl \
        ${_input} \
        ${humandb} \
        -buildver ${build} \
        -out ${_output:nn}\
        -otherinfo\
        -remove \
        -polish \
        -nastring . \
        -protocol ${",".join(protocol)}\
        -operation ${",".join(operation)} \
        -arg ${",".join(arg)} \
        -csvout 

## Generate files for burden_test regenie from the annotated file

This workflow's aim is to generate the `--anno_file` and the `--set_list` files needed to run regenie_burden in the LMM.ipynb

Required files
1. The anno_files: define variant sets and functional annotations that will be used to generate the masks. The format is `chr:start:ref:alt gene_name functional_annot`
2. The set-list-files: lists variants within each set/gene to use when building masks. The format is set/gene chr start_pos and a comma separated list of variants included in that gene
3. Mask file: this file specifies which annotation categories should be combined into masks

Optional files

4. Set inclusion/exclusion file: one column with a list of sets/genes to be included/excluded from the set-list-file
5. Alternative allele frequency file (AAF): by default the AAF is computed by the sample but you can specify an AAF for each variant using this file


In [None]:
[burden_files]
parameter: annovar_anno = paths
parameter: vep_anno = paths
parameter: rsid = False
input: annovar_anno, vep_anno, group_by = 'pairs'
output:f'{cwd}/{_input[0]:bn}.anno_file',
       f'{cwd}/{_input[0]:bn}.aaf_file',
       f'{cwd}/{_input[0]:bn}.set_list_file'
task: trunk_workers = 1, walltime = '10h', mem = mem, cores = numThreads, tags = f'{_output[0]:bn}'
R:  expand="${ }", stderr=f'{_output[0]:n}.err', stdout=f'{_output[0]:n}.out'
  rm(list = ls())
  # R.version
  packages_list <- c("devtools", "readr", "dplyr", "tidyr", "data.table", "R.utils", "knitr", "stringr", "stringi")
  invisible(lapply(packages_list, library, character.only = TRUE, warn.conflicts = F, quietly = T))
  # Read in the annotated file with annovar
  df_anno <- fread(${_input[0]:r},na.strings = ".")
  dim(df_anno)
  # Read in the annotated file with VEP
  df_vep <- fread(${_input[1]:r})
  dim(df_vep)
  # subset the columns we care about
  df_vep_cadd <- df_vep[, c("Uploaded_variation", "CADD_PHRED")]
  # change the name of the columns
  setnames(df_vep_cadd , c("Uploaded_variation", "CADD_PHRED"), c("Otherinfo1","CADD_PHRED_VEP"))
  # Remove the duplicates in the VEP file
  df_vep_nondup <- df_vep_cadd[!duplicated(df_vep_cadd)]
  dim(df_vep_nondup)
  # Merge the two databases 
  df_orig_orig <- merge(df_anno, df_vep_nondup, by="Otherinfo1", all=T)
  # Create another database appling the gnomAD_nfe rule and flipping the alleles with REF/ALT issues
  df_orig<- df_orig_orig %>%
  mutate(ID_r = case_when(gnomad312_AF_nfe>0.5 & AF_nfe>0.5 ~ paste(Chr, Start, Alt, Ref, sep = ":"),
                          gnomad312_AF_nfe>0.5 & is.na(AF_nfe) ~ paste(Chr, Start, Alt, Ref, sep = ":"),
                          is.na(gnomad312_AF_nfe) & AF_nfe>0.5 ~ paste(Chr, Start, Alt, Ref, sep = ":"),
                          TRUE ~ paste(Chr, Start, Ref, Alt, sep = ":"))) %>%
  mutate(AF_gnomAD=case_when(gnomad312_AF_nfe>0.5 & AF_nfe>0.5 ~ as.character(1 - pmax(gnomad312_AF_nfe, AF_nfe, na.rm=TRUE)),
                          gnomad312_AF_nfe>0.5 & is.na(AF_nfe) ~ as.character(1 - gnomad312_AF_nfe),
                          is.na(gnomad312_AF_nfe) & AF_nfe>0.5 ~ as.character(1 - AF_nfe),
                          TRUE ~ as.character(pmax(gnomad312_AF_nfe, AF_nfe, na.rm=TRUE))))

  # Print some summaries of the file
  dim(df_orig_orig)
  
  # Keep only the categories we need for creation of burden files
  categories_to_keep <- c('exonic', 'exonic;splicing', 'splicing', 'ncRNA_exonic;splicing')
  mask_column1 <- df_orig$Func.refGene %in% categories_to_keep
  mask_column2 <- df_orig$Func.refGeneWithVer %in% categories_to_keep
  df <- df_orig[mask_column1 | mask_column2, ]
  
  # Print a cross table of the two columns that contain the important annotations
  kable(table(df$Func.refGene, df$Func.refGeneWithVer))
  # add a unique ID to each row
  df$uniqueID <- c(1:nrow(df))
  df$Func.refGene.num <- str_count(df$Func.refGene, ";") + 1
  df$Gene.refGene.num <- str_count(df$Gene.refGene, ";") + 1
  df$Func.refGeneWithVer.num <- str_count(df$Func.refGeneWithVer, ";") + 1
  df$Gene.refGeneWithVer.num <- str_count(df$Gene.refGeneWithVer, ";") + 1
  #Single gene and single function
  df_1111 <- df %>%
    filter(Func.refGene.num == 1 & Gene.refGene.num == 1 & Func.refGeneWithVer.num == 1 & Gene.refGeneWithVer.num == 1)
    dim(df_1111)
    df_1111_output <- df_1111
  table(df_1111_output$Func.refGene.num, df_1111_output$Gene.refGene.num)
  table(df_1111_output$Func.refGeneWithVer.num, df_1111_output$Gene.refGeneWithVer.num)
  # Equal number of genes and functions
  df_equal <- df %>%
    filter(!(Func.refGene.num == 1 & Gene.refGene.num == 1 & Func.refGeneWithVer.num == 1 & Gene.refGeneWithVer.num == 1)) %>%
    filter(Func.refGene.num == Gene.refGene.num & Func.refGeneWithVer.num == Gene.refGeneWithVer.num)
  dim(df_equal)
  table(df_equal$Func.refGene.num, df_equal$Gene.refGene.num)
  table(df_equal$Func.refGeneWithVer.num, df_equal$Gene.refGeneWithVer.num)
  # separate this into different rows
  df_equal_output_1 <- df_equal %>%
  separate_longer_delim(c(Func.refGene, Gene.refGene), delim = ";")
  df_equal_output_2 <- df_equal_output_1 %>%
  separate_longer_delim(c(Func.refGeneWithVer, Gene.refGeneWithVer), delim = ";")
  dim(df_equal_output_2)
  df_equal_output_2$Func.refGene.num <- str_count(df_equal_output_2$Func.refGene, ";") + 1
  df_equal_output_2$Gene.refGene.num <- str_count(df_equal_output_2$Gene.refGene, ";") + 1
  df_equal_output_2$Func.refGeneWithVer.num <- str_count(df_equal_output_2$Func.refGeneWithVer, ";") + 1
  df_equal_output_2$Gene.refGeneWithVer.num <- str_count(df_equal_output_2$Gene.refGeneWithVer, ";") + 1

  table(df_equal_output_2$Func.refGene.num, df_equal_output_2$Gene.refGene.num)
  table(df_equal_output_2$Func.refGeneWithVer.num, df_equal_output_2$Gene.refGeneWithVer.num)
  ## Multiple genes to one function or viceversa
  df_1_to_n <- df %>%
    filter( (Func.refGene.num == 1 & Gene.refGene.num != 1 | Func.refGene.num != 1 & Gene.refGene.num == 1) & (Func.refGeneWithVer.num == 1 & Gene.refGeneWithVer.num != 1 | Func.refGeneWithVer.num != 1 & Gene.refGeneWithVer.num == 1))
  # we do the 1-n column in Func.refGene.num and Gene.refGene.num on the output of the step above
  tmp1 <- df_1_to_n %>%
    filter(Func.refGene.num == 1 & Gene.refGene.num != 1)
  tmp1_separated <- tmp1 %>%
    separate_longer_delim(c(Gene.refGene), delim = ";")
  tmp1_separated_excluding <- df_1_to_n %>% filter(!(Func.refGene.num == 1 & Gene.refGene.num != 1))
  tmp1_separated_output = rbind(tmp1_separated, tmp1_separated_excluding)
  # then we do the n-1 column in Func.refGene.num and Gene.refGene.num on the output of the step above
  tmp2 <- tmp1_separated_output %>%
    filter(Func.refGene.num != 1 & Gene.refGene.num == 1)
  tmp2_separated <- tmp2 %>%
    separate_longer_delim(c(Func.refGene), delim = ";")
  tmp2_separated_excluding <- tmp1_separated_output %>% filter(!(Func.refGene.num != 1 & Gene.refGene.num == 1))
  tmp2_separated_output = rbind(tmp2_separated, tmp2_separated_excluding)
  # then we do the 1-n column in Func.refGeneWithVer.num and Gene.refGeneWithVer.num on the output of the step above
  tmp3 <- tmp2_separated_output %>%
    filter(Func.refGeneWithVer.num == 1 & Gene.refGeneWithVer.num != 1)
  tmp3_separated <- tmp3 %>%
    separate_longer_delim(c(Gene.refGeneWithVer), delim = ";")
  tmp3_separated_excluding <- tmp2_separated_output %>% filter(!(Func.refGeneWithVer.num == 1 & Gene.refGeneWithVer.num != 1))
  tmp3_separated_output = rbind(tmp3_separated, tmp3_separated_excluding)

  # at last we do the n-1 column in Func.refGeneWithVer.num and Gene.refGeneWithVer.num on the output of the step above
  tmp4 <- tmp3_separated_output %>%
    filter(Func.refGeneWithVer.num != 1 & Gene.refGeneWithVer.num == 1)
  tmp4_separated <- tmp4 %>%
    separate_longer_delim(c(Func.refGeneWithVer), delim = ";")
  tmp4_separated_excluding <- tmp3_separated_output %>% filter(!(Func.refGeneWithVer.num != 1 & Gene.refGeneWithVer.num == 1))
  tmp4_separated_output = rbind(tmp4_separated, tmp4_separated_excluding)

  tmp4_separated_output$Func.refGene.num <- str_count(tmp4_separated_output$Func.refGene, ";") + 1
  tmp4_separated_output$Gene.refGene.num <- str_count(tmp4_separated_output$Gene.refGene, ";") + 1
  tmp4_separated_output$Func.refGeneWithVer.num <- str_count(tmp4_separated_output$Func.refGeneWithVer, ";") + 1
  tmp4_separated_output$Gene.refGeneWithVer.num <- str_count(tmp4_separated_output$Gene.refGeneWithVer, ";") + 1

  table(tmp4_separated_output$Func.refGene.num, tmp4_separated_output$Gene.refGene.num)
  table(tmp4_separated_output$Func.refGeneWithVer.num, tmp4_separated_output$Gene.refGeneWithVer.num)
  # Combine single valued dataframes
  df_output <- rbind(df_1111_output, df_equal_output_2)
  df_output <- rbind(df_output, tmp4_separated_output)
  ## Set the Func to nan for splice variants
  df_splicing = df_output %>% filter(Func.refGeneWithVer == "splicing")
  print("--before conversion, ExonicFunc.refGene column:--")
  table(df_splicing$ExonicFunc.refGene)
  print("--before conversion, ExonicFunc.refGeneWithVer column:--")
  table(df_splicing$ExonicFunc.refGeneWithVer)
  df_output_splicing_adjusted <- df_output %>%
    mutate(ExonicFunc.refGeneWithVer = ifelse(Func.refGeneWithVer == "splicing", "nan", ExonicFunc.refGeneWithVer))
  df_output_splicing_adjusted_only_splicing = df_output_splicing_adjusted %>% filter(Func.refGeneWithVer == "splicing")
  print("--after conversion, ExonicFunc.refGeneWithVer:--")
  table(df_output_splicing_adjusted_only_splicing$ExonicFunc.refGeneWithVer)
  
  ## From now on work on the file with separated and non-duplicated rows 
  df <- df_output_splicing_adjusted %>%
    mutate(if_match = ifelse(Func.refGeneWithVer == Func.refGene, TRUE, FALSE),
          ExonicFunc.refGeneWithVer.1 = ifelse(Func.refGeneWithVer == "exonic" & ExonicFunc.refGeneWithVer == "unknown", ExonicFunc.knownGene, ExonicFunc.refGeneWithVer),
          condition1 = ifelse(
            CADD_phred > 20 | dbscSNV_ADA_SCORE > 0.8 | dbscSNV_RF_SCORE > 0.8, TRUE, FALSE
        ))
  table(df$condition1)
  matched <- df %>%
    filter(if_match)
  unmatched <- df %>% filter(!if_match)
  print(paste0("There are ", as.character(nrow(matched)), " matched rows, and ", as.character(nrow(unmatched)), " unmatched rows."))
  kable(table(matched$Func.refGeneWithVer, matched$ExonicFunc.refGeneWithVer))
  kable(table(matched$Func.refGeneWithVer, matched$ExonicFunc.knownGene))
  # for the matched ones
  matched <- matched %>%
    mutate(consequence = case_when(
        Func.refGeneWithVer == "splicing" ~ "LOF",
        Func.refGeneWithVer == "exonic" & ExonicFunc.refGeneWithVer.1 == "nonsynonymous SNV" ~ "missense",
        Func.refGeneWithVer == "exonic" & ExonicFunc.refGeneWithVer.1 == "synonymous SNV" ~ "synonymous",
        Func.refGeneWithVer == "exonic" & ExonicFunc.refGeneWithVer.1 %in% c("frameshift substitution", "stoploss", "stopgain", "startloss") ~ "LOF",
        Func.refGeneWithVer == "exonic" & ExonicFunc.refGeneWithVer.1 == "nonframeshift substitution" ~ "inframe",
        Func.refGeneWithVer == "exonic" & ExonicFunc.refGeneWithVer.1 == "unknown" ~ ExonicFunc.knownGene,
        TRUE ~ "zzz")
           )
  print("--For the matched functions these are the consequences:--")
  table(matched$consequence)
  # for the unmatched
  kable(table(unmatched$Func.refGeneWithVer, unmatched$Func.refGene))
  kable(table(unmatched$Func.refGeneWithVer, unmatched$ExonicFunc.refGeneWithVer))
  unmatched <- unmatched %>%
    mutate(consequence = case_when(
        Func.refGeneWithVer == "exonic" & Func.refGene == "exonic;splicing" & ExonicFunc.refGeneWithVer %in% c("frameshift substitution", "stoploss", "stopgain", "startloss") ~ "LOF",
        Func.refGeneWithVer == "exonic" & Func.refGene == "exonic;splicing" & !(ExonicFunc.refGeneWithVer %in% c("frameshift substitution", "stoploss", "stopgain", "startloss")) & condition1 ~ "LOF",
        Func.refGeneWithVer == "exonic" & Func.refGene == "exonic;splicing" & !(ExonicFunc.refGeneWithVer %in% c("frameshift substitution", "stoploss", "stopgain", "startloss")) & (!condition1 | is.na(condition1)) ~ "splicing",
        Func.refGeneWithVer == "splicing" & Func.refGene %in% c("exonic", "ncRNA_exonic", "exonic;splicing")  ~ "LOF",
        Func.refGeneWithVer %in% c("UTR3", "UTR5", "intronic", "ncRNA_exonic", "ncRNA_intronic") & Func.refGene == "splicing" ~ "splicing",
        Func.refGeneWithVer == "exonic" & Func.refGene == "splicing" & ExonicFunc.refGeneWithVer %in% c("frameshift substitution", "stoploss", "stopgain", "startloss") ~ "LOF",
        Func.refGeneWithVer == "exonic" & Func.refGene == "splicing" & !(ExonicFunc.refGeneWithVer %in% c("frameshift substitution", "stoploss", "stopgain", "startloss")) ~ "splicing",
        TRUE ~ "zzz"
    ))
  print("--For the unmatched functions these are the consequences:--")
  table(unmatched$consequence)
  df_output_wf = rbind(matched,unmatched)
  df_output_wf_no_zzz <- df_output_wf %>% filter(consequence != "zzz")
  ## Please note that to this point duplicated variants have not been removed and they will be removed below in the generation of the output files
  ## Create output files
  ### Annotation file
  df_annotation_file <- df_output_wf_no_zzz %>%
    select(Chr, Start, Ref, Alt,gnomad312_AF_nfe, AF_nfe, Gene.refGeneWithVer, consequence, ID_r, AF_gnomAD, CADD_PHRED_VEP)
  
  head(df_annotation_file, 2)
  dim(df_annotation_file)
  print("=== after removing duplicated === ")
  # Remove duplicated variants in the annotation file
  df_annotation_file_no_dup <- df_annotation_file[!duplicated(df_annotation_file, by = c("Gene.refGeneWithVer", "ID_r", "consequence")), ]
  dim(df_annotation_file_no_dup)
  table(df_annotation_file_no_dup$consequence)

  print("=== after removing duplicated then keep the most damaging function === ") # update on 20230925
  df_annotation_file_no_dup$consequence <- factor(df_annotation_file_no_dup$consequence, levels = c("LOF", "splicing", "missense", "synonymous", "inframe", "unknown"))
  df_annotation_file_no_dup_in_order = df_annotation_file_no_dup %>%
     arrange(ID_r, Gene.refGeneWithVer, consequence)
  df_annotation_file_no_dup_in_order_most_damaging = df_annotation_file_no_dup_in_order[!duplicated(df_annotation_file_no_dup_in_order, by = c("Chr", "Start", "Ref", "Alt","gnomad312_AF_nfe", "AF_nfe", "Gene.refGeneWithVer", "ID_r","CADD_PHRED_VEP")),]
  table(df_annotation_file_no_dup_in_order_most_damaging$consequence)
  dim(df_annotation_file_no_dup_in_order_most_damaging)

  df_annotation_file_no_dup_in_order_most_damaging_output <- df_annotation_file_no_dup_in_order_most_damaging %>%
    select(ID_r, Gene.refGeneWithVer, consequence, CADD_PHRED_VEP)
  # This step is to make sure we deal with non-numeric characters in the VEP annotated files
  df_annotation_cadd <- df_annotation_file_no_dup_in_order_most_damaging_output %>%
    mutate(CADD_PHRED_VEP = ifelse(CADD_PHRED_VEP == "-", NA, CADD_PHRED_VEP))
  # Now we have to deal with the NA's and we generate the CADD_consequence for the burden analysis including CADD_phred
  df_annotation_cadd_nona <- df_annotation_cadd %>%
    mutate(CADD_cat = case_when(
            is.na(CADD_PHRED_VEP) ~ "NA",
            as.numeric(CADD_PHRED_VEP) <10 ~ "[0-10)",
            as.numeric(CADD_PHRED_VEP) >=10 & as.numeric(CADD_PHRED_VEP) <20 ~ "[10-20)",
            as.numeric(CADD_PHRED_VEP) >=20 ~ ">=20",
            TRUE ~ "zzz")) %>%
    mutate(CADD_consequence = paste(consequence,CADD_cat, sep = ""))
  df_annotation_cadd_nona_output <- df_annotation_cadd_nona %>%
      select(ID_r, Gene.refGeneWithVer, CADD_consequence)
  
  ### Set list file
  length(unique(df_annotation_file_no_dup$Gene.refGeneWithVer))
  df_set_list = df_annotation_file_no_dup %>% 
    arrange(Gene.refGeneWithVer, Start) %>%
    group_by(Gene.refGeneWithVer) %>%
    summarise(var_list = list(ID_r), start_list = list(Start), n_var = n())

  columns = c("Gene.refGeneWithVer", "Chr", "start", "var")
  df_set_list_output = data.frame(matrix(nrow=nrow(df_set_list), ncol = length(columns)))
  colnames(df_set_list_output) = columns
  Chromosome = unique(df$Chr)
  for (index in c(1:nrow(df_set_list))){
      df_set_list_output[index,1] = df_set_list$Gene.refGeneWithVer[index]
      df_set_list_output$Chr[index] = Chromosome
      df_set_list_output$start[index] = min(unlist(df_set_list$start_list[index]))
      df_set_list_output$var[index] = stri_paste(unlist(df_set_list$var_list[index]), collapse=',')
      }
  
  ## Allele frequency file
  #df_AAF = df_annotation_file_no_dup %>%
  #  select(ID_r, AF_nfe) %>% 
  #  mutate(AF_nfe.1 = ifelse(is.na(AF_nfe) | AF_nfe == 0 | AF_nfe == ".", "0.0", as.character(AF_nfe))) %>%
  #  select(!AF_nfe)
  #head(df_AAF)
  #df_AAF_no_dup <- df_AAF[!duplicated(df_AAF), ]
  #head(df_AAF_no_dup)

  frequency_df_max_AF <- df_annotation_file_no_dup %>%
      mutate(AF_gnomAD = ifelse(is.na(AF_gnomAD) | AF_gnomAD == 0 | AF_gnomAD == ".", "0.0", as.character(AF_gnomAD))) %>%
      select(ID_r, AF_gnomAD)

  # Final allele frequency file with changes and no duplicated variants
  frequency_df_max_AF_no_dup <- frequency_df_max_AF[!duplicated(frequency_df_max_AF),]
  
  write.table(df_annotation_cadd_nona_output, file = '${_output[0]}', row.names = FALSE, quote = FALSE, col.names = FALSE, sep = " ", na = "nan")
  write.table(frequency_df_max_AF_no_dup, file = '${_output[1]}', row.names = FALSE, quote = FALSE, col.names = FALSE, sep = " ", na = "nan")
  write.table(df_set_list_output, file = '${_output[2]}', row.names = FALSE, quote = FALSE, col.names = FALSE, sep = " ", na = "nan")
