# Gene Coordinate Annotation

This workflow adds genomic coordinate annotation to gene-level molecular phenotype files generated in `.gct` format and convert them to `.bed` format for downstreams analysis.

## Input

1. `protocol_example.protein.csv`
2. `reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.collapse_only.gene.ERCC.gtf`

## Output

1. `protocol_example.protein.bed.gz`
2. `protocol_example.protein.region_list.txt`

## Minimal Working Example

The proteomics data used in this MWE can be found on [synapse](https://www.synapse.org/#!Synapse:syn52369482).

### Step 1: Phenotype Annotation
Timing: < 1 minute

In [None]:
sos run pipeline/gene_annotation.ipynb annotate_coord_protein \
    --cwd output/phenotype \
    --phenoFile input/protocol_example.protein.csv \
    --annotation-gtf reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.collapse_only.gene.ERCC.gtf \
    --phenotype-id-type gene_name \
    --sample-participant-lookup output/sample_meta/protocol_example.protein.sample_overlap.txt \
    --container singularity/rna_quantification.sif

```
INFO: Running annotate_coord_protein:
INFO: annotate_coord_protein is completed.
INFO: annotate_coord_protein output:   /Users/alexmccreight/xqtl-pipeline-new/output/phenotype/protocol_example.protein.bed.gz /Users/alexmccreight/xqtl-pipeline-new/output/phenotype/protocol_example.protein.region_list.txt
INFO: Workflow annotate_coord_protein (ID=we5e99e82ff5b579b) is executed successfully with 1 completed step.
```

## Command interface

In [None]:
sos run gene_annotation.ipynb -h

In [None]:
usage: sos run gene_annotation.ipynb
               [workflow_name | -t targets] [options] [workflow_options]
  workflow_name:        Single or combined workflows defined in this script
  targets:              One or more targets to generate
  options:              Single-hyphen sos parameters (see "sos run -h" for details)
  workflow_options:     Double-hyphen workflow-specific parameters

Workflows:
  region_list_generation
  annotate_coord_gene
  annotate_coord_protein
  annotate_coord_biomart
  map_leafcutter_cluster_to_gene
  annotate_leafcutter_isoforms
  annotate_psichomics_isoforms

Global Workflow Options:
  --cwd output (as path)
                        Work directory & output directory
  --annotation-gtf VAL (as path, required)
                        gene gtf annotation table
  --phenoFile VAL (as path, required)
                        Molecular phenotype matrix
  --phenotype-id-type 'gene_id'
                        Whether the input data is named by gene_id or gene_name.
                        By default it is gene_id, if not, please change it to
                        gene_name
  --job-size 1 (as int)
                        For cluster jobs, number commands to run per job
  --walltime 5h
                        Wall clock time expected
  --mem 16G
                        Memory expected
  --numThreads 1 (as int)
                        Number of threads
  --container ''
  --entrypoint {('micromamba run -a "" -n' + ' ' + container.split('/')[-1][:-4]) if container.endswith('.sif') else f''}


Sections
  region_list_generation:
  annotate_coord_gene:
    Workflow Options:
      --sample-participant-lookup . (as path)
                        A file to map sample ID from expression to genotype,
                        must contain two columns, sample_id and participant_id,
                        mapping IDs in the expression files to IDs in the
                        genotype (these can be the same).
  annotate_coord_protein:
    Workflow Options:
      --sample-participant-lookup . (as path)
                        A file to map sample ID from expression to genotype,
                        must contain two columns, sample_id and participant_id,
                        mapping IDs in the expression files to IDs in the
                        genotype (these can be the same).
      --protein-name-index . (as path)
      --protein-ID-type SOMAseqID
  annotate_coord_biomart:
    Workflow Options:
      --ensembl-version VAL (as int, required)
  map_leafcutter_cluster_to_gene:
    Workflow Options:
      --intron-count VAL (as path, required)
                        Extract the code in case psichromatic needs to be
                        processed the same way PheoFile in this step is the
                        intron_count file
  annotate_leafcutter_isoforms:
    Workflow Options:
      --sample-participant-lookup . (as path)
  annotate_psichomics_isoforms:
    Workflow Options:
      --sample-participant-lookup . (as path)

### Step 1: Phenotype Annotation

In [None]:
[annotate_coord_protein]
# A file to map sample ID from expression to genotype, must contain two columns, sample_id and participant_id, mapping IDs in the expression files to IDs in the genotype (these can be the same).
parameter: sample_participant_lookup = path()
parameter: protein_name_index = path()
parameter: protein_ID_type = "SOMAseqID"
input: phenoFile, annotation_gtf
output: f'{cwd:a}/{_input[0]:bn}.bed.gz', f'{cwd:a}/{_input[0]:bn}.region_list.txt'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime,  mem = mem, tags = f'{step_name}_{_output[0]:bn}'  
python: expand= "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', container = container, entrypoint=entrypoint

    import pandas as pd
    import qtl.io
    from pathlib import Path

    def prepare_bed(df, bed_template_df, chr_subset=None):
        bed_df = pd.merge(bed_template_df, df, left_index=True, right_index=True)
        bed_df = bed_df.groupby('#chr', sort=False, group_keys=False).apply(lambda x: x.sort_values('start'))
        if chr_subset is not None:
            bed_df = bed_df[bed_df.chr.isin(chr_subset)]
        return bed_df

    def load_and_preprocess_data(input_path, drop_columns):
        df = pd.read_csv(input_path, skiprows=0)
        if any(col in df.columns for col in drop_columns):
            df = df.drop(drop_columns)
        return df

    def rename_samples_using_lookup(df, lookup_path):
        sample_participant_lookup = Path(lookup_path)
        if sample_participant_lookup.is_file():
            sample_participant_lookup_s = pd.read_csv(sample_participant_lookup, sep="\t", index_col=0, dtype={0:str,1:str})
            df.rename(columns=sample_participant_lookup_s.to_dict(), inplace=True)
        return df

    def load_bed_template(input_path, phenotype_id_type):
        if sum(qtl.io.gtf_to_tss_bed(input_path, feature='gene',phenotype_id = "gene_id").index.duplicated()) > 0:
            raise valueerror(f"gtf file {input_path} needs to be collapsed into gene model by reference data processing module")

        bed_template_df_id = qtl.io.gtf_to_tss_bed(input_path, feature='transcript', phenotype_id="gene_id")
        bed_template_df_name = qtl.io.gtf_to_tss_bed(input_path, feature='transcript', phenotype_id="gene_name")
        bed_template_df = bed_template_df_id.merge(bed_template_df_name, on=["chr", "start", "end"])
        bed_template_df.columns = ["#chr", "start", "end", "gene_id", "gene_name"]
        bed_template_df = bed_template_df.set_index(phenotype_id_type, drop=False)

        return bed_template_df

    df = load_and_preprocess_data(${_input[0]:ar}, ["chr", "start", "end"])
    protein_ID = df.columns.values[0]
    protein_name_index = Path("${protein_name_index:a}")
    if protein_name_index.is_file():
        df_info = pd.read_csv(protein_name_index).rename(columns={'${protein_ID_type}': protein_ID, 'EntrezGeneSymbol':'gene_name'})[['gene_name',protein_ID,'UniProt']]
        df = df_info.merge(df, on=protein_ID).drop(protein_ID,axis=1)
    else:
        df[[protein_ID, 'UniProt']] = df[protein_ID].astype(str).str.split('|', 1, expand=True)
    df.set_index(df.columns[0], inplace=True)
    df = rename_samples_using_lookup(df, "${sample_participant_lookup:a}")
    bed_template_df = load_bed_template(${_input[1]:ar}, "${phenotype_id_type}")
    bed_df = prepare_bed(df, bed_template_df)
    bed_df["ID"] = bed_df["gene_id"] + "_" + bed_df["UniProt"]
    bed_df = bed_df.drop_duplicates("ID", keep=False)[["#chr","start","end","ID"] + df.drop(["UniProt"],axis=1).columns.values.tolist()]
    qtl.io.write_bed(bed_df, ${_output[0]:r})
    bed_df[["#chr","start","end","ID"]].assign(path = ${_output[0]:r}).to_csv(${_output[1]:r},"\t",index = False) 