# make_reference workflow

This notebook is intended to handle the case where the GWAS data to be integrated with single-cell RNA-seq data is of mixed ancestry. Instead of using an off-the-shelf 1000 Genomes data set, use PLINK to merge them into a single reference data set to estimate LD with.

## Load required packages

In [1]:
from pathlib import Path

from src.utils import make_work_dir, move_output, setup
import src.make_reference.commands as commands
from src.make_reference.classes import Ancestry
from src.make_reference.utils import use_bfiles

## Setup the working environment

The `work_dir` directory will contain any intermediate files that are generated as a part of this process. The `output_dir` directory should contain any final outputs.

In [2]:
output_dir, tmp_dir = setup("src/make_reference/output")
work_dir = make_work_dir(tmp_dir)

making directory make_reference/output
making directory make_reference/tmp
making directory tmp/6e368aee1e87b6848cb1aec10c990170


## Find available PLINK binary file sets to merge

**bfile_dir** - defines the directory that contains the PLINK binary file sets
**bfile_sets** - the PLINK binary file sets found in **bfile_dir** that appear to correspond to the defined ancestries

In [3]:
bfile_dir = Path("bin/1k_genomes")
bfile_sets = use_bfiles(bfile_dir, Ancestry.EUROPEAN, Ancestry.EAST_ASIAN)

Available bfile sets that match ancestry: 
bin/1k_genomes/eur/eur_cm_filled
bin/1k_genomes/eur/g1000_eur
bin/1k_genomes/eas/g1000_eas
bin/1k_genomes/merged_bfiles/merged_eur_eas_cm_filled


## Merge the PLINK file sets

The tuple (**merged_bed**, **merged_bim**, **merged_fam**) are pathlib.Path objects pointing to the merged file set

Since multiple PLINK file sets could correspond to a given ancestry, **keep** defines the string patterns that should be used from the **bfile_sets** when merging

In [4]:
merged_bed, merged_bim, merged_fam = commands.merge_bfiles(bfile_sets, work_dir, keep=["g1000_eur", "g1000_eas"])

move_output(output_dir, merged_bed, merged_bim, merged_fam)
print(f"Moving output {merged_bed} to from {work_dir} to {output_dir}")
print(f"Moving output {merged_bim} to from {work_dir} to {output_dir}")
print(f"Moving output {merged_fam} to from {work_dir} to {output_dir}")

Attempting to merge bfile sets ['bin/1k_genomes/eur/g1000_eur', 'bin/1k_genomes/eas/g1000_eas']
Failed to merge bfile sets ['bin/1k_genomes/eur/g1000_eur', 'bin/1k_genomes/eas/g1000_eas']. Attempting to prune problem variants before trying again
Pruning problem variants from bin/1k_genomes/eur/g1000_eur
Pruning problem variants from bin/1k_genomes/eas/g1000_eas
Retrying bfile set merge...
...success
Moving output src/make_reference/tmp/6e368aee1e87b6848cb1aec10c990170/merge.bed to from src/make_reference/tmp/6e368aee1e87b6848cb1aec10c990170 to src/make_reference/output
Moving output src/make_reference/tmp/6e368aee1e87b6848cb1aec10c990170/merge.bim to from src/make_reference/tmp/6e368aee1e87b6848cb1aec10c990170 to src/make_reference/output
Moving output src/make_reference/tmp/6e368aee1e87b6848cb1aec10c990170/merge.fam to from src/make_reference/tmp/6e368aee1e87b6848cb1aec10c990170 to src/make_reference/output
