# PROJECT

# Download 1000 Genoms Project phased data Phase 3 

First, we need to download the data from the 1000 Genomes Project. We will use the phased data from Phase 3. The data is available in the following link: [1000 Genomes Project](ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/). To download the data, we will use the `wget` command as follows:

```bash
ftp_path="https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/"
file_name_prefix="ALL.chr"
file_name_sufix=".phase3_shapeit2_mvncall_integrated_v5b.20130502.genotypes.vcf.gz"
directory_path="./1000GenomesPhase3_phased/"
mkdir -p "$directory_path"
for chr in {1..22}; do 
echo "Downloading chr$chr"; 
wget  $ftp_path""$file_name_prefix""$chr""$file_name_sufix" -P "$directory_path"; 
done
```

The data will be downloaded in the directory `1000GenomesPhase3_phased`. The files are in compressed VCF format, and are divided by chromosome. We will have 22 files, one for each chromosome.

# America individuals

In folder `data`, we have four different files with subsets of american individuals from 1000 Genomes Project, including people with ancestry from Mexico, Peru, Columbia and Puerto Rico. We will use these files to filter the data from the 1000 Genomes Project and benchmark our tools with different number of individuals. The files are:

- `AMR_pop_150_samples.txt`
- `AMR_pop_250_samples.txt`
- `AMR_pop_350_samples.txt`
- `AMR_pop_all_samples.txt`

These files were created subsetting individuals with AMR ancestry from the 1000 Genomes Project donors. However, some donors presented in the general data were not present in the VCF files. The final number of individuals in each file is:

| File Name | Subset | Missing Individuals | Total Individuals |
|-----------|--------|---------------------|-------------------|
| AMR_pop_150_samples.txt | 150 individuals | 40 | 110 | 
| AMR_pop_250_samples.txt | 250 individuals | 77 | 173 |
| AMR_pop_350_samples.txt | 350 individuals | 106 | 244 |
| AMR_pop_all_samples.txt | 497 | 150 | 347 |


# Analysis

We will use [snakemake](https://snakemake.readthedocs.io/en/stable/index.html) as workflow manager to parallelize and automatize our analisys. File to run the analysis is `Snakefile_gnomix`. The file `snk_cmmd.sh` is a bash script that allows to run the analysis in a cluster. It makes use of a slurm profile, but it can be easily modified to use other [cluster systems](https://snakemake.readthedocs.io/en/stable/executing/cli.html), or even run sequentially by removinf the `profile` flag.

The workflow has two man steps:

## Preprocessing
In this step we extract the donors within the `AMR_pop_*_samples.txt` files from the 1000 Genomes Project VCF files. 

```bcftools view -S {input.popcount} {input.vcf} --force-samples -Oz -o {output.vcf}```

Where `{input.popcount}` is the file with the list of individuals and `{input.vcf}` is the VCF file from the 1000 Genomes Project, and the output is a VCF file with the subset of individuals. THe flag `--force-samples` is used to ignore the missing individuals discussed above.

## Local Ancestry Inference
In this step we will use [G-Nomix](https://github.com/AI-sandbox/gnomix), a tool to infer local ancestry.
We will use the following command:

```python3 gnomix {input_vcf} {output_dir} {chromosome} False {pretrained_model}```

Where the input VCF file comes from the previous step, the output directory is the directory where the output files will be saved, the chromosome is the number of the chromosome being analyzed (we focused only on autosomal chromosomes), and the genetic model pretrained by the authors of the tool. The parameter set as `False` indicates the intent of using predicted ancestry for phasing correction and it is recommended if there is evidence of phasing errors in the data.

# Analysis of the results


