# Introduction

Over the last few decades, a keen interest has been placed in investigating the human genome. In fact, it wasn't until 2003 that the Human Genome Project, an international scientific research project, was successful in sequencing the human genome for the first time in history. Along with this revolutionary discovery, came breakthroughs in DNA sequencing and genotyping technologies that have greatly influenced recent genetic endeavors. As an example, in 2015, the 1000 Genome Project, another international scientific effort, produced the most extensive catalogue of human genetic variation for populations all across the world. From this, the process of examining genetic variations amongst humans, or genotyping, has become ever more achievable for geneticists and biotechnology companies like 23andMe alike. Examination of such information opens pathways for investigating genetic ancestry, determining gene functionality, or even detecting and combating disease.

Given the significance of the 1000 Genome Project's findings and the availability of its data to the public, this data will be used for the purposes of this project. This project will replicate the findings of John Novembre et al. as described in the article "Genes mirror geography within Europe". In this study, genetic variation amongst 3,000 European individuals, collected by the Population Reference Sample project, are investigated. Interestingly, despite relatively low levels of genetic differences amongst the sample, a strong correlation between genetic variation and geographic distances was found. In fact, clustering and plotting each of these individuals closely resembles the map of Europe as the name of the article suggests. However, given that worldwide genetic variation data is available thanks to the 1000 Genome Project, this project will aim to extend these boundaries to not only Europe but the world.

Before beginning the investigation, it is important to evaluate the data at hand through the information provided in the article "A global reference for human genetic variation" published on Nature at https://www.nature.com/articles/nature15393. As previously mentioned, the data set available through the 1000 Genome Project will be used. The data set contains the reconstructed genomes of 2,548 reportedly healthy individuals from 26 self-reported populations in Africa, East Asia, Europe, South Asia, and the Americas. However, The International Sample Genome Resource (IGSR), an organization created to ensure the usability of the data, has expanded the data set to cover more populations. The genomes of the individuals were sequenced using both whole-genome sequencing and target exome sequencing. Variant discovery was of key importance as 24 sequence analysis tools were used along with machine learning techniques to minimize false positives and ensure a balance of sensitivity and specificity. In total, 88 million variant sites appear in the data set. The estimated power to detect SNPs and indels is >99% and >85% respectively for frequencies >1% and is >95% and >80% at frequencies >.5%.

There are also possible issues with the data that may cause biases. First, the thousands of individuals in this data set may not be fully representative of the world population of nearly 8 billion individuals. A larger sample size would of course be better but for the sake of this project, we will be working with the data available through the 1000 Genomes Project. Additionally, there are imbalances in who is represented in the data set. The proportion of African, European, East Asian, South Asian, and Ad Mixed American in the sample are 26%, 20%, 20%, 19%, and 14% respectively. Clearly there are some issues here as Latin American ancestries represent only 14% of the data. In fact, the samples for Latin America were only collected from Puerto Rico and the cities of Los Angeles, Columbia, and Peru. More information on the populations in the data can be accessed at https://www.internationalgenome.org/faq/which-populations-are-part-your-study/. It must also be noted that errors may  exist in the data set as the technologies that sequence and generate them are not perfect.  

# Data Considerations & Ingestion

The data set mentioned above is housed at the IGSR website located at https://www.internationalgenome.org/data. The data is openly available to the public for access. However, as mentioned at https://www.internationalgenome.org/IGSR_disclaimer, the IGSR data comes from many different owners and as a result, there may be different restrictions on different pieces of data. It is also noted that data should not be exploited to infringe on the rights of third parties or the data owners. As for data privacy concerns, the data provided is anonymized and does not contain any medical or phenotype data that could identify individuals who are part of the sample. No attempt will be made to unanonymize the data during the downloading and processing portion of the data ingestion pipeline.

In terms of ingesting the data, IGSR makes its data conveniently available through an FTP server at ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp. Therefore, the data will be accessed and downloaded here. The genetic data of interest is mainly in 3 file types: FASTQ, BAM, and VCF. The VCF files contain data for all individuals in a single file whereas BAM and FASTQ files contain data for individual samples (a single person's genes). Therefore, the data ingestion code to download each file type will vary.

Of course, all the data will not be downloaded at once. This is simply unfeasible for storage considerations as terabytes of data are available. Instead, individual files will be downloaded to a 'data' directory when they are needed for analysis. In order to achieve this, a series of functions will be produced to make specific requests for data of different file types and units. When downloading VCF files, a specific chromosome (10,20,X, etc.) will be able to be specified. In downloading BAM files, a specific individual's identifier along with a chromosome will be specified. Lastly, for FASTQ, a specific sample will be able to be specified and their corresponding data will be downloaded. This schema allows for flexibility and prevents the downloading of unwanted files.

As of now, no plans are being made to extract data for other sources as this project is 1000 Genome Project data specific. Therefore, the ability of this data ingestion pipeline to work and be robust only for the FTP server is sufficient.

**Note:** The script `etl.py` was written to automate this process. This script allows the user to obtain a specifc sample to get a FASTQ files for, obtain a sample's BAM file for a specific chromosome, or extract all VCF files from the FTP.

# General Workflow

## File Conversion

The end goal of this project is to visualize genetic variation amongst individuals in our sample. In order to do this, Variant Call Format (VCF) files will need to be generated as they contain the genetic variation information that is needed. Thankfully, IGSR provides these files through their FTP. However, one would typically begin with a FASTQ file, convert it into a BAM file, and eventually convert it into a VCF file. For the sake of completeness, this conversion process will be provided.

**<u>FASTQ to BAM<u>**

1. First we gather the reference genome that will be used for alignment:

> ```console
user@dsmlp:~$ wget [reference path]
```

2. Next we will generate the BWA index:

> ```console
user@dsmlp:~$ bwa index [reference path]
```


3. Now we will generate the FASTA file index using SAMtools:

> ```console
user@dsmlp:~$ samtools faidx [reference path]
```

**Note:** These first three steps should be followed if you don't have a reference genome file and its corresponding index files. Thankfully, these files are provided to us on the DSMPL server so we can access them on there instead.

4. The sequence dictionary will be generated using Picard:

> ```console
user@dsmlp:~$ java -jar [picard.jar path]\
                        CreateSequenceDictionary \
                        REFERENCE=[reference path] \
                        OUTPUT=[reference name].dict
```

5. Prepare read group information. The read group will act as the meta data of our FASTQ file:

> Example read group for sample HG00096: <b>"@RG\tID:group1\tSM:HG00096\tPL:illumina\tLB:lib1\tPU:unit1"</b>
    
6. The FASTQ file will now be mapped to the reference file in order to create a SAM file:

> ```console
user@dsmlp:~$ bwa mem -R ’<read group info from above>’ \
                  -p [reference path] \
                  [FASTQ path] > [SAM name].sam
```


7. Now that the SAM file is created, this file can be converted into a BAM file:

> ```console
user@dsmlp:~$ java -jar [picard.jar path] \ 
                    SortSam \
                    INPUT=[SAM name].sam \
                    OUTPUT=[BAM name].bam \
                    SORT_ORDER=coordinate
```  

8. Now that we have a BAM file, we must create an index (.bai) file for it:
> ```console
user@dsmlp:~$ samtools index [BAM name].bam  [BAM name].bam.bai
``` 

**<u>BAM to VCF<u>**
1. Finally, we can convert the BAM file to a VCF file:

> ```console
user@dsmlp:~$ gatk \
                HaplotypeCaller \
                -R [reference path] \
                -I [BAM name].bam -O [VCF name].vcf
```

**Note:** The script `conversion.py` was written to automate this process. This script allows the conversion of FASTQ to BAM, BAM to VCF, or FASTQ straight to VCF.

## VCF Preparation and Visualization

The general workflow to perform the FASTQ->BAM->VCF conversion was highlighted above. However, please note that we will be focusing on the VCF files provided by the IGSR FTP. With this said, we will need to work with software that is capable of manipulating and analyzing VCF files. Therefore, we will be using PLINK2, an open-source genome analysis toolset, to achieve such functionalities. This tool will be used to: filter SNPs, recode data, run PCA to detect outliers, and run PCA to visualize population substructure. The necessary steps for each of these processes, along with some preparatory steps, will be outlined below. 

**<u>Merge VCFs Across Chromosomes<u>**\
The 1000 genomes data set contains separate VCF files for chromosomes 1-22. Since we are interested in genetic variation amongst individuals, merging these files together in order to get a full view of the data is necessary.
    
\
Let's begin by storing all VCF file paths and names into a list file `input.list`:

> ```console
user@dsmlp:~$ ls [filepath] \ 
                | grep .vcf.gz \
                | grep -v ".tbi"\
                | sed -e 's/^/[filepath]/' > input.list
```

Now we will be storing the header of a single VCF file so that we can later concatenate to it:

> ```console
user@dsmlp:~$ zgrep '^#' "$(head -1 input.list)" | gzip > full_chroms.vcf.gz
```

And finally we concatenate all VCF files to a single file `full_chroms.vcf.gz`:
> ```console
user@dsmlp:~$ zgrep -v "^#" $(cat input.list) | gzip >> full_chroms.vcf.gz
```


**<u>Recode Data<u>**\
Since we now have a single VCF file, we must apply a series of variant filters to our data and recode it into an analysis ready format (.bed, .bim, etc.) using PLINK2.
    
This is all conveniently done using this one command:

> ```console
user@dsmlp:~$ plink2 \
                  --vcf [filepath]/full_chroms.vcf.gz \
                  --snps-only \
                  --maf 0.05 \
                  --geno 0.1 \
                  --mind 0.05 \
                  --recode \
                  --allow-extra-chr \
                  --make-bed \
                  --out [filepath]/chromosomes
```

Note that several parameters were passed in to filter the data. Here I will include explanation/justification for these parameters. Please note that any of these parameters can be changed at a later time if necessary as code will be written to make sure of this:

- **--snps-only:** For the purposes of this project, we are focused on genetic variation. Since SNPs are the most common form of genetic variation, we will be filtering the data to include only SNPs.

- **--maf:** Controls the minor allele frequency threshold. It appears to be common practice to set this threshold at 5\% to differentiate between rare and common alleles therefore we will be working with a 5\% threshold.

- **--geno:** Controls the missing call rate threshold. We will keep the default value of 0.1 meaning we want to work with SNPs for which there is less than 10\% missing data.

- **--mind:** Similar to --geno but filters out samples with a certain rate of missing calls. We will be working with a rate of 5\% here.

- **--recode:** Generates a text file set from the VCF file taken in as input (.ped and .map).

- **--allow-extra-chr:** Since we will be filtering across several different chromosomes, this must be allowed for this function to run.

- **--make-bed:** Generates a binary file set from the VCF taken in as input (.bed, .bim, and .fam).

**<u>Run PCA to Detect Outliers<u>** \
We will now be running an initial PCA to detect potential outliers in the data. 

PCA is run on the data generated above with the following command:

> ```console
user@dsmlp:~$ plink2 \
                  --bfile [filepath]/chromosomes \
                  --pca [num_pc] \
                  --allow-extra-chr \
                  --out chrom_pc
```

The **--pca** argument is used to specify the number of principal components needed. We will be working with the top three principal components so that we can visualize the genomic clusters on a three-dimensional plane.

In order to identify outliers, we will be reading in the principal component (eigenvector) file produced from the above command into Python. Samples that are +/- 3 standard deviations from the mean will be deemed outliers and stored in the file `outliers.txt`.

**<u>Remove Outlier Samples & Rerun PCA<u>**\
We will now be filtering out outliers. If no outliers were found in the previous step, this step will be ignored. Otherwise, the following command will be run to remove samples and rerun PCA.

> ```console
user@dsmlp:~$ plink2 \
                  --bfile [filepath]/chromosomes \
                  --pca [num_pc] \
                  --remove outliers.txt \
                  --allow-extra-chr \
                  --out chrom_pc
```

Note that the only additional argument here **--remove** is given the path to the text file produced in the previous step which contains samples we would like to remove.

**<u>Visualize Population Substructure<u>**\
Now that we have filtered out the outlier samples, we can finally visualize population substructure. Again, we will be reading in the principal component (eigenvector) file produced from the previous step. Additionally, IGSR conveniently provides data on every sample and their associated populations. This will be used to map colors to different super populations when plotting. This data has been slightly modified and stored in a file `sample_pop.csv`.
    
We can now generate and produce the final plot using the Python library Plotly:

<img align="left" src="notebook-resources/cluster1.png" width="460"/>
<img align="left" src="notebook-resources/cluster2.png" width="530"/>


**Note:** The script `process_data.py` was written to automate this process. This script allows you to choose maf, geno, and mind parameters to filter the master VCF file appropriately and even specify the number of principal components.

# Description of Results & Limitations

The purpose of this project was to map genetic variation amongst individuals in the 1000 Genomes dataset and see how the different samples cluster together. The final number of samples included remained at 2548. This means that no samples were filtered out or found to be outliers in the pipeline. After generating the PCA plot above based on approximately 70-80 million SNPs from chromosomes 1-22, it can clearly be seen that samples indeed cluster together into their respective populations. Additionally, we can see the genetic relationships between the different populations. For example, it appears that the South Asian cluster is the most isolated which leads one to believe that this population contains genetic variation that heavily distinguishes itself from the other populations. The same can be said about the East Asian population although it lies a bit closer to the other populations. When it comes to the African, Ad Mixed American, and European clusters, they are distinguishable but have some clear overlap. This can be due in part to historical events such as the colonization of the Americas by European nations and the Atlantic slave trade during the 17<sup>th</sup> century. From this, it can be seen that populations are genetically different but some are more related than others. 

Despite these findings, there are some limitations that should be noted. As was previously mentioned, misbalances in who is represented in the data set exist – most notably the Ad Mixed American population who represented only 14% of the data set. It is possible that this small sample size is partly the reason why the Ad Mixed American cluster has a  dispersed wing-like structure. A larger sample size may have helped create a more concrete cluster or even helped find subclusters within it. Another issue that may have arised from the data generation process are the points that seem out of place given their respective populations. As an example, there is an individual of African ancestry right in the middle of the Ad Mixed American cluster. Since the samples self reported their populations, this may have been an issue of self reporting. Another limitation is that these clusters were unable to be mapped to a map of the world as was done in the project that is being replicated. The issue is that when plotted on a 2 dimensional plane, the distances and positions between the cluster do not reflect their true locations on the globe. It is believed that due to the fact that the orignal project focused only on European nations, which are tightly clustered according to the above plots, the relatively short distances between them is more manageable than a global scale. As distances get larger, the more the relative positions between the clusters is lost. It is possible that more data could correct this issue in the future.

# Conclusion

Although the population clusters were not able to be mapped to a map of the world, there is a lot that was learned from this project. It was learned that genetic variation is indeed a powerful indicator of what population an individual belongs to. It is no surprise why companies like 23andMe use these methods to inform individuals of their genetic ancestries. With the collection of more and more genetic data, it is easy to imagine that this will continue to advance and offer more accurate results. In addition to this, the ability to determine the relationship between populations through genetic variation was revealed. It is very interesting to see how some populations are more associated with one population over others – especially when it can be explained by historical events. Goes to show that the actions and decisions we make as a society can ultimately affect the makings of our future DNA. I wonder however, if humans will continue to diversify and form distint clusters or, given an age of mobility and technology, admixture will foster a world of indistinguishable populations.