# Tutorial 1: Salmonella dataset

# 1. Background

This tutorial consists of an index constructed from a *Salmonella enterica* dataset previously published in (https://doi.org/10.1128/jcm.02200-15). To prepare the data, the genomes were downloaded, downsampled (to min 30x coverage) to reduce data size and run through [snippy](https://github.com/tseemann/snippy) to identify SNVs. These files were then loaded into a genomics index using the `gdi load snippy` command.

# 2. Getting data

Let's first download the data for this tutorial. To do this please run the below commands:

*Note: In a Jupyter Python notebook, prepending a command with `!` runs the command in a shell instead of the Python interpreter (e.g., `!unzip` runs the command `unzip`).*

In [None]:
!wget -O salmonella-project.zip https://ndownloader.figshare.com/files/27771615?private_link=0405199820a13aedca42
!unzip -n salmonella-project.zip | head -n 3
!echo
!ls

Great. Now that we've got some data (in the `salmonella-project/` directory), let's explore the command-line interface to access this data.

# 3. Command-line interface (`gdi`)

Let's first print out the version of `gdi` being used.

In [None]:
!gdi --version

## 3.1 List samples

The below command lists the samples loaded in this index. Note that instead of passing the project with `--project-dir` you can change to the project directory and `gdi` will figure out which project you're in.

In [None]:
!gdi --project-dir salmonella-project list samples | head -n 5

## 3.2 List reference genomes

This lists all the loaded reference genomes in this genomics index. This can be useful for commands later which require you to pass the name of the reference genome.

In [None]:
!gdi --project-dir salmonella-project list genomes

## 3.3 Query for a particular mutation

This searches for a particular mutation (in this case a deletion of a `CG` at position 2865334 and an insertion of a `G`). If you are familiar with the VCF format, this is given in terms of `CHROM:POS:REF:ALT` format.

In [None]:
!gdi --project-dir salmonella-project query mutation 'NC_011083.1:2865334:CG:C' | column -s$'\t' -t

## 3.4 Build an alignment

To build an alignment for use in further phylogenetics software you can do the following.

In [None]:
!gdi --project-dir salmonella-project build alignment --reference-name S_HeidelbergSL476 --align-type full --output-file out.aln

In [None]:
!head -n 2 out.aln

This will produce an alignment with length equal to the reference genome, masking any missing data or gaps with `N`, and concatenating individual sequences (contigs) in the alignment together to construct a single, whole-genome alignment.

You can pass in specific samples to include with `--sample`.

# 4. Python API (Genomics Data Index API)

Now let's move on to the Python interface for loading, and querying data in this index. This is a much more powerful (and flexible) way to work with your data which attempts to provide seamless data flow between this index and Python [pandas](https://pandas.pydata.org/).

We first start out by connecting to our project through `GenomicsDataIndex`.

In [None]:
import genomics_data_index.api as gdi

db = gdi.GenomicsDataIndex.connect('salmonella-project')
db

Great. We're connected. The `samples=59` tells us how many samples are in this database.

You can run `db.sample_names()` or `db.reference_names()` to list the samples and reference genomes in this database.

In [None]:
db.sample_names()[0:5]

In [None]:
db.reference_names()

## 4.1 List all features (mutations)

To list a summary of all indexed mutations we can do:

In [None]:
summary_df = db.mutations_summary(reference_genome='S_HeidelbergSL476').sort_values('Count', ascending=False)
summary_df

This gives us back a DataFrame summarizing the mutations.

The **Count**, **Total**, and **Percent** columns tell us how many samples have a particular mutation.

The **Annotation** and beyond columns give us detailed information about the mutation's impact (as derived from [snpeff](https://pcingola.github.io/SnpEff/) if this was run on the genomes (VCF files) prior to indexing).

Let's look for mutations where less than half of the genomic samples have a particular mutation.

In [None]:
summary_df[summary_df['Percent'] < 50].head(5)

Let's use this table to select a mutation to search for. We will select `NC_011083.1:1000708:T:C`.

## 4.2 Search for a particular mutation (`hasa()`)

We can search for genomic samples containing particular mutation by starting a query and using the `hasa()` method.

In [None]:
q = db.samples_query().hasa('NC_011083.1:1000708:T:C', kind='mutation')
q

The `hasa()` method can be read as "select samples that **have a** particular mutation". The selected samples are printed above (30/59 have this particular mutation).

### 4.2.1. Not including `kind`

The default **kind** for `hasa()` is `mutation` so you can leave out `kind='mutation'` if you wish.

In [None]:
q = db.samples_query().hasa('NC_011083.1:1000708:T:C')
q

### 4.2.2. Select by HGVS id

We can also select samples using the [HGVS](https://varnomen.hgvs.org/) identifier.

In [None]:
q = db.samples_query().hasa('hgvs:NC_011083.1:SEHA_RS05365:p.L110P')
q

The HGVS identifier is given as `hgvs:sequence:gene_locus_id:variant`. You can find the corresponding HGVS identifiers from the `mutations_summary()` table shown above.

In [None]:
summary_df.loc['NC_011083.1:1000708:T:C'][['ID_HGVS.c', 'ID_HGVS.p']]

*Note: As of right now, HGVS identifiers are derived when using SnpEff, so this option will only work if SnpEff was run on the VCF files and corresponding HGVS identifiers were stored in the index.*

### 4.2.3. Unknown samples

You may notice that in the results of the query we find `unknown=2% (1/59) samples`.

In [None]:
db.samples_query().hasa('hgvs:NC_011083.1:SEHA_RS05365:p.L110P')

This shows us that there exists 1 sample where it is unknown whether or not it has a `hgvs:NC_011083.1:SEHA_RS05365:p.L110P` mutation. This could be either that this particular region of the genome was identified as having gaps (`-`) or ambiguous characters (`N`), hence it cannot be determined whether or not the mutation in question exists.

Samples where the result of a query is `unknown` are tracked and can be printed out using the commands shown below on accessing more details about a query.

### 4.2.4. Details about query

Once we have a query/selection of samples the specific samples can be shown with the `toframe()` method:

In [None]:
q.toframe(include_unknown=True)

You can use `include_unknown=True` to include samples where the status of the query is **Unknown** (unknown whether it is True or False). By default unknowns are not included.

To summarize, you can use `summary()`:

In [None]:
q.summary()

Or you can use `tolist()` (by default unknowns are not included, you can use `include_unknown=True` to include them).

In [None]:
q.tolist()

## 4.3 Chaining queries

Queries can be chained together to select samples that match every criteria given in the `has()` method.

In [None]:
q = db.samples_query() \
    .hasa('hgvs:NC_011083.1:SEHA_RS05365:p.L110P') \
    .hasa('NC_011083.1:3371274:C:A')
q

This can be read as "select all samples that **have a** `L110P` mutation (amino acid) on gene `SEHA_RS05365` **AND** select all samples that **have a** `C to A` mutation on position `3371274` of sequence `NC_011083.1`".

### 4.3.1. Handling unknowns

You may notice that the unknown's have increased to 36%. In the data model used by this software, both ambiguous characters (e.g., `N`) and gaps (e.g., `-`) are considered as unknown (or missing). So, if genomes have a large deletion in a particular region overlapping a mutation being queried, then these would all show up as having an unknown status. I do not know if this is the case here (further investigation is required) but this is something to keep in mind.

You can convert all those samples/genomes that have an Unknown status to be considered selected if you wish using some boolean logic on queries and `select_unknown()`. In particular:

In [None]:
r = q.or_(q.select_unknown())

# Alternatively
#r = q | q.select_unknown()

r

Here, `q.select_unknown()` selects only the unknown samples and `q.or_()` means select everything that is in `q` OR in `q.select_unknown()`.

## 4.4 Searching for a particular sample (`isa()` and `isin()`)

The `isa()` and `isin()` methods let us search for particular samples by name. The difference between the two is that:

1. `isa()` is meant to be read "select samples that **are** (**is a**) type matching the expression.
2. `isin()` is meant to be read "select samples that are **in** a set defined by the passed criteria.

The differences between these become more apparent for more advanced queries later on. For now, we can use these to select samples by name.

In [None]:
q = db.samples_query().isa('SH12-001')
q

In [None]:
q = db.samples_query().isin(['SH12-001', 'SH13-001'])
q

## 4.5 Searching within a tree

Queries are not limited to what mutations a sample has or by sample name. We can also use `isin()` to select samples that match criteria related to a phylogenetic tree.

To do this, we must specify that our query has a tree attached to it. The example data for this tutorial does have such a tree (though it can be built on-the-fly if needed using `build_tree()`, or joined to an existing tree using `join_tree()`).

In [None]:
t = db.samples_query(universe='mutations', reference_name='S_HeidelbergSL476')
t

Here, the type of query is a `MutationTreeSamplesQuery` which means it has a tree attached to it (derived from mutations). You can access the underlying tree with the `tree` property (as an ete3 Tree object).

In [None]:
t.tree

# Print as newick format
#t.tree.write()

You can quickly visualize the tree in-line by using the `tree_styler()` method:

In [None]:
t.tree_styler().render(w=300)

### 4.5.1. Select by distance in a tree

Now that we have a phylogenetic tree, we can search using this tree with `isin()`. To do this, let's search for samples within some distance from `SH14-001`.

In [None]:
tdist = t.isin('SH14-001', kind='distance', distance=3e-7, units='substitutions/site')
tdist

This selects a subset of samples within the above distance (given in `'substitutions/site'`, you can also use `units='substitutions'`).

It can be hard to see what is going on, so we can combine our query with the tree visualization using the `highlight()` method to highlight the selected samples in the tree.

*Note: Selecting by distance to `SH14-001`, won't necessarily select genomes belonging to the same clade.*

In [None]:
t.tree_styler().highlight(tdist).render(w=300)

### 4.5.2. Select by most recent common ancestor

Another type of query instead of distance is `mrca` which selects samples that all share a particular most recent common ancestor.

In [None]:
tmrca = t.isin(['SH14-001', 'SH14-027'], kind='mrca')
t.tree_styler().highlight(tmrca).render(w=300)

### 4.5.3. Select by distance and mrca

If you wish to select by distance, but restrict yourself to some particular clade you can chain the above two queries together.

In [None]:
tmrca_and_dist = t.isin(['SH14-001', 'SH14-027'], kind='mrca')\
    .isin('SH14-001', kind='distance', distance=3e-7, units='substitutions/site')

t.tree_styler().highlight(tmrca_and_dist).render(w=300)

## 4.6 Attach external metadata

So far we've been looking at only the genomics data. But often times many details insights can be derived from the associated metadata with the genomic samples. External metadata can be attached and tracked by our queries. This also gives us annother method for selecting samples using pandas selection statements (e.g., `metadata['Column'] == 'value'`).

To attach external metadata we first must load it up in Python as a DataFrame (note `head(3)` just means only print the first 3 rows, which avoids printing a very large table for this tutorial).

In [None]:
import pandas as pd

metadata_df = pd.read_csv('salmonella-project/metadata.tsv', sep='\t', dtype=str)
metadata_df.head(3)

Now we can attach to our query using the `join()` method:

In [None]:
# Setup a new query (you don't have to do this, but this makes sure the results are all as expected in the tutorial)
q = db.samples_query().hasa('hgvs:NC_011083.1:SEHA_RS05365:p.L110P')

# Join our query with the given data frame
q = q.join(metadata_df, sample_names_column='Strain')
q

To join we had to define a column containing the sample names. We now get back a query of type `DataFrameSamplesQuery`.

## 4.7 Query external metadata

We can continue using this to select samples and when we're done use the `toframe()` method to dump out our selected data as a DataFrame (change to `toframe(include_unknown=True)` if you wish to include unknown results).

In [None]:
q.hasa('hgvs:NC_011083.1:SEHA_RS17780:c.1080T>C').toframe()

Running `toframe()` will add some additional columns to the front of the external data frame defining the genomics query information.

## 4.8 Selecting by column values using `isa()`

We can now use the `isa()` method to select samples by values in a particular metadata column. For example, to select all samples with a **Source** of `Food` we can use:

In [None]:
q.isa('Food', isa_column='Source', kind='dataframe').toframe().head(3)

To make it easier to write these query expressions, you can set a default column for `isa()` queries when joining to a dataframe:

In [None]:
q = db.samples_query().join(metadata_df, sample_names_column='Strain',
                            default_isa_kind='dataframe', default_isa_column='Source')
q.isa('Food').toframe().head(3)

You can also pass `regex=True` to `isa()` to query by a regex.

## 4.9 Selecting by pandas selection expressions

You can also select samples using the more powerful pandas selection expressions. You use the `isin()` method for this.

For example, an alternative way to select samples where **Source** is `Food` is:

In [None]:
q.isin(metadata_df['Source'] == 'Food', kind='dataframe').toframe().head(3)

# 5. Putting it all together

So far we've seen connecting to an index (project), querying by mutations and by relationships in a tree, as well as attaching external metadata. We can put this all together and do some basic visualiations of our selections on the tree (using the `ete3` toolkit).

## 5.1. Highlight everything in each outbreak and show a unique mutation mutation

### 5.1.1 Load tree and attach data frame

In [None]:
q = db.samples_query(universe='mutations', reference_name='S_HeidelbergSL476')\
    .join(metadata_df, sample_names_column='Strain')
q

### 5.1.2 Select samples in outbreak 1 and show mutations

In [None]:
q_o1 = q.isa('1', isa_column='Outbreak number', kind='dataframe')
q_o1

In [None]:
q_o1.features_summary(selection='unique').sort_values('Count', ascending=False).head(8)

Passing `selection='unique'` will select only those mutations that are unique to the selected set (useful for searching for lineage-defining mutations or mutations in subsets of the selected samples).

Let's pick one of the lineage-defining mutations (i.e., a mutation uniquely found in only **Outbreak 1**). We can detect these by using `selection='unique'` and filtering to mutations where `Count` is equal to `Total` (count of samples with mutation equals total samples in selection).

In [None]:
o1_df = q_o1.features_summary(selection='unique')
o1_df[o1_df['Count'] == o1_df['Total']].head(3)

Let's highlight `hgvs:NC_011083.1:SEHA_RS00825:p.I23F` (`NC_011083.1:58804:T:A`) in the tree.

### 5.1.3 Highlight samples in **Outbreak 1** and highlight those with a particular mutation in a different color

In [None]:
# Use `highlight()` to highlight Outbreak 1
# Chain with other `highlight()` methods to highlight other outbreak samples
# Use `annotate()` to add a column indicating the presence/absence of '58804:T:A' mutation
# Use `highlight_style` to change the color scheme of the highlights
ts = q.tree_styler(highlight_style='pastel', annotate_show_box_label=True, legend_nsize=30, legend_fsize=14)\
        .highlight(q_o1, legend_label='Outbreak 1')\
        .highlight(q.isa('2', isa_column='Outbreak number', kind='dataframe'), legend_label='Outbreak 2')\
        .highlight(q.isa('3', isa_column='Outbreak number', kind='dataframe'), legend_label='Outbreak 3')\
        .annotate(q.hasa('hgvs:NC_011083.1:SEHA_RS00825:p.I23F'), label='SEHA_RS00825:I23F')
ts.render(w=400)

This shows the 

### 5.1.4: Save the output to a file

In [None]:
# Save this to a PDF file
file = 'output1.pdf'
x = ts.render(file)
print(f'Saved to {file}')
# (I assign to x to surpress printing text when saving in Jupyter)

# 5.2: Working with mutations table

### 5.2.1 Get mutations for all samples with Outbreak number `2`

In [None]:
q = db.samples_query().join(metadata_df, sample_names_column='Strain',
                            default_isa_column='Outbreak number', default_isa_kind='dataframe')
q.isa('2')

In [None]:
q.isa('2').features_summary()

### 5.2.2 Plot distribution of mutations on genome for outbreak `2`

In [None]:
q_o2_positions = q.isa('2').features_summary()\
    .groupby('Position')\
    .agg({'Position': 'first', 'Count': 'sum'})
q_o2_positions

In [None]:
import matplotlib.pyplot as plt

reference_genome = db.reference_names()[0]

# I'm just showing a histogram of positions, I'm ignoring sample counts
q_o2_positions['Position'].hist(bins=100)

plt.title('Distribution of mutations for Outbreak 2', fontdict={'size': 16})
plt.xlabel(f'Position on {reference_genome} (bp)', fontdict={'size': 14})
plt.ylabel('Count of mutation positions', fontdict={'size': 14})

### 5.2.3 Compare to distribution of mutations in outbreak 1 and 3

We'll compare to outbreak 1 and 3 using the same histogram.

In [None]:
q_o1_positions = q.isa('1').features_summary()\
    .groupby('Position')\
    .agg({'Position': 'first', 'Count': 'sum'})
q_o1_positions

In [None]:
q_o3_positions = q.isa('3').features_summary()\
    .groupby('Position')\
    .agg({'Position': 'first', 'Count': 'sum'})
q_o3_positions

In [None]:
o1_positions_list = q_o1_positions['Position'].tolist()
o2_positions_list = q_o2_positions['Position'].tolist()
o3_positions_list = q_o3_positions['Position'].tolist()
data = [o1_positions_list, o2_positions_list, o3_positions_list]
labels = ['Outbreak 1', 'Outbreak 2', 'Outbreak 3']
colors = ['#a6cee3', '#1f78b4', '#b2df8a']

# Create histogram
plt.figure(figsize=(15,6))
plt.hist(data,
         label=labels, color=colors, edgecolor='black',
         bins=25)

plt.legend(prop={'size': 14})
plt.title('Distribution of mutations for Outbreaks 1,2, and 3', fontdict={'size': 16})
plt.xlabel(f'Position on {reference_genome} (bp)', fontdict={'size': 14})
plt.ylabel('Count of mutation positions', fontdict={'size': 14})

### 5.2.4 (Optional) View distribution of only unique mutations in each outbreak

Try replacing `features_summary()` with `features_summary(selection='unique')`. This will change the plot from a distribution of all mutations in each outbreak to only a distribution of the mutations uniquely found in each outbreak.

# 6. End

You've made it to the end. You are amazing 😀🥳. Way to go. I hope you enjoyed the tutorial.