# Hail
####  Hail is a Python library that enables scalable analysis of structured data, with specialized support for accessing, transforming, and analyzing massive genomic datasets

### Main Hail Objects

Hail provides a set of powerful, distributed data structures designed for scalable genomic analysis. The main Hail objects are:

---

- **[MatrixTable](https://hail.is/docs/0.2/overview/matrix_table.html)**  
  The `MatrixTable` is the **primary data structure** for genomic datasets. It can import genomic data from many formats, such as VCF.  
  The `MatrixTable` builds upon the Hail `Table` and is conceptually similar to a two-dimensional matrix with two tables attached to it. It comprises four components:

  - **Row fields**: A set of fields that are constant for every column. These typically represent **variants** (e.g., positions in the genome).
  - **Column fields**: A set of fields that are constant for every row. These typically represent **samples**.
  - **Entry fields**: A **two-dimensional matrix**, where each entry is **indexed by row key(s) and column key(s)** and stores per-variant, per-sample data (e.g., genotypes).
  - **Global fields**: A structured set of information associated with the entire dataset.
  
  The `MatrixTable` supports rich annotations for each of its fields and is typically used for:

  - Quality control (QC)  
  - Variant filtering and annotation  
  - GWAS

---

- **[Table](https://hail.is/docs/0.2/hail.Table.html#hail.Table)**  
  A `Table` is a **general-purpose distributed table**, similar to a Pandas DataFrame, but designed to scale across clusters (like Spark DataFrames).  
  It represents one axis of a `MatrixTable` or standalone data.  

  A Hail `Table` consists of:

  - **Row fields**: Structured data stored in the table rows (i.e., columns in tabular format).
  - **Global fields**: A structured set of information associated with the entire table.  

  Tables are typically used to store:

  - Phenotype or sample metadata  
  - Variant annotation databases  
  - Aggregation results

---

- **[Expression](https://hail.is/docs/0.2/overview/expressions.html)**  
  Hail uses expression objects to represent different types of data and their operations.  
  Hail expressions are **lazily evaluated**, meaning they define computations without executing them immediately. Evaluation occurs during pipeline execution or when triggered by an **action**.

  Each Hail data type has a corresponding expression class.  
  For example:

  - `Int32Expression` represents a 32-bit integer value.  
  - `BooleanExpression` represents a boolean value (`True` or `False`).

  Expressions are used to define computations, filters, and annotations. They are evaluated by Hail’s backend during execution of **actions**, such as:

  - `show()`  
  - `take()`  
  - `collect()`  
  - `eval()`

---

- **Keys**

  Every Hail `Table` has a **key** that determines the **ordering of rows** and enables **joins or annotations** with other tables.

  `MatrixTable` objects have **two keys**:

  - **Row key**: Indexes the row fields (e.g., variants).
  - **Column key**: Indexes the column fields (e.g., samples).
  - **Entry fields**: Indexed by the combination of both the **row key** and **column key**.


---

In [None]:
import os
from glob import glob
from tqdm import tqdm
import datetime
import hail as hl

from hail.plot import show
import pandas as pd
from pprint import pprint
hl.plot.output_notebook()

from pyspark.conf import SparkConf
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession

#### Start an [Apache Spark](https://en.wikipedia.org/wiki/Apache_Spark) instance

In [None]:
log_file_name = f"logs/hail-{datetime.datetime.now():%Y-%m-%d-%H-%M-%S}.log"
# run spark
spark_conf = SparkConf().setAppName("hail-test")
# .setMaster("spark://spark-master:7077")
spark_conf.set("spark.hadoop.fs.s3a.endpoint", "http://lifemap-minio:9000/")
spark_conf.set("spark.hadoop.fs.s3a.access.key", "root")
spark_conf.set("spark.hadoop.fs.s3a.secret.key", "passpass" )
spark_conf.set("spark.hadoop.fs.s3a.connection.ssl.enabled", "false")
spark_conf.set("spark.hadoop.fs.s3a.path.style.access", "true")
spark_conf.set("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
spark_conf.set("spark.hadoop.fs.s3a.connection.maximum", 1024);
spark_conf.set("spark.hadoop.fs.s3a.threads.max", 1024);
spark_conf.set("spark.hadoop.fs.s3.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")

try:
    sc = SparkContext(conf=spark_conf)
except:
    print ("Spark session already up")

#### Create bucket on [Minio](https://min.io/) if it does not exists

In [None]:
import boto3
from botocore.exceptions import NoCredentialsError

# S3 configuration
s3 = boto3.client(
    's3',
    endpoint_url="http://lifemap-minio:9000",
    aws_access_key_id="root",
    aws_secret_access_key="passpass",
)

bucket_name = "data-hail"

# Check if the bucket exists, if not, create it
try:
    s3.head_bucket(Bucket=bucket_name)
    print(f"Bucket '{bucket_name}' exists.")
except Exception:
    # If the bucket does not exist, create it
    s3.create_bucket(Bucket=bucket_name)
    print(f"Bucket '{bucket_name}' created.")

### [Hail](https://hail.is/) initialization

In [None]:
hl.init(sc=sc, log=log_file_name)

#### Set filenames

In [None]:
## VCF
vcf_fn = 'data/1kg.vcf'
#vcf_fn = 'data/hs1.vcf'

## Annotation file
annotations_fn = 'data/1kg_annotations.txt'
## Matrix table
mt_fn = 's3://data-hail/1kg.mt'

print (f"VCF fn: {vcf_fn}")
print (f"Annotation file fn: {annotations_fn}")
print (f"Matrix table fn: {mt_fn}")

#### Reading vcf with Pandas (N/A if the vcf is stored on s3)

In [None]:
vcf_pd = None
if "s3://" not in vcf_fn: 
    vcf_pd = pd.read_csv(vcf_fn, sep="\t", header=109, low_memory=False)
vcf_pd

### Import VCF to Hail Matrix Table
To work with genomic data stored in a **VCF**, we need to first import and converted it into a **Hail Matrix Table**


In [None]:
## Read a vcf file, convert to a matrix table and save it.
mt = hl.import_vcf(vcf_fn, reference_genome="GRCh37") 
mt.write(mt_fn, overwrite=True)

In [None]:
## Read the matrix table from the file and assign it to the mt vaiable
mt = hl.read_matrix_table(mt_fn)

## Assign MatrixTable fields to Hail tables
row_table = mt.rows() # Returns the row field table
col_table = mt.cols() # Returns the col field table
entry_fields = mt.entries() # Returns the entry field matrix in coordinate table form.

![alt text](immagini/vcf_matrix_table.png "Title")

#### Counting samples and variant: MatrixTable `count_rows` and `count_cols` methods and Table `count` method

In [None]:
## Counts of samples and variants
n_variants = mt.count_rows() 
n_samples = mt.count_cols()

print (f"\n\nThe dataset has {n_variants} variants and {n_samples} samples") 

##### Show a description of the MatrixTable components: the `describe` and `show` methods

##### MatrixTable

In [None]:
mt.describe()

##### Row table:

In [None]:
row_table.describe()

In [None]:
row_table.show()

##### Column table 

In [None]:
col_table.describe()

In [None]:
col_table.show()

#### Entry fields.

In [None]:
entry_fields.describe()

In [None]:
entry_fields.select(entry_fields.GT, entry_fields.AD, entry_fields.DP, entry_fields.GQ, entry_fields.PL).show(n_samples + 1)

#### Show attributes of entry fields. An example with the genotype field.
 - `entry` attribute
 - **Call** `phased` attributes
 - **Call** `summarize` method

In [None]:
## Attributes of entry fields
entry_structure = mt.entry

# The StructExpression
print (entry_structure)

# To show only entry field names
print (list(entry_structure))

To look at the first few genotype calls, we can use entries along with select and take. The **`take` method collects the first n rows into a Python list**. Alternatively, we can use the `show` method, which prints the first n rows to the console in a table format.

In [None]:
gt_expr = mt.GT # Takes the GT entry field for all samples 
gt_expr.phased.show(5) # Show the phased attribute of the GT field (It is False for not phased haplotypes)

In [None]:
gt_expr.summarize()

##### Global values.
Common values of the matrix table

In [None]:
mt.globals_table().show()

### How to access and insert data into a Hail Table using the [`select`](https://hail.is/docs/0.2/hail.Table.html#hail.Table.select) and [`annotate`](https://hail.is/docs/0.2/hail.Table.html#hail.Table.annotate) methods

We can use the Hail Table `select` method to extract specific fields from a table.

The `select` method takes either a string referring to a field name in the table or a Hail expression. If no arguments are provided, only the row key fields (`locus` and `alleles`) are retained.

The `select` method can also be used to add or transform fields, but the resulting table will include **only** the specified fields and any new fields defined in the expression; all other fields will be removed.

In contrast, the `annotate` method adds or modifies fields while preserving the entire original table structure. It returns the full table with the new or updated fields included.


In [None]:
row_table.show(5)

In [None]:
pprint(row_table.select().show(5))

In [None]:
### Select a field of the row table: Simple field and nested field
row_table.select("qual", row_table.info.AC).show(5)

In [None]:
## Adding a new column to the table. The not listed columns will be deleted in the new table
row_table.select(row_table.qual, new_col= row_table.qual * 2).show(5)

In [None]:
## Adding a new column to the table. The not listed columns will not be deleted in the new table
row_table.annotate(new_col= row_table.qual * 10).show(5)

In [None]:
mt.rows().show(5)

### Annotating a MatrixTable Using a Sample Metadata Table

Metadata about samples—such as phenotypes or geographic origin—is often stored in a separate text file, which can be imported into Hail as a `Table`.

A Hail `MatrixTable` can have any number of **row fields** and **column fields** for storing metadata associated with each row (e.g., variants) and column (e.g., samples). Annotations are a critical part of genetic studies. **Column fields** are used to store information such as sample phenotypes, ancestry, sex, and covariates. **Row fields** can be used to store attributes like gene membership or predicted functional impact, often used in QC or analysis.

In this example, we’ll use a text file to annotate the columns of a `MatrixTable`.

The annotation file includes:
- Sample ID
- Population and super-population labels
- Sample sex
- Two simulated phenotypes: one binary (Purple Hair), one continuous (Caffeine Consumption)

This file can be imported using Hail’s `import_table` function, which returns a **`Table` object**. This object behaves similarly to a Pandas or R DataFrame but is distributed across Spark and not limited by the memory of a single machine. Like the `MatrixTable`, a `Table` is **immutable**.   
To inspect or interact with the data locally in Python, you can use the `.take()` method or convert the table to a Pandas DataFrame using `.to_pandas()`.


In [None]:
annotation_table = (hl.import_table(annotations_fn, impute=True).key_by('Sample')) #  impute=True, guess field types from the file.

In [None]:
annotation_table.describe()

In [None]:
annotation_table.show(5)

#### Query functions for gathering statistics: The Table `aggregate` method and Hail `aggregators` (see [Aggregation](https://hail.is/docs/0.2/guides/agg.html) and [Aggregators](https://hail.is/docs/0.2/aggregators.html#sec-aggregators) for details)

Hail provides a number of useful query functions for gathering statistics from your dataset. These functions take **Hail aggregate expressions** as arguments.

For example, [`counter`](https://hail.is/docs/0.2/aggregators.html#hail.expr.aggregators.counter) is an aggregation function that counts the number of occurrences of each unique element. You can use this to compute the population distribution by passing in a Hail expression referencing the field you'd like to count.

The `aggregate` method is used to perform aggregation across rows in a `Table`. Aggregator functions like `counter`, `stats`, and others specify what statistics to compute and how to compute them.


In [None]:
## Population distribution
## Here counter counts unique geographycal origin label
aggregate_expression = hl.agg.counter(annotation_table.SuperPopulation)
print (aggregate_expression)
pprint(annotation_table.aggregate(aggregate_expression))

[`stats`](https://hail.is/docs/0.2/aggregators.html#hail.expr.aggregators.stats) is an aggregation function that produces some useful statistics about numeric collections. We can use this to see the distribution of the CaffeineConsumption phenotype.

In [None]:
## Stats perform some statistics on the specified field
## Here take stats of the caffeine consumption

aggregate_expression = hl.agg.stats(annotation_table.CaffeineConsumption)
pprint(annotation_table.aggregate(aggregate_expression))

#### Grouping to Summarize Information Within Superpopulations: The Table [`group_by`](https://hail.is/docs/0.2/hail.Table.html#hail.Table.group_by) Method

The Table `group_by` method allows you to apply aggregation functions to groups of rows based on specified keys. When working with a grouped table, the [`aggregate`](https://hail.is/docs/0.2/hail.GroupedTable.html#hail.GroupedTable.aggregate) method behaves slightly differently from `Table.aggregate`: it requires a *name expression* rather than a simple expression, and it returns a new table rather than a single result (such as a number or a struct). The *name expression* means that each aggregation must be assigned to a field name, which becomes a column in the resulting aggregated table.

In [None]:
grp = annotation_table.group_by('SuperPopulation')

In [None]:
grp.aggregate(cnt=hl.agg.counter(annotation_table.isFemale)).show()

In [None]:
grp.aggregate(stats=hl.agg.stats(annotation_table.CaffeineConsumption)).show()

#### Annotate the MatrixTable column fields: The MatrixTable [`annotate_cols`](https://hail.is/docs/0.2/hail.MatrixTable.html#hail.MatrixTable.annotate_cols) method

Using the `annotate_cols` method is possible to join the annotation table with the MatrixTable containing the dataset.
First, we’ll print the existing column schema using `col`. MatrixTable [`col`](https://hail.is/docs/0.2/hail.MatrixTable.html#hail.MatrixTable.col) is an attribute that return struct expression of all column-indexed fields, including keys.
It is different from the [`cols()`](https://hail.is/docs/0.2/hail.MatrixTable.html#hail.MatrixTable.cols) method that returns a table with all column fields in the matrix.

In [None]:
#Column table before adding per sample annotation:
mt.col.describe()

Setting the sample IDs as the key of the annotation table allows you to index sample information using the sample IDs from the `mt.s` column field of the `MatrixTable`. The indexed data can then be used to annotate the `MatrixTable`, adding a `pheno` field to its column annotations.

In [None]:
mt = mt.annotate_cols(pheno = annotation_table[mt.s])

# After the annotation the columns has a new field pheno,
# a struct that contains sample metadata

mt.col.describe()

In [None]:
print(f"Metadata table samples: {annotation_table.count()}")
print(f"Matrix table samples: {mt.cols().count()}")

Since there are fewer samples in our dataset than in the full thousand genomes cohort, we need to look at annotations on the dataset. We can use [`aggregate_cols`](https://hail.is/docs/0.2/hail.MatrixTable.html#hail.MatrixTable.aggregate_cols) to get the metrics for only the samples in our dataset.

In [None]:
mt.aggregate_cols(hl.agg.counter(mt.pheno.SuperPopulation))

In [None]:
pprint(mt.aggregate_cols(hl.agg.stats(mt.pheno.CaffeineConsumption)))

**The same Python, R, and Unix tools could do this work as well, but we’re starting to hit a wall - the latest gnomAD release publishes about 250 million variants, and that won’t fit in memory on a single computer.**

What about genotypes? Hail can query the collection of all genotypes in the dataset, and this is getting large even for our tiny dataset. Our 284 samples and 10,000 variants produce 10 million unique genotypes. The gnomAD dataset has about 5 trillion unique genotypes.

---

### Quality Control and Data Filtering

Quality control (QC) is an iterative process that varies across projects—there is no simple, “push-button” solution. However, through open science and collaboration, the community has established a set of best practices to guide QC decisions.

Effective QC depends on a deep understanding of the dataset’s properties. Hail supports this by providing the [`sample_qc`](https://hail.is/docs/0.2/methods/genetics.html#hail.methods.sample_qc) and [`variant_qc`](https://hail.is/docs/0.2/methods/genetics.html#hail.methods.variant_qc) functions, which compute useful summary metrics. These metrics are stored in column fields (for samples) and row fields (for variants), respectively.


#### Sample QC function

In [None]:
mt.col.describe()

In [None]:
# sample_qc is a hail genetic method to compute per-sample metrics useful for quality control.

mt = hl.sample_qc(mt)

mt.col.describe()

##### Hail plotting functions
Hail plotting functions allow Hail fields as arguments.  
If the range and bins arguments are not set, this function will compute the range based on minimum and maximum values of the field and use the default 50 bins.

In [None]:
##Plotting the QC metrics is a good place to start.

## Call rate
p = hl.plot.histogram(mt.sample_qc.call_rate, range=(.88,1), legend='Call Rate')
show(p)

##### Removing samples: `filter_cols` 

Removing outliers from the dataset will generally improve association results. We can make arbitrary cutoffs and use them to filter.
Using matrix table [`filter_cols`](https://hail.is/docs/0.2/hail.MatrixTable.html#hail.MatrixTable.filter_cols) method it is possible to **create a new matrix table considering samples with the DP mean >= 4 and a call rate >= 0.97**. Samples that don't satisfy these criteria are removed.
The filtering method does not perform in-place filtering, so the result must be assigned to a variable for the changes to take effect.

In [None]:
## Checking corralations between the mean value of dp and the call rate
p = hl.plot.scatter(mt.sample_qc.dp_stats.mean, mt.sample_qc.call_rate, xlabel='Mean DP', ylabel='Call Rate')
p.line([2,22], [0.97,0.97], color='red', line_width=2)
p.line([4,4], [0.878,1.0], color='red', line_width=2)
show(p)

In [None]:
mt = mt.filter_cols((mt.sample_qc.dp_stats.mean >= 4) & (mt.sample_qc.call_rate >= 0.97))
print('After filter, %d/284 samples remain.' % mt.count_cols())

##### Removing genotypes: `filter_entries`
Next is genotype QC. It’s a good idea to filter out genotypes where the reads aren’t where they should be: if we find a genotype called homozygous reference with >10% alternate reads, a genotype called homozygous alternate with >10% reference reads, or a genotype called heterozygote without a ref / alt balance near 1:1, it is likely to be an error.

In a low-depth dataset like 1KG, it is hard to detect bad genotypes using this metric, since a read ratio of 1 alt to 10 reference can easily be explained by binomial sampling. However, in a high-depth dataset, a read ratio of 10:100 is a sure cause for concern!



In [None]:
ab = mt.AD[1] / hl.sum(mt.AD)

filter_condition_ab = ((mt.GT.is_hom_ref() & (ab <= 0.1)) |
                        (mt.GT.is_het() & (ab >= 0.25) & (ab <= 0.75)) |
                        (mt.GT.is_hom_var() & (ab >= 0.9)))

fraction_filtered = mt.aggregate_entries(hl.agg.fraction(~filter_condition_ab))
print(f'Filtering {fraction_filtered * 100:.2f}% entries out of downstream analysis.')
mt = mt.filter_entries(filter_condition_ab)

##### Variant QC function

Variant QC computes per per-variant metric useful for quality control. It is a bit more of the same of sample_qc: we can use the [`variant_qc`](https://hail.is/docs/0.2/methods/genetics.html#hail.methods.variant_qc) function to produce a variety of useful statistics, plot them, and then filter. This is made at row level beacause they are stats on variants.

In [None]:
mt.row.describe()

In [None]:
mt = hl.variant_qc(mt)
mt.row.describe()

##### Removing variants: `filter_rows`
Restrict to variants that are:
- common (we’ll use a cutoff of 1%)
- not so far from Hardy-Weinberg equilibrium as to suggest sequencing error

In [None]:
print('Samples: %d  Variants: %d' % (mt.count_cols(), mt.count_rows()))
mt = mt.filter_rows(mt.variant_qc.AF[1] > 0.01) # It takes variants for which the alternate allele has a frequency larger than 1%
print('Samples: %d  Variants: %d' % (mt.count_cols(), mt.count_rows()))

mt = mt.filter_rows(mt.variant_qc.p_value_hwe > 1e-6) # Hardy-Weinberg equilibrium pvalue cut-off
print('Samples: %d  Variants: %d' % (mt.count_cols(), mt.count_rows()))

---

## GWAS with a quantitative phenotype

In Hail, the association tests accept column fields for the sample phenotype and covariates. Since we’ve already got our phenotype of interest (caffeine consumption) in the dataset, we are good to go:

In [None]:
gwas = hl.linear_regression_rows(y=mt.pheno.CaffeineConsumption,
                                 x=mt.GT.n_alt_alleles(),
                                 covariates=[1.0])
gwas.row.describe()

Looking at the bottom of the above printout, you can see the linear regression adds new row fields for the beta, standard error, t-statistic, and p-value.

Hail makes it easy to visualize results! Let’s make a Manhattan plot:

In [None]:
p = hl.plot.manhattan(gwas.p_value)
show(p)

Let’s check whether our GWAS was well controlled using a Q-Q (quantile-quantile) plot.

In [None]:
p = hl.plot.qq(gwas.p_value)
show(p)

## Confounded!

The observed p-values deviate from the expected distribution immediately. Either every SNP in our dataset is causally linked to caffeine consumption (which is highly unlikely), or there's an underlying confounder.

In fact, the phenotype was simulated using sample ancestry (in addition to a specific locus associated with caffeine consumption). This results in a **stratified phenotype distribution**. To correct for this, we need to include ancestry as a covariate in our regression model.

The `linear_regression_rows` function allows us to include column fields as covariates. Although we’ve already annotated samples with reported ancestry, such labels can be unreliable due to human error. Genomes, however, don't suffer from this issue. Rather than using reported ancestry, we’ll use **genetic ancestry** by incorporating the computed principal components (PCs) into our model.

The `pca` function outputs eigenvalues as a list and sample PCs as a `Table`; it can also generate variant loadings if requested. The `hwe_normalized_pca` function offers similar outputs, but uses Hardy-Weinberg Equilibrium (HWE)-normalized genotypes for the PCA.


In [None]:
mt.GT.show()

In [None]:
eigenvalues, pcs, _ = hl.hwe_normalized_pca(mt.GT)

In [None]:
pprint(eigenvalues)

In [None]:
pcs.show(5, width=100)

In [None]:
pcs.describe()

Since we now have the principal components for each sample, we can annotate them to the MatrixTable's column fields and and plot them to examine how well they align with the major human populations.

In [None]:
mt = mt.annotate_cols(scores = pcs[mt.s].scores)

In [None]:
p = hl.plot.scatter(mt.scores[0],
                    mt.scores[1],
                    label=mt.pheno.SuperPopulation,
                    title='PCA', xlabel='PC1', ylabel='PC2')
show(p)

Now we can rerun our linear regression, controlling for sample sex and the first few principal components.

In [None]:
gwas = hl.linear_regression_rows(
    y=mt.pheno.CaffeineConsumption,
    x=mt.GT.n_alt_alleles(),
    covariates=[1.0, mt.pheno.isFemale, mt.scores[0], mt.scores[1], mt.scores[2]])


In [None]:
gwas.show()

Q-Q plot and Manhattan plot:

In [None]:
p = hl.plot.qq(gwas.p_value)
show(p)

In [None]:
p = hl.plot.manhattan(gwas.p_value)
show(p)

#### How to save Table and MatrixTable

In [None]:
gwas_ht_fn = 's3://data-hail/gwas_results.ht'
gwas.write(gwas_ht_fn, overwrite=True)
    
mt_out_fn = 's3://data-hail/1kg_after_gwas.mt'
mt.write(mt_out_fn, overwrite=True)