UG100 S18-NGX

The objective of this project is to generate a genomic VCF (gVCF) for the genotype S18-NGX, which is derived from RegenSY27x. This genotype was used to generate Pho2 mutants. The goal is to identify the sequences (alleles) of the Pho2 region located on Chromosome 2 (2.1, 2.2, 2.3, and 2.4).

UG100 present challenges regarding homopolymers and a rigorous quality control is necessary at every step of the pipeline.

progress 2026-05-01

1.1 Preliminary analysis. A preliminary review of the UMNGC report was conducted.

ID	Sample	Mean_Read	Mean_Q	Q30_FLOWS	Q40_SNVQ	total_bases	total_reads	reads_billion
1	S12-PALOUSE	302.3	31.3	82.2	93.5	275818187432	912306934	0.91
2	S13-VENDRE	301.1	31.5	82.9	93.8	286951842713	953034450	0.95
3	S14-GIZA	300.1	31.5	82.7	93.7	299560687609	998146554	1.00
4	S15-Misija	301.3	31.4	82.6	93.7	278773652091	925398267	0.93
5	S16-ILLsev	300.3	31.4	82.5	93.6	238739357777	794933321	0.79
6	S17-G2510	299.7	31.5	82.9	93.8	320724651816	1070245726	1.07
7	S18-NGX	297.8	31.2	81.9	93.5	297354215135	998463718	1.00
8	S19-Mini	297.8	31.5	83.2	94.0	295261567061	991572358	0.99
9	S20-drb2abcdT0	300.7	31.4	82.6	93.7	276955428077	921022796	0.92
10	S21-drb2abcdT1-6	301.2	31.3	82.4	93.7	312721036047	1038199685	1.04
11	S22-drb2abcdT1-5	299.6	31.3	82.5	93.7	269818615860	900525488	0.90
	Mean/Total	300.2	31.4	82.6	93.7	3152679241618	10503849297	10.50
	Official_output*	300		>85%	>85%			10-12**

* Official output was obtained from https://genomics.umn.edu/service/ug-100-sequencing
** Wafer contains 11 samples

Sample S18-NGX will be aligned to the RegenSY27x genome. This genome is phased and contains 32 linkage groups (LG), representing eight chromosomes across four monoploid genomes. Given a genome size of 653 Mb for Hap_01, the coverage is $455\times$. For the full phased genome (32 LG) of 2.59 Gb, the coverage is $114\times$.

chrom	Chrom	Hap_1	Hap_2	Hap_3	Hap_4
medsa.RegenSY27x.gnm1.Chr1.1	1	85129743	87534413	86698797	81593309
medsa.RegenSY27x.gnm1.Chr2.1	2	69581426	69606329	75653446	72621877
medsa.RegenSY27x.gnm1.Chr3.1	3	96190425	90827970	91408138	83189900
medsa.RegenSY27x.gnm1.Chr4.1	4	83716887	83291558	86231604	92808212
medsa.RegenSY27x.gnm1.Chr5.1	5	79699470	62554310	75148351	71615600
medsa.RegenSY27x.gnm1.Chr6.1	6	83742631	73507436	69463442	82998842
medsa.RegenSY27x.gnm1.Chr7.1	7	79053817	87713645	88635038	96423042
medsa.RegenSY27x.gnm1.Chr8.1	8	76300915	70660710	74845475	86071966
Sub-Total		653415314	625696371	648084291	667322748
Total					2594518724

Two approaches will be tested

Alignment to Hap_1 only.
Alignment to the full genome (all 32 LGs).

Prior to genome indexing, 2,902 scaffolds (totaling 691,555,158 bp) will be removed. The monoploid version of the genome was generated using samtools:

samtools faidx all.haps.rename.fasta
medsa.RegenSY27x.gnm1.Chr1.1
medsa.RegenSY27x.gnm1.Chr2.1
medsa.RegenSY27x.gnm1.Chr3.1
medsa.RegenSY27x.gnm1.Chr4.1
medsa.RegenSY27x.gnm1.Chr5.1
medsa.RegenSY27x.gnm1.Chr6.1
medsa.RegenSY27x.gnm1.Chr7.1
medsa.RegenSY27x.gnm1.Chr8.1
> haplo_01.fa

Data Locations (MSI)

The files were downloaded from MSI and uploaded to SCinet Ceres:

Pipeline

The genotype S18-NGX was sequenced using the UG100 platform. The analysis pipeline follows these steps:

CRAM to BAM conversion (Reference-oriented Alignment Map to Binary Alignment Map).
Alignment to the reference genome.
Duplicate marking
Varaint calling: Ultima-specific DeepVariant or Sentieon software
Output GVCF

Although standard software like samtools or bwa can be used, the best choice is to use the official Ultima Genomics (UG) repositories. These include the Ultima Aligner (UA) and flow-optimized DeepVariant models, which are specifically tuned for Ultima flow-based chemistry and are not available in standard toolkits.

The UG Docker registry contains several repositories https://hub.docker.com/u/ultimagenomics and the project follows these specific workflows:

Progress: The reference genome was indexed successfully. Steps 1.1 (UA indexing), 1.2 (UA build) and 2 (UA aligment) were run succefully.

Task for the next week: I need to solve a problem with GPU hardware compatibility. GPU will be required in the step 4 (Varaint calling with DeepVariant). Ceres cluster does not have GPUs, but Atlas does it. Therefore, all data generated on Ceres must be synchronized to Atlas via Globus. Furthermore, the Apptainer images from https://hub.docker.com/u/ultimagenomics are not functioning correctly on the Atlas cluster; I will write to SCinet to solve it.

progress 2026-05-08

The pipeline for triming and sorting was corrected and now works ok: the script is 04_pipeline_works.sh. This pipeline is composed of three steps:

Reference indexing
Align with UA
Sort UA Aligned BAM with Demux and Sorter

The output files of interest are:
output_basename.cram
output_basename.cram.crai

The next steps are:
4. Prepare Reference for Variant Calling https://github.com/Ultimagen/healthomics-workflows/blob/main/workflows/efficient_dv/howto-custom-reference-genome.md
5. Variant calling https://github.com/Ultimagen/healthomics-workflows/blob/main/workflows/efficient_dv/howto-germline-calling-efficient-dv.md

The expected output is a gVCF with both variant and non-variant sites alligned to the monoploid reference genome.

Input files for calling variants

The workflow takes three major inputs:

Reads aligned to a reference genome, sorted and duplicate-marked, stored in CRAM format, and indexed with CRAI. The output files of interest are (done):
output_basename.cram
output_basename.cram.crai
Reference genome files (done):
haplo_01.fa
haplo_01.fa.fai
haplo_01.fa.dict
haplo_01.interval_list \
A model checkpoint in ONNX format. where to get it?

Workflow details

The workflow is composed of three steps:

make_examples
call_variants
post_process

During the step make_examples there is necesary to generate an ${REF%.fa}.interval_list file with the script 05_interval_list.sh. The output of this script will be required for the next step during the make_examples step.

The script 05.1_interval_bed.sh create a $N$ interval list with function IntervalListTools from PICARD and convert the interval list files into .bed.

The last step for make_examples is to run the script 06_make.sh. This script is a foor loop to run make_examples.sif for each .bed file generated from 05.1_interval_bed.sh.

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
03_pipeline		03_pipeline
05_UG_scripts		05_UG_scripts
README.md		README.md
commands_Shaun.md		commands_Shaun.md
output demux.md		output demux.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

UG100 S18-NGX

progress 2026-05-01

Two approaches will be tested

Data Locations (MSI)

Pipeline

progress 2026-05-08

Input files for calling variants

Workflow details

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

UG100 S18-NGX

progress 2026-05-01

Two approaches will be tested

Data Locations (MSI)

Pipeline

progress 2026-05-08

Input files for calling variants

Workflow details

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages