# SG10K_Health manifest

SG10K_Health is composed of 10,323 individuals

In [1]:
%load_ext dotenv

In [2]:
%dotenv

In [3]:
import os
import pandas as pd
import re

## Sample metadata

Initially 10,714 samples were registered. A subset of 10,323 samples has been successfully sequenced and included in SG10K_Health dataset

In [5]:
# Load DGO meta
df_meta = pd.read_csv(f"../{os.environ['DGO_META']}", compression='gzip', sep='\t')
# df_meta # 10,714
print(f'n = {len(df_meta):,}')

n = 10,714


  df_meta = pd.read_csv(f"../{os.environ['DGO_META']}", compression='gzip', sep='\t')


In [6]:
print(df_meta.columns.tolist())

['NPM Research ID', 'Multiplex Pool ID', 'Supplier ID', 'GIS Internal Sample ID', 'Site Supplying Sample', 'Year Of Birth', 'Supplied Gender', 'Self Reported Ethnicity', 'Extraction Kit', 'Date Of DNA Extraction', 'Plate Position', 'Plate Name', 'Version Of Consent Form Signed', 'Sequencing Depth', 'NPM Research ID Created By Username', 'NPM Research ID Creation Date', 'Comments Entered When NPM Research ID Created', 'Description Entered When NPM Research ID Created', 'ELM Project ID', 'ELM Project Title', 'ELM Project PI', 'Species Of Sample Sequenced', 'Tehcnique For Sequencing', 'Tissue Type Sequenced', 'Library Found In Sequencing Run ID', 'Library Found In Passed Sequencing Run ID', 'Vendor Sequencing Centre', 'DNA Sample Passed QC', 'Library Prep Kit', 'Run ID', 'Instrument ID', 'Instrument Serial Number', 'Hiseq XTM SBS Kit 300 Cycles V2 (box 1of 2) Lot', 'Hiseq XTM SBS Kit 300 Cycles V2 (box 2 Of 2) Lot', 'Hiseq XTM PE Cluster Kit Cbottm V2 (box 1 Of 2) Lot', 'Hiseq XTM PE Clus

In [7]:
# Select available samples
df_meta_sg = df_meta.loc[( df_meta['Sequencing Complete'] == 'Y' ) & (df_meta['Current'] == 'Y' )]
# df_meta_sg # 10,323
print(f'n = {len(df_meta_sg):,}')

n = 10,323


## GATK4 CRAM files available

In [9]:
# Load CRAM manifest
df_cram = pd.read_csv(f"../{os.environ['CRAM_LIST']}", header=None, names=['cram'])
# Extract sample name
df_cram['sample'] = df_cram['cram'].apply(lambda x: re.search(r'([^/]+)\.bqsr', x).group(1))

# df_cram
print(f'n = {len(df_cram):,}')

n = 10,323


## DRAGEN gVCF files available

The DRAGEN reanalysis effort include SG10K_Health & SG10K_Disease. Therefore the manifest include more files than SG10K_Health alone.

[2024-10-16] As of today 1,543 samples are missing from the DRAGEN re-analysis

In [10]:
# Load gvcf manifest
df_gvcf = pd.read_csv(f"../{os.environ['GVCF_LIST']}", header=None, names=['gvcf'])

# Note that for some samples the cram is named 1234-5678.bqsr.cram
# while the gvcf is named 1234-5678-ABC-DEF.hard-filtered.gvcf.gz
# Need to align the sample id

# Get the sample name from the parent folder
# If '-': split by - then take the 2 first item, then join with -
# else : get full name

df_gvcf['sample'] = df_gvcf['gvcf'].apply(lambda x: '-'.join(x.split('/')[2].split('-')[:2]) if '-' in x.split('/')[2] else x.split('/')[2])

# df_gvcf
print(f'n = {len(df_gvcf):,}')

n = 11,064


In [11]:
# Merge Health & Dragen
df_sg = df_cram.merge(df_gvcf, how='left', on='sample')
# df_sg
print(f'n = {len(df_sg):,}')

n = 10,323


In [12]:
# Count number of samples with VCF
print(f"samples with VCF: {len(df_sg.loc[df_sg['gvcf'].notnull()]):,}")
print(f"samples without VCF: {len(df_sg.loc[df_sg['gvcf'].isnull()]):,}")

samples with VCF: 8,780
samples without VCF: 1,543


## Generate CRAM & CRAI manifest file

In order to generate the missing DRAGEN gVCF we start from the GATK4 CRAM, re-create FASTQ files, and run DRAGEN from the FASTQ files.
First step is to create a file manifest of CRAM & CRAI to restore the files from archive

In [13]:
# Select sample with VCF missing
df_f = df_sg.loc[df_sg['gvcf'].isnull()].copy()
# Add bucket
df_f['cram_bucket'] = f"{os.environ['CRAM_BUCKET']}"
# Add crai path
df_f['crai'] = df_f['cram'] + '.crai'

print(f'n = {len(df_f):,}')

# Create cram manifest
df_f.to_csv(f"../{os.environ['CRAM_MANIFEST']}",columns=['cram_bucket', 'cram'], header=False, index=False)
# Create crai manifest
df_f.to_csv(f"../{os.environ['CRAI_MANIFEST']}",columns=['cram_bucket', 'crai'], header=False, index=False)

n = 1,543
