# Preprocessing ONT WGS VCFs for Deep Learning

Starting with the 3 VCF files containing the true variants, false positives, and false negatives, the goal is to end up with a one-hot encoded Numpy array of the sequence context around each variant in the 3 files to use as direct input to deep learning models. The hope is that the neural networks will be able to learn and extract features from the sequence context that will enable distinguishing of true variants from artifacts.

The following steps get us there:
- extract sequence context of all variants in the VCF files from the reference genome
- create a Pandas features and labels dataframe with the sequence context being the feature and artifact/not-artifact being the label
- split the data into train-validation and test
- one-hot encode the sequences and labels and save as Numpy arrays for input into the deep learning models

## 1. Extract from reference genome 201 bases centered at each variant call

In [3]:
!find 06_vcf_TP_FP_FN -type f -name "*.vcf"

06_vcf_TP_FP_FN/FP_snvs.vcf
06_vcf_TP_FP_FN/FN_snvs.vcf
06_vcf_TP_FP_FN/TP_snvs.vcf


In [4]:
%%bash
# converting VCFs to BED with bedops vcf2bed
# Loop through all .vcf files in the directory
for vcf_file in 06_vcf_TP_FP_FN/*.vcf; do
    vcf2bed \
        --do-not-sort \
        --do-not-split \
        < "$vcf_file" \
            | awk 'BEGIN {OFS="\t"} {print $(1), $(2), $(3)}' \
                > "${vcf_file%.vcf}_vcf2bed.bed"
    cat "${vcf_file%.vcf}_vcf2bed.bed" | head -n 10
done

chr1	853662	853663
chr1	853669	853670
chr1	853718	853719
chr1	853875	853876
chr1	925397	925398
chr1	933600	933601
chr1	950625	950626
chr1	960325	960326
chr1	995542	995543
chr1	1000078	1000079
chr1	643599	643600
chr1	800908	800909
chr1	818025	818026
chr1	832881	832882
chr1	855379	855380
chr1	883023	883024
chr1	883030	883031
chr1	883037	883038
chr1	883038	883039
chr1	883054	883055
chr1	783005	783006
chr1	783174	783175
chr1	784859	784860
chr1	785416	785417
chr1	797391	797392
chr1	798617	798618
chr1	798661	798662
chr1	800045	800046
chr1	801141	801142
chr1	801142	801143


In [6]:
%%bash
# expand bed files +100bp on each side
for bed_file in 06_vcf_TP_FP_FN/*.bed; do
    bedtools slop \
        -i "$bed_file" \
        -g ~/tools/bedtools/human.hg38.genome \
        -b 100 \
        > "${bed_file%_vcf2bed.bed}_vcf_slop_100.bed"
    echo "${bed_file%_vcf2bed.bed}_vcf_slop_100.bed"
    cat "${bed_file%_vcf2bed.bed}_vcf_slop_100.bed" | head -n 10
    echo ""
done

06_vcf_TP_FP_FN/FN_snvs_vcf_slop_100.bed
chr1	853562	853763
chr1	853569	853770
chr1	853618	853819
chr1	853775	853976
chr1	925297	925498
chr1	933500	933701
chr1	950525	950726
chr1	960225	960426
chr1	995442	995643
chr1	999978	1000179

06_vcf_TP_FP_FN/FP_snvs_vcf_slop_100.bed
chr1	643499	643700
chr1	800808	801009
chr1	817925	818126
chr1	832781	832982
chr1	855279	855480
chr1	882923	883124
chr1	882930	883131
chr1	882937	883138
chr1	882938	883139
chr1	882954	883155

06_vcf_TP_FP_FN/TP_snvs_vcf_slop_100.bed
chr1	782905	783106
chr1	783074	783275
chr1	784759	784960
chr1	785316	785517
chr1	797291	797492
chr1	798517	798718
chr1	798561	798762
chr1	799945	800146
chr1	801041	801242
chr1	801042	801243



In [20]:
# create modified reference that has SNVs from sample's variant calling instead of REF allele
# the 201bps sequences will be pulled from this modified reference
!bcftools sort \
    05_vcf_isec_GIAB/dorado_bcftools_P_099_isec.vcf \
    -o 05_vcf_isec_GIAB/dorado_bcftools_P_099_isec_sorted.vcf.gz \
    -m 16G \
    --write-index
!mkdir -p 07_bedtools_getfasta
!bcftools consensus \
    -s - \
    -f ~/tools/GIAB/GRCh38/refs/GRCh38_GIABv3.fasta \
    05_vcf_isec_GIAB/dorado_bcftools_P_099_isec_sorted.vcf.gz \
    -o 07_bedtools_getfasta/modified_reference.fasta

Writing to /tmp/bcftools.62iVE9
Merging 1 temporary files
Cleaning
Done
Applied 3773914 variants


In [42]:
%%bash
# extract the actual slop sequences from the modified human reference genome using bedtools getfasta
for bed_file in 06_vcf_TP_FP_FN/*_vcf_slop_100.bed; do
    filename=$(basename "$bed_file" _vcf_slop_100.bed)
    output_file="07_bedtools_getfasta/${filename}_bed_getfasta.bed"

    bedtools getfasta \
        -fi 07_bedtools_getfasta/modified_reference.fasta \
        -bed "$bed_file" \
        -bedOut \
        > "$output_file"

    echo "$output_file"
    cat "$output_file" | head -n 10
    echo ""
done




07_bedtools_getfasta/FN_snvs_bed_getfasta.bed
chr1	853562	853763	CTGATGAACGTGAGATGACCGCCGTGTGGTAAACTGATGAACCCCGACCCTGATGAACGTGAGATGACCGCCGTGTGGTAAACTGATGAACCCCGACCCTGATGAACGTGAGATGACCGCCGTGTGGTAAACTGATGAACCCCGACCCTGATCAACGTGAGATGACCGCCGTGTGGTAAACTGATGAACCCCGACCCTGAT
chr1	853569	853770	ACGTGAGATGACCGCCGTGTGGTAAACTGATGAACCCCGACCCTGATGAACGTGAGATGACCGCCGTGTGGTAAACTGATGAACCCCGACCCTGATGAACGTGAGATGACCGCCGTGTGGTAAACTGATGAACCCCGACCCTGATCAACGTGAGATGACCGCCGTGTGGTAAACTGATGAACCCCGACCCTGATGAACATG
chr1	853618	853819	ACGTGAGATGACCGCCGTGTGGTAAACTGATGAACCCCGACCCTGATGAACGTGAGATGACCGCCGTGTGGTAAACTGATGAACCCCGACCCTGATCAACGTGAGATGACCGCCGTGTGGTAAACTGATGAACCCCGACCCTGATGAACATGAGATGACCGCCGTGTGGTAAACTGATGAACCCCGACCCTGATCAACATG
chr1	853775	853976	ACCGCCGTGTGGTAAACTGATGAACCCCGACCCTGATCAACATGAGATGACCGCCGTGTGGTAAACTGATGAACCCCGACCCTGATCAACATGAGATGACTGCCGTGTGGTAAACTGATGAACCCTGACCCATTAGGCTTTGGCTACAGAATGTGGAAATAAGTTGTGTTACTACATGTGTGTAATCCTAGGGTGCAGGAC
chr1	925297	925498	CTGGAGGGGCACACGTGCGGCCGGGTGCGAGCGCGCGGCGGGGGAGGCTGC



07_bedtools_getfasta/FP_snvs_bed_getfasta.bed
chr1	643499	643700	ACCGGCAAATTCTGTTGTTTGTATAAACATCAGCCATGTTTATATAACTAAACTAGTGTTTTGTTTTGTCAATTCAGCAAGAAATTAGACCACATGGTGGTTTAATGCTGCATTGATTTGGCTATCAATTTGTTTTCACTTTTCTGCAAAATATTTAATACATTATTAAATTGAATTATGCTGATGCCACAGTTGTTCTTA
chr1	800808	801009	TTGATGGGTGCAGCAAACCACCATGGCATGTGTATACCTATGTAACAAACCTGCACATTGAGCATATGTATACCAGAACTTAAAAGTATAATTTAAAAAAATTTTTAAAAAGTCATATGATGCATTTAAGAAAGTCACTTAATTTACATCAGAGGAAAATCAAAGTTTATAGACTTAGGAAATAAAGTCGTAATGAAGAAG
chr1	817925	818126	AGTTTTATCCACTTTATGTGAAGAAAGCCAACAAGGGGCATGGAGTGAGTTCCGCAGGTTTTAGCGGCTGCGGCGGCTGGTGCTCAGTGGGGATGATGGAAGGAAGGCGCCTCCCTCTGCGGGCCCCGAGGTCTGTGCGGGAATCAGCTCTGCAGCTGTGTCCAGGGGGAGCCGTAGACCACACACGGCAGGCTCACAGCT
chr1	832781	832982	AGATGGAGTCTTGCTTTGTCGCCCAGGCTGTAGTGCAGTGGCGTGATCTTGGCTCACTGCAGCCTCCACCTTAGAGCAATCCTCTTGCCTCCTCCTCCCGAGTAGTTGGGACTACATGTGCATGCCACATGCCTGGCTAATTTTTGTATTTTTAGTAGAGACACGGTTTCACCATGTTGGCCAGGCTGGTGTCCAACTCCT
chr1	855279	855480	TGATGTACGGGTGTATCTGTGTATTGTGTATGCACACACGAGCATATGTGT



07_bedtools_getfasta/TP_snvs_bed_getfasta.bed
chr1	782905	783106	TAAAATATGCCACAGATTTCTAAGACTGAGCATGGAAAAGAAAATCTCCATAATTTTTTATATTGATTGTATACTGCAGTGATAATATTTTGGATGTATCGGGTTAAATAAAATTGACTGATTTCACCTTTTTCCTATTTTAAAAGTGGCTACTAAGAAAATTTTAAATTACTTACATGACCGACATGGTATTTTTATTTG
chr1	783074	783275	TACTTACATGACCGACATGGTATTTTTATTTGGCAGCGCTGCTCTAAGCTGTTGATGAAAAATATTGTTGGTGAGCTCTGCTTAGGTAATATATAGGACACGAGCAGAGAGGAGGCACGTGAACAGTTCTGGCCTGGAGTAGGCTTCATTGAGGCTGTGATGCTTTTAGCTGGATTTGAAGAAGTGGTAGTGATCATCCCA
chr1	784759	784960	GGAAATGAGATATTTATGTAGTCTTAAGGTATCTTCTCACAAATTATGTATTACTTACAAAGGGGAAAAGTGCACCTTGACGTGAGAATCTTGGCAGAACCACTTTAATCAAGAGGTTTAGTTGAGCATTATCAGTAACAGAGTAAATCAAAATCATAAGCCACTTGATAGATACAATGAGAAAATAAAGCATCACTTCTG
chr1	785316	785517	AGGGTGACAGGGCATTATATCAGCCACTTATCAACAAACATTTCAGAGAAAAATAGTTCCTTGTATTGTTATTGCAATTTTCTTTGAGATTGCCCCCTCCAAAACAGTAAGAACTTTCAAAAACAAACAAAGATATCAAGCCACAGATTCAAAGTGCTATAAACTCCACGCATGATTAGTTCTCACTCATAGGTGGGAATT
chr1	797291	797492	AACTTTGCTCAATCTTCGCCTTTGGGTTCCTTTAGGTTTATAATGAACTGT

In [43]:
# index modified reference fasta
!ls -alth 07_bedtools_getfasta
!samtools faidx 07_bedtools_getfasta/modified_reference.fasta
!ls -alth 07_bedtools_getfasta

total 3.8G
drwxr-xr-x 11 fmbuga wlee 4.0K Nov 17 18:45 ..
-rw-r--r--  1 fmbuga wlee 698M Nov 17 18:44 TP_snvs_bed_getfasta.bed
drwxr-xr-x  2 fmbuga wlee  194 Nov 17 18:44 .
-rw-r--r--  1 fmbuga wlee 116M Nov 17 18:44 FP_snvs_bed_getfasta.bed
-rw-r--r--  1 fmbuga wlee  26M Nov 17 18:44 FN_snvs_bed_getfasta.bed
-rw-r--r--  1 fmbuga wlee 3.0G Nov 17 18:02 modified_reference.fasta
-rw-r--r--  1 fmbuga wlee 8.8K Nov 17 17:25 modified_reference.fasta.fai
total 3.8G
-rw-r--r--  1 fmbuga wlee 8.8K Nov 17 18:49 modified_reference.fasta.fai
drwxr-xr-x 11 fmbuga wlee 4.0K Nov 17 18:45 ..
-rw-r--r--  1 fmbuga wlee 698M Nov 17 18:44 TP_snvs_bed_getfasta.bed
drwxr-xr-x  2 fmbuga wlee  194 Nov 17 18:44 .
-rw-r--r--  1 fmbuga wlee 116M Nov 17 18:44 FP_snvs_bed_getfasta.bed
-rw-r--r--  1 fmbuga wlee  26M Nov 17 18:44 FN_snvs_bed_getfasta.bed
-rw-r--r--  1 fmbuga wlee 3.0G Nov 17 18:02 modified_reference.fasta


## 2. Create features and labels Pandas dataframe

In [4]:
import pandas as pd

pd.read_csv('07_bedtools_getfasta/FN_snvs_bed_getfasta.bed', 
            sep='\t', 
            header=None).head(10)

Unnamed: 0,0,1,2,3
0,chr1,853562,853763,CTGATGAACGTGAGATGACCGCCGTGTGGTAAACTGATGAACCCCG...
1,chr1,853569,853770,ACGTGAGATGACCGCCGTGTGGTAAACTGATGAACCCCGACCCTGA...
2,chr1,853618,853819,ACGTGAGATGACCGCCGTGTGGTAAACTGATGAACCCCGACCCTGA...
3,chr1,853775,853976,ACCGCCGTGTGGTAAACTGATGAACCCCGACCCTGATCAACATGAG...
4,chr1,925297,925498,CTGGAGGGGCACACGTGCGGCCGGGTGCGAGCGCGCGGCGGGGGAG...
5,chr1,933500,933701,TCTCTGGGAAGCTAGTTCTCCTCCTGCAGGCGTCCTGGGGACACCA...
6,chr1,950525,950726,CAAAGGTACACGCAAAGGTACACACATGCATACACACAGGTGTACA...
7,chr1,960225,960426,CTGTGGGGGGCTTCCCGGGGAAGAAGGAAGGCGAGACCTAGGGGGG...
8,chr1,995442,995643,GATGTATAGATCGATGGAAAAGAATTGAGGGTTCAGAAATAGGTCC...
9,chr1,999978,1000179,GCCCCGCGCCTCCCCGTGCCGGGTGGAGCGCGCCGCCACGGACCAC...


#### add labels01 and labels012 columns to getfasta.bed files

`labels01` labels true positives as 0 and everything else (the artifacts) as 1. `labels012` labels true positives as 0, false positives as 1, and false negatives as 2

In [5]:
FN_seqs = pd.read_csv('07_bedtools_getfasta/FN_snvs_bed_getfasta.bed', 
            sep='\t', 
            header=None)
FP_seqs = pd.read_csv('07_bedtools_getfasta/FP_snvs_bed_getfasta.bed', 
            sep='\t', 
            header=None)
TP_seqs = pd.read_csv('07_bedtools_getfasta/TP_snvs_bed_getfasta.bed', 
            sep='\t', 
            header=None)
len(FN_seqs), len(FP_seqs), len(TP_seqs)

(119744, 537140, 3236774)

In [6]:
TP_seqs['labels01'] = TP_seqs['labels012'] = 0
TP_seqs.head(n=10)

Unnamed: 0,0,1,2,3,labels01,labels012
0,chr1,782905,783106,TAAAATATGCCACAGATTTCTAAGACTGAGCATGGAAAAGAAAATC...,0,0
1,chr1,783074,783275,TACTTACATGACCGACATGGTATTTTTATTTGGCAGCGCTGCTCTA...,0,0
2,chr1,784759,784960,GGAAATGAGATATTTATGTAGTCTTAAGGTATCTTCTCACAAATTA...,0,0
3,chr1,785316,785517,AGGGTGACAGGGCATTATATCAGCCACTTATCAACAAACATTTCAG...,0,0
4,chr1,797291,797492,AACTTTGCTCAATCTTCGCCTTTGGGTTCCTTTAGGTTTATAATGA...,0,0
5,chr1,798517,798718,AACATACTTTGCTAATACATTTTAATCTGGCATTTTTATGGGGGTA...,0,0
6,chr1,798561,798762,TAATTATAGGAAATGCCTGGAATTAAATAGCCTACAACCAATTCTT...,0,0
7,chr1,799945,800146,TTTTGCTATGCAGAAGCTCTTTAGTTTAATTAGATCCCATTTGTCA...,0,0
8,chr1,801041,801242,GGTCTCTGTCCTTGATTCCATTGTGACCTTCAGCCCATCTCTCTGG...,0,0
9,chr1,801042,801243,GTCTCTGTCCTTGATTCCATTGTGACCTTCAGCCCATCTCTCTGGG...,0,0


In [7]:
FP_seqs['labels01'] = FP_seqs['labels012'] = 1
FP_seqs.head(n=10)

Unnamed: 0,0,1,2,3,labels01,labels012
0,chr1,643499,643700,ACCGGCAAATTCTGTTGTTTGTATAAACATCAGCCATGTTTATATA...,1,1
1,chr1,800808,801009,TTGATGGGTGCAGCAAACCACCATGGCATGTGTATACCTATGTAAC...,1,1
2,chr1,817925,818126,AGTTTTATCCACTTTATGTGAAGAAAGCCAACAAGGGGCATGGAGT...,1,1
3,chr1,832781,832982,AGATGGAGTCTTGCTTTGTCGCCCAGGCTGTAGTGCAGTGGCGTGA...,1,1
4,chr1,855279,855480,TGATGTACGGGTGTATCTGTGTATTGTGTATGCACACACGAGCATA...,1,1
5,chr1,882923,883124,GCAGTGTTTGGTTTTCTTTTCTTTTTTTCTTTCTCTCTTTTCTTTT...,1,1
6,chr1,882930,883131,TTGGTTTTCTTTTCTTTTTTTCTTTCTCTCTTTTCTTTTTTTTTTT...,1,1
7,chr1,882937,883138,TCTTTTCTTTTTTTCTTTCTCTCTTTTCTTTTTTTTTTTTTGAGAC...,1,1
8,chr1,882938,883139,CTTTTCTTTTTTTCTTTCTCTCTTTTCTTTTTTTTTTTTTGAGACA...,1,1
9,chr1,882954,883155,TCTCTCTTTTCTTTTTTTTTTTTTGAGACAAACTTTCACTCTTGTT...,1,1


In [8]:
FN_seqs['labels01'] = 1
FN_seqs['labels012'] = 2
FN_seqs.head(n=10)

Unnamed: 0,0,1,2,3,labels01,labels012
0,chr1,853562,853763,CTGATGAACGTGAGATGACCGCCGTGTGGTAAACTGATGAACCCCG...,1,2
1,chr1,853569,853770,ACGTGAGATGACCGCCGTGTGGTAAACTGATGAACCCCGACCCTGA...,1,2
2,chr1,853618,853819,ACGTGAGATGACCGCCGTGTGGTAAACTGATGAACCCCGACCCTGA...,1,2
3,chr1,853775,853976,ACCGCCGTGTGGTAAACTGATGAACCCCGACCCTGATCAACATGAG...,1,2
4,chr1,925297,925498,CTGGAGGGGCACACGTGCGGCCGGGTGCGAGCGCGCGGCGGGGGAG...,1,2
5,chr1,933500,933701,TCTCTGGGAAGCTAGTTCTCCTCCTGCAGGCGTCCTGGGGACACCA...,1,2
6,chr1,950525,950726,CAAAGGTACACGCAAAGGTACACACATGCATACACACAGGTGTACA...,1,2
7,chr1,960225,960426,CTGTGGGGGGCTTCCCGGGGAAGAAGGAAGGCGAGACCTAGGGGGG...,1,2
8,chr1,995442,995643,GATGTATAGATCGATGGAAAAGAATTGAGGGTTCAGAAATAGGTCC...,1,2
9,chr1,999978,1000179,GCCCCGCGCCTCCCCGTGCCGGGTGGAGCGCGCCGCCACGGACCAC...,1,2


In [9]:
# concatenate the 3 dfs
TP_FP_FN_seqs_labels01_labels012 = pd.concat([TP_seqs, FP_seqs, FN_seqs], ignore_index=True)
TP_FP_FN_seqs_labels01_labels012.shape

(3893658, 6)

In [10]:
# rename columns
TP_FP_FN_seqs_labels01_labels012.rename(columns={0: "chrom", 
                                                 1: "pos_start", 
                                                 2: "pos_end", 
                                                 3: "sequences"}, 
                                       inplace=True)
TP_FP_FN_seqs_labels01_labels012.head(n=10)

Unnamed: 0,chrom,pos_start,pos_end,sequences,labels01,labels012
0,chr1,782905,783106,TAAAATATGCCACAGATTTCTAAGACTGAGCATGGAAAAGAAAATC...,0,0
1,chr1,783074,783275,TACTTACATGACCGACATGGTATTTTTATTTGGCAGCGCTGCTCTA...,0,0
2,chr1,784759,784960,GGAAATGAGATATTTATGTAGTCTTAAGGTATCTTCTCACAAATTA...,0,0
3,chr1,785316,785517,AGGGTGACAGGGCATTATATCAGCCACTTATCAACAAACATTTCAG...,0,0
4,chr1,797291,797492,AACTTTGCTCAATCTTCGCCTTTGGGTTCCTTTAGGTTTATAATGA...,0,0
5,chr1,798517,798718,AACATACTTTGCTAATACATTTTAATCTGGCATTTTTATGGGGGTA...,0,0
6,chr1,798561,798762,TAATTATAGGAAATGCCTGGAATTAAATAGCCTACAACCAATTCTT...,0,0
7,chr1,799945,800146,TTTTGCTATGCAGAAGCTCTTTAGTTTAATTAGATCCCATTTGTCA...,0,0
8,chr1,801041,801242,GGTCTCTGTCCTTGATTCCATTGTGACCTTCAGCCCATCTCTCTGG...,0,0
9,chr1,801042,801243,GTCTCTGTCCTTGATTCCATTGTGACCTTCAGCCCATCTCTCTGGG...,0,0


In [71]:
# save mega pd.DataFrame to csv
TP_FP_FN_seqs_labels01_labels012.to_csv('231117_TP_FP_FN_seqs_labels01_labels012.csv', index=False)
TP_FP_FN_seqs_labels01_labels012.shape

(3893658, 6)

## 3. Split the data into train, validation and test

- do 70% train 20% validation 10% test split
- include all of chr20 in test split

In [11]:
df = pd.read_csv('231117_TP_FP_FN_seqs_labels01_labels012.csv')
df_no_20 = df[df['chrom'] != 'chr20'] # chr20 will be exclusively test split; ~ 2% of entire dataset

In [12]:
# stratified split of df_no_20; stratify on chrom and labels012
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np

# Create a combined categorical variable based on chrom and label012
df_no_20['combined'] = df_no_20['chrom'] + '_' + df_no_20['labels012'].astype(str)

# Perform a stratified train-test split based on the combined variable
train, test = train_test_split(df_no_20, test_size=0.08041711572631151, 
                               stratify=df_no_20['combined'], 
                               random_state=42)

train.shape, test.shape

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_no_20['combined'] = df_no_20['chrom'] + '_' + df_no_20['labels012'].astype(str)


((3504292, 7), (306449, 7))

In [13]:
# check distribution of labels 0, 1, 2 in train-validation set
contingency_table = pd.crosstab(train['chrom'], train['labels012'])

# Assuming contingency_table is your computed contingency table
normalized_table1 = (contingency_table.div(contingency_table.sum(axis=1), axis=0) * 100).round(2)
normalized_table1

labels012,0,1,2
chrom,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
chr1,82.74,14.18,3.07
chr10,82.65,14.36,2.99
chr11,83.84,13.11,3.05
chr12,82.79,14.13,3.08
chr13,84.75,12.43,2.82
chr14,82.91,14.25,2.84
chr15,83.63,13.22,3.15
chr16,81.79,14.66,3.55
chr17,79.84,15.93,4.23
chr18,84.9,12.21,2.89


In [14]:
# check distribution of labels 0, 1, 2 in test set
contingency_table = pd.crosstab(test['chrom'], test['labels012'])

# Assuming contingency_table is your computed contingency table
normalized_table2 = (contingency_table.div(contingency_table.sum(axis=1), axis=0) * 100).round(2)
normalized_table2

labels012,0,1,2
chrom,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
chr1,82.74,14.19,3.07
chr10,82.65,14.36,2.99
chr11,83.85,13.11,3.04
chr12,82.79,14.13,3.08
chr13,84.76,12.43,2.82
chr14,82.91,14.25,2.84
chr15,83.63,13.22,3.15
chr16,81.79,14.66,3.56
chr17,79.85,15.92,4.23
chr18,84.9,12.21,2.89


In [15]:
# check difference in distribution of labels 0, 1, 2 in train-validation set vs test set
normalized_table1 - normalized_table2

labels012,0,1,2
chrom,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
chr1,0.0,-0.01,0.0
chr10,0.0,0.0,0.0
chr11,-0.01,0.0,0.01
chr12,0.0,0.0,0.0
chr13,-0.01,0.0,0.0
chr14,0.0,0.0,0.0
chr15,0.0,0.0,0.0
chr16,0.0,0.0,-0.01
chr17,-0.01,0.01,0.0
chr18,0.0,0.0,0.0


- distribution of train-validation set matches distribution of test set to 0.02%

In [16]:
# combine chr20 with rest of test set
chr20_df = df[df['chrom'] == 'chr20']
chr20_df['combined'] = chr20_df['chrom'] + '_' + chr20_df['labels012'].astype(str)
# concatenate chr20_df and test
test_df = pd.concat([test, chr20_df], ignore_index=True)
len(test_df), len(train)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  chr20_df['combined'] = chr20_df['chrom'] + '_' + chr20_df['labels012'].astype(str)


(389366, 3504292)

In [18]:
test_df.to_csv('231117_TP_FP_FN_seqs_labels01_labels012_test.csv', index=False)
train.to_csv('231117_TP_FP_FN_seqs_labels01_labels012_train_validation.csv', index=False)
train.head(n=10)

Unnamed: 0,chrom,pos_start,pos_end,sequences,labels01,labels012,combined
3198699,chr22,17920762,17920963,TGGAAATCTGGGGTCATCTCAGACCCTTCTCTGTTCTCCACTCCTC...,0,0,chr22_0
2552114,chr14,26643125,26643326,AGTTTCTTAATCCTGAGTTCTAATTTGATTGCACTGTGGTCTGAGA...,0,0,chr14_0
3189440,chr21,41464158,41464359,AAAAAATGGGAGTTTCTGTGCACAAACTCTCTCCCTGCCTGCTGCC...,0,0,chr21_0
3234028,chr22,49080339,49080540,CATCCCTGGTCGCCTCCCAGGAAGCCTGCCCAGGCCTAGTGCCTGT...,0,0,chr22_0
3596339,chr11,78303101,78303302,CTCGGGTGATGAGTGCACCAAAATCTCACAAATCACCACTAAAGAA...,1,1,chr11_1
2441643,chr13,37652807,37653008,CTCGGTGAGTCAGAAATTATGTTATCTTATTGCAATCACTAAGCAA...,0,0,chr13_0
473873,chr2,198235696,198235897,AGAGGCAGAGTAGCTATTATACCATCCCCAGCAGGAGATTCATGTA...,0,0,chr2_0
2813728,chr16,78365185,78365386,GTGCCCAGTTTAGTGCCTGCCAAGAGTAACTGGTGGGTCTGTGTCA...,0,0,chr16_0
2036432,chr10,87194820,87195021,GTTAAGAAATTATAAACTTAAGGCCAGGCACAGTGGCTCACGCCTA...,0,0,chr10_0
173879,chr1,183634163,183634364,AAAAGTTCTTACACAAAATCTCTGGTTGACAAGCTGTTATTTTCAA...,0,0,chr1_0


## 4. one-hot encode sequences and labels and save Numpy arrays

In [20]:
%%time
# Define a mapping of nucleotides to one-hot vectors
nucleotide_mapping = {
    'A': [1, 0, 0, 0],
    'C': [0, 1, 0, 0],
    'G': [0, 0, 1, 0],
    'T': [0, 0, 0, 1]
}

# Function to one-hot encode a DNA sequence
def one_hot_encode(sequence):
    encoding = np.array([nucleotide_mapping[n] for n in sequence])
    return encoding

# Apply one-hot encoding to the DNA sequences
encoded_sequences = train['sequences'].apply(one_hot_encode)
train['sequences'].shape, encoded_sequences.shape

CPU times: user 5min 57s, sys: 10.7 s, total: 6min 7s
Wall time: 6min 7s


((3504292,), (3504292,))

In [22]:
encoded_sequences = np.stack(encoded_sequences)
encoded_sequences.shape

(3504292, 201, 4)

In [23]:
%%time
# one-hot encode labels
from sklearn.preprocessing import OneHotEncoder
labels = train['labels01']

one_hot_encoder = OneHotEncoder(categories='auto')
labels = np.array(labels).reshape(-1, 1)
input_labels = one_hot_encoder.fit_transform(labels).toarray()

print('Labels:\n',labels.T)
print('One-hot encoded labels:\n',input_labels.T)

Labels:
 [[0 0 0 ... 0 1 1]]
One-hot encoded labels:
 [[1. 1. 1. ... 1. 0. 0.]
 [0. 0. 0. ... 0. 1. 1.]]
CPU times: user 280 ms, sys: 6.08 ms, total: 286 ms
Wall time: 283 ms


In [25]:
input_labels.shape

(3504292, 2)

In [28]:
# save data + labels in npz int8
encoded_sequences_int8 = encoded_sequences.astype(np.int8)
input_labels_int8 = input_labels.astype(np.int8)
np.savez("231117_encoded_seqs_labels_int8.npz", encoded_sequences=encoded_sequences_int8, input_labels=input_labels_int8) 

---
### NEXT:

- test various neural network architectures