**Gene of Interest: TP53**

TP53 is the most frequently mutated tumor suppressor gene in human cancers. It regulates cell cycle arrest, apoptosis, and genomic stability. Mutations in TP53 are often associated with poor prognosis, therapy resistance, and aggressive tumor behavior. By focusing on TP53 in circulating tumor DNA (ctDNA), this pipeline aims to detect actionable mutations and monitor clonal dynamics across treatment.

Introduction:

In this analysis, we processed circulating tumor DNA (ctDNA) sequencing data from a patient with metastatic colorectal cancer to investigate variants in the TP53 gene pre- and post-treatment with first-line chemotherapy and anti-EGFR therapy. Raw FASTQ reads were downloaded from the NCBI SRA database and subjected to quality control using FastQC and MultiQC to ensure high-quality sequencing data. Reads were then aligned to the targeted TP53 region of the human reference genome (hg19) using BWA, followed by sorting and indexing with Samtools. Variants were identified with Bcftools, and low-confidence calls were filtered to retain high-quality variants with sufficient read depth. This workflow enables the detection and comparison of TP53 mutations before and after treatment, providing insights into tumor evolution and potential mechanisms of therapy response, while also producing visual and tabular summaries suitable for interpretation, presentation, and clinical relevance assessment.

**Why were specific tools for the analysis:**

The tools selected for this TP53 variant calling pipeline were chosen for their efficiency, compatibility with Google Colab, and relevance to the task of targeted DNA variant analysis. SRA Toolkit’s fasterq-dump was used to retrieve raw sequencing data directly from NCBI, offering parallelization and streamlined file handling. FastQC and MultiQC provided essential quality control, allowing for rapid assessment of read integrity and adapter contamination. BWA MEM was selected for alignment due to its speed and accuracy with short reads, especially in targeted regions like TP53. Samtools was used for sorting, indexing, and manipulating BAM files, while BCFtools handled variant calling and filtering with minimal overhead. These tools are lightweight, scriptable, and widely adopted in clinical genomics workflows. Heavier alternatives like GATK were intentionally avoided due to their complexity and resource demands, which are unnecessary for single-gene analysis in a cloud-based notebook. Similarly, RNA-seq aligners like STAR or HISAT2, and automation frameworks like Snakemake or Nextflow, were excluded to maintain simplicity and focus. By using Miniconda and Bioconda, the environment remained reproducible and stable, avoiding dependency conflicts common in Colab. Overall, this toolset reflects a resource-conscious, streamlined approach tailored to ctDNA analysis of TP53.


**Patient Description**

The selected patient is part of a publicly available study investigating circulating tumor DNA (ctDNA) dynamics in colorectal cancer. Two timepoints were analyzed: SRR13973710 (pre-treatment) and SRR13973711 (post-treatment). These samples represent paired plasma-derived ctDNA collected before and after therapeutic intervention, allowing for longitudinal tracking of tumor-associated mutations. The focus on TP53, a key tumor suppressor gene, enables detection of clinically relevant variants that may reflect treatment response, clonal evolution, or residual disease. This patient was chosen to illustrate how targeted variant calling in ctDNA can reveal actionable insights and support precision oncology.


**VARIANT CALLING ANALYSIS**

Installs essential bioinformatics tools via the system package manager (apt-get).

sra-toolkit: To download FASTQ files from NCBI SRA.

fastqc: For quality control of raw reads.

bwa: To align reads to a reference genome.

samtools: To manipulate BAM/SAM files.

bcftools: To call and filter variants.

default-jre: Java runtime required for some tools like MultiQC

Installs MultiQC via pip.

In [42]:
# Core tools
!apt-get install -y sra-toolkit fastqc bwa samtools bcftools default-jre
!pip install multiqc

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
default-jre is already the newest version (2:1.11-72build2).
bcftools is already the newest version (1.13-1).
bwa is already the newest version (0.7.17-6).
fastqc is already the newest version (0.11.9+dfsg-5).
samtools is already the newest version (1.13-4).
sra-toolkit is already the newest version (2.11.3+dfsg-1ubuntu1).
0 upgraded, 0 newly installed, 0 to remove and 47 not upgraded.
Collecting multiqc
  Using cached multiqc-1.32-py3-none-any.whl.metadata (46 kB)
Collecting boto3 (from multiqc)
  Using cached boto3-1.40.74-py3-none-any.whl.metadata (6.8 kB)
Collecting humanize (from multiqc)
  Downloading humanize-4.14.0-py3-none-any.whl.metadata (7.8 kB)
Collecting importlib_metadata (from multiqc)
  Downloading importlib_metadata-8.7.0-py3-none-any.whl.metadata (4.8 kB)
Collecting jinja2>=3.0.0 (from multiqc)
  Downloading jinja2-3.1.6-py3-none-any.whl.metadata (2.9 kB)
Collecting kalei

Creates directories for organizing raw data, reference genomes, results (QC, BAMs, VCFs), scripts, and notebooks.

In [43]:
!mkdir -p ctDNA_TP53_Analysis/{data/{raw_fastq,reference,metadata},results/{qc,bam,vcf,annotations},scripts,notebook}
%cd ctDNA_TP53_Analysis/data/raw_fastq/

/content/ctDNA_TP53_Analysis/data/raw_fastq


WE need the pre-treatment (SRR13973710) and post-treatment (SRR13973711) reads for your ctDNA analysis.

fasterq-dump downloads and converts SRA files to FASTQ format.

-t /root/sra_tmp: temp folder for SRA processing.

--split-files: splits paired-end reads into separate _1 and _2 files.

--threads 4: uses 4 CPU threads for faster download.

gzip: compresses FASTQ files to save disk space.

In [44]:
!mkdir -p /root/sra_tmp
!fasterq-dump SRR13973711 -t /root/sra_tmp --split-files --threads 4 -p
!fasterq-dump SRR13973710 -t /root/sra_tmp --split-files --threads 4 -p
!gzip SRR13973711_1.fastq SRR13973711_2.fastq
!gzip SRR13973710_1.fastq SRR13973710_2.fastq

lookup :|  0.00% 0.01% 0.02% 0.03% 0.04% 0.05% 0.06% 0.07% 0.08% 0.09% 0.10% 0.11% 0.12% 0.13% 0.14% 0.15% 0.16% 0.17% 0.18% 0.19% 0.20% 0.21% 0.22% 0.23% 0.24% 0.25% 0.26% 0.27% 0.28% 0.29% 0.30% 0.31% 0.32% 0.33% 0.34% 0.35% 0.36% 0.37% 0.38% 0.39% 0.40% 0.41% 0.42% 0.43% 0.44% 0.45% 0.46% 0.47% 0.48% 0.49% 0.50% 0.51% 0.52% 0.53% 0.54% 0.55% 0.56% 0.57% 0.58% 0.59% 0.60% 0.61% 0.62% 0.63% 0.64% 0.65% 0.66% 0.67% 0.68% 0.69% 0.70% 0.71% 0.72% 0.73% 0.74% 0.75% 0.76% 0.77% 0.78% 0.79% 0.80% 0.81% 0.82%

I encountered persistent errors running FastQC in Google Colab because its default installation method (via .zip) relies on a Java-based GUI and classpath structure that doesn't work reliably in Colab’s virtual environment. To resolve this, I installed Miniconda — a lightweight version of Anaconda — which allows to use the Bioconda channel to install bioinformatics tools like FastQC in a more stable and reproducible way. Miniconda sets up an isolated environment with proper dependencies, avoiding the broken classpath issues. However, because Colab is non-interactive, Conda requires explicit acceptance of its Terms of Service for certain channels before proceeding. Once those were accepted, I could install FastQC cleanly and run it without the Java class errors. This approach ensures compatibility, avoids manual setup pitfalls, and aligns with best practices for reproducible bioinformatics workflows in cloud-based notebooks.


Installs Miniconda, which allows you to manage packages and dependencies easily.

Ensures Python can find packages installed via Conda.

In [50]:
# Install Miniconda
!wget -q https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
!chmod +x Miniconda3-latest-Linux-x86_64.sh
!bash ./Miniconda3-latest-Linux-x86_64.sh -b -f -p /usr/local

# Configure Conda
import sys
sys.path.append('/usr/local/lib/python3.8/site-packages')

# Install FastQC via Bioconda
!conda install -y -c bioconda fastqc

PREFIX=/usr/local
Unpacking bootstrapper...
Unpacking payload...

Installing base environment...

Preparing transaction: ...working... done
Executing transaction: ...working... done
installation finished.
    You currently have a PYTHONPATH environment variable set. This may cause
    unexpected behavior when running the Python interpreter in Miniconda3.
    For best results, please verify that your PYTHONPATH only points to
    directories of packages that are compatible with the Python interpreter
    in Miniconda3: /usr/local
[1;33mJupyter detected[0m[1;33m...[0m

CondaToSNonInteractiveError: Terms of Service have not been accepted for the following channels. Please accept or remove them before proceeding:
    - https://repo.anaconda.com/pkgs/main
    - https://repo.anaconda.com/pkgs/r

To accept these channels' Terms of Service, run the following commands:
    conda tos accept --override-channels --channel https://repo.anaconda.com/pkgs/main
    conda tos accept --override-chan

Initializes Conda and ensures licenses are accepted.

In [51]:
!conda init
!conda config --set always_yes yes
!conda tos accept --override-channels --channel https://repo.anaconda.com/pkgs/main
!conda tos accept --override-channels --channel https://repo.anaconda.com/pkgs/r

no change     /usr/local/condabin/conda
no change     /usr/local/bin/conda
no change     /usr/local/bin/conda-env
no change     /usr/local/bin/activate
no change     /usr/local/bin/deactivate
no change     /usr/local/etc/profile.d/conda.sh
no change     /usr/local/etc/fish/conf.d/conda.fish
no change     /usr/local/shell/condabin/Conda.psm1
no change     /usr/local/shell/condabin/conda-hook.ps1
no change     /usr/local/lib/python3.13/site-packages/xontrib/conda.xsh
no change     /usr/local/etc/profile.d/conda.csh
modified      /root/.bashrc

==> For changes to take effect, close and re-open your current shell. <==

accepted Terms of Service for [4;94mhttps://repo.anaconda.com/pkgs/main[0m
accepted Terms of Service for [4;94mhttps://repo.anaconda.com/pkgs/r[0m


Installs fastqc via bioconda, which is often more up-to-date for bioinformatics.

In [52]:
!conda install -y -c bioconda fastqc

[1;33mJupyter detected[0m[1;33m...[0m
[1;32m2[0m[1;32m channel Terms of Service accepted[0m
Retrieving notices: - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | done
Channels:
 - bioconda
 - conda-forge
 - defaults
Platform: linux-64
Collecting package metadata (repodata.json): - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | done
Solving environment: - \ | done

## Package Plan ##

  environment location: /usr/local

  added / updated specs:
    - fastqc


The following packages will be downloaded:

    packag

In [55]:
import os
os.getcwd()


'/content/ctDNA_TP53_Analysis/results/qc'

In [56]:
%cd /content/ctDNA_TP53_Analysis/data/raw_fastq/


/content/ctDNA_TP53_Analysis/data/raw_fastq


Runs FastQC on all gzipped FASTQ files.

-o: output folder

-t 4: 4 threads for faster processing

In [58]:
!mkdir -p /content/ctDNA_TP53_Analysis/results/qc
!fastqc *.fastq.gz -o /content/ctDNA_TP53_Analysis/results/qc -t 4


application/gzip
application/gzip
Started analysis of SRR13973710_1.fastq.gz
application/gzip
application/gzip
Started analysis of SRR13973710_2.fastq.gz
Started analysis of SRR13973711_1.fastq.gz
Started analysis of SRR13973711_2.fastq.gz
Approx 5% complete for SRR13973710_1.fastq.gz
Approx 5% complete for SRR13973710_2.fastq.gz
Approx 5% complete for SRR13973711_1.fastq.gz
Approx 5% complete for SRR13973711_2.fastq.gz
Approx 10% complete for SRR13973710_1.fastq.gz
Approx 10% complete for SRR13973710_2.fastq.gz
Approx 10% complete for SRR13973711_1.fastq.gz
Approx 10% complete for SRR13973711_2.fastq.gz
Approx 15% complete for SRR13973710_1.fastq.gz
Approx 15% complete for SRR13973710_2.fastq.gz
Approx 20% complete for SRR13973710_1.fastq.gz
Approx 20% complete for SRR13973710_2.fastq.gz
Approx 15% complete for SRR13973711_2.fastq.gz
Approx 15% complete for SRR13973711_1.fastq.gz
Approx 25% complete for SRR13973710_1.fastq.gz
Approx 25% complete for SRR13973710_2.fastq.gz
Approx 20% c

Interpretation of FASTQC output:


Aggregates all FastQC results into a single interactive MultiQC report, making it easier to spot low-quality reads or adapter contamination.

In [68]:
%cd /content/ctDNA_TP53_Analysis/results/qc
!multiqc . -o ./multiqc_report

/content/ctDNA_TP53_Analysis/results/qc

[91m///[0m ]8;id=989530;https://multiqc.info\[1mMultiQC[0m]8;;\ 🔍 [2mv1.32[0m

[34m       file_search[0m | Search path: /content/ctDNA_TP53_Analysis/results/qc
[2K         [34msearching[0m | [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [35m100%[0m [32m10/10[0m  
[?25h[34m            fastqc[0m | Found 4 reports
[34m     write_results[0m | Data        : multiqc_report/multiqc_data
[34m     write_results[0m | Report      : multiqc_report/multiqc_report.html
[34m           multiqc[0m | MultiQC complete


Based on the “Per Sequence Quality Scores” section of the MultiQC report, the sequencing data looks excellent.
- The peak is around Phred 33–34, meaning most reads have very high average quality.
- No significant tailing into the red zone, which means there's no subset of poor-quality reads.
- This distribution is typical of well-prepared libraries and clean sequencing runs.

Based on the MultiQC report — specifically the Adapter Content and Overrepresented Sequences — trimming was likely not needed.
- 4 samples had less than 1% of reads made up of overrepresented sequences. This means adapter contamination or biased sequences were minimal.

- The adapter content graph shows low levels (<5%) across all positions. The lines rise slightly toward the end of the reads, which is normal for Illumina libraries. Adapter presence is minimal and unlikely to affect alignment or variant calling.

The reads are high quality, with low adapter contamination and minimal overrepresented sequences. Trimming is not necessary for this dataset.




Focusing on TP53 reduces computation and makes variant calling easier for targeted sequencing.

Downloads hg19 reference genome. Hg19 reference genome was used as the paper used it.

samtools faidx: indexes reference for rapid access.

Extracts TP53 region (chr17:7571720-7590868) for targeted analysis.

bwa index: prepares reference for alignment.

In [59]:
%cd ../../data/reference/
!wget -O hg19.fa.gz http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.gz
!gunzip hg19.fa.gz
!samtools faidx hg19.fa
!samtools faidx hg19.fa chr17:7571720-7590868 > TP53.fa
!bwa index TP53.fa

/content/ctDNA_TP53_Analysis/data/reference
--2025-11-16 08:52:57--  http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.gz
Resolving hgdownload.cse.ucsc.edu (hgdownload.cse.ucsc.edu)... 128.114.119.163
Connecting to hgdownload.cse.ucsc.edu (hgdownload.cse.ucsc.edu)|128.114.119.163|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 948731419 (905M) [application/x-gzip]
Saving to: ‘hg19.fa.gz’


2025-11-16 08:53:18 (43.8 MB/s) - ‘hg19.fa.gz’ saved [948731419/948731419]

[bwa_index] Pack FASTA... 0.00 sec
[bwa_index] Construct BWT for the packed sequence...
[bwa_index] 0.00 seconds elapse.
[bwa_index] Update BWT... 0.00 sec
[bwa_index] Pack forward-only FASTA... 0.00 sec
[bwa_index] Construct SA from BWT and Occ... 0.00 sec
[main] Version: 0.7.17-r1188
[main] CMD: bwa index TP53.fa
[main] Real time: 0.054 sec; CPU: 0.014 sec


In [61]:
%cd /content/ctDNA_TP53_Analysis/data/reference/
!ls TP53.fa*


/content/ctDNA_TP53_Analysis/data/reference
TP53.fa  TP53.fa.amb  TP53.fa.ann  TP53.fa.bwt	TP53.fa.pac  TP53.fa.sa


Sorted, indexed BAMs are required for variant calling and downstream analysis.

bwa mem: aligns paired-end reads to TP53.

samtools sort: sorts BAM file by genomic coordinates.

samtools index: creates an index for visualization and variant calling.



In [65]:
%cd /content/ctDNA_TP53_Analysis/data/raw_fastq/

!bwa mem /content/ctDNA_TP53_Analysis/data/reference/TP53.fa SRR13973710_1.fastq.gz SRR13973710_2.fastq.gz | samtools sort -o /content/ctDNA_TP53_Analysis/results/bam/pre.bam
!samtools index /content/ctDNA_TP53_Analysis/results/bam/pre.bam

!bwa mem /content/ctDNA_TP53_Analysis/data/reference/TP53.fa SRR13973711_1.fastq.gz SRR13973711_2.fastq.gz | samtools sort -o /content/ctDNA_TP53_Analysis/results/bam/post.bam
!samtools index /content/ctDNA_TP53_Analysis/results/bam/post.bam

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 277)
[M::mem_pestat] mean and std.dev: (89.67, 43.51)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 352)
[M::mem_pestat] analyzing insert size distribution for orientation RF...
[M::mem_pestat] (25, 50, 75) percentile: (1208, 2209, 4792)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 11960)
[M::mem_pestat] mean and std.dev: (3281.96, 2773.29)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 15544)
[M::mem_pestat] analyzing insert size distribution for orientation RR...
[M::mem_pestat] (25, 50, 75) percentile: (2784, 6668, 9226)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 22110)
[M::mem_pestat] mean and std.dev: (6031.33, 3086.92)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 28552)
[M::mem_pestat] skip orientation FF
[M::mem_pestat] skip orien

In [76]:
!samtools flagstat /content/ctDNA_TP53_Analysis/results/bam/pre.bam
!samtools flagstat /content/ctDNA_TP53_Analysis/results/bam/post.bam

9518695 + 0 in total (QC-passed reads + QC-failed reads)
9465190 + 0 primary
0 + 0 secondary
53505 + 0 supplementary
0 + 0 duplicates
0 + 0 primary duplicates
1687858 + 0 mapped (17.73% : N/A)
1634353 + 0 primary mapped (17.27% : N/A)
9465190 + 0 paired in sequencing
4732595 + 0 read1
4732595 + 0 read2
1418096 + 0 properly paired (14.98% : N/A)
1540302 + 0 with itself and mate mapped
94051 + 0 singletons (0.99% : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)
11602018 + 0 in total (QC-passed reads + QC-failed reads)
11535792 + 0 primary
0 + 0 secondary
66226 + 0 supplementary
0 + 0 duplicates
0 + 0 primary duplicates
1981108 + 0 mapped (17.08% : N/A)
1914882 + 0 primary mapped (16.60% : N/A)
11535792 + 0 paired in sequencing
5767896 + 0 read1
5767896 + 0 read2
1679380 + 0 properly paired (14.56% : N/A)
1805310 + 0 with itself and mate mapped
109572 + 0 singletons (0.95% : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate 

The alignment statistics from both pre-treatment and post-treatment samples indicate high-quality sequencing data and consistent performance across the pipeline. The pre-treatment sample contained approximately 9.5 million total reads, with 1.69 million (17.73%) successfully mapped to the TP53 locus. Similarly, the post-treatment sample had 11.6 million reads, with 1.98 million (17.08%) mapped. These mapping rates are expected given that the alignment was restricted to a small genomic region (TP53), meaning most reads from whole-genome or exome libraries would not align. Both samples showed a high proportion of properly paired reads—14.98% for pre-treatment and 14.56% for post-treatment—indicating good fragment integrity. Singleton rates were low in both cases (under 1%), suggesting minimal fragmentation or library preparation issues. Notably, no duplicates were reported, which may indicate either that deduplication was not performed or that the data had inherently low redundancy—an important consideration in ctDNA analysis where input material is limited. Overall, the alignment metrics confirm that the sequencing data is of high quality and suitable for downstream variant calling and interpretation.

Indexing VCF enables rapid querying and visualization

bcftools mpileup: calculates per-base coverage for BAM.

bcftools call -mv: calls variants (-m=multiallelic, -v=variants only).

-Oz: compressed VCF output.

In [66]:
%cd /content/ctDNA_TP53_Analysis/results/vcf/
!bcftools mpileup -f /content/ctDNA_TP53_Analysis/data/reference/TP53.fa /content/ctDNA_TP53_Analysis/results/bam/pre.bam | bcftools call -mv -Oz -o pre.vcf.gz
!bcftools mpileup -f /content/ctDNA_TP53_Analysis/data/reference/TP53.fa /content/ctDNA_TP53_Analysis/results/bam/post.bam | bcftools call -mv -Oz -o post.vcf.gz
!bcftools index pre.vcf.gz
!bcftools index post.vcf.gz

/content/ctDNA_TP53_Analysis/results/vcf
Note: none of --samples-file, --ploidy or --ploidy-file given, assuming all sites are diploid
[mpileup] 1 samples in 1 input files
[mpileup] maximum number of reads per input file set to -d 250
Note: none of --samples-file, --ploidy or --ploidy-file given, assuming all sites are diploid
[mpileup] 1 samples in 1 input files
[mpileup] maximum number of reads per input file set to -d 250


Ensures that only reliable variants are considered for clinical interpretation

Filters out low-quality calls:

QUAL>20: only confident variants

DP>10: only variants with sufficient read depth

In [67]:
!bcftools filter -i 'QUAL>20 && DP>10' pre.vcf.gz -Oz -o pre.filtered.vcf.gz
!bcftools filter -i 'QUAL>20 && DP>10' post.vcf.gz -Oz -o post.filtered.vcf.gz

Filtered VCF files generated by BCFtools serve as a refined list of high-confidence variants, making them ideal for downstream interpretation. To utilize these files effectively, one should begin by parsing key fields such as chromosome, position, reference and alternate alleles, quality scores (QUAL), read depth (DP), and variant-specific metrics like allele frequency (VAF), which can be calculated from the DP4 field. Converting relative positions to genomic coordinates allows for accurate mapping to known loci. These variants can then be annotated using public databases like ClinVar or COSMIC to identify clinically relevant mutations, particularly in genes like TP53 where hotspot mutations (e.g., R175H, R248Q, R273C) are well-characterized. Comparing variant profiles across timepoints (e.g., pre- vs post-treatment) enables tracking of clonal dynamics, emergence of resistance mutations, or clearance of tumor-associated variants. Ultimately, filtered VCFs provide a focused dataset that, when combined with annotation and contextual analysis, can yield insights into disease progression, therapeutic response, and potential clinical actionability.

In [79]:
!bcftools query -f '%CHROM\t%POS\t%REF\t%ALT\t%QUAL\t%INFO\n' /content/ctDNA_TP53_Analysis/results/vcf/pre.filtered.vcf.gz > post_variants.tsv

In [80]:
import pandas as pd

df = pd.read_csv('post_variants.tsv', sep='\t', header=None,
                 names=['CHROM','POS','REF','ALT','QUAL','INFO'])

In [81]:
df['GENOMIC_POS'] = 7571720 + df['POS'].astype(int)

In [82]:
df['DP'] = df['INFO'].str.extract(r'DP=(\d+)').astype(float)
df['AC'] = df['INFO'].str.extract(r'AC=([\d,]+)')
df['DP4'] = df['INFO'].str.extract(r'DP4=([\d,]+)')

In [83]:
def calculate_vaf(dp4):
    try:
        ref_fwd, ref_rev, alt_fwd, alt_rev = map(int, dp4.split(','))
        alt_total = alt_fwd + alt_rev
        total = ref_fwd + ref_rev + alt_total
        return round(alt_total / total, 3) if total > 0 else None
    except:
        return None

df['VAF'] = df['DP4'].apply(calculate_vaf)

In [84]:
df

Unnamed: 0,CHROM,POS,REF,ALT,QUAL,INFO,GENOMIC_POS,DP,AC,DP4,VAF
0,chr17:7571720-7590868,153,A,G,153.3990,DP=34;VDB=6.58107e-05;SGB=-0.662043;RPBZ=-1.08...,7571873,34.0,1,12063,0.429
1,chr17:7571720-7590868,168,A,"C,T",193.2390,DP=251;VDB=5.617e-38;SGB=-0.693147;RPBZ=0.6149...,7571888,251.0,11,8411136,0.925
2,chr17:7571720-7590868,169,T,C,222.1800,DP=253;VDB=1.31137e-35;SGB=-0.693147;RPBZ=-1.4...,7571889,253.0,1,47107431,0.648
3,chr17:7571720-7590868,185,C,T,222.4110,DP=263;VDB=1.51347e-18;SGB=-0.693147;RPBZ=-7.6...,7571905,263.0,1,52228019,0.572
4,chr17:7571720-7590868,194,T,C,221.2770,DP=217;VDB=1.05629e-11;SGB=-0.693147;RPBZ=-0.9...,7571914,217.0,1,79253510,0.302
...,...,...,...,...,...,...,...,...,...,...,...
766,chr17:7571720-7590868,17280,A,G,92.1296,DP=56;VDB=0.800881;SGB=-0.676189;RPBZ=0.147716...,7589000,56.0,1,171365,0.268
767,chr17:7571720-7590868,17283,A,G,215.3430,DP=68;VDB=0.84099;SGB=-0.692717;RPBZ=2.27585;M...,7589003,68.0,1,10131211,0.500
768,chr17:7571720-7590868,17291,C,T,32.4825,DP=143;VDB=0.529851;SGB=-0.692562;RPBZ=1.16112...,7589011,143.0,1,3230814,0.262
769,chr17:7571720-7590868,17325,A,C,221.0210,DP=73;VDB=3.41264e-12;SGB=-0.692914;RPBZ=3.176...,7589045,73.0,1,571411,0.676


While I do not yet have hands-on experience interpreting VCF files or working extensively with clinical variant datasets, I am a fast and motivated learner. I’ve already begun exploring the structure and content of filtered VCFs using BCFtools, and I’m actively building my understanding of variant annotation, allele frequency analysis, and clinical relevance mapping. I’m confident in my ability to quickly grasp new tools and workflows, and I’m committed to developing the skills needed to perform rigorous and insightful genomic interpretation.

**IDEAS/VISION TO DISPLAY FINDINGS**

To effectively communicate my TP53 ctDNA findings in marketing materials or presentations, I envision a suite of visually engaging and clinically informative displays tailored to both scientific and non-technical audiences. A central component would be a mutation summary dashboard—either as a clean table or infographic—highlighting key variants with their genomic positions, allele frequencies, quality scores, and clinical annotations such as protein changes and ClinVar status. To illustrate tumor evolution, I would include bar plots comparing variant allele frequencies (VAFs) between pre- and post-treatment samples, emphasizing emerging or disappearing mutations. A TP53 lollipop plot would map mutations along the protein structure, spotlighting known hotspots and their functional domains. For longitudinal studies, a timeline graphic could show how specific mutations evolve over time, offering insights into clonal dynamics and treatment response. These visuals would be integrated into a slide deck or interactive dashboard, combining scientific rigor with accessible storytelling. This approach would not only support data-driven insights but also enhance communication with collaborators, clinicians, and potential stakeholders in translational research or precision oncology.