<a href="https://colab.research.google.com/github/carolnmqs/tcc-germinativo/blob/main/chr1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Configuração do ambiente

In [None]:
%%bash

#!/bin/bash

echo '1 - Instalação de programas'
mkdir -p logs

echo 'Instalando o bwa'
sudo apt install bwa 1>logs/bwa.log 2>logs/bwa.log

echo 'Instalando o fastqc'
sudo apt install fastqc 1>logs/fastqc.log 2>logs/fastqc.log

echo 'Instalando o samtools'
sudo apt install samtools 1>logs/samtools.log 2>logs/samtools.log

echo 'Instalando o bedtools'
sudo apt install bedtools 1>logs/bedtools.log 2>logs/bedtools.log

echo 'Instalando o bgzip'
sudo apt install bgzip 1>logs/bgzip.log 2>logs/bgzip.log

echo 'Instalando o tabix'
sudo apt install tabix 1>logs/tabix.log 2>logs/tabix.log

1 - Instalação de programas
Instalando o bwa
Instalando o fastqc
Instalando o samtools
Instalando o bedtools
Instalando o bgzip
Instalando o tabix


In [None]:
%%bash

echo 'Instalando o gatk'
wget https://github.com/broadinstitute/gatk/releases/download/4.1.8.1/gatk-4.1.8.1.zip 1>logs/gatk.log 2>logs/gatk.log
unzip gatk-4.1.8.1.zip 1>logs/gatk.log 2>logs/gatk.log
rm gatk-4.1.8.1.zip

Instalando o gatk


In [None]:
%%bash

echo 'Instalando o picard'
wget https://github.com/broadinstitute/picard/releases/download/2.24.2/picard.jar 1>logs/picard.log 2>logs/picard.log

echo 'Instalando o snpEff'
wget https://snpeff.blob.core.windows.net/versions/snpEff_latest_core.zip 1>logs/snpEff.log 2>logs/snpEff.log
unzip snpEff_latest_core.zip 1>logs/snpEff.log 2>logs/snpEff.log
rm snpEff_latest_core.zip

echo 'Instalando o multiqc'
sudo apt install multiqc 1>logs/multiqc.log 2>logs/multiqc.log

Instalando o picard
Instalando o snpEff
Instalando o multiqc


In [None]:
%%bash

echo '2 - Preparação do Genoma de Referência'

echo 'Baixando o Genoma de Referência'

mkdir -p reference

# baixando o chr1
curl -s "https://hgdownload.soe.ucsc.edu/goldenPath/hg38/chromosomes/chr1.fa.gz" | \
  gunzip -c > reference/hg38.fasta

echo 'Indexando o Genoma de Referência'
bwa index \
  -a bwtsw \
  /content/reference/hg38.fasta 1>logs/bwa.log 2>logs/bwa.log

samtools faidx /content/reference/hg38.fasta

java -jar picard.jar CreateSequenceDictionary \
    REFERENCE=/content/reference/hg38.fasta \
    OUTPUT=/content/reference/hg38.dict 1>logs/picard.log 2>logs/picard.log

2 - Preparação do Genoma de Referência
Baixando o Genoma de Referência
Indexando o Genoma de Referência


## Script para Análise Germinativa

Faça o upload dos arquivos necessários e não se esqueça de inicializar as variáveis com as informações da sua amostra.

In [None]:
%%bash

# !/bin/bash

amostra="cap-ngse-b-2019-chr1"
echo "Iniciando o processamento da $amostra"

fastq1="/content/cap-ngse-b-2019-chr1_S1_L001_R1_001.fastq.gz"
fastq2="/content/cap-ngse-b-2019-chr1_S1_L001_R2_001.fastq.gz"

# Criando estrutura de diretórios
mkdir -p $amostra
mkdir -p $amostra/input
mkdir -p $amostra/output

# Check if the file exists
if [ -e "$fastq1" ]; then
  echo "O arquivo '$fastq1' existe."
  mv "$fastq1" "$amostra/input/"
else
  echo "O arquivo '$fastq1' não existe."
fi

# Check if the file exists
if [ -e "$fastq2" ]; then
  echo "O arquivo '$fastq2' existe."
  mv "$fastq2" "$amostra/input/"
else
  echo "O arquivo '$fastq2' não existe."
fi

# Definir um cabeçalho para o pipeline
echo "==========================="
echo "Início do Pipeline de Análise de Variantes Germinativas"
echo "==========================="

# Passo 1: Preprocessamento de dados
echo "Passo 1: Preprocessamento de dados (ex. QC, trimming)"
# Aqui pode ir o código para FastQC, Trimmomatic, etc.
# fastqc input_data.fastq
fastqc $amostra/input/$fastq1 1>logs/fastqc.log 2>logs/fastqc.log
fastqc $amostra/input/$fastq2 1>logs/fastqc.log 2>logs/fastqc.log

Iniciando o processamento da cap-ngse-b-2019-chr1
O arquivo '/content/cap-ngse-b-2019-chr1_S1_L001_R1_001.fastq.gz' existe.
O arquivo '/content/cap-ngse-b-2019-chr1_S1_L001_R2_001.fastq.gz' existe.
Início do Pipeline de Análise de Variantes Germinativas
Passo 1: Preprocessamento de dados (ex. QC, trimming)


In [None]:
%%bash

nome=carol
echo $nome

carol


In [None]:
%%bash

echo $nome




In [None]:
%%bash

fastqc -o /content/cap-ngse-b-2019-chr1/output \
  /content/cap-ngse-b-2019-chr1/input/cap-ngse-b-2019-chr1_S1_L001_R1_001.fastq.gz


fastqc -o /content/cap-ngse-b-2019-chr1/output \
  /content/cap-ngse-b-2019-chr1/input/cap-ngse-b-2019-chr1_S1_L001_R2_001.fastq.gz

 ##abrir HTML para informações de qualidade

Analysis complete for cap-ngse-b-2019-chr1_S1_L001_R1_001.fastq.gz
Analysis complete for cap-ngse-b-2019-chr1_S1_L001_R2_001.fastq.gz


Started analysis of cap-ngse-b-2019-chr1_S1_L001_R1_001.fastq.gz
Approx 5% complete for cap-ngse-b-2019-chr1_S1_L001_R1_001.fastq.gz
Approx 10% complete for cap-ngse-b-2019-chr1_S1_L001_R1_001.fastq.gz
Approx 15% complete for cap-ngse-b-2019-chr1_S1_L001_R1_001.fastq.gz
Approx 20% complete for cap-ngse-b-2019-chr1_S1_L001_R1_001.fastq.gz
Approx 25% complete for cap-ngse-b-2019-chr1_S1_L001_R1_001.fastq.gz
Approx 30% complete for cap-ngse-b-2019-chr1_S1_L001_R1_001.fastq.gz
Approx 35% complete for cap-ngse-b-2019-chr1_S1_L001_R1_001.fastq.gz
Approx 40% complete for cap-ngse-b-2019-chr1_S1_L001_R1_001.fastq.gz
Approx 45% complete for cap-ngse-b-2019-chr1_S1_L001_R1_001.fastq.gz
Approx 50% complete for cap-ngse-b-2019-chr1_S1_L001_R1_001.fastq.gz
Approx 55% complete for cap-ngse-b-2019-chr1_S1_L001_R1_001.fastq.gz
Approx 60% complete for cap-ngse-b-2019-chr1_S1_L001_R1_001.fastq.gz
Approx 65% complete for cap-ngse-b-2019-chr1_S1_L001_R1_001.fastq.gz
Approx 70% complete for cap-ngse-b-2019

In [None]:
%%bash
# Passo 2: Alinhamento de Sequências
echo "Passo 2: Alinhamento de Sequências (ex. BWA)"
# Aqui pode ir o código para alinhamento, ex. com BWA ou Bowtie2
# bwa mem reference.fasta input_data.fastq > aligned_output.sam
bwa mem -R "@RG\tID:$amostra\tSM:$amostra\tLB:$amostra\tPL:$amostra" \
    /content/reference/hg38.fasta \
    $amostra/input/$fastq1 \
    $amostra/input/$fastq2 >$amostra/output/$amostra.sam 2>logs/bwa.log

Passo 2: Alinhamento de Sequências (ex. BWA)


bash: line 5: /output/.sam: No such file or directory


CalledProcessError: Command 'b'# Passo 2: Alinhamento de Sequ\xc3\xaancias\necho "Passo 2: Alinhamento de Sequ\xc3\xaancias (ex. BWA)"\n# Aqui pode ir o c\xc3\xb3digo para alinhamento, ex. com BWA ou Bowtie2\n# bwa mem reference.fasta input_data.fastq > aligned_output.sam\nbwa mem -R "@RG\\tID:$amostra\\tSM:$amostra\\tLB:$amostra\\tPL:$amostra" \\\n    /content/reference/hg38.fasta \\\n    $amostra/input/$fastq1 \\\n    $amostra/input/$fastq2 >$amostra/output/$amostra.sam 2>logs/bwa.log\n'' returned non-zero exit status 1.

In [None]:
%%bash

NOME="cap-ngse-b-2019-chr1"
Biblioteca="Exoma"
Plataforma="Illumina"

bwa mem -K 100000000 -R "@RG\tID:$NOME\tSM:$NOME\tLB:$Biblioteca\tPL:$Plataforma" \
  /content/reference/hg38.fasta \
  /content/cap-ngse-b-2019-chr1/input/cap-ngse-b-2019-chr1_S1_L001_R1_001.fastq.gz \
  /content/cap-ngse-b-2019-chr1/input/cap-ngse-b-2019-chr1_S1_L001_R2_001.fastq.gz > /content/cap-ngse-b-2019-chr1/output/cap-ngse-b-2019-chr1.sam

[M::bwa_idx_load_from_disk] read 0 ALT contigs
[M::process] read 665752 sequences (100000077 bp)...
[M::process] read 665804 sequences (100000286 bp)...
[M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (13, 292822, 8, 9)
[M::mem_pestat] analyzing insert size distribution for orientation FF...
[M::mem_pestat] (25, 50, 75) percentile: (148, 208, 446)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 1042)
[M::mem_pestat] mean and std.dev: (257.17, 163.20)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 1340)
[M::mem_pestat] analyzing insert size distribution for orientation FR...
[M::mem_pestat] (25, 50, 75) percentile: (161, 197, 246)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 416)
[M::mem_pestat] mean and std.dev: (206.76, 63.65)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 501)
[M::mem_pestat] skip orientation RF as there are not enough pairs
[M::mem_pestat] skip orientation RR as t

In [None]:
%%bash

# Passo 3: Conversão de formato e indexação
echo "Passo 3: Conversão para BAM e indexação"
# Conversão SAM para BAM e indexação
samtools sort -O bam -o /content/cap-ngse-b-2019-chr1/output/cap-ngse-b-2019-chr1.bam /content/cap-ngse-b-2019-chr1/output/cap-ngse-b-2019-chr1.sam
samtools index /content/cap-ngse-b-2019-chr1/output/cap-ngse-b-2019-chr1.bam

Passo 3: Conversão para BAM e indexação


[bam_sort_core] merging from 2 files and 1 in-memory blocks...


In [None]:
%%bash

bedtools bamtobed -i /content/cap-ngse-b-2019-chr1/output/cap-ngse-b-2019-chr1.bam >/content/cap-ngse-b-2019-chr1/output/cap-ngse-b-2019-chr1.bed
bedtools merge -i /content/cap-ngse-b-2019-chr1/output/cap-ngse-b-2019-chr1.bed >/content/cap-ngse-b-2019-chr1/output/cap-ngse-b-2019-chr1.merged.bed
bedtools sort -i /content/cap-ngse-b-2019-chr1/output/cap-ngse-b-2019-chr1.merged.bed >/content/cap-ngse-b-2019-chr1/output/cap-ngse-b-2019-chr1.sorted.bed

In [None]:
##nessa etapa pode rodar um mark duplicates (marca duplicações que podem ter sido causadas durante a amplificação, duplicata de PCR ou ótica, e exclui) e/ou https://gatk.broadinstitute.org/hc/en-us/articles/360035890531-Base-Quality-Score-Recalibration-BQSR (tira ruídos do BAM)

In [None]:
%%bash

amostra="cap-ngse-b-2019-chr1"

# Passo 4: Chamadas de variantes
echo "Passo 4: Chamadas de variantes (GATK HaplotypeCaller)"
# Chamada de variantes
gatk-4.1.8.1/gatk HaplotypeCaller --verbosity ERROR \
    -R reference/hg38.fasta \
    -I $amostra/output/$amostra.bam \
    -O $amostra/output/$amostra.vcf

bgzip -f $amostra/output/$amostra.vcf
tabix -f $amostra/output/$amostra.vcf.gz

Passo 4: Chamadas de variantes (GATK HaplotypeCaller)


Using GATK jar /content/gatk-4.1.8.1/gatk-package-4.1.8.1-local.jar
Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /content/gatk-4.1.8.1/gatk-package-4.1.8.1-local.jar HaplotypeCaller --verbosity ERROR -R reference/hg38.fasta -I cap-ngse-b-2019-chr1/output/cap-ngse-b-2019-chr1.bam -O cap-ngse-b-2019-chr1/output/cap-ngse-b-2019-chr1.vcf
[January 18, 2025 at 1:27:57 PM UTC] org.broadinstitute.hellbender.tools.walkers.haplotypecaller.HaplotypeCaller done. Elapsed time: 19.27 minutes.
Runtime.totalMemory()=1083179008


In [None]:
%%bash

amostra="cap-ngse-b-2019-chr1"

# Passo 5: Relatório final
echo "Passo 5: Geração do relatório final"
# Geração de relatórios ou visualizações
multiqc $amostra/

##abrir o multiqc_report e verificar tbm quais ferramentas o multiqc conhece e podemos rodar esse comando com reports de qualidade

# Fim do pipeline
echo "==========================="
echo "Pipeline de Bioinformática Concluído!"
echo "==========================="

Passo 5: Geração do relatório final
|         searching | ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 15/15  
Pipeline de Bioinformática Concluído!



  /// MultiQC 🔍 | v1.12

|           multiqc | MultiQC Version v1.26 now available!
|           multiqc | Search path : /content/cap-ngse-b-2019-chr1
|            fastqc | Found 2 reports
|           multiqc | Compressing plot data
|           multiqc | Previous MultiQC output found! Adjusting filenames..
|           multiqc | Use -f or --force to overwrite existing reports instead
|           multiqc | Report      : multiqc_report_1.html
|           multiqc | Data        : multiqc_data_1
|           multiqc | MultiQC complete


In [None]:
## PASSO 6 - ANOTACAO DE VARIANTES

%%bash

wget https://snpeff.blob.core.windows.net/versions/snpEff_latest_core.zip
unzip -o snpEff_latest_core.zip
rm snpEff_latest_core.zip

Archive:  snpEff_latest_core.zip
  inflating: snpEff/LICENSE.md       
  inflating: snpEff/snpEff.jar       
  inflating: snpEff/SnpSift.jar      
  inflating: snpEff/galaxy/snpSift_int.xml  
  inflating: snpEff/galaxy/tool-data/snpEff_genomes.loc  
  inflating: snpEff/galaxy/tool-data/snpEff_genomes.loc.sample  
  inflating: snpEff/galaxy/snpEffWrapper.pl  
  inflating: snpEff/galaxy/snpEff.xml  
  inflating: snpEff/galaxy/tool_conf.xml  
  inflating: snpEff/galaxy/snpSift_caseControl.xml  
  inflating: snpEff/galaxy/snpSift_filter.xml  
  inflating: snpEff/galaxy/snpSift_annotate.xml  
  inflating: snpEff/galaxy/snpSiftWrapper.pl  
  inflating: snpEff/galaxy/tool_dependencies.xml  
  inflating: snpEff/galaxy/snpEff_download.xml  
  inflating: snpEff/snpEff.config    
  inflating: snpEff/examples/samples_cancer.txt  
  inflating: snpEff/examples/example_motif.vcf  
  inflating: snpEff/examples/cancer.eff.vcf  
  inflating: snpEff/examples/examples.sh  
  inflating: snpEff/examples/tes

--2025-01-18 13:50:26--  https://snpeff.blob.core.windows.net/versions/snpEff_latest_core.zip
Resolving snpeff.blob.core.windows.net (snpeff.blob.core.windows.net)... 52.239.234.228
Connecting to snpeff.blob.core.windows.net (snpeff.blob.core.windows.net)|52.239.234.228|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 66475427 (63M) [application/zip]
Saving to: ‘snpEff_latest_core.zip.3’

     0K .......... .......... .......... .......... ..........  0%  141K 7m41s
    50K .......... .......... .......... .......... ..........  0%  280K 5m46s
   100K .......... .......... .......... .......... ..........  0%  282K 5m7s
   150K .......... .......... .......... .......... ..........  0%  282K 4m48s
   200K .......... .......... .......... .......... ..........  0%  283K 4m36s
   250K .......... .......... .......... .......... ..........  0% 56.8M 3m50s
   300K .......... .......... .......... .......... ..........  0% 60.3M 3m17s
   350K .......... .......... .

In [None]:
%%bash

sudo apt update
sudo apt install openjdk-21-jre

java -jar snpEff/snpEff.jar download -v GRCh38.p14

In [None]:
%%bash

java -Xmx8g -jar snpEff/snpEff.jar -v GRCh38.p14 \
    -stats /content/cap-ngse-b-2019-chr1/output/cap-ngse-b-2019-chr1.html \
    /content/cap-ngse-b-2019-chr1/output/cap-ngse-b-2019-chr1.vcf.gz > /content/cap-ngse-b-2019-chr1/output/cap-ngse-b-2019-chr1.ann.vcf

00:00:00 SnpEff version SnpEff 5.2e (build 2024-10-04 18:09), by Pablo Cingolani
00:00:00 Command: 'ann'
00:00:00 Reading configuration file 'snpEff.config'. Genome: 'GRCh38.p14'
00:00:00 Looking for config file: '/content/snpEff.config'
00:00:00 Reading config file: /content/snpEff/snpEff.config
00:00:02 Codon table 'Vertebrate_Mitochondrial' assigned to chromosome 'MT'
00:00:02 Codon table 'Vertebrate_Mitochondrial' assigned to chromosome 'M'
00:00:02 done
00:00:02 Reading database for genome version 'GRCh38.p14' from file '/content/snpEff/./data/GRCh38.p14/snpEffectPredictor.bin' (this might take a while)
00:01:04 done
00:01:04 Loading Motifs and PWMs
00:01:04 Building interval forest
00:01:21 done.
00:01:21 Genome stats :
#-----------------------------------------------
# Genome name                : 'Human genome GRCh38 using RefSeq transcripts'
# Genome version             : 'GRCh38.p14'
# Genome ID                  : 'GRCh38.p14[0]'
# Has protein coding info    : true
# Has Tr. 

In [None]:
%%bash

mkdir snpEff/./db
mkdir snpEff/./db/GRCh38
mkdir snpEff/./db/GRCh38/clinvar
mkdir snpEff/./db/GRCh38/dbSnp


wget -O snpEff/./db/GRCh38/clinvar/clinvar-latest.vcf.gz \
    https://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh38/clinvar.vcf.gz
wget -O snpEff/./db/GRCh38/clinvar/clinvar-latest.vcf.gz.tbi \
    https://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh38/clinvar.vcf.gz.tbi

wget -O snpEff/./db/GRCh38/dbSnp/dbSnp.vcf.gz \
    ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606/VCF/00-common_all.vcf.gz
wget -O snpEff/./db/GRCh38/dbSnp/dbSnp.vcf.gz.tbi \
    ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606/VCF/00-common_all.vcf.gz.tbi

--2025-01-18 13:56:30--  https://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh38/clinvar.vcf.gz
Resolving ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)... 130.14.250.13, 130.14.250.10, 130.14.250.11, ...
Connecting to ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)|130.14.250.13|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 108576392 (104M) [application/x-gzip]
Saving to: ‘snpEff/./db/GRCh38/clinvar/clinvar-latest.vcf.gz’

     0K .......... .......... .......... .......... ..........  0%  130K 13m33s
    50K .......... .......... .......... .......... ..........  0%  261K 10m9s
   100K .......... .......... .......... .......... ..........  0%  261K 9m1s
   150K .......... .......... .......... .......... ..........  0%  187M 6m46s
   200K .......... .......... .......... .......... ..........  0% 52.6M 5m25s
   250K .......... .......... .......... .......... ..........  0%  263K 5m37s
   300K .......... .......... .......... .......... ..........  0%  218M 4m49s
  

In [None]:
%%bash

java -Xmx1g -jar snpEff/SnpSift.jar \
    annotate \
    snpEff/./db/GRCh38/clinvar/clinvar-latest.vcf.gz \
    /content/cap-ngse-b-2019-chr1/output/cap-ngse-b-2019-chr1.ann.vcf \
    > /content/cap-ngse-b-2019-chr1/output/cap-ngse-b-2019-chr1.clinvar.ann.vcf


In [None]:
%%bash

gatk-4.1.8.1/gatk VariantsToTable -V /content/cap-ngse-b-2019-chr1/output/cap-ngse-b-2019-chr1.clinvar.ann.vcf \
    -F CHROM \
    -F POS \
    -F QUAL \
    -F TYPE \
    -F ID \
    -F ALLELEID \
    -F CLNDN \
    -F CLNSIG \
    -F CLNSIGCONF \
    -F CLNSIGINCL \
    -F CLNVC \
    -F GENEINFO \
    -F AF_EXAC \
    -F CLNHGVS \
    -GF AD \
    -GF DP \
    -GF GQ \
    -GF GT \
    -O /content/cap-ngse-b-2019-chr1/output/cap-ngse-b-2019-chr1.clinvar.ann.txt

Using GATK jar /content/gatk-4.1.8.1/gatk-package-4.1.8.1-local.jar
Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /content/gatk-4.1.8.1/gatk-package-4.1.8.1-local.jar VariantsToTable -V /content/cap-ngse-b-2019-chr1/output/cap-ngse-b-2019-chr1.clinvar.ann.vcf -F CHROM -F POS -F QUAL -F TYPE -F ID -F ALLELEID -F CLNDN -F CLNSIG -F CLNSIGCONF -F CLNSIGINCL -F CLNVC -F GENEINFO -F AF_EXAC -F CLNHGVS -GF AD -GF DP -GF GQ -GF GT -O /content/cap-ngse-b-2019-chr1/output/cap-ngse-b-2019-chr1.clinvar.ann.txt
14:23:39.995 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/content/gatk-4.1.8.1/gatk-package-4.1.8.1-local.jar!/com/intel/gkl/native/libgkl_compression.so
14:23:40.291 INFO  VariantsToTable - ------------------------------------------------------------
14:23:40.291 INFO  VariantsToTable - The Genome Analysis Toolkit (GATK) v4.1.8.1
14

In [None]:
# Interpretação de Variantes

import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns
import numpy as np

df = pd.read_csv("/content/cap-ngse-b-2019-chr1/output/cap-ngse-b-2019-chr1.clinvar.ann.txt", sep="\t")
df

Unnamed: 0,CHROM,POS,QUAL,TYPE,ID,ALLELEID,CLNDN,CLNSIG,CLNSIGCONF,CLNSIGINCL,CLNVC,GENEINFO,AF_EXAC,CLNHGVS,cap-ngse-b-2019-chr1.AD,cap-ngse-b-2019-chr1.DP,cap-ngse-b-2019-chr1.GQ,cap-ngse-b-2019-chr1.GT
0,chr1,12198,170.64,SNP,.,,,,,,,,,,5613,69,99,G/C
1,chr1,12332,34.64,SNP,.,,,,,,,,,,82,10,42,G/A
2,chr1,12383,53.64,SNP,.,,,,,,,,,,22,4,54,G/A
3,chr1,13684,32.64,SNP,.,,,,,,,,,,143,17,40,C/T
4,chr1,14930,60.64,SNP,.,,,,,,,,,,12,3,23,A/G
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
17692,chr1,248682198,1146.06,SNP,.,,,,,,,,,,042,42,99,G/G
17693,chr1,248778632,37.32,SNP,.,,,,,,,,,,02,2,6,G/G
17694,chr1,248785683,37.32,SNP,.,,,,,,,,,,02,2,6,C/C
17695,chr1,248816707,305.64,SNP,.,,,,,,,,,,4517,62,99,G/C


In [None]:
df["TYPE"].value_counts()

Unnamed: 0_level_0,count
TYPE,Unnamed: 1_level_1
SNP,15722
INDEL,1975


In [None]:
df["QUAL"].describe()

Unnamed: 0,QUAL
count,17697.0
mean,500.685856
std,857.223447
min,30.28
25%,58.32
50%,153.92
75%,571.06
max,13745.06


In [None]:
df[["cap-ngse-b-2019-chr1.DP", "QUAL", "cap-ngse-b-2019-chr1.GQ"]].describe()

Unnamed: 0,cap-ngse-b-2019-chr1.DP,QUAL,cap-ngse-b-2019-chr1.GQ
count,17697.0,17697.0,17697.0
mean,29.098661,500.685856,52.209358
std,52.526911,857.223447,41.583644
min,1.0,30.28,0.0
25%,2.0,58.32,6.0
50%,9.0,153.92,48.0
75%,35.0,571.06,99.0
max,986.0,13745.06,99.0


In [None]:
df1=df
df1['AF_EXAC'].replace('.','0.0',inplace=True)
df1['AF_EXAC'].value_counts()

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df1['AF_EXAC'].replace('.','0.0',inplace=True)


Unnamed: 0_level_0,count
AF_EXAC,Unnamed: 1_level_1
1.00000,4
0.22069,2
0.67575,2
0.47748,2
0.00630,2
...,...
0.89330,1
0.59782,1
0.61175,1
0.60778,1


In [None]:
patho = (df['CLNSIG'] == "Pathogenic")
df[patho]

Unnamed: 0,CHROM,POS,QUAL,TYPE,ID,ALLELEID,CLNDN,CLNSIG,CLNSIGCONF,CLNSIGINCL,CLNVC,GENEINFO,AF_EXAC,CLNHGVS,cap-ngse-b-2019-chr1.AD,cap-ngse-b-2019-chr1.DP,cap-ngse-b-2019-chr1.GQ,cap-ngse-b-2019-chr1.GT
174,chr1,976215,3190.06,SNP,1320032,1310278.0,Renal_tubular_epithelial_cell_apoptosis|Neutro...,Pathogenic,,,single_nucleotide_variant,PERM1:84808,,NC_000001.11:g.976215A>G,2129,131,99,G/G
7920,chr1,94005441,579.6,INDEL,917611,905881.0,not_provided|Severe_early-childhood-onset_reti...,Pathogenic,,,Deletion,ABCA4:24,,NC_000001.11:g.94005445del,2121,42,99,CT/C
