**DEVs Aryan Das & Rajul Matho**

->  The following code was compiled in a UBUNTU OS along with Python.

->  Create the following folder in the content folder of local drive-

    "bam" | "outputs" | "text"

->  System Requirments To Run Star ALigner -
 1.   CPU - Intel i3 With 4 threads or equivalent
 2.   Ram - 32GB (minimum)

->  About The Dataset- 
 *  30 pairs of normal and cancerous tissues from the same excision were collected which are PAIRED-END
 *  Accession -	PRJNA762469 ; GEO: GSE183947
 *  In order to remove technical sequences, including adapters, polymerase chain reaction (PCR) primers, or fragments thereof, and quality
    of bases lower than 20, pass filter data of fastq format were processed by Cutadapt (V1.9.1) to be high quality clean data.

In [None]:
!pip install pysradb
!pip install parallel-fastq-dump
!sudo apt-get install rna-star
!sudo apt-get install subread

In [None]:
# Getting the Metadata (Accession List)
!pysradb metadata --detailed PRJNA762469

In [None]:
# Downloading data (SRA FILES)
!pysradb download -y -p PRJNA762469 /mnt/d/PRJNA762469_files/

**Moving The SRA File From Individual Folder To Base Folder**

In [None]:
import os
import shutil

# Set the path to the base directory
base_path = r"/mnt/d/PRJNA762469_srafiles/"

# Iterate over the subdirectories and move the .sra files to the base directory
for subdir in os.listdir(base_path):
    subdir_path = os.path.join(base_path, subdir)
    if os.path.isdir(subdir_path):
        for file in os.listdir(subdir_path):
            if file.endswith(".sra"):
                src_path = os.path.join(subdir_path, file)
                dst_path = os.path.join(base_path, file)
                shutil.move(src_path, dst_path)

**Downloading Reference Genome Fasta and Annotation GTF files from Ensemble**

In [None]:
#Downloading GTF file only from ensembl
!wget https://ftp.ensembl.org/pub/release-108/gtf/homo_sapiens/Homo_sapiens.GRCh38.108.gtf.gz
!gzip -d Homo_sapiens.GRCh38.108.gtf.gz
src_path = "Homo_sapiens.GRCh38.108.gtf"
shutil.move(src_path, "/mnt/d/STAR/")

#Downloading Fasta file only from ensembl
!wget https://ftp.ensembl.org/pub/release-104/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
!gzip -d Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
src_path = "Homo_sapiens.GRCh38.dna.primary_assembly.fa"
shutil.move(src_path, "/mnt/d/STAR/")

In [None]:
# Building STAR Index Using hg38 Genome Assembly -
!STAR --runThreadN 4 \
     --runMode genomeGenerate \
     --genomeDir /mnt/d/STAR/genome \
     --genomeFastaFiles /mnt/d/STAR/Homo_sapiens.GRCh38.dna.primary_assembly.fa \
     --sjdbGTFfile /mnt/d/STAR/Homo_sapiens.GRCh38.108.gtf \
     --sjdbOverhang 100

**NOTE -** 
1. Only The BAM and FeatureCounts test files are being saved.
2. To Save ALL the File REMOVE the "rm" Commands Accordingly.
3. The following SRA files were already Trimmed using Cutadapt.

In [None]:
%%bash
SEQLIBS=(SRR15852393 SRR15852394 SRR15852395 SRR15852396 SRR15852397 SRR15852398 SRR15852399 SRR15852400 SRR15852401 SRR15852402 SRR15852403 SRR15852404 SRR15852405 SRR15852406 SRR15852407 SRR15852408 SRR15852409 SRR15852410 SRR15852411 SRR15852412 SRR15852413 SRR15852414 SRR15852415 SRR15852416 SRR15852417 SRR15852418 SRR15852419 SRR15852420 SRR15852421 SRR15852422 SRR15852423 SRR15852424 SRR15852425 SRR15852426 SRR15852427 SRR15852428 SRR15852429 SRR15852430 SRR15852431 SRR15852432 SRR15852433 SRR15852434 SRR15852435 SRR15852436 SRR15852437 SRR15852438 SRR15852439 SRR15852440 SRR15852441 SRR15852442 SRR15852443 SRR15852444 SRR15852445 SRR15852446 SRR15852447 SRR15852448 SRR15852449 SRR15852450 SRR15852451 SRR15852452)

for seqlib in ${SEQLIBS[@]}; do
    parallel-fastq-dump --threads 4 --split-files --outdir /mnt/d/PRJNA762469_srafiles/output --tmpdir tmpdir -s /mnt/d/PRJNA762469_srafiles/${seqlib}.sra
    STAR --runThreadN 4 --runMode alignReads --genomeDir /mnt/d/STAR/genome --readFilesIn /mnt/d/PRJNA762469_srafiles/output/${seqlib}_1.fastq /mnt/d/PRJNA762469_srafiles/output/${seqlib}_2.fastq  --outFileNamePrefix /mnt/d/PRJNA762469_srafiles/bam/${seqlib} --outSAMtype BAM SortedByCoordinate 
    rm /mnt/d/PRJNA762469_srafiles/output/*.out
    rm /mnt/d/PRJNA762469_srafiles/output/*.fastq
    rm /mnt/d/PRJNA762469_srafiles/output/*.tab
    featureCounts -p -t exon -g gene_id -a /mnt/d/STAR/Homo_sapiens.GRCh38.108.gtf -o /mnt/d/PRJNA762469_srafiles/text/${seqlib}_u.txt /mnt/d/PRJNA762469_srafiles/bam/${seqlib}Aligned.sortedByCoord.out.bam
done 

In [None]:
# Generating CSV FILE for ALL the RAW GENE COUNTS from the TEXT files
import glob
import pandas as pd

vars = glob.glob('/mnt/d/PRJNA762469_srafiles/text/*.txt')

vars = glob.glob('*_u.txt')
raw2=[]
df = pd.read_table(vars[0], header =1)
df2 = pd.read_table(vars[1], header =1)
raw = pd.concat([df,df2.iloc[:,-1]],axis=1)
for i in range(len(vars)-2):
    i +=2
    v = pd.read_table(vars[i], header =1)
    raw = pd.concat([raw,v.iloc[:,-1]],axis=1)

raw.to_csv('RawCounts.csv', index=False)

**The "RawCounts.csv" was then imported in R to applly DEseq2 for Normalization. Check out the "Deseq.R".**