## How to use Python Script in HPC Server?


### To Use python Script in HPC Server You need two files to execute your code
#### 1. .sh file
#### 2. .py file

###### 1. .py file is the file which contains the code of your interest and we will see the use case using examples of python file later.
###### 2. .sh file is the file which contains the commands for nodes, memory, RAM, Packages etc and we will this later too.
######   It is a Bourne shell script. They are used in many variations of UNIX-like operating systems. They have no  "language" and are interpreted by your shell (interpreter of terminal commands)and if u want to know more   then paste it(https://stackoverflow.com/questions/13805295/whats-a-sh-file) to your browser.
    

## Python Script:
##### Here i will show you a simple python script which will be use to execute in the HPC server.
##### From importing the libraries which are required for your interest to writing the function and calling it is totally same with the code which you execute normally in your local machine.
#### You just need to save this code in .py extension lets say we saved this file as myscript.py

In [None]:

import pandas as pd
import os
def coverage_normalised(gtf_file, cov_file): 
    """
    This code is written by Md Abuzar Khan (Data Management supervisor)
    Date : 30/08/2023 CSIR-IGIB, New Delhi.
    This function takes two input file.
    1. Human reference genome
    2. Coverage file after processing from BAM files using bedtool
    After taking the input firstly this function process the cov_file,
    it converts the cov_file into .csv from .txt format.

    """
    df = pd.read_csv(cov_file, sep = '\t', comment = '#', header = None, dtype = {0 : str})
    df.rename(columns ={
        df.columns[0]: 'Chromosome',
        df.columns[1]: 'Source',
        df.columns[2]: 'Feature',
        df.columns[3]: 'Start',
        df.columns[4]: 'End',
        df.columns[5]: 'needtodrop',
        df.columns[6]: 'neetodrop',
        df.columns[7]: 'neetdrop',
        df.columns[8]: 'Attributes',
        df.columns[9]:'Depth',
        df.columns[10]:'Mapped',
        df.columns[11]:'Length',
        df.columns[12]:'Coverage'

    },
              inplace=True)

    df = df[['Chromosome', 'Source', 'Feature', 'Start', 'End','Attributes', 'Depth', 'Mapped', 'Length','Coverage']]
    df = df[df['Feature'] == 'gene']
    df['Gene_ID'] = df['Attributes'].apply(lambda x: x.split(';')[0].split('"')[1])
    df['total_bases'] = df['End'] - df['Start'] + 1

    gtf_file = pd.read_csv(gtf_file, low_memory = False)
    gtf_file = gtf_file[gtf_file['feature'] == 'gene']
    gtf_file['Gene_ID'] = gtf_file['attribute'].apply(lambda x: x.split(';')[0].split('"')[1])

    df = df[['Gene_ID','Depth', 'Mapped', 'Length','Coverage','total_bases']]

    merged_df = gtf_file.merge(df, on = 'Gene_ID', how = 'left').fillna(0)

    merged_df['Total_Bases'] = merged_df['end']- merged_df['start'] + 1
    merged_df = merged_df[['Gene_ID','Total_Bases','Mapped','Depth','Coverage','total_bases']]
    merged_df['coverage_normalised'] = (merged_df['Mapped'] / merged_df['Total_Bases'])*100
    merged_df['Depth'] = merged_df['Depth'].apply(lambda x: 0 if x < 10 else x)
    merged_df = merged_df[['Total_Bases','Gene_ID','Mapped','total_bases','coverage_normalised','Depth']]
    merged_df.to_csv('/lustre/abuzar.khan/dengue_bdcov_files/norm_cov_data_gene/' + cov_file.split('_')[0] + '_gene.csv')

    return merged_df

#to get the data from the directory
directory = "/lustre/user_name/path/to/workingdir"
txt_files = [file_name for file_name in os.listdir(directory) if file_name.endswith(".txt")]
for txt_file in txt_files:
    txt_path = os.path.join(directory, txt_file)

    # Call the function
    result_df = coverage_normalised('human_gtf.csv',txt_file)





### Our .sh File and we save this lets myshfile.sh and make sure both the files are in the same directory and if not then specify the absolute path and if you want to know more about the below code please refer this link  
#### (https://researchcomputing.princeton.edu/support/knowledge-base/python#:~:text=This%20guide%20presents%20an%20overview%20of%20installing%20Python,are%20to%20be%20run%20on%20the%20command%20line.)

# No lets see the .sh file 
#!/bin/bash
#SBATCH --job-name=covrage_norm
#SBATCH --output=log.%j  # Standard output and error log
#SBATCH --nodes=1
##SBATCH --ntasks-per-node=20
#SBATCH --cpus-per-task=1
#SBATCH --mem=256GB  # Memory (RAM) per node. Number followed by unit prefix.
#SBATCH --partition=compute  # Partition/queue in which to run the job
#SBATCH --time=1193046:27:16

echo "SLURM_JOBID="$SLURM_JOBID
echo "SLURM_JOB_NODELIST="$SLURM_JOB_NODELIST
echo "SLURM_NNODES="$SLURM_NNODES
echo "SLURMTMPDIR="$SLURMTMPDIR
echo "Date = $(date)"
echo "Hostname = $(hostname -s)"
echo ""
echo "Number of Nodes Allocated = $SLURM_JOB_NUM_NODES"
echo "Number of Tasks Allocated = $SLURM_NTASKS"
echo "Number of Cores/Task Allocated = $SLURM_CPUS_PER_TASK"
echo "Working Directory = $(pwd)"
echo "Working Directory = "$SLURM_SUBMIT_DIR

cd /lustre/user_name/path/to/workingdirectory  # Change working directory to the desired path
module purge
module load anaconda3/2023.3
conda init bash  # Initialize conda for bash
source ~/.bashrc  # Activate conda initialization
conda create --name my-env pandas numpy csv os # Create a new conda environment and specify your required libraries
conda activate my-env  # Activate the created environment
python myscript.py
/bin/hostname | tee result




### I hope you saved both the .py and .sh file in your working directory now go to the command line and write

In [None]:
 $ sbatch myshfile

#### it gives you a log file and to check the status of your file just type.

In [None]:
 $ showq

## Thank you
#### Md Abuzar Khan