# Genomic Analysis of Resistance Genes in Staphylococcus â€” Code Walkthrough and Output Guide

This notebook explains the functionality of the three main code files:  

- conversionFASTA.py: converts the gbff files collected from the NCBI database genome collections into FASTA for the purpose of finding resistance genes against anitbiotics in each genome. 

- resFinder.py: uses abricate, a collection of all health base databases that also includes resFinder that can be used to read in through the FASTA files already converted to detect sequences that have specific resistance genes to them. 

- results.py: compiles results of csv files including: 'basic_stats.csv', 'degradation_genes.csv', 'plasmid_genes.csv', and 'synthesis_genes.csv' using gbff files that were downloaded directly from the ncbi database.

- total_gbff.py: checks to make sure, before downloading and curating results from gbff files, the total number of files that it was downloaded to aovid any duplications or extra files that may have accidentally been downloaded.

The purpose of this report is to help researchers from University of Galway from the microbiology department to utilize this tool to help organize data and eventually use it for machine learning purposes. 

## Python Environment Overview

This project was developed and tested on **Python 3.11**, using both general-purpose and bioinformatics-specific tools. I used VS Code to run all these commands and set up the environment for coding. 

To enable DNA sequence parsing and resistance gene detection, I used tools from the **Biopython** library, as well as the **Abricate** toolset (which includes ResFinder).

Since Iâ€™m on an Apple Silicon Mac (M1/M2/M3 chip), I used **Rosetta Terminal** to ensure compatibility with x86_64 binaries.

## Setting Up Visual Studio Code (VS Code)

Visual Studio Code is a free, lightweight editor developed by Microsoft that supports Python, Jupyter, and Conda environments.

### Step 1: Download VS Code
Visit: https://code.visualstudio.com/  
Click "Download for macOS" (or your OS), and follow installation instructions.

---

### Step 2: Install Python Extension
1. Open VS Code
2. Go to the **Extensions** sidebar (square icon on the left)
3. Search for "**Python**" and click **Install** (provided by Microsoft)

---

### Step 3: Use Your Conda Environment in VS Code
After creating your Conda environment (e.g., `abricate-env`) refer to the steps below:

1. Launch VS Code from a terminal where your environment is active (optional):
```bash
conda activate abricate-env
code .

## Biopython Setup (for sequence parsing)

To parse `.gbff` files and extract DNA sequences, the following import was used in all scripts:

```python
from Bio import SeqIO

To install BioPython copy this command onto the terminal:

pip install biopython

## For Other Installations: 
#### ðŸ“„ Script: resFinder.py (copy the following)

pip install biopython pandas tqdm

Also included:
subprocess and os: come pre-installed with Python
StringIO: part of Python's built-in io module (no install needed)

#### ðŸ“„ Script: results.py (copy the following)

pip install biopython pandas tqdm natsort

Also used:
glob, csv, datetime, defaultdict: all are built-in (no install needed)

#### ðŸ“„ Script: total_gbff.py (copy the following)

pip install pandas

Also used:
os, csv, hashlib, glob, defaultdict: all are part of Pythonâ€™s standard library

## Conda Environment for Abricate + ResFinder

For resistance gene detection, I created a dedicated Conda environment using **Bioconda**.

### Commands:
conda create -n abricate-env -c bioconda -c conda-forge abricate
conda activate abricate-env
abricate --setupdb

I used [Miniforge](https://github.com/conda-forge/miniforge) to ensure compatibility with Rosetta Terminal on macOS.


After installing all of them, you may now look into the codes and how they can assist you in the research component

## How to Download GBFF files and save them on local computer
When downloading all the gbff files, you need to make sure where those files are saved in the folder and the code needs to be directed to the file path. To make this simple, here is a guide on how to create folders and save files in them.

When downloading `.gbff` files for Staphylococcus or any genus, it's important to:
- Get complete, well-annotated genomes
- Organize your local folders correctly
- Point your code to the correct path for processing

We will use Staphylococcus as an example.

#### Step 1: Access the NCBI Genome Dataset
Visit this link:  
ðŸ”— [https://www.ncbi.nlm.nih.gov/datasets/taxonomy/tree/](https://www.ncbi.nlm.nih.gov/datasets/taxonomy/tree/)

1. In the **Taxonomy tree**, search for and select **Staphylococcus (Genus)**  
2. Taxonomy ID = `1279`

---

#### Step 2: Apply Filters
To avoid incomplete or low-quality data, apply the following filters **before downloading**:
- âœ… **RefSeq annotation**
- âœ… **GenBank annotation**
- âœ… **Complete genomes**

This ensures the genomes have full annotations and sequences for resistance gene detection.

---

#### Step 3: Download the Files
Click the **Download** button, then select the following formats:

- **GenBank (.gbff)** â€” for genome annotations

These files will be downloaded as a **`.zip` file** (e.g., `ncbi_dataset.zip`)

---

#### Step 4: Unzip the Files
Unzip the downloaded file using your system's unzip tool or the command line:
```bash
unzip ncbi_dataset.zip

This is what the folder set up should look like in VS Code:
â”œâ”€â”€ ncbi_dataset/
  â”œâ”€â”€ GCF_000XXXXXX/
  â”‚   â”œâ”€â”€ genome.gbff
  â”œâ”€â”€GCF_000XXXXXX/
  â”‚   â”œâ”€â”€ genome.gbff
  â”œâ”€â”€ ...
      â”œâ”€â”€ ...

#### Step 5: Load Folder into VS Code
Open Visual Studio Code
Click File â†’ Open Folder...
Choose the project_root/ folder you created
You should now see all your scripts and folders in the sidebar



This is how I saved and organized my files:

![Folder layout](project.jpg)


## Step 1: conversionFASTA.py // Convert GBFF to FASTA

This script parses `.gbff` files and extracts relevant nucleotide sequences into FASTA format. This is required for compatibility with resistance gene databases like ResFinder.

Main operations:
- Iterates over `.gbff` files in a specified directory
- Uses Biopython to extract sequence data
- Writes `.fasta` output files for downstream use

#### Important Note:
input_dir = "/Users/sanjoydasgupta/Desktop/genomics_env/dna_playground/scripts/ncbi_dataset"

This portion of the code is the path to finding the ncbi datasets (called ncbi_datasets). To find the path and make sure it aligns with this is through this command.

cd ~/Downloads/ncbi_dataset

You must make sure that you are in the environment that it is set in and make sure you are in the folder directory when accessing the path, which you can track on the side bar on the left. 

Use the pwd command to print the current working directory:

pwd

Youâ€™ll get something like:

/Users/yourname/Downloads/ncbi_dataset

Copy that path â€” itâ€™s what youâ€™ll use in your Python code like:

input_dir = "/Users/yourname/Downloads/ncbi_dataset/data/"

To make this process easier, you can keep the names of the folders and more the same, and follow exactly how it is saved and collected. 


## Step 2: resFinder.py // Run ResFinder via Abricate

This script applies Abricate's ResFinder database to each FASTA file to detect known resistance genes.

Main operations:
- Iterates over `.fasta` files
- Runs Abricate with ResFinder
- Produces output CSVs with detected resistance genes


## Step 3: results.py // Compile Results into CSVs

This script merges and processes outputs from previous steps. It generates summaries like:
- `basic_stats.csv`: overview of genomes
- `degradation_genes.csv`, `plasmid_genes.csv`, `synthesis_genes.csv`: gene categories

The data here can serve as features for ML models (e.g., one-hot encoding of gene presence).

Again, there is a snippet of code that has the file directory:

GBFF_ROOT        = "/Users/sanjoydasgupta/Desktop/genomics_env/dna_playground/scripts/ncbi_dataset" 

This will change according to the path and the name of the local computer. To refer how to find the path, go back to Step 1 conversionFASTA.py.

## Step 4: total_gbff.py // Verify Total GBFF Files

Before analysis, we check the integrity and count of downloaded GBFF files. This ensures:
- No duplicates
- No missing or corrupted downloads

Again, there is a snippet of code that has the file directory:

GBFF_ROOT        = "/Users/sanjoydasgupta/Desktop/genomics_env/dna_playground/scripts/ncbi_dataset" 

This will change according to the path and the name of the local computer. To refer how to find the path, go back to Step 1 conversionFASTA.py.