Restarted brem_env (Python 3.10.15)

# **BSEEJ: Example Workflow**

This notebook demonstrates the use of the **BSEEJ** algorithm. The overal workflow of the includes preprocessing, training using the A2ML1 gene dataset and saving the results and information of the run. The user does not need to do these steps one by one. Instead it is enough to provide some main arguments: 
1. junction path which includes the junction files of the genes, so for running this example, it is enough to unzip A2ML1.zip file and give the path to the extracted junctions (equivalent to -p argument).
2. A result path (equivalent to -o argument).
3. Gene name which is used only for saving the results and namings, so it doesn't need to be exact names (equivalent to -g argument). 
4. Number of clusters (equivalent to -k argument).  

## **Steps in the Workflow**

1. **Setup and Configuration**  
   - Install necessary dependencies and set up the Conda environment.
   - Define parameters such as the number of clusters, dataset paths, and model hyperparameters.

2. **Data Loading and Preprocessing**  
   - Load `.junc` files containing gene junction data.
   - Process the data to generate input structures like interval graphs, conflict matrices, and feature matrices.

3. **Model Initialization**  
   - Initialize the **Gene** object, representing the dataset.
   - Set up the **Model** object with the desired hyperparameters for training.

4. **Model Training**  
   - Train the model using the Gibbs sampling algorithm:
     - Update cluster assignments.
     - Optimize model parameters (e.g., `theta`, `pi`, `beta`).
     - Track convergence metrics such as log-likelihood.

5. **Results Saving and Visualization**  
   - Save outputs, including model parameters and cluster assignments.
   - Then we can visualize convergence trends anc cluster information.

In [None]:
# Step 1: Import dependencies
import sys
from BSEEJ.gene import Gene
from BSEEJ.model import Model
from utilities import save_results

In [None]:
# Step 2: Import the Main class
from bseej import Main

In [None]:
# Step 3: give arguments
args = [
    "bseej.py",                   # Simulating the script name
    "-k", "3",                    # Number of clusters
    "-p", "/labs/Aguiar/BSEEJ/A2ML1",  # Path to gene data
    "-o", "/labs/Aguiar/BSEEJ/results", # Path to results
    "-g", "A2ML1"                 # Gene name
]

In [None]:
# Step 4: Run the Main function
Main.main(args)

Gene: A2ML1
junction path: /labs/Aguiar/BSEEJ/A2ML1
result path: /labs/Aguiar/BSEEJ/results
Number of clusters: 3
Maximum number of iterations: 1000
model parameter, eta: 0.01
model parameter, alpha: 1
model parameter, r: 1
model parameter, s: 1
training gene A2ML1 with k = 3
Gene A2ML1 , Iteration 0 , Likelihood = -45238.2731 , Converged: False
Gene A2ML1 , Iteration 100 , Likelihood = -23215.0153 , Converged: False
Gene A2ML1 , Iteration 200 , Likelihood = -23670.9603 , Converged: False
Gene A2ML1 , Iteration 300 , Likelihood = -24367.3119 , Converged: False
Gene A2ML1 , Iteration 400 , Likelihood = -24807.777 , Converged: False
Gene A2ML1 , Iteration 500 , Likelihood = -25005.5194 , Converged: True
Gene A2ML1 , Iteration 600 , Likelihood = -25221.8276 , Converged: True
Saving the results for gene A2ML1
/labs/Aguiar/BSEEJ/results/A2ML1/run_info_gene_A2ML1_alpha_1_eta_0.01_epsilon_1e-06_rs_1_K_3.pkl saved.
/labs/Aguiar/BSEEJ/results/A2ML1/bseej_A2ML1_K_3.csv saved.
