Skip to content

Commit

Permalink
updated the vignette
Browse files Browse the repository at this point in the history
  • Loading branch information
danymukesha committed May 21, 2024
1 parent 864ba14 commit 3ced2bb
Show file tree
Hide file tree
Showing 4 changed files with 37 additions and 201 deletions.
4 changes: 2 additions & 2 deletions src/crossover.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -27,8 +27,8 @@ NumericMatrix crossover_cpp(const NumericMatrix& selected_parents, int offspring

// Perform crossover between selected parents
for (int i = 0; i < offspring_size; ++i) {
int parent1_index = R::unif_rand() / num_parents;
int parent2_index = R::unif_rand() / num_parents;
int parent1_index = int(R::unif_rand()) % num_parents;
int parent2_index = int(R::unif_rand()) % num_parents;
for (int j = 0; j < num_genes; ++j) {
offspring(i, j) = (selected_parents(parent1_index, j) + selected_parents(parent2_index, j)) / 2.0;
}
Expand Down
2 changes: 1 addition & 1 deletion src/initialize_population.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ using namespace Rcpp;
// Generate random population using genomic data
for (int i = 0; i < population_size; ++i) {
for (int j = 0; j < num_genes; ++j) {
int sample_index = R::unif_rand() / num_samples;
int sample_index = int(R::unif_rand()) % num_samples;
population(i, j) = genomic_data(j, sample_index);
}
}
Expand Down
3 changes: 2 additions & 1 deletion src/replacement.cpp
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
#include <Rcpp.h>
#include <cmath>
using namespace Rcpp;

//' Function to replace non-selected individuals in the population
Expand Down Expand Up @@ -31,7 +32,7 @@ NumericMatrix replacement_cpp(const NumericMatrix& population, const NumericMatr

// Replace non-selected individuals in the population with offspring
for (int i = 0; i < num_to_replace; ++i) {
int index_to_replace = R::unif_rand() / population_size;
int index_to_replace = int(R::unif_rand()) % population_size;
for (int j = 0; j < num_genes; ++j) {
updated_population(index_to_replace, j) = offspring(i, j);
}
Expand Down
229 changes: 32 additions & 197 deletions vignettes/Introduction.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ abstract:
optimal or near-optimal solutions.

In the field of genomics, where data sets are often large, complex,
and high-dimensional, genetic algorithms offer a promising approach
and high-dimensional, genetic algorithms offer a good approach
for addressing optimization challenges such as feature selection,
parameter tuning, and model optimization. By harnessing the power
of evolutionary principles, genetic algorithms can effectively explore
Expand All @@ -22,7 +22,7 @@ abstract:
The BioGA package extends the capabilities of genetic algorithms
to the realm of genomic data analysis, providing a suite of functions
optimized for handling high throughput genomic data. Implemented in C++
for enhanced performance, BioGA offers efficient algorithms for tasks
for enhanced performance, BioGA offers efficient algorithms for tasks
such as feature selection, classification, clustering, and more.
By integrating seamlessly with the Bioconductor ecosystem,
BioGA empowers researchers and analysts to leverage the power
Expand Down Expand Up @@ -71,10 +71,10 @@ We showcase its interoperability with Bioconductor classes, demonstrating
how genetic algorithm optimization can be seamlessly integrated into
existing genomics pipelines for improved analysis and interpretation.

The BioGA package provides a comprehensive set of functions for
genetic algorithm optimization tailored for analyzing high throughput
genomic data. This vignette demonstrates the usage of BioGA in the context
of selecting the best combination of genes for predicting a certain trait,
The BioGA package provides a set of functions for genetic algorithm
optimization tailored for analyzing high throughput genomic data.
This vignette demonstrates the usage of BioGA in the context of selecting
the best combination of genes for predicting a certain trait,
such as disease susceptibility.

## Overview
Expand Down Expand Up @@ -108,7 +108,28 @@ each column represents a sample. The values in the matrix represent
some measurement of gene expression, such as mRNA levels or protein abundance,
in each sample.

## Example Scenario
For instance, the value 0.1 in Sample 1 for Gene1 indicates the expression
level of Gene1 in Sample 1. Similarly, the value 2.2 in Sample 2 for Gene3
indicates the expression level of Gene3 in Sample 2.

Genomic data can be used in various analyses, including genetic association
studies, gene expression analysis, and comparative genomics. In the context
of the `evaluate_fitness_cpp` function, genomic data is used to calculate
fitness scores for individuals in a population, typically in the context
of genetic algorithm optimization.

The population represents a set of candidate combinations of genes that
could be predictive of the trait.
Each individual in the population is represented by a binary vector indicating
the presence or absence of each gene.
For example, an individual in the population might be represented as
[1, 0, 1],
indicating the presence of Gene1 and Gene3 but the absence of Gene2.
The population undergoes genetic algorithm operations such as selection,
crossover, mutation, and replacement to evolve towards individuals with higher
predictive power for the trait.

## Example Scenario

Consider an example scenario of using genetic algorithm optimization to select
the best combination of genes for predicting a certain trait, such as disease
Expand Down Expand Up @@ -154,11 +175,10 @@ such as RNAseq count matrices or microarray data.
head(genomic_data)
```


## Initialization

```{r}
# Initialize population
# Initialize population (select the number of canditate you wish `population`)
population <- BioGA::initialize_population_cpp(genomic_data,
population_size = 5
)
Expand Down Expand Up @@ -264,27 +284,15 @@ gene expression profiles that are more similar to the genomic data and
are therefore more likely to be selected for further optimization
in the genetic algorithm.


```{r}
# Plot fitness change over generations
BioGA::plot_fitness_history(fitness_history)
```



This vignette demonstrates how genetic algorithm optimization can be applied
to select the best combination of genes for predicting a certain trait using
the BioGA package. It showcases the integration of genetic algorithms
with genomic data analysis and highlights the potential of genetic algorithms
This showcases the integration of genetic algorithms with genomic
data analysis and highlights the potential of genetic algorithms
for feature selection in genomics.

BioGA is a computational tool designed to analyze and optimize high
throughputgenomic data using genetic algorithms (GAs). Genetic algorithms
are a type of optimization algorithm inspired by the process of natural
selection and genetics. They operate by iteratively evolving a population
of candidate
solutions towards better solutions.

Here's how BioGA could work in the context of high throughput genomic data
analysis:

Expand Down Expand Up @@ -326,186 +334,13 @@ the final population to identify the best solution(s) found. This could
involve further validation or interpretation of the results in the context
of the original problem.

Applications of BioGA in genomic data analysis could include genome-wide
Other applications of BioGA in genomic data analysis could include genome-wide
association studies (GWAS), gene expression analysis, pathway analysis,
and predictive modeling for personalized medicine, among others.
By leveraging genetic algorithms, BioGA offers a powerful approach
to exploring complex genomic datasets and identifying meaningful patterns
and associations.

The BioGA package provides a comprehensive set of functions for
genetic algorithm optimization tailored for analyzing high throughput
genomic data. This vignette demonstrates the usage of BioGA in the context
of selecting the best combination of genes for predicting a certain trait,
such as disease susceptibility.

Let's consider an example scenario of using genetic algorithm optimization
to select the best combination of genes for predicting a certain trait,
such as disease susceptibility.

```{r}
# Load the BioGA package
library(BioGA)
# Define parameters for genetic algorithm
population_size <- 100
generations <- 6
mutation_rate <- 0.1
# Generate example genomic data
genomic_data <- matrix(rnorm(100), nrow = 10, ncol = 10)
```


# Overview

Genomic data refers to the genetic information stored in an organism's DNA.
It includes the sequence of nucleotides (adenine, thymine, cytosine,
and guanine) that make up the DNA molecules. Genomic data can provide
valuable insights into various biological processes, such as gene expression,
genetic variation, and evolutionary relationships.

Genomic data in this context could consist of gene expression profiles
measured across different individuals (e.g., patients).

- Each row in the genomic_data matrix represents a gene, and each column
represents a patient sample.

- The values in the matrix represent the expression levels of each gene
in each patient sample.


Here's an example of genomic data:

```
Sample 1 Sample 2 Sample 3 Sample 4
Gene1 0.1 0.2 0.3 0.4
Gene2 1.2 1.3 1.4 1.5
Gene3 2.3 2.2 2.1 2.0
```

In this example, each row represents a gene (or genomic feature), and
each column represents a sample. The values in the matrix represent
some measurement of gene expression, such as mRNA levels or protein abundance,
in each sample.

For instance, the value 0.1 in Sample 1 for Gene1 indicates the expression
level of Gene1 in Sample 1. Similarly, the value 2.2 in Sample 2 for Gene3
indicates the expression level of Gene3 in Sample 2.

Genomic data can be used in various analyses, including genetic association
studies, gene expression analysis, and comparative genomics. In the context
of the `evaluate_fitness_cpp` function, genomic data is used to calculate
fitness scores for individuals in a population, typically in the context
of genetic algorithm optimization.

```{r}
# Initialize population
population <- initialize_population_cpp(genomic_data,
population_size = 5
)
```

The population represents a set of candidate combinations of genes that
could be predictive of the trait.
Each individual in the population is represented by a binary vector indicating
the presence or absence of each gene.
For example, an individual in the population might be represented as
[1, 0, 1],
indicating the presence of Gene1 and Gene3 but the absence of Gene2.
The population undergoes genetic algorithm operations such as selection,
crossover, mutation, and replacement to evolve towards individuals with higher
predictive power for the trait.

```{r}
# Initialize fitness history
fitness_history <- list()
# Initialize time progress
start_time <- Sys.time()
# Run genetic algorithm optimization
generation <- 0
while (TRUE) {
generation <- generation + 1
# Evaluate fitness
fitness <- evaluate_fitness_cpp(genomic_data, population)
fitness_history[[generation]] <- fitness
# Check termination condition
if (generation == generations) { # defined number of generations
break
}
# Selection
selected_parents <- selection_cpp(population, fitness, num_parents = 2)
# Crossover and Mutation
offspring <- crossover_cpp(selected_parents, offspring_size = 2)
# (no mutation in this example)
mutated_offspring <- mutation_cpp(offspring, mutation_rate = 0)
# Replacement
population <- replacement_cpp(population, mutated_offspring,
num_to_replace = 1
)
# Calculate time progress
elapsed_time <- difftime(Sys.time(), start_time, units = "secs")
# Print time progress
cat(
"\rGeneration:", generation, "- Elapsed Time:",
format(elapsed_time, units = "secs"), " "
)
}
```


The fitness calculation described in the provided code calculates a measure
of dissimilarity between the gene expression profiles of individuals
in the population and the genomic data. This measure of dissimilarity,
or "fitness", quantifies how well the gene expression profile of an individual
matches the genomic data.

Mathematically, the fitness calculation can be represented as follows:

Let:

- \( g_{ijk} \) be the gene expression level of gene \( j \)
in individual \( i \) and sample \( k \) from the genomic data.

- \( p_{ij} \) be the gene expression level of gene \( j \)
in individual \( i \) from the population.

- \( N \) be the number of individuals in the population.

- \( G \) be the number of genes.

- \( S \) be the number of samples.

Then, the fitness \( F_i \) for individual \( i \) in the population can be
calculated as the sum of squared differences between the gene expression
levels
of individual \( i \) and the corresponding gene expression levels
in the genomic data, across all genes and samples:

\[ F_i = \sum_{j=1}^{G} \sum_{k=1}^{S} (g_{ijk} - p_{ij})^2 \]

This fitness calculation aims to minimize the overall dissimilarity between
the gene expression profiles of individuals in the population and
the genomic data. Individuals with lower fitness scores are considered to have
gene expression profiles that are more similar to the genomic data and
are therefore more likely to be selected for further optimization
in the genetic algorithm.

```{r}
# Plot fitness change over generations
plot_fitness_history(fitness_history)
```



<details>

Expand Down

0 comments on commit 3ced2bb

Please sign in to comment.