updated the vignette

danymukesha · May 21, 2024 · 3ced2bb · 3ced2bb
1 parent 864ba14
commit 3ced2bb
Show file tree

Hide file tree

Showing 4 changed files with 37 additions and 201 deletions.
diff --git a/src/crossover.cpp b/src/crossover.cpp
@@ -27,8 +27,8 @@ NumericMatrix crossover_cpp(const NumericMatrix& selected_parents, int offspring
 
   // Perform crossover between selected parents
   for (int i = 0; i < offspring_size; ++i) {
-    int parent1_index = R::unif_rand() / num_parents;
-    int parent2_index = R::unif_rand() / num_parents;
+    int parent1_index = int(R::unif_rand()) % num_parents;
+    int parent2_index = int(R::unif_rand()) % num_parents;
     for (int j = 0; j < num_genes; ++j) {
       offspring(i, j) = (selected_parents(parent1_index, j) + selected_parents(parent2_index, j)) / 2.0;
     }

diff --git a/src/initialize_population.cpp b/src/initialize_population.cpp
@@ -27,7 +27,7 @@ using namespace Rcpp;
    // Generate random population using genomic data
    for (int i = 0; i < population_size; ++i) {
      for (int j = 0; j < num_genes; ++j) {
-       int sample_index = R::unif_rand() / num_samples;
+       int sample_index = int(R::unif_rand()) % num_samples;
        population(i, j) = genomic_data(j, sample_index);
      }
    }

diff --git a/src/replacement.cpp b/src/replacement.cpp
@@ -1,4 +1,5 @@
 #include <Rcpp.h>
+#include <cmath>
 using namespace Rcpp;
 
 //' Function to replace non-selected individuals in the population
@@ -31,7 +32,7 @@ NumericMatrix replacement_cpp(const NumericMatrix& population, const NumericMatr
 
   // Replace non-selected individuals in the population with offspring
   for (int i = 0; i < num_to_replace; ++i) {
-    int index_to_replace = R::unif_rand() / population_size;
+    int index_to_replace = int(R::unif_rand()) % population_size;
     for (int j = 0; j < num_genes; ++j) {
       updated_population(index_to_replace, j) = offspring(i, j);
     }

diff --git a/vignettes/Introduction.Rmd b/vignettes/Introduction.Rmd
@@ -11,7 +11,7 @@ abstract:
     optimal or near-optimal solutions.
 
     In the field of genomics, where data sets are often large, complex, 
-    and high-dimensional, genetic algorithms offer a promising approach 
+    and high-dimensional, genetic algorithms offer a good approach 
     for addressing optimization challenges such as feature selection, 
     parameter tuning, and model optimization. By harnessing the power 
     of evolutionary principles, genetic algorithms can effectively explore 
@@ -22,7 +22,7 @@ abstract:
     The BioGA package extends the capabilities of genetic algorithms 
     to the realm of genomic data analysis, providing a suite of functions 
     optimized for handling high throughput genomic data. Implemented in C++ 
-      for enhanced performance, BioGA offers efficient algorithms for tasks 
+    for enhanced performance, BioGA offers efficient algorithms for tasks 
     such as feature selection, classification, clustering, and more. 
     By integrating seamlessly with the Bioconductor ecosystem, 
     BioGA empowers researchers and analysts to leverage the power 
@@ -71,10 +71,10 @@ We showcase its interoperability with Bioconductor classes, demonstrating
 how genetic algorithm optimization can be seamlessly integrated into 
 existing genomics pipelines for improved analysis and interpretation.
 
-The BioGA package provides a comprehensive set of functions for 
-genetic algorithm optimization tailored for analyzing high throughput 
-genomic data. This vignette demonstrates the usage of BioGA in the context 
-of selecting the best combination of genes for predicting a certain trait, 
+The BioGA package provides a  set of functions for genetic algorithm 
+optimization tailored for analyzing high throughput genomic data. 
+This vignette demonstrates the usage of BioGA in the context of selecting
+the best combination of genes for predicting a certain trait, 
 such as disease susceptibility.
 
 ## Overview
@@ -108,7 +108,28 @@ each column represents a sample. The values in the matrix represent
 some measurement of gene expression, such as mRNA levels or protein abundance,
 in each sample.
 
-## Example Scenario
+For instance, the value 0.1 in Sample 1 for Gene1 indicates the expression 
+level of Gene1 in Sample 1. Similarly, the value 2.2 in Sample 2 for Gene3 
+indicates the expression level of Gene3 in Sample 2.
+
+Genomic data can be used in various analyses, including genetic association 
+studies, gene expression analysis, and comparative genomics. In the context 
+of the `evaluate_fitness_cpp` function, genomic data is used to calculate 
+fitness scores for individuals in a population, typically in the context 
+of genetic algorithm optimization.
+
+The population represents a set of candidate combinations of genes that 
+could be predictive of the trait.
+Each individual in the population is represented by a binary vector indicating
+the presence or absence of each gene.
+For example, an individual in the population might be represented as 
+[1, 0, 1],
+indicating the presence of Gene1 and Gene3 but the absence of Gene2.
+The population undergoes genetic algorithm operations such as selection, 
+crossover, mutation, and replacement to evolve towards individuals with higher
+predictive power for the trait.
+
+## Example Scenario 
 
 Consider an example scenario of using genetic algorithm optimization to select
 the best combination of genes for predicting a certain trait, such as disease
@@ -154,11 +175,10 @@ such as RNAseq count matrices or microarray data.
 head(genomic_data)
 ```
 
-
 ## Initialization
 
 ```{r}
-# Initialize population
+# Initialize population (select the number of canditate you wish `population`)
 population <- BioGA::initialize_population_cpp(genomic_data,
     population_size = 5
 )
@@ -264,27 +284,15 @@ gene expression profiles that are more similar to the genomic data and
 are therefore more likely to be selected for further optimization 
 in the genetic algorithm.
 
-
 ```{r}
 # Plot fitness change over generations
 BioGA::plot_fitness_history(fitness_history)
 ```
 
-
-
-This vignette demonstrates how genetic algorithm optimization can be applied 
-to select the best combination of genes for predicting a certain trait using 
-the BioGA package. It showcases the integration of genetic algorithms 
-with genomic data analysis and highlights the potential of genetic algorithms
+This showcases the integration of genetic algorithms with genomic 
+data analysis and highlights the potential of genetic algorithms
 for feature selection in genomics.
 
-BioGA is a computational tool designed to analyze and optimize high 
-throughputgenomic data using genetic algorithms (GAs). Genetic algorithms
-are a type of optimization algorithm inspired by the process of natural 
-selection and genetics. They operate by iteratively evolving a population
-of candidate 
-solutions towards better solutions.
-
 Here's how BioGA could work in the context of high throughput genomic data 
 analysis:
 
@@ -326,186 +334,13 @@ the final population to identify the best solution(s) found. This could
 involve further validation or interpretation of the results in the context
 of the original problem.
 
-Applications of BioGA in genomic data analysis could include genome-wide
+Other applications of BioGA in genomic data analysis could include genome-wide
 association studies (GWAS), gene expression analysis, pathway analysis,
 and predictive modeling for personalized medicine, among others.
 By leveraging genetic algorithms, BioGA offers a powerful approach
 to exploring complex genomic datasets and identifying meaningful patterns
 and associations.
 
-The BioGA package provides a comprehensive set of functions for 
-genetic algorithm optimization tailored for analyzing high throughput 
-genomic data. This vignette demonstrates the usage of BioGA in the context
-of selecting the best combination of genes for predicting a certain trait,
-such as disease susceptibility.
-
-Let's consider an example scenario of using genetic algorithm optimization 
-to select the best combination of genes for predicting a certain trait, 
-such as disease susceptibility.
-
-```{r}
-# Load the BioGA package
-library(BioGA)
-
-# Define parameters for genetic algorithm
-population_size <- 100
-generations <- 6
-mutation_rate <- 0.1
-
-# Generate example genomic data
-genomic_data <- matrix(rnorm(100), nrow = 10, ncol = 10)
-```
-
-
-# Overview
-
-Genomic data refers to the genetic information stored in an organism's DNA. 
-It includes the sequence of nucleotides (adenine, thymine, cytosine, 
-and guanine) that make up the DNA molecules. Genomic data can provide 
-valuable insights into various biological processes, such as gene expression, 
-genetic variation, and evolutionary relationships.
-
-Genomic data in this context could consist of gene expression profiles 
-measured across different individuals (e.g., patients).
-
-- Each row in the genomic_data matrix represents a gene, and each column 
-represents a patient sample.
-
-- The values in the matrix represent the expression levels of each gene 
-in each patient sample.
-
-
-Here's an example of genomic data:
-
-```
-      Sample 1   Sample 2   Sample 3   Sample 4
-Gene1    0.1        0.2        0.3        0.4
-Gene2    1.2        1.3        1.4        1.5
-Gene3    2.3        2.2        2.1        2.0
-```
-
-In this example, each row represents a gene (or genomic feature), and 
-each column represents a sample. The values in the matrix represent 
-some measurement of gene expression, such as mRNA levels or protein abundance,
-in each sample. 
-
-For instance, the value 0.1 in Sample 1 for Gene1 indicates the expression 
-level of Gene1 in Sample 1. Similarly, the value 2.2 in Sample 2 for Gene3 
-indicates the expression level of Gene3 in Sample 2.
-
-Genomic data can be used in various analyses, including genetic association 
-studies, gene expression analysis, and comparative genomics. In the context 
-of the `evaluate_fitness_cpp` function, genomic data is used to calculate 
-fitness scores for individuals in a population, typically in the context 
-of genetic algorithm optimization.
-
-```{r}
-# Initialize population
-population <- initialize_population_cpp(genomic_data,
-    population_size = 5
-)
-```
-
-The population represents a set of candidate combinations of genes that 
-could be predictive of the trait.
-Each individual in the population is represented by a binary vector indicating
-the presence or absence of each gene.
-For example, an individual in the population might be represented as 
-[1, 0, 1],
-indicating the presence of Gene1 and Gene3 but the absence of Gene2.
-The population undergoes genetic algorithm operations such as selection, 
-crossover, mutation, and replacement to evolve towards individuals with higher
-predictive power for the trait.
-
-```{r}
-# Initialize fitness history
-fitness_history <- list()
-
-# Initialize time progress
-start_time <- Sys.time()
-
-# Run genetic algorithm optimization
-generation <- 0
-while (TRUE) {
-    generation <- generation + 1
-
-    # Evaluate fitness
-    fitness <- evaluate_fitness_cpp(genomic_data, population)
-    fitness_history[[generation]] <- fitness
-
-    # Check termination condition
-    if (generation == generations) { # defined number of generations
-        break
-    }
-
-    # Selection
-    selected_parents <- selection_cpp(population, fitness, num_parents = 2)
-
-    # Crossover and Mutation
-    offspring <- crossover_cpp(selected_parents, offspring_size = 2)
-    # (no mutation in this example)
-    mutated_offspring <- mutation_cpp(offspring, mutation_rate = 0)
-
-    # Replacement
-    population <- replacement_cpp(population, mutated_offspring,
-        num_to_replace = 1
-    )
-
-    # Calculate time progress
-    elapsed_time <- difftime(Sys.time(), start_time, units = "secs")
-
-    # Print time progress
-    cat(
-        "\rGeneration:", generation, "- Elapsed Time:",
-        format(elapsed_time, units = "secs"), "     "
-    )
-}
-```
-
-
-The fitness calculation described in the provided code calculates a measure 
-of dissimilarity between the gene expression profiles of individuals 
-in the population and the genomic data. This measure of dissimilarity, 
-or "fitness", quantifies how well the gene expression profile of an individual
-matches the genomic data.
-
-Mathematically, the fitness calculation can be represented as follows:
-
-Let:
-
-- \( g_{ijk} \) be the gene expression level of gene \( j \) 
-in individual \( i \) and sample \( k \) from the genomic data.
-
-- \( p_{ij} \) be the gene expression level of gene \( j \) 
-in individual \( i \) from the population.
-
-- \( N \) be the number of individuals in the population.
-
-- \( G \) be the number of genes.
-
-- \( S \) be the number of samples.
-
-Then, the fitness \( F_i \) for individual \( i \) in the population can be 
-calculated as the sum of squared differences between the gene expression 
-levels
-of individual \( i \) and the corresponding gene expression levels 
-in the genomic data, across all genes and samples:
-
-\[ F_i = \sum_{j=1}^{G} \sum_{k=1}^{S} (g_{ijk} - p_{ij})^2 \]
-
-This fitness calculation aims to minimize the overall dissimilarity between 
-the gene expression profiles of individuals in the population and 
-the genomic data. Individuals with lower fitness scores are considered to have
-gene expression profiles that are more similar to the genomic data and 
-are therefore more likely to be selected for further optimization 
-in the genetic algorithm.
-
-```{r}
-# Plot fitness change over generations
-plot_fitness_history(fitness_history)
-```
-
-
 
 <details>