Bitworm is an AI simulation project aiming for "transparent intelligence" with a minimal number of neurons and weights. Using a genetic algorithm (GA), worm behaviors were evolved, and changes in survival rate and efficiency were analyzed through various experiments. Clear learning and convergence appeared with the 84-weight structure, but optimal results were not achieved due to environmental factors, GA limitations, and lack of statistical/mathematical background. The process highlighted the power of numpy, the challenge of designing an analyzer, and the importance of trial and error. It was also a chance to reflect on the growth of XAI (explainable AI) and personal development.
- Introduction
- Tech Stack
- Environment Overview
- Worm
- Data Analysis
- Failure Analysis
- Lessons Learned
- Conclusion
This project began with a conversation with Gemini about artificial intelligence and the AI black box problem. The AI black box refers to the phenomenon where the internal process of deriving results becomes incomprehensible to humans due to the complexity of models with billions of parameters.
I wondered if reducing the number of parameters would make it much easier to explain and interpret the values of weights. Knowing about the nematode C. elegans, which survives with only 302 neurons, and the Open Worm project that aims to simulate it, I thought it might be possible to create intelligence with a small number of neurons.
So, I started the Bitworm project with the goal of building transparent intelligence operating with fewer than 100 weights. After many trials and experiments, I achieved results such as increasing the survival rate from 0% to 50%, but ultimately, the experiment was not a success. This document covers the project design, reasons for failure, and lessons learned.
Pygame: Simulation visualization Numpy/Pandas: Numerical computation and data analysis
The experiment environment is structured as follows:
- The worm is the main agent, defined by position (x, y), direction (angle), unique id, and neural weights.
- Key environmental variables such as food energy, decoy penalty, world size, and torus state are provided to the worm and affect its behavior.
- The worm moves based on neural (weight) inputs from its surroundings, and its behavior is the main subject of analysis.
- The world is a 2D plane with multiple entities.
- Main entities are food and decoys, each with properties like position, radius, brightness, smell, and energy.
- Interactions include distance calculations and collision detection.
- World settings are managed in world_config.py and used as experiment variables.
- The simulation creates the world and worms, and simulates worm behavior by generation.
- Statistics (mean/max/min fitness, food/decoy consumption, energy, etc.) are recorded for each generation.
- Results are saved in various formats (csv, json) for analysis.
Thus, worm.py handles agent behavior and properties, world.py manages environment and entity interactions, and simulation.py oversees the overall experiment and data logging. This structure enables analysis of worm intelligence and behavior.
Worm intelligence and evolution are driven by a genetic algorithm (GA). The GA, based on natural selection, repeats the process of random generation → fitness evaluation → selection → crossover → mutation to search for optimal solutions.
Reasons for choosing GA in this simulation:
- No restrictions on the form or number of neurons/weights, allowing flexible design.
- The goal of Bitworm was to create a small virtual organism, so a biologically inspired evolutionary method was preferred.
- The environment posed a problem where the optimal solution was hard to find directly, requiring global search via GA.
Bitworm's fitness is a numerical indicator of worm behavior and survival performance. The evaluation method evolved over time, with key items, weights, and logic changes as follows:
- Food eaten (
foods_eaten): Score proportional to the number of foods eaten - Decoy eaten (
decoys_eaten): Penalty or weighted score for decoys eaten - Remaining energy (
energy): Energy at the end of simulation - Survival bonus (
alive_fitness_bonus): Extra score for surviving - (Later) Discrimination bonus and ratio bonus for more nuanced evaluation
Initially, the score was simply calculated as:
score = (
self.foods_eaten * self.worm_config["food_fitness_weight"] +
self.decoys_eaten * self.worm_config["decoy_fitness_weight"] +
self.energy +
(self.worm_config["alive_fitness_bonus"] if self.alive else 0)
) + 5000- Food adds to the score by its weight, decoy acts as a penalty.
- Remaining energy and survival bonus are simply added.
- 5000 is added to raise the overall score.
Later, the evaluation was refined to better reflect behavior ratios and survival bonus:
score = self.foods_eaten*self.worm_config["food_fitness_weight"] + self.decoys_eaten*self.worm_config["decoy_fitness_weight"] + self.energy
total_eaten = self.foods_eaten + self.decoys_eaten
alive_bonus = 0
if total_eaten > 0:
alive_bonus = (self.foods_eaten / total_eaten) * self.worm_config["alive_fitness_bonus"] if self.decoys_eaten != 0 else 1.5 * self.worm_config["alive_fitness_bonus"]
else:
alive_bonus = 0.05 * self.worm_config["alive_fitness_bonus"]
score += alive_bonus if self.alive else 0
self.fitness = max(score, 10)- Food/decoy consumption and energy are always included.
- Survival bonus is differentiated by the ratio of food to decoy eaten.
- Even if nothing is eaten, a small bonus is given.
- Final score is at least 10.
The latest version comprehensively reflects discrimination ability (how well food and decoy are distinguished), survival, energy, etc. (see worm.py's calculate_fitness):
base_score = self.foods_eaten*self.config["food_fitness_weight"] + self.decoys_eaten*self.config["decoy_fitness_weight"]
disc_bonus = 0
if self.foods_eaten >=5 and self.decoys_eaten == 0:
disc_bonus += 1.5*self.config["disc_bonus"]
if self.foods_eaten > self.decoys_eaten * self.config["disc_goal_ratio"]:
ratio = self.foods_eaten / (self.foods_eaten + self.decoys_eaten)
disc_bonus += ratio * self.config["disc_bonus"]
base_score += disc_bonus
if self.alive:
base_score += self.config["alive_fitness_bonus"] + self.energy
self.fitness = base_score- Food/decoy consumption and energy are always included.
- Eating 5+ food and no decoy gives a large bonus.
- High food-to-decoy ratio gives additional bonus.
- Survival adds bonus and remaining energy.
- Initial: Simple weighted sum + constant adjustment. Only survival considered, no discrimination.
- Intermediate: Survival bonus differentiated by food/decoy ratio. Qualitative behavior evaluation introduced.
- Final: Discrimination bonus added, bonus structure refined. The closer to the goal behavior (eating only food), the higher the score.
These changes reflect a shift from simple survival/consumption to evaluating how "smart" the worm behaves in the environment (discrimination, strategy, etc.).
The initial structure had 3 layers: {"input":5, "brain":1, "leg":2} nodes, with 18 weights (w0~w17: 17 weights and 1 memory). Five sensory neurons were directly connected to two motors. However, learning and convergence did not occur.
To solve learning and convergence issues, an h1 layer was added and the number of weights greatly increased: {"input":5, "h1":8, "brain":4, "leg":2}, totaling 84 (80 weights + 4 memory). With this structure, models appeared with survival rates rising from 0% to 50% and good convergence. However, the worms still could not distinguish food from decoy.
Originally, input was calculated as (food_light - decoy_light = input_light). If closer to decoy, input_light became negative, and it was hoped that worms with strong negative weights would be eliminated, but learning did not occur. So, the input was changed to use color channels instead of light, increasing the number of inputs and total weights to 100. Still, learning and convergence did not occur, even after increasing generations to 3000. More weights seemed to make the search space too large. Claude suggested switching to RL, but my goal was transparent intelligence with few neurons/weights, so I ended the project.
- Fitness change: Analyze trends in best_fitness, moving average, growth_rate, etc. by generation.
- Explosion detection: Automatically detect rapid growth (explosion) and record the relevant generations and growth rates. Analyze statistical changes (mean, std) before/after explosion.
- Population diversity and convergence: Aggregate key metrics by generation/worm to evaluate evolution and convergence.
- Reliability and evaluation: Calculate improvement_score, reliability_score, discrimination_score, and overall_score.
- Context analysis: Aggregate data before/after explosion to interpret evolutionary patterns.
- overall_score = improvement_score * 0.3 + reliability_score * 0.4 + discrimination_score * 0.3 (EXCELLENT >= 0.8, GOOD >= 0.6, ACCEPTABLE >= 0.4, POOR else)
-This was the first experiment with 84 weights, showing clear learning and convergence not seen before.
- Generations: 300
- Worms: 120
- Time per generation: 50
- World size: 1200 x 900
- Food/decoy count: 300 each
- Food/decoy ratio: 1.0
- Food energy: 100, decoy penalty: -50
- Max worm energy: 100 (initial 100)
- Max speed: 50, turn speed: 5
- Energy consumption: move 0.2, turn 0.03, base 0.5
- Fitness weights: food +400, decoy -800, survival bonus +500
- Mutation rate: 0.08
- Tournament size: 10, elite ratio: 0.1
- World torus: False
-The simulation was divided into Initial (1-75), Middle (75-225), and Last (226-300) periods, and average survival rates were analyzed.
| Period | Survival Mean | Survival Std |
|---|---|---|
| Initial | 0.003 | 0.006 |
| Middle | 0.174 | 0.172 |
| Last | 0.425 | 0.067 |
The table summarizes the average survival rate and standard deviation for each period. As the experiment progressed, the average survival rate increased significantly, and the standard deviation decreased, indicating not only higher survival but also more stable evolution.
-This shows efficiency by period, with a slight upward trend. (efficiency = foods_eaten / (foods_eaten + decoys_eaten))
The standard deviation of efficiency for top individuals steadily decreased.
| Period | Mean Efficiency | Efficiency Std | Top Mean Efficiency | Top Efficiency Std |
|---|---|---|---|---|
| Initial | 0.715 | 0.355 | 0.909 | 0.127 |
| Middle | 0.730 | 0.368 | 0.926 | 0.120 |
| Last | 0.734 | 0.357 | 0.930 | 0.114 |
The table summarizes mean and std of efficiency for all and top individuals. Efficiency increased slightly, and the std for top individuals steadily decreased, indicating more consistent behavior among the best worms.
-High convergence_score and discrimination_score, but 0 improvement_score (comparing top fitness in first/last 20%), resulting in an overall score of 0.48 ("ACCEPTABLE"). -Efficiency was high from the start (>0.7), with only a slight increase by the end. No significant improvement in mean/max fitness. -This suggests that the environment was easy, and high decoy penalty and survival bonus led to safe, non-exploratory behavior. -Overall, not a satisfying result, but the 84-weight structure showed potential in survival and convergence.
The above analysis is from the only experiment that showed clear learning among many attempts. As mentioned, the environment led to high survival but little growth. Many changes to settings and fitness algorithms were tried, but no clear success in survival, efficiency, or stability was achieved. This section analyzes three main reasons for failure.
Food and decoys are placed as follows:
- The world is divided into zones, and food/decoys are placed in each zone according to config ratios.
- Within each zone, random coordinates are generated, ensuring a minimum distance from existing food/decoys (with a max attempt limit).
- Some coordinates in each zone are assigned as food, the rest as decoys.
- This ensures even distribution and no overlap.
This is a form of stratified sampling, preventing clustering and ensuring fairness/diversity. Once eaten, food/decoys are respawned randomly.
Worms, however, are placed completely randomly. Thus, a lucky worm may spawn in a food-rich area, while others may be unlucky. This means that genes of lucky worms may be passed on regardless of intelligence.
Weights range from -1 to +1. The average absolute difference between top and bottom individuals was only 0.002~0.003. This suggests:
- Weights are highly sensitive, so tiny differences have big effects.
- Both top and bottom individuals have nearly identical genes, and top individuals may just be lucky.
In most experiments (except the one analyzed), the CV (std/mean) was very high, supporting the second interpretation: environmental factors (spawn location) are the main cause of ranking, not genetic difference.
NASA's ST5 antenna is a classic GA success story. The GA evolved a set of commands to build the antenna, with each variable having a direct meaning (e.g., "bend 30°", "move 2cm").
| Item | NASA ST5 Project |
|---|---|
| Population | ~50-100 |
| Evaluations | Up to 50,000 per process |
| Mutation/Cross | Command order/angle tweaks |
| Selection | Elite + tournament mix |
| Representation | Generative (command-based) |
| Variable Role | "Bend 30°", "Move 2cm" |
| Problem Type | Static optimization |
| Search Space | Smooth, physics-based |
| Item | Bitworm Project |
|---|---|
| Population | 30-120 (single process, repeated) |
| Evaluations | Up to 300 generations |
| Mutation/Cross | Small weight mutations/crossover |
| Selection | Tournament + elite retention |
| Representation | Direct weights |
| Variable Role | "Scale this neuron's signal" |
| Problem Type | Dynamic control (real-time) |
| Search Space | Rugged, highly sensitive |
The biggest differences are in variable meaning, problem type, and search space. In ST5, each variable had a clear, structural meaning, and mutations led to meaningful changes. In Bitworm, weights are just numbers, making it much harder for GA to find meaningful combinations. Small changes can have unpredictable effects, and the search space is much more rugged, limiting GA's effectiveness.
Many values (weight range, bonuses, penalties, generations, population, etc.) were set arbitrarily, based on intuition and trial/error rather than statistical or theoretical analysis. Hyperparameter tuning was also done by feel, not systematic search. This undermined reliability, reproducibility, and the ability to find optimal combinations.
Lack of mathematical/statistical background and simulation experience was a major limitation. I was unsure which metrics to analyze (moving average, CV, confidence interval, etc.) and how to interpret them. Even with analyzer output, I struggled to draw meaningful conclusions.
I also underestimated the difficulty and complexity of Bitworm's goal. From GA selection to experiment design, I failed to recognize structural limitations or explore better approaches. There was little research into prior work or alternative methods (like RL), and most experiments relied on trial and error. This lack of background affected experiment design, data interpretation, and result reliability.
Initially, signal calculation and worm state updates were handled with Python for-loops, making 300 generations take nearly 2 hours—too slow for flexible experimentation. After vectorizing core functions with numpy (thanks to Gemini's advice), simulation speed improved dramatically, enabling more experiments. I used to think of numpy as just a math library, but its vector operations, flexible syntax, and even SIMD concepts were eye-opening.
The simulation described in the data analysis section showed clear learning: survival rate grew from 0% to nearly 50%, with high stability. This was never seen in the 18-weight experiments, but appeared as soon as the structure was expanded to 84 weights—a striking (and perhaps indirect) experience of the scaling law.
The hardest part was designing the analyzer: I had to decide which metrics to analyze and how to weight them, with no standards or benchmarks. Unlike previous web/browser projects, I had to define and design everything from scratch. Debugging was also different—no compiler/interpreter to point out errors, so I had to track down the causes of learning stagnation myself. After hundreds of experiments, I learned a lot about perseverance and problem-solving, even if I didn't fully succeed.
I started out confident, but as the project progressed, I realized just how difficult and complex it is to create, explain, and interpret intelligence. I didn't reach my original goal, and even if I had built a successful Bitworm model, interpreting its weights would have been even more complex and perhaps impossible. If I were an expert, I might never have attempted such a reckless experiment.
But it was precisely my ignorance that allowed me to try, and I gained new experiences in the process. As a freshman in computer science, I hope to keep learning and not let knowledge turn into fear of new challenges. Even if this was a foolish experiment, it brought new insights and learning.
The AI black box problem is a challenge we must solve, and through this experiment and my research into XAI, I realized just how difficult it is.
Still, it's impressive how much XAI has advanced in recent years. I hope XAI technology continues to grow to make software more trustworthy, and I hope to keep learning and one day contribute to the advancement of software myself.