# Demonstrating `mobile-env:smart-city`

`mobile-env` is a simple and open environment for training, testing, and evaluating a decentralized metaverse environment.

* `mobile-env:smart-city` is written in pure Python
* It allows simulating various scenarios with moving users in a cellular network with a single base station and multiple stationary sensors
* `mobile-env:smart-city` implements the standard [Gymnasium](https://gymnasium.farama.org/) (previously [OpenAI Gym](https://gym.openai.com/)) interface such that it can be used with all common frameworks for reinforcement learning
* `mobile-env:smart-city` is not restricted to reinforcement learning approaches but can also be used with conventional control approaches or dummy benchmark algorithms
* It can be configured easily (e.g., adjusting number and movement of users, properties of cells, etc.)
* It is also easy to extend `mobile-env:smart-city`, e.g., implementing different observations, actions, or reward

As such `mobile-env:smart-city` is a simple platform to test RL algorithms in a decentralized metaverse environment.


**Demonstration Steps:**

This demonstration consists of the following steps:

1. Installation and usage of `mobile-env` with dummy actions
2. Configuration of `mobile-env` and adjustment of the observation space (optional)
3. Training a single-agent reinforcement learning approach with [`stable-baselines3`](https://github.com/DLR-RM/stable-baselines3)

In [1]:
# First, install stable baselines; only SB3 v2.0.0+ supports Gymnasium
%pip install stable-baselines3==2.0.0 tensorboard

Defaulting to user installation because normal site-packages is not writeable
You should consider upgrading via the '/Library/Developer/CommandLineTools/usr/bin/python3 -m pip install --upgrade pip' command.[0m
Note: you may need to restart the kernel to use updated packages.


In [2]:
# Importing necessary libraries
import gymnasium
import mobile_env
import numpy as np
from stable_baselines3 import PPO
from stable_baselines3.ppo import MlpPolicy
from stable_baselines3.common.env_checker import check_env

# predefined small scenarios
from mobile_env.scenarios.smart_city import MComSmartCity

# easy access to the default configuration
MComSmartCity.default_config()

{'width': 200,
 'height': 200,
 'EP_MAX_TIME': 100,
 'seed': 666,
 'reset_rng_episode': False,
 'arrival': mobile_env.core.arrival.NoDeparture,
 'channel': mobile_env.core.channels.OkumuraHata,
 'scheduler': mobile_env.core.schedules.ResourceFair,
 'movement': mobile_env.core.movement.RandomWaypointMovement,
 'utility': mobile_env.core.utilities.BoundedLogUtility,
 'handler': mobile_env.handlers.smart_city_handler.MComSmartCityHandler,
 'bs': {'bw': 100000000.0,
  'freq': 2500,
  'tx': 40,
  'height': 50,
  'computational_power': 100},
 'ue': {'velocity': 1.5, 'snr_tr': 2e-08, 'noise': 1e-09, 'height': 1.5},
 'sensor': {'height': 1.5, 'snr_tr': 2e-08, 'noise': 1e-09},
 'ue_job': {'job_generation_probability': 0.7,
  'communication_job_lambda_value': 10.0,
  'computation_job_lambda_value': 10.0},
 'sensor_job': {'communication_job_lambda_value': 5.0,
  'computation_job_lambda_value': 5.0},
 'e2e_delay_threshold': 3.0,
 'reward_calculation': {'ue_penalty': -5.0,
  'discount_factor': 0.95

In [3]:
from gymnasium.envs.registration import register

# Register the new environment
register(
    id='mobile-smart_city-smart_city_handler-rl-v0',
    entry_point='mobile_env.scenarios.smart_city:MComSmartCity',  # Adjust this if the entry point is different
    kwargs={'config': {}, 'render_mode': None}
)

In [4]:
# Step 4: Train a Single-Agent Reinforcement Learning

import gymnasium
from gymnasium.wrappers import TimeLimit
from stable_baselines3 import PPO
from stable_baselines3.common.callbacks import EvalCallback, CheckpointCallback, BaseCallback
from stable_baselines3.common.monitor import Monitor
from stable_baselines3.common.logger import configure

# Custom TensorBoard Callback
class TensorboardCallback(BaseCallback):
    def __init__(self, verbose=1):
        super(TensorboardCallback, self).__init__(verbose)
    
    def _on_step(self) -> bool:
        # Log reward
        reward = self.locals['rewards'][-1] if 'rewards' in self.locals else 0
        self.logger.record('custom/reward', reward)
        
        # Log mean action
        actions = self.locals['actions'] if 'actions' in self.locals else np.array([])
        mean_action = np.mean(actions) if actions.size > 0 else 0
        self.logger.record('custom/mean_action', mean_action)
        
        # Log policy loss (optional)
        policy_loss = self.locals['policy_loss'] if 'policy_loss' in self.locals else 0
        self.logger.record('custom/policy_loss', policy_loss)
        
        return True

# Wrapping the environment with a TimeLimit wrapper to enforce 200 timesteps per episode
def wrap_environment(env_name, max_episode_steps=100):
    env = gymnasium.make(env_name)
    env = TimeLimit(env, max_episode_steps=max_episode_steps)
    env = Monitor(env)
    return env

# Train RL Model
def train_rl_model(env_name, eval_env_name):
    """Train a PPO RL model with callbacks and logging."""
    # Wrap training and evaluation environments
    env = wrap_environment(env_name, max_episode_steps=100)
    eval_env = wrap_environment(eval_env_name, max_episode_steps=100)

    # Logger setup
    log_dir = "results_sb"
    new_logger = configure(log_dir, ["tensorboard"])
    
    # Define model
    model = PPO("MlpPolicy", env, tensorboard_log=log_dir, verbose=1)
    model.set_logger(new_logger)
    
    # Define callbacks
    eval_callback = EvalCallback(eval_env, best_model_save_path='./logs/best_model',
                                 log_path='./logs/results', eval_freq=500)
    checkpoint_callback = CheckpointCallback(save_freq=1000, save_path='./logs/checkpoints/',
                                             name_prefix='ppo_smartcity')
    tensorboard_callback = TensorboardCallback()
    
    # Train model
    print("Starting training...")
    model.learn(total_timesteps=3000, callback=[eval_callback, checkpoint_callback, tensorboard_callback])
    print("Training finished!")

    # Save the trained model
    model.save("ppo_smartcity_model")
    return model

# To visualize the logs, run `tensorboard --logdir results_sb` in your terminal

In [5]:
# Train model
trained_model = train_rl_model("mobile-smart_city-smart_city_handler-rl-v0", "mobile-smart_city-smart_city_handler-rl-v0")

Using cpu device
Wrapping the env in a DummyVecEnv.
Starting training...
Eval num_timesteps=500, episode_reward=0.00 +/- 0.00
Episode length: 100.00 +/- 0.00
New best mean reward!
Eval num_timesteps=1000, episode_reward=0.00 +/- 0.00
Episode length: 100.00 +/- 0.00
Eval num_timesteps=1500, episode_reward=0.00 +/- 0.00
Episode length: 100.00 +/- 0.00
Eval num_timesteps=2000, episode_reward=0.00 +/- 0.00
Episode length: 100.00 +/- 0.00
Eval num_timesteps=2500, episode_reward=0.00 +/- 0.00
Episode length: 100.00 +/- 0.00
Eval num_timesteps=3000, episode_reward=0.00 +/- 0.00
Episode length: 100.00 +/- 0.00
Eval num_timesteps=3500, episode_reward=0.00 +/- 0.00
Episode length: 100.00 +/- 0.00
Eval num_timesteps=4000, episode_reward=0.00 +/- 0.00
Episode length: 100.00 +/- 0.00
Training finished!


In [10]:
from stable_baselines3 import PPO
from stable_baselines3.ppo import MlpPolicy
import gymnasium as gym

env = gym.make("mobile-smart_city-smart_city_handler-v0", render_mode="rgb_array")

# train PPO agent on environment. this takes a while
model = PPO(MlpPolicy, env, tensorboard_log='results_sb', verbose=1)
model.learn(total_timesteps=30000)

Using cpu device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.
Logging to results_sb/PPO_1
---------------------------------
| rollout/           |          |
|    ep_len_mean     | 100      |
|    ep_rew_mean     | -627     |
| time/              |          |
|    fps             | 102      |
|    iterations      | 1        |
|    time_elapsed    | 19       |
|    total_timesteps | 2048     |
---------------------------------
--------------------------------------------
| rollout/                |                |
|    ep_len_mean          | 100            |
|    ep_rew_mean          | -592           |
| time/                   |                |
|    fps                  | 102            |
|    iterations           | 2              |
|    time_elapsed         | 40             |
|    total_timesteps      | 4096           |
| train/                  |                |
|    approx_kl            | 0.001124599    |
|    clip_fraction        | 9.77e-05       

<stable_baselines3.ppo.ppo.PPO at 0x14c286070>