Model "remembers" instead of learning #260

jarlva · 2023-01-14T10:47:00Z

Hey, after training (~200M) showing good reward, Enjoy shows bad reward numbers on unseen data. When including the training data in Enjoy the reward matches training. So, it seems the model "remembers" the data, as opposed to learning.

What's the best way to deal with that (other than adding more data and introducing random noise)? Are there settings to try?

Training a gym-like env with the following:

{
  "help": false,
  "algo": "APPO",
  "env": "Myrl-v0",
  "experiment": "0114-1156.2-62",
  "train_dir": "./train_dir",
  "restart_behavior": "resume",
  "device": "gpu",
  "seed": 5,
  "num_policies": 1,
  "async_rl": true,
  "serial_mode": false,
  "batched_sampling": false,
  "num_batches_to_accumulate": 2,
  "worker_num_splits": 2,
  "policy_workers_per_policy": 1,
  "max_policy_lag": 1000,
  "num_workers": 32,
  "num_envs_per_worker": 28,
  "batch_size": 1024,
  "num_batches_per_epoch": 1,
  "num_epochs": 1,
  "rollout": 32,
  "recurrence": 1,
  "shuffle_minibatches": false,
  "gamma": 0.99,
  "reward_scale": 1.0,
  "reward_clip": 1000.0,
  "value_bootstrap": false,
  "normalize_returns": true,
  "exploration_loss_coeff": 0.003,
  "value_loss_coeff": 0.5,
  "kl_loss_coeff": 0.0,
  "exploration_loss": "entropy",
  "gae_lambda": 0.95,
  "ppo_clip_ratio": 0.1,
  "ppo_clip_value": 1.0,
  "with_vtrace": false,
  "vtrace_rho": 1.0,
  "vtrace_c": 1.0,
  "optimizer": "adam",
  "adam_eps": 1e-06,
  "adam_beta1": 0.9,
  "adam_beta2": 0.999,
  "max_grad_norm": 4.0,
  "learning_rate": 0.0001,
  "lr_schedule": "constant",
  "lr_schedule_kl_threshold": 0.008,
  "obs_subtract_mean": 0.0,
  "obs_scale": 1.0,
  "normalize_input": true,
  "normalize_input_keys": null,
  "decorrelate_experience_max_seconds": 0,
  "decorrelate_envs_on_one_worker": true,
  "actor_worker_gpus": [],
  "set_workers_cpu_affinity": true,
  "force_envs_single_thread": false,
  "default_niceness": 0,
  "log_to_file": true,
  "experiment_summaries_interval": 10,
  "flush_summaries_interval": 30,
  "stats_avg": 100,
  "summaries_use_frameskip": true,
  "heartbeat_interval": 20,
  "heartbeat_reporting_interval": 180,
  "train_for_env_steps": 985000000,
  "train_for_seconds": 10000000000,
  "save_every_sec": 60,
  "keep_checkpoints": 1,
  "load_checkpoint_kind": "best",
  "save_milestones_sec": -1,
  "save_best_every_sec": 15,
  "save_best_metric": "7.ARGPB",
  "save_best_after": 20000000,
  "benchmark": false,
  "encoder_mlp_layers": [
    512,
    512
  ],
  "encoder_conv_architecture": "convnet_simple",
  "encoder_conv_mlp_layers": [
    512
  ],
  "use_rnn": false,
  "rnn_size": 512,
  "rnn_type": "gru",
  "rnn_num_layers": 1,
  "decoder_mlp_layers": [],
  "nonlinearity": "elu",
  "policy_initialization": "orthogonal",
  "policy_init_gain": 1.0,
  "actor_critic_share_weights": true,
  "adaptive_stddev": true,
  "continuous_tanh_scale": 0.0,
  "initial_stddev": 1.0,
  "use_env_info_cache": false,
  "env_gpu_actions": false,
  "env_gpu_observations": true,
  "env_frameskip": 1,
  "env_framestack": 1,
  "pixel_format": "CHW",
  "use_record_episode_statistics": false,
  "with_wandb": false,
  "wandb_user": null,
  "wandb_project": "sample_factory",
  "wandb_group": null,
  "wandb_job_type": "SF",
  "wandb_tags": [],
  "with_pbt": true,
  "pbt_mix_policies_in_one_env": true,
  "pbt_period_env_steps": 5000000,
  "pbt_start_mutation": 20000000,
  "pbt_replace_fraction": 0.3,
  "pbt_mutation_rate": 0.15,
  "pbt_replace_reward_gap": 0.1,
  "pbt_replace_reward_gap_absolute": 1e-06,
  "pbt_optimize_gamma": false,
  "pbt_target_objective": "true_objective",
  "pbt_perturb_min": 1.1,
  "pbt_perturb_max": 1.5,
  "command_line": "--train_dir=./train_dir --learning_rate=0.0001 --with_pbt=True --save_every_sec=60 --load_checkpoint_kind=best --save_best_every_sec=15 --use_rnn=False --seed=5 --num_envs_per_worker=28 --keep_checkpoints=1 --device=gpu --train_for_env_steps=985000000 --algo=APPO --experiment=0114-1156.2-62 --with_vtrace=False --experiment_summaries_interval=10 --save_best_after=20000000 --recurrence=1 --num_workers=32 --batch_size=1024 --env=Myrl-v0 --save_best_metric=7.ARGPB",
  "cli_args": {
    "algo": "APPO",
    "env": "Myrl-v0",
    "experiment": "0114-1156.2-62",
    "train_dir": "./train_dir",
    "device": "gpu",
    "seed": 5,
    "num_workers": 32,
    "num_envs_per_worker": 28,
    "batch_size": 1024,
    "recurrence": 1,
    "with_vtrace": false,
    "learning_rate": 0.0001,
    "experiment_summaries_interval": 10,
    "train_for_env_steps": 985000000,
    "save_every_sec": 60,
    "keep_checkpoints": 1,
    "load_checkpoint_kind": "best",
    "save_best_every_sec": 15,
    "save_best_metric": "7.ARGPB",
    "save_best_after": 20000000,
    "use_rnn": false,
    "with_pbt": true
  },
  "git_hash": "cf6f93c8109e48faf7bca746ce2184808f6513c1",
  "git_repo_name": "not a git repository",
  "train_script": "train_gym_env2"
}

The text was updated successfully, but these errors were encountered:

alex-petrenko · 2023-01-16T04:46:33Z

You're encountering a general machine learning problem called "overfitting". It is generally a challenge to make sure a model generalizes beyond training distribution, and it is not specific to RL or Sample Factory.

Some things to look at:

Look up general anti-overfitting techniques from deep learning. Dropout, larger learning rate come to mind. Although I haven't had much success with these.
Domain randomization. Make sure your training distribution is as diverse as possible, so it is harder to overfit. Randomize parameters of the environment where possible. Check out some automatic domain randomization ideas from this paper https://dextreme.org/ and papers it references.
Data augmentation. Making training distribution larger always helps. Augment training scenarios to provide more data. Augment observations (i.e. for visual observations you can use crop, change colors, flip images, and use other techniques from computer vision)
Noise injection can help (i.e. injecting noise into obs. and actions)
Adversarial learning and self-play can help if this is applicable to your setting
Use population-based training and use a performance metric (true_objective) which is a proxy for generalization performance (i.e. performance on unseen data).

jarlva · 2023-01-16T23:28:22Z

Thanks again for your reply @alex-petrenko !

jarlva · 2023-01-22T17:35:49Z

I tried the following but none worked. I'd like to try dropout and noticed it's possible to apply in pytorch but not sure how to do it in the SF2 code (maybe add an optional parameter)?

update: also tried editing sample-factory/tests/test_precheck.py with lines 15, 18

. adding noise to observations, up to +/-5%
. PBT
. simplify the model to 256,256
. changed LR to 0.00001 and 0.001, from default 0.0001
. increased data from 30k to 100k rows
. it's not possible to augment data

jarlva · 2023-01-25T09:18:10Z

Hi @alex-petrenko , would it be possible to reply to the latest request from 2 days ago, above?

alex-petrenko · 2023-01-25T09:56:25Z

I think your best option is to implement a custom model (encoder only should be sufficient, but you can override the entire actor-critic module). See the documentation here: https://www.samplefactory.dev/03-customization/custom-models/

Just add dropout as a layer and fingers crossed it should work. You should be careful about eval() and train() modes for your PyTorch module but I think you should already be covered here.
See here for example: https://discuss.pytorch.org/t/if-my-model-has-dropout-do-i-have-to-alternate-between-model-eval-and-model-train-during-training/83007/2

alex-petrenko · 2023-01-25T09:59:25Z

Hmmm I guess your confusion might be from the fact that Dropout can't be just added as a model layer, you have to actually call it explicitly in forward()

If I were you I would simply modify the code of forward() method of the actor_critic class to call dropout when needed.

Sorry, I don't think I can properly help you with the problem without knowing context and details of your problem. Overfitting is one of the hardest problems in all of ML and there's no single magical recipe for fixing it.

jarlva · 2023-01-29T09:45:19Z

Hi @alex-petrenko , sorry, I'm not an expert at this. I'm using a customized cartpole-like gym env.
Do you mean edit sample_factory/model/actor_critic.py in the following, lines 154, 184?

1/30 Update: Also updated sample_factory/model/encoder.py lines 216, 221

Also, would it make sense to add dropout as a switch option?

alex-petrenko · 2023-01-31T11:34:06Z

First thing I would try would be to add dropout after each layer in the encoder.
If you're using a cartpole-like environment, then you would need to modify MLP Encoder which is defined here:

sample-factory/sample_factory/model/encoder.py

Line 72 in 8633202

class MlpEncoder(Encoder):

Convolutional encoder probably has nothing to do with your task if your observations are just vectors of numbers. Convolutional encoder is for the images.

jarlva · 2023-01-31T14:40:39Z

I added it in the model_utils.py file, line 52. So the layers are:

RecursiveScriptModule(
original_name=Sequential
(0): RecursiveScriptModule(original_name=Linear)
(1): RecursiveScriptModule(original_name=ELU)
(2): RecursiveScriptModule(original_name=Dropout)
(3): RecursiveScriptModule(original_name=Linear)
(4): RecursiveScriptModule(original_name=ELU)
(5): RecursiveScriptModule(original_name=Dropout)
)

But, alas, that's still not solving overfitting...

alex-petrenko · 2023-02-01T02:00:55Z

Dropout is one way to combat overfitting but it is not a panacea.

I'm sorry I can't help figure out your exact issue, as I said previously, overfitting is a general machine learning phenomenon and most likely your problem has nothing to do with Sample Factory, but rather with the overall problem formulation and approach.

jarlva · 2023-02-01T09:48:34Z

Hi @alex-petrenko , I understand. I appreciate the guidance and advice!
Please let me know if you'd be open to advise for pay?

alex-petrenko · 2023-02-03T02:08:35Z

@jarlva not sure if this is realistic right now. I'm starting a full-time job very soon which will keep me busy for a foreseeable future.

You said you're able to fit to your training data, right?
That means, trained policy does well on the training data when you're evaluating?
But completely fails on out-of-distribution data.

If I could get some ideas what's your environment and what exactly the difference between your training and test data is, I could be more helpful. Maybe we can set up a call in ~2 weeks. Feel free to reach out on Discord DM or by email to discuss further.

jarlva closed this as completed Jan 16, 2023

jarlva reopened this Jan 22, 2023

jarlva mentioned this issue Aug 18, 2023

Minimize memorizing? #282

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Model "remembers" instead of learning #260

Model "remembers" instead of learning #260

jarlva commented Jan 14, 2023 •

edited

Loading

alex-petrenko commented Jan 16, 2023

jarlva commented Jan 16, 2023

jarlva commented Jan 22, 2023 •

edited

Loading

jarlva commented Jan 25, 2023

alex-petrenko commented Jan 25, 2023

alex-petrenko commented Jan 25, 2023 •

edited

Loading

jarlva commented Jan 29, 2023 •

edited

Loading

alex-petrenko commented Jan 31, 2023

jarlva commented Jan 31, 2023 •

edited

Loading

alex-petrenko commented Feb 1, 2023

jarlva commented Feb 1, 2023

alex-petrenko commented Feb 3, 2023

Model "remembers" instead of learning #260

Model "remembers" instead of learning #260

Comments

jarlva commented Jan 14, 2023 • edited Loading

alex-petrenko commented Jan 16, 2023

jarlva commented Jan 16, 2023

jarlva commented Jan 22, 2023 • edited Loading

jarlva commented Jan 25, 2023

alex-petrenko commented Jan 25, 2023

alex-petrenko commented Jan 25, 2023 • edited Loading

jarlva commented Jan 29, 2023 • edited Loading

alex-petrenko commented Jan 31, 2023

jarlva commented Jan 31, 2023 • edited Loading

alex-petrenko commented Feb 1, 2023

jarlva commented Feb 1, 2023

alex-petrenko commented Feb 3, 2023

jarlva commented Jan 14, 2023 •

edited

Loading

jarlva commented Jan 22, 2023 •

edited

Loading

alex-petrenko commented Jan 25, 2023 •

edited

Loading

jarlva commented Jan 29, 2023 •

edited

Loading

jarlva commented Jan 31, 2023 •

edited

Loading