Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Model "remembers" instead of learning #260

Open
jarlva opened this issue Jan 14, 2023 · 12 comments
Open

Model "remembers" instead of learning #260

jarlva opened this issue Jan 14, 2023 · 12 comments

Comments

@jarlva
Copy link

jarlva commented Jan 14, 2023

Hey, after training (~200M) showing good reward, Enjoy shows bad reward numbers on unseen data. When including the training data in Enjoy the reward matches training. So, it seems the model "remembers" the data, as opposed to learning.

What's the best way to deal with that (other than adding more data and introducing random noise)? Are there settings to try?

Training a gym-like env with the following:

{
  "help": false,
  "algo": "APPO",
  "env": "Myrl-v0",
  "experiment": "0114-1156.2-62",
  "train_dir": "./train_dir",
  "restart_behavior": "resume",
  "device": "gpu",
  "seed": 5,
  "num_policies": 1,
  "async_rl": true,
  "serial_mode": false,
  "batched_sampling": false,
  "num_batches_to_accumulate": 2,
  "worker_num_splits": 2,
  "policy_workers_per_policy": 1,
  "max_policy_lag": 1000,
  "num_workers": 32,
  "num_envs_per_worker": 28,
  "batch_size": 1024,
  "num_batches_per_epoch": 1,
  "num_epochs": 1,
  "rollout": 32,
  "recurrence": 1,
  "shuffle_minibatches": false,
  "gamma": 0.99,
  "reward_scale": 1.0,
  "reward_clip": 1000.0,
  "value_bootstrap": false,
  "normalize_returns": true,
  "exploration_loss_coeff": 0.003,
  "value_loss_coeff": 0.5,
  "kl_loss_coeff": 0.0,
  "exploration_loss": "entropy",
  "gae_lambda": 0.95,
  "ppo_clip_ratio": 0.1,
  "ppo_clip_value": 1.0,
  "with_vtrace": false,
  "vtrace_rho": 1.0,
  "vtrace_c": 1.0,
  "optimizer": "adam",
  "adam_eps": 1e-06,
  "adam_beta1": 0.9,
  "adam_beta2": 0.999,
  "max_grad_norm": 4.0,
  "learning_rate": 0.0001,
  "lr_schedule": "constant",
  "lr_schedule_kl_threshold": 0.008,
  "obs_subtract_mean": 0.0,
  "obs_scale": 1.0,
  "normalize_input": true,
  "normalize_input_keys": null,
  "decorrelate_experience_max_seconds": 0,
  "decorrelate_envs_on_one_worker": true,
  "actor_worker_gpus": [],
  "set_workers_cpu_affinity": true,
  "force_envs_single_thread": false,
  "default_niceness": 0,
  "log_to_file": true,
  "experiment_summaries_interval": 10,
  "flush_summaries_interval": 30,
  "stats_avg": 100,
  "summaries_use_frameskip": true,
  "heartbeat_interval": 20,
  "heartbeat_reporting_interval": 180,
  "train_for_env_steps": 985000000,
  "train_for_seconds": 10000000000,
  "save_every_sec": 60,
  "keep_checkpoints": 1,
  "load_checkpoint_kind": "best",
  "save_milestones_sec": -1,
  "save_best_every_sec": 15,
  "save_best_metric": "7.ARGPB",
  "save_best_after": 20000000,
  "benchmark": false,
  "encoder_mlp_layers": [
    512,
    512
  ],
  "encoder_conv_architecture": "convnet_simple",
  "encoder_conv_mlp_layers": [
    512
  ],
  "use_rnn": false,
  "rnn_size": 512,
  "rnn_type": "gru",
  "rnn_num_layers": 1,
  "decoder_mlp_layers": [],
  "nonlinearity": "elu",
  "policy_initialization": "orthogonal",
  "policy_init_gain": 1.0,
  "actor_critic_share_weights": true,
  "adaptive_stddev": true,
  "continuous_tanh_scale": 0.0,
  "initial_stddev": 1.0,
  "use_env_info_cache": false,
  "env_gpu_actions": false,
  "env_gpu_observations": true,
  "env_frameskip": 1,
  "env_framestack": 1,
  "pixel_format": "CHW",
  "use_record_episode_statistics": false,
  "with_wandb": false,
  "wandb_user": null,
  "wandb_project": "sample_factory",
  "wandb_group": null,
  "wandb_job_type": "SF",
  "wandb_tags": [],
  "with_pbt": true,
  "pbt_mix_policies_in_one_env": true,
  "pbt_period_env_steps": 5000000,
  "pbt_start_mutation": 20000000,
  "pbt_replace_fraction": 0.3,
  "pbt_mutation_rate": 0.15,
  "pbt_replace_reward_gap": 0.1,
  "pbt_replace_reward_gap_absolute": 1e-06,
  "pbt_optimize_gamma": false,
  "pbt_target_objective": "true_objective",
  "pbt_perturb_min": 1.1,
  "pbt_perturb_max": 1.5,
  "command_line": "--train_dir=./train_dir --learning_rate=0.0001 --with_pbt=True --save_every_sec=60 --load_checkpoint_kind=best --save_best_every_sec=15 --use_rnn=False --seed=5 --num_envs_per_worker=28 --keep_checkpoints=1 --device=gpu --train_for_env_steps=985000000 --algo=APPO --experiment=0114-1156.2-62 --with_vtrace=False --experiment_summaries_interval=10 --save_best_after=20000000 --recurrence=1 --num_workers=32 --batch_size=1024 --env=Myrl-v0 --save_best_metric=7.ARGPB",
  "cli_args": {
    "algo": "APPO",
    "env": "Myrl-v0",
    "experiment": "0114-1156.2-62",
    "train_dir": "./train_dir",
    "device": "gpu",
    "seed": 5,
    "num_workers": 32,
    "num_envs_per_worker": 28,
    "batch_size": 1024,
    "recurrence": 1,
    "with_vtrace": false,
    "learning_rate": 0.0001,
    "experiment_summaries_interval": 10,
    "train_for_env_steps": 985000000,
    "save_every_sec": 60,
    "keep_checkpoints": 1,
    "load_checkpoint_kind": "best",
    "save_best_every_sec": 15,
    "save_best_metric": "7.ARGPB",
    "save_best_after": 20000000,
    "use_rnn": false,
    "with_pbt": true
  },
  "git_hash": "cf6f93c8109e48faf7bca746ce2184808f6513c1",
  "git_repo_name": "not a git repository",
  "train_script": "train_gym_env2"
}
@alex-petrenko
Copy link
Owner

You're encountering a general machine learning problem called "overfitting". It is generally a challenge to make sure a model generalizes beyond training distribution, and it is not specific to RL or Sample Factory.

Some things to look at:

  1. Look up general anti-overfitting techniques from deep learning. Dropout, larger learning rate come to mind. Although I haven't had much success with these.
  2. Domain randomization. Make sure your training distribution is as diverse as possible, so it is harder to overfit. Randomize parameters of the environment where possible. Check out some automatic domain randomization ideas from this paper https://dextreme.org/ and papers it references.
  3. Data augmentation. Making training distribution larger always helps. Augment training scenarios to provide more data. Augment observations (i.e. for visual observations you can use crop, change colors, flip images, and use other techniques from computer vision)
  4. Noise injection can help (i.e. injecting noise into obs. and actions)
  5. Adversarial learning and self-play can help if this is applicable to your setting
  6. Use population-based training and use a performance metric (true_objective) which is a proxy for generalization performance (i.e. performance on unseen data).

@jarlva
Copy link
Author

jarlva commented Jan 16, 2023

Thanks again for your reply @alex-petrenko !

@jarlva jarlva closed this as completed Jan 16, 2023
@jarlva
Copy link
Author

jarlva commented Jan 22, 2023

I tried the following but none worked. I'd like to try dropout and noticed it's possible to apply in pytorch but not sure how to do it in the SF2 code (maybe add an optional parameter)?

update: also tried editing sample-factory/tests/test_precheck.py with lines 15, 18
image

. adding noise to observations, up to +/-5%
. PBT
. simplify the model to 256,256
. changed LR to 0.00001 and 0.001, from default 0.0001
. increased data from 30k to 100k rows
. it's not possible to augment data

@jarlva jarlva reopened this Jan 22, 2023
@jarlva
Copy link
Author

jarlva commented Jan 25, 2023

Hi @alex-petrenko , would it be possible to reply to the latest request from 2 days ago, above?

@alex-petrenko
Copy link
Owner

I think your best option is to implement a custom model (encoder only should be sufficient, but you can override the entire actor-critic module). See the documentation here: https://www.samplefactory.dev/03-customization/custom-models/

Just add dropout as a layer and fingers crossed it should work. You should be careful about eval() and train() modes for your PyTorch module but I think you should already be covered here.
See here for example: https://discuss.pytorch.org/t/if-my-model-has-dropout-do-i-have-to-alternate-between-model-eval-and-model-train-during-training/83007/2

@alex-petrenko
Copy link
Owner

alex-petrenko commented Jan 25, 2023

Hmmm I guess your confusion might be from the fact that Dropout can't be just added as a model layer, you have to actually call it explicitly in forward()

If I were you I would simply modify the code of forward() method of the actor_critic class to call dropout when needed.

Sorry, I don't think I can properly help you with the problem without knowing context and details of your problem. Overfitting is one of the hardest problems in all of ML and there's no single magical recipe for fixing it.

@jarlva
Copy link
Author

jarlva commented Jan 29, 2023

Hi @alex-petrenko , sorry, I'm not an expert at this. I'm using a customized cartpole-like gym env.
Do you mean edit sample_factory/model/actor_critic.py in the following, lines 154, 184?

1/30 Update: Also updated sample_factory/model/encoder.py lines 216, 221

Also, would it make sense to add dropout as a switch option?

image
image

@alex-petrenko
Copy link
Owner

First thing I would try would be to add dropout after each layer in the encoder.
If you're using a cartpole-like environment, then you would need to modify MLP Encoder which is defined here:

class MlpEncoder(Encoder):

Convolutional encoder probably has nothing to do with your task if your observations are just vectors of numbers. Convolutional encoder is for the images.

@jarlva
Copy link
Author

jarlva commented Jan 31, 2023

I added it in the model_utils.py file, line 52. So the layers are:

RecursiveScriptModule(
original_name=Sequential
(0): RecursiveScriptModule(original_name=Linear)
(1): RecursiveScriptModule(original_name=ELU)
(2): RecursiveScriptModule(original_name=Dropout)
(3): RecursiveScriptModule(original_name=Linear)
(4): RecursiveScriptModule(original_name=ELU)
(5): RecursiveScriptModule(original_name=Dropout)
)

But, alas, that's still not solving overfitting...

image

@alex-petrenko
Copy link
Owner

Dropout is one way to combat overfitting but it is not a panacea.

I'm sorry I can't help figure out your exact issue, as I said previously, overfitting is a general machine learning phenomenon and most likely your problem has nothing to do with Sample Factory, but rather with the overall problem formulation and approach.

@jarlva
Copy link
Author

jarlva commented Feb 1, 2023

Hi @alex-petrenko , I understand. I appreciate the guidance and advice!
Please let me know if you'd be open to advise for pay?

@alex-petrenko
Copy link
Owner

@jarlva not sure if this is realistic right now. I'm starting a full-time job very soon which will keep me busy for a foreseeable future.

You said you're able to fit to your training data, right?
That means, trained policy does well on the training data when you're evaluating?
But completely fails on out-of-distribution data.

If I could get some ideas what's your environment and what exactly the difference between your training and test data is, I could be more helpful. Maybe we can set up a call in ~2 weeks. Feel free to reach out on Discord DM or by email to discuss further.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants