Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discussion + Planning: How to unittest an agent? #72

Open
DannyWeitekamp opened this issue Dec 7, 2020 · 8 comments
Open

Discussion + Planning: How to unittest an agent? #72

DannyWeitekamp opened this issue Dec 7, 2020 · 8 comments

Comments

@DannyWeitekamp
Copy link
Collaborator

DannyWeitekamp commented Dec 7, 2020

There are a lot of bugs we have run into with regard to different behavior in different implementation/generations of code, and these issues are pretty hard to track down. I would like to move toward having a way to unittest agents.

The way we 'test' agents right now involves running the agents with altrain (which spins up another server that hosts a tutor in the browser) and then in the short term we look a the general behavior of the agent transactions as they are printed in the terminal, and maybe we additionally print out some of what is happening internally with the agent. And in the long term we look at the learning curves, which usually involves first doing some pre-processing/KC labeling and uploading to datashop.

It would be nice to have unittests at the level of "the agent is at this point in training with these skills, and we'll give it these interactions, and we expect this to happen".

Some impediments to doing this:

  1. Modularity: At the moment, the most straightforward way to run AL is via altrain, but this is more of an integration test, it requires spinning up a new process that utilizes a completely different library (that we have kept seperate from AL_Core for good reason).
  2. Performance: right now (ballpark estimate) AL_Core is 50% of training time and AL_Train is another 50% and together all that back and forth takes several minutes per agent. It would be nice if our tests ran on the order of seconds to make iterating on the code faster.
  3. No robust way to "Construct" a skill(s): Skills at least in the ModularAgent right now are kind of a hodgepodge of learning mechanisms that are linked together. They are also learned, not defined. It would be nice if there was some language for writing and representing Skills.
  4. Randomness: Some of the process of learning skills is random so this may need to be controlled some way.

I've been flirting with the idea of having a sort of virtual tutor that is just a knowledge-base that gets updated in python, executes the tutor logic, and calls request/train. This would at least address 1) and maybe also 2).

@eharpste
Copy link
Member

eharpste commented Dec 7, 2020

Related to #57, which has always been my big concern with testing whole agents. Would be great to have someone go through and try to catalog all the things that use some kind of random choice and see if we can set some master seed.

@eharpste
Copy link
Member

eharpste commented Dec 7, 2020

@cmaclell Don't you have something like the virtual tutor idea in the works? One concern with some of that would be over fitting to a testcase. Might be nice to have some kind of regular batch process (probably not using current CI tools but maybe) that re-runs some of our canonical experiments and reports fit to human data just so we can see cases when we totally break something there.

@DannyWeitekamp
Copy link
Collaborator Author

Made a new issue for CI #74

@cmaclell
Copy link
Collaborator

cmaclell commented Dec 9, 2020 via email

@eharpste eharpste closed this as completed Dec 9, 2020
@eharpste eharpste reopened this Dec 9, 2020
@eharpste
Copy link
Member

eharpste commented Dec 9, 2020

So beyond some of the issues with why this will be hard to implement what would be some good tests? Some I can think of:

  1. Basic API sanity checks
    1. Train A, A, A, A, A: Request A -> A
    2. Train A, A, A, A, A: Request B -> hint request
    3. Train A, B, A, B, A, B: Request A -> A, Request B ->B
    4. Train +A, -A, +A, -A: Request A -> hint request
  2. Rerun the fractions example
    1. basically codify our current de facto testing paradigm
    2. take a collection of items (probably mixed problem types?)
    3. do an incremental prediction task type thing where you train 1 problem, then iterate through them with requests for "grading", test for error rate monotonically decreasing and hitting 0 by at least 3x? runs through the training data. If we control the training set we can probably assume this sufficiently.
    4. could also have block v interleave variant or another main effect type tests. Not sure we want to assume this effect for all agent types.
  3. Benchmark drift
    1. We can run our current agents and come up with some kind of benchmark metric (could be something like mastery level at each opp for a population of 10? agents) we then peg that benchmark and track that agents remain within some tolerance level.
    2. Could this is as a good way to make sure we didn't break something.

I guess part of this what high level assumptions do we want to make about all agents? Some that seem reasonable, though might not apply to all agent types or goals;

  1. Reaching mastery (could be hitting 0 error rate or some kind of 95% threshold) after some amount of training.
  2. Monotonic decrease in error rate, assuming training without intentional interference effects.
  3. Demonstrating blocked versus interleaved interference effect. (could see this not being for all agent types)

@DannyWeitekamp
Copy link
Collaborator Author

I'm not entirely clear on what you both mean by overfitting, the agent overfits or there is some fitting in the environment?

Thanks @cmaclell these seem like a step in the right direction I'll take a closer look.

@eharpste these are all things that would be good to incorporate. For the non-deep learning based agents (or at least ModularAgent) expanding on 1, it would be nice to test things that are inside the agent (more of the unit test variety) beyond behavior like:

  1. If the agent runs the planner the true explanation is in the set of found explanations
  2. A new skill is created if the agent falls back on the planner
  3. The where conditions never over-generalize (i.e. test for this)

@eharpste
Copy link
Member

eharpste commented Dec 9, 2020

I meant overfitting in a general software engineering sense not an ML sense. Basically, we don't want to stay myopically focused on the few test cases we define in cases where things might change as we explore new directions. For example, there are some known cases where the blocked vs interleaved effect should invert.

@DannyWeitekamp
Copy link
Collaborator Author

Ahh I see. The unit tests I'm suggesting would be written on an agent by agent basis. For a given implementation there are a set of intended behaviors that should be directly enforced via unit tests. If the intended behaviors change then the unit tests should change, but if they don't change then they should still pass regardless of implementation changes or additions.

eharpste added a commit that referenced this issue Jul 7, 2021
This merge adds in several additional integration tests that were on the [expanded_tests](https://github.com/apprenticelearner/AL_Core/tree/expanded-tests) branch, this is relevant to but does not address #72 . It also implements a test for agent serialization and re-introduces database saving of agents talked about in #60 . The changes also remove all un-commented print statements and replace them with use of the built in logging library as talked about in #73 .
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants