Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FileStore Backend Support #17

Closed
chovanecm opened this issue Jan 28, 2017 · 10 comments
Closed

FileStore Backend Support #17

chovanecm opened this issue Jan 28, 2017 · 10 comments

Comments

@chovanecm
Copy link
Owner

Currently only mongodb is supported as a backend.
However sacred is able to use other databases as well. This could be supported in sacredboard as well.

@cicobalico
Copy link

This would be awesome, especially for the local backends like tinydb or file-storage

@gideonite
Copy link
Contributor

I think the first step would be to decide upon a web api. Ideally, I think this would somehow be codified in the form of an abstract python class which would then be subclassed by each database backend.

Any thoughts? I'm happy to take a look through the current implementation and propose an interface, but I think it might be more appropriate for main contributors to do this kind of "deep design" thinking.

@chovanecm
Copy link
Owner Author

My idea was to make either an abstract class that would the PyMongoDataAccess inherit from (if that makes any sense in a language like Python where we can do "duck-typing" - but perhaps it does for the sake of intelligibility).
The data format accepted and returned by PyMongoDataAccess would be then considered as a reference for other implementations - which is basically what @gideonite suggested.
I've tried to document the format of parameters requested by the MongoDB backend and I should do the same with return values.
Unfortunately, I've prioritised other features and aspects of the project (such as decoupling of the HTML view from its model to support its testability), so I've "postponed" this - not mentioning that new features like filtering (currently in the development branch) could be more difficult to implement with backends like file-storage.

Of course, any help would be appreciated. Basically, the important methods of the current interface are

def get_runs(self, sort_by=None, sort_direction=None,
                 start=0, limit=None, query={"type": "and", "filters": []}):
def get_run(self, run_id):
def connect(self): # not as important for other implementations - only referenced in bootstrap.py

The approach Sacredboard currently uses is to initialize the data gateway and put it into the current_app.config["data"] variable in bootstrap.py
It's definitely not the best solution, but might be enough for now.

If anyone creates a proof-of-concept for other backends using the here provided hints, I'm happy to consult it, but cannot start developing it on my own now because of my other duties.

I was also thinking whether we might not try replacing (or extending?) the current pseudo-API with something nicer, such as GraphQL. But would be just a nice-to-have.

@gideonite
Copy link
Contributor

I've implemented a basic interface called DataStorage and PyMongoDataAccess inherits from it. I've also started implementing a FileStorage class which simply does get_run and get_runs. I'm getting some weird DataTables errors though when I startup the web app. I assume this is because either:

1 - Because the data format doesn't jive with the mongodb format, or
2 - I haven't implemented any filtering.

To create a run I simply load the run.json and add set run["config"] = load_config("config.json").

At pointers at this point? Obviously it would be nice to get it to run. I'm also a noob when it comes to writing pythonic code so will happily take code style comments.

Here's the commit on my fork: gideonite@8cfac39

@chovanecm
Copy link
Owner Author

Thanks!
You can run sacredboard with debugging enabled (--debug) and navigate to 127.0.0.1:5000/api/runs
There you can see that the returned object has to implement the "count" method to see the total number of records (DataTables uses this for pagination etc).

One of the solutions to achieve it is to add this to datastorage.py:

class DataIterator():
    def __init__(self, count, iterable):
        self.iterable = iterable
        self._count = count

    def count(self):
        "The number of items"
        return self._count

    def __iter__(self):
        return iter(self.iterable)

and to modify your get_runs method:

 def get_runs(self, sort_by=None, sort_direction=None,
                 start=0, limit=None, query={"type": "and", "filters": []}):
        all_run_ids = os.listdir(self.path_to_dir)
        blacklist = set(["_sources"])

        def run_iterator():
            for id in all_run_ids:
                if id in blacklist:
                    continue
                yield self.get_run(id)
        return DataIterator(len([id for id in all_run_ids if id not in blacklist]), run_iterator())

But even then there is a problem as the loaded JSON does not contain information about data types. Therefore, date and times, such as the experiment start time and last heartbeat time, are being loaded as strings and they need to be converted to appropriate Python representation using

    datetime.datetime.strptime('2017-05-28T12:18:33.091144', '%Y-%m-%dT%H:%M:%S.%f')

I hope this should be enough for making it work (without sorting, pagination and filtering).

Regarding coding style, if you have the tox package installed, you can run
tox -e flake8
It's quite pedantic, so there is still a lot of work to do even in my code. When everything is fine (line breaks, spaces etc) it starts checking the documentation strings.

What I personally found useful is to write tests together with the main code. You can find inspiration in test_mongodb.py (It uses a library that mocks MongoDB, so the test behaves almost as if it was talking to a real database).

@chovanecm chovanecm removed the draft label May 28, 2017
@gideonite
Copy link
Contributor

Great feedback! Sorry for the delay in responding.

I've made some changes that (I think) addresses these issues. Namely, I created a FileStoreCursor object. This brings up another point which is that the object oriented programmer in me says to create another abstraction to match "DataStorage" called "DataCursor." I don't know if this is pythonic, but I don't know how else to guarantee that others that want to extend this code for their own datastore know what to implement.

I've had some troubles with tox because I use Anaconda instead of virtualenv. I couldn't find a developer wiki page and at this point it might be a good idea to create one. We can put some of the info that's been mentioned in this thread such as tox, --debug, etc. so that future developers can quickly see how to get started. I'm interested to know your development setup.

Here's the latest:
develop...gideonite:filestore

@chovanecm
Copy link
Owner Author

chovanecm commented May 31, 2017

Great job!
Don't worry about the delay - I'm currently busy with my university anyway.
I'll try to have a look at it tomorrow (I'm in Europe), so now only in brief:

  1. The solution with FileStoreCursor basically corresponds to my DataIterator proposed earlier. If you look at the code, it does not depend on the FileStorage - it is generic enough to be on the same level of abstraction as DataStorage because it's reusable also by the MongoDB layer and possibly by any other (let's name it DataIterator). <<<< UPDATE: DataIterator would be better than DataCursor
  2. I had problems with Anaconda too. That's why I started using the Windows Subsystem for Linux (aka Bash for Windows) in Windows 10. But as it's a beta, it sometimes simply stops working. So perhaps a Linux virtual machine (e.g. in VirtualBox) is better.
  3. I am sorry about the missing documentation. In fact, Sacredboard was created as a part of my master's thesis so I spent more time writing the thesis than a proper documentation :-( Chapters 6 and 7 are shortly describing the development environment and commands (now I noticed I didn't mention the --debug option).
    In two weeks, I am going to defend the thesis and to take the "final state examination", which is necessary to graduate here in the Czech Republic. I probably won't have much time by then because I need to basically go through almost all courses I took to prepare for the exam.
    But I was thinking about either creating a wiki page or generating Sphinx documentation (it is a kind of standard for Python projects). That's why I started with completing code documentation docstrings because they would become a part of the generated doc.

But I'm glad you asked, it should be probably one of my priorities now.

@chovanecm
Copy link
Owner Author

I tried it now. Nice that the list actually shows something.
Now I see I should definitely add tests for testing interaction with the backend (#40) so that developers on the backend side always know whether the data they return are in the expected format.

Basically, the object you are building with _create_run should correspond to the object key in the example "/api/run/id" in #40.
So I think in order for even the Run detail view to work, you need to set run["_id"] and run["info"].
Again, thanks for doing this. It is helping not only with the functionality but also realising what's necessary to document and test.

@gideonite
Copy link
Contributor

Ah ok, now I understand a bit more about the context of the project. Makes sense.

I see your point now. Perhaps the more pythonic way of doing this, instead of heavy handedly forcing conformity via interfaces/abstract classes and the like, is to implement the required functionality in tests.

My only motivation for using the term "cursor" is to match the return type of the MongoDB interface.

Ok, I will make sure that the data type conforms to #40 and implement a test to make sure that the right keys are returned, etc. It will probably be some kind of integration test in which I create some experiments in /tmp and make sure the resulting json has the right keys.

@chovanecm
Copy link
Owner Author

chovanecm commented Jun 2, 2017

It's OK to have abstract classes in Python where appropriate, such as the DataStorage class you created. It is good that it tells developers what methods they have to implement.
In the case of FileStoreCursor (aka DataIterator), I was just thinking the class could be put directly on the datastore level so that developers of other database backends can use it too.

When you are talking about integration tests, do you mean it would involve running sacred too? That would be amazing but if the test fails, it might not easy to tell whether the problem was caused by Sacred or Sacredboard. So I would see this as an additional level of testing.
I think testing of your module could be solved for instance by generating a few reports by Sacred, copying the directory e.g. to tests/data/fixtures/filestore and loading them from there.
You can perhaps also try to decouple loading the JSON files from building the Run objects as it looks like the latter part could be reused e.g. for the TinyDB backend (Maybe by extracting the _create_run method somewhere else. But once I have time again, I can try to design a better solution for it).

@chovanecm chovanecm changed the title Different database backends FileStore Backend Support Jun 30, 2017
@chovanecm chovanecm added the Epic label Jun 30, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants