-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
create_db.py: refactor database design for Flux accounting #6
Conversation
"state" is actually the job state. We don't have the equivalent for exit code yet. The PR I have up right now for "success" is the cloest thing we have to exit code. flux-framework/flux-core#2831. For fair share, I assume "success" doesn't need to be in the table.
just want to double check this is the numeric user ID and not the string username. |
src/create_db.py
Outdated
mod_time bigint(20) DEFAULT 0 NOT NULL, | ||
deleted tinyint(4) DEFAULT 0 NOT NULL, | ||
id_assoc integer PRIMARY KEY AUTOINCREMENT, | ||
user tinytext NOT NULL, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i notice "user" in this table, but "user_name" in the "user_table". Just copied from slurm?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the names of these fields were just copied from Slurm. We could (and probably should) rename these to keep them consistent and easy to interpret, especially because there are (at least three that I can think of) multiple ways to identify a user:
- Unix user ID
- User's "Association" ID
- User name (string)
src/create_db.py
Outdated
print("Creating job_table in DB...") | ||
conn.execute( | ||
""" | ||
CREATE TABLE IF NOT EXISTS job_table ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in general with the job table, is there a reason to keep the column names as the slurm names? Perhaps b/c reporting scripts will initially assume those column names?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No reason in particular; it just made it easier for me to see which columns I was copying from. I'd be open to renaming them if they make more sense being named to something different!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
although we should be flexible b/c things are still changing, I think going with flux names vs slurm names is a good idea
Yes, this is the Unix user ID. |
From our discussion in #7, if we just go with the flux-core job-info DB, then this script wouldn't in fact need to create a |
If job info creates the DB, then correct, you would only query it. |
Just pushed a change that removes the creation of a job record table. Instead, it contains just the following types of tables:
This database structure represents more of a user accounting database, where the focus is on user/cluster-usage relationships and user limits enforcement. |
Dropped the [POC] tag and pushed a change to keep the |
Just some quick thoughts, sorry if these aren't all that helpful. Before merging this PR, it might be useful to write down what this first step should be capable of, design a set of functionality tests that demonstrate that capability, and include these tests along with the PR. For example, if the current work is just a tool that would be used to create a user/account db for a flux instance, then add a test that creates the DB from scratch using the tool, along with some tests that verify proper creation of the db file. As a casual reviewer, it would be easier to reason about the database design if there were additionally some functionality tests with mock data. Perhaps it is too early for that, but I don't really have a grasp of how the data from a Flux instance (presumably inactive job data) gets reduced and incorporated into this database, and how subsequently a tool queries the resulting database to create a job priority number. You could even do something very simple at first with some mock data to verify that all the parts are present for the DB to do the most basic things. Finally, we might find that we could start the Flux accounting database in a much simpler form rather than just copying the Slurm accounting DB fields, and still get what we need even in the medium term. (e.g. I don't think we are going to support QoS in the near term?) To summarize, steps that might be helpful:
The above is just my suggestion, but in general this kind of test driven development has worked well for other flux-framework projects. |
baa4c24
to
cc7698c
Compare
Thanks for your feedback @grondo - it was really helpful for me to look at. Hopefully I addressed your suggestions in my changes to this PR. I made a couple of them:
You're absolutely right. And as of now, I don't think we have this functionality. So the creation of association usage tables, QoS tables, and a job table probably aren't really needed right now. I reduced the number of tables created by create_db.py to just two for now - a user table and an association table. Like you mentioned, for the short term, we can start out with something simple, where we just have a table of users and a table of associations. The other tables that were originally in the database were dropped, but can be added in at a later time, when we have functionality to grab job data and populate usage tables, generate priority values, change job priorities, etc. I also created a unit test file, db_unittest.py, which tests for the following:
As we add more tables, functionality, front-end commands (just a couple that come to mind are adding a user to the database and editing an association's limits), we can add more unit tests, either to this file or to a new one to keep tests organized. |
would be nice to run the tests via |
The conventional thing would be to use setuptools and make a setup.py file. That would support building the package, running tests, building a wheel, etc. The closest equivalent to |
Thanks @trws. Do you think we should first merge a PR that makes setup.py and setup tools before merging this PR? |
I think it could reasonably be a separate step. As long as you feel confident that this is working and tested appropriately, I'd go forward and add that on. It's the kind of thing I normally would either do first, so I can use it while developing, or do on top of something I know is working. |
f310704
to
58c7ce9
Compare
OK, rebased and pushed to catch up after #11, but I also made some additional changes:
So, as of now, the Accounting Database has three static limits per user: shares allocated (related to fairshare calculations), max jobs, max wall time per job. But since we are still in discussions of user policy limits in flux-framework/flux-sched #638, these limits could definitely be refined/removed/added to. Perhaps the limit fields in this particular PR aren't all that necessary since we haven't come up with a solid stance just yet; I can always remove those fields for now until we have a better idea on how we want to enforce user limits.
valid database file creation, association table creation, and successful addition of an association to the table As more tables or fields are added, this unit test file will be expanded to account for those additions.
|
Pushed a couple more changes that I forgot to add yesterday:
|
test/test_create_db.py
Outdated
conn = sqlite3.connect("test/FluxAccounting_test.db") | ||
|
||
# create association table | ||
conn.execute( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't this be calling create_db.py
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, you're right. I'm not sure why I was making the database from scratch... I can't remember. I just pushed a change so that the unit test file calls create_db.py
with its own file path to the test directory.
reduce the tables created by create_db.py to just an association table, which holds user account information like an association id, administrative level, and static job limit info such as max jobs and max wall time per job replace the print statements with logging to a file, db_creation.log, which contains status messages about both database creation and table initialization Fixes flux-framework#1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! I'll approve the PR. Although @dongahn should take a look too.
Coming at this late. This is now the highest priority for me. (probably tonight or tomorrow) |
I'm probably missing some setup steps. I tried to run a test in this PR (
|
BTW, there is no I initially typed this python script in by itself and realized the executable permission wasn't set. Then, I mistakenly used |
OK. I manually created But now,
|
Once I manually created |
class TestDB(unittest.TestCase): | ||
# create database and make sure it exists | ||
def test_00_test_create_db(self): | ||
c.create_db("test/FluxAccounting.db") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You will at least test the existence of the path. Better yet, perhaps create_db
can do some more sanity check about the given path.
LOGGER = logging.basicConfig(filename="accounting/db_creation.log", level=logging.INFO) | ||
|
||
|
||
def create_db(filepath): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function should do some sanity check on the filepath
input and tried to fail over if recoverable (e.g., missing subdirectories leading to the db file) or raise an exception or return an error code?
|
||
|
||
def create_db(): | ||
LOGGER = logging.basicConfig(filename="accounting/db_creation.log", level=logging.INFO) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I didn't have the accounting
subdirectory, logging failed. Either recover from such failure or dump the log file into the current working directory? Ultimately you may want to make this configurable but the project is so early I gather you want to make fastest progress with core business logic.
@@ -0,0 +1,59 @@ | |||
############################################################### |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ultimately, I understand these test cases will be hooked into a high level test driver like make check
equivalent. But for the time being, it would be good to have a short README.md file explaining how to run this properly? Don't have to be fancy though.
You can add a couple of sentences in your top-level the README.md file as well. The first thing a developer would like to do would typically be to run a "hello world" type test and documenting such in README.md goes a long way towards helping them ensure things are working after git clone, make, make install or equivalents.
@cmoussa1: this seems to bring in lots of fundamental stuff! I added a few review comments related to developers/reviewers experiences on this project. If you want, we can merge this first and address my comments in a separate PR or address them in this PR before landing it. With a project so early like yours, IMHO it doesn't make a lot sense to try to minimize churns in the repo -- making quicker progress should be our priorities for awhile... Sorry I came to this so late in that regard. |
add unit test file for create_db.py, which tests: - valid database file creation, - association table creation, and - successful addition of an association to the table
remove build directory from version control since it is auto-generated by setup.py
add the following files and directories to .gitignore: FluxAccounting.db db_creation.log FluxAccounting_test.db build/ test/FluxAccounting.db
write_jobs.py was a first pass at writing inactive jobs to a SQLite DB and no longer belongs in this repo, so remove it
Thanks for all of your feedback @dongahn! I'll try my best to answer some of your comments, but if I miss anything else feel free to correct me:
No, this was not on purpose. I went ahead and pushed a change that adds the shebang line to the top of
Thanks for pointing this out. In this project's directory there should be a requirements.txt file that lists the pandas dependency, along with the version that I used. Once you have
You can run this command while in the
Thanks for poking around with the unit test functionality. I'm sorry that you were running into multiple issues with getting the unit tests to work properly. Here is how I run all of those unit tests files at the same time: When I am in the top-level directory
I absolutely agree. A README describing some steps to build this directory and run unit tests could be very helpful. I could post a subsequent PR after this one is merged with a README containing those instructions along with improved error handling with the .db file path creation; if you feel that it would be better to include it in this PR however, that is fine with me. 🙂 |
Sounds good. Is this ready to go in now then? |
Yes, this should be ready! |
Thanks! |
This is a proof-of-concept PR designed to gather feedback/suggestions on short-term steps related to the design of the accounting database for Flux. The code itself is probably not fully fledged out yet (but the design is pretty much there), but can and will be revised and improved as this PR matures.
Background
Currently,
create_db.py
just creates one table called inactive that sits in the.db
file JobCompletion.db. I knew pretty much from the start that just one table would not be sufficient, and eventually we would have to create something similar to Slurm's current accounting database.This PR's Proposal
This PR adds a brand new database creation to
create_db.py
that creates a.db
file called FluxAccounting.db (both.db
files are left in this branch for the time being). It creates multiple tables that range from job completion data to user account data. Specifically, it creates the following tables:user_table
Contains user information as well as their admin level on a scale from 1 to 3 (1 being the lowest, 3 being the highest). This table is essentially just copied from Slurm.
association_table
Contains an entry for every "association," which consists of a combination of a cluster, account, user name, and optional partition name. It contains limits: the number of shares, max jobs, max Trackable Resources (referred to as TRES from here on out) per job, max Wall time per job, group TRES limits, and group Wall Time. This table is essentially just copied from Slurm, but trimmed down to only contain a subset of the limits present in Slurm's
assoc_table
.qos_table
Contains a number of Quality-of-Service (QOS) categories that are used to help categorize a job's priority. This also adjusts limits that a job faces when submitted and run (to my knowledge, this is used in combination with an association's limits, meaning they can override them if they are higher). This table is essentially just copied from Slurm.
tres_table
Contains a label for every trackable resource. In Slurm, the following resources are trackable, labeled, and assigned a number: 1 = cpu, 2 = mem, 3 = energy, 4 = node, 5 = billing, 6 = disk, 7 = vmem, 8 = pages. An example entry in a job completion table would look like the following:
These values are also found in the
assoc_usage_(hour | day | month)
tables, where reports are generated usingsinfo
. This table is essentially just copied from Slurm.assoc_usage_(hour | day | month)_table
This contains the usage for an association over a certain period of time: hour, day, or month. Like I mentioned above, these tables used to generate accounting reports quickly with
sinfo
. These tables are essentially just copied from Slurm.job_table
Contains an entry for every job that has run on a cluster. It contains a number of information that's presently available with the
flux jobs
tool. Here is a comparison chart comparing what's needed for the Slurm job table versus what we can currently capture withflux jobs
(if I miss something, mislabel an attribute, I apologize in advance!):flux jobs
Job AttributeThe Next Steps
In my opinion, I think only a subset of these tables for the time being might be useful to implement and take a closer look at: maybe just the job table, association table, user table, and qos table. These might serve as the most relevant to implementing our fair-tree algorithm for calculating job priority.
Things like tracking association usage and managing WCKeys might not be the highest priority at the moment, and we can add/implement those tables at a later date. However, I'm completely open to suggestions and advice.
This was a lot of explanation; I hope I made some sense in what progress I've made and what information I've found out! Let me know if I need to explain anything else further 🙂.