Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

create_db.py: refactor database design for Flux accounting #6

Merged
merged 5 commits into from
May 4, 2020

Conversation

cmoussa1
Copy link
Member

This is a proof-of-concept PR designed to gather feedback/suggestions on short-term steps related to the design of the accounting database for Flux. The code itself is probably not fully fledged out yet (but the design is pretty much there), but can and will be revised and improved as this PR matures.

Background

Currently, create_db.py just creates one table called inactive that sits in the .db file JobCompletion.db. I knew pretty much from the start that just one table would not be sufficient, and eventually we would have to create something similar to Slurm's current accounting database.

This PR's Proposal

This PR adds a brand new database creation to create_db.py that creates a .db file called FluxAccounting.db (both .db files are left in this branch for the time being). It creates multiple tables that range from job completion data to user account data. Specifically, it creates the following tables:

user_table

Contains user information as well as their admin level on a scale from 1 to 3 (1 being the lowest, 3 being the highest). This table is essentially just copied from Slurm.

association_table

Contains an entry for every "association," which consists of a combination of a cluster, account, user name, and optional partition name. It contains limits: the number of shares, max jobs, max Trackable Resources (referred to as TRES from here on out) per job, max Wall time per job, group TRES limits, and group Wall Time. This table is essentially just copied from Slurm, but trimmed down to only contain a subset of the limits present in Slurm's assoc_table.

qos_table

Contains a number of Quality-of-Service (QOS) categories that are used to help categorize a job's priority. This also adjusts limits that a job faces when submitted and run (to my knowledge, this is used in combination with an association's limits, meaning they can override them if they are higher). This table is essentially just copied from Slurm.

tres_table

Contains a label for every trackable resource. In Slurm, the following resources are trackable, labeled, and assigned a number: 1 = cpu, 2 = mem, 3 = energy, 4 = node, 5 = billing, 6 = disk, 7 = vmem, 8 = pages. An example entry in a job completion table would look like the following:

+-----------------+-------------+
| tres_alloc      | tres_req    |
+-----------------+-------------+
| 1=108,4=3,5=108 | 1=3,4=3,5=3 |
+-----------------+-------------+

These values are also found in the assoc_usage_(hour | day | month) tables, where reports are generated using sinfo. This table is essentially just copied from Slurm.

assoc_usage_(hour | day | month)_table

This contains the usage for an association over a certain period of time: hour, day, or month. Like I mentioned above, these tables used to generate accounting reports quickly with sinfo. These tables are essentially just copied from Slurm.

job_table

Contains an entry for every job that has run on a cluster. It contains a number of information that's presently available with the flux jobs tool. Here is a comparison chart comparing what's needed for the Slurm job table versus what we can currently capture with flux jobs (if I miss something, mislabel an attribute, I apologize in advance!):

Slurm Job Attribute flux jobs Job Attribute
cpus_req
exit_code state
job_name name
id_assoc
id_job id
id_qos
id_wckey
id_user userid
id_group
nodelist
nodes_alloc nnodes
partition
priority
state
timelimit
time_submit t_submit
time_sched t_sched
time_depend t_depend
time_start t_run
time_end t_cleanup
time_inactive t_inactive
work_dir
tres_alloc
tres_req
The Next Steps

In my opinion, I think only a subset of these tables for the time being might be useful to implement and take a closer look at: maybe just the job table, association table, user table, and qos table. These might serve as the most relevant to implementing our fair-tree algorithm for calculating job priority.

Things like tracking association usage and managing WCKeys might not be the highest priority at the moment, and we can add/implement those tables at a later date. However, I'm completely open to suggestions and advice.

This was a lot of explanation; I hope I made some sense in what progress I've made and what information I've found out! Let me know if I need to explain anything else further 🙂.

@chu11
Copy link
Member

chu11 commented Mar 16, 2020

| exit_code | state |

"state" is actually the job state. We don't have the equivalent for exit code yet. The PR I have up right now for "success" is the cloest thing we have to exit code. flux-framework/flux-core#2831. For fair share, I assume "success" doesn't need to be in the table.

| id_user | userid |

just want to double check this is the numeric user ID and not the string username.

src/create_db.py Outdated
mod_time bigint(20) DEFAULT 0 NOT NULL,
deleted tinyint(4) DEFAULT 0 NOT NULL,
id_assoc integer PRIMARY KEY AUTOINCREMENT,
user tinytext NOT NULL,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i notice "user" in this table, but "user_name" in the "user_table". Just copied from slurm?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the names of these fields were just copied from Slurm. We could (and probably should) rename these to keep them consistent and easy to interpret, especially because there are (at least three that I can think of) multiple ways to identify a user:

  • Unix user ID
  • User's "Association" ID
  • User name (string)

src/create_db.py Outdated
print("Creating job_table in DB...")
conn.execute(
"""
CREATE TABLE IF NOT EXISTS job_table (
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in general with the job table, is there a reason to keep the column names as the slurm names? Perhaps b/c reporting scripts will initially assume those column names?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No reason in particular; it just made it easier for me to see which columns I was copying from. I'd be open to renaming them if they make more sense being named to something different!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

although we should be flexible b/c things are still changing, I think going with flux names vs slurm names is a good idea

@cmoussa1
Copy link
Member Author

| id_user | userid |

just want to double check this is the numeric user ID and not the string username.

Yes, this is the Unix user ID.

@cmoussa1
Copy link
Member Author

From our discussion in #7, if we just go with the flux-core job-info DB, then this script wouldn't in fact need to create a job_table, right? We could instead just query that job table to populate our association_usage_<hour | day | month> tables and related information

@chu11
Copy link
Member

chu11 commented Mar 19, 2020

From our discussion in #7, if we just go with the flux-core job-info DB, then this script wouldn't in fact need to create a job_table, right? We could instead just query that job table to populate our association_usage_<hour | day | month> tables and related information

If job info creates the DB, then correct, you would only query it.

@cmoussa1
Copy link
Member Author

Just pushed a change that removes the creation of a job record table. Instead, it contains just the following types of tables:

  • users
  • associations
  • quality-of-service
  • trackable resources
  • association usage

This database structure represents more of a user accounting database, where the focus is on user/cluster-usage relationships and user limits enforcement.

@cmoussa1 cmoussa1 changed the title [POC] create_db.py: refactor database design for Flux accounting create_db.py: refactor database design for Flux accounting Mar 27, 2020
@cmoussa1
Copy link
Member Author

Dropped the [POC] tag and pushed a change to keep the user_name field consistent in both user_table and association_table. I think this is ready for a review.

@grondo
Copy link
Contributor

grondo commented Mar 27, 2020

Just some quick thoughts, sorry if these aren't all that helpful.

Before merging this PR, it might be useful to write down what this first step should be capable of, design a set of functionality tests that demonstrate that capability, and include these tests along with the PR. For example, if the current work is just a tool that would be used to create a user/account db for a flux instance, then add a test that creates the DB from scratch using the tool, along with some tests that verify proper creation of the db file.

As a casual reviewer, it would be easier to reason about the database design if there were additionally some functionality tests with mock data. Perhaps it is too early for that, but I don't really have a grasp of how the data from a Flux instance (presumably inactive job data) gets reduced and incorporated into this database, and how subsequently a tool queries the resulting database to create a job priority number.

You could even do something very simple at first with some mock data to verify that all the parts are present for the DB to do the most basic things.

Finally, we might find that we could start the Flux accounting database in a much simpler form rather than just copying the Slurm accounting DB fields, and still get what we need even in the medium term. (e.g. I don't think we are going to support QoS in the near term?)

To summarize, steps that might be helpful:

  1. Lay out the high level flow of data that goes into this DB, and at least one example of how and what data we'll get out, i.e. simple use case.
  2. Devise a simple test that demonstrates the use case presented in 1.
  3. Check in test(s) that run in GitHub workflow along with this PR

The above is just my suggestion, but in general this kind of test driven development has worked well for other flux-framework projects.

@cmoussa1 cmoussa1 force-pushed the design-DB branch 3 times, most recently from baa4c24 to cc7698c Compare March 30, 2020 18:44
@cmoussa1
Copy link
Member Author

Thanks for your feedback @grondo - it was really helpful for me to look at. Hopefully I addressed your suggestions in my changes to this PR. I made a couple of them:

As a casual reviewer, it would be easier to reason about the database design if there were additionally some functionality tests with mock data. Perhaps it is too early for that, but I don't really have a grasp of how the data from a Flux instance (presumably inactive job data) gets reduced and incorporated into this database, and how subsequently a tool queries the resulting database to create a job priority number.

You're absolutely right. And as of now, I don't think we have this functionality. So the creation of association usage tables, QoS tables, and a job table probably aren't really needed right now.

I reduced the number of tables created by create_db.py to just two for now - a user table and an association table. Like you mentioned, for the short term, we can start out with something simple, where we just have a table of users and a table of associations. The other tables that were originally in the database were dropped, but can be added in at a later time, when we have functionality to grab job data and populate usage tables, generate priority values, change job priorities, etc.

I also created a unit test file, db_unittest.py, which tests for the following:

  • Successful creation of a SQLite database
  • Successful creation of tables within the database
  • An example entry in both the user table and the association table

As we add more tables, functionality, front-end commands (just a couple that come to mind are adding a user to the database and editing an association's limits), we can add more unit tests, either to this file or to a new one to keep tests organized.

@chu11
Copy link
Member

chu11 commented Mar 30, 2020

would be nice to run the tests via make check, although I now notice there is no Makefile. Which may be ok given its so early in this project and its all Python. Is there a common Pythonic way to do make dist equivalent of files and run unit tests @trws, @SteVwonder ?

@trws
Copy link
Member

trws commented Mar 31, 2020

The conventional thing would be to use setuptools and make a setup.py file. That would support building the package, running tests, building a wheel, etc. The closest equivalent to make dist would be using one of the targets of a setup.py to build an egg or a wheel or a zip or something. Like most sane build systems there isn't really a make dist since the expectation is you build it with the full source tree and a working environment. That said, you can use a setup.py to install it in a virtual environment and run tests on the result, or test in place or whatever. It also makes it possible to be compatible with PyPi.

@cmoussa1
Copy link
Member Author

Thanks @trws. Do you think we should first merge a PR that makes setup.py and setup tools before merging this PR?

@trws
Copy link
Member

trws commented Mar 31, 2020

I think it could reasonably be a separate step. As long as you feel confident that this is working and tested appropriately, I'd go forward and add that on. It's the kind of thing I normally would either do first, so I can use it while developing, or do on top of something I know is working.

@cmoussa1 cmoussa1 force-pushed the design-DB branch 6 times, most recently from f310704 to 58c7ce9 Compare April 15, 2020 21:01
@cmoussa1
Copy link
Member Author

cmoussa1 commented Apr 15, 2020

OK, rebased and pushed to catch up after #11, but I also made some additional changes:

  • one of the only differences between the user_table and association_table was an administrator level that was defined in the user table. I went ahead and merged the two tables into just one association table that also contains an admin_level field. I also narrowed down the fields in the association table to the following:
Field Description
creation_time
mod_time
deleted
id_assoc association id
user_name
admin_level adminstrative level. basic user is level 1.
account
parent_acct
shares number of shares allocated to the user
max_jobs max number of jobs a user can have running at the same time
max_wall_pj max wall time per job submitted by the user

So, as of now, the Accounting Database has three static limits per user: shares allocated (related to fairshare calculations), max jobs, max wall time per job. But since we are still in discussions of user policy limits in flux-framework/flux-sched #638, these limits could definitely be refined/removed/added to. Perhaps the limit fields in this particular PR aren't all that necessary since we haven't come up with a solid stance just yet; I can always remove those fields for now until we have a better idea on how we want to enforce user limits.

  • create_db.py now uses the logging Python module instead of print statements to provide updates on successful database and table creation, written as info statements. These get written to a log file, named db_creation.log. An example entry:
INFO:root:Creating Flux Accounting DB
INFO:root:Created Flux Accounting DB sucessfully
  • the changes made to create_db.py should also be reflected in its unit tests, named test_create_db.py, which tests for the following:

valid database file creation,

association table creation, and

successful addition of an association to the table

As more tables or fields are added, this unit test file will be expanded to account for those additions.

  • the .db and .log files were added to .gitignore

@cmoussa1
Copy link
Member Author

Pushed a couple more changes that I forgot to add yesterday:

  • write_jobs.py was just a first pass at writing inactive jobs to a database, but I think that functionality is being tackled by flux-framework/flux-core #2880's job archive, so I removed it from this repo,

  • the build directory gets auto-generated from setup.py, so I added it to .gitignore and removed it from version control as well.

@cmoussa1 cmoussa1 requested review from chu11 and dongahn April 20, 2020 16:47
conn = sqlite3.connect("test/FluxAccounting_test.db")

# create association table
conn.execute(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this be calling create_db.py?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, you're right. I'm not sure why I was making the database from scratch... I can't remember. I just pushed a change so that the unit test file calls create_db.py with its own file path to the test directory.

reduce the tables created by create_db.py to
just an association table, which holds user
account information like an association id,
administrative level, and static job limit info
such as max jobs and max wall time per job

replace the print statements with logging to a
file, db_creation.log, which contains status messages
about both database creation and table initialization

Fixes flux-framework#1
Copy link
Member

@chu11 chu11 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! I'll approve the PR. Although @dongahn should take a look too.

@dongahn
Copy link
Member

dongahn commented May 2, 2020

Coming at this late. This is now the highest priority for me. (probably tonight or tomorrow)

@dongahn
Copy link
Member

dongahn commented May 2, 2020

I'm probably missing some setup steps.

I tried to run a test in this PR (test_create_db.py) to see create_db.py in action but I seem to miss a dependency. I manually installed pip3 and pandas (Issue #12 created to capture some of these).

flux@9dd8a52ae9c3:/usr/src/test$ python3 test_create_db.py
Traceback (most recent call last):
  File "test_create_db.py", line 14, in <module>
    from accounting import create_db as c
  File "<frozen importlib._bootstrap>", line 971, in _find_and_load
  File "<frozen importlib._bootstrap>", line 955, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 656, in _load_unlocked
  File "<frozen importlib._bootstrap>", line 626, in _load_backward_compatible
  File "/usr/local/lib/python3.6/dist-packages/flux_accounting-0.0.1-py3.6.egg/accounting/create_db.py", line 17, in <module>
  File "/usr/lib/python3.6/logging/__init__.py", line 1808, in basicConfig
    h = FileHandler(filename, mode)
  File "/usr/lib/python3.6/logging/__init__.py", line 1032, in __init__
    StreamHandler.__init__(self, self._open())
  File "/usr/lib/python3.6/logging/__init__.py", line 1061, in _open
    return open(self.baseFilename, self.mode, encoding=self.encoding)
FileNotFoundError: [Errno 2] No such file or directory: '/usr/src/test/accounting/db_creation.log'

@dongahn
Copy link
Member

dongahn commented May 2, 2020

BTW, there is no #! python3 within test_create_db.py. Was this on purpose?

I initially typed this python script in by itself and realized the executable permission wasn't set. Then, I mistakenly used python test_create_db.py which points to python version 2 (not 3) and got the pandas import error.

@dongahn
Copy link
Member

dongahn commented May 2, 2020

I tried to run a test in this PR (test_create_db.py) to see create_db.py in action but I seem to miss a dependency. I manually installed pip3 and pandas (Issue #12 created to capture some of these).

OK. I manually created test/accounting/db_creation.log and got past the error!

But now,

flux@b2f80e8db92a:/usr/src/test$ python3 test_create_db.py
EEE
======================================================================
ERROR: test_00_test_create_db (__main__.TestDB)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_create_db.py", line 20, in test_00_test_create_db
    c.create_db("test/FluxAccounting.db")
  File "/usr/local/lib/python3.6/dist-packages/flux_accounting-0.0.1-py3.6.egg/accounting/create_db.py", line 23, in create_db
    conn = sqlite3.connect(filepath)
sqlite3.OperationalError: unable to open database file

======================================================================
ERROR: test_01_user_table_exists (__main__.TestDB)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_create_db.py", line 26, in test_01_user_table_exists
    with sqlite3.connect("test/FluxAccounting.db") as db:
sqlite3.OperationalError: unable to open database file

======================================================================
ERROR: test_02_create_association (__main__.TestDB)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_create_db.py", line 42, in test_02_create_association
    with sqlite3.connect("test/FluxAccounting.db") as db:
sqlite3.OperationalError: unable to open database file

----------------------------------------------------------------------
Ran 3 tests in 0.014s

@dongahn
Copy link
Member

dongahn commented May 2, 2020

Once I manually created test directory under test, I can now run test_create_db.py! I will add some inline comments related to this.

class TestDB(unittest.TestCase):
# create database and make sure it exists
def test_00_test_create_db(self):
c.create_db("test/FluxAccounting.db")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You will at least test the existence of the path. Better yet, perhaps create_db can do some more sanity check about the given path.

LOGGER = logging.basicConfig(filename="accounting/db_creation.log", level=logging.INFO)


def create_db(filepath):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function should do some sanity check on the filepath input and tried to fail over if recoverable (e.g., missing subdirectories leading to the db file) or raise an exception or return an error code?



def create_db():
LOGGER = logging.basicConfig(filename="accounting/db_creation.log", level=logging.INFO)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I didn't have the accounting subdirectory, logging failed. Either recover from such failure or dump the log file into the current working directory? Ultimately you may want to make this configurable but the project is so early I gather you want to make fastest progress with core business logic.

@@ -0,0 +1,59 @@
###############################################################
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ultimately, I understand these test cases will be hooked into a high level test driver like make check equivalent. But for the time being, it would be good to have a short README.md file explaining how to run this properly? Don't have to be fancy though.

You can add a couple of sentences in your top-level the README.md file as well. The first thing a developer would like to do would typically be to run a "hello world" type test and documenting such in README.md goes a long way towards helping them ensure things are working after git clone, make, make install or equivalents.

@dongahn
Copy link
Member

dongahn commented May 2, 2020

@cmoussa1: this seems to bring in lots of fundamental stuff!

I added a few review comments related to developers/reviewers experiences on this project. If you want, we can merge this first and address my comments in a separate PR or address them in this PR before landing it.

With a project so early like yours, IMHO it doesn't make a lot sense to try to minimize churns in the repo -- making quicker progress should be our priorities for awhile... Sorry I came to this so late in that regard.

add unit test file for create_db.py,
which tests:

- valid database file creation,

- association table creation, and

- successful addition of an association
to the table
remove build directory from version
control since it is auto-generated by
setup.py
add the following files and directories
to .gitignore:

FluxAccounting.db
db_creation.log
FluxAccounting_test.db
build/
test/FluxAccounting.db
write_jobs.py was a first pass at writing
inactive jobs to a SQLite DB and no
longer belongs in this repo, so remove it
@cmoussa1
Copy link
Member Author

cmoussa1 commented May 4, 2020

Thanks for all of your feedback @dongahn! I'll try my best to answer some of your comments, but if I miss anything else feel free to correct me:

BTW, there is no #! python3 within test_create_db.py. Was this on purpose?

No, this was not on purpose. I went ahead and pushed a change that adds the shebang line to the top of test_create_db.py.

I tried to run a test in this PR (test_create_db.py) to see create_db.py in action but I seem to miss a dependency. I manually installed pip3 and pandas (Issue #12 created to capture some of these).

Thanks for pointing this out. In this project's directory there should be a requirements.txt file that lists the pandas dependency, along with the version that I used. Once you have pip installed, I believe the command to install all dependencies listed in the requirements.txt file is:

pip -r requirements.txt

You can run this command while in the accounting directory, where the requirements file is located.

Once I manually created test directory under test, I can now run test_create_db.py! I will add some inline comments related to this.

Thanks for poking around with the unit test functionality. I'm sorry that you were running into multiple issues with getting the unit tests to work properly. Here is how I run all of those unit tests files at the same time:

When I am in the top-level directory accounting, I run the following command: python -m unittest discover. This will go into the test directory and run all of those unit tests for you so you don't have to individually run each file. I don't believe there is a need to make a test directory inside of the first test directory if you run python -m unittest discover from the top level directory accounting.

Ultimately, I understand these test cases will be hooked into a high level test driver like make check equivalent. But for the time being, it would be good to have a short README.md file explaining how to run this properly? Don't have to be fancy though.

I absolutely agree. A README describing some steps to build this directory and run unit tests could be very helpful. I could post a subsequent PR after this one is merged with a README containing those instructions along with improved error handling with the .db file path creation; if you feel that it would be better to include it in this PR however, that is fine with me. 🙂

@dongahn
Copy link
Member

dongahn commented May 4, 2020

Sounds good. Is this ready to go in now then?

@cmoussa1
Copy link
Member Author

cmoussa1 commented May 4, 2020

Yes, this should be ready!

@dongahn dongahn merged commit 69d134d into flux-framework:master May 4, 2020
@cmoussa1
Copy link
Member Author

cmoussa1 commented May 4, 2020

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants