Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sparse data --- in search of a good representation #2

Closed
BurntSushi opened this issue Aug 20, 2013 · 1 comment
Closed

sparse data --- in search of a good representation #2

BurntSushi opened this issue Aug 20, 2013 · 1 comment
Labels

Comments

@BurntSushi
Copy link
Owner

In designing the schema for nfldb, I've run into an interesting problem: sparse data. I'll talk briefly about what I've looked into and what I've decided on for now. My goal is to lay out my decision procedure so that it can be improved upon by future willing participants that are smarter than me.

Each play has a set of statistics associated with it. Some of them are meta data about the play itself. For example, whether it was a third down attempt (and if it succeeded) or any penalty yards that accrue on the play. The other statistics are related to a particular player. For example, passing yards or a tackle.

There is only one set of meta statistics about a play, but there is usually more than one set of player statistics for a play.

The problem is that the set of possible statistics is quite large (a little over 100), but any particular combination of them in a play is quite small. Moreover, each statistical category must be searchable using the standard set of comparison operators: =, !=, <, <=, > and >=. To my knowledge, there are no fewer than four different approaches to storing this kind of data in PostgreSQL.

Use an entity-attribute-value ("EAV") table

An EAV table is a "skinny" table with only a few columns. One is a foreign key (which would link it to a play in this case), another is an attribute name and the last is a value. In this case, the value is always an integer except for sacks, which can have a decimal. Therefore, the value column would need type real (which wastes at least 2 bytes for each of the vast majority of rows, i.e., smallint versus real.)

An index on this table would be simple.

Querying this table would be extremely difficult. Every query would require some kind of pivot of rows into columns, which can be exceedingly complex to write. Since the main focus of nfldb is to provide a simple API that will automatically write queries for you, this would greatly increase the complexity of query generation.

For these reasons, EAV is typically considered an anti-pattern. I would be willing to accept the overhead of using real for each statistic, but the complexity of writing queries is a showstopper for me here.

Have a single table with many columns

In this case, most rows will have NULL for the majority of columns. In my reading, I've learned that PostgreSQL stores NULL values efficiently: each NULL takes up a single bit in a bitmap for the row. I would consider that acceptable overhead.

The problem here is having a table with a huge number of columns and the overhead that entails. In particular, I would very much like to have an index on each of the columns for efficient searching. But how does that scale to ~100 columns?

Use the hstore extension

This looked really promising at first. But it stores data as text. This would require casts every single time a query used a comparison operator on a particular statistic. (I don't know what the overhead of that is, but I'm hard pressed to believe a "cast" from text to numeric is free.) Moreover, while indexes can be defined on an hstore column, I don't think they can be (easily) defined on a particular key in an hstore column. That's two showstoppers IMO. If I'm wrong about my understanding of hstore, I'd be happy to be corrected.

Use an array

I think this is a strictly superior solution to hstore since it can store an array of real as opposed to text. But this suffers from using extra space just like the EAV solution does, and I don't think indexes can be created on a particular element of an array. Moreover, arrays are indexed numerically, which implies we'd need to use a mapping between statistic and its numeric index. That increases the complexity of query generation too.


In light of the above analysis, I've decided to just go ahead with creating a sparse table. The trade offs seem more attractive to me, particularly given that the space overhead will be fairly low and the query generation will remain simple. A possible alternative is to divide statistics into common groupings and use a table for each grouping (for example, passing, rushing, receiving, defense, etc.) I think that's a very viable option going forward, but I consider it to be a premature optimization at this point. It may help a lot if it turns out that indexes on a 100 column table suck.

@BurntSushi
Copy link
Owner Author

The above was written before I designed the database. Now that things are in full swing with a single table and a bunch of columns, I'm fairly happy. Adding indexes to each column when the database is first imported takes a bit of time (seems to be about 5 minutes on decent hardware), but otherwise things are pretty snappy.

If anyone else has other ideas, please open a new issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant