You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In designing the schema for nfldb, I've run into an interesting problem: sparse data. I'll talk briefly about what I've looked into and what I've decided on for now. My goal is to lay out my decision procedure so that it can be improved upon by future willing participants that are smarter than me.
Each play has a set of statistics associated with it. Some of them are meta data about the play itself. For example, whether it was a third down attempt (and if it succeeded) or any penalty yards that accrue on the play. The other statistics are related to a particular player. For example, passing yards or a tackle.
There is only one set of meta statistics about a play, but there is usually more than one set of player statistics for a play.
The problem is that the set of possible statistics is quite large (a little over 100), but any particular combination of them in a play is quite small. Moreover, each statistical category must be searchable using the standard set of comparison operators: =, !=, <, <=, > and >=. To my knowledge, there are no fewer than four different approaches to storing this kind of data in PostgreSQL.
Use an entity-attribute-value ("EAV") table
An EAV table is a "skinny" table with only a few columns. One is a foreign key (which would link it to a play in this case), another is an attribute name and the last is a value. In this case, the value is always an integer except for sacks, which can have a decimal. Therefore, the value column would need type real (which wastes at least 2 bytes for each of the vast majority of rows, i.e., smallint versus real.)
An index on this table would be simple.
Querying this table would be extremely difficult. Every query would require some kind of pivot of rows into columns, which can be exceedingly complex to write. Since the main focus of nfldb is to provide a simple API that will automatically write queries for you, this would greatly increase the complexity of query generation.
For these reasons, EAV is typically considered an anti-pattern. I would be willing to accept the overhead of using real for each statistic, but the complexity of writing queries is a showstopper for me here.
Have a single table with many columns
In this case, most rows will have NULL for the majority of columns. In my reading, I've learned that PostgreSQL stores NULL values efficiently: each NULL takes up a single bit in a bitmap for the row. I would consider that acceptable overhead.
The problem here is having a table with a huge number of columns and the overhead that entails. In particular, I would very much like to have an index on each of the columns for efficient searching. But how does that scale to ~100 columns?
This looked really promising at first. But it stores data as text. This would require casts every single time a query used a comparison operator on a particular statistic. (I don't know what the overhead of that is, but I'm hard pressed to believe a "cast" from text to numeric is free.) Moreover, while indexes can be defined on an hstore column, I don't think they can be (easily) defined on a particular key in an hstore column. That's two showstoppers IMO. If I'm wrong about my understanding of hstore, I'd be happy to be corrected.
Use an array
I think this is a strictly superior solution to hstore since it can store an array of real as opposed to text. But this suffers from using extra space just like the EAV solution does, and I don't think indexes can be created on a particular element of an array. Moreover, arrays are indexed numerically, which implies we'd need to use a mapping between statistic and its numeric index. That increases the complexity of query generation too.
In light of the above analysis, I've decided to just go ahead with creating a sparse table. The trade offs seem more attractive to me, particularly given that the space overhead will be fairly low and the query generation will remain simple. A possible alternative is to divide statistics into common groupings and use a table for each grouping (for example, passing, rushing, receiving, defense, etc.) I think that's a very viable option going forward, but I consider it to be a premature optimization at this point. It may help a lot if it turns out that indexes on a 100 column table suck.
The text was updated successfully, but these errors were encountered:
The above was written before I designed the database. Now that things are in full swing with a single table and a bunch of columns, I'm fairly happy. Adding indexes to each column when the database is first imported takes a bit of time (seems to be about 5 minutes on decent hardware), but otherwise things are pretty snappy.
If anyone else has other ideas, please open a new issue.
In designing the schema for nfldb, I've run into an interesting problem: sparse data. I'll talk briefly about what I've looked into and what I've decided on for now. My goal is to lay out my decision procedure so that it can be improved upon by future willing participants that are smarter than me.
Each play has a set of statistics associated with it. Some of them are meta data about the play itself. For example, whether it was a third down attempt (and if it succeeded) or any penalty yards that accrue on the play. The other statistics are related to a particular player. For example, passing yards or a tackle.
There is only one set of meta statistics about a play, but there is usually more than one set of player statistics for a play.
The problem is that the set of possible statistics is quite large (a little over 100), but any particular combination of them in a play is quite small. Moreover, each statistical category must be searchable using the standard set of comparison operators:
=
,!=
,<
,<=
,>
and>=
. To my knowledge, there are no fewer than four different approaches to storing this kind of data in PostgreSQL.Use an entity-attribute-value ("EAV") table
An EAV table is a "skinny" table with only a few columns. One is a foreign key (which would link it to a play in this case), another is an attribute name and the last is a value. In this case, the value is always an integer except for sacks, which can have a decimal. Therefore, the
value
column would need typereal
(which wastes at least 2 bytes for each of the vast majority of rows, i.e., smallint versus real.)An index on this table would be simple.
Querying this table would be extremely difficult. Every query would require some kind of pivot of rows into columns, which can be exceedingly complex to write. Since the main focus of
nfldb
is to provide a simple API that will automatically write queries for you, this would greatly increase the complexity of query generation.For these reasons, EAV is typically considered an anti-pattern. I would be willing to accept the overhead of using
real
for each statistic, but the complexity of writing queries is a showstopper for me here.Have a single table with many columns
In this case, most rows will have
NULL
for the majority of columns. In my reading, I've learned that PostgreSQL storesNULL
values efficiently: eachNULL
takes up a single bit in a bitmap for the row. I would consider that acceptable overhead.The problem here is having a table with a huge number of columns and the overhead that entails. In particular, I would very much like to have an index on each of the columns for efficient searching. But how does that scale to ~100 columns?
Use the hstore extension
This looked really promising at first. But it stores data as text. This would require casts every single time a query used a comparison operator on a particular statistic. (I don't know what the overhead of that is, but I'm hard pressed to believe a "cast" from text to numeric is free.) Moreover, while indexes can be defined on an
hstore
column, I don't think they can be (easily) defined on a particular key in anhstore
column. That's two showstoppers IMO. If I'm wrong about my understanding ofhstore
, I'd be happy to be corrected.Use an array
I think this is a strictly superior solution to
hstore
since it can store an array ofreal
as opposed to text. But this suffers from using extra space just like the EAV solution does, and I don't think indexes can be created on a particular element of an array. Moreover, arrays are indexed numerically, which implies we'd need to use a mapping between statistic and its numeric index. That increases the complexity of query generation too.In light of the above analysis, I've decided to just go ahead with creating a sparse table. The trade offs seem more attractive to me, particularly given that the space overhead will be fairly low and the query generation will remain simple. A possible alternative is to divide statistics into common groupings and use a table for each grouping (for example, passing, rushing, receiving, defense, etc.) I think that's a very viable option going forward, but I consider it to be a premature optimization at this point. It may help a lot if it turns out that indexes on a 100 column table suck.
The text was updated successfully, but these errors were encountered: