Skip to content

GSOC 2015 Michael Mueller: Adding indexing capability to Table object

Michael Mueller edited this page Mar 27, 2015 · 5 revisions

Google Summer of Code Application for 2015

Background

I am currently a first-year student at the University of Massachusetts Amherst, majoring in mathematics and computer science. I am also interested in physics and enjoy reading about physics (especially the Feynman Lectures). This semester, I am working on a research project to study the melting of "active" crystals, whose constituent elements have a biased rather than purely random pattern of motion. The melting is simulated by small (6 mm wide) plastic particles vibrated on a shaker at ~50 Hz, with particular geometries that cause biased motion. I have found it really interesting to see how programming work can be useful for physics experiments, as a major part of my project has involved image processing in Python using scikit-image and scipy.ndimage (for convolutions and so forth).

Last year, I participated in GSOC with Astropy and completed a project to write an optimized C/Cython engine for io.ascii based on the approach used by Pandas for read_csv. My blog from the summer is viewable at http://muellergsoc.blogspot.com/ and the primary pull request involved is at https://github.com/astropy/astropy/pull/2716. My mentors were Tom Aldcroft, Erik Bray, and Michael Droettboom.

In addition to programming, I enjoy running, reading, and playing or listening to music in my spare time.

Programming Information

I run Ubuntu Linux on my laptop, and my preferred editors generally vary by language. For standard text editing and programming in C++, I use emacs. However, I use Eclipse for Java projects and in Python will use either emacs or occasionally IDLE. Although emacs has a bit of a learning curve, I find it very useful as an editor because its numerous commands and macros allow for faster and more powerful editing.

My programming background extends back to 7th grade, when I discovered MIT’s educational programming tool Scratch. After playing around with Scratch and reading more about programming, I used online tutorials to teach myself Java. From making small computer games with friends to trying out programming challenges like Project Euler and Code Golf, I then continued to immerse myself in the world of programming and soon picked up experience with C++ and Python. Since then, I have continued to enjoy programming recreationally. While I haven’t often worked on large programming projects, one project I particularly enjoyed working on was the creation of an OpenGL-based 3D engine in C++ (viewable at https://github.com/amras1/opengl-engine). This project was exciting to work on because it involved learning about graphics programming, which contains interesting mathematical underpinnings (such as matrix transformations and quaternions). I also incorporated simple models of mathematical and physical phenomena in the engine, such as Lindenmayer systems (or L-systems), which allow for the rendering of fractal patterns which imitate such natural objects as plants, and particle systems, which can be used to create interesting effects like fireworks or the flow of a water fountain. Although I didn’t implement very advanced versions of these effects, I enjoyed discovering new intersections between math, physics, and programming. These intricate relationships continue to excite me, and I hope to explore the applications of computer science to other fields in the future. I have also previously worked on an extensible zombie apocalypse text adventure game, which may be viewed at https://github.com/amras1/zombie-text-adventure.

I have been using Python in particular for about three years, and I find it very useful as a language whenever high-level programming is appropriate. Although there is somewhat of a performance hit in using Python compared to more mid-level languages like C or C++, its ease of use and natural syntax have allowed me to program more quickly and with less propensity for error. I particularly enjoy the most distinctive, or “Pythonic”, aspects of Python, such as list comprehensions, generators, and lambda expressions. In fact, I’ve often found that when I return to C++ or Java after using Python for a while, I become annoyed at having to translate one of these features into a more cumbersome syntax. In my opinion, Python’s most useful language feature is the existence of iterables and functions relating to them. for i, elem in enumerate(elem_set): foo(elem, i) is far clearer and easier to use than the C++ equivalent for (std::set<int>::iterator it = elem_set.begin(); it != elem_set.end(); ++it) { foo(*it, it - elem_set.begin()); } and standard library functions like map() and zip() make it much simpler to operate on elements of a container.

From working on my GSOC project last year, I am fairly comfortable with Cython and git, although I could still use some experience with more advanced commands, complicated rebases, etc. For my previous work, see my primary PR from last summer: https://github.com/astropy/astropy/pull/2716. I also have PRs from last year for minor issues (http://github.com/astropy/astropy/pull/2110, https://github.com/astropy/astropy/pull/2114, and https://github.com/astropy/astropy/pull/2142). These pull requests have involved adding documentation for time formats supported by Astropy, removing exceptions from comparison operators in the Time class, and documenting classes in astropy.io.ascii.core. I also implemented an HTML reader and writer in astropy.io.ascii, which may be viewed at https://github.com/astropy/astropy/pull/2160.

Project Details

Abstract

The Table class is a central data structure in Astropy and is useful for manipulating tabular data in general. As of now, the Table class offers functionality for certain database operations analogous to those in SQL, such as grouping (splitting the table into chunks by common fields), joining (combining tables based on matching values), etc. It would be useful to add more advanced database operations to the Table class; in particular, adding indexing capability to Table would allow for increased efficiency in searching for individual records in a table, and primary key indexing (such as by time) would improve the functionality of Table in dealing with time series data. My proposal will involve implementing indexing functionality for Table while ensuring that adding and removing records from each table works correctly in terms of the internal indexing system and that addition/removal operations suffer minimal slowdown due to indexing.

Detailed Description

In addition to serving as a general-purpose container for tabular data, the Table class in Astropy has useful functionality such such as metadata and missing data handling, and is designed to interface with other packages in the Astropy project; for example, tables can be input or output with astropy.io.fits and astropy.io.ascii. The Table class also supports some of the database operations present in SQL and Pandas, such as grouping, joining, and stacking tables together. However, Table currently lacks the capability to perform indexing on columns to improve lookup performance, which other database software systems provide. My proposal will aim to bring this functionality to Astropy and thereby increase the speed of table lookups as well as allowing for a system of unique row identifiers in an Astropy Table.

In an ordinary table without indexing, searching for a row based on one of its column values takes O(n), or linear time. For example, in a table of 100,000 astronomical observations where one column records the exact time at which an observation was taken, it might take up to 100,000 operations to search through the table for an observation taking place at 9:00 AM. The principle behind indexing is to maintain some sorted order for a given column, which will reduce the time complexity of a search for a given record. For example, if the entries are sorted by their observation time, a binary search on the table is possible and so the number of operations required is at most log2(100,000) = 17, rounded up. This is a clear computational speedup for large data sets, and the principle can be applied either to individual columns or to a "primary key" that uniquely identifies each table row. One drawback of table indexing is that insertions and removals of table records become more difficult as the sorted index order must be maintained, which slows down these operations to some degree. Therefore, one challenge in designing an indexing system is to minimize this unavoidable slowdown.

I will begin by assessing the current performance of database operations in Table using the open source asv benchmarking tool (https://github.com/spacetelescope/asv), which can be used for comparison later to determine the effectiveness of column indexing. After doing so and then writing high-level code to dispatch the task of column indexing to a separate data structure, I will implement this data structure and test it extensively. One possible data structure, used in MySQL, is a B-tree. A B-tree is similar to a binary search tree with the exception that each node can have more than two children, and each operation on a B-tree (searching, sequential access, insertion, and deletion) runs in O(log n) time. This is in contrast with a non-indexed table system, in which insertion and deletion are constant-time operations and searching is linear, or O(n). Thus, a B-tree is advantageous when the number of insertions or deletions is small relative to the number of expected searches. Another potential data structure is a hash table, which functions as an associative array mapping ID values to records.

Once this is done, I plan on integrating this new data structure together with the underlying data implementation in Table to complete a bare-bones indexing system for Table. At this stage, I will organize the column indexes in a non-clustered fashion, meaning that the sorting information for the column will be separate from the internal structure of the column itself. The next step will be to ensure that the indexing order remains consistent as the columns change or properties of the overall table change, and I plan to research any algorithms that might be used to minimize the time/memory impact of these changes. Another useful addition will be to designate a single column as the "primary key" for a table, meaning that each row can be uniquely identified by its value in the column in the same way that the Series class in Pandas associates each data point with a unique ID. Ideally, the primary key will be a "clustered" index, so that the data itself will be sorted internally based on the primary key values. A significant application of this primary key functionality will be the ability to create a Table for time series data and designate a time column as the primary key, so that there exists an efficient API for retrieving table rows by time value.

After further optimization of the structures previously implemented, I will make use of table indexing wherever possible by altering the database operations in Table (as well as another relevant Astropy code) to use table indexing. Finally, if there is time left over at the end of the summer after I have implemented the ideas defined above, I will implement additional features for the Table class suggested by my mentors, such as allowing for nested tables or the creation of dynamically computed columns. While it is possible that there will be insufficient time to begin work on any extra features, such additions to the Table class could definitely be useful.

Timeline

Community bonding period (April 27 — May 24): Become more familiar with the existing database functionality in Table, look at the Pandas codebase to understand how the Series class handles indexing, and read about B-trees (which might be used as an indexing data structure).
May 25 — May 31 (1 week): Using asv, create benchmarks for current Table database operations and row retrieval by column value, which should scale linearly with number of rows. Write high-level code to keep a (not yet implemented) indexing data structure in memory and pass data from the Table class into this data structure. This will serve as a temporary means for creating an index by passing control to the indexing data structure through new methods (e.g. create_index(col)).
June 1 — June 14 (2 weeks): After discussing with my mentors which data structure would make the most sense to use for indexing in Table, such as a B-tree or a hash table, implement this structure. Include a testing suite to ensure that the structure works properly, and take initial benchmarks of the structure's performance using asv.
June 15 — June 21 (1 week): Without worrying about maintaining the integrity of the index as column or table properties change, incorporate the new indexing data structure into the Table class by writing a basic indexing API (e.g. a method create_index(col) and methods to select rows based on column values). Dispatch a separate data structure for each indexed column without affecting the internal data itself, i.e. make the indexes non-clustered.
June 22 — July 5 (2 weeks): Draw on indexing implementations in Pandas and MySQL, as well as any other possible algorithms, to decide how to maintain the integrity of the index data structure upon row insertion/removal or any other change of table properties. Implement this functionality and test with ordinary data as well as corner cases.
July 6 — July 12 (1 week): For the special case where a unique identifier for each row is useful, implement a method to designate a single column as a so-called "primary key". If feasible, make the primary key a clustered index, i.e. make the indexing information part of the Table itself. For example, this might mean sorting rows in memory based on their primary key values.
July 13 — July 19 (1 week): Take another set of benchmarks and work on optimization of the indexing data structure, the algorithm(s) used to preserve the sorted order of the index, and any other time-critical operations. Rewrite performance bottlenecks in Cython.
July 20 — August 2 (2 weeks): Go through the existing code in Table, particularly the basic database operations currently supported, and adjust these methods to make use of table indexing wherever doing so would improve performance. Also modify functions outside the astropy.table package that would benefit from the new indexing capability of Table. Document the new API for table indexing and write tests for all new changes to the Astropy code base.
August 3 — August 16 (2 weeks): These two weeks will serve as a buffer period in case earlier steps of the proposal require more time, unforeseen difficulties arise, etc. If everything runs smoothly and the buffer period is unnecessary, then work on implementing additional features for the Table class. For example, it would be nice to have nested tables or Excel-style dynamically computed columns with dependencies.
August 17 (suggested pencils down date) — August 21 (final evaluation deadline):
  Write documentation for all previous code and add any tests I may have missed during the main coding period. Look for potential bugs and fix any that arise.

Additional Information

I have a blog from last summer (http://muellergsoc.blogspot.com/), which I intend to use again weekly this summer for progress reports. I will also be available for regular contact with my project mentors over the course of the summer through IRC, email, phone calls, or any other mutually convenient form of communication.

Clone this wiki locally