GSoC 2014 Michael Mueller: High performance ASCII table reader and memory view tables

Google Summer of Code Application for 2014

Background

I am a current high school senior, and my academic interests lie primarily in math, physics, and programming. Mathematics has always interested me as a subject, and I have taken online classes in multivariable calculus and differential equations with a local community college, as well as linear algebra with Oklahoma State University, after exhausting my high school's curriculum. More recently, I have become interested in physics because as I enjoy learning about how mathematics applies to the natural world. In addition to programming, I enjoy running, reading, and playing or listening to music in my spare time.

Programming Information

I run Ubuntu Linux on my laptop, and my preferred editors generally vary by language. For standard text editing and programming in C++, I use emacs. However, I use Eclipse for Java projects and will use either emacs or IDLE when programming in Python. Although emacs has a bit of a learning curve, I find it very useful as an editor because its numerous commands and macros allow for faster and more powerful editing. I also often use IDLE simply because I’ve been using it longer than emacs and I like its syntax highlighting and interactive shell.

My programming background extends back to 7th grade, when I discovered MIT’s educational programming tool Scratch. After playing around with Scratch and reading more about programming, I used online tutorials to teach myself Java. From making small computer games with friends to trying out programming challenges like Project Euler and Code Golf, I then continued to immerse myself in the world of programming and soon picked up experience with C++ and Python. Since then, I have continued to enjoy programming recreationally. While I haven’t often worked on large programming projects, one project I particularly enjoyed working on was the creation of an OpenGL-based 3D engine in C++ (viewable at https://github.com/amras1/opengl-engine). This project was exciting to work on because it involved learning about graphics programming, which contains interesting mathematical underpinnings (such as matrix transformations and quaternions). I also incorporated simple models of mathematical and physical phenomena in the engine, such as Lindenmayer systems (or L-systems), which allow for the rendering of fractal patterns which imitate such natural objects as plants, and particle systems, which can be used to create interesting effects like fireworks or the flow of a water fountain. Although I didn’t implement very advanced versions of these effects, I enjoyed discovering new intersections between math, physics, and programming. These intricate relationships continue to excite me, and I hope to explore the applications of computer science to other fields in the future. I have also previously worked on an extensible zombie apocalypse text adventure game, which may be viewed at https://github.com/amras1/zombie-text-adventure.

I have been using Python in particular for about three years, and I find it very useful as a language whenever high-level programming is appropriate. Although there is somewhat of a performance hit in using Python compared to more mid-level languages like C or C++, its ease of use and natural syntax have allowed me to program more quickly and with less propensity for error. I particularly enjoy the most distinctive, or “Pythonic”, aspects of Python, such as list comprehensions, generators, and lambda expressions. In fact, I’ve often found that when I return to C++ or Java after using Python for a while, I become annoyed at having to translate one of these features into a more cumbersome syntax. In my opinion, Python’s most useful language feature is the existence of iterables and functions relating to them. for i, elem in enumerate(elem_set): foo(elem, i) is far clearer and easier to use than the C++ equivalent for (std::set<int>::iterator it = elem_set.begin(); it != elem_set.end(); ++it) { foo(*it, it - elem_set.begin()); }and standard library functions like map() and zip() make it much simpler to operate on elements of a container.

I intend to use Cython in my project, and although I haven’t used Cython before, I am comfortable with C and intend to gain a working knowledge of Cython before summer coding begins. I am reasonably comfortable with git, having previously contributed to a friend’s project on BitBucket and having begun contributing to Astropy on GitHub. As of this writing, I have three merged pull requests for minor issues (http://github.com/astropy/astropy/pull/2110, https://github.com/astropy/astropy/pull/2114, andhttps://github.com/astropy/astropy/pull/2142). These pull requests have involved adding documentation for time formats supported by Astropy, removing exceptions from comparison operators in the Time class, and documenting classes in astropy.io.ascii.core. I have also been working on a more significant open pull request implementing an HTML reader and writer in astropy.io.ascii, which may be viewed at https://github.com/astropy/astropy/pull/2160.

Project Details

Abstract

Currently, the astropy.io.ascii package contains support for reading and writing a number of text-based formats. For simple formats, it would be very useful to have an optimized parser that can efficiently read table data from large files. My proposal will involve implementing fast reading and writing for these formats. It will also include the possibility of implementing memory mapping for ASCII parsers. If all of this is done, I hope to work on general performance enhancement for Astropy.

Detailed Description

The existing astropy.io.ascii package supports a variety of formats for text-based table data, from simpler formats like CSV and RDB to more complex formats such as IPAC and DAOphot. Although the current model of inheritance from ascii.BaseReader is highly configurable and allows for the relatively easy creation of new readers and writers, speed is an important concern in table parsing, and the current implementation of simpler formats is not optimized for maximum performance. In particular, ASCII readers and writers extending from ascii.Basic simply use the functionality of base classes in ascii.core (such as BaseSplitter, BaseHeader, and BaseData) in order to convert input rows into table data using delimiters.

My proposal, which aims to improve Astropy’s reading and writing performance in cases when flexibility is unimportant, will involve the reimplementation of table readers and writers for simple formats like CSV, RDB, and commented header files. My approach will involve extensive use of the open source asv benchmarking tool (https://github.com/spacetelescope/asv). I will begin by writing benchmarks for the current readers and writers in io.ascii, as well as the relevant parts of astropy.table.Table, in order to assess the performance of the current implementation and look for areas that require improvement. After I find potential bottlenecks, I will use line_profiler (http://pythonhosted.org/line_profiler/) to see if these functions can be redesigned.

Once this is done, I plan on studying the parsing methods of the Pandas library in pandas.io.parsers in order to determine whether these methods can be incorporated effectively into the Astropy framework. If not, then I plan to learn from Pandas’ approach in order to create fast parsers for CSV and other formats from scratch. In any case, I will reimplement the reading and writing of tables for simple formats and ensure the improved performance of my new implementation using asv. Cython, a superset of Python which allows for improved performance through static typing and other features, will come in useful in this reimplementation as the high-level nature of Python presents an additional obstacle to improving parsing performance. I intend to use Cython wherever performance cannot be improved enough in pure Python using the line_profiler tool.

A major benefit of the Flexible Image Transport System (FITS) format in Astropy table parsing is that the use of memory mapping allows for the FITS reader to parse fixed-format FITS tables without having to store an intermediate string representation of file data. This approach, currently implemented with numpy.memmap, increases the performance of the FITS reader from both a time and a memory perspective. It is not yet apparent whether memory mapping would be a feasible option for variable-length ASCII tables, but I intend to explore the option by carefully looking through the code in io.fits and looking for a way to adapt this feature to CSV and other formats. If this turns out to be possible, I will implement memory mapping for simple ASCII formats and test the performance of my implementation using asv. Unlike timing benchmarks, the memory benchmarking capacity of asv is still experimental according to its documentation (http://spacetelescope.github.io/asv/writing_benchmarks.html). If this presents a challenge, I will either look at asv’s source to better understand any issues or switch to another tool for memory benchmarking, such as https://github.com/fabianp/memory_profiler.

Finally, if there is time left over at the end of the summer after I have implemented the ideas defined above, I will work on improving the overall performance of Astropy. This would involve the continued use of Cython and the asv benchmarking tool. Since the task of increasing general performance is highly open-ended, this addition ensures that I will be able to contribute as much as possible to Astropy over the course of the summer. While it is very possible that there will be insufficient time to begin work on this extra project, I plan to leave the option open in the hopes of dealing with all potential eventualities.

Timeline

August 11 (suggested pencils down date) — August 22 (final evaluation deadline):
Community bonding period (April 21 — May 18):	Become more familiar with Cython and the Pandas data analysis library. Continue working on general Astropy issues on GitHub.
May 19 — May 25 (1 week):	Carefully read through the documentation of the asv benchmarking tool as well as line_profiler. Begin to experiment with asv, using code samples in both Cython and regular Python to compare performance. Look more closely at the structure of `io.ascii` and `table.Table` to get a preliminary understanding of the code.
May 26 — June 8 (2 weeks):	Write benchmarks for the major ASCII formats supported by Astropy, whose readers and writers are implemented in `io.ascii`. These will at least include each format in `ascii.basic` (CSV, RDB, tab-separated, etc.), fixed-width formats, and SExtractor, although I hope to cover other ASCII formats like IPAC, DAOphot, etc. as well. In addition, write benchmarks for the relevant parts of the `Table` class in `astropy.table`. If there is enough time, add benchmarks for other parts of `astropy.table`. This could be of use at the end of the summer if I have the opportunity to improve performance in other packages like `astropy.table`.
June 9 — June 15 (1 week):	Look closely at the Pandas code in `pandas.io.parsers`, particularly `read_csv()` and `read_table()`, in order to determine whether these methods can be modified for use in Astropy. This will involve analyzing the infrastructure of Pandas and comparing its data storage to Astropy’s table storage in `astropy.table.Table`. After presenting my findings to my project mentor and reaching a mutual decision whether or not to use the Pandas library, formulate a plan for implementing efficient parsing of CSV and other simple formats.
June 16 — June 29 (2 weeks):	Using the plan developed during the previous week, implement a fast reader and writer for simple ASCII formats. If it turns out that using Pandas is the best option, this will involve adapting the reading and writing methods in Pandas (`read_csv()`, `to_csv`, etc.) to the frameworks of `astropy.io.ascii` and `astropy.table`. If not, I will use my knowledge of how Pandas implements efficient parsing and writing in order to create fast readers and writers for simple ASCII formats outside the Pandas framework.
June 30 — July 6 (1 week):	Thoroughly test the performance of my new implementation using asv. After looking closely at the performance of my implementation and finding areas of weakness, use line_profiler to see if these areas can be sufficiently improved in pure Python. If not, use Cython to target these areas and improve the implementation’s speed across the board. As a last resort, rewrite performance-critical portions of code in C for use with Cython.
July 7 — July 13 (1 week):	Look at how `io.fits` implements memory mapping for FITS tables and determine, after discussing my findings with my project mentor, if it will be possible to add functionality for memory mapping with ASCII tables. If this is the case, test the performance of FITS memory mapping using asv benchmarks and create a plan for implementing ASCII memory mapping. If this is not the case, then continue the work of the previous week in improving the performance of my newly implemented ASCII readers using asv, line_profiler, Cython, and possibly C.
July 14 — July 27 (2 weeks):	Assuming that the memory mapping project turns out to be feasible, use these two weeks to implement memory mapping for ASCII formats. Since `numpy.memmap` will be of use in this project, read over the pertinent numpy documentation and incorporate numpy’s memory-mapping features into the ASCII implementation, using asv to ensure enhanced performance. If the memory mapping project is not feasible, these two weeks will allow for greater flexibility in the overall project plan. If everything is completed up to this point, these two weeks might be used in beginning the overall performance enhancement project. If not, then this will serve as a buffer period in which previous work may be completed as necessary.
July 28 — August 10 (2 weeks):	These two weeks will serve as a buffer period in case earlier steps of the proposal require more time, unforeseen difficulties arise, etc. If everything runs smoothly and the buffer period is unnecessary, then work on improving Astropy’s overall performance. If there was sufficient time in weeks 2 and 3 (May 26 — June 8) to write extensive benchmarks for `astropy.table`, then begin work on performance enhancements for this package. Improve performance for other Astropy packages if time permits. Throughout this project, use line_profiler, asv, and Cython as optimization tools.
	Write documentation for all previous code and add any tests I may have missed during the main coding period. Look for potential bugs and fix any that arise.

Additional Information

The last day of classes for me will be May 27 (the beginning of the second week of coding), so I will not be able to devote 40 hours for the first week. However, I anticipate that I will have very little, if any, schoolwork and I should be able to complete the set tasks for week 1. If not, then I can make up for missed work the next couple of weeks. Except for the first week, I will be fully capable of spending 40 hours per week on Google Summer of Code.

I have set up a blog for Google Summer of Code (http://muellergsoc.blogspot.com), which I intend to use weekly during the summer for progress reports. I will also be available for regular contact with my project mentor over the course of the summer through IRC, email, phone calls, or any other mutually convenient form of communication.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly