Skip to content

GSOC 2014 Application Shantanu Srivastava : Astropy (High performance ASCII table reader and memory view tables)

Shantanu Srivastava edited this page Apr 5, 2014 · 22 revisions

Google Summer of Code 2014 Proposal - Astropy

Python Software Foundation

Sub Organization

Astropy - Development of a core package for Astronomy in Python, and fosters interoperability between Python Astronomy packages

Student Information

University Information

  • University: Indian Institute OF Information Technology and Management, Gwalior, India
  • Major : Information Technology
  • Current year and Expected Completion date : 2nd year(sophomore), 2017
  • Degree : Integrated Post Graduation ( Bachelor of Technology + Masters of Technology)

Project Proposal

High-performance ASCII table reader and memory view tables

Tables are the most widely used form for storing data. Tables come with a writing performance penalty as compared to regular stores.But its importance comes into play when handling large sized data(preferably GBs). The benefit is the ability to append/delete and query (potentially very large amounts of data). Write times are generally longer as compared with regular stores. Query times can be quite fast, especially on an indexed axis.

The present Astropy structure uses astropy.io.ascii module which provides a flexible table reader and writer. It can be optimized to make it a high performance module for reading and writing large sized files.

  • Optimizing Performance with Benchmarking - Aerospace velocity (asv) Python Benchmarking package

The asv package can be used to perform benchmark testing for the reading and writing operations for the io.ascii module. Performance warnings can be raised if the testing fails for various ascii formats. Memory benchmarks and timing benchmarks can be used for performance evaluation.

  • Fast Reader

The ascii reader can be implemented for reading in data from various file formats including AASTex, CDs, Latex etc. It consists of a table parser which can be used for writing data from large files. The basic class and inheritance structure of astropy can be further developed for performance improvements. Optimization operations including prevention of storage of intermediate data into memory at the time of reading, guessing the data format, using attributes and parameters including column delimiter.The guess function can be empowered with the sniffing functionality as in Pandas project for faster operation. Splitters and Separator functions in the existing code can be further optimized. In case of heavy data load it is possible to achieve sizeable speed-ups by offloading work to cython. Providing type information , using ndarray , looping over numpy arrays using cython are some basic implementation ideas for optimization using cython. Advance optimization can be carried out by importing C libraries in Cython.

  • Fast Writer

The present writer for astropy can write the astropy.table data into number of formats.It can be further developed specially for writing it to CSV format. The performance can be improved by using explicit buffering(on top of python buffer) . Using "chunksize" as argument can reduce the memory usage during writing. Similarly using "expectedrows" can optimize read/write performance. Normalization of missing or NaN values

  • Memory mapped ASCII files

Memory mapping can be applied for reading and writing files for faster performance . Using a memory mapped file significantly increase the performance over explicitly reading chunks of the file. The io.fits module uses memory mapping to write FITS table. A similar approach can be developed for ascii read and write operations.

Proposed Timeline :

  • 19th May - 15th June - Fast ASCII Reader - During this time period I would like to work on the 'Fast ASCII Reader' , starting work with developing the class structure and methods based on the present astropy.io.ascii class and then optimizing the performance using the above mentioned optimizations.
  • 16th June - 30th June - Fast ASCII Writer - The next approach would be to develop a fast ascii reader for io.ascii using the optimization technique described. The main focus would be to develop the ascii writer for csv formats.
  • 1st July- 15th July - Memory mapping for ASCII files
  • 16th July -31st July - Benchmarking - Developing the benchmarking framework using the asv tool
  • 1st August - 18th August - Further improvement and optimizations can be carried out using the results and stats of the benchmarking tool .

Contribution to the Project

Documentation improvement for issue #1221 "Fix non-functional API links " with pull request #2217 : https://github.com/astropy/astropy/issues/2217 . The merged version : https://github.com/astropy/astropy/pull/2279

Code Samples

About Me

"Coding is Life"

Dedicated towards learning leading edge new technologies and skills of the dynamic software world through live projects and opportunities. Innovation accompanied with creativity to solve various coding issues with ability to assimilate and rapidly utilize emerging technologies.

Aim and Goal

Artificial Intelligence Pro with gaining specialization in fields of Machine Intelligence and Learning , Data mining, Robotics, Natural Language Processing and Information Retrieval, Search optimization and Social Intelligence 

"Live your dreams to the fullest ! Work hard to make your dreams true otherwise you will be hired by someone else to make his dream true !!"

Why me ?

I have started contributing to the project with documentation fix bugs and have started developing a good understanding of the workflow of Astropy. I assure you that I can develop and contribute to the maximum possible extent to take this project to completion . I am very passionate about getting my hands on real world projects. I have experience of programming in Python, C++ and C#. I am well aware of the associated data structures and algorithms. My passion is for coding and computer technologies. I love learning and sharing new ideas. I am always open for new opportunities and exploring challenging research areas.

Clone this wiki locally