Skip to content

Google Summer of Code 2015 Proposal for Boost Document Library Development

Anurag Ghosh edited this page Mar 24, 2015 · 20 revisions

Proposal for Boost Document Library Development

Personal Details

Name: Anurag Ghosh

College/University: International Institute of Information Technology, Hyderabad, India

Course/Major: Computer Science and Engineering

Degree Program (B.Sc., M, Sc., PhD, etc.): B.Tech. and MS by Research

Email: ghoshanurag1995@gmail.com

Homepage: http://researchweb.iiit.ac.in/~anurag.ghosh

Availability

How much time do you plan to spend on your GSoC?

I think that I can afford to give around 4-5 hours during weekdays and about 8 hours during weekends (Saturday and Sunday). So barring the occasional holiday taken, I think I’ll be able to devote around 35 - 40 hours per week for my GSoC Project. ( 4.5 hours * 5 days + 8 hours * 2 days = 38.5 hours )

Also, I’ll be at home during the month of July (which I think would be the peak time for the project progress), thus I think I can easily devote around 8 hours every day, barring Sundays, so that means about 45-50 hours in July specifically (6 days * 8 hours = 48 hours).

What are your intended start and end dates?

Ideally, I would have liked the start date to be early may and end date to be early-August, but I'm fine with the given GSoC timeline as it nearly aligns with my preferences, though I’ll start with the pre-work earlier. I have tried to best utilize the time as described in the ‘Project schedule and milestones’ section at the end of the proposal.

What other factors affect your availability (exams, courses, moving, work, etc.)?

I would be working in an introductory capacity at a research lab in college for 8 weeks in accordance with my course requirements during May and June. Also my college semester starts around August. I think I'll be able to find time for the project despite these commitments, after accounting for both of them.

Background Information

Please summarize your educational background (degrees earned, courses taken, etc.).

I’m currently in my 4th semester pursuing a B.Tech and MS by Research Dual Degree majoring in Computer Science and Engineering at IIIT-Hyderabad, India. In the past 2 Years, I have completed the following technical courses in chronological order -

Semester - 1

Computer Programming, Digital Logic and Processors, Mathematics 1, IT Workshop 1 (Linux, HTML, CSS, jQuery)

Semester - 2

Data Structures, Computer System Organization, Basic Electronic Circuits, Mathematics 2, IT Workshop 2 (Python, Javascript and Web2Py)

Semester - 3

Algorithms, Operating Systems (mainly a Unix perspective), Introduction to Databases, Structured System Analysis and Design (SSAD), Mathematics 3, Science I

Semester - 4 (Ongoing Courses)

Graphics, Formal Methods, Artificial Intelligence, Digital Signal Analysis and Application, Computer Networks (CN)

Please summarize your programming background (OSS projects, internships, jobs, etc.).

I had no prior experience in programming before I came to college, although I was an avid Linux user. During my first year, we did algorithmic programming in Compter Programming and Data Structures, in C, implementing various Graph algorithms, Heaps, Stack, Queues, Trees. I did attempt some competitive coding, learning Union-Find, Segment Trees, BIT’s and other Data Structures and Algorithms in the process, during which I migrated to C++ and STL. Also, during the same time, I started working on a simple processor simulator in C++ (https://github.com/anuragxel/iiit-processer-sim) with a friend and single handedly wrote a 2 pass assembler for the toy instruction set. The project also resulted us in expanding the course material of the Course in which processor design was taught. Also, I have programmed using threads and forks during our Operating Systems course.

I have used C++ for our Graphics courses. I Made mini games in Graphics using OpenGl2 and OpenGl3. Played with the basics of socket programming by designing a simple File Download/Upload Protocol and writing an application layer program, although in C. I have also used Java for my SSAD Project but I did not like it much.

As a regular contributor to our college festival website and contest (specifically http://felicity.iiit.ac.in/threads), I have contributed to a generic contest portal made for holding quizzing and programming contests, written in Python using Django and Celery. Also, I made some questions in an esoteric programming (SNUSP) for a contest, which required me to write an implementation of the Language Interpreter in C++ (https://github.com/anuragxel/modular-snusp).

Please tell us a little about your programming interests. Please tell us why you are interested in contributing to the Boost C++ Libraries.

I knew a bit about open source before I entered into college, had a faint idea because I had used Linux before. Later, I was exposed to the OSDG Club in my college and came to know about various open source organisations whose projects I was using in my daily life.

I came to know about Boost when I was trying to write some threaded code in my OS course, which prompted my interest in Boost C++ Libraries as I found Boost to be a very useful set of libraries for any budding or experienced developer. Later, I was talking to one of my seniors about this, who has contributed to Boost and he encouraged me to have a look at the Projects as they are a very good way to start contributing, so I came across “Boost Document Library Development” which I think is a project which I’m capable of doing, as my prototype hopefully demonstrates.

The fact that Boost libraries reaches out to a lot of people in the programming world provides me a very big impetus to contribute to it. It is really exciting to think that someone would be productive and happy to use the code that I have written, because I was at the same place when I used boost libraries myself.

What is your interest in the project you are proposing?

My primary interests in Programming lies in building applications, which can be used for academic or other free interests. The Document Library Development Project is a project that has a wide variety of applications, in different work environments.

I remember that once I had to work with Excel sheets regarding some data to be populated in a database, for a college portal. As there was no way I knew how to automate such a task directly, I had to convert it to CSV, use cut command to cut specific columns, then painfully write a python script to generate the SQL query. I believe that is the case at a lot of places. This is where Boost Document Library comes into picture. I think that I will learn a lot about software and particularly open source project development in general if I work on this project. Also, I wish to become a regular open source contributor at Boost and this seems to a very good way to start, considering the very good guidance of my mentor, Mr. Antony Polukhin.

Have you done any previous work in this area before or on similar projects?

I have been developing the prototype since the last week with the help of my mentor. But prior to that, I have never worked on a C++ Project of this magnitude.

What are your plans beyond this Summer of Code time frame for your proposed work?

The scope of this project is huge and thus may require adding new methods and functionality to the library. I hope to complete the project within the defined scope during the summer and extend it after that either by broadening the scope or from the feedback that the project receives from the users. I would also like to fix any bugs that may arise during the usage.

In addition, I hope that after the project ends, I become competent enough to be a regular contributor to Boost. I would be more than happy to make more libraries (maybe small ones like the sorely missing fully featured JSON library, instead of the PropertyTree version).

Please rate, from 0 to 5 (0 being no experience, 5 being expert), your knowledge of the following languages, technologies, or tools:

C++ 98/03 (traditional C++) : 3.5

C++ 11/14 (modern C++) : 4

C++ Standard Library : 3.5

Boost C++ Libraries : 2.5

Git : 4

What software development environments are you most familiar with (Visual Studio, Eclipse, KDevelop, etc.)?

I normally use the vim (in guake terminal), gdb/ddd and sometimes sublime text in conjunction to code in C++ because that is what I’m I have been used to. I have used Eclipse a few times, but I don’t like the interface a lot, it seems too cluttered. Visual Studio 2010 is something that I have used on Windows Platform, although I haven’t used Windows at all lately. I’m also familiar with CodeBlocks IDE on Windows.

What software documentation tool are you most familiar with (Doxygen, DocBook, Quickbook, etc.)?

I’m not familiar with any of these but I’m going to try to use Doxygen on the prototype I made so that I understand how to use it before starting the project. With Antony's help, I have already made some strides in this regard.

Project Proposal

Short Overview:

To unify APIs of different Office suits and provide a library that is capable of doing simple tasks with office documents (creation, pdf exporting, file format changes, data extraction and cells manipulations).

Major Points:

  • The library must allow to create object instances of Excel/LibreOffice Calc/OpenOffice Calc Documents and manipulate the documents.
  • The library will be platform compatible (ie. work on both Linux and Windows) and should work if one of the dependencies are satisfied ie. either Libreoffice/OpenOffice or Microsoft Office.
  • The user must be able to do basic tasks such as creating a new document at the given path, opening the document instance, export the document instance to various file formats. I intend to support atleast PDF, CSV and XML. Optional: Functionality Support for Writer/Word and Impress/Powerpoint also. The Libreoffice/OpenOffice API code is already functional in this manner in the prototype I have written.
  • The user must be able to get the corresponding sheet of a spreadsheet document. Every sheet would have associated functions to access the cells and manipulate the cells.
  • Cell Manipulation Functions would include Row and Column Iterators which naturally integrate with the C++ programming paradigm.
  • Adding Simple Functions like SUM,AVERAGE,MAX,MIN,SUMPRODUCT support to Both Excel and OpenOffice. Optional: Adding Macro Execution Support on the spreadsheet for both OpenOffice/LibreOffice and Microsoft Excel respectively.
  • Adding support to draw charts using the data ranges and labels, namely, Pie-Charts and Bar-Charts.
  • Optional: Adding support to change the format from LibreOffice/OpenOffice Calc spreadsheet ( .ods ) to Excel file format ( .xls/.xlsx whichever the OpenOffice API supports).

Requirement of the Project

The library aims to simplify the Office API usage to such an extent that casual users can make their own applications using it. Currently, the Office API’s are not that intuitive and easy to use, and this abstraction layer would provide easier access and thus higher productivity. Such a library is essential for banking software, CRMs and many other programs. It can be very useful in Academia also.

Implementation Details

I would be using Object Oriented Programming principles to abstract out all the details of the Implementation of the wrapped functions of OpenOffice and Microsoft Excel separately, leading to lesser bugs, greater coding ease and good modularity.

The project code base will be divided into two parts: OpenOffice/LibreOffice Functionality Microsoft Office Functionality

The Functions themselves will require calls to API of OpenOffice/LibreOffice and Microsoft Office. One of the big challenges would be to ensure that the code runs on both Windows and Linux, with either Microsoft Office or OpenOffice (on Windows) and with OpenOffice on Linux, on both 32bit or 64bit machines, depending on the discussion with the mentor.There is a lot of boilerplate required before getting the API's themselves working, automate such tasks.

There also needs to be a discussion on the amount of customization ability provided for functionalities such as export (which can have many options in the formatting, like tab separation instead of comma separation) and charts from data.

Row and Column Transforms

Providing iterators for rows and columns would make spreadsheet document manipulations very easy in C++.

  document d = ...;
  row r = d.row(0);
  for(cell& c : d.row(0)) { /*do something*/ }
  for(cell& c : d.column (0)) { /*do something*/ }

Proposed Milestones and Schedule

Pre Project Phase

  • 27th March

    • Submit the final draft of proposal.
  • 28th March - 17th April

    • Go through C++ generic and template programming paradigm resources. I’m familiar with them already but it is better to revise them first.

    • Get familiar with doxygen or some other documentation tool (after discussion with mentor) and use it on the prototype developed to gain an understanding.

    • Discuss and Perform other Tasks as discussed with my mentor needed to get me up to speed before starting the actual project. Note: The tasks that I have mentioned are generic and would help me in the longer run even if I am not selected.

  • 18th April - 7th May

    • I have my End Semester Examinations the week after 18th, Hence I would be committed to that.
  • 10th May- 25th May

    • Go through the OpenOffice documentation (The LibreOffice documentation points to OpenOffice most of the times, and both use StarOffice backend, hence) and learn the use of SAL and CPPUHelper. I already have a brief idea of these but a thorough outlook would be good. Go through the major differences occurring between
    • Similarly, Go through the Microsoft Office Documentation (if any) for the Office API usage.
    • Explore the Possibility of format change from Microsoft Excel to Open Office using the Microsoft Excel API.
    • Make notes of the requisite classes and flows that need to be used, write small snippets of code if they seem to be useful.
    • Discuss about the concept and architecture of the project with the mentor. Also, learn the various standards that need to be followed while writing a library for Boost. Discuss about the methods of implementation specified in this proposal.

Project Phase

I’m not mentioning the testing and the documentation explicitly, however testing for each module will be done for each functionality in the appropriate week, along with writing the documentation.

  • Week-1 -- Week-2

    • Extend prototype features to include compatibility for Excel API and extend format export support to the other formats as proposed for both OpenOffice and Microsoft Office.

    • Learn how to write code which is compatible across all supported platforms, if any changes are needed to be made in the code as required.

  • Week-3

    • Extend functionality to provide the user with the sheet containing the cell, and methods to get and set the cells and the cell ranges.
  • Week-4 -- Week-5

    • Extend Functionality to add support for manipulating rows and columns by providing iterators over the cells of the spreadsheet rows/columns.
  • Week-6

    • Review and Catch-Up week. If everything is going on as planned without roadblocks, try to accomplish the 1st Optional point given in the 3rd Major Point, ie. Implement Basic Document Usage for Word/Powerpoint as is for OpenOffice/LibreOffice.
  • Week-7 -- Week-8

    • Adding Simple Functions like SUM,AVERAGE,MAX,MIN,SUMPRODUCT support to Both Excel and OpenOffice.
  • Week-9 -- Week-10

    • Adding support for making Charts from the spreadsheet, specifically. bar-charts and pie-charts.
  • Week-11 -- Week 12

    • I believe I would have complete all the requirements of the project by now, Code Cleanup would be the main priority.

    • If I still have time left, complete the other Optional Parts or at least try to evaluate the possibility if it is feasible.

Programming Competency

Please have a look at my code base for the prototype that I was asked to write at https://github.com/anuragxel/boost-generic-document-library

The prototype works with LibreOffice/OpenOffice as dependency and provides a document.hpp header, containing open_document(), close_document(), export_document() which exports to PDF and CSV and save_document() which can be used. Examples for usage and other tests are provided in test.cpp. It is also required that OpenOffice/LibreOffice server is switched on.

Other Smaller C++ Projects

I have also been in process of making C++ based Processor Simulator with a friend for a toy instruction set taught in our college, you can have a look at the code base here ( code at src/Assembler is written mostly by me). https://github.com/anuragxel/iiit-processer-sim

Also, I have made an implementation for Modular Snusp (an esoteric programming language) Interpreter, which can be found here https://github.com/anuragxel/modular-snusp