Project Ideas

runab edited this page Feb 6, 2014 · 12 revisions

Project Ideas

Ankur.org.in would like to welcome all students, developers and mentors to participate in the our projects. We encourage interested students and developers to talk with us about their project proposals. A clearly articulated proposal that sets out a plan of action along with time-lines and, is realistic in providing checkpoints to measure progress is something we look forward to.

To start our conversation we would like to know about the following:

  • self introduction (courses attending etc)
  • what is the scope of the proposal (especially what is outside the scope)
  • familiarity with the tools, infrastructure and concepts of the project
  • how many hours per work will you be able to commit to your project
  • do you want to inform the project about any other commitments you have
  • how will you adjust to a mentor who will be virtually present
  • are you comfortable in English

Ankur.org.in has been a participant in Google Summer of Code 2012 and 2013. Ankur.org.in will be putting forth an application to be a mentoring organization for Google Summer of of Code 2014. Through its involvement in programs like the Google Summer of Code 2014 the organisation aims to achieve some of the goals it has set itself for this year. These relate to the availability of simpler and reliable tools for end-users of platforms to communicate and share knowledge in their local language. The project proposals are so constructed so as to enable architecture and development of ‘frameworks’ rather than language specific tools. This would enable a larger part of the work to be re-used and improved upon by contributions from other language communities and especially, Indic language communities. Given below are a set of ideas around which we would like to see proposals. These are by no means the only ideas which we will consider. An interesting approach to solve a relevant problem in the domain of language technologies is always welcome and we would encourage you to discuss that using our mailing list or, IRC. The earlier we begin the conversation the easier it will be to familiarize ourselves with the patterns and way of work. This would help us in ensuring that the project ideas selected are successfully delivered.

Our mailing list is project-ideas @ lists dot ankur dot org dot in and subscription interface is open for memberships. We also use IRC for discussion and our channel is #ankur.org.in on Freenode IRC If you are interested in any of the ideas, please do get in touch with the mentors (their IRC nicknames are provided along side their names)

Tentative list of Project Mentors

Sankarshan Mukhopadhyay (sankarshan) Sayamindu Dasgupta (unmadindu) Shreyank Gupta (shrink) Runa Bhattacharjee (arrbee) Sucheta Ghoshal (sucheta)

An implementation of NLP rules using a widely available and comprehensive corpora of Bengali language texts.

In recent times there have been many discussions around creating (through organic and inorganic methods) corpora. We feel that in addition to the creation of a body of texts, it is important to provide a well defined and comprehensive set of examples of implementation of NLP rules for a language (eg. Bengali)

This is a project of moderate complexity but would require familiarity with NLP (Natural Language Processing); scripting languages and, grammar rules pertaining to the language. Familiarity with libraries pertaining to NLP is a must.

Improving information retrieval methods for OCR data sets consisting of Indic scripts

The availability of archived and digitized documents for Indic scripts has gradually increased in recent times. The project aims to improve existing methods and algorithms in the retrieval of information from digitized text. The ability to increase the effectiveness of information retrieval from such text enables the content to be made available via standardized and structured text processing software. Current methods of retrieval result in significant degradation thus making information retrieval and use ineffective. As part of the proposal, the search algorithms should make use of all additional methods of error corrections to improve the performance.

This is an exploratory/research centric project of high degree of complexity

Improve the accuracy of OCR tools for Bengali language to 98%

Existing Free and Open Source software around OCR result in significantly high erroneous result. The proposal requires a study of the currently open items for any existing tool and, develop patches which would improve the accuracy of the software to ~98%

This is a risky/exploratory project. Current methods of OCR for Indic languages have attained a sort of plateau. The intent of this project is to devise technology constructs which will help improve the accuracy. Attaining such goal would require prior knowledge about the existing tools and programs along with a depth of understanding of the current problem sets.

A platform to integrate into an OCR workflow pipeline to enable collaborative correction of OCR text.

This is a project of low to moderate complexity. This project requires the creation of a collaboration platform which can be integrated into a OCR workflow pipeline and, can thus be used by a diverse set of distributed editors and reviewers. The corrections to the text will need to be tracked and tagged in a manner so as to be able to complete the feedback loop to the OCR pipeline for training and improvement. It is preferred that a voting mechanism be also included in the platform in order to enable greater participation and, easy identification of correct usage of popular terms.

A method/tool to measure performance indicators of web fonts.

The popularity of web fonts have resulted in a number of approaches towards measuring (or, scoring) aspects of their usage.

This is a project of moderate complexity which requires an assessment of existing methods and, thereafter developing a lightweight framework which can be used in the context of performance engineering, DevOps and, Quality Engineering.

An application UI testing framework for validating translation completeness and quality

A typical problem in the translation work-flow is the incomplete coverage of translated strings for an application. This creates an inconsistent experience for the end user.

The basic premise of this project idea is to allow a desktop application testing tools/framework to check for consistency of translations across a GUI, coverage as well as whether the translated UI contravenes known UI Guidelines. A knowledge of available testing frameworks and, ability to develop code in scripting languages viz. Python is required. This is an infrastructure/automation centric project which will help improve the quality and consistency of translated interfaces. There are existing tools/scripts which do parts of the outlined idea. However, there is a lack of an unified approach to the solution. Proposals which are based on extending an existing tool would also be accepted.

Add language grammar rules to a machine translation system

Existing Machine Translation systems perform with less accuracy when provided with the task of translating from English to Bengali. The proposal involves documenting existing language grammar rules and developing enhancements to an existing system of machine translation.

The proposal also desires that the system be deployed as a proof-of-concept to a project of scale (eg. Wikipedia) and, generate auto-translated content for review, scoring and curation.

Add a language model for speech recognition software for Bengali language

Develop a language model for speech processing by extending a freely available corpus.

The proposal also requires an understanding of existing models of processing used by speech recognition software and, devising a proof-of-concept deployment for use in a project of scale.

Speech based query and result retrieval system for Indian languages

This projects intends to provide an easily extensible framework for utilizing speech input to query a datastore of content and, provide result set. There exist similar implementation eg. IRCTC uses an implementation with Asterix for the "Speech based dialog query system" in Hindi. The proposal aims to create an easy framework which can be deployed by local language service providers to serve content.

Familiarity with the speech-to-text and text-to-speech aspects for Indian languages, knowledge of word-sense-disambiguation and, familiarity with parts-of-speech taggers is expected. There exist a large body of published papers on this subject and, the interested student would be expected to study and assess the approaches. The end result of the project would be a reference implementation in one Indian language (other than Hindi) along with a documented process of how to extend it to other Indian languages.

A validation system for translated strings based upon Translation Style Guides of language communities.

Every language community creates a set of Style Guides for translations. These pertain to the specific ways in which translatable aspects like Trademarks, Shortcuts, Hot-keys, accelerators etc are translated.

The project idea is around creating a script based workflow - manual or, automated which can accept a Style Guide as an input and thereafter test a corpus of translated files and generate a result which can also be used to score the quality of a translation. A knowledge of Style Guides is preferable. Ability to develop code in scripting languages or, web-frameworks is required. This is an infrastructure project aimed at someone who would like to begin contributing towards i18n/l10n development. There is adequate guidelines available in the form of styleguides. The intent of this project is to convert such guidelines into a scoring mechanism which will enable teams to reach conclusions on the quality of the translations. Familiarity with plug-in systems of existing translation content management systems is preferable but by no means mandatory.

Design and development of a print ready OpenType font for Bengali

A specification for developing OpenType fonts for Indic languages is being drawn up by Santhosh Thottingal and others.

The set of available fonts for the Bengali language are not specifically developed with the aim of being used in printed content. The proposal requires the design and development of a OpenType font for Bengali which is suitable to be used in printed content, has aesthetic appeal and, is compliant with the current specifications mandated by the Unicode Consortium. The specific set of test cases for the font would also be required to be developed. A prior knowledge of font design/development, familiarity with the specifications and discussions of the Unicode consortium and, familiarity with font development tools on Linux is preferred. This is core development work as part of the organization’s focus area.

Terminology Query dashboard

For translators of specialized content, getting help with the terminology eases the dependency on in-depth research that is a requirement. The project aims at building a flexible and automated dashboard to query for terminology references. It will provide visualized content from an underlying database that will store the mapping between original content, and translation in multiple languages, and references to the origins of terminology context. The dashboard will also allow cross-referencing and provide an interactive system to suggest modifications to the content. The actual content would be editable only through an admin console. Additionally, the query system would be available for use within other translation systems through an API.

You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.
Press h to open a hovercard with more details.