Skip to content

Latest commit

 

History

History
404 lines (314 loc) · 21.8 KB

proposal.rst

File metadata and controls

404 lines (314 loc) · 21.8 KB

WEIGHTING SCHEMES

About You

I am a Computer Science student of Indian Institute of Technology , Kanpur,INDIA.I am proficient in C and C++ .As a hobby ,I practice Competitive Programming . Apart from programming, I am also very fond of Mathematics.

Background Information

Have you taken part in GSoC and/or GCI (https://codein.withgoogle.com/) and/or similar programmes before? If so, tell us about how it went, and any areas you would have liked more help with.

No,This is the first time.

Please tell us about any previous experience you have with Xapian, or other systems for indexed text search.

Before intending to apply for gsoc,I didn't have much experience.Now I have been using Xapian since the past 1 month.I am familiar with its indexed text search and also with the implementation of current Weighting schemes(the part where I intend to work).

Tell us about any previous experience with Free Software and Open Source other than Xapian.

No prior experience other than Xapian.

What other relevant prior experience do you have (courses taken at college, hobbies, holiday jobs, etc)?

I have taken courses in C and C++ .I have also taken mathematics course in Linear Algebra and Number Theory.

What development platforms, tools and methods do you prefer to use?

I use Ubuntu as my OS .I am familiar with github.

Have you previously worked on a project of a similar scope? If so, tell us about it.

No,this is the first time.

What timezone will you be in during the coding period?

IST(GMT+5:30)

Will your Summer of Code project be the main focus of your time during the program?

YES,I won't have any other major activity during the program.

Expected work hours (e.g. Monday–Friday 9am–5pm UTC)

Monday-Friday 9am-5pm IST

Are you applying for other projects in GSoC this year? If so, with which organisation(s)?

NO.

Your Project

Motivations

Why have you chosen this particular project?

This project is suitable for my skills.Moreover,I am also eager to compare the effectiveness of using different weighting schemes.

Who will benefit from your project and in what ways?

Anybody using Xapian will have more choices regarding which weighting scheme to use .There are many other Weighting Schemes which are worth implementing in Xapian some because they're potentially more effective than BM25(the default scheme for Xapian), others because they're of interest for Information Retrieval students and academics.

The use of Weight::create() in Xapian-evaluation would help to establish a standard format for specifying the weighting scheme, which is helpful for users.

Project Details

Describe any existing work and concepts on which your project is based.

Xapian already supports the Vector space model used in Tf-idf Weighting Schemes.It has some normalisation (described by SMART) already implemented by sub classing WEIGHT.We can add more normalizations if we made some more statistics available to our Weighting Schemes. For example ,getting the max-tf would enable us to implement the "aug-norm" and "max-norm" described in SMART normalisation.

I would also be implementing the following normalizations (described in the below mentioned research papers http://www.kolda.net/publication/ornl-tm-13756.pdf)- Entropy,Global frequency IDF,Changed-coefficient ATF1,Augmented average term frequency,Augmented log,Square root,Log-global frequency IDF,Incremented global frequency IDF,Square root global frequency IDF.

These normalisations have proven to be more effective than other popular weighting schemes in certain cases.(For details ,please refer the research paper.)

The above mentioned normalizations dont need any other extra statistics,i.e, all the required stats are already available.So, only a small patch is required for each.

I would also be implemented the "aug-norm" and "max-norm" described in SMART normalisation.For that we need "max-tf" to be made visible to our Weighting Scheme.This can be done By using the Weight ::Internal subclass.It has a map (named termfreqs) which has the required information.

It would also be very useful to see how the different schemes compare for speed and retrieval effectiveness, so can offer solid advice to users wondering which to use.So, for this we can use https://github.com/samuelharden/xapian-evaluation to evaluate and compare modified weighting functions with their counterparts to access their speed and retrieval effectiveness.

Currently the code in xapian evaluations requires some cleanup.Currently we need to provide separate information for different weighting schemes and their parameters.This is a lot of extra work.All of this can be avoided by using Weight::create().First we make changes in config_file.cc and config_file.h .For using Weight::create(),we can change the config_value string input.Instead of providing just the name of Weighting scheme ,we can use a string with weight scheme name followed by its parameters (same as used in Weight ::create()).Then we wont require to write extra code for parameters.We will also make changes in trec_search.Now it would be simple as we won't need to make separate checks for implementing different schemes.We would simple use Weight::create() and feed it with the string used earlier.

Then I would be adding support for the newly inplemented normalistions in Weight::create().This requires some changes in Registry .A few additions in the map implemented there would do.

Finally I would be comparing the effectiveness of different normalisations using Xapian evaluations.

Stretch Goals

If time permits, we can also implement the following:

Providing exact value of unique terms: Currently the number of unique terms is not exact. We can get an exact value of unique terms by storing unique terms values like we store the document length.

There are more weighting schemes which we can add.Some of them are developed using genetic programming.They have shown to outperform BM25 in some cases.For details ,please refer http://www.genetic-programming.org/hc2007/07-Cummins/Cummins-JIR-2006.pdf

Do you have any preliminary findings or results which suggest that your approach is possible and likely to succeed?

  1. http://www.kolda.net/publication/ornl-tm-13756.pdf
  2. http://www.iaeng.org/publication/IMECS2010/IMECS2010_pp690-692.pdf
  3. http://people.csail.mit.edu/jrennie/ecoc-svm/smart.html

Th first paper suggests that the new normalisations are more effective than other popular weighting schemes in certain cases.Since this paper is a little too old, for that I have mentioned the second paper which is more recent. It also talks about the Weighting schemes mentioned in the previos paper.Third paper is about the SMART normalisations which are popular.

The implementation of new normalisations and evaluation is similar to those already implemented ,So that won't be a problem either.

What other approaches have you considered, and why did you reject those in favour of your chosen approach?

I also wanted to implement the "sum","cosine","max" and "fourth" normalization described by SMART for 3rd parameter.But unlike those I have chosen to implement ,these require the visibility of the weights of other terms.Since the other term weight are internal to the scheme ,these stats can't be directly fed to our Weighting scheme.So these can't be implemented in our current weighting framework AFAICS.

Another approach to add support in xapian-evaluation was to simply add code without using Weight::create().That would not be useful as the the step of providing information of separate weighting schemes and parameters can be avoided.

Please note any uncertainties or aspects which depend on further research or investigation.

I don't think there are any possibilities of uncertainties in this project.

How useful will your results be when not everything works out exactly as planned?

Just in case things don't go as planned,the work would still be useful.This project involves the implementation of different normalisations.Their implementation is not dependent directly on each other.Even if some normalisations are left, those implemented will be in perfect working conditions.

Also the changes in Xapian evaluation to use Weight::create() is an independent sub project as well.

Project Timeline

Community bonding (4 MAY-31 MAY)
Week 1(4 MAY-10 MAY)
Going through weight files again in full detail and discussing doubts on IRC.
Week 2(11 MAY-17 MAY)
Understanding the writing of Automated test cases.
Week 3(18 MAY-24 MAY)
Going through Xapian Evaluation files in detail.
Week 4(25 MAY-31 MAY)
Time for any other discussion needed prior to coding.
Coding
Week 1(1 JUNE-6 JUNE)
  • implement Entropy and Global frequency IDF (1 day)
  • write test cases for Entropy and Global frequency IDF (3 days)
  • implement Changed-coefficient ATF1 and Augmented average term frequency (1 day)
Week 2(7 JUNE-13 JUNE)
  • write test cases for Changed-coefficient ATF1 and Augmented average term frequency (3 days)
  • make PR for these changes and getting it reviewed and completing user guide documentation simultaneously (2 days)
Week 3(14 JUNE-20 JUNE)
  • Time for any changes in PR after review by mentors(2 days)
  • implement Augmented log and square root (1 day)
  • write test cases for Augmented log and square root (2 days)
Week 4(21 JUNE-27 JUNE)
  • write more test cases for Augmented log and square root (1 day)
  • implement Log-global frequency IDF and Incremented global frequency IDF (1 day)
  • write test cases for Log-global frequency IDF and Incremented global frequency IDF (3 days)
Week 5(28 JUNE-4 JULY) [Evaluation 1]
  • Submit evaluation .
  • make PR for these changes and getting it reviewed and completing user guide documentation simultaneously (4 days)
Week 6(5 JULY-11 JULY)
  • implement Square root global frequency IDF (1 day)
  • write test cases for Square root global frequency IDF (2 days)
  • get max-tf from Weight::Internal (2 days)
Week 7(12 JULY-18 JULY)
  • implement "max-norm" and "aug-norm" (1 day)
  • write test cases for max-tf (2 days)
  • write test cases for "max-norm" and "aug-norm (2 days)
Week 8(19 JULY-25 JULY)
  • write test cases for "max-norm" and "aug-norm (1 day)
  • make PR for these changes and getting it reviewed and completing user guide documentation simultaneously(4 days)
Week 9(26 JULY-1 AUG)[Evaluation 2]
  • submit evaluation.
  • Add support for the new normalisations in Weight::create().This includes the changes to be made in Registry(3 days)
Week 10(2 AUG-8 AUG)
  • make changes in config_file.cc (2 days)
  • make changes in config_file.h (1 day)
  • make changes in trec_search.cc (2 days)
Week 11(9 AUG-15 AUG)
  • make PR for these changes (4 days)
  • Compare effectiveness of different normalisations. We will use FIRE database for this.
Week 12(16 AUG-22 AUG)
  • make user guide documentation for these results.
  • This week is kept as buffer week.Also due to the COVID-19 outbreak in my country,Our exams have been rescheduled but the dates aren't decided.So there is a possibility that I might have exams during the coding period.But they wont last more than a week.If I have time,I can work for Stretch goals.

Previous Discussion of your Project

I have discussed it on IRC.

Licensing of your contributions to Xapian

Do you agree to dual-license all your contributions to Xapian under the GNU GPL version 2 and all later versions, and the MIT/X licence?

For the avoidance of doubt this includes all contributions to our wiki, mailing lists and documentation, including anything you write in your project's wiki pages.

I have already agreed to that when I made my first contribution with Xapian.

Use of Existing Code

If you already know about existing code you plan to incorporate or libraries you plan to use, please give details.

I would be using Xapian::Weight and its subclasses.I would also be using Xapian evaluations.