Skip to content

Fix for NUTCH-2245 NGram Model for Cosine Similarity by bhavyasanghavi#101

Closed
bhavyasanghavi wants to merge 1 commit intoapache:masterfrom
bhavyasanghavi:NUTCH-2245
Closed

Fix for NUTCH-2245 NGram Model for Cosine Similarity by bhavyasanghavi#101
bhavyasanghavi wants to merge 1 commit intoapache:masterfrom
bhavyasanghavi:NUTCH-2245

Conversation

@bhavyasanghavi
Copy link

No description provided.

@bhavyasanghavi
Copy link
Author

Description:

  1. The user should enable the plugin by enabling scoring-similarity in the plugin.includes property in nutch-site.xml.
  2. Copy the gold-standard file into the conf directory and enter the name of this file in nutch-site.xml.
  3. The user should include the following property in nutch-site.xml to specify the window size for ngram model:
    <property>
    <name>scoring.similarity.ngrams</name>
    <description>Specifies the 'n' in ngrams</description>
    </property>
  4. If the user does not include the above mentioned property, the default size will be 1, i.e unigrams will be considered for similarity scoring.

Use Case:
It is extremely useful in cases where the user specifies names entities in the gold standard. For example "Leonardo DiCaprio" is one of the terms in gold standard. The ngram model with a window size 2, will provide distinction between pages pertaining to "Leonardo da Vinci", which otherwise would not be captured. Another example is where user is looking for phrases/dialogues, the window size can be cleverly adjusted depending on the content of the gold standard.


//Check if user has specified n for ngram cosine model
int ngram = conf.getInt("scoring.similarity.ngrams", 1);
LOG.info("Value of ngram: "+ngram);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use correct effficient slf4j code notation here
e.g. LOG.info("Value of ngram: {} ", ngram);

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I will update that. Thanks.

@lewismc
Copy link
Member

lewismc commented Apr 1, 2016

Overall, the patch it looking excellent.

tStream.reset();
while(tStream.incrementToken()) {
String term = charTermAttribute.toString();
LOG.info(term);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like its used for debugging, please change it to LOG.debug(). It helps keeping the log clean.
Thanks!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated it. Thanks.

@sujen1412
Copy link
Member

I have tried out the patch, works as expected, going to commit now.
Thanks @bhavyasanghavi!

@asfgit asfgit closed this in b62f43f Apr 4, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants