Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

a performance question and proposal to expose the Idf trait #5

Open
btc opened this issue May 18, 2021 · 2 comments
Open

a performance question and proposal to expose the Idf trait #5

btc opened this issue May 18, 2021 · 2 comments

Comments

@btc
Copy link
Contributor

btc commented May 18, 2021

Hi! Thanks for publishing this software. It's quite helpful to potentially be able to use your library instead of re-implementing TFIDF myself. I am grateful for the time and attention you've given to this.

For my use-case, I am working with a large corpus of documents and trying to understand if I can use this library in a way which will have suitable performance.

Examples in this repo show examples of the form:

for term in terms:
  for doc in docs:
    score = compute_tfidf(term, doc) 

where compute_tfidf is either TfIdfDefault::tfidf or MyTfIdfStrategy::tfidf


  1. Is it true that the idf implementations exposed by this crate all require a O(n) linear iteration over the documents/corpus?
  2. Is it possible to use the idf functions on their own, without going through tfidf?

Presented in pseudocode here, I would like to do the following:

for term in terms:
  idf = compute_idf(term, docs)
  for doc in docs:
    score = compute_tf(term, doc) * idf

Concretely, I have tried to use the library in the following way, but ran into an error that I don't quite understand yet:

use tfidf::idf::{InverseFrequencySmoothIdf};
use tfidf::tf::DoubleHalfNormalizationTf;
use tfidf::{Tf, Idf};

for term in terms {
  let idf = idf::InverseFrequencyIdf::idf(term, docs)
  for doc in docs {
    let tf = tf::DoubleHalfNormalizationTf::tf(term, doc)
    let tfidf = tf * idf
  }
}

Error 1:

Unresolved import `tfidf::Idf`

Error 2:

No function or associated item named `idf` found for struct `InverseFrequencySmoothIdf` in the current scope

Further exploration has led me to discover that this may be occurring simply because Idf isn't exposed. Would it be okay for me to submit a patch which modifies lib.rs to expose Idf?

by changing:

pub use prelude::{
  Document, ExpandableDocument, NaiveDocument, NormalizationFactor, ProcessedDocument,
  SmoothingFactor, Tf, TfIdf,
};

proposed:

pub use prelude::{
  Document, ExpandableDocument, NaiveDocument, NormalizationFactor, ProcessedDocument,
  SmoothingFactor, Tf, TfIdf, Idf,
};
@ferristseng
Copy link
Owner

Further exploration has led me to discover that this may be occurring simply because Idf isn't exposed. Would it be okay for me to submit a patch which modifies lib.rs to expose Idf?

Hi! That would be fine to submit that patch. I'm surprised that that trait isn't exposed already!

Is it true that the idf implementations exposed by this crate all require a O(n) linear iteration over the documents/corpus?

Just an FYI, I'm only passively maintaining this project. I'm not sure if I'm able to answer your question, since I haven't look at this code in a long time.

@btc btc mentioned this issue May 18, 2021
@btc
Copy link
Contributor Author

btc commented May 18, 2021

Hi! That would be fine to submit that patch. I'm surprised that that trait isn't exposed already!

Here you go:

#6

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants