Skip to content

benbroks/tfidf

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 

Repository files navigation

tfidf

Implementation of Term Frequency - Inverse Document Frequency using TextBlob. This algorithm is typically used to measure the relative importance of words in documents without relying exclusively on raw frequency. Raw frequency disproportionately values stop words and other common words that show up in nearly any corpus.

From what I've seen, implementations of tf-idf rarely allow for batch training. Given that I'm often working with datasets of large-ish sizes (~100k docs), I decided to build one out.

Data can be input in a couple different ways:

  1. List of Strings + Corresponding List of String IDs. (i.e. String ID at index i corresponds to the String at index i)
    • call batch_train_w_lists
  2. Dictionary of ID keys + String values. Each ID directly maps to a String.
    • call batch_train_w_dict

When initializing your TFIDF object, toggle the boolean clean parameter to apply string pre-processing prior to training. This way, words like "PaNcakes" and "pancakes" will be considered one and the same! One thing to note: symbols will be replaced with spaces, so be prepared for "Ben's" to be converted to "ben" and "s", two separate words.

Output structures:

  1. tfidf(doc_id,word): Returns typical tf-idf value
  2. large_doc_normalized_tfidf(doc_id,word): Largest tf-idf value in a given document is 1. Inspired by Stanford's description.
    • First, we calculate the typical value, t = tfidf(doc_id,word)
    • Second, we calculate the maximum tf-idf value within doc-id, max.
    • Choose some a (usually 0.4), return a + (1-a)*t / max.
  3. small_doc_normalized_tfidf(doc_id,word): Largest tf-idf value in the entire corpus is 1.
    • After calculating every typical tf-idf value, we find the absolute max.
    • Return tfidf(doc_id,word)/absolute max.

Larger Scale Outputs:

  1. top_n(doc_id,n=5,large_doc_normalized=False,small_doc_normalized=False): Returns, in order by tfidf value, the n greatest words in a document.
  2. every_word(doc_id,large_doc_normalized=False,small_doc_normalized=False): Returns every word and corresponding tfidf value in a given document.

About

term frequency–inverse document frequency w/ batch training

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages