tfidf

Implementation of Term Frequency - Inverse Document Frequency using TextBlob. This algorithm is typically used to measure the relative importance of words in documents without relying exclusively on raw frequency. Raw frequency disproportionately values stop words and other common words that show up in nearly any corpus.

From what I've seen, implementations of tf-idf rarely allow for batch training. Given that I'm often working with datasets of large-ish sizes (~100k docs), I decided to build one out.

Data can be input in a couple different ways:

List of Strings + Corresponding List of String IDs. (i.e. String ID at index i corresponds to the String at index i)
- call batch_train_w_lists
Dictionary of ID keys + String values. Each ID directly maps to a String.
- call batch_train_w_dict

When initializing your TFIDF object, toggle the boolean clean parameter to apply string pre-processing prior to training. This way, words like "PaNcakes" and "pancakes" will be considered one and the same! One thing to note: symbols will be replaced with spaces, so be prepared for "Ben's" to be converted to "ben" and "s", two separate words.

Output structures:

tfidf(doc_id,word): Returns typical tf-idf value
large_doc_normalized_tfidf(doc_id,word): Largest tf-idf value in a given document is 1. Inspired by Stanford's description.
- First, we calculate the typical value, t = tfidf(doc_id,word)
- Second, we calculate the maximum tf-idf value within doc-id, max.
- Choose some a (usually 0.4), return a + (1-a)*t / max.
small_doc_normalized_tfidf(doc_id,word): Largest tf-idf value in the entire corpus is 1.
- After calculating every typical tf-idf value, we find the absolute max.
- Return tfidf(doc_id,word)/absolute max.

Larger Scale Outputs:

top_n(doc_id,n=5,large_doc_normalized=False,small_doc_normalized=False): Returns, in order by tfidf value, the n greatest words in a document.
every_word(doc_id,large_doc_normalized=False,small_doc_normalized=False): Returns every word and corresponding tfidf value in a given document.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
README.md		README.md
tfidf.py		tfidf.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

tfidf

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

tfidf

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages