-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
added documentation #21
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please make the suggested changes
self.doc_freqs = [] # list of dictionaries of term_frequency of each document | ||
self.idf = {} # idf score of each word in whole corpus | ||
self.doc_len = [] # list of length of each document in corpus | ||
self.tokenizer = tokenizer # user input tokenizer, defaults to none | ||
|
||
if tokenizer: | ||
corpus = self._tokenize_corpus(corpus) | ||
|
||
nd = self._initialize(corpus) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is nd
here? You should explain it
Example: | ||
corpus = [['ram', 'is', 'a', 'good', 'boy'], ['ram', 'does', 'cycling', 'and', 'racing'], ['ram', 'is', 'healthy'], ['rita', 'likes', 'shyam'], ['good', 'luck']] | ||
nd = {'ram': 3, 'is': 2, 'a': 1, 'good': 2, 'boy': 1, 'does': 1, 'cycling': 1, 'and': 1, 'racing': 1, 'healthy': 1, 'rita': 1, 'likes': 1, 'shyam': 1, 'luck': 1} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shorten the examples so that I don't need to scroll. The functionality can also be explained only using 2 items in the list.
for document in corpus: | ||
self.doc_len.append(len(document)) | ||
num_doc += len(document) | ||
num_words += len(document) # total number of words in whole corpus |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function of variable num_words
has already been explained.
frequencies = {} | ||
term_frequencies = ( | ||
{} | ||
) # term frequency of each word in a document........ changed frequencies to term_frequencies |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't need to comment that you changed the name of variable. git keeps track of it.
if word not in term_frequencies: | ||
term_frequencies[word] = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This block of code can be removed by using defaultdict
instead of the normal dictionary.
added inline comments and docstrings to explain what the code is actually doing.