https://medium.com/rapids-ai/show-me-the-word-count-3146e1173801
- filter punctuation
- to_lower
- remove stop words (from nltk corpus)
- remove multiple spaces with one
- remove leading and trailing spaces
- Get Frequency count for the whole dataset
- Compare word count for two authors (Albert Einstein vs Charles Dickens )
- Get Word counts for all the authors
We do this in two steps:
-
Encode the string Series using
top 20k
most usedwords
in the Dataset which we calculated earlier.- We encode anything not in the series to string_id =
20_000
(threshold
)
- We encode anything not in the series to string_id =
-
With the encoded count series for all authors, we create an aligned word-count vector for them, where:
- Where each column corresponds to a
word_id
from the thetop 20k words
- Each row corresponds to the
count vector
for that author
- Where each column corresponds to a
- Fit a knn
- Find the authors nearest to each other in the count vector space
- Decrease dimunitonality using UMAP
- Find the authors nearest to each other in the latent space