-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Topic modeling_Split the novels #13
Comments
If you have reliable information about paragraphs there's no reason not to model paragraphs other than the increase in computation time. Chapters are great as well. Otherwise every 1000 words would work -- but the choice is arbitrary. |
All valid choices, I guess. Ideally, I think we would split on borders between "scenes", the assumption(s) being that scenes form a meaningful unit, that they may have just the right size (although they may be very unequal in length), and that it makes sense to keep such units intact for best results from topic modeling. However, we usually don't have any information about scene boundaries. |
Hi Allen,
sorry to bother you again, i have a dumm question:
https://de.dariah.eu/tatom/topic_model_mallet.html
Here you wrote "Because these are lengthy texts, the novels are split up into smaller sections—a preprocessing step which improves results considerably."
My question is, is there any rules about the length (or size) of the smaller sections? One paragraph as a section? One chapter of the novel? Or maybe the length of the smaller sections is not important, since we will combine the results of topic modelling in the end after all.
I've noticed, that almost all your data are about 6 or 7 kB. I assume maybe this is the right way?
Thanks a lot!
The text was updated successfully, but these errors were encountered: