Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Topic modeling_Split the novels #13

Closed
dkltimon opened this issue Mar 24, 2015 · 2 comments
Closed

Topic modeling_Split the novels #13

dkltimon opened this issue Mar 24, 2015 · 2 comments

Comments

@dkltimon
Copy link

Hi Allen,

sorry to bother you again, i have a dumm question:

https://de.dariah.eu/tatom/topic_model_mallet.html

Here you wrote "Because these are lengthy texts, the novels are split up into smaller sections—a preprocessing step which improves results considerably."

My question is, is there any rules about the length (or size) of the smaller sections? One paragraph as a section? One chapter of the novel? Or maybe the length of the smaller sections is not important, since we will combine the results of topic modelling in the end after all.

I've noticed, that almost all your data are about 6 or 7 kB. I assume maybe this is the right way?

Thanks a lot!

@ariddell
Copy link
Owner

If you have reliable information about paragraphs there's no reason not to model paragraphs other than the increase in computation time. Chapters are great as well. Otherwise every 1000 words would work -- but the choice is arbitrary.

@christofs
Copy link
Collaborator

All valid choices, I guess. Ideally, I think we would split on borders between "scenes", the assumption(s) being that scenes form a meaningful unit, that they may have just the right size (although they may be very unequal in length), and that it makes sense to keep such units intact for best results from topic modeling. However, we usually don't have any information about scene boundaries.
So the next larger unit we tend to have information about is the chapter. And the next smaller unit we tend to have informatin about is the paragraph. However, paragraphs in novels are sometimes extremely short, if you define their border by a newline. For example, if there is an extended dialogue, each statement by a person will be one paragraph. So I think the best solution we currently have is to use something like "around n words (maybe 1000, or 2000), but cutting only on paragraph boundaries".
What would be really interesting is a study comparing results of topic modeling for the same texts but using different splitting strategies. Just to see whether it even makes a big difference.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants