Topic modeling_Split the novels #13

dkltimon · 2015-03-24T09:38:54Z

Hi Allen,

sorry to bother you again, i have a dumm question:

https://de.dariah.eu/tatom/topic_model_mallet.html

Here you wrote "Because these are lengthy texts, the novels are split up into smaller sections—a preprocessing step which improves results considerably."

My question is, is there any rules about the length (or size) of the smaller sections? One paragraph as a section? One chapter of the novel? Or maybe the length of the smaller sections is not important, since we will combine the results of topic modelling in the end after all.

I've noticed, that almost all your data are about 6 or 7 kB. I assume maybe this is the right way?

Thanks a lot!

ariddell · 2015-03-24T15:09:04Z

If you have reliable information about paragraphs there's no reason not to model paragraphs other than the increase in computation time. Chapters are great as well. Otherwise every 1000 words would work -- but the choice is arbitrary.

christofs · 2015-03-26T10:30:39Z

All valid choices, I guess. Ideally, I think we would split on borders between "scenes", the assumption(s) being that scenes form a meaningful unit, that they may have just the right size (although they may be very unequal in length), and that it makes sense to keep such units intact for best results from topic modeling. However, we usually don't have any information about scene boundaries.
So the next larger unit we tend to have information about is the chapter. And the next smaller unit we tend to have informatin about is the paragraph. However, paragraphs in novels are sometimes extremely short, if you define their border by a newline. For example, if there is an extended dialogue, each statement by a person will be one paragraph. So I think the best solution we currently have is to use something like "around n words (maybe 1000, or 2000), but cutting only on paragraph boundaries".
What would be really interesting is a study comparing results of topic modeling for the same texts but using different splitting strategies. Just to see whether it even makes a big difference.

ariddell closed this as completed Mar 24, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Topic modeling_Split the novels #13

Topic modeling_Split the novels #13

dkltimon commented Mar 24, 2015

ariddell commented Mar 24, 2015

christofs commented Mar 26, 2015

Topic modeling_Split the novels #13

Topic modeling_Split the novels #13

Comments

dkltimon commented Mar 24, 2015

ariddell commented Mar 24, 2015

christofs commented Mar 26, 2015