In [1]:
meta_ch = read.csv("~/github/NovelProject/features_ch.csv")
meta_jp = read.csv("~/github/NovelProject/features_jp.csv")

First we fit a multinomial regression with the following features:
- thought
- pronouns
- kl_score
- ttr_mean
- ent_mean
- con_ent2_mean
- period
- punct
- stopword

In [2]:
library(nnet)
fit_ch = multinom(genre~thought+pronouns+kl_score+ttr_mean+ent_mean+con_ent2_mean+period+punct+stopword,data=meta_ch,maxit=1000,trace=FALSE)

Get summary of the fitted coefficients.

In [3]:
z = summary(fit_ch)$coefficients/summary(fit_ch)$standard.errors
pval = 2*(1-pnorm(abs(z)))
fit_summary = list()
fit_summary$Romantic = cbind(coef(fit_ch)[1,],summary(fit_ch)$standard.errors[1,],z[1,],pval[1,])
colnames(fit_summary$Romantic) = c("Estimate","Std. Error","Z-score","Pr(>|Z|)")
fit_summary$SR = cbind(coef(fit_ch)[2,],summary(fit_ch)$standard.errors[2,],z[2,],pval[2,])
colnames(fit_summary$SR) = c("Estimate","Std. Error","Z-score","Pr(>|Z|)")
print(fit_summary,digits=3)

$Romantic
              Estimate Std. Error Z-score Pr(>|Z|)
(Intercept)      8.234      32.60   0.253 8.01e-01
thought        233.073      25.80   9.034 0.00e+00
pronouns       344.887      68.47   5.037 4.73e-07
kl_score         0.913       1.34   0.683 4.95e-01
ttr_mean        20.532      15.78   1.301 1.93e-01
ent_mean        -9.126       4.99  -1.827 6.77e-02
con_ent2_mean   40.275       6.80   5.926 3.11e-09
period          35.198      37.83   0.930 3.52e-01
punct         -100.263      23.21  -4.320 1.56e-05
stopword       -52.135      16.11  -3.236 1.21e-03

$SR
              Estimate Std. Error Z-score Pr(>|Z|)
(Intercept)     -13.82      32.02  -0.431 6.66e-01
thought          85.29      26.13   3.264 1.10e-03
pronouns        263.74      66.95   3.940 8.16e-05
kl_score         -3.95       2.33  -1.697 8.98e-02
ttr_mean         25.80      15.36   1.680 9.30e-02
ent_mean         -7.84       4.95  -1.586 1.13e-01
con_ent2_mean    39.21       6.70   5.853 4.82e-09
period          

The underlying model here is
$$\log\left(\frac{P(\text{Romantic})}{P(\text{Popular})}\right)=\beta_0+\beta_1\cdot\text{thought}+\cdots+\beta_9\cdot\text{stopword}.$$
To interpret the result, for example, we can say that the odds ratio between being a romantic and a popular novel for a 0.1% increase of the thought word proportion while holding other variable fixed is 
$$\exp(233.073\times 0.001)=1.26.$$

For the Japanese corpus:

In [4]:
library(nnet)
fit_jp = multinom(genre~thought+narrator+kl_score+ttr_mean+ent_mean+con_ent2_mean+period+punct+stopword,data=meta_jp,maxit=1000,trace=FALSE)

In [5]:
z = summary(fit_jp)$coefficients/summary(fit_jp)$standard.errors
pval = 2*(1-pnorm(abs(z)))
fit_summary = list()
fit_summary$Prolet = cbind(coef(fit_jp)[1,],summary(fit_jp)$standard.errors[1,],z[1,],pval[1,])
colnames(fit_summary$Prolet) = c("Estimate","Std. Error","Z-score","Pr(>|Z|)")
fit_summary$Shishosetsu = cbind(coef(fit_jp)[2,],summary(fit_jp)$standard.errors[2,],z[2,],pval[2,])
colnames(fit_summary$Shishosetsu) = c("Estimate","Std. Error","Z-score","Pr(>|Z|)")
print(fit_summary,digits=3)

$Prolet
              Estimate Std. Error Z-score Pr(>|Z|)
(Intercept)    -70.829     23.848  -2.970 2.98e-03
thought       -125.012    176.752  -0.707 4.79e-01
narratorthird    0.695      0.461   1.506 1.32e-01
kl_score         3.257      2.438   1.336 1.82e-01
ttr_mean        15.937     14.616   1.090 2.76e-01
ent_mean         8.248      4.257   1.938 5.27e-02
con_ent2_mean    5.911      5.842   1.012 3.12e-01
period         152.005     28.060   5.417 6.05e-08
punct          -24.622     12.370  -1.990 4.65e-02
stopword        18.816     13.114   1.435 1.51e-01

$Shishosetsu
              Estimate Std. Error Z-score Pr(>|Z|)
(Intercept)    -42.083     24.568  -1.713 0.086727
thought        375.385    164.131   2.287 0.022190
narratorthird   -0.516      0.458  -1.128 0.259472
kl_score        -1.702      2.588  -0.658 0.510799
ttr_mean        25.193     16.020   1.573 0.115807
ent_mean         0.450      4.364   0.103 0.917940
con_ent2_mean    2.097      6.249   0.336 0.737132
period   

Now let's look at the predictive performance of the multinomial fit.

In [6]:
confusion_ch = matrix(0,3,3)
B = 100
for (i in 1:B) {
    romantic = sample(which(meta_ch$genre=="Romantic"))
    sr = sample(which(meta_ch$genre=="SR"))
    pop = sample(which(meta_ch$genre=="Pop"))
    
    ptrain = 4/5
    train = meta_ch[c(romantic[1:floor(length(romantic)*ptrain)],sr[1:floor(length(sr)*ptrain)],pop[1:floor(length(pop)*ptrain)]),]
    test = meta_ch[c(romantic[(floor(length(romantic)*ptrain)+1):length(romantic)],sr[(floor(length(sr)*ptrain)+1):length(sr)],pop[(floor(length(pop)*ptrain)+1):length(pop)]),]

    fit = multinom(genre~thought+pronouns+kl_score+ttr_mean+ent_mean+con_ent2_mean+period+punct+stopword,data=meta_ch,maxit=1000,trace=FALSE)
    true = test$genre
    pred = predict(fit,newdata=test)
    confusion_ch = confusion_ch + table(as.character(true),pred)
}

In [7]:
confusion_jp = matrix(0,3,3)
B = 100
for (i in 1:B) {
    control = sample(which(meta_jp$genre=="CONTROL"))
    prolet = sample(which(meta_jp$genre=="PROLET"))
    shishosetsu = sample(which(meta_jp$genre=="SHISHOSETSU"))
    
    ptrain = 4/5
    train = meta_jp[c(control[1:floor(length(control)*ptrain)],prolet[1:floor(length(prolet)*ptrain)],shishosetsu[1:floor(length(shishosetsu)*ptrain)]),]
    test = meta_jp[c(control[(floor(length(control)*ptrain)+1):length(control)],prolet[(floor(length(prolet)*ptrain)+1):length(prolet)],shishosetsu[(floor(length(shishosetsu)*ptrain)+1):length(shishosetsu)]),]

    fit = multinom(genre~thought+narrator+kl_score+ttr_mean+ent_mean+con_ent2_mean+period+punct+stopword,data=meta_jp,maxit=1000,trace=FALSE)
    true = test$genre
    pred = predict(fit,newdata=test)
    confusion_jp = confusion_jp + table(as.character(true),pred)
}

In [8]:
confusion_ch/B

          pred
             Pop Romantic    SR
  Pop       9.36     0.41  2.23
  Romantic  1.32    18.70  4.98
  SR        1.30     5.92 15.78

In [9]:
confusion_jp/B

             pred
              CONTROL PROLET SHISHOSETSU
  CONTROL        8.66   2.93        2.41
  PROLET         3.22   8.44        2.34
  SHISHOSETSU    2.09   2.70        9.21