# Paper Trail

## Classification Benchmarks


### Full, 50 class task (pre-analysis, 10.5 million paragraphs)

|Measure              | F1 score | F1 (no math) |
|:--------------------|---------:|-------------:|
| Zero Rule           |  0.201   | 0.206        |
| 3-level BiLSTM      |   0.67   |      0.67    |

### Core, 13 class task (10.4 million paragraphs)

|Measure         | F1 score | F1 (no math) |
|:---------------|---------:|-------------:|
| Zero Rule      |    0.388 |        0.369 |
| LogReg         |     0.30 |        0.35  |
| LogReg + GloVe |     0.77 |        0.77  |
| Perceptron     |     0.83 |        0.83  |
| HAN            |     0.89 |        0.88  |
| 3-level BiLSTM |   0.91   |        0.90  |


**Legend:**
 1. Zero Rule - a trivial analytic model with a single constant prediction -- the most common class in the dataset
 2. LogReg - logistic regression with raw word indexes, where each paragraph is an array of 480 integers
 3. Logreg + Glove - logistic regression on arXMLiv 08.2018 word embeddings, each paragraph is a (480,300) matrix of integers
 4. Perceptron - one hidden fully-connected layer of neurons, with a final softmax layer. Also based on the GloVe-embedded (480,300) data
 5. HAN - Hierarchical Attention Networks, split 8 sentences of 60 words each. Also uses GloVe embeddings, for embedded paragraph shape of (8,60,300). Size established with grid search on 3% of the data.
 6. BiLSTM - encoder/decoder BiLSTM pair with a LSTM follow-up. BiLSTM(128)→BiLSTM(64)→LSTM(64)→Dense(13)

**Math-free control experiment:** 
 1. All mathematics is stripped out, instead of being normalized as math lexemes
 2. Regenerated GloVe model, (vocabulary size reduces from just over 1 million to 0.75 million words)
 3. Re-extracted paragraph dataset (unique SHA256 names reduce data from 10.5 to 10.1 million paragraphs)
 4. Confirmed 50-class pre-analysis confusion matrix is analogous, reduced to 13-class task (10 million paragraphs)
 5. Re-ran all 13-class model training notebooks on math-free data, to co-report and measure effect of math modality
 
*Inescapable control impurity:* we need to run the *identical* models to have a claim of comparison, but the average length of a paragraph is significantly lower with math omitted. Additionally, ~10% of the paragraphs exceed the 480 word cap, and stripping the math leads to including more content from the extra long items. This offers additional context to the control models, which was never visible to the math-lexeme-enabled models. 

  **Final assembly**: 25 of the original classes grouped into 13 strongly separable unions:

| Class           | Additional Members | Frequency |
|:----------------|:-------------------|----------:|
| abstract        | -                  | 1,030,774 |
| acknowledgement | -                  |   162,230 |
| conclusion      | discussion         |   401,235 |
| definition      | -                  |   686,717 |
| example         | -                  |   295,152 |
| introduction    | -                  |   688,530 |
| keywords        | -                  |     1,565 |
| proof           | demonstration      | 2,148,793 |
| proposition     | assumption, claim, | 4,060,029 |
| +               | condition,         |         + |
| +               | conjecture,        |         + |
| +               | corollary, fact,   |         + |
| +               | lemma, theorem     |         + |           
| problem         | question           |    57,609 |
| related work    | -                  |    26,299 |
| remark          | note               |   643,500 |
| result          | -                  |   239,931 |


  Dropped (25) =
  notice, expansion, hint, expectation, explanation, affirmation, answer, issue, bound, summary, experiment,
  solution, criterion, principle, comment, exercise, constraint, rule, convention, case, step, overview, notation, observation, method


# Pre-analysis 50-class task, BiLSTM

![normalized 50 class confusion matrix](https://github.com/dginev/arxiv-ams-paragraph-classification/blob/49-class-dataset/figures/confusion_matrix_normalized_50class.png?raw=true)

# Core 13-class task, BiLSTM

![normalized 13 class confusion matrix](https://github.com/dginev/arxiv-ams-paragraph-classification/blob/49-class-dataset/figures/confusion_matrix_normalized_13class.png?raw=true)