# Second LLM Pass (or labeling large text portions) research
As initial attempts to create e2e pipeline resulted in low quality results we want to know other options on design. Here I will gather info about possibilities to use ML tools to categorize section of text (possibly xml/csv) as coupon.

## Similar existing problems
### Sentence classification
There are many works on classifying single sequences and datasets focused on this task (ex [trec](https://huggingface.co/datasets/CogComp/trec)).
### Larger text classification
Closer to our problem, we have reviews classification task. Reviews can be quite [long](https://medium.com/codex/fine-tune-bert-for-text-classification-cef7a1d6cdf1), but we still miss two aspects of our problem: we do not know where possible coupons have beginning and end ending, which stops us from applying this method directly.
### Text Retrieval and Question answering
There are tasks of question answering and text retrieval, more similar to our problem. Given portion of text and possibly a question, the model selects a begin and an end of a sequence containing the answer. This addresses our lack of knowledge about exact coupon placement. Despite that, we still have to deal with the fact that we might have more than one coupon in one provided xml portion. I will discuss this problem further in the following section. \\
Note: text retrieval seems to be quite niche, there is even no model task on HF for that.
## Known challenges
### Random coupon placement
Coupons might be located in groups, there will also be large sparse portions of text without any.This presents a challenge both to the QA and standard classification approach. Possible solutions:
* combining with heuristic used in first PoC (analyze count of labels from first BERT pass and decide whether xml node is possibly a coupon, and it is worth inputting to the classification model.
* simply chunking text and classifying each section. We should make two overlapping passes to avoid having coupon on the chunk border
* multi-span question answering: there are [attempts](https://aclanthology.org/2020.emnlp-main.248.pdf) to modify qa task to produce multiple outputs - provided example is actually close to what we discussed on our meetings (casting QA task to NER-like token labeling)
### Providing input in reasonable format
Possibly feeding LLM with raw CSV is not the best idea. It contains a lot of boilerplate data and probably does not represent xml structure well.
## Solutions requiring model training
Below I present some ideas involving training custom model from scratch:
### 1D convolution over text sequence
This solution is conceptually similar to multi-answer QA. For each token we assign a label being `UNKNOWN`, `BEGIN-COUPON`, `MID-COUPON` or `END-COUPON`. This time, however, we will use CNNs. <\br>
Several years ago one-dimensional convolution over text sequences was perspective approach to some NLP tasks, like [classification](https://aclanthology.org/D14-1181.pdf) or sentiment analysis. There were even results suggesting that common association between NLP tasks and RNNs should be [reconsidered](https://arxiv.org/pdf/1803.01271). <\br>
The main idea is to combine this convolution-based approach with technique known from image processing - semantic segmentation. In [classical semantic segmentation task](https://arxiv.org/pdf/1505.04597) we assign class to each pixel. In our case, we would assign one of labels mentioned above to each input token. <\br>
###### Why not RNN then?
If we want to train custom model anyway, why won't we simply run bidirectional multi-input multi-output RNN on our data and expect it to generate mentioned labels? <\br>
The only reason is intuition - that selecting will be easier when we consider only local context, unlike in RNNs. But we could also try RNNs, that also seems like valid approach.
###### Why not second BERT pass then?
Generally, it is probable that we wont be able to achieve LLM-level performance as we are preforming some version of NER here, which is proven to be task suitable for LLMs. I suggest that this solution should be tested after we find existing LLMs model unsatisfactory.
### Problems
We still not escaped problem with input format. We also introduced additional tokens for detecting coupon boundaries (`BEGIN-COUPON` and `END-COUPON`). Removing those might increase accuracy, but we are left with not separated coupons.
### Recursive Neural Network / Tree-LSTM
DOZRO

## DOZRO
* research if (and if yes, how) LLMs can handle csv/xml inputs. do we need some token mapping of xml tags?
* add some suggestions (is option good/bad, why)
* Tree-LSTM section