# 🚀 Thesis Roadmap
---

The thesis objective is to build a **pipeline** based on NLP state-of-the-art that starting from Pdf scientific papers pubblications obtains a papers' embedding and using a custom learned metric is able to find similar papers.

The main argument will be
> **Deep metric learning for scientific paper similarities**

Based on this problem the roadmap we will follow in this thesis will be:

## Datasets
---
1. [x] collection of fulltext/title-abstract datasets
2. [ ] *(optional) creation of public journal crowler*
3. [ ] *(optional) definition of the pdf-to-json pipeline*

## Tasks
---
1. [ ] (papers similarity) bag-of-words of papers, defining a metric on those BoWs
2. [x] (embedding analysis) summarization of {abstract} - obtaining {title}
    - [ ] (papers similarity) based on this embeddings, cosine similarities of papers
3. [ ] (embedding analysis) summarization of {intro,method,conclusion} - obtaining {abstract}
    - [ ] (papers similarity) based on this embeddings, cosine similarities of papers
4. [ ] (visualizzation) clustering of paper based on {keyword}
    - (methods) we can do this either by
        - [ ] (approximate) averaging the keywords (finding the mean keyword)
    - or
        - [ ] (graph) building a graph of keyphrases (or keyphrases embedding) that represents the paper

# What first ?
---

What left:
1. [x] I'm stranslating the used functions in module to import and use them
2. [x] need to check if tokenize the, before is usefull or we need some preprocess different for mlm in that file
3. [ ] probably the custom tokenization function is good for sts


As the main goal is to build something that works, functional and useful, the first tasks to build are:

1. [x] collection of fulltext/title-abstract datasets
2. [x] defining classes Dataset and DatasetLoader (using 🤗huggingface/pytorch)
3. [x] (embedding analysis) summarization of {abstract} - obtaining {title}
    1. [ ] (papers similarity) based on this embeddings, cosine similarities of papers
4. [ ] (visualizzation) clustering of paper based on {keyword}
    - (methods) we can do this either by
        - [ ] (approximate) averaging the keywords (finding the mean keyword)
    - or
        - [ ] (graph) building a graph of keyphrases (or keyphrases embedding) that represents the paper

# What next ?
---

The main goal here is to build the pipelines:

1. [ ] Data preparation
    1. [ ] fulltext line-by-line
    2. [x] title line-by-line
    3. [x] abstract line-by-line
2. [ ] Data preparation
    1. [x] S2ORC analysis extract batch (Computer science sub (full/not full))
    2. [x] ADD mag_id
    3. [x] train/test dataset text-abstract from Computer Science (not over 1M papers)
    4. [ ] <span style="color:red">Full text from s2orc</span>
3. [ ] Train
    1. [x] Train on S2ORC
    2. [ ] Train on S2ORC sub Comp.Scie
    3. [ ] (?) vocab + bert pre-train 
4. [ ] Eval
    1. [ ] 3.1 vs import scibert-uncased -> who wins?
    2. [ ] 3.2 vs 3.1 -> who wins?
    3. [ ] 3.3 vs all
5. [x] (embedding analysis) summarization of {abstract} - obtaining {title}
    1. [ ] (papers similarity) based on this embeddings, cosine similarities of papers
        0. [ ] (papers similarity) on conference
            - [ ] take papers from conference Comp.Scie.
            - [ ] embedds all
            - [ ] n(n-1) similitudinies
            - [ ] clustering
                - [ ] sbert-wk (chiedere a francesco)
                - [ ] s2orc
                - [ ] s2orc fine tuned comp.sci.
6. [ ] (visualizzation) clustering of paper based on {keyword}
    - (methods) we can do this either by
        - [ ] (approximate) averaging the keywords (finding the mean keyword)
    - or
        - [ ] (graph) building a graph of keyphrases (or keyphrases embedding) that represents the paper     

# what now?

0. [ ] Utils
    2. [ ] Dataset config
        1. [ ] dataset Name ("s2orc", "keyphrase", and the others)
        2. [ ] create generic DatasetConfig
        3. [ ] create s2orc DatasetConfig
        4. [ ] create keyphrase DatasetConfig
1. [ ] <span style="color:red">S2ORC dataset (train/test)</span>
    1. [ ] full/sample
    2. [ ] mag_field specification (e.g. \["Computer Science", "Phisics"\] )
    3. [ ] (only for full) chunk ids (e.g. \[0, 1, 2\] over the 99 we have downloaded)
    4. [ ] data, target, classes (e.g. dictionary_input = { "data": \["abstract"\], "target": \["title"\], "classes": \["mag_field_of_study"\]})
2. [ ] <span style="color:blue">KeyPhrase dataset (train)</span>
    1. [ ] Find it !
3. [ ] <span style="color:green">KeyPhrase dataset (test)</span>
    1. [ ] title/abstract/keyphrase
    2. [ ] field specification (e.g. \["Computer Science", "Phisics"\] )
4. [ ] <span style="color:orange">Fusion s2orc + keyphrase</span>
    1. [ ] having fulltest + keyphrases (same papers title | paper_id | arxive_id)

# What very next?
---

The things I'd like to do, next to the next, are (based on build the pipelines):

0. [ ] <span style="color:lightblue">Using</span>
    1. [ ] Fast.ai 
    2. [ ] 🤗 huggingface
    3. [ ] wandb / Tensorboard+vscode / neptune.ai
    4. [ ] nbdev
1. [ ] <span style="color:blue">Data preparation</span>
    1. [ ] (s2orc) {title, abstract, mag} line-by-line
    2. [ ] (keyphrase) {title, abstract, mag, keyphrases} line-by-line
    3. [ ] (s2orc) {title, abstract, fulltext, mag} line-by-line
    4. [ ] (s2orc+keyphrase) {title, abstract, fulltext, mag, keyphrases} line-by-line
2. [ ] <span style="color:purple">Vocab choice</span>
    1. [ ] with the model
    2. [ ] scibert
    3. [ ] custom
    4. [ ] ftidf
3. [ ] <span style="color:magenta">Embedding choice</span>
    1. [ ] BoW
    2. [ ] CBoW
    3. [ ] tfidf
    3. [ ] Random
    4. [ ] Vocab
4. [ ] <span style="color:red">Encoding (Model) choice</span>
    1. [ ] BERT-based
        1. [ ] scibert
        2. [ ] sbert
        3. [ ] rbert
        4. [ ] Roberta
    4. [ ] old methods
        1. [ ] Doc2Vec
        2. [ ] GloVe
        3. [ ] ELMo
    5. [ ] other methods
        2. [ ] xlnet
        3. [ ] keyphrase
5. [ ]  <span style="color:orange">Pre Processing</span>
    1. [ ] Vocabulary
    2. [ ] Tokenizer
    3. [ ] Encoder
    4. [ ] Dataset
    5. [ ] DataLoader
6. [ ]  <span style="color:#cc7722">Pre Training (MLM)</span>
    1. [ ] from scratch
    2. [ ] from checkpoint
7. [ ] <span style="color:#cc7722">Pooling</span>
    - [ ] mean
    - [ ] max
    - [ ] bert-wk
8. [ ]  <span style="color:brown">Evaluation (MLM)</span>
    1. [ ] loss
    2. [ ] accuracy
9. [ ] <span style="color:black">Fine Tuning</span>
    1. [ ] (papers similarity) STS
    2. [ ] (title generation) title from abstract
    3. [ ] (abstract generation) abstract from fulltext
10. [ ] <span style="color:pink">Evaluation</span>
    1. [ ] (benchmarks) SentEval
    1. [ ] (papers similarity) STS
    2. [ ] (title generation) title from abstract
    3. [ ] (abstract generation) abstract from fulltext
11. [ ] <span style="color:lime">Ablation studies</span>
    1. [ ] ?
    2. [ ] ?
12. [ ] <span style="color:lime">(visualizzation) clustering of paper based on {keyword}</span>
    - (methods) we can do this either by
        - [ ] (approximate) averaging the keywords (finding the mean keyword)
    - or
        - [ ] (graph) building a graph of keyphrases (or keyphrases embedding) that represents the paper     