Notebooks and other python code for competing against the results of garrettgoon.com/arxiv-vixra-quiz.
The goal of the project is to accurately assess whether a given paper is from arXiv or viXra based on the title or abstract alone.
The arXiv data used in the preceding notebooks was gleaned from Kaggle's arXiv Dataset, while the viXra data was scraped from viXra and can be downloaded here as an 18MB .feather
file.
The notebooks are from Google Colab pro GPUs and written using tools from PyTorch Lightning and Weights and Biases for organization and hyperparameter sweeps. Notebooks are written to read from Google Drive, so require alternations to run locally.
The present repo consists of the following directories:
arxiv_vixra_models
:python
package in which alltorch
/pl
architectures, classes, helper functions, etc. are defined.data_processing
: Colab notebooks for exploring properties of the dataset, as well as filtering, normalizing, and tokenizing the text.figures
: Various figures generated from notebooks.glove
: Colab notebooks for running the GloVe algorithm for word embeddings.baseline_models
: Colab notebooks for simple baseline models (logistic regression and random forest) against which to compare.simple_recurrent
: BasicRNN
/LSTM
/GRU
models, either at character level or using word-embeddings.