Skip to content

garrett361/arxiv-vixra-ml

Repository files navigation

arxiv-vixra-ml

Notebooks and other python code for competing against the results of garrettgoon.com/arxiv-vixra-quiz.

The goal of the project is to accurately assess whether a given paper is from arXiv or viXra based on the title or abstract alone.

Data

The arXiv data used in the preceding notebooks was gleaned from Kaggle's arXiv Dataset, while the viXra data was scraped from viXra and can be downloaded here as an 18MB .feather file.

Setup and Workflow

The notebooks are from Google Colab pro GPUs and written using tools from PyTorch Lightning and Weights and Biases for organization and hyperparameter sweeps. Notebooks are written to read from Google Drive, so require alternations to run locally.

Repo Contents

The present repo consists of the following directories:

  • arxiv_vixra_models: python package in which all torch/pl architectures, classes, helper functions, etc. are defined.
  • data_processing: Colab notebooks for exploring properties of the dataset, as well as filtering, normalizing, and tokenizing the text.
  • figures: Various figures generated from notebooks.
  • glove: Colab notebooks for running the GloVe algorithm for word embeddings.
  • baseline_models: Colab notebooks for simple baseline models (logistic regression and random forest) against which to compare.
  • simple_recurrent: Basic RNN/LSTM/GRU models, either at character level or using word-embeddings.

About

Jupyter (Colab) notebooks and other python code for competing against the results of garrettgoon.com/arxiv-vixra-quiz/

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published