Things to do with arXiv metadata :-)
This repository is (currently) a collection of python scripts and notebooks that
Do a Word2Vec encoding of physics jargon (using gensim's CBOW or skip-gram, if you care for specifics).
Examples: "particle + charge = electron" and "majorana + braiding = non-abelian"
Remark: These examples were learned from the cond-mat section titles only.
Analyze the n-grams (i.e. fixed n-word expressions) in the titles over the years (what should we work on? ;-))
These scripts were tested and run using Python 3. I have not checked backwards compatibility, but I have heard from people who managed to get it to work in Python 2 too! Feel free to reach out to me in case things don't work out-of-the-box. I have not (yet) tried to make the scripts and notebooks super user-friendly, though I did try to comment the code such that you may figure things out by trial-and-error.
If you're already familiar with python, all you need to have are the modules numpy, pyoai, inflect and gensim. These should all be easy to install using pip/pip3. Then the workflow is as follows (I used python3):
- python arXivHarvest.py --section physics:cond-mat --output condmattitles.txt
- python parsetitles.py --input condmattitles.txt --output condmattitles.npy
- python trainmodel.py --input condmattitles.npy --size 100 --window 10 --mincount 5 --output condmatmodel-100-10-5
- python askmodel.py --input condmatmodel-100-10-5 --add particle charge
In step 1, we get the titles from arXiv. This is a time-consuming step; it took 1.5hrs for the physics:cond-mat section, and so I've provided the files for those in the repository already (i.e. you can skip steps 1 and 2). In step 2 we take out the weird symbols etc, and parse it into a *.npy file. In the third step, we train a model with vector size 100, window size 10 and minimum count for words to participate of 5. Step 4 can be repeated as often as one desires.
Apart from the above scripts, I provide 3 python notebooks that perform more than just the analysis of arXiv titles. I highly recommend using notebooks. Very easy to install, and super useful. See here: http://jupyter.org/. You can also just copy-and-paste the code from the notebooks into a *.py script and run those.
You are going to need to following python modules in addition, all installable using pip3 (sudo pip3 install [module-name]).
Must-have for anything scientific you want to do with python (arrays, linalg)
Open Archive Initiaive module for querying the arXiv servers for metadata
Module for generating/checking plural/singular versions of words
Very versatile module for topic modelling (analyzing basically anything you want from text, including word2vec)
Not required, but highly recommended is the module "matplotlib" for creating plots. You can comment/remove the sections in the code that refer to it if you really don't want to.
Optionally, if you wish to make a WordCloud, you will need