BibTex files are in bib/
Note: work in progress, still contains only a fraction of recent articles
The following non-standard fields are used to add information how the publications relate to Common Crawl:
- cc-author-affiliation
- affiliation of the authors
- cc-class
- classification of the publication: domain of research, topics, keywords
- cc-snippet
- snippet citing Common Crawl
- cc-dataset-used
- subset of Common Crawl used, e.g., CC-MAIN-2016-07
- cc-derived-dataset-about
- the publication describes a dataset which has been derived from Common Crawl, e.g., GloVe-word-embeddings
- cc-derived-dataset-used
- a dataset has been used which is derived from Common Crawl, e.g., GloVe-word-embeddings
- cc-derived-dataset-cited
- a derived dataset is cited but not used
The Makefile contains targets to apply a consistent formatting to the citations. It also allows to export the citations. The following BibTeX tools are required: bibtex2html, bibclean, bibtool.
(Do not be confused by the pypi package bibclean, it's entirely different. bibclean, bibtool, and bibtex2html are available as OS packages, at least in apt-based distros.)
As an initial step and to get a higher coverage, citations are extracted from Google Scholar Alert e-mails received April 2016 to date. See gscholar_alerts.