centillion: a pan-github-markdown-issues-google-docs search engine.
a centillion: a very large number consisting of a 1 with 303 zeros after it.
one centillion is 3.03 log-times better than a googol.
What Is It
centillion (https://github.com/dcppc/centillion) is a search engine that can index different kinds of document collections: Google Documents (.docx files), Google Drive files, Github issues, Github files, Github Markdown files, and Groups.io email threads.
How It Works
We define the types of documents the centillion should index,
what info and how. centillion then builds and
updates a search index. That's all done in
centillion also provides a simple web frontend for running
queries against the search index. That's done using a Flask server
centillion keeps it simple.
centillion lives behind a Github authentication layer, implemented with flask-dance. When you first visit the site it will ask you to authenticate with Github so that it can verify you have permission to access the site.
There is a master list of all content indexed by centilion at the master list page, https://search.nihdatacommons.us/master_list.
A master list for each type of document indexed by the search engine is displayed in a table:
The metadata shown in these tables can be filtered and sorted:
There's also a control panel at https://search.nihdatacommons.us/control_panel that allows you to rebuild the search index from scratch. The search index stores versions/contents of files locally, so re-indexing involves going out and asking each API for new versions of a file/document/web page. When you re-index the main search index, it will ask every API for new versions of every document. You can also update only specific types of documents in the search index.
centillion is a Python program built using whoosh (search engine library). It indexes the full text of docx files in Google Documents, just the filenames for non-docx files. The full text of issues and their comments are indexed, and results are grouped by issue. centillion requires Google Drive and Github OAuth apps. Once you provide credentials to Flask you're all set to go.
You will need to configure both the centillion search index and the flask app.
The centillion search index is configured with
config_centillion.py; this file
sets the names of repositories to crawl when indxing issues and files.
The flask app is configured with
config_flask.py. This file contains sensitive
information and is in the
.gitignore file. This file contains API credentials
for Github and Groups.io.
Exampls are provided in
The search engine will need to connect to several APIs when it re-indexes the search index:
- Google Drive
Github API credentials (both an OAuth token for the centillion app's Github
authentication mechanism, and a personal access token for accessing repositories
during the re-indexing process) are provided in
The Groups.io API token is used to index email threads. This token is provided in
The Google Drive API credentials are provided in a file,
credentials.json. This is
the file that is generated when the OAuth process is complete.
When you enable the Google Drive API in the Google Cloud Console, you will be provided
with a file
client_secrets.json. To authenticate centillion with Google Drive, you should
download this file, and run the Google Drive utility directly:
This will initiate the authentication procedure. Sign in as a user that has access to the documents you want to index, and only the documents you want to index (it is useful to set up a bot account for this purpose).
Once you log in as that user, it will create
credentials.json, and the Google Drive
re-indexing procedure should not have any problems autheticating using that file.
Quickstart (With Github Auth)
Start by creating a Github OAuth application. Get the public and private application key (client token and client secret token) from the Github application's page. You will also need a Github access token (in addition to the app tokens).
When you create the application, set the callback
/login/github/authorized, as in:
Edit the Flask configuration
and set the public and private application keys.
Now run centillion:
or if you used http instead of https:
OAUTHLIB_INSECURE_TRANSPORT="true" python centillion.py
This will start a Flask server, and you can view the minimal search engine
interface in your browser at
If you are having problems with your callback URL being treated as HTTP by Github, even though there is an HTTPS address, and everything else seems fine, try deleting the Github OAuth app and creating a new one.