Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
[ Home ]
The first step in our approach is gradual organization and systematization of existing code and data on local machines. Users can gradually classify their files by assigning a new or existing CK module written in Python. Such modules are used as wrappers (containers) to abstract, describe and manage related data. Modules have common actions (such as add, delete, load, update, find, etc) and internal actions specific to a given class.
Modules and related data entries are always assigned a unique identifier (UID) and may also have a user-friendly alias (referenced by UOA - UID Or Alias) and a brief description. Any data entry in the system can be found and cross-linked using CID=module_uoa:data_uoa.
All data entries are kept on a native file system thus ensuring platform portability. Users can add or gradually update meta-description of any data entry in a popular and human readable JSON format that can be modified using any available editor. This meta-description can be transparently indexed by third-party Hadoop-based ElasticSearch tool to enable fast and powerful search capabilities. Furthermore, data entries in our format can be now easily archived, shared via GIT and moved between different local repositories and user machines.
Such relatively simple approach already allowed us to gradually abstract, organize and share all our past knowledge (not only results and data but also code) along with publications, protect it from continuous changes in the system, make it searchable, and connect it together to implement various research scenarios as conceptually shown in the figures below:
At the end, CK is just a small python module with JSON API that glues together user's code and data in local directories registered as CK repositories (to be able to always find any code and data by CID).
Furthermore, CK can help deal with ever changing and possibly proprietary software and hardware by abstracting access to them via CK tool wrappers as conceptually shown below:
For example, users just need to set up CK environment for a given version of already pre-installed software (such as Intel compilers or SPEC benchmarks) and then use CK modules with JSON API to access such software as described in detail in this section. This allows researchers share their experimental setups while excluding proprietary software and providing simple recipe how to install it and set up CK environment for unified communication. It also allows easy co-existence of multiple versions of related tools such as different version of LLVM and GCC compilers.
Such organization allows users to gradually convert any ad-hoc and hardwired analysis and experimental setups into unified pipelines (or workflows) assembled as LEGO (R) from interconnected CK modules and data entries. Furthermore, simple CK API with unified input and output allows to expose information flow to existing and powerful statistical analysis, classification and predictive modeling tools including R and SciPy. For example, CK helped us to convert and share all hardwired and script-based experimental setups from our past and current R&D on program auto-tuning and machine learning into the shareable CK pipelines conceptually presented in the following figure:
Furthermore, such pipelines can be replayed (repeated) anytime later provided a given JSON input, module_uoa and action thus supporting our initiative on collaborative and reproducible R&D. At the same time, whenever any unexpected behavior is detected, community can help improve modules and provide missing descriptions or add more tools, modules and data to the pipelines to gradually and collaboratively explain unexpected behavior, ensure reproducibility and improve collective knowledge.
Internally, modules and data should always be referenced by UID and not by alias to ensure compatibility between various modules, i.e. whenever API or data format of a given module becomes backward incompatible, we may keep the same alias (or add version), but we should change its UID. Thus, new and old modules will co-exist without breaking shared experimental workflows.
Finally, CK implementation as a simple and open knowledge management SDK makes it easy to integrate it with other third-party technology such as iPython, web services, file managers, GUI frameworks, MediaWiki, Drupal, Visual Studio, Android Studio, Eclipse, etc. It can also be extended through higher-level and user-friendly tools similar to TortoiseGIT, iPython, phpmyadmin, etc. We expect that if our community will find CK useful, it will help us improve CK and and develop extensions for various practical research and experimentation scenarios.
We hope that our approach will let industry, academia and volunteers work together to gradually improve research techniques and continuously share realistic benchmarks and data sets. We believe that this can eventually enable truly open, collaborative, interdisciplinary and reproducible research to some extent similar to physics, other natural sciences, Wikipedia, literate programming and open-source movement. It may also help computer engineers and researchers become data scientists and focus on innovation while liberating them from ad-hoc, repetitive, boring and time consuming tasks. It should also help solve some of big data problems we faced since 1993 by preserving predictive models (knowledge) and finding missing features rather than keeping large and possible useless amounts of raw data.
List of public repositories, data, modules and actions (customizable and multi-objective autotuning, realistic benchmarks and workloads, experiment crowdsourcing, predictive analytics, co-existance of multiple tools, interactive graphs and articles, etc):our past publications as reusable and customizable CK components.
You are welcome to get in touch with the CK community if you have questions or comments!