Minerva Dataset Generation

Project Overview

Why there is a need to generate a dataset?

To implement any Machine learning/Deep learning algorithm we need a better and bigger dataset of SPDX Licences. Due to the lack of dataset currently, all the 10 algorithms which have been tested on Atarashi are restricted to 59% accuracy. But unfortunately, there exists no such dataset for open source licenses on the web.
Advanced Architectures and algorithms such as LSTMs, GRU, BERT, WordNET, etc. require huge volumes of the dataset before achieving the ability to outperform the accuracy of even traditional algorithms such as TF-IDF, n-gram, etc.
Licenses differ from traditional corpora, because of which 50-60% keywords are similar in any two licenses, and if the licenses have the same license heading but different versions, they're around 90% similar.

SPDX recent release : SPDX

 python ./Download-licenses-Script/spdx.py

SPDX-exceptions recent release : SPDX-exceptions

 python ./Download-licenses-Script/exceptions.py

Licenses in Fossology Database : licenseRef

 python ./Download-licenses-Script/database-foss.py

GENERATED FILES THROUGH INITIAL SPLIT

The basic idea is n-gramming licenses and maintaining a sliding window, i.e for a licene with 4 paragraphs, all the different files that I wanted to generate were - para1, para2, para3, para4, para1+para2, para2+para3, para3+para4, para1+para2+para3, para2+para3+para4, para1+para2+para3+para4. Not para1+para3, para1+para3+para4, etc. because the structure of licenses needs to be maintained.

 python ./Script-Initial-Split/initial_split.py

Script : initial_split
Files : SPDX
Files : FOSSologyDatabase

ADDING REGEX TO FILES

Regex from STRINGS.in file is added to splitted files. Regex expansion is done through free and open-source libraries such as xeger, intxeger

HANDLING REGEX EXPANSION

To handle expansions i.e .{1,32}, .{1,64} two algorithms are being considered :

A. NGRAM
(basically a set of co-occurring words within a given window)
B. MARKOV
(As an extension of Naive Bayes for sequential data, the Hidden Markov Model provides a joint distribution over the letters/tags with an assumption of the dependencies of variables x and y between adjacent tags.)

Added "Multiprocessing" to the Script to speed up the process of data generation.

Codebase : Ngram
To generate licenses with ngram expansion:

 python ./ngram/licenses.py

Codebase : Markov
To generate licenses with ngram expansion:

 python ./markov/markov_licenses.py

Validating Dataset Generated Using NOMOS in Fossology

Using Nomos to validate generated files. This is a base line regex-based validation for the generated text files using both the algorithms. Terminal command to run this will be :

 sudo nomos -J -d <folder_with_files>

And to use multiple cores to validate files (here I am using 3 cores) :

 sudo nomos -J -d <folder_with_files> -n 3

Name		Name	Last commit message	Last commit date
Latest commit History 54 Commits
Database-Gen		Database-Gen
Download-licenses-Script		Download-licenses-Script
Original-DB-Foss-Dataset		Original-DB-Foss-Dataset
Original-SPDX-Dataset		Original-SPDX-Dataset
STRINGSin-Regex-Extraction		STRINGSin-Regex-Extraction
Script-Initial-Split		Script-Initial-Split
Split-DB-Foss-Licenses		Split-DB-Foss-Licenses
Split-SPDX-licenses		Split-SPDX-licenses
assets		assets
markov		markov
ngram		ngram
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

License

fossology/Minerva-Dataset-Generation

Folders and files

Latest commit

History

Repository files navigation

Minerva Dataset Generation

Project Overview

SPDX recent release : SPDX

SPDX-exceptions recent release : SPDX-exceptions

Licenses in Fossology Database : licenseRef

GENERATED FILES THROUGH INITIAL SPLIT

ADDING REGEX TO FILES

HANDLING REGEX EXPANSION

Validating Dataset Generated Using NOMOS in Fossology

About

Topics

Resources

License

Stars

Watchers

Forks