-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feat/ngram #23
Feat/ngram #23
Conversation
6a45aad
to
44225bb
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please make the changes.
scripts/CosineSimNgram.py
Outdated
parser = argparse.ArgumentParser() | ||
parser.add_argument("inputFile", help="Specify the input file which needs to be scanned") | ||
parser.add_argument("licenseList", help="Specify the license list file which contains licenses") | ||
parser.add_argument("Similarity", choices=["CosineSim", "DiceSim", "BigramCosineSim"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider using similarity
as a flag like verbose
instead of making it as a positional argument.
It will also be useful to add a default value which will be really helpful.
scripts/CosineSimNgram.py
Outdated
if __name__ == "__main__": | ||
parser = argparse.ArgumentParser() | ||
parser.add_argument("inputFile", help="Specify the input file which needs to be scanned") | ||
parser.add_argument("licenseList", help="Specify the license list file which contains licenses") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use processedLicenseList
instead of plain licenseList
which can be confusing.
scripts/initial_match.py
Outdated
for x in identifiers.split(" "): | ||
if x in shortnames: | ||
spdx_identifiers.append({ | ||
'shortname': x, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same namedtuple
here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
scripts/initial_match.py
Outdated
if full_text in processedData: | ||
exact_match_fulltext.append({ | ||
'shortname': license[0], | ||
'sim_type': 'ExactFullText', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same namedtuple
here.
scripts/initial_match.py
Outdated
if ngram_sim >= 0.7: | ||
header_sim_match.append({ | ||
'shortname': license[0], | ||
'sim_type': 'HeaderNgramSimilarity', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same namedtuple
here.
scripts/license_clustering.py
Outdated
|
||
if __name__ == "__main__": | ||
parser = argparse.ArgumentParser() | ||
parser.add_argument("licenseList", help="Specify the license list file which contains licenses") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please rewrite licenseList
as processedLicenseList
as above.
scripts/ngram.py
Outdated
|
||
if __name__ == '__main__': | ||
parser = argparse.ArgumentParser() | ||
parser.add_argument("licenseList", help="Specify the license list file which contains licenses") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same processedLicenseList
here as well please.
scripts/pariksha.py
Outdated
@@ -72,5 +72,10 @@ | |||
if temp in text[4]: | |||
matched += 1 | |||
tqdm.write("{0} {1} {2}".format(temp, text[1], text[4])) | |||
elif agent_name == "Ngram": | |||
temp = str(NgramSim(pathto + filePath, processedLicense, "BigramCosineSim")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please consider using other similarities here as well.
scripts/tfidf.py
Outdated
sim_score = sum(value) | ||
score_arr.append({ | ||
'shortname': licenses[result][0], | ||
'sim_type': "Sum of TF-IDF score", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same namedtuple
here as well.
scripts/tfidf.py
Outdated
if sim_score >= 0.8: | ||
matches.append({ | ||
'shortname': licenses[counter][0], | ||
'sim_type': "TF-IDF Cosine Sim", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same namedtuple
here as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have one more idea, we can use class objects are the scan result.
Like there is a class called ScanResult
and have all the 4 attributes as a constant. These attributes will be assigned by the class constructor. So whenever you are getting a result, you just create an object of ScanResult
and pass the 4 values as argument to the constructor.
This object can then be serialized to JSON see here
Please review PR #28 before finally merging this. |
… licenses to make unique ngrams for each cluster
…x to spdx identifier
…mented feature in imtihaan to check with any given specific file
6bcd147
to
8399606
Compare
|
This is working in #26. Can you please check? |
Made changes to include Pandas changes. Use iloc instead of loc when traversing using indexes in DataFrame Compressed Ngram_keywords_new.json to Ngram-json.tar.gz which is extracted while runing ngram.py and CosineSimNgram.py using a function defined in scripts/utils.py Signed-off-by: Gaurav Mishra <mishra.gaurav@siemens.com>
Signed-off-by: Gaurav Mishra <mishra.gaurav@siemens.com>
…am-json fix(utils): change file name
Implemented a unified script to run any algorithm from atarashi
Feat/unified script reviewed and tested by : anupam.ghosh@siemens.com, mishra.gaurav@siemens.com
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code looks good. Can proceed to test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
code looks good
Implemented N-gram matching with bi-gram cosine similarity.
Ngram JSON can be created with ngram.py and it has multiprocessing too.
python ngram.py <processed licenseList> -t <threads>
License_clustering makes the cluster for similar licenses so that unique ngrams can be created for each cluster.
CosineSimNgram scans file as:
python CosineSimNgram.py <input file> <processed licenseList> <algorithm type>
Pariksha (Test suite)
python pariksha.py <processed licenseList> <Algo type>
Imtihaan (Test suite for SPDX files and Bigram Cosine similarity only)
python imtihaan.py <processed licenseList> <path to test folder>
Also, please check with
--help
flag to know more about arguments.TODO: