Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat/ngram #23

Merged
merged 36 commits into from
Aug 1, 2018
Merged

Feat/ngram #23

merged 36 commits into from
Aug 1, 2018

Conversation

amanjain97
Copy link
Collaborator

@amanjain97 amanjain97 commented Jul 14, 2018

Implemented N-gram matching with bi-gram cosine similarity.

Ngram JSON can be created with ngram.py and it has multiprocessing too.
python ngram.py <processed licenseList> -t <threads>

License_clustering makes the cluster for similar licenses so that unique ngrams can be created for each cluster.

CosineSimNgram scans file as:

  1. SPDX identifier
  2. Exact Match for license header
  3. Header Similarity with percentage Ngram Matching
  4. Exact Match for license text
  5. Filtered list with Ngram matching
  6. Bi-gram cosine sim with the filtered list/ Dice Similarity/ uni-gram Cosine similarity
    python CosineSimNgram.py <input file> <processed licenseList> <algorithm type>

Pariksha (Test suite)
python pariksha.py <processed licenseList> <Algo type>

Imtihaan (Test suite for SPDX files and Bigram Cosine similarity only)
python imtihaan.py <processed licenseList> <path to test folder>

Also, please check with --help flag to know more about arguments.

TODO:

  • Add more comments to make code more readable.
  • Modularize the code so that the same flow can be applied with Levenshtein Distance and TF-IDF algorithm.

@amanjain97 amanjain97 requested review from GMishx and ag4ums July 14, 2018 14:13
@amanjain97 amanjain97 added enhancement New feature or request Need review labels Jul 14, 2018
@amanjain97 amanjain97 force-pushed the feat/ngram branch 2 times, most recently from 6a45aad to 44225bb Compare July 14, 2018 20:56
@GMishx GMishx added the WIP Work in progress label Jul 15, 2018
Copy link
Member

@GMishx GMishx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make the changes.

parser = argparse.ArgumentParser()
parser.add_argument("inputFile", help="Specify the input file which needs to be scanned")
parser.add_argument("licenseList", help="Specify the license list file which contains licenses")
parser.add_argument("Similarity", choices=["CosineSim", "DiceSim", "BigramCosineSim"],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider using similarity as a flag like verbose instead of making it as a positional argument.
It will also be useful to add a default value which will be really helpful.

if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("inputFile", help="Specify the input file which needs to be scanned")
parser.add_argument("licenseList", help="Specify the license list file which contains licenses")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use processedLicenseList instead of plain licenseList which can be confusing.

for x in identifiers.split(" "):
if x in shortnames:
spdx_identifiers.append({
'shortname': x,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same namedtuple here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@GMishx Considering PR #26, do we need to add namedtuple here ?

if full_text in processedData:
exact_match_fulltext.append({
'shortname': license[0],
'sim_type': 'ExactFullText',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same namedtuple here.

if ngram_sim >= 0.7:
header_sim_match.append({
'shortname': license[0],
'sim_type': 'HeaderNgramSimilarity',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same namedtuple here.


if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("licenseList", help="Specify the license list file which contains licenses")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please rewrite licenseList as processedLicenseList as above.

scripts/ngram.py Outdated

if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument("licenseList", help="Specify the license list file which contains licenses")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same processedLicenseList here as well please.

@@ -72,5 +72,10 @@
if temp in text[4]:
matched += 1
tqdm.write("{0} {1} {2}".format(temp, text[1], text[4]))
elif agent_name == "Ngram":
temp = str(NgramSim(pathto + filePath, processedLicense, "BigramCosineSim"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please consider using other similarities here as well.

scripts/tfidf.py Outdated
sim_score = sum(value)
score_arr.append({
'shortname': licenses[result][0],
'sim_type': "Sum of TF-IDF score",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same namedtuple here as well.

scripts/tfidf.py Outdated
if sim_score >= 0.8:
matches.append({
'shortname': licenses[counter][0],
'sim_type': "TF-IDF Cosine Sim",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same namedtuple here as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have one more idea, we can use class objects are the scan result.
Like there is a class called ScanResult and have all the 4 attributes as a constant. These attributes will be assigned by the class constructor. So whenever you are getting a result, you just create an object of ScanResult and pass the 4 values as argument to the constructor.
This object can then be serialized to JSON see here

@amanjain97
Copy link
Collaborator Author

Please review PR #28 before finally merging this.

@GMishx GMishx added the has merge conflicts The PR has merge conflicts which needs to be resolved label Jul 20, 2018
@amanjain97 amanjain97 force-pushed the feat/ngram branch 2 times, most recently from 6bcd147 to 8399606 Compare July 21, 2018 07:05
@GMishx GMishx removed the has merge conflicts The PR has merge conflicts which needs to be resolved label Jul 21, 2018
@amanjain97
Copy link
Collaborator Author

image
We rae getting only alphabets as output
@GMishx can you please fix the error.

@GMishx
Copy link
Member

GMishx commented Jul 22, 2018

This is working in #26. Can you please check?

Made changes to include Pandas changes.
Use iloc instead of loc when traversing using indexes in DataFrame

Compressed Ngram_keywords_new.json to Ngram-json.tar.gz which is
extracted while runing ngram.py and CosineSimNgram.py using a
function defined in scripts/utils.py

Signed-off-by: Gaurav Mishra <mishra.gaurav@siemens.com>
This was referenced Jul 27, 2018
@amanjain97 amanjain97 removed the WIP Work in progress label Jul 30, 2018
Feat/unified script
  reviewed and tested by : anupam.ghosh@siemens.com, mishra.gaurav@siemens.com
Copy link
Member

@GMishx GMishx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks good. Can proceed to test

Copy link
Collaborator

@ag4ums ag4ums left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code looks good

@ag4ums ag4ums merged commit 7db915a into master Aug 1, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request Need review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants