Feat/ngram #23

amanjain97 · 2018-07-14T14:05:51Z

Implemented N-gram matching with bi-gram cosine similarity.

Ngram JSON can be created with ngram.py and it has multiprocessing too.
python ngram.py <processed licenseList> -t <threads>

License_clustering makes the cluster for similar licenses so that unique ngrams can be created for each cluster.

CosineSimNgram scans file as:

SPDX identifier
Exact Match for license header
Header Similarity with percentage Ngram Matching
Exact Match for license text
Filtered list with Ngram matching
Bi-gram cosine sim with the filtered list/ Dice Similarity/ uni-gram Cosine similarity
python CosineSimNgram.py <input file> <processed licenseList> <algorithm type>

Pariksha (Test suite)
python pariksha.py <processed licenseList> <Algo type>

Imtihaan (Test suite for SPDX files and Bigram Cosine similarity only)
python imtihaan.py <processed licenseList> <path to test folder>

Also, please check with --help flag to know more about arguments.

TODO:

Add more comments to make code more readable.
Modularize the code so that the same flow can be applied with Levenshtein Distance and TF-IDF algorithm.

GMishx

Please make the changes.

GMishx · 2018-07-17T05:27:13Z

scripts/CosineSimNgram.py

+  parser = argparse.ArgumentParser()
+  parser.add_argument("inputFile", help="Specify the input file which needs to be scanned")
+  parser.add_argument("licenseList", help="Specify the license list file which contains licenses")
+  parser.add_argument("Similarity", choices=["CosineSim", "DiceSim", "BigramCosineSim"],


Consider using similarity as a flag like verbose instead of making it as a positional argument.
It will also be useful to add a default value which will be really helpful.

GMishx · 2018-07-17T05:28:28Z

scripts/CosineSimNgram.py

+if __name__ == "__main__":
+  parser = argparse.ArgumentParser()
+  parser.add_argument("inputFile", help="Specify the input file which needs to be scanned")
+  parser.add_argument("licenseList", help="Specify the license list file which contains licenses")


Please use processedLicenseList instead of plain licenseList which can be confusing.

GMishx · 2018-07-17T05:33:12Z

scripts/initial_match.py

+    for x in identifiers.split(" "):
+      if x in shortnames:
+        spdx_identifiers.append({
+          'shortname': x,


Same namedtuple here.

@GMishx Considering PR #26, do we need to add namedtuple here ?

GMishx · 2018-07-17T05:35:00Z

scripts/initial_match.py

+    if full_text in processedData:
+      exact_match_fulltext.append({
+        'shortname': license[0],
+        'sim_type': 'ExactFullText',


Same namedtuple here.

GMishx · 2018-07-17T05:35:14Z

scripts/initial_match.py

+      if ngram_sim >= 0.7:
+        header_sim_match.append({
+          'shortname': license[0],
+          'sim_type': 'HeaderNgramSimilarity',


Same namedtuple here.

GMishx · 2018-07-17T05:54:27Z

scripts/license_clustering.py

+
+if __name__ == "__main__":
+  parser = argparse.ArgumentParser()
+  parser.add_argument("licenseList", help="Specify the license list file which contains licenses")


Please rewrite licenseList as processedLicenseList as above.

GMishx · 2018-07-17T05:55:52Z

scripts/ngram.py

+
+if __name__ == '__main__':
+  parser = argparse.ArgumentParser()
+  parser.add_argument("licenseList", help="Specify the license list file which contains licenses")


Same processedLicenseList here as well please.

GMishx · 2018-07-17T05:56:42Z

scripts/pariksha.py

@@ -72,5 +72,10 @@
        if temp in text[4]:
          matched += 1
        tqdm.write("{0} {1} {2}".format(temp, text[1], text[4]))
+      elif agent_name == "Ngram":
+        temp = str(NgramSim(pathto + filePath, processedLicense, "BigramCosineSim"))


Please consider using other similarities here as well.

GMishx · 2018-07-17T05:58:08Z

scripts/tfidf.py

+    sim_score = sum(value)
+    score_arr.append({
+      'shortname': licenses[result][0],
+      'sim_type': "Sum of TF-IDF score",


Same namedtuple here as well.

GMishx · 2018-07-17T05:58:25Z

scripts/tfidf.py

+    if sim_score >= 0.8:
+      matches.append({
+        'shortname': licenses[counter][0],
+        'sim_type': "TF-IDF Cosine Sim",


Same namedtuple here as well.

I have one more idea, we can use class objects are the scan result.
Like there is a class called ScanResult and have all the 4 attributes as a constant. These attributes will be assigned by the class constructor. So whenever you are getting a result, you just create an object of ScanResult and pass the 4 values as argument to the constructor.
This object can then be serialized to JSON see here

amanjain97 · 2018-07-19T18:01:37Z

Please review PR #28 before finally merging this.

…icense

…o execute

…eaders

…headers

…separately

… licenses to make unique ngrams for each cluster

…mation

…x to spdx identifier

…mented feature in imtihaan to check with any given specific file

…nse issue

amanjain97 · 2018-07-22T05:07:35Z

We rae getting only alphabets as output
@GMishx can you please fix the error.

GMishx · 2018-07-22T09:55:47Z

This is working in #26. Can you please check?

Made changes to include Pandas changes. Use iloc instead of loc when traversing using indexes in DataFrame Compressed Ngram_keywords_new.json to Ngram-json.tar.gz which is extracted while runing ngram.py and CosineSimNgram.py using a function defined in scripts/utils.py Signed-off-by: Gaurav Mishra <mishra.gaurav@siemens.com>

Signed-off-by: Gaurav Mishra <mishra.gaurav@siemens.com>

…am-json fix(utils): change file name

… json zip

1

Implemented a unified script to run any algorithm from atarashi

Feat/unified script reviewed and tested by : anupam.ghosh@siemens.com, mishra.gaurav@siemens.com

GMishx

Code looks good. Can proceed to test

ag4ums

code looks good

amanjain97 requested review from GMishx and ag4ums July 14, 2018 14:13

amanjain97 added enhancement New feature or request Need review labels Jul 14, 2018

amanjain97 force-pushed the feat/ngram branch 2 times, most recently from 6a45aad to 44225bb Compare July 14, 2018 20:56

GMishx added the WIP Work in progress label Jul 15, 2018

GMishx requested changes Jul 17, 2018

View reviewed changes

amanjain97 mentioned this pull request Jul 18, 2018

Remove nltk dependency from project #24

Closed

GMishx force-pushed the feat/ngram branch from 4770ad5 to 536e2a8 Compare July 19, 2018 05:18

GMishx added the has merge conflicts The PR has merge conflicts which needs to be resolved label Jul 20, 2018

amanjain97 added 18 commits July 21, 2018 12:13

feat(ngram): Generate the JSON file that has unique ngram for every l…

0a59f0f

…icense

feat(CosineSim): Added Dice similarity

72f1799

perf(ngram): Added Multithreading to reduce time

8814e89

refactor(ngram): Store the Data frame of Ngram generated in CSV

68bdc18

feat(CosineSimNgram): Added Sorenson Dice Similarity

2d4e9d3

feat(CosineSimNgram): Added Cosine Similarity with bigrams

58ef98c

refactor(CosineSimNgram): added arguement to specify which sim algo t…

1a4f8db

…o execute

feat(pariksha): Added ngram sim to pariksha

aee95b0

feat(CosineSimNgram): Updated algorithm to check the full names and h…

8afd9ef

…eaders

feat(CosineSim): Implemented similarity score to approximately match …

a4ad954

…headers

feat(CosineSimNgram): Feature to check for SPDX identifiers also

2a01bf2

refactor(CosineSimNgram): Get results from exact match and sim match …

cdf3f15

…separately

feat(license_clustering): Implemented Clustering algorithm to cluster…

6acf43c

… licenses to make unique ngrams for each cluster

refactor(Ngram): Change the output type to JSON and return more infor…

8726fbb

…mation

docs(NgramKeywords): Added json file containing ngram keywords

430768d

Feat(License_clustering): Implemented license clustering and minor fi…

d6cd8fa

…x to spdx identifier

feat(imtihaan): implemented test suite to run for SPDX test cases

7097f8a

test(SPDXFiles): Added SPDX test suite

13744ae

amanjain97 added 4 commits July 21, 2018 12:13

style(imtihaan): Beautify the output log

46f029d

perf(ngram): Changed the value of n to create unique ngrams and imple…

f293347

…mented feature in imtihaan to check with any given specific file

fix(remove_nltk): Removed all implementations of nltk to resolve lice…

a27c2b2

…nse issue

fix/refactor(atarashi): fixed according to reviews and refactored

8399606

amanjain97 force-pushed the feat/ngram branch 2 times, most recently from 6bcd147 to 8399606 Compare July 21, 2018 07:05

GMishx removed the has merge conflicts The PR has merge conflicts which needs to be resolved label Jul 21, 2018

GMishx mentioned this pull request Jul 23, 2018

feat(Atarashi): Added Copyright statements to Atarshi files #28

Merged

This was referenced Jul 27, 2018

Feat/unified script #30

Merged

Spelling Mistake #31

Closed

amanjain97 and others added 8 commits July 28, 2018 03:05

feat(Atarashi): Added Copyright statements to Atarshi files

ad568f5

fix(LicensePreprocessor): fix copyright in LicensePreprocessor.py

36441e7

feat(atarashi): Add email to copyright statements

027e93b

feat(licenseDownloader): Download exceptional licenses from SPDX website

145575e

chore(license): Updated the ngram json and licenses

bd1281e

Signed-off-by: Gaurav Mishra <mishra.gaurav@siemens.com>

refactor(atarashi): Organize File structure, refactoring, rebuild ngr…

0376ebc

…am-json fix(utils): change file name

style(ngram_keywords_zip): removed ngram json and replaced with ngram…

2ee90ec

… json zip

fix(spelling): Fixed spelling mistake of depricate

a9c372d

1

amanjain97 force-pushed the feat/ngram branch from ad7ae05 to a9c372d Compare July 27, 2018 21:39

amanjain97 added 2 commits July 28, 2018 03:13

feat(atarashi): Add email to copyright statements

2610c1f

feat(atarashii): Implemented a unified script

277a351

Implemented a unified script to run any algorithm from atarashi

amanjain97 removed the WIP Work in progress label Jul 30, 2018

Merge pull request #30 from siemens/feat/unified-script

9aca4b1

Feat/unified script reviewed and tested by : anupam.ghosh@siemens.com, mishra.gaurav@siemens.com

GMishx approved these changes Aug 1, 2018

View reviewed changes

ag4ums approved these changes Aug 1, 2018

View reviewed changes

ag4ums merged commit 7db915a into master Aug 1, 2018

Aman-Codes mentioned this pull request Dec 28, 2020

Parallelize the evaluator algorithm #64

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat/ngram #23

Feat/ngram #23

amanjain97 commented Jul 14, 2018 •

edited

GMishx left a comment

GMishx Jul 17, 2018

GMishx Jul 17, 2018

GMishx Jul 17, 2018

amanjain97 Jul 19, 2018 •

edited

GMishx Jul 17, 2018

GMishx Jul 17, 2018

GMishx Jul 17, 2018

GMishx Jul 17, 2018

GMishx Jul 17, 2018

GMishx Jul 17, 2018

GMishx Jul 17, 2018

GMishx Jul 22, 2018

amanjain97 commented Jul 19, 2018

amanjain97 commented Jul 22, 2018

GMishx commented Jul 22, 2018

GMishx left a comment

ag4ums left a comment

Feat/ngram #23

Feat/ngram #23

Conversation

amanjain97 commented Jul 14, 2018 • edited

GMishx left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amanjain97 Jul 19, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amanjain97 commented Jul 19, 2018

amanjain97 commented Jul 22, 2018

GMishx commented Jul 22, 2018

GMishx left a comment

Choose a reason for hiding this comment

ag4ums left a comment

Choose a reason for hiding this comment

amanjain97 commented Jul 14, 2018 •

edited

amanjain97 Jul 19, 2018 •

edited