Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(semanticTextSim): Semantic Text sim algorithm using doc2vec #60

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
1 change: 1 addition & 0 deletions MANIFEST.in
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@ include requirements*.*
include pyproject.toml
include atarashi/data/licenses/processedLicenses.csv
include atarashi/data/Ngram_keywords.json
include atarashi/agents/semanticTextSim/spdxDoc2Vec.model

prune .git
prune venv
Expand Down
Empty file.
73 changes: 73 additions & 0 deletions atarashi/agents/semanticTextSim/semanticTextSim.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
#!/usr/bin/env python3
"""
Copyright 2019 Ayush Bhardwaj (classicayush@gmail.com)

SPDX-License-Identifier: GPL-2.0

This program is free software; you can redistribute it and/or
modify it under the terms of the GNU General Public License
version 2 as published by the Free Software Foundation.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.

You should have received a copy of the GNU General Public License along
with this program; if not, write to the Free Software Foundation, Inc.,
51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA.
"""
import gensim
import os
import argparse
import code_comment
from gensim.models.doc2vec import Doc2Vec
from atarashi.libs.commentPreprocessor import CommentPreprocessor

__author__ = "Ayush Bhardwaj"
__email__ = "classicayush@gmail.com"

temp = os.path.dirname(os.path.abspath(__file__))
path = os.path.join(temp, 'spdxDoc2Vec.model')

def semanticTextSim(filePath):
'''
The function loads the trained model and returns the most similar doc to the input doc.
It preprocess the files and extract the comments out of it i.e. License statements.
The doc is converted to vector and most similar doc (highest cosine sim) is returned.

:param filePath: Input file path to scan
:return: result with license name, sim score, sim type and description
:rtype: list (JSON Format)
'''
commentFile = CommentPreprocessor.extract(filePath)
with open(commentFile) as file:
doc = file.read()
matches = []

# Load the trained model
model = Doc2Vec.load(path)

# To find the vector of a document
data = ((doc).lower()).split()
vector = model.infer_vector(data)

# to find most similar docs
similar_doc = model.docvecs.most_similar([vector])

matches.append({
'shortname': similar_doc[0][0],
'sim_score': similar_doc[0][1],
'sim_type': "semanticTextSim",
'description': ""
})

return matches

if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument("inputFile", help="Specify the input file which needs to be scanned")

args = parser.parse_args()
filename = args.inputFile

scanner = semanticTextSim(filename)
Binary file added atarashi/agents/semanticTextSim/spdxDoc2Vec.model
Binary file not shown.
12 changes: 12 additions & 0 deletions atarashi/agents/semanticTextSim/text/0BSD.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
Copyright (C) 2006 by Rob Landley <rob@landley.net>

Permission to use, copy, modify, and/or distribute this software for any purpose
with or without fee is hereby granted.

THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES WITH
REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY
AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY SPECIAL, DIRECT,
INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM
LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE
OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR
PERFORMANCE OF THIS SOFTWARE.
7 changes: 7 additions & 0 deletions atarashi/agents/semanticTextSim/text/389-exception.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
This Program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; version 2 of the License.

This Program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this Program; if not, write to the Free Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA.

In addition, as a special exception, Red Hat, Inc. gives You the additional right to link the code of this Program with code not covered under the GNU General Public License ("Non-GPL Code") and to distribute linked combinations including the two, subject to the limitations in this paragraph. Non-GPL Code permitted under this exception must only link to the code of this Program through those well defined interfaces identified in the file named EXCEPTION found in the source code files (the "Approved Interfaces"). The files of Non-GPL Code may instantiate templates or use macros or inline functions from the Approved Interfaces without causing the resulting work to be covered by the GNU General Public License. Only Red Hat, Inc. may make changes or additions to the list of Approved Interfaces. You must obey the GNU General Public License in all respects for all of the Program code and other code used in conjunction with the Program except the Non-GPL Code covered by this exception. If you modify this file, you may extend this exception to your version of the file, but you are not obligated to do so. If you do not wish to provide this exception without modification, you must delete this exception statement from your version and license this file solely under the GPL without exception.
49 changes: 49 additions & 0 deletions atarashi/agents/semanticTextSim/text/AAL.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
Attribution Assurance License Copyright (c) 2002 by AUTHOR PROFESSIONAL IDENTIFICATION
* URL "PROMOTIONAL SLOGAN FOR AUTHOR'S PROFESSIONAL PRACTICE"

All Rights Reserved ATTRIBUTION ASSURANCE LICENSE (adapted from the original
BSD license)

Redistribution and use in source and binary forms, with or without modification,
are permitted provided that the conditions below are met. These conditions
require a modest attribution to <AUTHOR> (the "Author"), who hopes that its
promotional value may help justify the thousands of dollars in otherwise billable
time invested in writing this and other freely available, open-source software.

1. Redistributions of source code, in whole or part and with or without modification
(the "Code"), must prominently display this GPG-signed text in verifiable
form.

2. Redistributions of the Code in binary form must be accompanied by this
GPG-signed text in any documentation and, each time the resulting executable
program or a program dependent thereon is launched, a prominent display (e.g.,
splash screen or banner text) of the Author's attribution information, which
includes:

(a) Name ("AUTHOR"),

(b) Professional identification ("PROFESSIONAL IDENTIFICATION"), and

(c) URL ("URL").

3. Neither the name nor any trademark of the Author may be used to endorse
or promote products derived from this software without specific prior written
permission.

4. Users are entirely responsible, to the exclusion of the Author and any
other persons, for compliance with (1) regulations set by owners or administrators
of employed equipment, (2) licensing terms of any other software, and (3)
local regulations regarding use, including those regarding import, export,
and use of encryption software.

THIS FREE SOFTWARE IS PROVIDED BY THE AUTHOR "AS IS" AND ANY EXPRESS OR IMPLIED
WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY
AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE
AUTHOR OR ANY CONTRIBUTOR BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
EFFECTS OF UNAUTHORIZED OR MALICIOUS NETWORK ACCESS; PROCUREMENT OF SUBSTITUTE
GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH
DAMAGE.
7 changes: 7 additions & 0 deletions atarashi/agents/semanticTextSim/text/ADSL.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
This software code is made available "AS IS" without warranties of any kind.
You may copy, display, modify and redistribute the software code either by
itself or as incorporated into your code; provided that > you do not remove
any proprietary notices. Your use of this software code is at your own risk
and you waive any claim against Amazon Digital Services, Inc. or its affiliates
with respect to your use of this software code. (c) 2006 Amazon Digital Services,
Inc. or its affiliates.
79 changes: 79 additions & 0 deletions atarashi/agents/semanticTextSim/text/AFL-1.1.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
Academic Free License

Version 1.1 The Academic Free License applies to any original work of authorship
(the "Original Work") whose owner (the "Licensor") has placed the following
notice immediately following the copyright notice for the Original Work:

"Licensed under the Academic Free License version 1.1."

Grant of License. Licensor hereby grants to any person obtaining a copy of
the Original Work ("You") a world-wide, royalty-free, non-exclusive, perpetual,
non-sublicenseable license

(1) to use, copy, modify, merge, publish, perform, distribute and/or sell
copies of the Original Work and derivative works thereof, and

(2) under patent claims owned or controlled by the Licensor that are embodied
in the Original Work as furnished by the Licensor, to make, use, sell and
offer for sale the Original Work and derivative works thereof, subject to
the following conditions.

Right of Attribution. Redistributions of the Original Work must reproduce
all copyright notices in the Original Work as furnished by the Licensor, both
in the Original Work itself and in any documentation and/or other materials
provided with the distribution of the Original Work in executable form.

Exclusions from License Grant. Neither the names of Licensor, nor the names
of any contributors to the Original Work, nor any of their trademarks or service
marks, may be used to endorse or promote products derived from this Original
Work without express prior written permission of the Licensor.

WARRANTY AND DISCLAIMERS. LICENSOR WARRANTS THAT THE COPYRIGHT IN AND TO THE
ORIGINAL WORK IS OWNED BY THE LICENSOR OR THAT THE ORIGINAL WORK IS DISTRIBUTED
BY LICENSOR UNDER A VALID CURRENT LICENSE FROM THE COPYRIGHT OWNER. EXCEPT
AS EXPRESSLY STATED IN THE IMMEDIATELY PRECEEDING SENTENCE, THE ORIGINAL WORK
IS PROVIDED UNDER THIS LICENSE ON AN "AS IS" BASIS, WITHOUT WARRANTY, EITHER
EXPRESS OR IMPLIED, INCLUDING, WITHOUT LIMITATION, THE WARRANTY OF NON-INFRINGEMENT
AND WARRANTIES THAT THE ORIGINAL WORK IS MERCHANTABLE OR FIT FOR A PARTICULAR
PURPOSE. THE ENTIRE RISK AS TO THE QUALITY OF THE ORIGINAL WORK IS WITH YOU.
THIS DISCLAIMER OF WARRANTY CONSTITUTES AN ESSENTIAL PART OF THIS LICENSE.
NO LICENSE TO ORIGINAL WORK IS GRANTED HEREUNDER EXCEPT UNDER THIS DISCLAIMER.

LIMITATION OF LIABILITY. UNDER NO CIRCUMSTANCES AND UNDER NO LEGAL THEORY,
WHETHER TORT (INCLUDING NEGLIGENCE), CONTRACT, OR OTHERWISE, SHALL THE LICENSOR
BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR
CONSEQUENTIAL DAMAGES OF ANY CHARACTER ARISING AS A RESULT OF THIS LICENSE
OR THE USE OF THE ORIGINAL WORK INCLUDING, WITHOUT LIMITATION, DAMAGES FOR
LOSS OF GOODWILL, WORK STOPPAGE, COMPUTER FAILURE OR MALFUNCTION, OR ANY AND
ALL OTHER COMMERCIAL DAMAGES OR LOSSES, EVEN IF SUCH PERSON SHALL HAVE BEEN
INFORMED OF THE POSSIBILITY OF SUCH DAMAGES. THIS LIMITATION OF LIABILITY
SHALL NOT APPLY TO LIABILITY FOR DEATH OR PERSONAL INJURY RESULTING FROM SUCH
PARTY'S NEGLIGENCE TO THE EXTENT APPLICABLE LAW PROHIBITS SUCH LIMITATION.
SOME JURISDICTIONS DO NOT ALLOW THE EXCLUSION OR LIMITATION OF INCIDENTAL
OR CONSEQUENTIAL DAMAGES, SO THIS EXCLUSION AND LIMITATION MAY NOT APPLY TO
YOU.

License to Source Code. The term "Source Code" means the preferred form of
the Original Work for making modifications to it and all available documentation
describing how to access and modify the Original Work. Licensor hereby agrees
to provide a machine-readable copy of the Source Code of the Original Work
along with each copy of the Original Work that Licensor distributes. Licensor
reserves the right to satisfy this obligation by placing a machine-readable
copy of the Source Code in an information repository reasonably calculated
to permit inexpensive and convenient access by You for as long as Licensor
continues to distribute the Original Work, and by publishing the address of
that information repository in a notice immediately following the copyright
notice that applies to the Original Work.

Mutual Termination for Patent Action. This License shall terminate automatically
and You may no longer exercise any of the rights granted to You by this License
if You file a lawsuit in any court alleging that any OSI Certified open source
software that is licensed under any license containing this "Mutual Termination
for Patent Action" clause infringes any patent claims that are essential to
use that software.

This license is Copyright (C) 2002 Lawrence E. Rosen. All rights reserved.

Permission is hereby granted to copy and distribute this license without modification.
This license may not be modified without the express written permission of
its copyright owner.
81 changes: 81 additions & 0 deletions atarashi/agents/semanticTextSim/text/AFL-1.2.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
Academic Free License

Version 1.2 This Academic Free License applies to any original work of authorship
(the "Original Work") whose owner (the "Licensor") has placed the following
notice immediately following the copyright notice for the Original Work:

Licensed under the Academic Free License version 1.2

Grant of License. Licensor hereby grants to any person obtaining a copy of
the Original Work ("You") a world-wide, royalty-free, non-exclusive, perpetual,
non-sublicenseable license (1) to use, copy, modify, merge, publish, perform,
distribute and/or sell copies of the Original Work and derivative works thereof,
and (2) under patent claims owned or controlled by the Licensor that are embodied
in the Original Work as furnished by the Licensor, to make, use, sell and
offer for sale the Original Work and derivative works thereof, subject to
the following conditions.

Attribution Rights. You must retain, in the Source Code of any Derivative
Works that You create, all copyright, patent or trademark notices from the
Source Code of the Original Work, as well as any notices of licensing and
any descriptive text identified therein as an "Attribution Notice." You must
cause the Source Code for any Derivative Works that You create to carry a
prominent Attribution Notice reasonably calculated to inform recipients that
You have modified the Original Work.

Exclusions from License Grant. Neither the names of Licensor, nor the names
of any contributors to the Original Work, nor any of their trademarks or service
marks, may be used to endorse or promote products derived from this Original
Work without express prior written permission of the Licensor.

Warranty and Disclaimer of Warranty. Licensor warrants that the copyright
in and to the Original Work is owned by the Licensor or that the Original
Work is distributed by Licensor under a valid current license from the copyright
owner. Except as expressly stated in the immediately proceeding sentence,
the Original Work is provided under this License on an "AS IS" BASIS and WITHOUT
WARRANTY, either express or implied, including, without limitation, the warranties
of NON-INFRINGEMENT, MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
THE ENTIRE RISK AS TO THE QUALITY OF THE ORIGINAL WORK IS WITH YOU. This DISCLAIMER
OF WARRANTY constitutes an essential part of this License. No license to Original
Work is granted hereunder except under this disclaimer.

Limitation of Liability. Under no circumstances and under no legal theory,
whether in tort (including negligence), contract, or otherwise, shall the
Licensor be liable to any person for any direct, indirect, special, incidental,
or consequential damages of any character arising as a result of this License
or the use of the Original Work including, without limitation, damages for
loss of goodwill, work stoppage, computer failure or malfunction, or any and
all other commercial damages or losses. This limitation of liability shall
not apply to liability for death or personal injury resulting from Licensor's
negligence to the extent applicable law prohibits such limitation. Some jurisdictions
do not allow the exclusion or limitation of incidental or consequential damages,
so this exclusion and limitation may not apply to You.

License to Source Code. The term "Source Code" means the preferred form of
the Original Work for making modifications to it and all available documentation
describing how to modify the Original Work. Licensor hereby agrees to provide
a machine-readable copy of the Source Code of the Original Work along with
each copy of the Original Work that Licensor distributes. Licensor reserves
the right to satisfy this obligation by placing a machine-readable copy of
the Source Code in an information repository reasonably calculated to permit
inexpensive and convenient access by You for as long as Licensor continues
to distribute the Original Work, and by publishing the address of that information
repository in a notice immediately following the copyright notice that applies
to the Original Work.

Mutual Termination for Patent Action. This License shall terminate automatically
and You may no longer exercise any of the rights granted to You by this License
if You file a lawsuit in any court alleging that any OSI Certified open source
software that is licensed under any license containing this "Mutual Termination
for Patent Action" clause infringes any patent claims that are essential to
use that software.

Right to Use. You may use the Original Work in all ways not otherwise restricted
or conditioned by this License or by law, and Licensor promises not to interfere
with or be responsible for such uses by You.

This license is Copyright (C) 2002 Lawrence E. Rosen. All rights reserved.

Permission is hereby granted to copy and distribute this license without modification.
This license may not be modified without the express written permission of
its copyright owner.