<h1>Get Ontology Data</h1>
CodeOntology is an ontology to model object-oriented programming languages and source code. More info http://codeontology.org/

In [1]:
from rdfpandas.graph import to_dataframe
import pandas as pd
import rdflib
import preprocessing
import numpy as np
import tfidf
import scipy

<h2>Load the rdf file in to a dataframe</h2>

Using rdfpandas.graph lybrary we parse the ontology files where are the function names and description in to a dataframe.

In [2]:
g = rdflib.Graph()
g.parse('DB/Code_Ontology/comments.nt', format = 'nt')
df = to_dataframe(g)
df

Unnamed: 0,rdfs:comment{Literal},rdfs:comment{Literal}(xsd:string),rdfs:comment{Literal}@en
http://rdf.webofcode.org/woc/,,,An ontology that represents object-oriented pr...
http://rdf.webofcode.org/woc/Abstract,,,The abstract modifier
http://rdf.webofcode.org/woc/AccessModifier,,,An access modifier
http://rdf.webofcode.org/woc/ActualArgument,The actual argument of a method,,
http://rdf.webofcode.org/woc/AnnotationType,,,An annotation
...,...,...,...
http://rdf.webofcode.org/woc/org.jcp.xml.dsig.internal.dom.XMLDSigRI,Defines the XMLDSigRI provider.,,
http://rdf.webofcode.org/woc/overrides,,,The overrides property relates a method to the...
http://rdf.webofcode.org/woc/references,,,The references property relates a method or a ...
http://rdf.webofcode.org/woc/returns,,,The returns property relates a method to the v...


<h2>Manipolate data</h2>

As we can see from the dataframe we need to manipulate the data since we want to have only the function name, its description and its return values for our purpouse

In [3]:
#Use the index as a column that is the name of the function 
df.reset_index(level=0, inplace=True)


In [4]:
#Getting a sample of db for testing
data = df.sample(frac = 0.2)
data

Unnamed: 0,index,rdfs:comment{Literal},rdfs:comment{Literal}(xsd:string),rdfs:comment{Literal}@en
27580,http://rdf.webofcode.org/woc/java.security.cer...,Creates an {@code X509CertSelector}. Initially...,,
69738,http://rdf.webofcode.org/woc/javax.swing.filec...,Returns true if the file (directory) can be vi...,<code>true</code> if the file/directory can be...,
54477,http://rdf.webofcode.org/woc/javax.naming.ldap...,The control's ASN.1 BER encoded value.\n\n @se...,,
73592,http://rdf.webofcode.org/woc/javax.swing.plaf....,Width of the area to paint to,,
66094,http://rdf.webofcode.org/woc/javax.swing.JTabl...,Invoked when the underlying model has complete...,,
...,...,...,...,...
7294,http://rdf.webofcode.org/woc/java.awt.event.In...,the source of the event,,
11494,http://rdf.webofcode.org/woc/java.awt.image.Lo...,"the specified <code>RenderingHints</code>, or ...",,
84986,http://rdf.webofcode.org/woc/jdk.internal.org....,"Performs a simple DFS of the instructions, ass...",,
54096,http://rdf.webofcode.org/woc/javax.naming.dire...,Retrieves the number of values in this attribu...,The nonnegative number of values in this attri...,


In [5]:
#Renaming columns names 
df.rename({'index': 'function', 'rdfs:comment{Literal}' : 'description', 'rdfs:comment{Literal}(xsd:string)' : 'return_value', 'rdfs:comment{Literal}@en' : 'description3' }, axis=1, inplace=True)

#Leave only the row where the function is really a function so the line is ending with ')'
df = df.loc[(df['function'].str.endswith(')'))]

df.function

85       http://rdf.webofcode.org/woc/com.oracle.net.Sd...
86       http://rdf.webofcode.org/woc/com.oracle.net.Sd...
87       http://rdf.webofcode.org/woc/com.oracle.net.Sd...
88       http://rdf.webofcode.org/woc/com.oracle.net.Sd...
89       http://rdf.webofcode.org/woc/com.oracle.net.Sd...
                               ...                        
87244    http://rdf.webofcode.org/woc/org.jcp.xml.dsig....
87247    http://rdf.webofcode.org/woc/org.jcp.xml.dsig....
87253    http://rdf.webofcode.org/woc/org.jcp.xml.dsig....
87254    http://rdf.webofcode.org/woc/org.jcp.xml.dsig....
87255    http://rdf.webofcode.org/woc/org.jcp.xml.dsig....
Name: function, Length: 38636, dtype: object

In [6]:
#Delete the rows where the function name contains parameter since that rows aren't defintion of functions 
df = df[~df.function.str.contains("parameter")]
df

Unnamed: 0,function,description,return_value,description3
85,http://rdf.webofcode.org/woc/com.oracle.net.Sd...,Creates a SDP enabled SocketImpl,,
86,http://rdf.webofcode.org/woc/com.oracle.net.Sd...,Creates an unbound SDP server socket. The {@co...,a new ServerSocket,
87,http://rdf.webofcode.org/woc/com.oracle.net.Sd...,Opens a socket channel to a SDP socket.\n\n <p...,a new ServerSocketChannel,
88,http://rdf.webofcode.org/woc/com.oracle.net.Sd...,Creates an unconnected and unbound SDP socket....,a new Socket,
89,http://rdf.webofcode.org/woc/com.oracle.net.Sd...,Opens a socket channel to a SDP socket.\n\n <p...,a new SocketChannel,
...,...,...,...,...
87244,http://rdf.webofcode.org/woc/org.jcp.xml.dsig....,Creates a <code>DOMXMLSignature</code> from XM...,,
87247,http://rdf.webofcode.org/woc/org.jcp.xml.dsig....,Initializes a new instance of this class.,,
87253,http://rdf.webofcode.org/woc/org.jcp.xml.dsig....,"Returns the ID from a same-document URI (ex: ""...",,
87254,http://rdf.webofcode.org/woc/org.jcp.xml.dsig....,"Returns true if uri is a same-document URI, fa...",,


In [7]:
preprocessing.clean_ontology(df)
df

Unnamed: 0,function,description,return_value,description3
85,com.oracle.net.Sdp-createSocketImpl(),Creates a SDP enabled Socket tImpl,,
86,com.oracle.net.Sdp-openServerSocket(),,a new ServerSocket,
87,com.oracle.net.Sdp-openServerSocketChannel(),,a new ServerSocketChannel,
88,com.oracle.net.Sdp-openSocket(),,a new Socket,
89,com.oracle.net.Sdp-openSocketChannel(),,a new SocketChannel,
...,...,...,...,...
87244,org.jcp.xml.dsig.internal.dom.DOMXMLSignature-...,Creates a DOMXMLSignature from XML.,,
87247,org.jcp.xml.dsig.internal.dom.DOMXMLSignatureF...,Initializes a new instance of this class.,,
87253,org.jcp.xml.dsig.internal.dom.Utils-parseIdFro...,"Returns the ID from a same-document URI (ex: ""...",,
87254,org.jcp.xml.dsig.internal.dom.Utils-sameDocume...,"Returns true if uri is a same-document URI, fa...",,


In [8]:
#count rows where due to preprocessing the description became an empty string
df[df['description'] == ''].index


Int64Index([   86,    87,    88,    89,   211,   214,   473,   519,   674,
              726,
            ...
            87043, 87055, 87060, 87089, 87125, 87166, 87168, 87174, 87190,
            87199],
           dtype='int64', length=10260)

In [9]:
#delete rows where due to preprocessing the description became an empty string
df = df[df['description'] != '']
df

Unnamed: 0,function,description,return_value,description3
85,com.oracle.net.Sdp-createSocketImpl(),Creates a SDP enabled Socket tImpl,,
189,java.applet.Applet$AccessibleApplet-getAccessi...,Get the role of this object.,an instance of AccessibleRole describing the r...,
190,java.applet.Applet$AccessibleApplet-getAccessi...,Get the state of this object.,an instance of AccessibleStateSet containing t...,
191,java.applet.Applet-Applet(),Constructs a new Applet. Note: Many methods...,,
192,java.applet.Applet-destroy(),Called by the browser or applet viewer to info...,,
...,...,...,...,...
87244,org.jcp.xml.dsig.internal.dom.DOMXMLSignature-...,Creates a DOMXMLSignature from XML.,,
87247,org.jcp.xml.dsig.internal.dom.DOMXMLSignatureF...,Initializes a new instance of this class.,,
87253,org.jcp.xml.dsig.internal.dom.Utils-parseIdFro...,"Returns the ID from a same-document URI (ex: ""...",,
87254,org.jcp.xml.dsig.internal.dom.Utils-sameDocume...,"Returns true if uri is a same-document URI, fa...",,


In [10]:
df = df[~df['description'].isna()]
df = df[~df['function'].isna()] 
descriptions = df['description']
descriptions

85                      Creates a SDP enabled Socket tImpl
189                        Get the role of this object.   
190                       Get the state of this object.   
191      Constructs a new Applet.    Note: Many methods...
192      Called by the browser or applet viewer to info...
                               ...                        
87244               Creates a DOMXMLSignature from XML.   
87247            Initializes a new instance of this class.
87253    Returns the ID from a same-document URI (ex: "...
87254    Returns true if uri is a same-document URI, fa...
87255    Converts an Iterator to a Set of Nodes, accord...
Name: description, Length: 28365, dtype: object

In [11]:
#Clearing text 
descriptions_processed = descriptions.apply(lambda x: preprocessing.clear_text(x))
descriptions_processed

85                       creates sdp enabled socket timpl 
189                                       get role object 
190                                      get state object 
191      constructs new applet note many methods java a...
192      called browser applet viewer inform applet rec...
                               ...                        
87244                         creates domxmlsignature xml 
87247                      initializes new instance class 
87253                       returns id document uri ex id 
87254       returns true uri document uri false otherwise 
87255    converts iterator set nodes according xpath da...
Name: description, Length: 28365, dtype: object

In [12]:
df['description_processed'] = descriptions_processed
df = df.drop('description3',axis=1)


In [13]:
df.isna().sum()

df = df.dropna()
df.head()

Unnamed: 0,function,description,return_value,description_processed
189,java.applet.Applet$AccessibleApplet-getAccessi...,Get the role of this object.,an instance of AccessibleRole describing the r...,get role object
190,java.applet.Applet$AccessibleApplet-getAccessi...,Get the state of this object.,an instance of AccessibleStateSet containing t...,get state object
193,java.applet.Applet-getAccessibleContext(),Gets the Accessible eContext associated with t...,an AccessibleApplet that serves as the Accessi...,gets accessible econtext associated applet app...
194,java.applet.Applet-getAppletContext(),"Determines this applet's context, which allows...",the applet's context.,determines applet context allows applet query ...
195,java.applet.Applet-getAppletInfo(),Returns information about this applet. An appl...,a string containing information about the auth...,returns information applet applet override met...


In [14]:
df.to_csv('DB/Preprocessed_ontology.csv', index=False)