<h1><center>Mini-Project  -  Foundations of Knowledge Graphs</center></h1> 

<h2><center>Classification of the remaining individuals from carcinogenesis</center></h2> 

<h3><center>Knowledge Group LRJ</center></h3> 

<center><b>Team members (IMT user name):</b></center> 

   <center>Jonas Thorben Becker (becks100)</center> 

   <center>Lukas Kneilmann (lukn)</center> 
   
   <center>Rupesh Sapkota (rupezzz) (now deregistered)</center> 

<br> 
<br> 
<br> 
<br> 

This Jupyter Notebook was created by the 'Knowledge Group LRJ' for the mini project as part of the module ‘Foundations of Knowledge Graphs'. It reads in 'kg-mini-project-grading', determines the missing individuals from carcinogenesis per learning problem, and classifies them using a machine learning model, which was trained on the existing individuals in the learning problem. For more detailed information on the approach and its motivation, please see the Readme.md file provided in the submission. 

<br> 
<br> 
To start the notebook/ individual cells klick on the 'Run' icon in the toolbar above. 
<br> 
<br> 
<br>
Necessary imports: 

In [1]:
from rdflib import Graph
import rdflib
#from owlready2 import *
#from owlrl import *
import numpy as np
import gensim
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split
from sklearn.svm import LinearSVC
from imblearn.over_sampling import SVMSMOTE
from rdflib import Namespace, URIRef, Literal

<br>

 - Parses the graph in the file <font color='blue'>kg-mini-project-grading.ttl</font> and saves it as <font color='blue'>g</font>  
 - loads the OWL2Vec* Embedding and saves it as a <font color='blue'>owl2vec_model</font>

In [2]:
g = Graph()
g.parse('kg-mini-project-grading.ttl', format='n3')

# Load pre-trained OWL2Vec* model.
owl2vec_model = gensim.models.Word2Vec.load("output")

<br>

Opens <font color='blue'>'all_carcinogenesis.txt'</font> and saves the individuals as string in <font color='blue'>all_carcis</font> to obtain a complete list of all carcinogenesis individuals

In [3]:
text_file = open('all_carcinogenesis.txt', 'r')
all_carcis = text_file.read().split()
text_file.close()

<br>

Function <font color='blue'>data</font>:  

- for the learning problem specified by parameter <font color='blue'>i</font>: 

    - returns the test dataset (<font color='blue'>X_np_test</font>) by determining the missing individuals of the learning problem and transforming them into their embedding 

    - returns a list of the missing instances (<font color='blue'>lp_grading_str</font> )

    - returns the training data set (<font color='blue'>y_np, X_np</font>) : 

        - creates <font color='blue'>y_np</font> by determining the class of the included individuals 

        - creates <font color='blue'>x_np</font> by embedding the included individuals using <font color='blue'>owl2vec_model</font> 

In [4]:
def data (i):
    lp = rdflib.Graph()
    lp_object_str = []
    lp_y = []
    
    # tag individuals with class id
    s = 'https://lpbenchgen.org/resource/lp_' + str(i)
    print(s)
    for i, (s,p,o) in enumerate(g.triples((rdflib.term.URIRef(s), None, None))):
         
        if str(o)=='https://lpbenchgen.org/class/LearningProblem':
            pass
        else:
            lp_object_str.append(str(o))
            if str(p) == 'https://lpbenchgen.org/property/excludesResource':
                 lp_y.append(0)
            else:
                    lp_y.append(1)

    
    # get missing individuals for prediction dataset
    lp_grading_str = list(set(lp_object_str).symmetric_difference(set(all_carcis)))
    
    # embed included individuals
    lp_emb = []
    for i in range(len(lp_object_str)):
        lp_emb.append(owl2vec_model.wv[lp_object_str[i]])
    
    # embed missing individuals
    lp_grading_emb = []
    for i in range(len(lp_grading_str)):
        lp_grading_emb.append(owl2vec_model.wv[lp_grading_str[i]])
        
    #to numpy
    X_np = np.array(lp_emb)
    y_np = np.array(lp_y)
    X_np_test = np.array(lp_grading_emb)
    return X_np, y_np, X_np_test, lp_grading_str

<br>

Iterates over the learning problems 26 to 50 in <font color='blue'>g</font> and predicts the classes of the missing individuals of the individual learning problems, saving the instances and corresponding predictions in dictionaries. The process for each learning problem is as follows:  

- Calls <font color='blue'>data</font> to obtain the training dataset and the data to be predicted   

- standardizes the training dataset and resamples a new dataset set with <font color='blue'>SVMSMOTE-sampler</font>   

- fits a linear SVM on the sampled data set using a gridsearch in combination with crossvalidation (<font color='blue'>GridSearchCV</font>) to find the optimal regularization parameters for the SVM and learning problem   

- predicts the classes for the missing individuals in the test dataset <font color='blue'>X_test</font> using the SVM and saves the predictions in the dictionary <font color='blue'>predictions</font> and the missing individuals in the dictionary <font color='blue'>grading_strings</font> 

In [5]:
sampler = SVMSMOTE(random_state=42,n_jobs=-1) #best sampler
predictions = {}
grading_strings = {}

for i in range(26,51):
    X_train, y_train, X_test, grad_str = data(i)
    grading_strings[i] = grad_str
    scaler = StandardScaler()
    
    # standardize and resample
    scaler = scaler.fit(X_train)  
    X_train = scaler.transform(X_train)
    X_test = scaler.transform(X_test)                                    
    X_train_re, y_train_re = sampler.fit_resample(X_train, y_train)

    # Gridsearch to find optimal regularization
    parameters = { 'C':[0.1,0.5,1]}  
    m= LinearSVC(dual=False, penalty='l2',max_iter=2000)
    clf = GridSearchCV(m, parameters,verbose=0,n_jobs=-1, scoring = {'precision','f1', 'accuracy', 'recall'}, refit='f1')  
    clf.fit(X_train_re, y_train_re)
        
    #predict missing individuals
    prediction = clf.predict(X_test)
    predictions[i] = prediction     

https://lpbenchgen.org/resource/lp_26
https://lpbenchgen.org/resource/lp_27
https://lpbenchgen.org/resource/lp_28
https://lpbenchgen.org/resource/lp_29
https://lpbenchgen.org/resource/lp_30
https://lpbenchgen.org/resource/lp_31
https://lpbenchgen.org/resource/lp_32
https://lpbenchgen.org/resource/lp_33
https://lpbenchgen.org/resource/lp_34
https://lpbenchgen.org/resource/lp_35
https://lpbenchgen.org/resource/lp_36
https://lpbenchgen.org/resource/lp_37
https://lpbenchgen.org/resource/lp_38
https://lpbenchgen.org/resource/lp_39
https://lpbenchgen.org/resource/lp_40
https://lpbenchgen.org/resource/lp_41
https://lpbenchgen.org/resource/lp_42
https://lpbenchgen.org/resource/lp_43
https://lpbenchgen.org/resource/lp_44
https://lpbenchgen.org/resource/lp_45
https://lpbenchgen.org/resource/lp_46
https://lpbenchgen.org/resource/lp_47
https://lpbenchgen.org/resource/lp_48
https://lpbenchgen.org/resource/lp_49
https://lpbenchgen.org/resource/lp_50


<br>

Creates the graph <font color='blue'>grading_graph</font> with the missing individuals and their class prediction for all learning problems

In [6]:
grading_graph = Graph()
carcinogenesis = Namespace('http://dl-learner.org/carcinogenesis#')
lpprop = Namespace('https://lpbenchgen.org/property/')
lpres = Namespace('https://lpbenchgen.org/resource/')

grading_graph.bind('carcinogenesis', carcinogenesis)
grading_graph.bind('lpprop', lpprop)
grading_graph.bind('lpres', lpres)

for i, key in enumerate(predictions.keys(),1):
    grading_graph.add((URIRef('https://lpbenchgen.org/resource/result_{}neg'.format(i)), lpprop.belongsToLp, Literal(False)))
    grading_graph.add((URIRef('https://lpbenchgen.org/resource/result_{}pos'.format(i)), lpprop.belongsToLp, Literal(True)))
    grading_graph.add((URIRef('https://lpbenchgen.org/resource/result_{}neg'.format(i)), lpprop.pertainsTo, URIRef('https://lpbenchgen.org/resource/lp_{}'.format(key))))
    grading_graph.add((URIRef('https://lpbenchgen.org/resource/result_{}pos'.format(i)), lpprop.pertainsTo, URIRef('https://lpbenchgen.org/resource/lp_{}'.format(key))))
       
    for j in range(len(predictions[key])):
        if predictions[key][j] == 0:
            grading_graph.add((URIRef('https://lpbenchgen.org/resource/result_{}neg'.format(i)), lpprop.resource, URIRef(grading_strings[key][j])))
        else:
            grading_graph.add((URIRef('https://lpbenchgen.org/resource/result_{}pos'.format(i)), lpprop.resource, URIRef(grading_strings[key][j])))

<br>

Writes <font color='blue'>grading_graph</font> to  <font color='blue'>'grading.ttl'</font>

In [7]:
grading_graph.serialize('grading.ttl',format='turtle')