## INK: explanatory notebook

Within this notebook, we give a simple example of how the INK library can be used to extract the neighbourhood of certain subjects of interest.<br>
We also show the mining capabilities for both task specific and task agnostic cases.


The example dataset can be found in the /datasets folder.<br>
We have used the common known animal dataset, describing several animals with their properties.

In [1]:
from ink.base.connectors import RDFLibConnector
from ink.base.structure import InkExtractor
from ink.miner.rulemining import RuleSetMiner

To start, three different packages are loaded.
* A connector: which is used to load the original dataset.
Here we have used an RDFLib connector, but other connectors are available.
* The INK extractor, which will transform the neighbourhood of certain nodes into a binary representation.
* The Rule Set Miner, which is our rule mining module.

In [2]:
import pandas as pd
import numpy as np

customer = pd.read_csv('train.tsv', sep='\t')
cus_seg = pd.read_csv('train.csv')

In [3]:
customer.head(2)

Unnamed: 0,subject,predicate,object
0,User462809,Gender,GenderMale
1,User462809,Ever_Married,Ever_MarriedNo


In [4]:
cus_seg.head(2)

Unnamed: 0,ID,Gender,Ever_Married,Age,Graduated,Profession,Work_Experience,Spending_Score,Family_Size,Var_1,Segmentation
0,462809,Male,No,22,No,Healthcare,1.0,Low,4.0,Cat_4,D
1,462643,Female,Yes,38,Yes,Engineer,,Average,3.0,Cat_4,A


In [5]:
cus_seg["ID"] =  "User" + cus_seg.ID.astype(str).str.title().str.replace(" ", "")
for i in range(0,len(cus_seg)):
    if cus_seg["Segmentation"][i] == 'A':
        cus_seg["Segmentation"][i] =  1
    elif cus_seg["Segmentation"][i] == 'B':
        cus_seg["Segmentation"][i] =  2
    elif cus_seg["Segmentation"][i] == 'C':
        cus_seg["Segmentation"][i] =  3
    elif cus_seg["Segmentation"][i] == 'D':
        cus_seg["Segmentation"][i] =  4 
cus_seg.head(3)

Unnamed: 0,ID,Gender,Ever_Married,Age,Graduated,Profession,Work_Experience,Spending_Score,Family_Size,Var_1,Segmentation
0,User462809,Male,No,22,No,Healthcare,1.0,Low,4.0,Cat_4,4
1,User462643,Female,Yes,38,Yes,Engineer,,Average,3.0,Cat_4,1
2,User466315,Female,Yes,67,Yes,Engineer,1.0,Low,1.0,Cat_6,2


In [6]:
cus_sub = customer['subject'].unique() #row #6665

In [7]:
cus_pre = customer['predicate'].unique() #9
#cus_pre

In [8]:
cus_obj = customer['object'].unique() #116
#cus_obj

##### four set of customer

In [9]:
A = set()
B = set()
C = set()
D = set()

In [10]:
#get segmentation
seg = []
for i in range(0,len(cus_sub)):
    idx = cus_seg[cus_seg['ID'] == cus_sub[i]]
    #print(idx.index)
    seg_idx = idx.index[0]
    seg.append(cus_seg['Segmentation'][seg_idx])
    if cus_seg['Segmentation'][seg_idx]==1:
        A.add(cus_sub[i])
    elif cus_seg['Segmentation'][seg_idx]==2:
        B.add(cus_sub[i])
    elif cus_seg['Segmentation'][seg_idx]==3:
        C.add(cus_sub[i])
    elif cus_seg['Segmentation'][seg_idx]==4:
        D.add(cus_sub[i])

In [11]:
y_train = np.array(seg)  #y_train #label
y_train

array([4, 2, 2, ..., 4, 2, 2])

###### rule miner (不知道可不可以用(((o(*ﾟ▽ﾟ*)o))))

In [12]:
miner = RuleSetMiner(chains=100, max_len_rule_set=3, forest_size=10)

###### task specific rule mining

In [13]:
df = pd.crosstab(customer['subject'], customer['object']) #add .ne(0) to transform 0,1 to True False 

In [14]:
df.head(2)

object,Age18,Age19,Age20,Age21,Age22,Age23,Age25,Age26,Age27,Age28,...,Work_Experience13.0,Work_Experience14.0,Work_Experience2.0,Work_Experience3.0,Work_Experience4.0,Work_Experience5.0,Work_Experience6.0,Work_Experience7.0,Work_Experience8.0,Work_Experience9.0
subject,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
User458982,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
User458983,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [15]:
df.filter(regex='^Age',axis=1).head(3)

object,Age18,Age19,Age20,Age21,Age22,Age23,Age25,Age26,Age27,Age28,...,Age80,Age81,Age82,Age83,Age84,Age85,Age86,Age87,Age88,Age89
subject,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
User458982,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
User458983,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
User458984,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [16]:
df.columns = ['rdf#Gender§' + str(col) if col in df.filter(regex='^Gender',axis=1) else col for col in df.columns]
df.columns = ['rdf#Ever_Married§' + str(col) if col in df.filter(regex='^Ever_Married',axis=1) else col for col in df.columns]
df.columns = ['rdf#Age§' + str(col) if col in df.filter(regex='^Age',axis=1) else col for col in df.columns]
df.columns = ['rdf#everGraduated§' + str(col) if col in df.filter(regex='^everGraduated',axis=1) else col for col in df.columns]
df.columns = ['rdf#Profession§' + str(col) if col in df.filter(regex='^Profession',axis=1) else col for col in df.columns]
df.columns = ['rdf#Work_Experience§' + str(col) if col in df.filter(regex='^Work_Experience',axis=1) else col for col in df.columns]
df.columns = ['rdf#Spending_Score§' + str(col) if col in df.filter(regex='^Spending_Score',axis=1) else col for col in df.columns]
df.columns = ['rdf#Family_Size§' + str(col) if col in df.filter(regex='^Family_Size',axis=1) else col for col in df.columns]
df.columns = ['rdf#Var_1§' + str(col) if col in df.filter(regex='^Var_1',axis=1) else col for col in df.columns]

In [24]:
df.head(2)

Unnamed: 0_level_0,rdf#Age§Age18,rdf#Age§Age19,rdf#Age§Age20,rdf#Age§Age21,rdf#Age§Age22,rdf#Age§Age23,rdf#Age§Age25,rdf#Age§Age26,rdf#Age§Age27,rdf#Age§Age28,...,rdf#Work_Experience§Work_Experience13.0,rdf#Work_Experience§Work_Experience14.0,rdf#Work_Experience§Work_Experience2.0,rdf#Work_Experience§Work_Experience3.0,rdf#Work_Experience§Work_Experience4.0,rdf#Work_Experience§Work_Experience5.0,rdf#Work_Experience§Work_Experience6.0,rdf#Work_Experience§Work_Experience7.0,rdf#Work_Experience§Work_Experience8.0,rdf#Work_Experience§Work_Experience9.0
subject,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
User458982,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
User458983,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [18]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 6665 entries, User458982 to User467974
Columns: 116 entries, rdf#Age§Age18 to rdf#Work_Experience§Work_Experience9.0
dtypes: int64(116)
memory usage: 5.9+ MB


In [19]:
from scipy.sparse import csr_matrix  #convert df to csr
df_csr = csr_matrix(df.values)
#df_csr

In [25]:
col = list(df.columns)
#col

In [21]:
X_train = (df_csr,list(cus_sub),col)
#X_train

In [22]:
#X_train: data
#y_train: label
acc, rules = miner.fit(X_train, y_train)

ValueError: math domain error

In [25]:
col = []
#for i in range (0,len(customer)):  #fill with True/False
#    col.append('count.rdf#')

In [26]:
#fill with 0,1
for i in range (0,len(customer)):
    if "Gender" in customer['object'][i]: 
        text = 'rdf#Gender§' + customer['object'][i]
    if "Ever_Married" in customer['object'][i]:
        text = 'rdf#Ever_Married§' + customer['object'][i]
    if "Age" in customer['object'][i]:
        text = 'rdf#Age§' + customer['object'][i]
    if "everGraduated" in customer['object'][i]:
        text = 'rdf#everGraduated§' + customer['object'][i]
    if "Profession" in customer['object'][i]:
        text = 'rdf#Profession§' + customer['object'][i]
    if "Work_Experience" in customer['object'][i]:
        text = 'rdf#Work_Experience§' + customer['object'][i]
    if "Spending_Score" in customer['object'][i]:
        text = 'rdf#Spending_Score§' + customer['object'][i]
    if "Family_Size" in customer['object'][i]:
        text = 'rdf#Family_Size§' + customer['object'][i]
    if "Var_1" in customer['object'][i]:
        text = 'rdf#Var_1§' + customer['object'][i]
    
    if text not in col:
        col.append(text) #store all possible col

In [38]:
col

['rdf#Gender§GenderMale',
 'rdf#Ever_Married§Ever_MarriedNo',
 'rdf#Age§Age22',
 'rdf#Profession§ProfessionHealthcare',
 'rdf#Work_Experience§Work_Experience1.0',
 'rdf#Spending_Score§Spending_ScoreLow',
 'rdf#Family_Size§Family_Size4.0',
 'rdf#Var_1§Var_1Cat_4',
 'rdf#Gender§GenderFemale',
 'rdf#Ever_Married§Ever_MarriedYes',
 'rdf#Age§Age67',
 'rdf#Profession§ProfessionEngineer',
 'rdf#Family_Size§Family_Size1.0',
 'rdf#Var_1§Var_1Cat_6',
 'rdf#Profession§ProfessionLawyer',
 'rdf#Work_Experience§Work_Experience0.0',
 'rdf#Spending_Score§Spending_ScoreHigh',
 'rdf#Family_Size§Family_Size2.0',
 'rdf#Age§Age56',
 'rdf#Profession§ProfessionArtist',
 'rdf#Spending_Score§Spending_ScoreAverage',
 'rdf#Age§Age32',
 'rdf#Family_Size§Family_Size3.0',
 'rdf#Age§Age33',
 'rdf#Age§Age61',
 'rdf#Var_1§Var_1Cat_7',
 'rdf#Age§Age55',
 'rdf#Age§Age26',
 'rdf#Age§Age19',
 'rdf#Work_Experience§Work_Experience4.0',
 'rdf#Age§Age58',
 'rdf#Profession§ProfessionDoctor',
 'rdf#Var_1§Var_1Cat_3',
 'rdf#Age§