## INK: explanatory notebook

Within this notebook, we give a simple example of how the INK library can be used to extract the neighbourhood of certain subjects of interest.<br>
We also show the mining capabilities for both task specific and task agnostic cases.


The example dataset can be found in the /datasets folder.<br>
We have used the common known animal dataset, describing several animals with their properties.

In [13]:
from ink.base.connectors import RDFLibConnector
from ink.base.structure import InkExtractor
from ink.miner.rulemining import RuleSetMiner

To start, three different packages are loaded.
* A connector: which is used to load the original dataset.
Here we have used an RDFLib connector, but other connectors are available.
* The INK extractor, which will transform the neighbourhood of certain nodes into a binary representation.
* The Rule Set Miner, which is our rule mining module.

In [14]:
import pandas as pd
import numpy as np

customer = pd.read_csv('train_full.tsv', sep='\t')
cus_seg = pd.read_csv('Train.csv')

In [15]:
customer.head(8)

Unnamed: 0,subject,predicate,object
0,Customer15634602,Gender,GenderFemale
1,Customer15634602,Age,Age42
2,Customer15634602,Geography,GeographyFrance
3,Customer15634602,Tenure,Tenure2
4,Customer15634602,NumOfProducts,NumOfProducts1
5,Customer15634602,HasCrCard,HasCrCardYes
6,Customer15634602,IsActiveMember,IsActiveMemberYes
7,Customer15634602,CreditScore,CreditScore619


In [16]:
cus_seg.head(5)

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0


In [17]:
cus_sub = customer['subject'].unique() #row #10000
cus_sub

array(['Customer15634602', 'Customer15647311', 'Customer15619304', ...,
       'Customer15584532', 'Customer15682355', 'Customer15628319'],
      dtype=object)

In [18]:
cus_pre = customer['predicate'].unique() #10
cus_pre

array(['Gender', 'Age', 'Geography', 'Tenure', 'NumOfProducts',
       'HasCrCard', 'IsActiveMember', 'CreditScore', 'Balance',
       'EstimatedSalary'], dtype=object)

In [19]:
cus_obj = customer['object'].unique() #94
cus_obj

array(['GenderFemale', 'Age42', 'GeographyFrance', ...,
       'EstimatedSalary92888.52', 'Balance130142.79',
       'EstimatedSalary38190.78'], dtype=object)

###### rule miner (不知道可不可以用(((o(*ﾟ▽ﾟ*)o))))

In [59]:
miner = RuleSetMiner()

###### task specific rule mining

In [21]:
df = pd.crosstab(customer['subject'], customer['object']) #add .ne(0) to transform 0,1 to True False 

In [22]:
df.head(2)

object,Age18,Age19,Age20,Age21,Age22,Age23,Age24,Age25,Age26,Age27,...,Tenure1,Tenure10,Tenure2,Tenure3,Tenure4,Tenure5,Tenure6,Tenure7,Tenure8,Tenure9
subject,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Customer15565701,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
Customer15565706,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0


In [23]:
#df.filter(regex='^Age',axis=1).head()

In [24]:
df.columns = ['rdf#Gender§' + str(col) if col in df.filter(regex='^Gender',axis=1) else col for col in df.columns]
df.columns = ['rdf#Age§' + str(col) if col in df.filter(regex='^Age',axis=1) else col for col in df.columns]
df.columns = ['rdf#Geography§' + str(col) if col in df.filter(regex='^Geography',axis=1) else col for col in df.columns]
df.columns = ['rdf#Tenure§' + str(col) if col in df.filter(regex='^Tenure',axis=1) else col for col in df.columns]
df.columns = ['rdf#NumOfProducts§' + str(col) if col in df.filter(regex='^NumOfProducts',axis=1) else col for col in df.columns]
df.columns = ['rdf#HasCrCard§' + str(col) if col in df.filter(regex='^HasCrCard',axis=1) else col for col in df.columns]
df.columns = ['rdf#IsActiveMember§' + str(col) if col in df.filter(regex='^IsActiveMember',axis=1) else col for col in df.columns]
df.columns = ['rdf#CreditScore§' + str(col) if col in df.filter(regex='^CreditScore',axis=1) else col for col in df.columns]
df.columns = ['rdf#Balance§' + str(col) if col in df.filter(regex='^Balance',axis=1) else col for col in df.columns]
df.columns = ['rdf#EstimatedSalary§' + str(col) if col in df.filter(regex='^EstimatedSalary',axis=1) else col for col in df.columns]

In [25]:
df.head(10)

Unnamed: 0_level_0,rdf#Age§Age18,rdf#Age§Age19,rdf#Age§Age20,rdf#Age§Age21,rdf#Age§Age22,rdf#Age§Age23,rdf#Age§Age24,rdf#Age§Age25,rdf#Age§Age26,rdf#Age§Age27,...,rdf#Tenure§Tenure1,rdf#Tenure§Tenure10,rdf#Tenure§Tenure2,rdf#Tenure§Tenure3,rdf#Tenure§Tenure4,rdf#Tenure§Tenure5,rdf#Tenure§Tenure6,rdf#Tenure§Tenure7,rdf#Tenure§Tenure8,rdf#Tenure§Tenure9
subject,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Customer15565701,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
Customer15565706,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
Customer15565714,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
Customer15565779,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
Customer15565796,0,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
Customer15565806,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
Customer15565878,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
Customer15565879,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
Customer15565891,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
Customer15565996,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0


In [26]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 10000 entries, Customer15565701 to Customer15815690
Columns: 16935 entries, rdf#Age§Age18 to rdf#Tenure§Tenure9
dtypes: int64(16935)
memory usage: 1.3+ GB


In [53]:
from scipy.sparse import csr_matrix  #convert df to csr
df_csr = csr_matrix(df.values)
df_csr

<10000x16935 sparse matrix of type '<class 'numpy.int64'>'
	with 100000 stored elements in Compressed Sparse Row format>

In [54]:
col = list(df.columns)
#col

In [55]:
cus_seg_sort = cus_seg.sort_values(by=['CustomerId'])
#cus_seg_sort

In [56]:
label = cus_seg_sort['Exited']
y_train = np.array(label)  #y_train #label #10000
y_train

array([0, 1, 0, ..., 1, 0, 0], dtype=int64)

In [57]:
X_train = (df_csr,list(cus_sub),col)
#X_train

In [65]:
#X_train: data
#y_train: label
acc, rules = miner.fit(X_train, y_train)

  self.rules = np.asarray(self.rules)[supp_select]


In [66]:
print(acc)
miner.print_rules(rules)

0.7464
['rdf#NumOfProducts§NumOfProducts1', 'rdf#IsActiveMember§IsActiveMemberNo']
['rdf#NumOfProducts§NumOfProducts3']


##### 產出 sparse matrix 時間過長->10000筆資料跑了3天

0.4069
['rdf#NumOfProducts§NumOfProducts2']

0.7715
['rdf#NumOfProducts§NumOfProducts3']
['rdf#NumOfProducts§NumOfProducts1', 'rdf#Geography§GeographyGermany']
['rdf#NumOfProducts§NumOfProducts1', 'rdf#Balance§Balance0.0']

0.7241
['rdf#Geography§GeographyGermany', 'rdf#NumOfProducts§NumOfProducts1']
['rdf#NumOfProducts§NumOfProducts3']
['rdf#IsActiveMember§IsActiveMemberNo', 'rdf#NumOfProducts§NumOfProducts1']

0.7464
['rdf#NumOfProducts§NumOfProducts1', 'rdf#IsActiveMember§IsActiveMemberNo']
['rdf#NumOfProducts§NumOfProducts3']