## FAIR-DB: TITANIC
1. Data Preparation and Exploration (saltata, solo Data Acquisition)
2. ACFDs Discovery and Filtering (CFDDiscovery algorithm)
3. ACFDs Selection
4. ACFDs Ranking
5. ACFDs User Selection and Scoring


## 1. Data acquisition

In [1]:
import pandas
import numpy as np
file_path = '~/Dropbox/Tesi/cdfAlgorithm/cfddiscovery/datasets/preprocessedTitanic.csv'

df = pandas.read_csv(file_path)
all_tuples = len(df)
cols = df.columns

#For Titanic dataset:
#Because there are numbers in the dataset, I tranform it into strings to perform comparison in completion phase
df = df.applymap(lambda x : str(x) if type(x) == int else x)

print("Total number of tuples in dataframe: " ,len(df))
df.head()

Total number of tuples in dataframe:  876


Unnamed: 0,Survived,Pclass,Sex,SibSp,Parch,Fare,Embarked
0,0,3,male,1,0,0-8,S
1,1,1,female,1,0,41-80,C
2,1,3,female,0,0,0-8,S
3,1,1,female,1,0,41-80,S
4,0,3,male,0,0,9-20,S


In [2]:
import json
# Make it work for Python 2+3 and with Unicode
import io
try:
    to_unicode = unicode
except NameError:
    to_unicode = str

data = df.to_json(orient="split")



In [3]:
# Write JSON file
with io.open('data.json', 'w', encoding='utf8') as outfile:
    str_ = json.dumps(data,
                      indent=4, sort_keys=True,
                      separators=(',', ': '), ensure_ascii=False)
    outfile.write(to_unicode(str_))

In [4]:
dfFirst = df[df.Pclass == '1']
dfFM = dfFirst[dfFirst.Sex == 'male']
print(len(dfFirst), len(dfFM))
dfFirst = df[df.Pclass == '2']
dfFM = dfFirst[dfFirst.Sex == 'male']
print(len(dfFirst), len(dfFM))
dfFirst = df[df.Pclass == '3']
dfFM = dfFirst[dfFirst.Sex == 'male']
print(len(dfFirst), len(dfFM))

211 117
178 102
487 343


In [2]:
#INPUTS
#array of protected attributes
protected_attr = ['Pclass', 'Sex']
#target class
target = 'Survived'

#input parameters
confidence =  0.80
supportCount = 80
support = supportCount/len(df)
maxSize = 2
grepValue = target+'='
minDiff = 0.07

## 2. ACFDs Discovery and Filtering

1. Establish a support, confidence and a maxSize rule (= number of attributes that at most appears in the lhs part of the rule) apply CFDDiscovery algorithm (remember to apply cmake before!). We use the grep command to establish a particular attribute or value, in this case the target must be present.


In [3]:
#Apply CFDDiscovery algorithm
output = !../cdfAlgorithm/cfddiscovery/CFDD {file_path} {supportCount} {confidence} {maxSize} | grep {grepValue}

#all rules obtianed
print("Total number of dependencies found: " ,len(output), "\n")

for i in range(0,2):
    print("Dependency n.", i, ": " ,output[i])

Total number of dependencies found:  46 

Dependency n. 0 :  (Survived=0) => Parch=0
Dependency n. 1 :  (SibSp, Survived=0) => Parch


Output contains all rules: approximated FDs and  approximated CFDs

#### Now we parse the rule, first deleting the AFDs - in this notebook we study only CFDs.
#### Second condition to select rules we can establish some conditions like:
- condition1 (both on LHS or RHS) : it could be a particular value of an attribute in which we are not interested 
- condition2: it could be a particular value of an attribute in which we need for our rules (not implemented yet)

In [4]:
#Transform the '<=' in '<' and viceversa to not have problem with the following '=' detection
o1 = list(map(lambda x: x.replace("<=", "<"), output))
#o1 = list(map(lambda x: x.replace(">=", ">"), output))
#Delete the parenthesis
o1 = list(map(lambda x: x.replace("(", ""), o1))
o1 = list(map(lambda x: x.replace(")", ""), o1))
#Split the entire rule in a lhs and rhs 
o2 = list(map(lambda x: x.split(' => '), o1))

print(o2, o2[0], o2[1])

[['Survived=0', 'Parch=0'], ['SibSp, Survived=0', 'Parch'], ['Survived=0, SibSp=0', 'Parch=0'], ['Survived=0, Parch=0', 'SibSp=0'], ['Survived=0, Fare=9-20', 'SibSp=0'], ['Survived=0, Fare=0-8', 'SibSp=0'], ['Fare, Survived=0', 'Parch'], ['Survived=0, Fare=9-20', 'Parch=0'], ['Survived=0, Fare=0-8', 'Parch=0'], ['Pclass, Survived=0', 'Parch'], ['Survived=0, Pclass=2', 'Parch=0'], ['Survived=0, Fare=0-8', 'Pclass=3'], ['Survived=1, Fare=21-40', 'Embarked=S'], ['Survived=0, Fare=9-20', 'Embarked=S'], ['Fare=0-8, Embarked=S', 'Survived=0'], ['Embarked, Survived=0', 'Parch'], ['Survived=0, Embarked=S', 'Parch=0'], ['Survived=1, Pclass=2', 'Embarked=S'], ['Survived=0, Pclass=2', 'Embarked=S'], ['Pclass=3, Embarked=S', 'Survived=0'], ['Survived=0', 'Sex=male'], ['Sex=male', 'Survived=0'], ['SibSp, Survived=0', 'Sex'], ['Survived=0, SibSp=0', 'Sex=male'], ['Sex=male, SibSp=0', 'Survived=0'], ['Fare, Survived=0', 'Sex'], ['Survived=0, Fare=9-20', 'Sex=male'], ['Survived=0, Fare=0-8', 'Sex=male

In [8]:
#Function to select only CFDs from all rules
#x is the single rule
def parseCFD(x):
    #Flag indicates if the rule is a CFD (True) or and FD (False)
    isCFD = True
    rawLHS = x[0].split(', ')
    #lhs rule
    for i, y in enumerate(rawLHS):
        for attr in cols:
            if (y in str(attr+'=!')):
                isCFD = False
        
       
    rawRHS = x[1].split(', ')
    for i, y in enumerate(rawRHS):
        for attr in cols:
            if (y in str(attr+'=!')):
                isCFD = False
      
        #To keep only CFDs
        if(isCFD == True):
            return [rawLHS, rawRHS]
        else:
            return None
        
#conditions is an array of conditions to delete some rules that are not interesting, for example:
#  ex: conditionslhs = ['age-range=15-30', 'native-country=NC-White']     

def parseCFDWithCond(x, conditionslhs, conditionsrhs):
    #Flag indicates if the rule is a CFD (True) or and FD (False)
    isCFD = True
    #Flag indicates if the rule contains unwanted condition(s) (rhs or lhs) - it doesn't contain the condition (true)
    takenRule = True
    rawLHS= x[0].split(', ')
    #lhs rule
    for i, y in enumerate(rawLHS):
        for attr in cols:
            if (y in str(attr+'=!')):
                isCFD = False
            for condlhs in conditionslhs:
                if (y == condlhs):
                    takenRule = False
        
       
    rawRHS = x[1].split(', ')
    for i, y in enumerate(rawRHS):
        for attr in cols:
            if (y in str(attr+'=!')):
                isCFD = False
            for condrhs in conditionsrhs:
                if (y == condrhs):
                    takenRule = False
      
        #To keep only CFDs
        if(isCFD == True and takenRule == True):
            return [rawLHS, rawRHS]
        else:
            return None
    
    
#condition to delete some rules that are not interesting, for example:
#conditionslhs = ['age-range=15-30']
conditionslhs = []
conditionsrhs = []

o3 = list()   
if (not conditionslhs and not conditionsrhs):
    for i in o2:
        #print(i)
        x = parseCFD(i)
        if (x != None):
            o3.append(x)
else:
    for i in o2:
        x = parseCFDWithCond(i,conditionslhs, conditionsrhs)
        if (x != None):
            o3.append(x)
            
for i in range(0,3):
    print(o3[i])

[['Survived=0'], ['Parch=0']]
[['Survived=0', 'SibSp=0'], ['Parch=0']]
[['Survived=0', 'Parch=0'], ['SibSp=0']]


#### Create the dictionary for CFDs

In [9]:
#To split every couple attribute-value
def splitElem(l1):
    return list(map(lambda x: x.split('='), l1))

#To create an array that contains all rules with the lhs and rhs separated
def createSplitting(elem):
    elemLhs = elem[0]
    elemRhs = elem[1]
    LHS = splitElem(elemLhs)
    RHS = splitElem(elemRhs)
    return [LHS, RHS]

#Now that we have deleted all the '=' we can replace the "<" with '<='
def createDictionaryElem(side):
    elem = {}
    for x in side:
        replacedX = x[1].replace('<', '<=')
        elem[x[0]]= replacedX
    return elem

o4 = list(map(createSplitting, o3))
#for i in range(0,4):
#    print(o4[i])

#Create the dictionary with the LHS and RHS that contains all CFDs
parsedRules = list(map(lambda x: {'lhs' : createDictionaryElem(x[0]), 'rhs': createDictionaryElem(x[1])}, o4))

print("Total number of dependencies in the dictionary: " ,len(parsedRules))

Total number of dependencies in the dictionary:  36


#### ParsedRules is the final dictionary of approximated CFDs

In [10]:
for i in range(0,3):
    print("ACFD n.", i, ": " ,parsedRules[i])

ACFD n. 0 :  {'lhs': {'Survived': '0'}, 'rhs': {'Parch': '0'}}
ACFD n. 1 :  {'lhs': {'Survived': '0', 'SibSp': '0'}, 'rhs': {'Parch': '0'}}
ACFD n. 2 :  {'lhs': {'Survived': '0', 'Parch': '0'}, 'rhs': {'SibSp': '0'}}


#### Create table: for each rule create a table that presents the main metrics

In [11]:
import math

def countOccur(elem):
    #How many times appears the lhs of the rule
    countX = 0
    #How many times appears the rhs of the rule
    countY = 0
    #How many times appears the entire rule
    countXY = 0
    
    #for every row of the database, count the LHS, RHS and the total count
    for index, row in df.iterrows():
        #The flags help in dealing with missing values
        flagX = True
        flagY = True
        
        for key in list(elem['lhs'].keys()):
            value = elem['lhs'][key]
            
            #add the constraint to manage '?' that could be a missing values
            if (str(row[key]) != value):
                flagX = False
                
        for key in list(elem['rhs'].keys()):
            value = elem['rhs'][key]
            
            #add the constraint to manage missing values
            if (str(row[key]) != value):
                flagY = False
                
        if flagX:
            #increase the lhs support count
            countX = countX + 1
        if flagY:
             #increase the rhs support count
            countY = countY + 1
        if flagX and flagY:
             #increase the entire rule support count
            countXY = countXY + 1
    #return the lhs supp count, rhs supp count and the entire rule supp count 
    return  (countX, countY, countXY)

def computeConfidenceNoProtectedAttr(elem):
    
    filteredRule = {}
    filteredRule['lhs'] = {k: v for k, v in elem['lhs'].items() if ((k not in (protected_attr)) and (k != target))}
    filteredRule['rhs'] = elem['rhs']
    
    fCount = countOccur(filteredRule)
    #if the rule is valid for at least one tuple
    if(fCount[2] != 0 and fCount[0] != 0):
        ratio = fCount[2]/fCount[0]
    else: 
        ratio = 0
    return ratio

def computeConfidenceForProtectedAttr(elem, protAttr):
    
    filteredRule = {}
    filteredRule['lhs'] = {k: v for k, v in elem['lhs'].items() if (k != protAttr)}
    filteredRule['rhs'] = elem['rhs']
    
    fCount = countOccur(filteredRule)
    #if the rule is valid for at least one tuple
    if(fCount[2] != 0 and fCount[0] != 0):
        ratio = fCount[2]/fCount[0]
    else:
        ratio = 0
    return ratio

def computePDifference(rule, conf, attribute):
    if(attribute in protected_attr):
        diffp = 0
        if(attribute in rule['lhs'].keys()):
            RHSConfidence = computeConfidenceForProtectedAttr(rule, attribute)
            diffp = conf - RHSConfidence
            return diffp
    return None

In [12]:
def equalRules(rule1,rule2):

    flagR = True
    flagL = True
    
    for keyL in rule1['lhs'].keys():
        if(keyL in rule2['lhs'].keys()):
            if(rule1['lhs'][keyL]!=rule2['lhs'][keyL]):
                flagL = False
        else: 
            flagL = False
            
    for keyR in rule1['rhs'].keys():
        if(keyR in rule2['rhs'].keys()):
            if(rule1['rhs'][keyR]!=rule2['rhs'][keyR]):
                flagR = False
        else:
            flagR = False
            
    if(flagL==True and flagR == True):
        return True
    else:
        return False
    
def removeDuplicates(df):
    dfColumns = df.columns
    k=0
    dfClean= pandas.DataFrame(columns = dfColumns)
    for i, row in df.iterrows():
        flag = True
        rule1 = df.loc[i, 'Rule']
        j=k-1
        while(j>=0):
            rule2 = dfClean.loc[j, 'Rule']
            if(equalRules(rule1,rule2)==True):
                flag = False
            j=j-1
        if(flag == True):
            dfClean.loc[k] = df.loc[i]
            k=k+1
            
    return dfClean 
    
def removeDuplicatesArray(arrayList):
    
    cleanedArray = []
    k=0
    for i in range(0, len(arrayList)):
        flag = True
        rule1 = arrayList[i]
        j=k-1
        while(j>=0):
            rule2 = cleanedArray[j]
            if(equalRules(rule1,rule2)==True):
                flag = False
            j=j-1
        if(flag == True):
            cleanedArray.append(arrayList[i])
            k=k+1
            
    return cleanedArray

C'è un bug dopo aver fatto il refactoring del codice!

In [13]:
print("Total number of rules: ", len(parsedRules))

Total number of rules:  36


In [14]:
def createTable(parsedRules):
    df3 = pandas.DataFrame(columns=['Rule', 'Support', 'Confidence', 'Diff'])
    for attribute in protected_attr:
        column = attribute+'Diff'
        df3[column] = 0

    row_index = 0
    for i,parsedRule in enumerate(parsedRules):

        count = countOccur(parsedRule)
        support = tuple(map(lambda val: val/all_tuples, count))
        conf = 0
        confNoProtectedAttr = 0
        RHSConfidence = 0
        diff = 0
        flagProt = False

        for keyL in parsedRule['lhs'].keys():
            if(keyL in protected_attr):
                flagProt = True
        for keyR in parsedRule['rhs'].keys():
            if(keyR in protected_attr):
                flagProt = True

        if(support[0]!= 0 and support[1]!=0 and flagProt == True):
            conf = count[2]/count[0]
            confNoProtectedAttr = computeConfidenceNoProtectedAttr(parsedRule)
            diff = conf - confNoProtectedAttr

            #lift = support[2]/(support[0]*support[1])


            df3 = df3.append({'Rule': parsedRule, 'Confidence': conf, 'Support': support[2], 'Diff': diff}, ignore_index=True)

             #compute the diff for each protected  attributes
            for attribute in protected_attr:
                diffp = 0
                if(attribute in parsedRule['lhs'].keys()):
                    diffp = computePDifference(parsedRule, conf, attribute)
                    column = attribute+'Diff'
                    df3.loc[row_index, column] = diffp 
            row_index = row_index +1
    return df3

In [15]:
df3 = createTable(parsedRules)
df3 = removeDuplicates(df3)
pandas.set_option('display.max_colwidth', None)
pandas.set_option('display.max_rows', None)
print("Total number of tuples in dataframe: " ,len(df3))
df3.head()

Total number of tuples in dataframe:  25


Unnamed: 0,Rule,Support,Confidence,Diff,PclassDiff,SexDiff
0,"{'lhs': {'Survived': '0', 'Pclass': '2'}, 'rhs': {'Parch': '0'}}",0.091324,0.879121,0.122272,0.073513,
1,"{'lhs': {'Survived': '0', 'Fare': '0-8'}, 'rhs': {'Pclass': '3'}}",0.19863,0.994286,-0.00129,,
2,"{'lhs': {'Survived': '1', 'Pclass': '2'}, 'rhs': {'Embarked': 'S'}}",0.086758,0.873563,0.153244,0.234267,
3,"{'lhs': {'Survived': '0', 'Pclass': '2'}, 'rhs': {'Embarked': 'S'}}",0.093607,0.901099,0.180779,0.129136,
4,"{'lhs': {'Pclass': '3', 'Embarked': 'S'}, 'rhs': {'Survived': '0'}}",0.323059,0.810888,0.156372,0.156372,


Compute a dataframe that contains all the metrics for every CFDs

#### Select the minDifference threshold

In [16]:
minDiff= 0.07
#To select the not ethical rules
df4 = df3[df3.Diff > minDiff]
print("Total number of tuples in dataframe: " ,len(df4))
df4.head(17)

Total number of tuples in dataframe:  24


Unnamed: 0,Rule,Support,Confidence,Diff,PclassDiff,SexDiff
0,"{'lhs': {'Survived': '0', 'Pclass': '2'}, 'rhs': {'Parch': '0'}}",0.091324,0.879121,0.122272,0.073513,
2,"{'lhs': {'Survived': '1', 'Pclass': '2'}, 'rhs': {'Embarked': 'S'}}",0.086758,0.873563,0.153244,0.234267,
3,"{'lhs': {'Survived': '0', 'Pclass': '2'}, 'rhs': {'Embarked': 'S'}}",0.093607,0.901099,0.180779,0.129136,
4,"{'lhs': {'Pclass': '3', 'Embarked': 'S'}, 'rhs': {'Survived': '0'}}",0.323059,0.810888,0.156372,0.156372,
5,"{'lhs': {'Survived': '0'}, 'rhs': {'Sex': 'male'}}",0.518265,0.848598,0.207046,,
6,"{'lhs': {'Sex': 'male'}, 'rhs': {'Survived': '0'}}",0.518265,0.807829,0.197099,,0.197099
7,"{'lhs': {'Survived': '0', 'SibSp': '0'}, 'rhs': {'Sex': 'male'}}",0.396119,0.903646,0.197069,,
8,"{'lhs': {'Sex': 'male', 'SibSp': '0'}, 'rhs': {'Survived': '0'}}",0.396119,0.828162,0.180607,,0.180607
9,"{'lhs': {'Survived': '0', 'Fare': '9-20'}, 'rhs': {'Sex': 'male'}}",0.172374,0.825137,0.168202,,
10,"{'lhs': {'Survived': '0', 'Fare': '0-8'}, 'rhs': {'Sex': 'male'}}",0.182648,0.914286,0.126675,,


In [17]:
df4.head(24)

Unnamed: 0,Rule,Support,Confidence,Diff,PclassDiff,SexDiff
0,"{'lhs': {'Survived': '0', 'Pclass': '2'}, 'rhs': {'Parch': '0'}}",0.091324,0.879121,0.122272,0.073513,
2,"{'lhs': {'Survived': '1', 'Pclass': '2'}, 'rhs': {'Embarked': 'S'}}",0.086758,0.873563,0.153244,0.234267,
3,"{'lhs': {'Survived': '0', 'Pclass': '2'}, 'rhs': {'Embarked': 'S'}}",0.093607,0.901099,0.180779,0.129136,
4,"{'lhs': {'Pclass': '3', 'Embarked': 'S'}, 'rhs': {'Survived': '0'}}",0.323059,0.810888,0.156372,0.156372,
5,"{'lhs': {'Survived': '0'}, 'rhs': {'Sex': 'male'}}",0.518265,0.848598,0.207046,,
6,"{'lhs': {'Sex': 'male'}, 'rhs': {'Survived': '0'}}",0.518265,0.807829,0.197099,,0.197099
7,"{'lhs': {'Survived': '0', 'SibSp': '0'}, 'rhs': {'Sex': 'male'}}",0.396119,0.903646,0.197069,,
8,"{'lhs': {'Sex': 'male', 'SibSp': '0'}, 'rhs': {'Survived': '0'}}",0.396119,0.828162,0.180607,,0.180607
9,"{'lhs': {'Survived': '0', 'Fare': '9-20'}, 'rhs': {'Sex': 'male'}}",0.172374,0.825137,0.168202,,
10,"{'lhs': {'Survived': '0', 'Fare': '0-8'}, 'rhs': {'Sex': 'male'}}",0.182648,0.914286,0.126675,,


In [18]:
import json
# Make it work for Python 2+3 and with Unicode
import io
try:
    to_unicode = unicode
except NameError:
    to_unicode = str

data1 = df4.to_json(orient="split")

# Write JSON file
with io.open('data1.json', 'w', encoding='utf8') as outfile:
    str_1 = json.dumps(data1,
                      indent=4, sort_keys=True,
                      separators=(',', ': '), ensure_ascii=False)
    outfile.write(to_unicode(str_1))



### ACFDs Completion  

In [16]:
def cartesianProduct(set_a, set_b): 
    result =[] 
    for i in range(0, len(set_a)): 
        for j in range(0, len(set_b)): 
  
            # for handling case having cartesian 
            # prodct first time of two sets 
            if type(set_a[i]) != list:          
                set_a[i] = [set_a[i]] 
                  
            # coping all the members 
            # of set_a to temp 
            temp = [num for num in set_a[i]] 
              
            # add member of set_b to  
            # temp to have cartesian product      
            temp.append(set_b[j])              
            result.append(temp)   
              
    return result 


  
def Cartesian(list_a, n):
    # result of cartesian product 
    # of all the sets taken two at a time 
    temp = list_a[0] 
      
    # do product of N sets  
    for i in range(1, n): 
        temp = cartesianProduct(temp, list_a[i]) 
          
    return temp 

def createSide(side):
    elem = {}
    for x in side:
        elem[x[0]] = x[1]
    
    return elem

import copy
def findCFDsCombinations(elem):
    CFDs = []
    perm = []
    attr_names = []
    assocRule = list()
    flag = False
    #select db according to already set attributes
    for key in list(elem['lhs'].keys()):
        
        if((key in protected_attr) or (key == target)):
            attr_names.append(key)
            perm.append(df[key].unique())
            flag = True
            
            
    for key in list(elem['rhs'].keys()):
        
        if((key in protected_attr) or (key== target)):
            attr_names.append(key)
            perm.append(df[key].unique())
            flag = True
    
    if(flag == True):
        
        assocRule = copy.deepcopy(elem)
        mat =  Cartesian(perm, len(perm))

        for m in mat:
            if(len(attr_names) == 1):
                for key in list(assocRule['lhs'].keys()):
                    if(key == attr_names[0]):
                        assocRule['lhs'][key] = m
                for key in list(assocRule['rhs'].keys()):
                    if(key == attr_names[0]):
                        assocRule['rhs'][key] = m
            
            else:
                i= 0

                assocRule = copy.deepcopy(elem)
                while(i< len(m)):

                    for key in list(assocRule['lhs'].keys()):
                        if(key == attr_names[i]):
                            assocRule['lhs'][key] = m[i]
                    for key in list(assocRule['rhs'].keys()):
                        if(key == attr_names[i]):
                            assocRule['rhs'][key] = m[i]
                    i = i+1
                   
            CFDs.append(assocRule) 
        return CFDs 
    else:
        return elem

In [17]:
CFDCombinations = []
for elem in df4.Rule:
    #for every rule compute the combinations over the protected attribute
    rulesCount = findCFDsCombinations(elem)
    
    for ar in rulesCount:
        CFDCombinations.append(ar)
        
print("Total number of combinations found: ", len(CFDCombinations))

#print("Original ACFD: ", df4.Rule[0], "\n")
for i in range(0,8):
    
    print("ACFD n.", i, ": " ,CFDCombinations[i])
CFDCombinations = removeDuplicatesArray(CFDCombinations)
print("Total number of combinations found: ", len(CFDCombinations))

Total number of combinations found:  160
ACFD n. 0 :  {'lhs': {'Survived': '0', 'Pclass': '3'}, 'rhs': {'Parch': '0'}}
ACFD n. 1 :  {'lhs': {'Survived': '0', 'Pclass': '1'}, 'rhs': {'Parch': '0'}}
ACFD n. 2 :  {'lhs': {'Survived': '0', 'Pclass': '2'}, 'rhs': {'Parch': '0'}}
ACFD n. 3 :  {'lhs': {'Survived': '1', 'Pclass': '3'}, 'rhs': {'Parch': '0'}}
ACFD n. 4 :  {'lhs': {'Survived': '1', 'Pclass': '1'}, 'rhs': {'Parch': '0'}}
ACFD n. 5 :  {'lhs': {'Survived': '1', 'Pclass': '2'}, 'rhs': {'Parch': '0'}}
ACFD n. 6 :  {'lhs': {'Survived': '0', 'Pclass': '3'}, 'rhs': {'Embarked': 'S'}}
ACFD n. 7 :  {'lhs': {'Survived': '0', 'Pclass': '1'}, 'rhs': {'Embarked': 'S'}}
Total number of combinations found:  106


#### Create table

In [18]:
df5 = createTable(CFDCombinations)   
df5 = removeDuplicates(df5)
pandas.set_option('display.max_colwidth', None)
pandas.set_option('display.max_rows', None)
print("Total number of tuples in dataframe: " ,len(df5))
df5.head()

Total number of tuples in dataframe:  106


Unnamed: 0,Rule,Support,Confidence,Diff,PclassDiff,SexDiff
0,"{'lhs': {'Survived': '0', 'Pclass': '3'}, 'rhs': {'Parch': '0'}}",0.333333,0.791328,0.034479,-0.01428,
1,"{'lhs': {'Survived': '0', 'Pclass': '1'}, 'rhs': {'Parch': '0'}}",0.067352,0.786667,0.029817,-0.018941,
2,"{'lhs': {'Survived': '0', 'Pclass': '2'}, 'rhs': {'Parch': '0'}}",0.091324,0.879121,0.122272,0.073513,
3,"{'lhs': {'Survived': '1', 'Pclass': '3'}, 'rhs': {'Parch': '0'}}",0.097032,0.720339,-0.03651,0.039987,
4,"{'lhs': {'Survived': '1', 'Pclass': '1'}, 'rhs': {'Parch': '0'}}",0.113014,0.727941,-0.028908,0.047589,


Table with all the permutations

### 4. ACFDs Selection and ACFDs Ranking

In [19]:
#orderingCriterion = 0, order using Support, 1 order using Difference, 2 order using Mean
orderingCriterion = 2
#To select the not ethical rules
df51 = df5[df5.Diff > minDiff]

#print("Total number of tuples in dataframe: " ,len(df51))
#Order the rules by Diff or Support or both
if(orderingCriterion == 0):
    df6 = df51.iloc[df51['Support'].argsort()[::-1][:len(df51)]]
elif(orderingCriterion ==1):
    df6 = df51.iloc[df51['Diff'].argsort()[::-1][:len(df51)]]
else:
    df51['Mean'] = 0
    for index, row in df51.iterrows():
         df51.loc[index, 'Mean'] = ((df51.loc[index, 'Support'] + df51.loc[index,'Diff'])/2)
    df6 = df51.iloc[df51['Mean'].argsort()[::-1][:len(df51)]]
    
print("Number of original CFDs: ", len(df4), ". Number of combinations rules: ", len(df5), ". Number of final rules found: ", len(df6))
df6.head()

Number of original CFDs:  24 . Number of combinations rules:  106 . Number of final rules found:  47


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  del sys.path[0]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[item] = s


Unnamed: 0,Rule,Support,Confidence,Diff,PclassDiff,SexDiff,Mean
18,"{'lhs': {'Survived': '0'}, 'rhs': {'Sex': 'male'}}",0.518265,0.848598,0.207046,,,0.362655
22,"{'lhs': {'Sex': 'male'}, 'rhs': {'Survived': '0'}}",0.518265,0.807829,0.197099,,0.197099,0.357682
101,"{'lhs': {'Pclass': '1', 'Sex': 'female'}, 'rhs': {'Survived': '1'}}",0.103881,0.968085,0.578816,0.226047,0.323535,0.341348
54,"{'lhs': {'Survived': '0', 'Parch': '0'}, 'rhs': {'Sex': 'male'}}",0.445205,0.904872,0.197482,,,0.321344
58,"{'lhs': {'Sex': 'male', 'Parch': '0'}, 'rhs': {'Survived': '0'}}",0.445205,0.831557,0.181481,,0.181481,0.313343


In [20]:
#number of rules that the user wants to see
n = 50
df6.head(n)

Unnamed: 0,Rule,Support,Confidence,Diff,PclassDiff,SexDiff,Mean
18,"{'lhs': {'Survived': '0'}, 'rhs': {'Sex': 'male'}}",0.518265,0.848598,0.207046,,,0.362655
22,"{'lhs': {'Sex': 'male'}, 'rhs': {'Survived': '0'}}",0.518265,0.807829,0.197099,,0.197099,0.357682
101,"{'lhs': {'Pclass': '1', 'Sex': 'female'}, 'rhs': {'Survived': '1'}}",0.103881,0.968085,0.578816,0.226047,0.323535,0.341348
54,"{'lhs': {'Survived': '0', 'Parch': '0'}, 'rhs': {'Sex': 'male'}}",0.445205,0.904872,0.197482,,,0.321344
58,"{'lhs': {'Sex': 'male', 'Parch': '0'}, 'rhs': {'Survived': '0'}}",0.445205,0.831557,0.181481,,0.181481,0.313343
25,"{'lhs': {'Sex': 'female'}, 'rhs': {'Survived': '1'}}",0.265982,0.742038,0.352769,,0.352769,0.309375
61,"{'lhs': {'Sex': 'female', 'Parch': '0'}, 'rhs': {'Survived': '1'}}",0.174658,0.78866,0.438735,,0.438735,0.306696
105,"{'lhs': {'Pclass': '2', 'Sex': 'female'}, 'rhs': {'Survived': '1'}}",0.079909,0.921053,0.531783,0.179014,0.432289,0.305846
94,"{'lhs': {'Pclass': '3', 'Sex': 'male'}, 'rhs': {'Survived': '0'}}",0.339041,0.865889,0.255159,0.05806,0.108189,0.2971
26,"{'lhs': {'Survived': '0', 'SibSp': '0'}, 'rhs': {'Sex': 'male'}}",0.396119,0.903646,0.197069,,,0.296594


## 5. ACFDs User Selection and Scoring

In [21]:
#INPUT PARAMETERS
#indexes of the selected rules
indexArray = [22, 25, 101, 105, 94, 85, 102]

#minumum number of rules necessary to have a problematic tuple
nMarked = 0

#for every rule = elem, iter over all rows and add one if the tuple respect the rule
def validates(df,elem):  
    
    for index, row in df.iterrows():
        flag = True
        for key in list(elem['lhs'].keys()):
            value = elem['lhs'][key]

            #add the constraint to manage '?' that could be a missing values
            if (str(row[key]) != value):
                flag = False
            

        for key in list(elem['rhs'].keys()):
            value = elem['rhs'][key]

            #add the constraint to manage missing values
            if (str(row[key]) != value):
                flag = False
                
        if(flag == True):
            #update the marked field
            df.loc[index, 'marked'] = df.loc[index, 'marked'] + 1

#add column 'marked'
df = pandas.read_csv(file_path)
#add one column to count the number of tuples involved by the dependencies
df['marked'] = 0


#create the list of the selected dependencies
dependencies = []
for i in indexArray:
    dependencies.append(df6.Rule[i])
    
#create a copy of the df to count the number of tuples involved by the dependencies
dfMarked = df
for dep in dependencies:
    #for every dependency add one to marked field if the tuple respect the rule
    validates(dfMarked, dep)

def extractProblematicTuples(dfMarked):
    dfEthicalProblems = dfMarked[dfMarked.marked > nMarked]
    return dfEthicalProblems
    
dfEthicalProblems = extractProblematicTuples(dfMarked)
print("Problematic tuples: ", len(dfEthicalProblems))
dfEthicalProblems.head()


Problematic tuples:  759


Unnamed: 0,Survived,Pclass,Sex,SibSp,Parch,Fare,Embarked,marked
0,0,3,male,1,0,0-8,S,2
1,1,1,female,1,0,41-80,C,2
2,1,3,female,0,0,0-8,S,1
3,1,1,female,1,0,41-80,S,2
4,0,3,male,0,0,9-20,S,2


In [22]:
def computeStatistics(df6, selectedDependencies, dfMarked, indexArray):

    scores = 0
    diffs = 0
    marks = dfMarked.marked.sum()
    
    for i in indexArray:
        scores = scores + df6.Mean[i]
        diffs = diffs + df6.Diff[i]

    scoreMean = (scores/len(selectedDependencies))
    diffMean = (diffs/len(selectedDependencies))
    pMean = 0



    dfM = dfMarked[dfMarked.marked != 0]
    #print(All tuples interested by the rules: ', marks)

    print('Number of tuples interested by the rules: ', len(dfM), ". Total number of tuples: ", len(df), "\n")
    print( "Cumulative Support: ", len(dfM)/len(df), ". Difference Mean: ", diffMean, "\n")


    for attribute in protected_attr:
        deps = 0
        if(attribute+'Diff' in df6):
            for i in indexArray:
                if not(pandas.isna(df6[attribute+'Diff'][i])):
                    #print(df6[attribute+'Diff'][i])
                    pMean = pMean + df6[attribute+'Diff'][i];
                    deps = deps+1
            if(pMean != 0):
                pMean = (pMean/deps)
                print(attribute, '-Difference Mean: ', pMean, "\n")

    finalRules =  df6[df6.index.isin(indexArray)]
    print("Total number of ACFDs selected: ", len(finalRules), "\n")
    return finalRules

finalRules = computeStatistics(df6, dependencies, dfMarked, indexArray)
finalRules.head(8)

Number of tuples interested by the rules:  759 . Total number of tuples:  876 

Cumulative Support:  0.8664383561643836 . Difference Mean:  0.35302578542970675 

Pclass -Difference Mean:  0.12215637200322627 

Sex -Difference Mean:  0.2939004783614751 

Total number of ACFDs selected:  7 



Unnamed: 0,Rule,Support,Confidence,Diff,PclassDiff,SexDiff,Mean
22,"{'lhs': {'Sex': 'male'}, 'rhs': {'Survived': '0'}}",0.518265,0.807829,0.197099,,0.197099,0.357682
101,"{'lhs': {'Pclass': '1', 'Sex': 'female'}, 'rhs': {'Survived': '1'}}",0.103881,0.968085,0.578816,0.226047,0.323535,0.341348
25,"{'lhs': {'Sex': 'female'}, 'rhs': {'Survived': '1'}}",0.265982,0.742038,0.352769,,0.352769,0.309375
105,"{'lhs': {'Pclass': '2', 'Sex': 'female'}, 'rhs': {'Survived': '1'}}",0.079909,0.921053,0.531783,0.179014,0.432289,0.305846
94,"{'lhs': {'Pclass': '3', 'Sex': 'male'}, 'rhs': {'Survived': '0'}}",0.339041,0.865889,0.255159,0.05806,0.108189,0.2971
85,"{'lhs': {'Survived': '0', 'Sex': 'female'}, 'rhs': {'Pclass': '3'}}",0.082192,0.888889,0.332953,,0.199169,0.207572
102,"{'lhs': {'Pclass': '2', 'Sex': 'male'}, 'rhs': {'Survived': '0'}}",0.097032,0.833333,0.222603,0.025504,0.322097,0.159817
