## FAIR-DB: ADULT US CENSUS DATASET
1. Data Preparation and Exploration (saltata, solo Data Acquisition)
2. ACFDs Discovery and Filtering (CFDDiscovery algorithm)
3. ACFDs Completion
4. ACFDs Selection and ACFDs Ranking
5. Compute dversity - coverage
5. ACFDs User Selection and Scoring (not present in this notebook)


## 1. Data acquisition

In [24]:
import pandas
import numpy as np
file_path = '~/Dropbox/Tesi/cdfAlgorithm/cfddiscovery/datasets/preprocessedAdultNoOccupation.csv'

df = pandas.read_csv(file_path)
all_tuples = len(df)
cols = df.columns

print("Total number of tuples in dataframe: " ,len(df))
df.head()

Total number of tuples in dataframe:  30169


Unnamed: 0,workclass,race,sex,hours-per-week,native-country,income,age-range,education-degree
0,Private,White,Female,0-20,NC-White,<=50K,75-100,HS-College
1,Private,White,Female,21-40,NC-White,<=50K,45-60,Elementary
2,Private,White,Female,21-40,NC-White,<=50K,30-45,Assoc
3,Private,White,Female,41-60,NC-White,<=50K,30-45,HS-College
4,Private,White,Male,21-40,NC-White,<=50K,30-45,MiddleSchool


In [25]:
#INPUTS
#array of protected attributes
protected_attr = ['race', 'sex', 'native-country']
#target class
target = 'income'
binaryValues = df.income.unique()
print(binaryValues)

#input parameters
confidence =  0.86
supportCount = 900
support = supportCount/len(df)
maxSize = 4
grepValue = target+'='
minDiff = 0.07

['<=50K' '>50K']


## 2. ACFDs Discovery and Filtering

1. Establish a support, confidence and a maxSize rule (= number of attributes that at most appears in the lhs part of the rule) apply CFDDiscovery algorithm (remember to apply cmake before!). We use the grep command to establish a particular attribute or value, in this case the target must be present.


In [26]:
#Apply CFDDiscovery algorithm
output = !../cdfAlgorithm/cfddiscovery/CFDD {file_path} {supportCount} {confidence} {maxSize} | grep {grepValue}

#all rules obtianed
print("Total number of dependencies found: " ,len(output), "\n")

for i in range(0,12):
    print("Dependency n.", i, ": " ,output[i])

Total number of dependencies found:  2487 

Dependency n. 0 :  (education-degree=MiddleSchool) => income=<=50K
Dependency n. 1 :  (age-range=15-30) => income=<=50K
Dependency n. 2 :  (education-degree=Bach, age-range=15-30) => income=<=50K
Dependency n. 3 :  (education-degree=MiddleSchool, age-range=15-30) => income=<=50K
Dependency n. 4 :  (education-degree=Assoc, age-range=15-30) => income=<=50K
Dependency n. 5 :  (education-degree=HS-College, age-range=15-30) => income=<=50K
Dependency n. 6 :  (native-country=NC-Hispanic) => income=<=50K
Dependency n. 7 :  (income=>50K) => native-country=NC-White
Dependency n. 8 :  (income=<=50K) => native-country=NC-White
Dependency n. 9 :  (native-country=NC-White, education-degree=MiddleSchool) => income=<=50K
Dependency n. 10 :  (education-degree, income=>50K) => native-country
Dependency n. 11 :  (income=>50K, education-degree=Mast) => native-country=NC-White


Output contains all rules: approximated FDs and  approximated CFDs

#### Now we parse the rule, first deleting the AFDs - in this notebook we study only CFDs.
#### Second condition to select rules we can establish some conditions like:
- condition1 (both on LHS or RHS) : it could be a particular value of an attribute in which we are not interested 
- condition2: it could be a particular value of an attribute in which we need for our rules (not implemented yet)

In [27]:
#Transform the '<=' in '<' and viceversa to not have problem with the following '=' detection
o1 = list(map(lambda x: x.replace("<=", "<"), output))
#o1 = list(map(lambda x: x.replace(">=", ">"), output))
#Delete the parenthesis
o1 = list(map(lambda x: x.replace("(", ""), o1))
o1 = list(map(lambda x: x.replace(")", ""), o1))
#Split the entire rule in a lhs and rhs 
o2 = list(map(lambda x: x.split(' => '), o1))


In [28]:
#Function to select only CFDs from all rules
#x is the single rule
def parseCFD(x):
    #Flag indicates if the rule is a CFD (True) or and FD (False)
    isCFD = True
    rawLHS = x[0].split(', ')
    #lhs rule
    for i, y in enumerate(rawLHS):
        for attr in cols:
            if (y in str(attr+'=!')):
                isCFD = False
        
       
    rawRHS = x[1].split(', ')
    for i, y in enumerate(rawRHS):
        for attr in cols:
            if (y in str(attr+'=!')):
                isCFD = False
      
        #To keep only CFDs
        if(isCFD == True):
            return [rawLHS, rawRHS]
        else:
            return None
        
#conditions is an array of conditions to delete some rules that are not interesting, for example:
#  ex: conditionslhs = ['age-range=15-30', 'native-country=NC-White']     

def parseCFDWithCond(x, conditionslhs, conditionsrhs):
    #Flag indicates if the rule is a CFD (True) or and FD (False)
    isCFD = True
    #Flag indicates if the rule contains unwanted condition(s) (rhs or lhs) - it doesn't contain the condition (true)
    takenRule = True
    rawLHS= x[0].split(', ')
    #lhs rule
    for i, y in enumerate(rawLHS):
        for attr in cols:
            if (y in str(attr+'=!')):
                isCFD = False
            for condlhs in conditionslhs:
                if (y == condlhs):
                    takenRule = False
        
       
    rawRHS = x[1].split(', ')
    for i, y in enumerate(rawRHS):
        for attr in cols:
            if (y in str(attr+'=!')):
                isCFD = False
            for condrhs in conditionsrhs:
                if (y == condrhs):
                    takenRule = False
      
        #To keep only CFDs
        if(isCFD == True and takenRule == True):
            return [rawLHS, rawRHS]
        else:
            return None
    
    
#condition to delete some rules that are not interesting, for example:
conditionslhs = ['age-range=15-30']
conditionsrhs = []

o3 = list()   
if not conditionslhs and not conditionsrhs:
    for i in o2:
        x = parseCFD(i)
        if (x != None):
            o3.append(x)
else:
    for i in o2:
        x = parseCFDWithCond(i,conditionslhs, conditionsrhs)
        if (x != None):
            o3.append(x)
            
for i in range(0,3):
    print(o3[i])

[['education-degree=MiddleSchool'], ['income=<50K']]
[['native-country=NC-Hispanic'], ['income=<50K']]
[['income=>50K'], ['native-country=NC-White']]


#### Create the dictionary for CFDs

In [29]:
#To split every couple attribute-value
def splitElem(l1):
    return list(map(lambda x: x.split('='), l1))

#To create an array that contains all rules with the lhs and rhs separated
def createSplitting(elem):
    elemLhs = elem[0]
    elemRhs = elem[1]
    LHS = splitElem(elemLhs)
    RHS = splitElem(elemRhs)
    return [LHS, RHS]

#Now that we have deleted all the '=' we can replace the "<" with '<='
def createDictionaryElem(side):
    elem = {}
    for x in side:
        replacedX = x[1].replace('<', '<=')
        elem[x[0]]= replacedX
    return elem

o4 = list(map(createSplitting, o3))
#for i in range(0,4):
#    print(o4[i])

#Create the dictionary with the LHS and RHS that contains all CFDs
parsedRules = list(map(lambda x: {'lhs' : createDictionaryElem(x[0]), 'rhs': createDictionaryElem(x[1])}, o4))

print("Total number of dependencies in the dictionary: " ,len(parsedRules))

Total number of dependencies in the dictionary:  578


#### ParsedRules is the final dictionary of approximated CFDs

In [30]:
for i in range(0,3):
    print("ACFD n.", i, ": " ,parsedRules[i])

ACFD n. 0 :  {'lhs': {'education-degree': 'MiddleSchool'}, 'rhs': {'income': '<=50K'}}
ACFD n. 1 :  {'lhs': {'native-country': 'NC-Hispanic'}, 'rhs': {'income': '<=50K'}}
ACFD n. 2 :  {'lhs': {'income': '>50K'}, 'rhs': {'native-country': 'NC-White'}}


#### Create table: for each rule create a table that presents the main metrics

In [31]:
import math

def countOccur(elem):
    #How many times appears the lhs of the rule
    countX = 0
    #How many times appears the rhs of the rule
    countY = 0
    #How many times appears the entire rule
    countXY = 0
    
    #for every row of the database, count the LHS, RHS and the total count
    for index, row in df.iterrows():
        #The flags help in dealing with missing values
        flagX = True
        flagY = True
        
        for key in list(elem['lhs'].keys()):
            value = elem['lhs'][key]
            
            #add the constraint to manage '?' that could be a missing values
            if (str(row[key]) != value):
                flagX = False
                
        for key in list(elem['rhs'].keys()):
            value = elem['rhs'][key]
            
            #add the constraint to manage missing values
            if (str(row[key]) != value):
                flagY = False
                
        if flagX:
            #increase the lhs support count
            countX = countX + 1
        if flagY:
             #increase the rhs support count
            countY = countY + 1
        if flagX and flagY:
             #increase the entire rule support count
            countXY = countXY + 1
    #return the lhs supp count, rhs supp count and the entire rule supp count 
    return  (countX, countY, countXY)

def computeConfidenceNoProtectedAttr(elem):
    
    filteredRule = {}
    filteredRule['lhs'] = {k: v for k, v in elem['lhs'].items() if ((k not in (protected_attr)) and (k != target))}
    filteredRule['rhs'] = elem['rhs']
    
    fCount = countOccur(filteredRule)
    #if the rule is valid for at least one tuple
    if(fCount[2] != 0 and fCount[0] != 0):
        ratio = fCount[2]/fCount[0]
    else: 
        ratio = 0
    return ratio

def computeConfidenceForProtectedAttr(elem, protAttr):
    
    filteredRule = {}
    filteredRule['lhs'] = {k: v for k, v in elem['lhs'].items() if (k != protAttr)}
    filteredRule['rhs'] = elem['rhs']
    
    fCount = countOccur(filteredRule)
    #if the rule is valid for at least one tuple
    if(fCount[2] != 0 and fCount[0] != 0):
        ratio = fCount[2]/fCount[0]
    else:
        ratio = 0
    return ratio

def computePDifference(rule, conf, attribute):
    if(attribute in protected_attr):
        diffp = 0
        if(attribute in rule['lhs'].keys()):
            RHSConfidence = computeConfidenceForProtectedAttr(rule, attribute)
            diffp = conf - RHSConfidence
            return diffp
    return None

In [32]:
print("Total number of rules: ", len(parsedRules))

Total number of rules:  578


Compute a dataframe that contains all the metrics for every CFDs

In [33]:
def equalRules(rule1,rule2):

    flagR = True
    flagL = True
    
    for keyL in rule1['lhs'].keys():
        if(keyL in rule2['lhs'].keys()):
            if(rule1['lhs'][keyL]!=rule2['lhs'][keyL]):
                flagL = False
        else: 
            flagL = False
            
    for keyR in rule1['rhs'].keys():
        if(keyR in rule2['rhs'].keys()):
            if(rule1['rhs'][keyR]!=rule2['rhs'][keyR]):
                flagR = False
        else:
            flagR = False
            
    if(flagL==True and flagR == True):
        return True
    else:
        return False
    
def removeDuplicates(df):
    dfColumns = df.columns
    k=0
    dfClean= pandas.DataFrame(columns = dfColumns)
    for i, row in df.iterrows():
        flag = True
        rule1 = df.loc[i, 'Rule']
        j=k-1
        while(j>=0):
            rule2 = dfClean.loc[j, 'Rule']
            if(equalRules(rule1,rule2)==True):
                flag = False
            j=j-1
        if(flag == True):
            dfClean.loc[k] = df.loc[i]
            k=k+1
            
    return dfClean 
    
def removeDuplicatesArray(arrayList):
    
    cleanedArray = []
    k=0
    for i in range(0, len(arrayList)):
        flag = True
        rule1 = arrayList[i]
        j=k-1
        while(j>=0):
            rule2 = cleanedArray[j]
            if(equalRules(rule1,rule2)==True):
                flag = False
            j=j-1
        if(flag == True):
            cleanedArray.append(arrayList[i])
            k=k+1
            
    return cleanedArray

In [34]:
def createTable(parsedRules):
    df3 = pandas.DataFrame(columns=['Rule', 'Support', 'Confidence', 'Diff'])
    index_row = 0
    for i,parsedRule in enumerate(parsedRules):

        count = countOccur(parsedRule)
        support = tuple(map(lambda val: val/all_tuples, count))
        conf = 0
        confNoProtectedAttr = 0
        RHSConfidence = 0
        diff = 0
        flagProt = False

        for keyL in parsedRule['lhs'].keys():
            if(keyL in protected_attr):
                flagProt = True
        for keyR in parsedRule['rhs'].keys():
            if(keyR in protected_attr):
                flagProt = True

        if(support[0]!= 0 and support[1]!=0 and flagProt == True):
            conf = count[2]/count[0]
            confNoProtectedAttr = computeConfidenceNoProtectedAttr(parsedRule)
            diff = conf - confNoProtectedAttr

            #lift = support[2]/(support[0]*support[1])


            df3 = df3.append({'Rule': parsedRule, 'Confidence': conf, 'Support': support[2], 'Diff': diff}, ignore_index=True)

             #compute the diff for each protected  attributes
            for attribute in protected_attr:
                if(attribute in parsedRule['lhs'].keys()):
                    diffp = computePDifference(parsedRule, conf, attribute)
                    column = attribute+'Diff'
                    df3.loc[index_row,column] = diffp  
            index_row = index_row+1
    return df3

df3 = createTable(parsedRules)
 
pandas.set_option('display.max_colwidth', None)
pandas.set_option('display.max_rows', None)
print("Total number of tuples in dataframe: " ,len(df3))
df3.head()

Total number of tuples in dataframe:  568


Unnamed: 0,Rule,Support,Confidence,Diff,native-countryDiff,sexDiff,raceDiff
0,"{'lhs': {'native-country': 'NC-Hispanic'}, 'rhs': {'income': '<=50K'}}",0.043919,0.908156,0.157021,0.157021,,
1,"{'lhs': {'income': '>50K'}, 'rhs': {'native-country': 'NC-White'}}",0.231861,0.931673,0.019777,,,
2,"{'lhs': {'income': '<=50K'}, 'rhs': {'native-country': 'NC-White'}}",0.680036,0.905344,-0.006552,,,
3,"{'lhs': {'native-country': 'NC-White', 'education-degree': 'MiddleSchool'}, 'rhs': {'income': '<=50K'}}",0.074016,0.933138,-0.003229,-0.003229,,
4,"{'lhs': {'income': '>50K', 'education-degree': 'Mast'}, 'rhs': {'native-country': 'NC-White'}}",0.028141,0.924837,0.012728,,,


In [35]:
df3 = removeDuplicates(df3)
print("Total number of tuples in dataframe: " ,len(df3))
df3.head()

Total number of tuples in dataframe:  568


Unnamed: 0,Rule,Support,Confidence,Diff,native-countryDiff,sexDiff,raceDiff
0,"{'lhs': {'native-country': 'NC-Hispanic'}, 'rhs': {'income': '<=50K'}}",0.043919,0.908156,0.157021,0.157021,,
1,"{'lhs': {'income': '>50K'}, 'rhs': {'native-country': 'NC-White'}}",0.231861,0.931673,0.019777,,,
2,"{'lhs': {'income': '<=50K'}, 'rhs': {'native-country': 'NC-White'}}",0.680036,0.905344,-0.006552,,,
3,"{'lhs': {'native-country': 'NC-White', 'education-degree': 'MiddleSchool'}, 'rhs': {'income': '<=50K'}}",0.074016,0.933138,-0.003229,-0.003229,,
4,"{'lhs': {'income': '>50K', 'education-degree': 'Mast'}, 'rhs': {'native-country': 'NC-White'}}",0.028141,0.924837,0.012728,,,


#### Select the minDifference threshold

In [36]:
minDiff = 0.07
#To select the not ethical rules
df4 = df3[df3.Diff > minDiff]
print("Total number of tuples in dataframe: " ,len(df4))
df4.head()

Total number of tuples in dataframe:  128


Unnamed: 0,Rule,Support,Confidence,Diff,native-countryDiff,sexDiff,raceDiff
0,"{'lhs': {'native-country': 'NC-Hispanic'}, 'rhs': {'income': '<=50K'}}",0.043919,0.908156,0.157021,0.157021,,
24,"{'lhs': {'hours-per-week': '21-40', 'native-country': 'NC-Hispanic'}, 'rhs': {'income': '<=50K'}}",0.033909,0.928312,0.122855,0.122855,,
54,"{'lhs': {'sex': 'Female'}, 'rhs': {'income': '<=50K'}}",0.287447,0.886345,0.13521,,0.13521,
56,"{'lhs': {'sex': 'Female', 'education-degree': 'Assoc'}, 'rhs': {'income': '<=50K'}}",0.101362,0.910932,0.126163,,0.126163,
57,"{'lhs': {'sex': 'Female', 'education-degree': 'HS-College'}, 'rhs': {'income': '<=50K'}}",0.095926,0.931445,0.095758,,0.095758,


In [37]:
df4.head(128)

Unnamed: 0,Rule,Support,Confidence,Diff,native-countryDiff,sexDiff,raceDiff
0,"{'lhs': {'native-country': 'NC-Hispanic'}, 'rhs': {'income': '<=50K'}}",0.043919,0.908156,0.157021,0.157021,,
24,"{'lhs': {'hours-per-week': '21-40', 'native-country': 'NC-Hispanic'}, 'rhs': {'income': '<=50K'}}",0.033909,0.928312,0.122855,0.122855,,
54,"{'lhs': {'sex': 'Female'}, 'rhs': {'income': '<=50K'}}",0.287447,0.886345,0.13521,,0.13521,
56,"{'lhs': {'sex': 'Female', 'education-degree': 'Assoc'}, 'rhs': {'income': '<=50K'}}",0.101362,0.910932,0.126163,,0.126163,
57,"{'lhs': {'sex': 'Female', 'education-degree': 'HS-College'}, 'rhs': {'income': '<=50K'}}",0.095926,0.931445,0.095758,,0.095758,
58,"{'lhs': {'education-degree': 'HS-College', 'income': '>50K'}, 'rhs': {'sex': 'Male'}}",0.046538,0.868275,0.183995,,,
59,"{'lhs': {'age-range': '45-60', 'income': '>50K'}, 'rhs': {'sex': 'Male'}}",0.084159,0.877636,0.165426,,,
60,"{'lhs': {'sex': 'Female', 'age-range': '30-45', 'education-degree': 'Assoc'}, 'rhs': {'income': '<=50K'}}",0.034207,0.862876,0.130639,,0.130639,
61,"{'lhs': {'sex': 'Female', 'education-degree': 'HS-College', 'age-range': '30-45'}, 'rhs': {'income': '<=50K'}}",0.034108,0.909814,0.088641,,0.088641,
63,"{'lhs': {'sex': 'Female', 'hours-per-week': '21-40'}, 'rhs': {'income': '<=50K'}}",0.204647,0.901314,0.095857,,0.095857,


### 3.ACFDs Completion  

In [14]:
def cartesianProduct(set_a, set_b): 
    result =[] 
    for i in range(0, len(set_a)): 
        for j in range(0, len(set_b)): 
  
            # for handling case having cartesian 
            # prodct first time of two sets 
            if type(set_a[i]) != list:          
                set_a[i] = [set_a[i]] 
                  
            # coping all the members 
            # of set_a to temp 
            temp = [num for num in set_a[i]] 
              
            # add member of set_b to  
            # temp to have cartesian product      
            temp.append(set_b[j])              
            result.append(temp)   
              
    return result 


  
def Cartesian(list_a, n):
    # result of cartesian product 
    # of all the sets taken two at a time 
    temp = list_a[0] 
      
    # do product of N sets  
    for i in range(1, n): 
        temp = cartesianProduct(temp, list_a[i]) 
          
    return temp 

def createSide(side):
    elem = {}
    for x in side:
        elem[x[0]] = x[1]
    
    return elem

import copy
def findCFDsCombinations(elem):
    CFDs = []
    perm = []
    attr_names = []
    assocRule = list()
    flag = False
    #select db according to already set attributes
    for key in list(elem['lhs'].keys()):
        
        if((key in protected_attr) or (key == target)):
            attr_names.append(key)
            perm.append(df[key].unique())
            flag = True
            
            
    for key in list(elem['rhs'].keys()):
        
        if((key in protected_attr) or (key== target)):
            attr_names.append(key)
            perm.append(df[key].unique())
            flag = True
    
    if(flag == True):
        
        assocRule = copy.deepcopy(elem)
        mat =  Cartesian(perm, len(perm))

        for m in mat:
            if(len(attr_names) == 1):
                for key in list(assocRule['lhs'].keys()):
                    if(key == attr_names[0]):
                        assocRule['lhs'][key] = m
                for key in list(assocRule['rhs'].keys()):
                    if(key == attr_names[0]):
                        assocRule['rhs'][key] = m
            
            else:
                i= 0

                assocRule = copy.deepcopy(elem)
                while(i< len(m)):

                    for key in list(assocRule['lhs'].keys()):
                        if(key == attr_names[i]):
                            assocRule['lhs'][key] = m[i]
                    for key in list(assocRule['rhs'].keys()):
                        if(key == attr_names[i]):
                            assocRule['rhs'][key] = m[i]
                    i = i+1
                   
            CFDs.append(assocRule) 
        return CFDs 
    else:
        return elem

In [15]:
CFDCombinations = []
for elem in df4.Rule:
    #for every rule compute the combinations over the protected attribute
    rulesCount = findCFDsCombinations(elem)
    
    for ar in rulesCount:
        CFDCombinations.append(ar)
        
print("Total number of combinations found: ", len(CFDCombinations))

#print("Original ACFD: ", df4.Rule[0], "\n")
for i in range(0,8):
    
    print("ACFD n.", i, ": " ,CFDCombinations[i])

Total number of combinations found:  296
ACFD n. 0 :  {'lhs': {'native-country': 'NC-White'}, 'rhs': {'income': '<=50K'}}
ACFD n. 1 :  {'lhs': {'native-country': 'NC-White'}, 'rhs': {'income': '>50K'}}
ACFD n. 2 :  {'lhs': {'native-country': 'NC-Hispanic'}, 'rhs': {'income': '<=50K'}}
ACFD n. 3 :  {'lhs': {'native-country': 'NC-Hispanic'}, 'rhs': {'income': '>50K'}}
ACFD n. 4 :  {'lhs': {'native-country': 'NC-Non-Hisp-White'}, 'rhs': {'income': '<=50K'}}
ACFD n. 5 :  {'lhs': {'native-country': 'NC-Non-Hisp-White'}, 'rhs': {'income': '>50K'}}
ACFD n. 6 :  {'lhs': {'native-country': 'NC-Asian-Pacific'}, 'rhs': {'income': '<=50K'}}
ACFD n. 7 :  {'lhs': {'native-country': 'NC-Asian-Pacific'}, 'rhs': {'income': '>50K'}}


In [16]:
CFDCombintaions = removeDuplicatesArray(CFDCombinations)
print("Total number of combinations found: ", len(CFDCombinations))

Total number of combinations found:  296


In [18]:
df5 = createTable(CFDCombinations)   
pandas.set_option('display.max_colwidth', None)
pandas.set_option('display.max_rows', None)
print("Total number of tuples in dataframe: " ,len(df5))
df5.head()

Total number of tuples in dataframe:  296


Unnamed: 0,Rule,Support,Confidence,Diff,native-countryDiff,sexDiff,raceDiff
0,"{'lhs': {'native-country': 'NC-White'}, 'rhs': {'income': '<=50K'}}",0.680036,0.745738,-0.005397,-0.005397,,
1,"{'lhs': {'native-country': 'NC-White'}, 'rhs': {'income': '>50K'}}",0.231861,0.254262,0.005397,0.005397,,
2,"{'lhs': {'native-country': 'NC-Hispanic'}, 'rhs': {'income': '<=50K'}}",0.043919,0.908156,0.157021,0.157021,,
3,"{'lhs': {'native-country': 'NC-Hispanic'}, 'rhs': {'income': '>50K'}}",0.004442,0.091844,-0.157021,-0.157021,,
4,"{'lhs': {'native-country': 'NC-Non-Hisp-White'}, 'rhs': {'income': '<=50K'}}",0.014551,0.683801,-0.067335,-0.067335,,


In [19]:
#orderingCriterion = 0, order using Support, 1 order using Difference, 2 order using Mean
orderingCriterion = 2
#To select the not ethical rules
df51 = df5[df5.Diff > minDiff]

#print("Total number of tuples in dataframe: " ,len(df51))
#Order the rules by Diff or Support or both
if(orderingCriterion == 0):
    df6 = df51.iloc[df51['Support'].argsort()[::-1][:len(df51)]]
elif(orderingCriterion ==1):
    df6 = df51.iloc[df51['Diff'].argsort()[::-1][:len(df51)]]
else:
    df51['Mean'] = 0
    for index, row in df51.iterrows():
         df51.loc[index, 'Mean'] = ((df51.loc[index, 'Support'] + df51.loc[index,'Diff'])/2)
    df6 = df51.iloc[df51['Mean'].argsort()[::-1][:len(df51)]]
    
print("Number of original CFDs: ", len(df4), ". Number of combinations rules: ", len(df5), ". Number of final rules found: ", len(df6))
df6.head()

Number of original CFDs:  23 . Number of combinations rules:  296 . Number of final rules found:  89


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  del sys.path[0]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[item] = s


Unnamed: 0,Rule,Support,Confidence,Diff,native-countryDiff,sexDiff,raceDiff,Mean
218,"{'lhs': {'native-country': 'NC-Asian-Pacific', 'income': '<=50K'}, 'rhs': {'race': 'Asian-Pac-Islander'}}",0.011369,0.900262,0.870596,0.871711,,,0.440983
223,"{'lhs': {'native-country': 'NC-Asian-Pacific', 'income': '>50K'}, 'rhs': {'race': 'Asian-Pac-Islander'}}",0.005171,0.886364,0.856697,0.853332,,,0.430934
16,"{'lhs': {'sex': 'Female'}, 'rhs': {'income': '<=50K'}}",0.287447,0.886345,0.13521,,0.13521,,0.211329
60,"{'lhs': {'sex': 'Female', 'native-country': 'NC-White'}, 'rhs': {'income': '<=50K'}}",0.26219,0.885382,0.134246,-0.000963,0.139644,,0.198218
44,"{'lhs': {'sex': 'Female', 'native-country': 'NC-White'}, 'rhs': {'income': '<=50K'}}",0.26219,0.885382,0.134246,-0.000963,0.139644,,0.198218


In [20]:
#number of rules that the user wants to see
n = 89
df6.head(n)

Unnamed: 0,Rule,Support,Confidence,Diff,native-countryDiff,sexDiff,raceDiff,Mean
218,"{'lhs': {'native-country': 'NC-Asian-Pacific', 'income': '<=50K'}, 'rhs': {'race': 'Asian-Pac-Islander'}}",0.011369,0.900262,0.870596,0.871711,,,0.440983
223,"{'lhs': {'native-country': 'NC-Asian-Pacific', 'income': '>50K'}, 'rhs': {'race': 'Asian-Pac-Islander'}}",0.005171,0.886364,0.856697,0.853332,,,0.430934
16,"{'lhs': {'sex': 'Female'}, 'rhs': {'income': '<=50K'}}",0.287447,0.886345,0.13521,,0.13521,,0.211329
60,"{'lhs': {'sex': 'Female', 'native-country': 'NC-White'}, 'rhs': {'income': '<=50K'}}",0.26219,0.885382,0.134246,-0.000963,0.139644,,0.198218
44,"{'lhs': {'sex': 'Female', 'native-country': 'NC-White'}, 'rhs': {'income': '<=50K'}}",0.26219,0.885382,0.134246,-0.000963,0.139644,,0.198218
226,"{'lhs': {'sex': 'Female', 'race': 'White'}, 'rhs': {'income': '<=50K'}}",0.22954,0.877026,0.125891,,0.140694,-0.009319,0.177716
246,"{'lhs': {'race': 'White', 'sex': 'Female'}, 'rhs': {'income': '<=50K'}}",0.22954,0.877026,0.125891,,0.140694,-0.009319,0.177716
292,"{'lhs': {'workclass': 'Private', 'sex': 'Female'}, 'rhs': {'income': '<=50K'}}",0.229408,0.905653,0.124445,,0.124445,,0.176926
36,"{'lhs': {'sex': 'Female', 'hours-per-week': '21-40'}, 'rhs': {'income': '<=50K'}}",0.204647,0.901314,0.095857,,0.095857,,0.150252
191,"{'lhs': {'native-country': 'NC-White', 'income': '>50K'}, 'rhs': {'race': 'White'}}",0.216746,0.934811,0.075054,0.023916,,,0.1459


In [23]:
df6 = removeDuplicates(df6)
print(len(df6))
df6.head(57)

57


Unnamed: 0,Rule,Support,Confidence,Diff,native-countryDiff,sexDiff,raceDiff,Mean
0,"{'lhs': {'native-country': 'NC-Asian-Pacific', 'income': '<=50K'}, 'rhs': {'race': 'Asian-Pac-Islander'}}",0.011369,0.900262,0.870596,0.871711,,,0.440983
1,"{'lhs': {'native-country': 'NC-Asian-Pacific', 'income': '>50K'}, 'rhs': {'race': 'Asian-Pac-Islander'}}",0.005171,0.886364,0.856697,0.853332,,,0.430934
2,"{'lhs': {'sex': 'Female'}, 'rhs': {'income': '<=50K'}}",0.287447,0.886345,0.13521,,0.13521,,0.211329
3,"{'lhs': {'sex': 'Female', 'native-country': 'NC-White'}, 'rhs': {'income': '<=50K'}}",0.26219,0.885382,0.134246,-0.000963,0.139644,,0.198218
4,"{'lhs': {'sex': 'Female', 'race': 'White'}, 'rhs': {'income': '<=50K'}}",0.22954,0.877026,0.125891,,0.140694,-0.009319,0.177716
5,"{'lhs': {'workclass': 'Private', 'sex': 'Female'}, 'rhs': {'income': '<=50K'}}",0.229408,0.905653,0.124445,,0.124445,,0.176926
6,"{'lhs': {'sex': 'Female', 'hours-per-week': '21-40'}, 'rhs': {'income': '<=50K'}}",0.204647,0.901314,0.095857,,0.095857,,0.150252
7,"{'lhs': {'native-country': 'NC-White', 'income': '>50K'}, 'rhs': {'race': 'White'}}",0.216746,0.934811,0.075054,0.023916,,,0.1459
8,"{'lhs': {'sex': 'Male', 'race': 'White'}, 'rhs': {'income': '>50K'}}",0.194504,0.325241,0.076376,,0.061574,0.011481,0.13544
9,"{'lhs': {'sex': 'Male', 'native-country': 'NC-White'}, 'rhs': {'income': '>50K'}}",0.197918,0.321419,0.072554,0.007659,0.067157,,0.135236


### 4. ACFDs Selection and ACFDs Ranking

In [None]:
#INPUT PARAMETERS
#indexes of the selected rules
indexArray = [22, 25, 101, 105, 94, 85, 102]

#minumum number of rules necessary to have a problematic tuple
nMarked = 0

#for every rule = elem, iter over all rows and add one if the tuple respect the rule
def validates(df,elem):  
    
    for index, row in df.iterrows():
        flag = True
        for key in list(elem['lhs'].keys()):
            value = elem['lhs'][key]

            #add the constraint to manage '?' that could be a missing values
            if (str(row[key]) != value):
                flag = False
            

        for key in list(elem['rhs'].keys()):
            value = elem['rhs'][key]

            #add the constraint to manage missing values
            if (str(row[key]) != value):
                flag = False
                
        if(flag == True):
            #update the marked field
            df.loc[index, 'marked'] = df.loc[index, 'marked'] + 1

#add column 'marked'
df = pandas.read_csv(file_path)
#add one column to count the number of tuples involved by the dependencies
df['marked'] = 0


#create the list of the selected dependencies
dependencies = []
for i in indexArray:
    dependencies.append(df6.Rule[i])
    
#create a copy of the df to count the number of tuples involved by the dependencies
dfMarked = df
for dep in dependencies:
    #for every dependency add one to marked field if the tuple respect the rule
    validates(dfMarked, dep)

def extractProblematicTuples(dfMarked):
    dfEthicalProblems = dfMarked[dfMarked.marked > nMarked]
    return dfEthicalProblems
    
dfEthicalProblems = extractProblematicTuples(dfMarked)
print("Problematic tuples: ", len(dfEthicalProblems))
dfEthicalProblems.head()


In [None]:
def computeStatistics(df6, selectedDependencies, dfMarked, indexArray):

    scores = 0
    diffs = 0
    marks = dfMarked.marked.sum()
    
    for i in indexArray:
        scores = scores + df6.Mean[i]
        diffs = diffs + df6.Diff[i]

    scoreMean = (scores/len(selectedDependencies))
    diffMean = (diffs/len(selectedDependencies))
    pMean = 0



    dfM = dfMarked[dfMarked.marked != 0]
    #print(All tuples interested by the rules: ', marks)

    print('Number of tuples interested by the rules: ', len(dfM), ". Total number of tuples: ", len(df), "\n")
    print( "Cumulative Support: ", len(dfM)/len(df), ". Difference Mean: ", diffMean, "\n")


    for attribute in protected_attr:
        deps = 0
        if(attribute+'Diff' in df6):
            for i in indexArray:
                if not(pandas.isna(df6[attribute+'Diff'][i])):
                    #print(df6[attribute+'Diff'][i])
                    pMean = pMean + df6[attribute+'Diff'][i];
                    deps = deps+1
            if(pMean != 0):
                pMean = (pMean/deps)
                print(attribute, '-Difference Mean: ', pMean, "\n")

    finalRules =  df6[df6.index.isin(indexArray)]
    print("Total number of ACFDs selected: ", len(finalRules), "\n")
    return finalRules

finalRules = computeStatistics(df6, dependencies, dfMarked, indexArray)
finalRules.head()