# Affinity Analysis

Affinity analysis is a data mining technique that gives similarity between items or samples.
    
    Use Case:
        1. Product recommendations
        2. Targetted services and ads
        3. find people with similar genes, i.e. ancestors

**Loading and understanding data**

Each row of data represents a sample or a customer's shopping list.

0 = No items were purchased for this item type (feature)

1 = atleast 1 item purchased for this item type

In [230]:
import numpy as np
dataset = "affinity_dataset.txt"
X = np.loadtxt(dataset, dtype = "int")

X_type = type(X)
num_samples, num_features = X.shape
features = ["bread", "milk", "cheese", "apples", "bananas"]

print("X is a {0} datatype.".format(X_type))
print("The dataset has {0} samples and {1} features.".format(num_samples, num_features))
print("The feature names are {0}.".format(features[:]))
print("The first 5 samples are:","\n", X[0:5])

X is a <class 'numpy.ndarray'> datatype.
The dataset has 100 samples and 5 features.
The feature names are ['bread', 'milk', 'cheese', 'apples', 'bananas'].
The first 5 samples are: 
 [[0 0 1 1 1]
 [1 1 0 1 0]
 [1 0 1 1 0]
 [0 0 1 1 1]
 [0 1 0 0 1]]


A set of rules can be derived from the data depending on what a customer purchased. 

For e.g. A person who **"bought"** apples **"also bought"** cheese. 

There are some valid rules and a some invalid rules for a given **number of purchases**. 

**Valid rules** are the ones where the user has purchased at least 2 different types of item, i.e. *apple & cheese*; *milk & bread*;  *apples & bananas & milk*, etc. 

**Invalid rules** are the ones where the user has purchased only 1 or lesser type of items. i.e. *apple*, *bread & bread*, *zero  purchases* etc.

In [231]:
from collections import defaultdict
num_purchases = defaultdict(int)
valid_rules = defaultdict(int) #a dictionary of valid rules with keys as tuple (bought, also_bought)
invalid_rules = defaultdict(int) #a dictionary of invalid rules with keys as tuples (bought, also_bought)

In [232]:
#for every sample in the dataset
for sample in X:
    #for every item type in a sample
    for bought in range(num_features):
        #if the item type is not bought, i.e value == 0
        if sample[bought] == 0:
            continue
        #else increment the number of purchases of that item type by 1    
        num_purchases[bought] += 1
        #for every item type that is also bought together with the above item type
        for also_bought in range(num_features):
            #if both item types are same then it is an invalid rule, so do nothing
            if bought == also_bought:
                continue
            #else if a different type is bought, i.e value == 1, then increment the valid rule
            if sample[also_bought] == 1:
                valid_rules[(bought, also_bought)] += 1
            #bought the first item but did not buy a second differnt item
            else:
                invalid_rules[(bought, also_bought)] += 1

In [233]:
bought_key = int(input("Enter the key of a bought item, i.e 0-4: "))
also_bought_key = int(input("Enter the key of a differnt also bought item, i.e 0-4: "))

print("\n")

print("{0} was bought {1} times.".format(features[bought_key], num_purchases[bought_key]))
print("{0} was bought {1} times.".format(features[also_bought_key], num_purchases[also_bought_key]))

print("{0} valid rules exist for {1} bought with {2}.".format(valid_rules[(bought_key, also_bought_key)],
                                                                        features[bought_key], features[also_bought_key]))
print("{0} invalid rules exist for {1} bought with {2}.".format(invalid_rules[(bought_key, also_bought_key)], 
                                                                          features[bought_key], features[also_bought_key]))

Enter the key of a bought item, i.e 0-4: 3
Enter the key of a differnt also bought item, i.e 0-4: 2


apples was bought 36 times.
cheese was bought 41 times.
25 valid rules exist for apples bought with cheese.
11 invalid rules exist for apples bought with cheese.


Two basic methods to test the rules:

**Support** is the number of times a valid rule occurs in a data set.

**Confidence** is the accuracy of a rule. It can be calaulated as number of valid rules divided by number of samples i.e. (num_purchases here).

In [234]:
support = valid_rules
confidence = defaultdict(int)

for bought, also_bought in valid_rules.keys():
    rule = (bought, also_bought)
    confidence[rule] = round(valid_rules[rule] / num_purchases[bought], 3)

In [235]:
def print_rule(bought, also_bought, support, confidence, features):
    bought_item = features[bought]
    also_bought_item = features[also_bought]
    
    print("The people who bought {0} are likely to buy {1}.".format(bought_item, also_bought_item))
    print("- Support: {0}".format(support[bought, also_bought]))
    print("- Confidence: {0} or {1:.2f}%".format(confidence[bought, also_bought], 
                                                 100 * confidence[bought, also_bought]))

In [236]:
bought_key = int(input("Enter the key of a bought item, i.e 0-4: "))
also_bought_key = int(input("Enter the key of a differnt also bought item, i.e 0-4: "))

print_rule(bought_key, also_bought_key, support, confidence, features)

Enter the key of a bought item, i.e 0-4: 3
Enter the key of a differnt also bought item, i.e 0-4: 2
The people who bought apples are likely to buy cheese.
- Support: 25
- Confidence: 0.694 or 69.40%


Sorting based on support and confidence of the rules.

In [237]:
from operator import itemgetter
#sorts the dictionary items according to the itemgetter() in descending order
sorted_support = sorted(support.items(), key=itemgetter(1), reverse=True)
sorted_confidence = sorted(confidence.items(), key=itemgetter(1), reverse=True)

In [238]:
def print_rule_by_top_support():
    top_n = int(input("Enter top 'n' number:"))
    print(" ")
    print("Top {0} rules sorted by support are:".format(top_n))
    for index in range(top_n):
        print("Rule #{0}".format(index + 1))
        (bought, also_bought) = sorted_support[index][0]
        print_rule(bought, also_bought, support, confidence, features)
        
print_rule_by_top_support()

Enter top 'n' number:3
 
Top 3 rules sorted by support are:
Rule #1
The people who bought cheese are likely to buy bananas.
- Support: 27
- Confidence: 0.659 or 65.90%
Rule #2
The people who bought bananas are likely to buy cheese.
- Support: 27
- Confidence: 0.458 or 45.80%
Rule #3
The people who bought cheese are likely to buy apples.
- Support: 25
- Confidence: 0.61 or 61.00%


In [239]:
def print_rule_by_top_confidence():
    top_n = int(input("Enter top 'n' number: "))
    print(" ")
    print("Top {0} rules sorted by confidence are:".format(top_n))
    for index in range(top_n):
        print("Rule #{0}".format(index + 1))
        (bought, also_bought) = sorted_confidence[index][0]
        print_rule(bought, also_bought, support, confidence, features)
        
print_rule_by_top_confidence()

Enter top 'n' number: 3
 
Top 3 rules sorted by confidence are:
Rule #1
The people who bought apples are likely to buy cheese.
- Support: 25
- Confidence: 0.694 or 69.40%
Rule #2
The people who bought cheese are likely to buy bananas.
- Support: 27
- Confidence: 0.659 or 65.90%
Rule #3
The people who bought bread are likely to buy bananas.
- Support: 17
- Confidence: 0.63 or 63.00%
