# Getting Started with Data Mining

Data mining is a methodology that we can employ to train computers to make decisions with data and forms the backbone of many high-tech systems of today.

The Python language is fast growing in popularity, for a good reason. It gives the programmer a lot of flexibility; it has a large number of modules to perform different tasks; and Python code is usually more readable and concise than in any other languages. There is a large and an active community of researchers, practitioners, and beginners using Python for data mining.

- **Introducing data mining**
- **A simple affiniy analysis example**
- **A simple classification example**
- **What is classification**

## Introducing data mining

Data mining provides a way for a computer to learn how to make decisions with data. This decision could be predicting tomorrow's weather, blocking a spam email from entering your inbox, detecting the language of a website, or finding a new romance on a dating site. 

Data mining is part of algorithms, statistics, engineering, optimization, and computer science. We also use concepts and knowledge from other fields such as linguistics, neuroscience, or town planning. Applying it effectively usually requires this domain-specific knowledge to be integrated with the algorithms.

We start our data mining process by creating a dataset, describing an aspect of the real world. Datasets comprise of two aspects:

- Samples that are objects in the real world. This can be a book, photograph, animal, person, or any other object.
- Features that are descriptions of the samples in our dataset. They could be the length, frequency of a given word, number of legs, date it was created, and so on. 

The next step is tuning the data mining algorithm. Each data mining algorithm has parameters, either within the algorithm or supplied by the user. This tuning allows the algorithm to learn how to make decisions about the data.

## A simple affinity analysis example

A common use case for data mining is to improve sales by asking a customer who is buying a product if he/she would like another similar product as well. This can be done through affinity analysis, which is the study of when things exist together.

## What is affinity analysis?

Affinity analysis is a type of data mining that gives similarity between samples (objects). This could be the similarity between the following:

- users on a website, in order to provide varied services or targeted advertising
- items to sell to those users, in order to provide recommended movies or products
- human genes, in order to find people that share the same ancestors

We can measure affinity in a number of ways. For instance, we can record how frequently two products are purchased together. We can also record the accuracy of the statement when a person buys object 1 and also when they buy object 2. 

## Product recommendations

One of the issues with moving a traditional business online, such as commerce, is that tasks that used to be done by humans need to be automated in order for the online business to scale. One example of this is up-selling, or selling an extra item to a customer who is already buying. Automated product recommendations through data mining are one of the driving forces behind the e-commerce revolution that is turning billions of dollars per year into revenue.

Product recommendations are based on the following idea: when two items are historically purchased together, they are more likely to be purchased together in the future. This sort of thinking is behind many product recommendation services, in both online and offline businesses.

A very simple algorithm for this type of product recommendation algorithm is to simply find any historical case where a user has brought an item and to recommend other items that the historical user brought. In practice, simple algorithms such as this can do well, at least better than choosing random items to recommend. However, they can be improved upon significantly, which is where data mining comes in.

To simplify the coding, we will consider only two items at a time. As an example, people may buy bread and milk at the same time at the supermarket. In this early example, we wish to find simple rules of the form:

*If a person buys product X, then they are likely to purchase product Y*


In [1]:
import numpy as np

In [3]:
DATA = 'data/'
AFFINITY_DATASET = DATA + 'affinity_dataset.txt'

In [9]:
x = np.loadtxt(AFFINITY_DATASET)
n_samples, n_features = x.shape
print(f'This dataset has {n_samples} samples and {n_features} features')

This dataset has 100 samples and 5 features


Each of these features contain binary values, stating only whether the items were purchased and not how many of them were purchased. A 1 indicates that "at least 1" item was bought of this type, while a 0 indicates that absolutely none of that item was purchased.

In [10]:
print(x[:5])

[[0. 0. 1. 1. 1.]
 [1. 1. 0. 1. 0.]
 [1. 0. 1. 1. 0.]
 [0. 0. 1. 1. 1.]
 [0. 1. 0. 0. 1.]]


## Implementing a simple ranking of rules

We wish to find rules of the type If a person buys product X, then they are likely to purchase product Y. We can quite easily create a list of all of the rules in our dataset by simply finding all occasions when two products were purchased together. However, we then need a way to determine good rules from bad ones. This will allow us to choose specific products to recommend.

Rules of this type can be measured in many ways, of which we will focus on two: **support** and **confidence**.

#### Support 
is the number of times that a rule occurs in a dataset, which is computed by simply counting the number of samples that the rule is valid for. 

**Note**: It can sometimes be normalized by dividing by the total number of times the premise of the rule is valid, but we will simply count the total for this implementation.

#### Confidence 
measures how accurate they are when they can be used. It can be computed by determining the percentage of times the rule applies when the premise applies.

In [11]:

# The names of the features, for your reference.
features = ["bread", "milk", "cheese", "apples", "bananas"]


In our first example, we will compute the Support and Confidence of the rule "If a person buys Apples, they also buy Bananas".

In [12]:
# First, how many rows contain our premise: that a person is buying apples
num_apple_purchases = 0
for sample in x:
    if sample[3] == 1:  # This person bought Apples
        num_apple_purchases += 1
print(f"{num_apple_purchases} people bought Apples")


36 people bought Apples


In [13]:
# How many of the cases that a person bought Apples involved the people purchasing Bananas too?
# Record both cases where the rule is valid and is invalid.
rule_valid = 0
rule_invalid = 0
for sample in x:
    if sample[3] == 1:  # This person bought Apples
        if sample[4] == 1:
            # This person bought both Apples and Bananas
            rule_valid += 1
        else:
            # This person bought Apples, but not Bananas
            rule_invalid += 1
print(f"{rule_valid} cases of the rule being valid were discovered")
print(f"{rule_invalid} cases of the rule being invalid were discovered")


21 cases of the rule being valid were discovered
15 cases of the rule being invalid were discovered


In [14]:
# Now we have all the information needed to compute Support and Confidence
support = rule_valid  # The Support is the number of times the rule is discovered.
confidence = rule_valid / num_apple_purchases
print("The support is {0} and the confidence is {1:.3f}.".format(support, confidence))
# Confidence can be thought of as a percentage using the following:
print("As a percentage, that is {0:.1f}%.".format(100 * confidence))


The support is 21 and the confidence is 0.583.
As a percentage, that is 58.3%.


In [15]:
from collections import defaultdict
# Now compute for all possible rules
valid_rules = defaultdict(int)
invalid_rules = defaultdict(int)
num_occurences = defaultdict(int)

for sample in x:
    for premise in range(n_features):
        if sample[premise] == 0: continue
        # Record that the premise was bought in another transaction
        num_occurences[premise] += 1
        for conclusion in range(n_features):
            if premise == conclusion:  # It makes little sense to measure if X -> X.
                continue
            if sample[conclusion] == 1:
                # This person also bought the conclusion item
                valid_rules[(premise, conclusion)] += 1
            else:
                # This person bought the premise, but not the conclusion
                invalid_rules[(premise, conclusion)] += 1
support = valid_rules
confidence = defaultdict(float)
for premise, conclusion in valid_rules.keys():
    confidence[(premise, conclusion)] = valid_rules[(premise, conclusion)] / num_occurences[premise]


In [16]:
for premise, conclusion in confidence:
    premise_name = features[premise]
    conclusion_name = features[conclusion]
    print(f"Rule: If a person buys {premise_name} they will also buy {conclusion_name}")
    print(f" - Confidence: {confidence[(premise, conclusion)]:.3f}")
    print(f" - Support: {support[(premise, conclusion)]}")

Rule: If a person buys cheese they will also buy apples
 - Confidence: 0.610
 - Support: 25
Rule: If a person buys cheese they will also buy bananas
 - Confidence: 0.659
 - Support: 27
Rule: If a person buys apples they will also buy cheese
 - Confidence: 0.694
 - Support: 25
Rule: If a person buys apples they will also buy bananas
 - Confidence: 0.583
 - Support: 21
Rule: If a person buys bananas they will also buy cheese
 - Confidence: 0.458
 - Support: 27
Rule: If a person buys bananas they will also buy apples
 - Confidence: 0.356
 - Support: 21
Rule: If a person buys bread they will also buy milk
 - Confidence: 0.519
 - Support: 14
Rule: If a person buys bread they will also buy apples
 - Confidence: 0.185
 - Support: 5
Rule: If a person buys milk they will also buy bread
 - Confidence: 0.304
 - Support: 14
Rule: If a person buys milk they will also buy apples
 - Confidence: 0.196
 - Support: 9
Rule: If a person buys apples they will also buy bread
 - Confidence: 0.139
 - Support:

In [17]:
def print_rule(premise, conclusion, support, confidence, features):
    premise_name = features[premise]
    conclusion_name = features[conclusion]
    print(f"Rule: If a person buys {premise_name} they will also buy {conclusion_name}")
    print(f" - Confidence: {confidence[(premise, conclusion)]:.3f}")
    print(f" - Support: {support[(premise, conclusion)]}")

In [18]:
premise = 1
conclusion = 3
print_rule(premise, conclusion, support, confidence, features)

Rule: If a person buys milk they will also buy apples
 - Confidence: 0.196
 - Support: 9


In [19]:
# Sort by support
from pprint import pprint
pprint(list(support.items()))

[((2, 3), 25),
 ((2, 4), 27),
 ((3, 2), 25),
 ((3, 4), 21),
 ((4, 2), 27),
 ((4, 3), 21),
 ((0, 1), 14),
 ((0, 3), 5),
 ((1, 0), 14),
 ((1, 3), 9),
 ((3, 0), 5),
 ((3, 1), 9),
 ((0, 2), 4),
 ((2, 0), 4),
 ((1, 4), 19),
 ((4, 1), 19),
 ((0, 4), 17),
 ((4, 0), 17),
 ((1, 2), 7),
 ((2, 1), 7)]


In [20]:
from operator import itemgetter
sorted_support = sorted(support.items(), key=itemgetter(1), reverse=True)


In [21]:
for index in range(5):
    print(f"Rule #{index + 1}")
    (premise, conclusion) = sorted_support[index][0]
    print_rule(premise, conclusion, support, confidence, features)


Rule #1
Rule: If a person buys cheese they will also buy bananas
 - Confidence: 0.659
 - Support: 27
Rule #2
Rule: If a person buys bananas they will also buy cheese
 - Confidence: 0.458
 - Support: 27
Rule #3
Rule: If a person buys cheese they will also buy apples
 - Confidence: 0.610
 - Support: 25
Rule #4
Rule: If a person buys apples they will also buy cheese
 - Confidence: 0.694
 - Support: 25
Rule #5
Rule: If a person buys apples they will also buy bananas
 - Confidence: 0.583
 - Support: 21


In [22]:
sorted_confidence = sorted(confidence.items(), key=itemgetter(1), reverse=True)

In [23]:
for index in range(5):
    print(f"Rule #{index + 1}")
    (premise, conclusion) = sorted_confidence[index][0]
    print_rule(premise, conclusion, support, confidence, features)

Rule #1
Rule: If a person buys apples they will also buy cheese
 - Confidence: 0.694
 - Support: 25
Rule #2
Rule: If a person buys cheese they will also buy bananas
 - Confidence: 0.659
 - Support: 27
Rule #3
Rule: If a person buys bread they will also buy bananas
 - Confidence: 0.630
 - Support: 17
Rule #4
Rule: If a person buys cheese they will also buy apples
 - Confidence: 0.610
 - Support: 25
Rule #5
Rule: If a person buys apples they will also buy bananas
 - Confidence: 0.583
 - Support: 21
