# Getting Started with Data Mining

Data mining is a methodology that we can employ to train computers to make decisions with data and forms the backbone of many high-tech systems of today.

The Python language is fast growing in popularity, for a good reason. It gives the programmer a lot of flexibility; it has a large number of modules to perform different tasks; and Python code is usually more readable and concise than in any other languages. There is a large and an active community of researchers, practitioners, and beginners using Python for data mining.

- **Introducing data mining**
- **A simple affiniy analysis example**
- **A simple classification example**
- **What is classification**

## Introducing data mining

Data mining provides a way for a computer to learn how to make decisions with data. This decision could be predicting tomorrow's weather, blocking a spam email from entering your inbox, detecting the language of a website, or finding a new romance on a dating site. 

Data mining is part of algorithms, statistics, engineering, optimization, and computer science. We also use concepts and knowledge from other fields such as linguistics, neuroscience, or town planning. Applying it effectively usually requires this domain-specific knowledge to be integrated with the algorithms.

We start our data mining process by creating a dataset, describing an aspect of the real world. Datasets comprise of two aspects:

- Samples that are objects in the real world. This can be a book, photograph, animal, person, or any other object.
- Features that are descriptions of the samples in our dataset. They could be the length, frequency of a given word, number of legs, date it was created, and so on. 

The next step is tuning the data mining algorithm. Each data mining algorithm has parameters, either within the algorithm or supplied by the user. This tuning allows the algorithm to learn how to make decisions about the data.

## A simple affinity analysis example

A common use case for data mining is to improve sales by asking a customer who is buying a product if he/she would like another similar product as well. This can be done through affinity analysis, which is the study of when things exist together.

## What is affinity analysis?

Affinity analysis is a type of data mining that gives similarity between samples (objects). This could be the similarity between the following:

- users on a website, in order to provide varied services or targeted advertising
- items to sell to those users, in order to provide recommended movies or products
- human genes, in order to find people that share the same ancestors

We can measure affinity in a number of ways. For instance, we can record how frequently two products are purchased together. We can also record the accuracy of the statement when a person buys object 1 and also when they buy object 2. 

## Product recommendations

One of the issues with moving a traditional business online, such as commerce, is that tasks that used to be done by humans need to be automated in order for the online business to scale. One example of this is up-selling, or selling an extra item to a customer who is already buying. Automated product recommendations through data mining are one of the driving forces behind the e-commerce revolution that is turning billions of dollars per year into revenue.

Product recommendations are based on the following idea: when two items are historically purchased together, they are more likely to be purchased together in the future. This sort of thinking is behind many product recommendation services, in both online and offline businesses.

A very simple algorithm for this type of product recommendation algorithm is to simply find any historical case where a user has brought an item and to recommend other items that the historical user brought. In practice, simple algorithms such as this can do well, at least better than choosing random items to recommend. However, they can be improved upon significantly, which is where data mining comes in.

To simplify the coding, we will consider only two items at a time. As an example, people may buy bread and milk at the same time at the supermarket. In this early example, we wish to find simple rules of the form:

*If a person buys product X, then they are likely to purchase product Y*


In [1]:
import numpy as np

In [3]:
DATA = 'data/'
AFFINITY_DATASET = DATA + 'affinity_dataset.txt'

In [9]:
x = np.loadtxt(AFFINITY_DATASET)
n_samples, n_features = x.shape
print(f'This dataset has {n_samples} samples and {n_features} features')

This dataset has 100 samples and 5 features


Each of these features contain binary values, stating only whether the items were purchased and not how many of them were purchased. A 1 indicates that "at least 1" item was bought of this type, while a 0 indicates that absolutely none of that item was purchased.

In [10]:
print(x[:5])

[[0. 0. 1. 1. 1.]
 [1. 1. 0. 1. 0.]
 [1. 0. 1. 1. 0.]
 [0. 0. 1. 1. 1.]
 [0. 1. 0. 0. 1.]]


## Implementing a simple ranking of rules

We wish to find rules of the type If a person buys product X, then they are likely to purchase product Y. We can quite easily create a list of all of the rules in our dataset by simply finding all occasions when two products were purchased together. However, we then need a way to determine good rules from bad ones. This will allow us to choose specific products to recommend.

Rules of this type can be measured in many ways, of which we will focus on two: **support** and **confidence**.

#### Support 
is the number of times that a rule occurs in a dataset, which is computed by simply counting the number of samples that the rule is valid for. 

**Note**: It can sometimes be normalized by dividing by the total number of times the premise of the rule is valid, but we will simply count the total for this implementation.

#### Confidence 
measures how accurate they are when they can be used. It can be computed by determining the percentage of times the rule applies when the premise applies.

In [11]:

# The names of the features, for your reference.
features = ["bread", "milk", "cheese", "apples", "bananas"]


In our first example, we will compute the Support and Confidence of the rule "If a person buys Apples, they also buy Bananas".

In [12]:
# First, how many rows contain our premise: that a person is buying apples
num_apple_purchases = 0
for sample in x:
    if sample[3] == 1:  # This person bought Apples
        num_apple_purchases += 1
print(f"{num_apple_purchases} people bought Apples")


36 people bought Apples


Similarly, we can check if bananas were bought in a transaction by seeing if the value for `sample[4]` is equal to 1 (and so on). We can now compute the number of times our rule exists in our dataset and, from that, the confidence and support.

In [13]:
# How many of the cases that a person bought Apples involved the people purchasing Bananas too?
# Record both cases where the rule is valid and is invalid.
rule_valid = 0
rule_invalid = 0
for sample in x:
    if sample[3] == 1:  # This person bought Apples
        if sample[4] == 1:
            # This person bought both Apples and Bananas
            rule_valid += 1
        else:
            # This person bought Apples, but not Bananas
            rule_invalid += 1
print(f"{rule_valid} cases of the rule being valid were discovered")
print(f"{rule_invalid} cases of the rule being invalid were discovered")


21 cases of the rule being valid were discovered
15 cases of the rule being invalid were discovered


In [24]:
# Now we have all the information needed to compute Support and Confidence
support = rule_valid  # The Support is the number of times the rule is discovered.
confidence = rule_valid / num_apple_purchases
print(f"The support is {support} and the confidence is {confidence:.3f}.")
# Confidence can be thought of as a percentage using the following:
print(f"As a percentage, that is {100 * confidence:.1f}%.")


The support is 21 and the confidence is 0.583.
As a percentage, that is 58.3%.


Now we are going to compute the confidence and support for all possible rules.

In [15]:
from collections import defaultdict
# Now compute for all possible rules
valid_rules = defaultdict(int)
invalid_rules = defaultdict(int)
num_occurences = defaultdict(int)

for sample in x:
    for premise in range(n_features):
        if sample[premise] == 0: 
            continue
        # Record that the premise was bought in another transaction
        num_occurences[premise] += 1
        for conclusion in range(n_features):
            if premise == conclusion:  # It makes little sense to measure if X -> X.
                continue
            if sample[conclusion] == 1:
                # This person also bought the conclusion item
                valid_rules[(premise, conclusion)] += 1
            else:
                # This person bought the premise, but not the conclusion
                invalid_rules[(premise, conclusion)] += 1
support = valid_rules
confidence = defaultdict(float)
for premise, conclusion in valid_rules.keys():
    confidence[(premise, conclusion)] = valid_rules[(premise, conclusion)] / num_occurences[premise]


In [16]:
for premise, conclusion in confidence:
    premise_name = features[premise]
    conclusion_name = features[conclusion]
    print(f"Rule: If a person buys {premise_name} they will also buy {conclusion_name}")
    print(f" - Confidence: {confidence[(premise, conclusion)]:.3f}")
    print(f" - Support: {support[(premise, conclusion)]}")

Rule: If a person buys cheese they will also buy apples
 - Confidence: 0.610
 - Support: 25
Rule: If a person buys cheese they will also buy bananas
 - Confidence: 0.659
 - Support: 27
Rule: If a person buys apples they will also buy cheese
 - Confidence: 0.694
 - Support: 25
Rule: If a person buys apples they will also buy bananas
 - Confidence: 0.583
 - Support: 21
Rule: If a person buys bananas they will also buy cheese
 - Confidence: 0.458
 - Support: 27
Rule: If a person buys bananas they will also buy apples
 - Confidence: 0.356
 - Support: 21
Rule: If a person buys bread they will also buy milk
 - Confidence: 0.519
 - Support: 14
Rule: If a person buys bread they will also buy apples
 - Confidence: 0.185
 - Support: 5
Rule: If a person buys milk they will also buy bread
 - Confidence: 0.304
 - Support: 14
Rule: If a person buys milk they will also buy apples
 - Confidence: 0.196
 - Support: 9
Rule: If a person buys apples they will also buy bread
 - Confidence: 0.139
 - Support:

In [17]:
def print_rule(premise, conclusion, support, confidence, features):
    premise_name = features[premise]
    conclusion_name = features[conclusion]
    print(f"Rule: If a person buys {premise_name} they will also buy {conclusion_name}")
    print(f" - Confidence: {confidence[(premise, conclusion)]:.3f}")
    print(f" - Support: {support[(premise, conclusion)]}")

In [18]:
premise = 1
conclusion = 3
print_rule(premise, conclusion, support, confidence, features)

Rule: If a person buys milk they will also buy apples
 - Confidence: 0.196
 - Support: 9


In [19]:
# Sort by support
from pprint import pprint
pprint(list(support.items()))

[((2, 3), 25),
 ((2, 4), 27),
 ((3, 2), 25),
 ((3, 4), 21),
 ((4, 2), 27),
 ((4, 3), 21),
 ((0, 1), 14),
 ((0, 3), 5),
 ((1, 0), 14),
 ((1, 3), 9),
 ((3, 0), 5),
 ((3, 1), 9),
 ((0, 2), 4),
 ((2, 0), 4),
 ((1, 4), 19),
 ((4, 1), 19),
 ((0, 4), 17),
 ((4, 0), 17),
 ((1, 2), 7),
 ((2, 1), 7)]


## Ranking to find the best rules

Now that we can compute the support and confidence of all rules, we want to be able to find the best rules. To do this, we perform a ranking and print the ones with the highest values. We can do this for both the support and confidence values.

In [20]:
from operator import itemgetter
sorted_support = sorted(support.items(), key=itemgetter(1), reverse=True)


In [21]:
for index in range(5):
    print(f"Rule #{index + 1}")
    (premise, conclusion) = sorted_support[index][0]
    print_rule(premise, conclusion, support, confidence, features)


Rule #1
Rule: If a person buys cheese they will also buy bananas
 - Confidence: 0.659
 - Support: 27
Rule #2
Rule: If a person buys bananas they will also buy cheese
 - Confidence: 0.458
 - Support: 27
Rule #3
Rule: If a person buys cheese they will also buy apples
 - Confidence: 0.610
 - Support: 25
Rule #4
Rule: If a person buys apples they will also buy cheese
 - Confidence: 0.694
 - Support: 25
Rule #5
Rule: If a person buys apples they will also buy bananas
 - Confidence: 0.583
 - Support: 21


Similarly, we can print the top rules based on confidence. First, compute the sorted confidence list:

In [22]:
sorted_confidence = sorted(confidence.items(), key=itemgetter(1), reverse=True)

In [23]:
for index in range(5):
    print(f"Rule #{index + 1}")
    (premise, conclusion) = sorted_confidence[index][0]
    print_rule(premise, conclusion, support, confidence, features)

Rule #1
Rule: If a person buys apples they will also buy cheese
 - Confidence: 0.694
 - Support: 25
Rule #2
Rule: If a person buys cheese they will also buy bananas
 - Confidence: 0.659
 - Support: 27
Rule #3
Rule: If a person buys bread they will also buy bananas
 - Confidence: 0.630
 - Support: 17
Rule #4
Rule: If a person buys cheese they will also buy apples
 - Confidence: 0.610
 - Support: 25
Rule #5
Rule: If a person buys apples they will also buy bananas
 - Confidence: 0.583
 - Support: 21


Two rules are near the top of both lists. The first is **If a person buys apples, they will also buy cheese**, and the second is **If a person buys cheese, they will also buy bananas**. A store manager can use rules like these to organize their store. For example, if apples are on sale this week, put a display of cheeses nearby. Similarly, it would make little sense to put both bananas on sale at the same time as cheese, as nearly 66 percent of people buying cheese will buy bananas anyway—our sale won't increase banana purchases all that much.

Data mining has great exploratory power in examples like this. A person can use data mining techniques to explore relationships within their datasets to find new insights.

## A simple classification example

In the affinity analysis example, we looked for correlations between different variables in our dataset. In classification, we instead have a single variable that we are interested in and that we call the class (also called the target). If, in the previous example, we were interested in how to make people buy more apples, we could set that variable to be the class and look for classification rules that obtain that goal. We would then look only for rules that relate to that goal.

## What is classification?

Classification is one of the largest uses of data mining, both in practical use and in research. As before, we have a set of samples that represents objects or things we are interested in classifying. We also have a new array, the class values. These class values give us a categorization of the samples. Some examples are as follows:

- Determining the species of a plant by looking at its measurements. The class value here would be *Which species is this?*.
- Determining if an image contains a dog. The class would be *Is there a dog in this image?*.
- Determining if a patient has cancer based on the test results. The class would be *Does this patient have cancer?*.

The goal of classification applications is to train a model on a set of samples with known classes, and then apply that model to new unseen samples with unknown classes. For example, we want to train a spam classifier on my past e-mails, which I have labeled as spam or not spam. I then want to use that classifier to determine whether my next email is spam, without me needing to classify it myself.

## Loading and preparing the dataset

The dataset we are going to use for this example is the famous **Iris database** of plant
classification.

We have 150 plant samples and four measurements of each (all in centimeters): 
- **sepal length**
- **sepal width**
- **petal length**
- **petal width**

This classic dataset (first used in 1936!) is one of the classic datasets for data mining.
There are three classes: 
- **Iris Setosa**
- **Iris Versicolour**
- **Iris Virginica**

The aim is to determine which type of plant a sample is, by examining its measurements.

In [25]:
from sklearn.datasets import load_iris


dataset = load_iris()
X = dataset.data
y = dataset.target


print(dataset.DESCR)
n_samples, n_features = X.shape

.. _iris_dataset:

Iris plants dataset
--------------------

**Data Set Characteristics:**

    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive attributes and the class
    :Attribute Information:
        - sepal length in cm
        - sepal width in cm
        - petal length in cm
        - petal width in cm
        - class:
                - Iris-Setosa
                - Iris-Versicolour
                - Iris-Virginica
                
    :Summary Statistics:

                    Min  Max   Mean    SD   Class Correlation
    sepal length:   4.3  7.9   5.84   0.83    0.7826
    sepal width:    2.0  4.4   3.05   0.43   -0.4194
    petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
    petal width:    0.1  2.5   1.20   0.76    0.9565  (high!)

    :Missing Attribute Values: None
    :Class Distribution: 33.3% for each of 3 classes.
    :Creator: R.A. Fisher
    :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
    :

The features in this dataset are continuous values, meaning they can take any range of values. Measurements are a good example of this type of feature, where a measurement can take the value of 1, 1.2, or 1.25 and so on. Another aspect about continuous features is that feature values that are close to each other indicate similarity. A plant with a sepal length of 1.2 cm is like a plant with sepal width of 1.25 cm.

In contrast are categorical features. These features, while often represented as numbers, cannot be compared in the same way. In the Iris dataset, the class values are an example of a categorical feature. 

- class 0 represents Iris Setosa
- class 1 represents Iris Versicolour
- class 2 represents Iris Virginica 

This doesn't mean that Iris Setosa is more similar to Iris Versicolour than it is to Iris Virginica. The numbers here represent categories. All we can say is whether categories are the same or different.

## Implementing the OneR algorithm

OneR is a shorthand for *One Rule*, indicating we only use a single rule for this classification by choosing the feature with the best performance. This simple algorithm has been shown to have good performance in a
number of real-world datasets.

Our attributes are continuous, while we want categorical features to use OneR. We will perform a preprocessing step called discretisation. At this stage, we will perform a simple procedure: compute the mean and determine whether a value is above or below the mean.

### The algorithm 
- Start by iterating over every value of every feature
- For that value, count the number of samples from each class that have that feature value
- Record the most frequent class for the feature value, and the error of that prediction

    - **For example**: If a feature has two values, 0 and 1, we first check all samples that have the value 0. For that value, we may have 20 in class A, 60 in class B, and a further 20 in class C. The most frequent class for this value is B, and there are 40 instances that have difference classes. The prediction for this feature value is B with an error of 40, as there are 40 samples that have a different class from the prediction. We then do the same procedure for the value 1 for this feature, and then for all other feature value combinations.

- We compute the error for each feature by summing up the errors for all values for that feature
- The feature with the lowest total error is chosen as the One Rule and then used to classify other instances

In [26]:

# Compute the mean for each attribute
attribute_means = X.mean(axis=0)
assert attribute_means.shape == (n_features,)
X_d = np.array(X >= attribute_means, dtype='int')

In [29]:
# Now, we split into a training and test set
from sklearn.model_selection import train_test_split

# Set the random state to the same number to get the same results everytime
random_state = 14

X_train, X_test, y_train, y_test = train_test_split(X_d, y, random_state=random_state)
print("There are {} training samples".format(y_train.shape))
print("There are {} testing samples".format(y_test.shape))

There are (112,) training samples
There are (38,) testing samples


In [30]:
from collections import defaultdict
from operator import itemgetter


def train(X, y_true, feature):
    """Computes the predictors and error for a given feature using the OneR algorithm
    
    Parameters
    ----------
    X: array [n_samples, n_features]
        The two dimensional array that holds the dataset. Each row is a sample, each column
        is a feature.
    
    y_true: array [n_samples,]
        The one dimensional array that holds the class values. Corresponds to X, such that
        y_true[i] is the class value for sample X[i].
    
    feature: int
        An integer corresponding to the index of the variable we wish to test.
        0 <= variable < n_features
        
    Returns
    -------
    predictors: dictionary of tuples: (value, prediction)
        For each item in the array, if the variable has a given value, make the given prediction.
    
    error: float
        The ratio of training data that this rule incorrectly predicts.
    """
    # Check that variable is a valid number
    n_samples, n_features = X.shape
    assert 0 <= feature < n_features
    # Get all of the unique values that this variable has
    values = set(X[:,feature])
    # Stores the predictors array that is returned
    predictors = dict()
    errors = []
    for current_value in values:
        most_frequent_class, error = train_feature_value(X, y_true, feature, current_value)
        predictors[current_value] = most_frequent_class
        errors.append(error)
    # Compute the total error of using this feature to classify on
    total_error = sum(errors)
    return predictors, total_error

# Compute what our predictors say each sample is based on its value
#y_predicted = np.array([predictors[sample[feature]] for sample in X])
    

def train_feature_value(X, y_true, feature, value):
    # Create a simple dictionary to count how frequency they give certain predictions
    class_counts = defaultdict(int)
    # Iterate through each sample and count the frequency of each class/value pair
    for sample, y in zip(X, y_true):
        if sample[feature] == value:
            class_counts[y] += 1
    # Now get the best one by sorting (highest first) and choosing the first item
    sorted_class_counts = sorted(class_counts.items(), key=itemgetter(1), reverse=True)
    most_frequent_class = sorted_class_counts[0][0]
    # The error is the number of samples that do not classify as the most frequent class
    # *and* have the feature value.
    n_samples = X.shape[1]
    error = sum([class_count for class_value, class_count in class_counts.items()
                 if class_value != most_frequent_class])
    return most_frequent_class, error

In [31]:
# Compute all of the predictors
all_predictors = {variable: train(X_train, y_train, variable) for variable in range(X_train.shape[1])}
errors = {variable: error for variable, (mapping, error) in all_predictors.items()}
# Now choose the best and save that as "model"
# Sort by error
best_variable, best_error = sorted(errors.items(), key=itemgetter(1))[0]
print(f"The best model is based on variable {best_variable} and has error {best_error:.2f}")

# Choose the bset model
model = {'variable': best_variable,
         'predictor': all_predictors[best_variable][0]}
print(model)

The best model is based on variable 2 and has error 37.00
{'variable': 2, 'predictor': {0: 0, 1: 2}}


In [32]:
def predict(X_test, model):
    variable = model['variable']
    predictor = model['predictor']
    y_predicted = np.array([predictor[int(sample[variable])] for sample in X_test])
    return y_predicted

In [33]:
y_predicted = predict(X_test, model)
print(y_predicted)

[0 0 0 2 2 2 0 2 0 2 2 0 2 2 0 2 0 2 2 2 0 0 0 2 0 2 0 2 2 0 0 0 2 0 2 0 2
 2]


## Testing the algorithm

When we evaluated the affinity analysis algorithm of the last section, our aim was to explore the current dataset. With this classification, our problem is different. We want to build a model that will allow us to classify previously unseen samples by comparing them to what we know about the problem.

For this reason, we split our machine-learning workflow into two stages: training and testing. 
- In training, we take a portion of the dataset and create our model
- In testing, we apply that model and evaluate how effectively it worked on the dataset. 

As our goal is to create a model that is able to classify previously unseen samples, we cannot use our testing data for training the model. If we do, we run the risk of **overfitting** (the problem of creating a model that classifies our training dataset very well, but performs poorly on new samples).

#### The rule

Never use training data to test your algorithm.


In [34]:
# Compute the accuracy by taking the mean of the amounts that y_predicted is equal to y_test
accuracy = np.mean(y_predicted == y_test) * 100
print("The test accuracy is {:.1f}%".format(accuracy))

The test accuracy is 65.8%


In [35]:
from sklearn.metrics import classification_report

In [40]:
print(classification_report(y_test, y_predicted, zero_division=0))

              precision    recall  f1-score   support

           0       0.94      1.00      0.97        17
           1       0.00      0.00      0.00        13
           2       0.40      1.00      0.57         8

    accuracy                           0.66        38
   macro avg       0.45      0.67      0.51        38
weighted avg       0.51      0.66      0.55        38

