In [6]:
import numpy as np
import pandas as pd

# Data Mining Revision - Elliot Linsey QMUL 

### Attributes:

![attribute%20types.JPG](attachment:attribute%20types.JPG)

An attribute is an observed feature for an object. The specific value for an attribute may vary between different objects. E.g, if the object is a country, the average temperature (attribute) may vary. 

Observed values for a given object are known as observations. A set of attributes used to describe an object is known as an *attribute vector*. 

### Qualitative Categorical: 
**Nominal**:
* Categorical
* Means 'relating to names'. Easier to remember it as just 'no order'. 
* There is no ranking involved. 
* Can be represented by integers, for example  customer ID numbers, but mathematical operations on these are meaningless. 

**Binary**:
* Only contains two states, 1 or 0. 
* If these states correspond to True or False, it's *Boolean*. 
* Symmetric means that both states have equal value.
* Asymmetric means they are not equally important, having a disease is more serious than not having the disease. 

**Ordinal**: 
* These are attributes that have an order, but the magnitude or size difference between them may not be known.
* Small, medium, large etc. 

### Quantitative Numeric:
**Interval**: 
* Measured on a scale of equal size units, where there is order and the difference between two values is meaningful. 
* Examples are temperature (Celsius), pH scale (0 is the most acidic, it does not denote the absence of acid), anything that can go below 0. 

**Ratio**: 
* Contains that same attributes as Interval, but also includes a clear 0 point. If the level is 0, then there is none of the attribute. 
* Examples include temperature (Kelvin), heart rate (bpm), weight (g). All these have an inherent 0 point, you cannot have a negative heart rate or weight. 

![ratio%20and%20interval.JPG](attachment:ratio%20and%20interval.JPG)

**Continuous**: 
* Represented as floating or decimal numbers. These can only be measured with limited precision.
* Examples include height (cm), weight (g), anything that can be feasibly measured to a more precise level. 

**Discrete**: 
* Represented as a finite or countably infinite set of integers. (Can also be categorical). 
* Examples include population (you can't have half a person), heart rate (bpm), customer ID.
* Binary variables can be represented as discrete (0 and 1). 

**Asymmetric**: 
* Records only the presence of an attribute (non-zero value). 
* Can be either discrete or continuous
* Examples include words present in a document, items in a transaction dataset. 

### Recording Data: 

**Record**: 
* A collection of objects that have the same fixed attributes. No explicit relationship between objects.
* Usually stored in flat files

![flat%20file.JPG](attachment:flat%20file.JPG)



The above dataset contains a number of attribute types. 

* TID: A numerical discrete value for transaction IDs
* Refund: Categorical binary, may be asymmetrical if the presence of a refund is more important than not having a refund
* Marital Status: Categorical nominal
* Income: Numerical discrete value, the income can not be measured more precisely

**Transaction Record**:
* Each transaction records a set of items that were purchased. 

![transaction%20file.JPG](attachment:transaction%20file.JPG)

Measures such as Kulczynski and the Imbalance ratio can be calculated from transaction data. 

In [7]:
def K_measure(dataset,A,B):
    count1 = 0
    count2 = 0
    count3 = 0
    for x in dataset:
        if set(A).issubset(set(x)) and set(B).issubset(set(x)):
            count1 += 1
        if set(A).issubset(set(x)):
            count2 += 1
        if set(B).issubset(set(x)):
            count3 += 1       
    conA = count1/count2
    conB = count1/count3
    return (conA+conB)/2

def imb_ratio(dataset,A,B):
    count1 = 0
    count2 = 0
    count3 = 0
    for x in dataset:
        if set(A).issubset(set(x)) and set(B).issubset(set(x)):
            count1 += 1
        if set(A).issubset(set(x)):
            count2 += 1
        if set(B).issubset(set(x)):
            count3 += 1       
    supportA = count2/len(dataset)
    supportB = count3/len(dataset)
    supportAUB = count1/len(dataset)
    #print(supportA,supportB,supportAUB)
    conA = count1/count2
    conB = count1/count3
    return abs(supportA-supportB)/(supportA+supportB-supportAUB)

**Document-term matrix**: 
* This is simply the number of times a specific word appears in each document, regardless of word order. 

![document-term%20matrix.JPG](attachment:document-term%20matrix.JPG)

The inverse document frequency (idf) measure can be calculated from these. The formula is: 

$idf(w) = log10(\frac{|D|}{|Dw|})$

It is simplified as the log10 of the total number of documents divided by the number of documents that contain the word.

For the idf(coach) it would be $idf(coach) = log10(\frac{3}{2})$

In [8]:
print('idf(coach) = ' + str(np.log10(3/2)))

idf(coach) = 0.17609125905568124


**Temporal Data**: 
* This data contains relationships that are ordered according to time

![temporal%20graph.JPG](attachment:temporal%20graph.JPG)

Examples could be temperature over the course of a year, business profits over a quarter.

For more info on datatypes, including: 
* Data matrix
* Graph 
* Spatial
* Sequence

Check the **Week 2** slides. 

### Simple Matching Coefficient (SMC) 

This is purely for comparing two objects that contain *n* binary attributes. This results in a value between 0 and 1, with 1 meaning both objects are completely similar and 0 meaning they are dissimilar. 

The comparison of two objects with *n* binary attributes results in 4 possible combinations: 

![SMC.JPG](attachment:SMC.JPG)

The SMC formula is: 

![SMC%20formula.JPG](attachment:SMC%20formula.JPG)

Here's an example of two objects with 10 binary attributes:

In [9]:
x = [1, 0, 0, 0, 0, 0, 0, 0, 0, 0]
y = [0, 0, 0, 0, 0, 0, 1, 0, 0, 1]

def smc(x,y):
    count = 0
    for i in range(len(x)):
        if x[i] == y[i]:
            count += 1
    return count/len(x)

smc(x,y)

0.7

### Jaccard Coefficient

Also used for binary attributes, however this is for asymmetric where the presence of an attribute is important and ignores $f_{00}$ matches.

![Jaccard%20coefficient.JPG](attachment:Jaccard%20coefficient.JPG)

In [10]:
xj = [1, 0, 1, 0, 1, 0, 0, 0, 0, 0]
yj = [0, 0, 1, 0, 1, 0, 1, 0, 0, 1]

def jaccard(x,y):
    count1 = 0
    count2 = 0
    for i in range(len(x)):
        if x[i] == 1 and y[i] == 1:
            count1 += 1
            count2 += 1
        if x[i] == 1 and y[i] == 0:
            count2 += 1
        if x[i] == 0 and y[i] == 1:
            count2 += 1
    return count1/count2

jaccard(xj,yj)

0.4

### Cosine Similarity

This is for comparing sparse vectors, such as 'bag of word' representations. Here's the representation of the previous bag of words sparse vector. Cosine similarity are usually non-negative and range from $[0,1]$ 0 = no similarity, 1 = complete similarity.

In [23]:
words = pd.DataFrame(
    [[3,0,5,0,2,6,0,2,0,2],
    [0,7,0,2,1,0,0,3,0,0],
    [0,1,0,0,1,2,2,0,3,0],
    [1,1,1,1,1,1,1,1,1,1],
    [1,1,1,1,1,1,1,1,1,1]],
    columns=['team','coach','play','ball','score','game','win','lost','timeout','season'],
    index=['Document 1','Document 2', 'Document 3','Document 4', 'Document 5']
)

words

Unnamed: 0,team,coach,play,ball,score,game,win,lost,timeout,season
Document 1,3,0,5,0,2,6,0,2,0,2
Document 2,0,7,0,2,1,0,0,3,0,0
Document 3,0,1,0,0,1,2,2,0,3,0
Document 4,1,1,1,1,1,1,1,1,1,1
Document 5,1,1,1,1,1,1,1,1,1,1


In [25]:
def cosine_sim(dataset,doc1,doc2):
    doc1 = dataset.loc[doc1].to_numpy()
    doc2 = dataset.loc[doc2].to_numpy()
    return (np.dot(doc1,doc2))/(np.linalg.norm(doc1)*np.linalg.norm(doc2))

cosine_sim(words,'Document 1', 'Document 2')
#cosine_sim(words,'Document 4', 'Document 5')

0.11130451615062428

We can calculate the cosine similarity using this function below from Scipy. Notice that the similarity = $1-d$, this is because the cosine distance function calculates the *distance* or *dissimilarity* which is a measure of how the two objects are different. Dissimilarities usually range from $[0,1]$ but they can also range from $[0,\infty]$

In [30]:
from scipy import spatial

similarity = 1 - spatial.distance.cosine(words.iloc[0].to_numpy(), words.iloc[1].to_numpy())
print('Similarity = ' + str(similarity))

dissimilarity = spatial.distance.cosine(words.iloc[0].to_numpy(), words.iloc[1].to_numpy())
print('Dissimilarity = ' + str(dissimilarity))

Similarity = 0.11130451615062431
Dissimilarity = 0.8886954838493757
