In [2]:
import numpy as np
import pandas as pd

# Data Mining Revision - Elliot Linsey QMUL 

### Attributes:

![attribute%20types.JPG](attachment:attribute%20types.JPG)

An attribute is an observed feature for an object. The specific value for an attribute may vary between different objects. E.g, if the object is a country, the average temperature (attribute) may vary. 

Observed values for a given object are known as observations. A set of attributes used to describe an object is known as an *attribute vector*. 

### Qualitative Categorical: 
**Nominal**:
* Categorical
* Means 'relating to names'. Easier to remember it as just 'no order'. 
* There is no ranking involved. 
* Can be represented by integers, for example  customer ID numbers, but mathematical operations on these are meaningless. 

**Binary**:
* Only contains two states, 1 or 0. 
* If these states correspond to True or False, it's *Boolean*. 
* Symmetric means that both states have equal value.
* Asymmetric means they are not equally important, having a disease is more serious than not having the disease. 

**Ordinal**: 
* These are attributes that have an order, but the magnitude or size difference between them may not be known.
* Small, medium, large etc. 

### Quantitative Numeric:
**Interval**: 
* Measured on a scale of equal size units, where there is order and the difference between two values is meaningful. 
* Examples are temperature (Celsius), pH scale (0 is the most acidic, it does not denote the absence of acid), anything that can go below 0. 

**Ratio**: 
* Contains that same attributes as Interval, but also includes a clear 0 point. If the level is 0, then there is none of the attribute. 
* Examples include temperature (Kelvin), heart rate (bpm), weight (g). All these have an inherent 0 point, you cannot have a negative heart rate or weight. 

![ratio%20and%20interval.JPG](attachment:ratio%20and%20interval.JPG)

**Continuous**: 
* Represented as floating or decimal numbers. These can only be measured with limited precision.
* Examples include height (cm), weight (g), anything that can be feasibly measured to a more precise level. 

**Discrete**: 
* Represented as a finite or countably infinite set of integers. (Can also be categorical). 
* Examples include population (you can't have half a person), heart rate (bpm), customer ID.
* Binary variables can be represented as discrete (0 and 1). 

**Asymmetric**: 
* Records only the presence of an attribute (non-zero value). 
* Can be either discrete or continuous
* Examples include words present in a document, items in a transaction dataset. 

### Recording Data: 

**Record**: 
* A collection of objects that have the same fixed attributes. No explicit relationship between objects.
* Usually stored in flat files

![flat%20file.JPG](attachment:flat%20file.JPG)



The above dataset contains a number of attribute types. 

* TID: A numerical discrete value for transaction IDs
* Refund: Categorical binary, may be asymmetrical if the presence of a refund is more important than not having a refund
* Marital Status: Categorical nominal
* Income: Numerical discrete value, the income can not be measured more precisely

**Transaction Record**:
* Each transaction records a set of items that were purchased. 

![transaction%20file.JPG](attachment:transaction%20file.JPG)

Measures such as Kulczynski and the Imbalance ratio can be calculated from transaction data. 

In [4]:
def K_measure(dataset,A,B):
    count1 = 0
    count2 = 0
    count3 = 0
    for x in dataset:
        if set(A).issubset(set(x)) and set(B).issubset(set(x)):
            count1 += 1
        if set(A).issubset(set(x)):
            count2 += 1
        if set(B).issubset(set(x)):
            count3 += 1       
    conA = count1/count2
    conB = count1/count3
    return (conA+conB)/2

def imb_ratio(dataset,A,B):
    count1 = 0
    count2 = 0
    count3 = 0
    for x in dataset:
        if set(A).issubset(set(x)) and set(B).issubset(set(x)):
            count1 += 1
        if set(A).issubset(set(x)):
            count2 += 1
        if set(B).issubset(set(x)):
            count3 += 1       
    supportA = count2/len(dataset)
    supportB = count3/len(dataset)
    supportAUB = count1/len(dataset)
    #print(supportA,supportB,supportAUB)
    conA = count1/count2
    conB = count1/count3
    return abs(supportA-supportB)/(supportA+supportB-supportAUB)

**Document-term matrix**: 
* This is simply the number of times a specific word appears in each document, regardless of word order. 

![document-term%20matrix.JPG](attachment:document-term%20matrix.JPG)

The inverse document frequency (idf) measure can be calculated from these. The formula is: 

$idf(w) = log10(\frac{|D|}{|Dw|})$

It is simplified as the log10 of the total number of documents divided by the number of documents that contain the word.

For the idf(coach) it would be $idf(coach) = log10(\frac{3}{2})$

In [9]:
print('idf(coach) = ' + str(np.log10(3/2)))

idf(coach) = 0.17609125905568124
