In [1]:
import numpy as np
import pandas as pd

# Data Mining Revision - Elliot Linsey QMUL 

### Attributes:

![attribute%20types.JPG](attachment:attribute%20types.JPG)

An attribute is an observed feature for an object. The specific value for an attribute may vary between different objects. E.g, if the object is a country, the average temperature (attribute) may vary. 

Observed values for a given object are known as observations. A set of attributes used to describe an object is known as an *attribute vector*. 

### Qualitative Categorical: 
**Nominal**:
* Categorical
* Means 'relating to names'. Easier to remember it as just 'no order'. 
* There is no ranking involved. 
* Can be represented by integers, for example  customer ID numbers, but mathematical operations on these are meaningless. 

**Binary**:
* Only contains two states, 1 or 0. 
* If these states correspond to True or False, it's *Boolean*. 
* Symmetric means that both states have equal value.
* Asymmetric means they are not equally important, having a disease is more serious than not having the disease. 

**Ordinal**: 
* These are attributes that have an order, but the magnitude or size difference between them may not be known.
* Small, medium, large etc. 

### Quantitative Numeric:
**Interval**: 
* Measured on a scale of equal size units, where there is order and the difference between two values is meaningful. 
* Examples are temperature (Celsius), pH scale (0 is the most acidic, it does not denote the absence of acid), anything that can go below 0. 

**Ratio**: 
* Contains that same attributes as Interval, but also includes a clear 0 point. If the level is 0, then there is none of the attribute. 
* Examples include temperature (Kelvin), heart rate (bpm), weight (g). All these have an inherent 0 point, you cannot have a negative heart rate or weight. 

![ratio%20and%20interval.JPG](attachment:ratio%20and%20interval.JPG)

**Continuous**: 
* Represented as floating or decimal numbers. These can only be measured with limited precision.
* Examples include height (cm), weight (g), anything that can be feasibly measured to a more precise level. 

**Discrete**: 
* Represented as a finite or countably infinite set of integers. (Can also be categorical). 
* Examples include population (you can't have half a person), heart rate (bpm), customer ID.
* Binary variables can be represented as discrete (0 and 1). 

**Asymmetric**: 
* Records only the presence of an attribute (non-zero value). 
* Can be either discrete or continuous
* Examples include words present in a document, items in a transaction dataset. 

### Recording Data: 

**Record**: 
* A collection of objects that have the same fixed attributes. No explicit relationship between objects.
* Usually stored in flat files

![flat%20file.JPG](attachment:flat%20file.JPG)



The above dataset contains a number of attribute types. 

* TID: A numerical discrete value for transaction IDs
* Refund: Categorical binary, may be asymmetrical if the presence of a refund is more important than not having a refund
* Marital Status: Categorical nominal
* Income: Numerical discrete value, the income can not be measured more precisely

**Transaction Record**:
* Each transaction records a set of items that were purchased. 

![transaction%20file.JPG](attachment:transaction%20file.JPG)

Measures such as Kulczynski and the Imbalance ratio can be calculated from transaction data. 

In [2]:
def K_measure(dataset,A,B):
    count1 = 0
    count2 = 0
    count3 = 0
    for x in dataset:
        if set(A).issubset(set(x)) and set(B).issubset(set(x)):
            count1 += 1
        if set(A).issubset(set(x)):
            count2 += 1
        if set(B).issubset(set(x)):
            count3 += 1       
    conA = count1/count2
    conB = count1/count3
    return (conA+conB)/2

def imb_ratio(dataset,A,B):
    count1 = 0
    count2 = 0
    count3 = 0
    for x in dataset:
        if set(A).issubset(set(x)) and set(B).issubset(set(x)):
            count1 += 1
        if set(A).issubset(set(x)):
            count2 += 1
        if set(B).issubset(set(x)):
            count3 += 1       
    supportA = count2/len(dataset)
    supportB = count3/len(dataset)
    supportAUB = count1/len(dataset)
    #print(supportA,supportB,supportAUB)
    conA = count1/count2
    conB = count1/count3
    return abs(supportA-supportB)/(supportA+supportB-supportAUB)

**Document-term matrix**: 
* This is simply the number of times a specific word appears in each document, regardless of word order. 

![document-term%20matrix.JPG](attachment:document-term%20matrix.JPG)

The inverse document frequency (idf) measure can be calculated from these. The formula is: 

$idf(w) = log10(\frac{|D|}{|Dw|})$

It is simplified as the log10 of the total number of documents divided by the number of documents that contain the word.

For the idf(coach) it would be $idf(coach) = log10(\frac{3}{2})$

In [3]:
print('idf(coach) = ' + str(np.log10(3/2)))

idf(coach) = 0.17609125905568124


**Temporal Data**: 
* This data contains relationships that are ordered according to time

![temporal%20graph.JPG](attachment:temporal%20graph.JPG)

Examples could be temperature over the course of a year, business profits over a quarter.

For more info on datatypes, including: 
* Data matrix
* Graph 
* Spatial
* Sequence

Check the **Week 2** slides. 

### Simple Matching Coefficient (SMC) 

This is purely for comparing two objects that contain *n* binary attributes. This results in a value between 0 and 1, with 1 meaning both objects are completely similar and 0 meaning they are dissimilar. 

The comparison of two objects with *n* binary attributes results in 4 possible combinations: 

![SMC.JPG](attachment:SMC.JPG)

The SMC formula is: 

![SMC%20formula.JPG](attachment:SMC%20formula.JPG)

Here's an example of two objects with 10 binary attributes:

In [4]:
x = [1, 0, 0, 0, 0, 0, 0, 0, 0, 0]
y = [0, 0, 0, 0, 0, 0, 1, 0, 0, 1]

def smc(x,y):
    count = 0
    for i in range(len(x)):
        if x[i] == y[i]:
            count += 1
    return count/len(x)

smc(x,y)

0.7

### Jaccard Coefficient

Also used for binary attributes, however this is for asymmetric where the presence of an attribute is important and ignores $f_{00}$ matches.

![Jaccard%20coefficient.JPG](attachment:Jaccard%20coefficient.JPG)

In [5]:
xj = [1, 0, 1, 0, 1, 0, 0, 0, 0, 0]
yj = [0, 0, 1, 0, 1, 0, 1, 0, 0, 1]

def jaccard(x,y):
    count1 = 0
    count2 = 0
    for i in range(len(x)):
        if x[i] == 1 and y[i] == 1:
            count1 += 1
            count2 += 1
        if x[i] == 1 and y[i] == 0:
            count2 += 1
        if x[i] == 0 and y[i] == 1:
            count2 += 1
    return count1/count2

jaccard(xj,yj)

0.4

### Cosine Similarity

This is for comparing sparse vectors, such as 'bag of word' representations. Here's the representation of the previous bag of words sparse vector. Cosine similarity are usually non-negative and range from $[0,1]$ 0 = no similarity, 1 = complete similarity.

In [6]:
words = pd.DataFrame(
    [[3,0,5,0,2,6,0,2,0,2],
    [0,7,0,2,1,0,0,3,0,0],
    [0,1,0,0,1,2,2,0,3,0],
    [1,1,1,1,1,1,1,1,1,1],
    [1,1,1,1,1,1,1,1,1,1]],
    columns=['team','coach','play','ball','score','game','win','lost','timeout','season'],
    index=['Document 1','Document 2', 'Document 3','Document 4', 'Document 5']
)

words

Unnamed: 0,team,coach,play,ball,score,game,win,lost,timeout,season
Document 1,3,0,5,0,2,6,0,2,0,2
Document 2,0,7,0,2,1,0,0,3,0,0
Document 3,0,1,0,0,1,2,2,0,3,0
Document 4,1,1,1,1,1,1,1,1,1,1
Document 5,1,1,1,1,1,1,1,1,1,1


Here's the formula for cosine similarity:

![similarity2.JPG](attachment:similarity2.JPG)

In [7]:
def cosine_sim(dataset,doc1,doc2):
    doc1 = dataset.loc[doc1].to_numpy()
    doc2 = dataset.loc[doc2].to_numpy()
    return (np.dot(doc1,doc2))/(np.linalg.norm(doc1)*np.linalg.norm(doc2))

cosine_sim(words,'Document 1', 'Document 2')
#cosine_sim(words,'Document 4', 'Document 5')

0.11130451615062428

We can calculate the cosine similarity using this function below from Scipy. Notice that the similarity = $1-d$, this is because the cosine distance function calculates the *distance* or *dissimilarity* which is a measure of how the two objects are different. Dissimilarities usually range from $[0,1]$ but they can also range from $[0,\infty]$

In [8]:
from scipy import spatial

similarity = 1 - spatial.distance.cosine(words.iloc[0].to_numpy(), words.iloc[1].to_numpy())
print('Similarity = ' + str(similarity))

dissimilarity = spatial.distance.cosine(words.iloc[0].to_numpy(), words.iloc[1].to_numpy())
print('Dissimilarity = ' + str(dissimilarity))

Similarity = 0.11130451615062431
Dissimilarity = 0.8886954838493757


![similarity.JPG](attachment:similarity.JPG)

### Ordinal dissimilarity

If we had a rating system for a course that ranged from: ['very dissatisfied', 'dissatisfied', 'neutral', 'satisfied', 'very satisfied'], we could assign it numerical values from 1 to 5. Then to find the dissimilarity measure we simply find the absolute difference between our numerical values, then divide this by $n-1$ which is the number of attributes (in this case 5) - 1 = 4. 

In this way, very satisfied to very dissatisfied are the most different with $|5-1|/4 = 1.0$ and as values get closer together (say neutral and satisfied) the dissimilarity measure decreases. $|4-3|/4 = 0.25$ 

To find the similarity it's simply $1-d$

In [38]:
ratings = pd.DataFrame(
    [[0,2,0,0,0],
    [0,0,0,0,5],],
    columns=['very dissatisfied', 'dissatisfied', 'neutral', 'satisfied', 'very satisfied'],
    index=['Student 1','Student 2']
)

ratings

Unnamed: 0,very dissatisfied,dissatisfied,neutral,satisfied,very satisfied
Student 1,0,2,0,0,0
Student 2,0,0,0,0,5


In [37]:
abs(4-3)/4

0.25

Cosine can be used but only to tell you whether the results are different, not how close together they are which is kind of pointless. 

In [22]:
dissimilarity2 = spatial.distance.cosine(ratings.iloc[0].to_numpy(), ratings.iloc[1].to_numpy())
print('Dissimilarity = ' + str(dissimilarity2))

Dissimilarity = 1.0


## Distances

#### Euclidean Distance

The most popular distance metric used and this is a straight line between two points.

In [9]:
xe = np.array([1,2])
ye = np.array([3,5])

def euclidean(x,y):
    count = 0
    for i in range(len(x)):
        count += (x[i]-y[i])**2
    return count**0.5

euclidean(xe,ye)

3.605551275463989

In [10]:
spatial.distance.euclidean(xe,ye)

3.605551275463989

#### Manhattan distance

The Manhattan distance is the distance in blocks between any two points in a city.

In [11]:
def manhattan(x,y):
    count = 0
    for i in range(len(x)):
        count += abs(x[i]-y[i])
    return count

manhattan(xe,ye)

5

In [12]:
spatial.distance.cityblock(xe,ye)

5

#### Chebyshev distance

This is the maximum distance between two vectors along any of their dimensions.

In [13]:
def chebyshev(x,y):
    count = 0
    for i in range(len(x)):
        cheby = abs(x[i]-y[i])
        if cheby > count:
            count = cheby
    return count

chebyshev(xe,ye)

3

In [14]:
spatial.distance.chebyshev(xe,ye)

3

![distances.JPG](attachment:distances.JPG)

### Binning Data

![binning%20data.JPG](attachment:binning%20data.JPG)

### Chi-Square

For the chi-square test, first recreate the table and sum both the rows and columns.

| Rating/University | University A | University B | Total |
| --------------- | ------------ | ----------- | ---------- |
|Satisfied | 71 | 129 | 200 |
|Dissatisfied| 37 | 73 | 110 |
|Total | 108 | 202 | 310 |

Then calculate the expected values. This involves multiplying the column and row total of the variable you wish to use, then dividing this figure by the overall total. 

The expected value for University A and Satisfied is :
(108 * 200)/310 = 69.7

The expected value for University A and Dissatisfied is :
(108 * 110)/310 = 38.3

The expected value for University B and Satisfied is :
(202 * 200)/310 = 130.3

The expected value for University B and Dissatisfied is :
(202 * 110)/310 = 71.7 

Update the table with these figures in parentheses. 

| Rating/University | University A | University B | Total |
| --------------- | ------------ | ----------- | ---------- |
|Satisfied | 71 (69.7) | 129 (130.3)| 200 |
|Dissatisfied| 37 (38.3) | 73 (71.7) | 110 |
|Total | 108 | 202 | 310 |

Calculating the ${x^2}$ statistic uses this formula : ${x^2 = \Sigma\frac{(O-E)^2}{E}}$

This means finding the difference between the original and expected value, squaring this difference then dividing it by the expected value then summing all these resulting values. 

Here are the calculations:

(71-69.7)^2/69.7 = 1.3^2/69.7 = 1.69/69.7 = 0.024

(37-38.3)^2/38.3 = -1.3^2/38.3 = 1.69/38.3 = 0.044

(129-130.3)^2/130.3 = -1.3^2/130.3 = 1.69/130.3 = 0.013

(73-71.7)^2/71.7 = 1.3^2/71.7 = 1.69/71.7 = 0.024

${x^2}$ = 0.024 + 0.044 + 0.013 + 0.024 = 0.105

The hypothesis that satisfaction and university are independent with 1 degree of freedom requires a ${x^2}$ value of 10.828 or below. With this resulting statistic of 0.105, we can at this stage accept this hypothesis at a significance level of 0.001 and state that these categorical variables are independent. 

### Correlation Coefficient

For numeric attributes, we can calculate Pearson's correlation coefficient which evaluates the relationship between two attributes. 

Note that $-1 \leq r_{a,b} \geq 1$. At -1, the attributes are negatively correlated, therefore 1 attribute increases as the other decreases. 0 means that they are potentially independent or there is no relationship, however they may have a more in-depth relationship (remember statistics lectures). 1 means that they are positively correlated and as one attribute increases, so does the other. 

![correlation%20coefficient.JPG](attachment:correlation%20coefficient.JPG)

In [15]:
correlation = pd.DataFrame({
    'one':[5,6,7,8,9],
    'two':[4,7,8,1,0]
    })

correlation

Unnamed: 0,one,two
0,5,4
1,6,7
2,7,8
3,8,1
4,9,0


In [16]:
mean_one = np.mean(correlation.one)
mean_two = np.mean(correlation.two)
std_one = np.std(correlation.one)
std_two = np.std(correlation.two)

corr_count = 0
for i in range(len(correlation.one)):
    corr_count += correlation.one[i]*correlation.two[i]
corr_count

126

In [17]:
(corr_count-(5*mean_one*mean_two))/(5*std_one*std_two)

-0.6260990336999411

In [18]:
correlation['one'].corr(correlation.two)

-0.6260990336999411

![correlation%20graphs.JPG](attachment:correlation%20graphs.JPG)

## Data Cubes and OLAP (Online Analytical Processing)

Data warehouses utilise data cubes which are multidimensional forms of viewing data. 

Dimension tables contain information about each cuboid dimension. If we use sales data as an example, a dimension could be the item table {item name, brand, type}, or time {day, week, month, quarter, year}. What is not included in the dimension table are the value that you are measuring (such as dollars). 

The fact table contains the value that you are measuring as well as linkages to the dimension tables. The most common schema type that uses this format is the **Star schema**. 

![star%20schema.JPG](attachment:star%20schema.JPG)

There are a number of OLAP operations that can be used on data cubes. These are: 
* Roll up (or drill up). This summarises data by going up the dimension (dimension reduction), for example going from dollars sold for each specific week then rolling up to dollars sold for month will sum all the week dollar values. 
* Drill down (roll down). This is the opposite to roll up and creates more dimensions. If drilling down on months to weeks, you split the month dollar value into the 4 corresponding week sums that make up that month. 
* Slice. You select on one dimension, if you wanted to get all data from a specific quarter you would slice on that quarter.
* Dice. You select on two (or more) dimensions. If you wanted to get all the data from a specific quarter and item name, you would dice on those dimensions. 
* Pivot. Reorients the data - used for visualisation. 

Say we are trying to get the dollars sold data for a specific month for a specific item at a specific city, if we're given the cuboid of {Date = year, Item = type, Location = street}. In this cuboid we have the hierarchy of year > quarter > month > day (for time), type > brand > item name (for item), country > state > city > street (for location). 

Here we would drill down from year to month and slice on this given month. Then drill down from type to item and slice on this dimension. Then roll up from street to city and slice on this dimension. 

From how I understand it, you roll up or drill down on each dimension until you reach the specific level, then slice on that level. You continue this per each dimension until you are in the correct section of the hierarchy for each attribute dimension. 

### Cube subsets

Depending on the number of dimensions, we can create a lattice of cuboids. These form both the **0-D (apex)** cuboid that is a full summarisation of all the dimensions. I think this just means that it would result in 1 number of a summary value for dollars sold over all the dimensions. 

The next is the N-D (base) cuboid that has the lowest level of summarisation. If the apex cuboid calculates 1 summary value, then the base cuboid will produce many values depending on the number of dimensions. If your cube has 4 dimensions, then the base cuboid is also 4-D. 

![cuboid%20lattice.JPG](attachment:cuboid%20lattice.JPG)

Here is a representation of the base cuboid, you can see all the summary values being at the lowest level of summarisation. 

![4d%20cuboids.JPG](attachment:4d%20cuboids.JPG)

If the dimensions don't have any hierarchies associated with them, then the number of subsets you can generate is $2^n$. 

In practice, most dimensions do have hierarchies. To generate the number of subsets you use the formula:

$T = \Pi_{i=1}^n(L_i + 1)$

This is very simple to deconstruct, it's only multiplying the levels in the hierarchy +1 by all the other hierarchy levels +1 of the other dimensions. 

If you have {Time: day, month, quarter, year}, {Product: item name}, {Location: city}. The resulting equation is $5*2*2 = 20$

### Bitmap Indexing

This is quite similar to One-Hot Encoding. It allows quick searching within data cubes. 

![bitmap%20indexing.JPG](attachment:bitmap%20indexing.JPG)

### Join indexing

The join index method relates the values of from the dimension tables of the star schema to the rows of the fact table. 

![join%20indexing.JPG](attachment:join%20indexing.JPG)

It is usually implausible to materialise all available cuboids. Therefore, query processing should be as follows: 
* Determine which operations should be performed on the available cuboids. 
* Determine to which materialised cuboid(s) the relevant operations should be performed. 