### Chapter 1: Introduction

#### Why data mining?

- The proliferation of devices collecting data, decreased cost to store data, and increased speed and ability to process data has increased the opportunity to turn large sets of data into knowledge with commercial and societal benefits.  
  
- Example: Turning Google's search data on flu symptoms into flu trends faster than traditional reporting systems. 

#### What is data mining?

- Data mining the process of discovering interesting patterns from massive amounts of data. It is extracting non-trivial, previously unknown knowledge from large quantities of data by automated or semi-automated means.
- The process is:
    - *preprocessing*: feature selection, dimensionality reduction, normalization, data subsetting
    - *mining*: methods applied to extract patterns
    - *postprocessing*: filtering patterns, visualization, interpretation

#### What kind of data can be mined?

- While nearly any type of data can be mined, relational and transactional data in databases are common sources. 

#### What kind of patterns can be mined?

- Data mining tasks can be classified as:
    - *descriptive*: find human interpretable patterns that describe the data.
    - *predictive*: Use variables to predict unknown future values of other variables.

- Descriptive tasks (descriptive, supervised):
    - *data characterization*: summarizing the data of the class in general terms. For example, summarize the general characteristics of customers who spend more than $5,000 / yr "big spenders". The general profile may include income, credit, age, etc.
    - *data discrimination*: similar to data characterization, but comparing two classes. For example, how does the general profile of a "frequent shopper" differ from an "infrequent shopper".

- Frequent patterns, associations and correlations (predictive, supervised):
    - *frequent patterns*: patterns that occur frequently in the data
    - *association analysis*: a predictive attribute such as 'buy' that repeats in the data. 

- Classification and regression (predictive, supervised):
    - *classification*: the process of finding a model that distinguishes data classes or concepts. The process relies on labeled data. Decision trees, neural networks, naive Bayes, k-nearest neighbor are common classification methods. Example: model for predicting creditworthiness (decision tree), classifying a breed of dog (neural network).
    - *regression*: similar to classification, but models continuous valued functions, used to predict missing numerical value. Example: predict sales based on ad spend, stock price, wind speed, etc.

- Cluster Analysis (descriptive, unsupervised):
    - *cluster analysis*: finding groups of objects that minimize intra-cluster distance and maximize inter-cluster distance. Does not consult labels. Examples include market segmentation and text similarity analysis. 

- Outlier analysis:
    - Data that does not comply with the model. In some situations, such as fraud detection, network invasion, deforestation: anomaly mining is more interesting than the regular occurrences. 

- Not all patterns are interesting. A pattern is interesting if it is: easily understood, valid on new data, useful and new. Association analysis examples that may be interesting: market basket analysis, alarm diagnosis, medical diagnosis. 

#### Which technologies are used? 

- Data mining has incorporated many techniques from other domains such as stats, ML, database systems.  Data mining is closely related to all of these disciplines. 
- It is not uniquely a transformation or application of one of these disciplines but rather an evolution of all of them in response to the need for “effective, scalable, and flexible data analysis in our society”. 
- As data sets become massive and diverse, data mining techniques must be computationally efficient, handle various data types and produce effective results. This is distinct from the more constrained and controlled worlds of traditional disciplines such as statistics.

#### Major issues in data mining

- The main challenge presented when mining a huge amount of data is the efficiency and scalability of the data mining algorithm used to store and process the data effectively.  Finding an algorithm that can process the data effectively, efficiently and increasingly in real-time.
- Another challenge is dealing with the increasing diversity of data types. Data is no longer of a uniform data type and stored neatly in a relational database. It is diverse, dynamic and distributed. One way to deal with these challenges is to utilize parallel and distributed data-intensive mining algorithms.

### Chapter 2: Getting to Know Your Data

#### Data objects and attribute types

- Data is a collection of objects and their attributes.
- A data attribute is a property or characteristic of an object (eye color, income).
- A collection of attributes describe an object. 
- Attribute values can be nominal, binary, ordinal or numeric.
    - *nominal*: categorical, the values have no meaningful order. 
    - *binary*: two possible outcomes. 
        - symmetric: equally likely/valuable such as gender
        - asymmetric: one value, such as a positive medical test, is more valuable
    - *ordinal*: categorical, the values have a rank order.
    - *numeric*: quantitative, measurable value.
        - interval-scaled: relative, no true zero exists (temp, year)
        - ratio-scaled: absolute, true zero exists (temp in K, length)

- The type of an attribute depends on the properties/operations it possesses:
    - Nominal attribute: distinctness (=, !=)
    - Ordinal attribute: distinctness & order (<, >)
    - Interval attribute: distinctness, order & meaningful differences (+, -)
    - Ratio attribute: distinctness, order & meaningful differences and ratios (*, /)

- Discrete vs. continuous variables:
    - Discrete values are finite or countably infinite (age 0-110)
    - Continuous values have real numbers as attributes, typically represented by floating point. 

#### Types of data sets

- Important characteristics of data include: 
    - *Dimensionality*: number of attributes
    - *Sparsity*: Only presence counts
    - *Resolution*: Patterns depend on the scale
    - *Size*: Type of analysis may depend on size

- Types of data sets:
    - *Record*: A collection of records, each with a fixed set of attributes. This can be represented by a matrix or a vector. Transaction data is a special type of record data.
    - *Graph*: Represented by edges and vertices, and may represent things like molecular structures or websites.
    - *Ordered*: May be a sequence of transactions, temporal data, genomic sequence, etc.

#### Data quality issues

- Noise and outliers: For objects, extraneous objects; for attributes: modification of original values. Outliers are objects with considerably different characteristics than the rest of the data set.
- Missing values: Information is either not collected or not relevant in all cases.
- Duplicate data: same person with multiple entries

#### Basic statistical descriptions of data

- Measures of central tendency
    - *mean*: average
    - *median*: middle
    - *mode*: most common
    - *midrange*: average of max/min
    - *negative skew*: long left tail
    - *positive skew*: long right tail


- Measures of dispersion of data
    - *quartiles*: Q1(25th), Q3(75th)
    - *interquartile range*: Q3 - Q1
    - *five number summary*: min, Q1, median, Q3, max
    - *boxplot*: outliers marked (>1.5x IQR), whiskers are min/max, quartiles are marked
    - *variance*: sum(x^2)/n - mean^2
    - *standard deviation*: sqrt(variance), measures the spread around the mean
        - mean +/- 1 sigma: 68% under normal curve
        - mean +/- 2 sigma: 95% under normal curve
        - mean +/- 3 sigma: 99.7% under normal curve

#### Graphic displays of basic statistical descriptions of data

- Univariate:
    - Quantile plot: Sort in ascending order, calculate the fraction of each data point as: f = (i-0.5)/N. Plot x = f-value, y = data
    - Quantile-quantile plot: Calculate the quantile of each, plot the data at each f-value against each other.
    - Histogram: plot the data distribution in bins
    
- Bivariate: 
    - Scatter plot: treat each pair as (x, y)
    - Correlation: positive, negative or no correlation

#### Similarity and dissimilarity measures

- Dissimilarity measures:
    - *nominal*: [0,1] are 0 for similar objects, 1 for dissimilar objects. 
    - *ordinal*: d = |x-y|/(n-1), values are mapped to integers 0 to n-1.
    - *interval or ratio*: d = |x-y|
- Similarity measures: 
    - *nominal*: [0,1] are 1 for similar objects, 0 for dissimilar objects.
    - *ordinal*: s = 1-d
    - *interval or ratio*: s = -d, s = 1/(1+d)

- Euclidean distance = $ \sqrt{\sum_{i=0}^n (x_{k}-y_{k})^2} $

- Minkowski distance = $ ({\sum_{i=0}^n (x_{k}-y_{k})^r})^\frac{1}{r} $

- n is the number of dimensions (attributes), r is the dimension.
- r = 1: Manhattan distance
- r = 2: Euclidean distance
- r = infinity: max distance between any attribute max(|x1 - x2|,|y1 - y2|)

- (?) Mahalanobis distance 

- Properties of distance:
    - d(x,y) >= 0 (Positive definiteness)
    - d(x,y) = d(y,x) (symmetry)
    - d(x,z) <= d(x,y) + d(y,z) (Triangle inequality)

- Properties of similarity:
    - max similarity = 1 if x = y
    - s(x,y) = s(y,x) (symmetry)

#### Proximity for binary attributes

- Compute similarities using the following quantities
    - f01 = the number of attributes where p was 0 and q was 1
    - f10 = the number of attributes where p was 1 and q was 0
    - f00 = the number of attributes where p was 0 and q was 0
    - f11 = the number of attributes where p was 1 and q was 1

- Simple matching = number of matches / number of attributes
    - (f11 + f00) / (f01 + f10 + f11 + f00)

- Jaccard = number of 11 matches / number of non-zero attributes
    - (f11) / (f01 + f10 + f11) 

#### Cosine similarity

- Good for long sparse data sets the similarity of two documents. Measures the cosine of the angle between two vectors, determines whether they are pointing in roughly the same direction. 
    - cos(d1, d2) = d1 (dot) d2 / ||d1|| ||d2||

- Extended Jaccard Coefficient (Tanimoto)
    - (x dot y)/(||x||^2+||y||^2 - x dot y)

#### Correlation and covariance

- Correlation measures the linear relationship between objects
    - cov(x,y) = (1/n-1) x sum(x deviation * y deviation)
    - corr = cov/(sx x sy)

#### Information Based Measures

- The more certain an outcome, the less information that it contains and vice-versa
    - If a coin has 2 heads, a flip provides no additional information.
    - Information is inversely related to the probability of an outcome.

- This is commonly measured as entropy: -$ \sum_{i=1}^n p_{i}log_{2}p_{i}$
- Entropy is between 0 and $log_{2}n$
- Joint entropy: H(X) + H(Y) - H(x,Y)

- (?) General approach for combining similarities.

#### Measures of density

- Euclidean Density: grid based approach
    - Divide region into cells of equal volume, count the points in each cell
- Euclidean Density: center-based approach
    - The number of points within a specified radius.

### Chapter 3: Data Preprocessing

- Data quality depends on: accuracy, completeness, consistency, timeliness, believability, interpretability. 
- The major tasks in data preprocessing include:
    - Data cleaning: fill missing values, smooth noisy data, remove ouliers, resolve inconsistency.
    - Data integration: combining various data sets / sources
    - Data reduction: Compressing data, removing irrelevant attributes
    - Numerosity reduction: replace data with smaller representation. 
    - Data transformation: normalization, data discretization, concept hierarchy

#### Data Cleaning

- Missing values
    - Ignore it, fill it in manually, use a global constant to fill it, use a central tendency for the attribute, use a central tendency 
- Noisy data
    - Smoothing by bin means: each value replaced by the mean of the bin
    - Smoothing by bin medians, smoothing by bin boundaries: each value is replaced by max or min of the bin.
    - Regression: find best line to fit the data
    - Outlier analysis: use cluster analysis 
    

#### Data integration

- Merging of various data sources. 
    - Entity identification problem (matching up sources)
    - Redundancy - can be detected by correlation analysis (chi-square or correlation coefficient)
    - (?) chi-square calculation

#### Data reduction

- Aggregation: combining two or more attributes into a single attribute. 
    - Smaller in data, but maintains most of the integrity of the original data
    - Change of scale from weeks to years, cities to states, etc. (concept hierarchy)
    - Purpose: data reduction, change of scale, more stable data
- Histograms:
    - equal width vs. equal frequency

#### Sampling 

- Processing the entire data set is too expensive or time consuming
    - sample must be representative 
    - simple random sampling w/ or w/out replacement
    - cluster sample
    - stratified sampling: ensures a representative sample when data is skewed.

#### Reduce dimensionality

- When dimensionality increases, data becomes sparse and definitions of density become less meaningful 
- Purpose of reducing dimensionality: avoid curse of dimensionality, reduce time and memory to mine, allow data to be visualized.
- Techniques: 
    - Principal components analysis (PCA): find a projection that captures the largest amount of variation in the data

#### Feature subset selection

- Remove redundant features, irrelevant features, especially for classification. 
- Trim based on information gain

#### Feature creation

- Create new attributes that can capture the important information in a data set more efficiently than the original methodologies:
    - feature extraction, feature construction, mapping data to new space
    - (?) Fourier and wavelet transformation

#### Discretization 

- The process of converting a continuous attribute into an ordinal one
    - Commonly used in classification, most work best with only a few possible variables

#### Binarization 

- Maps continuous or categorical values to binary variables,
    - Typically used for association analysis (asymmetric binary attributes)

#### Attribute transformation 

- Function maps the entire set of values to new values
    - log(x)
    - normalization, standardization
    - example: seasonality accounts for much of correlation of plant growth

#### Normalization

- Max/min nomalization: (v - min) / (max - min) * (new_max - new_min) + new_min
    - range likely [0,1]

- Z-score normalization (v - avg)/sigma

- Normalization by decimal scaling: v = v / 10^j range [0,1)

#### Concept hierarchy for nominal data

- May be generated based on the number of distinct values per attribute

### Chapter 6

- Frequent pattern: a pattern (set of items, subsequences, substructures) that occur frequently in a data set.
    - What products are purchased together? (basket data analysis)
    - What products are purchased after purchasing a product? (cross marketing)
    - Web log analysis, DNA sequence analysis

- Market basket analysis:
    - $support(A \rightarrow B) = P(A \cup B) = count(A \cup B) / n$
    - $confidence(A \rightarrow B) = P(A | B) = count(A \cup B) / count(A)$
    - Find all frequent itemsets
    - Generate all strong association rules from each frequent itemset
    

- Closed and max sets
    - An itemset X is closed if X is frequent and there exists no super-pattern Y $\supset$ X with the same support as X
        - Y must have the same support as X, this is not a lossy compression and contains complete information
    - An itemset X is maximal if X is frequent and there exists no super-pattern Y $\supset$ X and Y is frequent
        - Y only has to be frequent, this is a lossy compression, doesn't contain complete information

#### Apriori: Finding Frequent Itemsets by Confined Candidate Generation

- Downward closure property of frequent patterns: any subset of a frequent itemset must be frequent. 
- Scan the database to get all frequent 1-itemsets
- Generate candidates of length (k+1) from L1. 
- Prune the itemsets that are not frequent.
- Generate candidates of length (k+1) if (k-1) = (k-1).
- Look at all (k-1) subsets of the candidate and prune any that are not frequent
- Scan for count of itemset lenght k and repeat until empty.

- Problems with Apriori: multiple scans of the database, many candidates, support counting
    - breadth first, huge number of candidates
- (?) Improvements to the algorithm: Candidate itemsets stored in a hash tree.
    - Others include: Transaction reduction, partitioning, sampling

#### Pattern-Growth Approach: Mining Frequent Patterns Without Candidate Generation

- Depth first search 
- Avoids explicit candidate generation
- Grow long patterns from short ones using locally frequent items

- Construct an FP Tree 
    - Find frequent 1 itemset 
    - order it
    - scan each transaction and create a tree based on the sorted list.
- Create conditional pattern bases from the paths up the tree
    - For each pattern base, create a conditional FP Tree
    - From the conditional FP Tree, generate frequent patterns.

- Benefits of the FP Tree structure
    - Completeness, compactness, 
    - Divide-and-conquer: 
        - Decompose both the mining task and DB according to the frequent patterns obtained so far
         - Lead to focused search of smaller databases
         - No candidate generation, no candidate test
         - Compressed database: FP-tree structure 
         - No repeated scan of entire database 

#### Which patterns are interesting?

- Sometimes, association does not measure the real strength (or lack of strength) of the correlation and implication between A and B. They don't filter out uninteresting association rules. 

- Lift: $P(A \cup B) / (P(A) * P(B))$   >1 = positive correlation;  <1 = negative
- Chi-square 

- Four other measures defined have the following property: Its value is only influenced by the supports of A, B, and A ∪ B, or more exactly, by the conditional probabilities of P(A|B) and P(B|A), but not by the total number of transactions. 
- Another common property is that each measure ranges from 0 to 1, and the higher the value, the closer the relationship between A and B

- A measure is null-invariant if its value is free from the influence of null-transactions. Null-invariance is an important property for measuring association patterns in large transaction databases. Among the six discussed measures in this subsection, only lift and χ2 are not null-invariant measures

### Chapter 8

- Supervised learning: training data with labels indicating the class of the observation (classification)
- Unsupervised learning: the class labels of training data is not known, looking to establish classes (clustering)

- Prediction problems:
    - Classification: predicts category
        - Typical applications: loan approval, medical diagnosis, fraud detection, web categorization
    - Numeric prediction: regression analysis

- Classification is a two-step process. 
    - Model construction: use the samples/tuples as the training set
        - Model represents classification rules, decision trees, mathematical formula
    - Model usage: used to classify unknown objects
        - estimate accuracy: rate of test samples that are correctly classified. Test set is independent to avoid overfitting.
        - if the accuracy is acceptable, it is used to classify new data.

#### Decision tree induction

- Flowchart like structure where every node represents a test of an attribute and each node represents an outcome. 
- Decision trees are popular because they don't require any domain knowledge or parameter setting so they are appropriate for exploration. They are simple and intuitive. Frequently used in medicine and astronomy.

- Basic algorithm (a greedy, non-backtracking algorithm):
    - Tree is constructed in top-down recursive divide and conquer manner
    - At the start, all training examples are at the root.
    - All attributes are categorical
    - Attributes are partitioned recursively based on selected attributes
    - Test attributes are selected on the basis of information gain.
    - Stop if: 
        - all samples for a node belong to the same class
        - there are no attributes left, there are no samples left

#### Information gain

- Entropy is a measure of the uncertainty 
- p is the probability that a tuple in D belongs to C: |C|/|D|
- Expected information (entropy) needed to classify tuple:  -$ \sum_{i=1}^n p_{i}log_{2}p_{i}$
- Information needed after: weighted average entropy for each of the new partitions
- Information gain: Info(D) - Info(D)after_partition
- When working with continuous data, an optimum split-point must be found
- Problem with information gain: tends to favor splits with a higher number of values.

#### Gain ratio

- (?) Gain ratio normalizes gain with split info: 
- GainRatio = Gain(A)/SplitInfo(A)
- Problem with gain ratio: Tends to prefer unbalanced splits where one partition is much smaller than the other.

#### Gini Index

- split data into two subsets
- gini(D) = 1 - $\sum_{j=1}^np_j^2$
- gini(D) = weighted average gini post split
- need to enumerate all the possible splitting points (all possible binary subsets)
- Problem with gini index: biased towards multivaried attributes and difficulty when number of classes is large

- A tree may have too many branches, may reflect anomalies due to noise or outliers
- No one method is measurably more effective than another, they all have biases
- Two approaches to avoid overfitting: 
    - prepruning: halt early if gain is below a threshold
    - postpruning: remove branches from a fully grown tree
- Problem with the decision tree is storing the large data set in memory.
    - (?) RainForest framework: builds a list of (attribute, value, class_label)
    - BOAT bootstraps smaller trees into a bigger tree

#### Bayesian Classification

- Assumes class-conditional independence, which simplifies the calculation (why it's called Bayes) 
- It is a statistical classifier - predicts class membership probabilities
- Performance is a standard of optimal decision making. 
- Incremental: each training set can increase/decrease the probability that a hypothesis is correct.

- Bayes Theorem: P(H|X) = [P(X|H) x P(H)] / P(X)
- Maximum of P(C|X) for all classes
- Practical difficulty: requires initial knowledge of many probabilities involving significant computational cost.

- Since P(X) is constant, only P(X|C) x P(C) needs to be maximized. If class probabilities are not known, only P(X|C) needs to be maximized.
- P(X|C) is the product of all the individual attribute conditional probabilities
- Each conditional probability must be non-zero, use a Laplacian correction, adding 1 to each case.

- Advantages: easy to implement, good results in most cases.
- Disadvantages: Assumes conditional independence, dependencies likely exist but can't be modeled.

#### Rule-Based Classification

- Rules are easier to understand than large trees; one rule is created for each path of the tree
- Rules are mutually exclusive and exhaustive
- Rule antecedent/precondition vs rule consequent

- Coverage = (tuples covered by the rule) / (number of tuples in D)
- Accuracy = (tuples correctly identified) / (tuples covered by the rule)

- If more than one rule is triggered during classification, need a strategy to deal with it:
    - size ordering: assign to the most attributes
    - class-based ordering: decreasing order of prevalence or cost of misclassification
    - rule-based ordering: make a priority list

#### Sequential covering method

- IF-THEN rules can be extracted directly from the data without building a tree using a sequential covering method.
- Rules are learned one at a time. Each time a rule is learned the tuples covered by the rules are removed
- Learn the best rule for the current class, use a greedy depth first search to do this
- (?) FOIL considers/favors both coverage and accuracy
- FOIL_Prune prunes if higher for pruned R than R

#### Model evaluation and selection

- Use test data set when measuring accuracy
- Methods for estimating accuracy:

- Confusion matrix (prdicted on X, actual on Y):  
    
|  |YES  |NO  |TOTAL  |  |
|---|---|---|---|---|
| YES |TP  |FN  |P  |  |
| NO |FP  |TN  |N  |  |
|  | P' |N'  |P+N |  |

- Accuracy: % of tuples that are correctly classified = (TP + TN)/(P + N) 
    - Most effective when generally balanced

- Error rate: misclassification rate: 1- accuracy = (FP + FN)/(P+N)

- Sensitivity: True positive recognition rate = (TP)/P
- Specificity: True Negative recognition rate = (TF)/F
- Accuracy = sensitivity x (P/(P+N)) + specificity x (N/(P+N))

- Precision: measure of exactness, what % of tuples that the classifier labeled as positive actually are: TP/(TP+FP)
- Recall: measure of completeness, what % of positive tuples did the classifier label as positive: TP/P
- tends to be an inverse relationship, tells us how many tuples were misclassified outside of the class.
- F measure: harmonic mean of precision and recall: F = (2 x precision x recall) / (precision + recall),  weights precision twice as much as recall
- can change 2 to (1+ b^2)/(b^2)

#### Holdout and random

- Holdout method: training set = 2/3 of data, test set = 1/3 of data
- Random sampling: repeat holdout k times, accuracy is average

#### Cross validation

- Cross-validation: randomly partition the data into k mutually exclusive subsets. At ith iteration, use Di as test, others as training
- Leave one out, k-folds where k = number of tuples, for small sized data
- Stratified cross validation: folds are stratified so the class dist reflects the total data

#### Bootstrap

- Works well with small data sets, tends to be overly optimistic
- Samples the training tuples uniformly with replacement
- (1-1/d)^d as d-> large, 0.368 end up not in the sample, 0.632 are in the sample
- Repeat k times to increase accuracy


#### Model selection using statistical tests of significance

- Null hypothesis is that M1 and M2 are the same. 
- Assume t-distribution, k-1 degrees of freedom
- if t > z (sigma / 2) or t < -z, reject the null
- If we reject the null, then difference between M1 and M2 is statistically significant and chose model with lower error rate
- Formulas for pairwise comparison

#### Receiver operator curves

- For visual comparison of classification models
- Shows the trade-off between true positive rate and false positive rate
    - The cost associated with false negative (not diagnosing a positive cancer) vs false positive (diagnosing cancer incorrectly)
- Area under the curve is the accuracy of the model
- Rank the tuples in order of most likely to belong to the positive class down (use naive bayes to get probability of class)
- determine a threshold t for where the model is positive
- Plot by moving up for true positive, right for false positive
- (?) A model with perfect accuracy will have an area of 1

- Issues affecting model selection: accuracy, speed, robustness, scalability, interpretability

#### Techniques to improve classification accuracy

- Ensemble methods: use a combination of models to increase accuracy

- Bagging: each training set is a bootstrap sample, sampling with replacement is used, averages the prediction over a collection of classifiers
- Boosting and AdaBoost: weighted vote with a collection of classifiers, more accurate but risks overfitting
- (?) Random Forest: Each classifier in the ensemble is a decision tree and is generated using a random selection of attributes at each node to determine the split. Comparable accuracy to AdaBoost, more robust to outliers and errors 

- Dealing with class imbalanced sets: oversampling, under-sampling, threshold moving

### Chapter 10: Cluster Analysis

- Cluster: a group of data objects similar to each other, dissimilar to other groups analysis is a 
- Cluster analysis: unsupervised learning, can be used as a stand alone method or as a preprocessing technique
- Preprocessing: may be used to preprocess for classification or attribute selection or outlier detection
- Use in applications: biology, marketing, fraud detection
- Good clustering:
    - high cohesiveness: within cluster
    - distinctive between clusters
- Requirements:
    - scalability: clustering on all of the data, not a sample
    - ability to deal with different data types
    - requirements for domain knowledge
    - ability to deal with noisy data
- Considerations:
    - single level vs. hierarchical
    - completely separate or allow for overlap
    - how to define similarity
    - subspace clustering

#### Basic clustering methods

- Partitioning methods: distance based, uses iterative relocation techniques to move objects from one partition to another. Works well finding spherical clusters in small-mid sized databases
- Hierarchical methods: can be bottoms up (start with many clusters and merge) or top down (start with one cluster and split). Once a merge or step has been done, it can't be undone.
- Density based methods: Continue to grow the cluster as long as the density is above some threshold. Can find clusters of arbitrary shape.
- Grid based: fast processing time

#### Partitioning methods

- K-means: partition objects into k groups, then repeat: compute the centroid, assign the data points to the nearest centroid.  until no change
    - strengths: efficient O(tkn)
    - weaknesses: often terminates at a local optimal point, only applicable to continuous n-dimensional space, need to specify k in advance, sensitive to outliers, not good with non-convex shapes, sensitive to seed

- K-medoids: instead of taking mean, take center data point, partition the data, test a different non-medoid point, if it reduces the SSE, then reassign the medoid. PAM works well for small data sets, but it is not efficient for large data sets.
    - CLARA is PAM but using samples
    - (?) CLARANS

#### Hierarchical Methods

- Divides data into a tree of clusters, does not require K as an input
- Agglomerative: starts with the individual vs. Divisive: starts as one cluster

- AGNES: merge nodes that are the nearest neighbor single link (least dissimilarity)
- DIANA: opposite of AGNES, start as one
- Similarity is measured as the closest pair between the two clusters

- Measures of distance between clusters:
    - single link: nearest neighbor, minimum distance
    - complete link: maximum distance, largest distance from one element to another
    - average: avg distance between an element in one cluster and element in another
    - centroid: distance between centroids
    - medoid: distance between medoids 
- Major weaknesses: cannot be undone and does not scale well

- (?) BIRCH overcomes these two weaknesses, scalability and inability to undo what was previously done, however it only handles numeric data
    - uses clustering feature to summarize the cluster, and CF tree to represent the hierarchy
    - CF = <n,LS,SS>
    - Can find the centroid, radius, diameter

- Chameleon: graph based and two-phase: 
    - use graph partitioning algo to cluster objects in small sub-clusters
    - use agglomerative hierarchical clustering algo to find genuine clusers by combining sub clusters
    - used for clustering complex objects, however the processing cost is high 0(n^2)

- Issues with hierarchical clustering methods: choosing a good distance measure is nontrivial, cannot have any missing attribute values, optimization goal not clear: local search
- Probabilistic clustering: aims to overcome these using probabilistic models to measure distances between cluster
- use a generative model, assume the data set adopts a common distribution function, find Mu and sigma that maximum likelihood the set of data points is generated
- same efficiency, but can handle missing data

#### Density-based Methods

- Clustering based on density (local clustering)
    - Discover clusters of arbitrary shape
    - Can handle noise
    - One scan, but needs a termination condition
- DBSCAN: discover clusters of arbitrary shape, cluster: max set of density connected-points
    - EPS: max radius of the neighborhood
    - MinPts: min points in Eps neighborhood
    - If it is not a core point but can be reached it is a boarder point
    - Directly density reachable: if it belongs to the neighborhood of a core point
    - Density reachable: chain regions to reach
    - Computational complexity if spatial index is used, O(n log n) otherwise, O(n^2)
    - sensitive to setting of parameters
    - Algo: select point p, retrieve all points density reachable from p wrt eps and minPts. if core, cluster is formed, else, visit next point

- OPTICS: extends DBSCAN but much less sensitive to parameter setting
    - Density based clusters are monotonic with respect to the neighborhood threshold
    - The deeper the valley, the denser the cluster, reachability plot
    - core distance: the smallest distance to cover e points
    - reachability distance: min radius that makes p density reachable from q
        - max(core-distance, distance(q,p))
    - O(n log n) if indexed

- (?) DENCLUE - 

#### Grid based methods

- partition the data space into cells to form a grid structure, when you find a dense region in the cells, 
- Efficient and scalable, uniform but hard to handle irregular distributions, locality, curse of dimensionality

- STING: Spacial area is divided into rectangular cells at different levels of resolution to form a tree structure
- Statistical measures for each cell. Calculate the likelihood the cell is relevant at some confidence level, only children at the relevant cells are explored
- Query independent, complexity is O(K), k<< N
- disadvantage: probabilistic nature may imply a loss of accuracy for query processing

- CLIQUE: density based and grid based, connect dense units into a cluster. Starts from a low dimension
- Start in 1D, Find dense region in each subspace and generate their minimal discriptions, then find promising candidates in 2D, repeat in levelwise manner in higher dimensional space in Apriori manner, then find the connected dense units
- Strengths: automatically finds subspaces of the highest dimensionality as long as high density clusters exist, insensitive to the order of records, scales linearly 
- Weakness: quality of the data depends on the resolution of the gris

#### Assessing Clustering 

- Assessing if non-random structure exists in the data by measuring probability the data is generated by uniform data distribution
    - Tested with Hopkins Statistic
- Determine the number of clusters: empirical method (Sqrt(n/2)), elbow method, cross validation method
- Measuring quality: extrinsic (supervised) vs. intrinsic (unsupervised)
- Quality is good if it is: pure, complete, pro 'rag bag', small cluster preservation