# Decision Support Systems

- **Decision-support systems** are used to make business decisions, often based on data collected using OLTP systems
- **Data analysis** tasks are simplified by specialized tools and SQL extensions 
- **Data mining** seeks to discover knowledge automatically in the form of statistical rules and patterns from large databases

# Data Warehousing

- data sources often store only current data, not historical data
- corporate decision making requires a unified view of all organizational data, including historical data
- a **data warehouse** archives information gathered from multiple sources, and stores it under a unified schema, at a single site
    - important for large businesses that generate data from multiple divisions, possibly at multiple sites
    - simplifies querying and permits study of historical trends
    - shifts decision support query load away from transaction processing systems and into other tools
    
<img src="img/Snip20191111_220.png" width=80%/>

# Design Challenges

- *when and how to gather data*
    - **Source driven architecture**: data sources transmit new information to warehouse, either continuously or periodically
    - **Destination driven architecture**: warehouse periodically requests new information from data sources
    - keeping warehouse exactly synchronized with data sources is too expensive
- *what schema to use*
    - schema integration
- *data cleansing*
    - e.g., **correct** mistakes in attributes (e.g., addresses)
    - or **merge** address lists from different sources and **purge** duplicates
- *how to propagate updates*
    - warehouse schema may be a (materialized) view of schema from data sources
- *what data to summarize*
    - raw data may be too large to store 
    - aggregate values (totals/subtotals) often suffice
    - queries on raw data canoften be transformed by query optimizer to use aggregate values



# Warehouse Schemas

- warehouses generally organize data into **fact tables** and **dimension tables**


- **Fact tables** describe specific events (e.g., transactions), and contain mostly *small numeric values* as well as *foreign keys* pointing to rows of dimension tables
    - fact tables can be very large


- **Dimension tables** store descriptive data, for example referring to a time, a place, a product, or a person
    - tend to record a larger number of attributes, including both numeric and text
    - tend to contain fewer rows than fact tables
    

- resultant schema is called a **star schema**, more elaborate schema structures are possible
    - **Snowflake schema**: multiple levels of dimension tables
    - **Constellation**: multiple fact tables

- Example of Star Schema

<img src="img/Snip20191111_221.png" width=80%/>



# Data Cubes

- a star schema represents multidimensional data, which is difficult to visualize
- OLAP systems enable data summarization and interactive exploration of a multidimensional data set using the **data cube** abstraction
- each cell of the data cube holds a data item corresponding to an intersection of dimensions
- a **slicer** is a dimension that is held constant so that the cube can be collapsed onto fewer dimensions

<img src="img/Snip20191111_222.png" width=60%/>

## Slicing

<img src="img/Snip20191111_223.png" width=80%/>

## Dicing 

<img src="img/Snip20191111_224.png" width=80%/>

## Drill-up/Drill-down

<img src="img/Snip20191111_225.png" width=80%/>

## Pivoting

<img src="img/Snip20191111_226.png" width=80%/>


# Data Mining

- the process of **semi-automatically** analyzing large databases to find **useful patterns**


- data mining is often used for **prediction**
    - e.g., predict if a credit card applicant poses a good credit risk
- examples of prediction mechanisms
    - **classification**: given a new item whose class is unknown, predict to which class it belongs
        - e.g., classify a blog post as either positive or negative
    - **regression**: given a set of mappings for an unknown function, predict the function result for a new parameter value
        - e.g., given the outdoor temperature measurements for the last week, predict tomorrow's outdoor temperature


- other applications of data mining aim to **identify descriptive patterns** in existing data
- examples of descriptive patterns
    - **associations**: "if-then" patterns
        - e.g., if a customer C purchases book b, then they are likely to enjoy book B' because other customers "similar" to C have purchased both B and B'
    - **clustering**: discover groups of similar objects
        - e.g., mobile network users are clustered in certain areas, suggesting where cellular towers should be placed
        




# Data Mining - Classification

## Classification Rules

- classification rules help assign new objects to classes
- classification rules can use a variety of data
- rules are not necessarily exact: there may be some misclassifications
- classification rules can be represented compactly as a decision tree

<img src="img/Snip20191111_227.png" width=80%/>

## Construction of Decision Trees
- **training set**: a sample of instances for which the classification is already known
- the decision tree is generated from the training set using a **greedy** top-down approach
    - each internal node of the tree partitions the data into groups based on a **partitioning attribute**, and a **partitioning condition** for the node
    - at each **leaf** node, either all (or most) of the items at the node belong to the same calss, or else all attributes have been considered and no further partitioning is possible

## Best Splits

- a traversal of the decision tree begins with "impure" data (instances from many classes) at the root and terminates with "pure" data (instances from one class only) at the leaf level

- the main **goal in building a decision tree** is to pick the best attributes and conditions on which to partition at each level so as the reduce the "impurity"

- several quantitative measures of **impurity** have been proposed over a set $S$ of **training instances**

- $k$: number of classes
- $|S|$: number of instances
- $p_i$: fraction of instances in class $i$

## Impurity Measures: Gini

- the **Gini** measure of impurity is defined as
\begin{equation}
\textrm{Gini}(S) = 1 - \sum_{i=1}^k p_i^2
\end{equation}

- if all instances are in a single class (i.e., maximum purity), the $\textrm{Gini}$ value is 0
- if each class has the same number of instances (i.e., minimum purity), the value if $1 - {1\over k}$

## Impurity Measures: Entropy

- another measure of impurity is the **entropy** measure, which is defined as

\begin{equation}
\textrm{Entropy}(S) = -\sum_{i=1}^k p_i \log_2 p_i
\end{equation}

- if all instances are in a single class (i.e., maximum purity), the entropy value is 0
    - note: $p_i \log_2 p_i$ is defined as 0 for $p_i = 0$
- if each class has the same number of instances (i.e., minimum purity) the value is $-\log_2 {1\over k}$

## Information Gain

- when a set $S$ is split into multiple sets $S_i, i = 1, 2, ..., r$, we can measure the impurity of the resultant set of sets as

\begin{equation}
\textrm{Impurity}(S_1, S_2,...,S_r) = \sum_{i = 1}^r {|S_i| \over |S|} \textrm{Impurity}(S_i)
\end{equation}


- the **information gain** due to a particular split of $S$ into $S_i, i = 1, 2, ..., r$ is defined as follows

\begin{equation}
\textrm{Information-gain}(S, \{S_1, S_2, ..., S_r\}) = \textrm{Impurity}(S) - \textrm{Impurity}(S_1, S_2, ..., S_r)
\end{equation}

- a good split always achieves a **positive** information gain


- measure of "cost" of a split

\begin{equation}
\textrm{Information-content}(S, \{S_1, S_2, ..., S_r\}) = -\sum_{i=1}^r {|S_i| \over |S|} \log_2 {|S_i| \over |S|}
\end{equation}


- measure of "goodness" of a split

\begin{equation}
\textrm{Information gain ratio} = {\textrm{Information-gain(S, \{S_1, S_2, ..., S_r\})} \over \textrm{Information-content(S, \{S_1, S_2, ..., S_r\})}}
\end{equation}

- the best split (i.e., one that tends to yield the simplest and most meaningful decision tree) is the one that produces the **maximum information gain ratio**

# Finding Best Splits

- **Categorical attributes** (with no meaningful order)
    - binary split: try all possible ways to partition the values into two disjoint sets, and pick the best
    - multi-way split: one child for each value
- **Continuous-valued attributes** (with meaningful order)
    - binary split: sort values, try each as a split point
    - multi-way split: a series of binary splits on the same attribute has roughly equivalent effect


# Decision Tree Construction

<img src="img/Snip20191115_11.png" width=60%/>

- $\delta_p$ and $\delta_s$ are user-defined thresholds


# Other Classifiers

- *Bayesian classifiers* use **Bayes theorem**, which states


\begin{equation}
p(c_j | d) = {p(d | c_j) p(c_j) \over p(d)}
\end{equation}

- $p(c_j | d)$: probability of instance $d$ being in class $c_j$
- $p(d|c_j)$: probability of generating instance $d$ given class $c_j$
- $p(c_j)$: probability of occurrence of class $c_j$
- $p(d)$: probability of occurrence  of instance $d$

## Naive Bayesian Classifiers

- Bayesian classifiers require
    - computation of $p(d | c_j)$
    - pre-computation of $p(c_j)$
    - $p(d)$ can be ignored since it is the same for all classes


- to simplify, **naive Bayesian classifiers** assume attributes have *independent distributions•, and thereby estimate
\begin{equation}
p(d|c_j) = p(d_1 | c_j) \times p(d_2 | c_j)\times ... \times p(d_n | c_j)
\end{equation}
    - each of the $p(d_i | c_j)$ can be estimated from a histogram on $d_i$ values for each class $c_j$
    - histograms are computed from training instances

# Validating a Classifer

- the quality of a classifier can be quantified along several dimensions

- for a Boolean classifier, outcomes fall into 4 categories
    - **True Positive**(TP): correct, prediction was positive, instance is positive
    - **False Positive**(FP): incorrect, prediction was positive, instance is negative
    - **True Negative**(TN): correct, prediction was negative, instance is negative
    - **False Negative**(FN): incorrect, prediction was negative, instance is positive
    
    
- the quality of a classifier can be described in several ways
    - **accuracy**: fraction of instances where classifier is correct: $TP + TN \over TP + FP + TN + FN$
    - **recall**: fraction of positive instances correctly classified: $TP \over TP + FN$ (aka *true positive rate*)
    - **precision**: fraction of correct positive predictions: $TP \over TP + FP$
    - **specificity**: fraction of negative instances correctly classified: $TN \over FP + TN$ (aka *true negative rate*)

    
    

# Cross-Validation

- a simple way to validate a classifier is to split a given data set (sample) into disjoint training and testing parts
- both parts are labelled with the ground truth (i.e., the correct classification)
    - required in the training part to compute the model, and in the testing part for validation
- in **k-fold cross-validation**, a split is computed $k$ times, the data set is first divided randomly into $k$ parts (folds) of equal size
    - each part is used to validate a model computed using the remaining $k-1$ parts
- measures of quality are computed separately for each part, and then averaged
    - the accuracy of the constructed model is the average of the $k$ accuracy figuerd computed for different parts



# Regression

- regression deals with the prediction of a value, rather than a class
- given values for a set of variables $X_1, X_2,...,X_n$, we want to predict the value of a variable $Y$
- **linear regression**: infer coeeficients $a_0, a_1, ..., a_n$ such that $Y = a_0 + a_1 X_1 + a_2 X_2 + ... + a_n X_n$
- in general, the process of finding a curve that fits the data is also called **curve fitting**
- regression aims to find coefficients that give the best possible fit (e.g., minimizes sum of squared residuals)
- the fit may only be approximate because of noise in the data, or because the relationship does not follow exactly the type of curve being fitted (e.g., polynomial)

# Association Rule Mining

- **association rules**
    - $bread => milk$
    - left side is the **antecedent**, right side is the **consequent**
    - an association rule must have an associated **population**, the population consists of a set of **instances**
        - e.g., each transaction at a shop is an instance, and the set of all transactions is the population


- rules have an associated "support" and an associated "confidence"
- **Support**: a measure of what fraction of the population satisfies both the antecedent and the consequent of a rule
    - e.g., suppose only 0.0001% of all purchases include both milk and screwdrivers, then the degree of support for the rule $milk => screwdrivers$ is low
    

- **Confidence**: a measure of how often the consequent is true when the antecedent is true
    - e.g., the rule $bread => milk$ has a confidence of 80% if 80% of the purchases that include bread also include milk


## Finding Association Rules

- we are generally only interested in association rules with reasonably high (e.g., few %) support

- naive algorithm
    1. consider all possible sets of relevant items
    2. compute the support for each set and identify sets with sufficiently high support (e.g., based on a threshold)
    3. select **large itemsets** with sufficiently high support
    4. use these large itemsets to generate association rules, from itemset $A$ and each $b \in A$ generate the rule $A-\{b\} => b$
        - support of rule: $support(A)$
        - confidence: $support(A) \over support(A - \{b\})$

## Apriori Algorithm 

- computes large itemsets given a set $T$ of transactions over a set of items, and a support threshold $\mathcal{E}$

<img src="img/Snip20191115_12.png" width=80%/>

### Apriori Example

<img src="img/Snip20191115_14.png" width=80%/>

# Clustering

- finding clusters of points in the given data such that similar points lie in the same cluster
- **k-means**: group points into $k$ sets (for a given $k$) such that the average distance (e.g., Euclidean distance) of points from the centroid of their assigned set is minimized
    - **centroid**: the mean of a sset of points

## $k$-Means Clustering Algorithm

- input: number of clusters ($k$) and set of input points $p_1, ..., p_n$
- initialization: place centroids $c_1, ..., c_k$ randomly
- loop
    - for each point $x_i$, find the nearest (in terms of Euclidian distance) centroid $c_j$ and assign point $x_i$ to cluster $j$
    - for each non-empty clutser $j$, let the new centroid $c_j$ be the mean of all points assigned to cluster $j$ in the previous step
    - for each empty cluster $j$, re-initialize the centroid $c_j$ randomly
    - exit if none of the centroids $c_1, ..., c_k$ has changed since the previous iteration (i.e., convergence has occurred)


## Hierarchical Clustering

- **agglomerative clustering algorithms**: build small clusters, then cluster small clusters into bigger clusters and so on
- **divisive clustering algorithms**: start with all items in a single cluster, repeatedly refine (break) clusters into smaller ones (top-down)

