# Chapter 3: Data Preprocessing

## Measures of Data Quality

- Accuracy
- Completeness
- Consistency
- Timeliness
- Believability
- Interpretability
- Accessibility

## Major Tasks in Preprocessing

- **Data cleaning**
    - fill in missing values, smooth noisy data, identify or remove outliers, resolve inconsistencies
- **Data integration**
    - integration of multiple data sources
- **Data reduction**
     - dimensionality, numerosity, compression
- **Data transformation and data discretization**
    - normalization, concept hierarchy generation

## Why Data Cleaning?

- Imperfect real-world data
- **Incomplete**: missing attributes, values
    - e.g., age = "", major = ""
- **Noisy**: containing errors or outliers
    - e.g., salary = "-10"
- **Inconsistent**: containing discrepancies
    - e.g., age = "21", birthday = "08/03/1995"
    - e.g., ratings of "1, 2, 3" and "A, B, C"

## Why Are Data Imperfect?

- **Incomplete data**
    - "not applicable" values
    - time between collection and analysis
    - human/hardware/software problems
- **Noisy data**
    - faulty data collection instruments
    - human or computer error at data entry
    - errors in data transmission
- **Inconsistent data**
    - different data sources
    - naming conventions, data formats
        - e.g., data "03/07/11"
    - functional dependency violation
        - e.g., modify some linked data
- **No quality data, no quality data mining results!**

## How to Handle Noisy Data?

- **Binning**
    - first sort & partition data into bins
    - then smooth by
        - bin means
        - bin median
        - bin boundaries

![Image: Binning](img/3.1.png)

- **Regression**
    - fit data into regression functions

![Image: Regression](img/3.2.png)

- **Clustering**
    - detect and remove outliers

![Image: Clustering](img/3.3.png)

## Data Integration

- Combines data from multiple sources
- **Entity identification**
    - schema integration, object matching
    - e.g., student_id vs. student_number
- **Redundant data**
    - different naming, derived data
    - may be detected by correlation analysis

## Correlation Analysis

- **Correlation coefficient (numerical data)**
    - $\displaystyle r_{A,B} = \frac{\sum_{i=1}^N(a_i-\bar{A})(b_i-\bar{B})}{N\sigma_A\sigma_B} = \frac{\sum_{i=1}^N(a_ib_i)-N\bar{A}\bar{B}}{N\sigma_A\sigma_B}$
- **$X^2 (chi-square) test (categorical data)**
    - $\displaystyle \chi^2 = \sum_{i=1}^c\sum_{j=1}^r\frac{(o_{ij}-e_{ij})^2}{e_{ij}}$
    - $\displaystyle e_{ij} = \frac{count(A=a_i) \times count(B=b_j)}{N}$

## Chi-Square Test: An Example

||play chess|not play chess|total|
|---|---|---|---|
|like fiction|250 (**90**)|200 (**360**)|450|
|not like fiction|50 (**210**)|1000 (**840**)|1050|
|total|300|1200|1500|

$\displaystyle e_{11} = \frac{\text{\#(like fiction)} \times \text{\#(play chess)}}{N} = \frac{450 \times 300}{1500} = 90$

$\displaystyle \chi^2 = \frac{(250 - 90)^2}{90} + \frac{(50 - 210)^2}{210} + \frac{(200 - 360)^2}{360} + \frac{(1000 - 840)^2}{840} = 507.93$

## Correlation Analysis

- **Correlation coefficient**
    - numeric data, $[-1.0, 1.0]$
- **$\chi^2$ (chi-square) test**
    - categorical data, $\geq 0$
    - $d = (c-1)(r-1)$
- **Correlation vs. causality**
    - Does **correlation** imply **causality**?
        - sleeping with one's shoes on is strongly correlated with waking up with a headache
        - the more fireman fighting a damage, the more damage there is going to be
        - as ice cream sales increase, the rate of drowning deaths increases sharply
        - **correlation does not imply causality!**

## Data Reduction

- **Why data reduction?**
    - massive data sets
    - mining takes a long time
- **Goal of data reduction**
    - data set is much smaller in volume
    - produces (almost) the same mining results

## Data Reduction Strategies

- **Dimensionality reduction**
    - attribute subset selection
    - Wavelet transform
    - principle component analysis (PCA)
- **Numerosity reduction**
    - regression, log-linear models
    - data cube aggregation
    - histograms, clustering, sampling

## Attribute Subset Selection

- Remove irrelevant or redundant attributes

## Dimensionality Reduction

- **Discrete wavelet transform (DWT)**
    - linear signal processing, multi-resolution
    - store a small fraction of the strongest wavelet coefficients
- **Principal component analysis (PCA)**
    - given $N$ data vectors of $n$ dimensions
    - find $k \leq n$ orthogonal vectors (principal components) that can
    - best represent the data
    - for numerical data only
    - used when $n$ is large

## Numerosity Reduction

- Use alternative, smaller data representations
- **Parametric methods**
    - assume the data fits some model
    - estimate model parameters
    - store the parameters, discard the data
- **Non-parametric methods**
    - do not assume models
    - e.g., histograms, clustering, sampling

## Regression & Log-Linear Models

- **Linear regression**
    - $Y = wX + b$
- **Multiple regression**
    - $Y = b_0 + b_1X_1 + b_2X_1$
- **Log-linear models**
    - approximate multi-dimensional probability distributions with lower-dimensional distributions

## Data Cube Aggregation

- E.g., quarterly sales $\Rightarrow$ annual sales
- Multiple levels of aggregation may be possible
- Use the smallest representation which is enough for the task

## Histograms

- Divide data into buckets and store average (or sum) for each bucket
- Partitioning rules
    - equal-width
    - equal-frequency
    - v-optimal
    - max-diff

## Sampling

- Use a small sample to represent whole data
- Choose a **representative** subset of the data
    - simple random sampling may have very poor performance in the presence of skew
- Simple random sample without replacement
- Simple random sample with replacement
- Cluster sample
- Stratified sample

## Sample w/ or w/o Replacement

![Sample w or wo Replacement](./img/3.4.png)

## Cluster or Stratified Sampling

- Approximate the percentage of each class

![Cluster or Stratified Sampling](./img/3.5.png)

## Data Transformation

- **Smoothing**: remove noise from data
- **Aggregation**: summarization
    - e.g., daily sales $\Rightarrow$ monthly, annual sales
- **Generalization**: concept hierarchy climbing
    - e.g., street $\Rightarrow$ city $\Rightarrow$ state
- **Normalization**: scale to fall within a range
- **Attribute/feature construction**: new attributes constructed from existing ones

## Normalization

- **Min-max normalization**
    - $\displaystyle v' = \frac{v-\min_A}{\max_A - \min_A}(\text{new\_max}_A - \text{new\_min}_A) + \text{new\_min}_A$
    - e.g., income range $[\$12000,\$98000]$, normalize to $[0.0,1.0]$ then $73,600 is mapped to
        - $\displaystyle \frac{73600 - 12000}{98000 - 12000}(1.0 - 0) + 0 = 0.716$
- **Z-score normalization**
    - e.g., mean = 54,000
    - stdev = 16,000
    - then
        - $\displaystyle \frac{73600 - 54000}{16000} = 1.225$
- **Normalization by decimal scaling**
    - $\displaystyle v' = \frac{v}{10^j}$
    - where $j$ is the smallest integer s.t. $\max(|v'|) < 1$
    - e.g., range $[-986,917]$
        - $j=3$, divide by 1000
        - $-986 \Rightarrow -0.986$
        - $917 \Rightarrow 0.917$

## Discretization

- Three types of attributes
    - **nominal**: unordered set (e.g., profession)
    - **ordinal**: ordered set (e.g., military rank)
    - **continuous**: e.g., integer or real numbers
- Discretization
    - divide continuous range into intervals
    - interval labels used to replace data values
    - supervised vs. unsupervised, split vs. merge

## Discretization Methods

- **Binning**: split, unsupervised
- **Histogram analysis**: split, unsupervised
- **Clustering analysis**: split/merge, unsupervised
- **Entropy-based discretization**: split, supervised
- **Interval merging by $X^2$ analysis**: merge, supervised
- **Intuitive partitioning**: split, unsupervised

## Entropy-Based Discretization

- Partition $D$ into $D_1$ and $D_2$ at boundary $A$
- Pick boundary $A$ with minimum $Info_A(D)$
    - "purer" distribution has lower entropy
- Apply recursively to each partition
- Top-down split, supervised (uses class info)

$\displaystyle Info_A(D) = \frac{|D_1|}{|D|}Entropy(D_1) + \frac{|D_2|}{|D|}Entropy(D_2)$

$\displaystyle Entropy(D_1) = -\sum_{i=1}^m p_i \log_2(p_i)$

## Interval Merge by $X^2$ Analysis

- Bottom-up merge, supervised
- Merge the best neighboring intervals
    - interval w/ most similar class distributions
- **ChiMerge**
    - merge adjacent intervals w/ min $X^2$ value
        - i.e., class is independent of interval
    - stopping criterion
        - significance, #intervals, inconsistency, ...

## Concept Hierarchy Generation

- Categorical data
- Partial/total ordering of attributes
    - street < city < state < country
- Automatic concept hierarchy generation
- fewer distinct values $\Rightarrow$ higher level
- e.g., street, city, state, country
- exceptions
    - e.g., weekday, month, quarter, year