# Chapter 3: Data Preprocessing

## Measures of Data Quality

- Accuracy
- Completeness
- Consistency
- Timeliness
- Believability
- Interpretability
- Accessibility

## Major Tasks in Preprocessing

- **Data cleaning**
    - fill in missing values, smooth noisy data, identify or remove outliers, resolve inconsistencies
- **Data integration**
    - integration of multiple data sources
- **Data reduction**
     - dimensionality, numerosity, compression
- **Data transformation and data discretization**
    - normalization, concept hierarchy generation

## Why Data Cleaning?

- Imperfect real-world data
- **Incomplete**: missing attributes, values
    - e.g., age = "", major = ""
- **Noisy**: containing errors or outliers
    - e.g., salary = "-10"
- **Inconsistent**: containing discrepancies
    - e.g., age = "21", birthday = "08/03/1995"
    - e.g., ratings of "1, 2, 3" and "A, B, C"

## Why Are Data Imperfect?

- **Incomplete data**
    - "not applicable" values
    - time between collection and analysis
    - human/hardware/software problems
- **Noisy data**
    - faulty data collection instruments
    - human or computer error at data entry
    - errors in data transmission
- **Inconsistent data**
    - different data sources
    - naming conventions, data formats
        - e.g., data "03/07/11"
    - functional dependency violation
        - e.g., modify some linked data
- **No quality data, no quality data mining results!**

## How to Handle Noisy Data?

- **Binning**
    - first sort & partition data into bins
    - then smooth by
        - bin means
        - bin median
        - bin boundaries

![Image: Binning](img/3.1.png)

- **Regression**
    - fit data into regression functions

![Image: Regression](img/3.2.png)

- **Clustering**
    - detect and remove outliers

![Image: Clustering](img/3.3.png)

## Data Integration

- Combines data from multiple sources
- **Entity identification**
    - schema integration, object matching
    - e.g., student_id vs. student_number
- **Redundant data**
    - different naming, derived data
    - may be detected by correlation analysis