# [CPSC 222](https://github.com/GonzagaCPSC222) Intro to Data Science
[Gonzaga University](https://www.gonzaga.edu/)

[Gina Sprint](http://cs.gonzaga.edu/faculty/sprint/)

# Data Preprocessing
What are our learning objectives for this lesson?
* Learn about the steps involved in data preprocessing
* Learn about different attribute types
* Summarize data with simple statistics
* Clean data by filling missing values

Content used in this lesson is based upon information in the following sources:
* Dr. Shawn Bowers' Data Mining notes

## Warm-up Task(s)
* Open DataPreprocessingFun/main.py
* Given a table and a column index, write a function to get the column in the table at the index
    * Ignore "NA" values
* Given a table, write a function to get the min and max value of one of its (numerical) attributes
* Test your functions on the following table:

```
header = ["CarName", "ModelYear", "MSRP"]
msrp_table = [["ford pinto", 75, 2769],
              ["toyota corolla", 75, 2711],
              ["ford pinto", 76, 3025],
              ["toyota corolla", 77, 2789]]
```

* Brainstorm with your neighbor possible issues with the different approaches for dealing with missing values

## Data Preprocessing
Data analysts spend a surprising amount of time preparing data for analysis. In fact, a survey was conducted found that cleaning big data is the most time-consuming and least enjoyable task data scientists do!
<img src="https://thumbor.forbes.com/thumbor/960x0/https%3A%2F%2Fblogs-images.forbes.com%2Fgilpress%2Ffiles%2F2016%2F03%2FTime-1200x511.jpg" width="700">
(image from [https://thumbor.forbes.com/thumbor/960x0/https%3A%2F%2Fblogs-images.forbes.com%2Fgilpress%2Ffiles%2F2016%2F03%2FTime-1200x511.jpg](https://thumbor.forbes.com/thumbor/960x0/https%3A%2F%2Fblogs-images.forbes.com%2Fgilpress%2Ffiles%2F2016%2F03%2FTime-1200x511.jpg))

The goal of data preprocessing is to produce high-quality data to improve mining results and efficiency

At a high level, data preprocessing includes the following steps (these steps are done in any order and often multiple times):
1. Data Exploration (basic understanding of meaning, attributes, values, issues)
2. Data Reduction (reduce size via aggregation, redundant features, etc.)
3. Data Integration (merge/combine multiple datasets)
4. Data Cleaning (remove noise and inconsistencies)
5. Data Transformation (normalize/scale, etc.)

It is important for data mining that your process is transparent and repeatable:
* Can repeat "experiment" and get the same result
* No "magic" steps

It is important, however, to write down steps (log):
* Ideally, someone should be able to take your data, program, and description of steps, rerun everything, and get the same results!

## Data Exploration
Get to know your data first by exploring it. 

### Attributes
Different aspects of attributes (variables)
* Data (storage) type - e.g., int versus float versus string
* Measurement scales - are values discrete or continuous
* Semantic type – what the values represent (e.g., colors, ages)

### Measurement Scales
1. Nominal
    * Discrete values without inherent order
    * E.g., colors (red, blue, green), identifiers, occupation, gender
    * Often ints or strings (but could be any data type)
2. Ordinal
    * Discrete values with inherent order
    * E.g., t-shirt size (s, m, l, xl), grades (A+, A-, B+, ...)
    * No guarantee that the difference between values is same
    * Often ints or strings (but could be any data type)
3. Interval
    * Values measured on a scale of equal-sized widths
    * Unlike ordinal, can compare and quantify difference between values
    * No inherent zero point (i.e., absence)
    * Temperature (Celsius, Fahrenheit) is an example
4. Ratio
    * Interval values with an inherent zero point
    * Temperature in Kelvin is an example
    * Also counts of things (where 0 means not present)
    
### Categorical vs Continuous
* Categorical roughly means the nominal and ordinal values
* Continuous roughly means the rest (interval, ratio) ... aka "numerical"
* For many algorithms/approaches, this is enough detail

### Labeled vs Unlabeled Data
* Labeled data implies an attribute that classifies instances (e.g., mpg)
    * Goal is typically to predict the class for new instances
    * This is called "Supervised Learning"
* Unlabeled means there isn't such an attribute (for mining purposes)
    * Can still find patterns, associations, etc.
    * Generally referred to as "Unsupervised Learning"
            
## Data Cleaning
1. Noisy vs Invalid Values
    * Noisy implies the value is correct, just recorded incorrectly
        * E.g., decimal place error (5.72 instead of 57.2), wrong categorical value used
    * Invalid implies a noisy value that is not a valid value (for domain)
        * E.g., 57.2X, misspelled categorical data, or value out of range (6 on a 5 point scale)
    * Ways to deal with this:
        * Look for duplicates (when there shouldn't be)
        * Look for outliers
        * Sort and print range of values
    * The term "noisy" may also imply random error or random variance
        * Various techniques to "smooth out" values
        * E.g., using means of bins or regression
2. Missing Values
     * How should we deal with missing values?
        * Discard instances: throw out any row with a missing value
        * Replace with a new value:
            * By hand
            * Use a constant
            * Use a central tendency measure (mean, median, most frequent, ...)
        * Most "probable" value (e.g., regression, using a classifier)
        * Replace either across data set, or based on similar instances
            * E.g. average based on model year
        
## Summary Statistics
Summary statistics give (initial) insights into a dataset, such as:
1. Number of instances (how many rows)
2. Min and max attribute values
    * Q: Do these make sense for both categorical and continuous attributes?
        * Ordinal, but not Nominal
        * Much easier if numeric!
        * Can only count number of each nominal value
    * Q: What should be done with null (NA) values?
        * Really, undefined / unknown
        * In practice just ignore them
3. Middle values of a distribution (aka “Central Tendency”)
    * Mid value: `(max + min) / 2.0` 
        * AKA "Midrange"
    * (Arithmetic) Mean $\bar{x} = (x_1 + x_2 + ... + x_n) / n$
        * AKA average
        * Python: `sum(column) / float(len(column))`
        * Q: Problems with the mean? ... sensitive to extremes (e.g., outliers)
        * Q: Make sense for categorical and continuous?
            * only Interval or Ratio (same widths)
    * Median
        * The middle value in a set of sorted values
        * If even number of values, halfway between the two middles
        * Better measure for skewed data
        * Can be expensive for large data sets (sorting!)
    * Mode
        * Value(s) that occurs most frequently
    * typically assume data is unimodal (one mode), e.g., normally distributed
    * Q: How might we compute the mode in Python?
4. Data Dispersion (Spread)
    * Range (max - min)
    * Quantiles: (Roughly) equal size partitions of data (if sorted from smallest to largest)
        * "2-quantiles" is the data point that divides into two halves (AKA median)
        * "Quartiles" is three data points that divide into four groups
            * Used as part of box plots (more later)
        * Interquartile range (IQR) is distance between 1st and 3rd quartiles
        * "Percentiles" are 100-quantiles (100 groups)
    * Variance and Standard Deviation
        * Variance measures how spread out the data is (small implies data close to mean, large implies data spread out) $$\frac{\sum_{i=1}^{n}(x_i - \bar{x})^2}{n}$$
        * Standard Deviation is square root of variance
            * Python: `numpy.std(vals)` 
                * ... more on numpy later
        * For a normal (i.e., Gaussian) data distribution
            * About 68% of values are within 1 standard deviation of mean
            * About 95% of values are within 2 standard deviations
            * About 99.7% of values are within 3 standard deviations