# [CPSC 222](https://github.com/GonzagaCPSC222) Intro to Data Science
[Gonzaga University](https://www.gonzaga.edu/)

[Gina Sprint](http://cs.gonzaga.edu/faculty/sprint/)

# Data Processing
What are our learning objectives for this lesson?
* Clean data by filling missing values
* Perform data aggregation w/split-apply-combine

Content used in this lesson is based upon information in the following sources:
* Dr. Shawn Bowers' Data Mining notes

## Data Processing
Data analysts spend a surprising amount of time preparing data for analysis. In fact, a survey was conducted found that cleaning big data is the most time-consuming and least enjoyable task data scientists do!
<img src="https://thumbor.forbes.com/thumbor/960x0/https%3A%2F%2Fblogs-images.forbes.com%2Fgilpress%2Ffiles%2F2016%2F03%2FTime-1200x511.jpg" width="700">
(image from [https://thumbor.forbes.com/thumbor/960x0/https%3A%2F%2Fblogs-images.forbes.com%2Fgilpress%2Ffiles%2F2016%2F03%2FTime-1200x511.jpg](https://thumbor.forbes.com/thumbor/960x0/https%3A%2F%2Fblogs-images.forbes.com%2Fgilpress%2Ffiles%2F2016%2F03%2FTime-1200x511.jpg))

The goal of data preprocessing is to produce high-quality data to improve mining results and efficiency

At a high level, data preprocessing includes the following steps (these steps are done in any order and often multiple times):
1. Data Exploration (basic understanding of meaning, attributes, values, issues)
2. Data Reduction (reduce size via aggregation, redundant features, etc.)
3. Data Integration (join/merge/combine multiple datasets)
4. Data Cleaning (remove noise and inconsistencies)
    * Dealing with missing values
    * Dealing with incorrect values (e.g., misspelled names, values out of range)
5. Data Transformation (normalize/scale, to discrete values, etc.)

It is important for data mining that your process is transparent and repeatable:
* Can repeat "experiment" and get the same result
* No "magic" steps

It is important, however, to write down steps (log):
* Ideally, someone should be able to take your data, program, and description of steps, rerun everything, and get the same results!

## Data Cleaning
It is not uncommon to have datasets with noisy, invalid, or completely missing values.
1. Noisy vs Invalid Values
    * Noisy implies the value is correct, just recorded incorrectly
        * E.g., decimal place error (5.72 instead of 57.2), wrong categorical value used
    * Invalid implies a noisy value that is not a valid value (for domain)
        * E.g., 57.2X, misspelled categorical data, or value out of range (6 on a 5 point scale)
    * Ways to deal with this:
        * Look for duplicates (when there shouldn't be)
        * Look for outliers
        * Sort and print range of values
    * The term "noisy" may also imply random error or random variance
        * Various techniques to "smooth out" values
        * E.g., using means of bins or regression
2. Missing Values
     * How should we deal with missing values?
        * Discard instances: throw out any row with a missing value
        * Replace with a new value:
            * By hand
            * Use a constant
            * Use a central tendency measure (mean, median, most frequent, ...)
        * Most "probable" value (e.g., regression, using a classifier)
        * Replace either across data set, or based on similar instances
            * E.g. average based on model year
            
Missing values are usually coded as an out of range value, such as an empty string in a text field, -1 in a numeric field that is normally positive, or 0 in a numeric field that cannot take on the value of 0. In the Scipy ecosystem, the common value `NaN` (not a number) is used to denote missing data. There is support in the Scipy libraries to handle `NaN` specially. For example, the Pandas function [`isnull()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.isnull.html) returns a Boolean array detecting the `NaN` values element-wise and [`dropna()`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.dropna.html) removes `NaN` values from a series or data frame:


In [None]:
## Data Cleaning Example
import numpy as np
import pandas as pd
x = np.arange(0, 10)
ser = pd.Series(x)
ser[1] = np.NaN
ser[5] = np.NaN
nans = ser.isnull()
# count the number of missing values
print(nans.sum())
print(ser)
ser.dropna(inplace=True)
print(ser)

Note: you can learn more about missing data by reading [Pandas website](https://pandas.pydata.org/pandas-docs/stable/missing_data.html).

By learning how to use the Pandas library, we have the skills to perform many of the tasks listed above. In this lesson, we are going to focus on *data cleaning*, modifying the data to make it sufficiently accurate and structured to support the analysis you want to perform. To learn about data cleaning, we are going to clean data by working through an example!

## Data Aggregation
Gathering and summarizing information, perhaps in preparation for statistical analysis or visualization, is called *data aggregation*. For example, suppose you want to investigate the similarities/differences amongst patients in a clinical setting. Suppose specific attributes you are interested in include medical condition, age, and gender. You might *group* the data into two groups: male and female. By grouping the data based on a variable, such as gender, you are aggregating the data. The grouping allows you to then create a bar chart representing the frequency of each medical condition present in each group, or perform hypothesis testing to see if there is a significant age difference between the two groups. 

### Split-Apply-Combine
Data aggregation typically follows a "split, apply, combine" process:
* Split the data into groups based on some criteria
    * Perform *group by* operations
    * Select or slice data to form a subset
    * Example: Group a data frame by rows (axis 0) or by columns (axis 1)
* Apply a function to each group independently, producing a new value
    * Compute summary statistics (aggregation)
        * Example: Count the size of each group
        * Example: Compute mean, standard deviation, custom stats, etc.
    * Transform the data in the group (transformation)
        * Example: Standardizing data (z-score) within each group
        * Example: Filling missing data with a value derived from each group
    * Discard some groups (filtration)
        * Example: Discarding data that belongs to groups with only a few members
        * Example: Filtering out data based on the group sum or mean
* Combine the results of the function applications into a data structure
    * Example: A series with index corresponding to data frame column names and values representing the column means
    
<img src="http://blog.yhat.com/static/img/split-apply-combine.jpg" width="500">
(image from [http://blog.yhat.com/static/img/split-apply-combine.jpg](http://blog.yhat.com/static/img/split-apply-combine.jpg))
    
### Pandas GroupBy
In the split step, we want to divide a dataset into a mapping of group names to group data. With the Pandas [`groupby()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.groupby.html) function, we can divide a data frame into a [`GroupBy`](http://pandas.pydata.org/pandas-docs/stable/groupby.html) object that stores the mapping. For example:

In [1]:
import numpy as np
import pandas as pd

# adapted from http://pandas.pydata.org/pandas-docs/stable/groupby.html
df = pd.DataFrame({"Gender" : ["F", "F", "M", "F", "M", "M", "M", "F"],
                   "AgeGroup" : ["OA", "A", "OA", "YA", "YA", "OA", "A", "YA"], # OA: older adult, A: adult, YA: young adult
                   "Feature1" : np.random.randn(8),
                   "Feature2" : np.random.randn(8)})
print(df)
# GroupBy object (mapping of group name -> group data frame)
gender_groups = df.groupby("Gender")
# groups attribute is a dictionary storing the mapping
print("Groups:", gender_groups.groups)
print("Female data frame")
F_df = gender_groups.get_group("F")
print(F_df)
print("Male data frame")
M_df = gender_groups.get_group("M")
print(M_df)
# confirm M_df is a data frame
print(type(M_df))
# divided the data frame into 2 groups
print(len(df) == len(F_df) + len(M_df))

  AgeGroup  Feature1  Feature2 Gender
0       OA  0.526954  0.352297      F
1        A  0.419304 -0.145591      F
2       OA -1.541056 -0.701479      M
3       YA -1.003222  0.511651      F
4       YA  0.906797  0.559671      M
5       OA -0.541548 -0.428555      M
6        A  1.321634 -1.693548      M
7       YA -0.475034 -1.297328      F
Groups: {'F': [0, 1, 3, 7], 'M': [2, 4, 5, 6]}
Female data frame
  AgeGroup  Feature1  Feature2 Gender
0       OA  0.526954  0.352297      F
1        A  0.419304 -0.145591      F
3       YA -1.003222  0.511651      F
7       YA -0.475034 -1.297328      F
Male data frame
  AgeGroup  Feature1  Feature2 Gender
2       OA -1.541056 -0.701479      M
4       YA  0.906797  0.559671      M
5       OA -0.541548 -0.428555      M
6        A  1.321634 -1.693548      M
<class 'pandas.core.frame.DataFrame'>
True


Now we have learned enough background information to dive into learning about aggregating data by working through an example!

## Data Aggregation Example
We are going to an example with the following dataset pd_hoa_activities.csv. This dataset contains information from a smart home study where participants performed 9 activities in a smart home environment. pd_hoa_activities_cleaned.csv is a version of the data that is already cleaned. We will start with this cleaned version of the dataset. You can download this file at: 

In [2]:
import pandas as pd
import numpy as np

fname = r"files\pd_hoa_activities_cleaned.csv"
df = pd.read_csv(fname, header=0, index_col=[0, 1])
print(df.shape)
print(df.head(n=10))

(665, 3)
                               duration  age class
pid task                                          
0   Water Plants                    146   72   HOA
    Fill Medication Dispenser       210   72   HOA
    Wash Countertop                 241   72   HOA
    Sweep and Dust                  328   72   HOA
    Cook                            229   72   HOA
    Wash Hands                       38   72   HOA
    Perform TUG                      10   72   HOA
    Perform TUG w/Questions          10   72   HOA
    Day Out Task                    680   72   HOA
1   Water Plants                     63   54   HOA


### Split
Now let's group the data into two population groups, HOA and PD. 

In [3]:
classes = df.groupby("class")
for class_name, cls_df in classes:
    print(class_name)
    print(cls_df.head())

HOA
                               duration  age class
pid task                                          
0   Water Plants                    146   72   HOA
    Fill Medication Dispenser       210   72   HOA
    Wash Countertop                 241   72   HOA
    Sweep and Dust                  328   72   HOA
    Cook                            229   72   HOA
PD
                               duration  age class
pid task                                          
2   Water Plants                     47   62    PD
    Fill Medication Dispenser       205   62    PD
    Wash Countertop                 232   62    PD
    Sweep and Dust                  543   62    PD
    Cook                            511   62    PD


### Apply and Combine
Then, we can compute summary statistics for each group, such as mean and standard deviation for age. we will store the results in a new results data frame with index "HOA" and "PD:

In [4]:
age_results_df = pd.DataFrame(index=classes.groups, columns=["age mean", "age std"])
for class_name, cls_df in classes:
    age_results_df.ix[class_name]["age mean"] = cls_df["age"].mean()
    age_results_df.ix[class_name]["age std"] = cls_df["age"].std()
print(age_results_df)

    age mean  age std
HOA  68.6771  9.78872
PD   68.8539  9.88264
