<a href="https://colab.research.google.com/github/brendenwest/ad450/blob/master/5_data_wrangling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Wrangling

Advanced techniques for preparing & summarizing data 

### Reading
- McKinney, Chapter 8 & 10
- Molin, “Aggregating Pandas Dataframes”
 
### Tutorials
- https://www.datacamp.com/community/tutorials/pandas-multi-index
- https://www.datacamp.com/community/tutorials/pandas-split-apply-combine-groupby
 
### Learning Outcomes
- hiearchical indexing 
- combining & reshaping datasets
- data aggregation
- group operations
- calculating group statistics
- within-group transformations or subset selection
- computing pivot tables and cross-tabulations
- performing quantile & statistical-group analysis

### Hierarchical Indexing

- allows multiple index (MultiIndex) levels on an axis
- either axis of a DataFrame can have a hierarchical index
- partial indexing allows concise selection of data subsets
- selections can be made on the `outer` or `inner` level of a MultiIndex
- can be used to reshape data (using `stack` & `unstack`)
- can be used for group-based operations


#### Creating hierarchical indexes

Hierarchical indexes can be assigned to dataset by passing a multi-dimensional list for the `index` value, with each list item having the same length as the dataset axis. This creates a `MultiIndex` object.

In [0]:
import pandas as pd
import numpy as np
data = pd.Series(np.random.randn(9),
 index=[['a', 'a', 'a', 'b', 'b', 'c', 'c', 'd', 'd'],
 [1, 2, 3, 1, 3, 1, 2, 2, 3]])
print(data)
print(data.index)

a  1   -0.605719
   2    0.123044
   3    1.811644
b  1    0.823819
   3    0.385486
c  1    0.327702
   2   -0.159786
d  2    0.869689
   3   -1.515328
dtype: float64
MultiIndex([('a', 1),
            ('a', 2),
            ('a', 3),
            ('b', 1),
            ('b', 3),
            ('c', 1),
            ('c', 2),
            ('d', 2),
            ('d', 3)],
           )


#### Reordering & Sorting Levels

- `swaplevel` changes order of levels in a MultiIndex
- `sort_index` sorts the dataset using only the values in a single level of the MultiIndex

#### Summary Statistics by Level

- Many pandas descriptive and summary statistics support specifying the level you want to aggregate by on a particular axis. 

#### Indexing with DataFrame columns

- DataFrame's `set_index` function will create a new DataFrame using one or more of its columns as the index
- index columns are removed from the DataFrame unless specified otherwise with `drop=False`
- `reset_index` moves hierarchical index levels into the DataFrame

### Combining & Merging Datasets

pandas supports several different ways to combine datasets:

- `pandas.merge` - connects rows in DataFrames based on one or more keys, like a SQL database join operation
- `pandas.concat` - concatenates or 'stacks' data along an axis
- `combine_first` instance method enables splicing together overlapping data to fill in missing values in one object with values from another.

#### Database-style joins

Merge or join operations combine datasets by linking rows using one or more keys.

If column to join on is not specified, `merge` uses the overlapping column names as the keys.

Different column names from each dataset can be specified as keys.

By default `merge` does an `inner` join - the keys in the result are the intersection, or the common set found in both tables. Other possible options are `left`, `right`, and `outer` joins.

To merge with multiple keys, pass a list of column names. 

`left_index=True` or `right_index=True` (or both) can indicate that the index should be used as the merge key

DataFrame has a convenient `join` instance method for merging by index. It can also be used to combine together many DataFrame objects having the same or similar indexes but non-overlapping columns.

#### Concatenation

`concat` provides a consistent way to:
- combine objects that are indexed differently
- make the combined data identifiable in the resulting objects
- preserve data in the concatenation axis

By default `concat` works along axis=0, producing another Series. If you pass axis=1, the result will instead be a DataFrame (axis=1 is the columns)

Concatenation along axis=1 supports an argument for type of `join` to use.

Concatenation can create a hierarchical index on the concatenation axis using the `keys` argument.

`ignore_index=True` allows drops the original indexes from the result.


#### Combining data with overlap

`combine_first` is similar to NumPy's `where` method for performing ternary operations.


### Reshaping & Pivoting

#### Reshaping with Hierarchical Indexing
#### Pivoting 'long' to 'wide'
#### Pivoting 'wide' to 'long'

### Grouping Data

The expressiveness of Python and pandas allows complex group operations using any function that accepts a pandas object or NumPy array. This can include:

- Splitting a pandas object into pieces using one or more keys
- Calculating group summary statistics
- Applying within-group transformations or other manipulations
- Computing pivot tables and cross-tabulations
- Performing quantile analysis and other statistical group analyses

  #### GroupBy Operations

Group operations involve the `split-apply-combine` mechanism.

1. Data are split into groups based on one or more keys
2. A function is applied to each group
3. Results of the function application are combined into a new object

Grouping keys can take many forms, and the keys do not have to be all of the same type.

pandas `groupby` method returns a GroupBy object that can be re-used.

DataFrame columns can be used as the group keys.

Numeric aggregations will exclude `nuisance` (non-numeric) columns from the result

By default `groupby` groups on axis=0, but can group on any of the other axes.

  #### Iterating over groups

The GroupBy object supports iteration, generating a sequence of 2-tuples containing the group name along with the chunk of data.

Indexing a GroupBy object created from a DataFrame with a column name or array of column names has the effect of column subsetting for aggregation. This means that:
```
df.groupby('key1')['data1']
```
is essentially the same as:
```
df['data1'].groupby(df['key1'])
```

  #### Grouping with Series or Dicts


  #### Grouping with Functions
  #### Grouping by Index Levels

### Data Aggregation

Aggregations refer to any data transformation that produces scalar values from arrays.

Programs can use built-in optimized aggregation methods, or custom functions.

Programs can pass any function that aggregates an array to the `aggregate` or `agg` method of a GroupBy object. 

Custom aggregation functions are generally much slower than the built-in optimized functions.

#### Column-wise & Multiple-function application

Providing a list of functions or function names, results in a DataFrame with column names taken from the functions. Programs can over-ride the default column name:

`grouped_pct.agg([('foo', 'mean'), ('bar', np.std), peak_to_peak])`

You can specify a list of functions to apply to all of the columns or different functions per column.

#### Aggregated data without row indexes

By default, the aggregated data result has an index, potentially hierarchical, composed from the unique group key combinations. You can disable this behavior in most cases by passing `as_index=False` to groupby.

### General split-apply-combine

#### Suppressing Group Keys
#### Quantile & Bucket Analysis
#### Filling missing values
#### Random sampling & permutation
#### Group weighted average and correlation

### Pivot Tables & Cross-Tabulation

#### Pivot Tables
#### Cross-Tabulations
