<a href="https://colab.research.google.com/github/brendenwest/ad450/blob/master/5_data_wrangling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Wrangling

Advanced techniques for preparing & summarizing data 

### Reading
- McKinney, Chapter 8 & 10
- Molin, “Aggregating Pandas Dataframes”
 
### Tutorials
- https://www.datacamp.com/community/tutorials/pandas-multi-index
- https://www.datacamp.com/community/tutorials/pandas-split-apply-combine-groupby
 
### Learning Outcomes
- hiearchical indexing 
- combining & reshaping datasets
- data aggregation
- group operations
- calculating group statistics
- within-group transformations or subset selection
- computing pivot tables and cross-tabulations
- performing quantile & statistical-group analysis

### Hierarchical Indexing

- allows multiple index (MultiIndex) levels on an axis
- either axis of a DataFrame can have a hierarchical index
- partial indexing allows concise selection of data subsets
- selections can be made on the `outer` or `inner` level of a MultiIndex
- can be used to reshape data (using `stack` & `unstack`)
- can be used for group-based operations


#### Creating hierarchical indexes

Hierarchical indexes can be assigned to dataset by passing a multi-dimensional list for the `index` value, with each list item having the same length as the dataset axis. This creates a `MultiIndex` object.

In [7]:
import pandas as pd
import numpy as np
data = pd.Series(np.random.randn(9),
 index=[['a', 'a', 'a', 'b', 'b', 'c', 'c', 'd', 'd'],
 [1, 2, 3, 1, 3, 1, 2, 2, 3]])
print(data)
print(data.index)

a  1   -0.605719
   2    0.123044
   3    1.811644
b  1    0.823819
   3    0.385486
c  1    0.327702
   2   -0.159786
d  2    0.869689
   3   -1.515328
dtype: float64
MultiIndex([('a', 1),
            ('a', 2),
            ('a', 3),
            ('b', 1),
            ('b', 3),
            ('c', 1),
            ('c', 2),
            ('d', 2),
            ('d', 3)],
           )


#### Reordering & Sorting Levels

- `swaplevel` changes order of levels in a MultiIndex
- `sort_index` sorts the dataset using only the values in a single level of the MultiIndex

#### Summary Statistics by Level

- Many pandas descriptive and summary statistics support specifying the level you want to aggregate by on a particular axis. 

#### Indexing with DataFrame columns

- DataFrame's `set_index` function will create a new DataFrame using one or more of its columns as the index
- index columns are removed from the DataFrame unless specified otherwise with `drop=False`
- `reset_index` moves hierarchical index levels into the DataFrame

### Combining & Merging Datasets

#### Database-style joins
#### Contatenation
#### Combining data with overlap



### Reshaping & Pivoting

#### Reshaping with Hierarchical Indexing
#### Pivoting 'long' to 'wide'
#### Pivoting 'wide' to 'long'

### Grouping Data
  #### GroupBy Operations
  #### Iterating over groups
  #### Grouping with Series or Dicts
  #### Grouping with Functions
  #### Groupting by Index Levels

### Data Aggregation

#### Column-wise & Multiple-function application
#### Aggregated data without row indexes


### General split-apply-combine

#### Suppressing Group Kys
#### Quantile & Bucket Analysis
#### Filling missing values
#### Random sampling & permutation
#### Group weighted average and correlation

### Pivot Tables & Cross-Tabulation

#### Pivot Tables
#### Cross-Tabulations
