# `pandas` - Reshaping Data and Pivots tables

__Contents__
1. Setup
1. Reshaping data by pivoting/melt/stacking and unstacking
1. Pivot tables

##Reference
Related/useful documentation:
- http://pandas.pydata.org/pandas-docs/stable/index.html
- https://pandas.pydata.org/pandas-docs/stable/dsintro.html
- https://pandas.pydata.org/pandas-docs/stable/generated/pandas.pivot_table.html

## 1. Setup

Load libraries

In [6]:
import pandas  as pd
import numpy   as np
(pd.__version__,
 np.__version__
)

Load a DataFrame from the `imports-85.csv` CSV file. Set the column names.

In [8]:
column_names = ['symboling', 'normalized-losses', 'make', 'fuel-type',
                'aspiration', 'num-of-doors', 'body-style', 'drive-wheels',
                'engine-location', 'wheel-base', 'length', 'width',
                'height', 'curb-weight', 'engine-type', 'num-of-cylinders',
                'engine-size', 'fuel-system', 'bore', 'stroke',
                'compression-ratio', 'horsepower', 'peak-rpm', 'city-mpg',
                'highway-mpg', 'price']
import_df = pd.read_csv('/dbfs/mnt/datalab-datasets/file-samples/imports-85.csv',
                        names=[string.replace('-','_') for string in column_names],
                        na_values=['?']
                       )

Display basic information about each column of the DataFrame.

In [10]:
import_df.info()

Create another sample DataFrame `df` and display the first 5 rows.

In [12]:
import datetime
np.random.seed(0)
df = pd.DataFrame({'item': ['A']*6 + ['B']*6 + ['C']*6 + ['D']*6,
                    'quantity': np.random.randint(1000,size=24),
                    'value': np.random.randn(24),
                    'date': [datetime.datetime(2013, i, 1) for i in range(1, 7)]*4
                   })
df.head()

In [13]:
import datetime
np.random.seed(1)
icecream_sales = pd.DataFrame({'flavor': ['Chocolate']*6 + ['Vanilla']*6 + ['Cookie Dough']*6 + ['Green Tea']*6,
                    'quantity': np.random.randint(10,size=24)+1,
                    'profit': 1.5*np.random.random_integers(5,size=24),
                    'date': [datetime.datetime(2018, 4, i) for i in range(1, 7)]*4
                   })
icecream_sales.head()

##2. Reshaping data

### `pivot()` method

The `pivot()` method of a DataFrame object is used to create a new derived table out of the given one. `pivot()` method takes 3 arguements: 
- `index` 
- `columns` 
- `values`

Notice that we cannot aggregate using `pivot()` method and if either rows or columns are not unique, this method will fail.

Use `pivot` method to reshape the data into a time series format.

In [18]:
pivot_df = icecream_sales.pivot(index='date', columns='flavor', values='quantity')
pivot_df

Pivoting by multiple columns. Notice that in `pivot` method, if not specify the `values` parameter, all remaining columns will be used and the result will have hierarchically indexed columns.

In [20]:
icecream_sales.pivot(index='date',columns='flavor')

##`melt()` function

The `melt()` function and the `melt()` method of a DataFrame are useful to transform a DataFrame from wide to long format.

Create numeric index in the `pivot_df` DataFrame.

In [24]:
pivot_df.reset_index(inplace = True)
pivot_df

Use the `melt()` function to reshape the `pivot_df` DataFrame. The `melt()` function in the below code cell takes 
- `pivot_df`: the name of dataframe to reshape
- `id_vars` : the column(s) to use as identifier variable(s)
- `var_name`: name to use for the ‘variable’ column
- `value_name`: name to use for the ‘value’ column.

In [26]:
pd.melt(pivot_df,id_vars=['date'],var_name='flavor',value_name='sales')

### `stack()` & `unstack()` method

The `stack` and `unstack` methods are designed to work together with MultiIndex objects. Create a MultiIndex object `df_multi` from the `icecream_sales` DataFrame by setting multiple indexes and sorting them.

In [29]:
df_multi = icecream_sales.set_index(['date','flavor']).sort_index()
df_multi.head(8)

The `stack()` method “compresses”(stack) a level in the `df_multi`’s columns. The stacked level becomes the new lowest level in a MultiIndex on the columns:

In [31]:
df_multi.stack()\
        .rename_axis(['date','flavor','type'],axis=0)\
        .head(16)

With a “stacked” DataFrame, the inverse operation of stack is unstack, which by default unstacks the last level:

In [33]:
df_multi.stack()\
        .rename_axis(['date','flavor','type'],axis=0)\
        .unstack()\
        .head(8)

The `stack()` or `unstack()` method can be applied to more than one level at a time by passing a list of levels. In the code cell below we pass a list of column names to the `unstack()` method.

In [35]:
df_multi.stack()\
        .rename_axis(['date','flavor','type'],axis=0)\
        .unstack(['flavor','type'])

##3. Pivot tables

While `pivot()` provides general purpose pivoting with various data types (strings, numerics, etc.), pandas also provides `pivot_table()` for pivoting with aggregation of numeric data. The `pivot_table` method works like `pivot()`, but it aggregates the values from rows with duplicate entries for the specified columns.

Create a `pivot_table` instance of a DataFrame object:
- The `data` arguments takes a DataFrame object.
- The `index` and `columns` arguments take categorical variables which have duplicate values in the DataFrame. 
- The `values` arguments takes variable(s) that can be aggregated. 
- The `aggfunc` arguments takes the function to use for aggregation, defaulting to `numpy.mean`.

Create a pivot table from the `import_df` DataFrame. In the pivot table we calculate the average `horsepower` for each `body_style` according to the type of `drive_wheels`.

In [40]:
pd.pivot_table(data=import_df, index='body_style', columns='drive_wheels', values='horsepower', aggfunc=np.mean)

Below, create a pivot table from the `import_df` DataFrame that calculates the average `price` for each `body_style`. The result object `res` is a Series. The values are rounded to zero decimals usingthe `np.round()` function.

In [42]:
res = pd.pivot_table(import_df, 
                     values ='price', 
                     index='body_style',
                     aggfunc=np.mean)
np.round(res)

Create a pivot table from the `import_df` DataFrame. The result object `res` is a DataFrame having hierarchical indexes on the rows. Omit the missing values by calling `to_string` method.

In [44]:
res = pd.pivot_table(import_df, 
                     values ='price', 
                     columns='body_style', 
                     index  =['make','drive_wheels'], 
                     aggfunc=np.mean)\
        .round()

print(res.to_string(na_rep=''))

Note that `pivot_table` is also available as an instance method on DataFrame, i.e. `DataFrame.pivot_table()`.

Create a pivot table from `import_df` by calling the `pivot_table` method. 
Since the `values` column name is not given, after grouping by the `make` variable, all columns are aggregated using the `np.mean` function.

In [47]:
import_df.pivot_table(index='make').round()

Note that the following command cell has the same output as the above one.

In [49]:
pd.pivot_table(import_df, index='make').round()

__The End__