# `pandas` - Summarization and Grouping

__Contents__: 
1. Setup
1. Grouping data
1. Aggregation with `agg()` method
1. Other aggregating functions

## Reference
- http://pandas.pydata.org/pandas-docs/stable/index.html
- https://pandas.pydata.org/pandas-docs/stable/dsintro.html
- https://pandas.pydata.org/pandas-docs/stable/groupby.html

## 1. Setup

Load the libraries.

In [6]:
import pandas  as pd
import numpy   as np
(pd.__version__,
 np.__version__
)

Load the DataFrame from the `imports-85.csv` CSV file. Set the column names.

In [8]:
column_names = ['symboling', 'normalized-losses', 'make', 'fuel-type',
                'aspiration', 'num-of-doors', 'body-style', 'drive-wheels',
                'engine-location', 'wheel-base', 'length', 'width',
                'height', 'curb-weight', 'engine-type', 'num-of-cylinders',
                'engine-size', 'fuel-system', 'bore', 'stroke',
                'compression-ratio', 'horsepower', 'peak-rpm', 'city-mpg',
                'highway-mpg', 'price']
import_df = pd.read_csv('/dbfs/mnt/datalab-datasets/file-samples/imports-85.csv',
                        names=[string.replace('-','_') for string in column_names],
                        na_values=['?']
                       )

In [9]:
import_df.head()

Display basic information about each column of the DataFrame.

In [11]:
import_df.info()

## 2. Grouping data

Pandas objects can be split into groups on any of their axes. To create a GroupBy object (more on what the GroupBy object later), you may do the following:
- `grouped = obj.groupby(key)`
- `grouped = obj.groupby(key, axis=1) (default is axis=0)`
- `grouped = obj.groupby([key1, key2])`

Note that `obj` is a pandas DataFrame object.

Create a GroupBy object by calling `groupby()` method of the dataframe. Below group by the `make` column.

In [15]:
grouped = import_df.groupby('make')
grouped

A single group can be selected using `get_group()`. The following code cell displays all the records that meet the condition of `make='toyota'` in the dataframe. Recall that `make` was specified above as the "group by" variable.

In [17]:
grouped.get_group('toyota')

The code cell below groups by multiple columns and displays records that meet the conditions of `make='toyota'` and `body_style='sedan'`.

In [19]:
import_df.groupby(['make','body_style']).get_group(('toyota','sedan'))

##3. Aggregation with the `agg()` method

Once the GroupBy object has been created, several methods are available to perform a computation on the grouped data. 
The `aggregate()` or equivalently `agg()` method is the most general way to summarize grouped dataframes. 
The following cells in this section demonstrate these methods.

Apply the `np.mean` function to the dataframe with the `make`, `body_style`, `city_mpg` and `highway_mpg` columns grouped by the `make` and `body_style` columns.

In [23]:
import_df \
  .loc[:,['make','body_style','city_mpg','highway_mpg']] \
  .groupby(['make','body_style']) \
  .agg(np.mean)

Apply multiple functions to all non-grouping columns by passing a list of functions to the `agg` or `aggregate` method.

In [25]:
import_df \
  .loc[:,['make','body_style','city_mpg','highway_mpg']] \
  .groupby(['make','body_style']) \
  .agg([np.mean, np.max, np.min]) \
  .head()

Apply specific functions to specific dataframe columns by passing a dict to the `agg` or `aggregate` method.

In [27]:
import_df \
  .loc[:,['make','body_style','city_mpg','highway_mpg']] \
  .groupby(['make','body_style']) \
  .agg({'city_mpg'   : 'mean',
        'highway_mpg': lambda x: np.mean(x),
         }) \
  .head()

Notice that the summary function can also be specified as a string and using an anonymous function.

To name output columns use a nested dict (as in the following example). 
- Keys of the outer dictionary name existing columns in the grouped dataframe
- Keys of the inner dictionary name the summary columns of the resulting dataframe

Values of the inner dictionary can be strings, numpy functions or lamda functions.

In [30]:
import_df.columns

In [31]:
import_df \
  .groupby(['engine_type']) \
  .agg({'width'   : {'mean': np.mean,
                     'min' : np.min,
                     'max' : np.max
                    },
        'length'  : {'mean': 'mean',
                     'min' : np.min,
                     'max' : lambda x: np.max(x)}
       })

##4. Other aggregating functions

There are other aggregation functions are included as GroupBy methods. Some commonly used aggregating functions are:
- `mean()`	Compute mean of groups
- `sum()`	Compute sum of group values
- `size()`	Compute group sizes
- `std()`	Standard deviation of groups
- `sem()`	Standard error of the mean of groups
- `describe()`	Generates descriptive statistics
- `min()`	Compute min of group values
- `max()`	Compute max of group values

The following cells in this section are some examples showing the use of those aggregation functions.

In each case the summary function is applied to all non-grouping columns of the grouped dataframe.

Display the mean of `city_mpg` and `highway_mpg` for each kind of `make`.

In [35]:
import_df[['make','city_mpg','highway_mpg']].groupby('make').mean()

Display the mean of `city_mpg` and `highway_mpg` for each unique combination of `make` and `body_style`. Sort the values according to the  `city_mpg` in a descending way.

In [37]:
import_df[['make','body_style','city_mpg','highway_mpg']] \
  .groupby(['make','body_style'])\
  .mean()\
  .sort_values(by='city_mpg',ascending=False)

Display the number of records in each unique combination of `make` and `body_style` in the dataframe.

In [39]:
import_df[['make','body_style','city_mpg','highway_mpg']]\
       .groupby(['make','body_style'])\
       .size()

Display the descriptive statistics of `city_mpg` and `highway_mpg` for every value of the `make` column.

In [41]:
import_df[['make','city_mpg','highway_mpg']]\
  .groupby(['make'])\
  .describe()

__The End__