# Using the `axis` options when working with `DataFrame`

In [2]:
import numpy as np
import pandas as pd

## `DataFrame` used for tests

Number of births for each name and year, with a imbalance for women, so we can spot the operations more easily (would be harder to follow what the operations are doing if the F/M data were the same).

In [56]:
births = pd.DataFrame(np.array([
    [1970, 'James', 'M', 1],
    [1970, 'Mary', 'F', 2],
    [1970, 'John', 'M', 3],

    [1971, 'James', 'M', 10],
    [1971, 'Mary', 'F', 20],
    [1971, 'John', 'M', 30],

    ]), columns=('year', 'name', 'genre', 'births'))

births = births.astype({'year': 'int', 'births': 'int'})

# This will be important for calculations later
# It lets Pandas align the elements during those calculations
births.set_index('year', inplace=True)

births

Unnamed: 0_level_0,name,genre,births
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1970,James,M,1
1970,Mary,F,2
1970,John,M,3
1971,James,M,10
1971,Mary,F,20
1971,John,M,30


## Group

Calculate number of births by genre in each year.

In [57]:
births_by_genre_year = births.groupby(['year','genre']).sum()
births_by_genre_year

Unnamed: 0_level_0,Unnamed: 1_level_0,births
year,genre,Unnamed: 2_level_1
1970,F,2
1970,M,4
1971,F,20
1971,M,40


Although not wrong, it's not the easiest way to visualize the numbers. Move `genre` to columns with `unstack()`:

In [58]:
births_by_genre_year = births_by_genre_year.unstack()
births_by_genre_year

Unnamed: 0_level_0,births,births
genre,F,M
year,Unnamed: 1_level_2,Unnamed: 2_level_2
1970,2,4
1971,20,40


Pivot table version - do it all in one shot.

In [59]:
births.pivot_table('births', index='year', columns='genre',
                   aggfunc='sum')

genre,F,M
year,Unnamed: 1_level_1,Unnamed: 2_level_1
1970,2,4
1971,20,40


## Slicing

Picking the `DataFrame` apart by addressing specific pieces of it.

Number of births for a specific year and genre.

In [93]:
births_by_genre_year.loc[1970]['births']['M']

4

Number of births for a specific genre, all years.

In [104]:
births_by_genre_year.loc[:]['births']['M']

year
1970     4
1971    40
Name: M, dtype: int64

## Aggregations - transform and augment data

**VERY IMPORTANT CONCEPT**: the axis specifies the dimension that will be collapsed (i.e. will be operated on and consolidated).

If we ask to perform an opeartion on rows, the rows are collapsed and aggreated by the specified operation.

### By rows: no axis, axis=0, axis='rows'

Collapse the rows and leave columns in place, i.e. aggregate (sums) row values.

In [73]:
# These are equivalent
births_by_genre_year.sum().unstack()
births_by_genre_year.sum(axis=0).unstack()
births_by_genre_year.sum(axis='rows').unstack()

genre,F,M
births,22,44


### By columns: axis=1, axis='columns'

Collapse the columns and leave the rows in place, i.e. aggregate (sums) colum values.

In [69]:
# These are equivalent
births_by_genre_year.sum(axis=1)
births_by_genre_year.sum(axis='columns')

year
1970     6
1971    60
dtype: int64

### `groupby`: split, apply, combine

#### Calculating the percentage of M/F in each year

How it works:

* **Split**: Since we want the percentage by year and by genre, we need to split using both year and genre.
* **Apply**: Once the dataset is split, we can apply the calculations we need to that slice.
* **Combine**: Pandas reassemles the results of the _apply_ step into one `DataFrame`.

In the code below, the `pct_year_genre(group)` is the _apply_ step. The `print(group)` statement is not needed for the calculations. It is there to help us visualize how the data set was split.

In [122]:
def pct_year_genre(group):
    group['pct_genre_year'] = group.births / group.births.sum() * 100
    print(group)
    return group

births = births.groupby(['year', 'genre']).apply(pct_year_genre)
births

      name genre  births  pct_genre_year
year                                    
1970  Mary     F       2           100.0
       name genre  births  pct_genre_year
year                                     
1970  James     M       1            25.0
1970   John     M       3            75.0
      name genre  births  pct_genre_year
year                                    
1971  Mary     F      20           100.0
       name genre  births  pct_genre_year
year                                     
1971  James     M      10            25.0
1971   John     M      30            75.0


Unnamed: 0_level_0,name,genre,births,pct_genre_year
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1970,Mary,F,2,100.0
1970,James,M,1,25.0
1970,John,M,3,75.0
1971,Mary,F,20,100.0
1971,James,M,10,25.0
1971,John,M,30,75.0


Visually:

![Split, apply, combine example](images/split_apply_combine_births.png)

### Calculating number of births

We do not need to return the same number of rows when grouping a dataset.

In this example we consolidate the M and F columns to get an overall count of births, independent of genre.

In [132]:
def births_per_year(group):
    return group['births'].sum()

births.groupby(['year']).apply(births_per_year)

year
1970     6
1971    60
dtype: int64

When the apply function is simple, we can use a lambda expression:

In [134]:
births.groupby(['year']).apply(lambda x: x['births'].sum())

year
1970     6
1971    60
dtype: int64

Finally, we can clean up the syntax a bit:

* Remove the array from `groupby`, since we are grouping by one column only
* Use the column name `births` as a variable name

In [136]:
births.groupby('year').apply(lambda x: x.births.sum())

year
1970     6
1971    60
dtype: int64

## A word about `value_counts`

If we are interested only in the proportion of categorical columns, we can use `values_counts`:

In [120]:
births['genre'].value_counts(normalize=True)

M    0.666667
F    0.333333
Name: genre, dtype: float64

### Percentage of births per year