# Categoricals and groupby

## Sales data

In [1]:
import pandas as pd

In [2]:
sales = pd.DataFrame({
    'weekday': ['Sun', 'Sun', 'Mon', 'Mon'],
    'city': ['Austin', 'Dallas', 'Austin', 'Dallas'],
    'bread': [139, 237, 327, 456],
    'butter': [20,45,70,98]
})
sales

Unnamed: 0,bread,butter,city,weekday
0,139,20,Austin,Sun
1,237,45,Dallas,Sun
2,327,70,Austin,Mon
3,456,98,Dallas,Mon


### Boolean filtering and count

In [3]:
sales.loc[sales['weekday']=='Sun'].count()

bread      2
butter     2
city       2
weekday    2
dtype: int64

### Groupby and count

In [4]:
sales.groupby('weekday').count()

Unnamed: 0_level_0,bread,butter,city
weekday,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Mon,2,2,2
Sun,2,2,2


### Split-apply-combine

`sales.groupby('weekday.count()`

- split by 'weekdays'
- apply `count()` function on each group
- combine counts per group



### Aggregatin/Reduction

Some reducing functions
- `mean()`
- `std()`
- `sum()`
- `first()`, `last()`
- `min()`, `max()`

### Groupby and sum

What was the total amount of bread soldon each day?

In [5]:
sales.groupby('weekday')['bread'].sum()

weekday
Mon    783
Sun    376
Name: bread, dtype: int64

### Groupby and sum: multiple columns

In [6]:
sales.groupby('weekday')[['bread','butter']].sum()

Unnamed: 0_level_0,bread,butter
weekday,Unnamed: 1_level_1,Unnamed: 2_level_1
Mon,783,168
Sun,376,65


### Groupby and mean : multi-level index

- creates a sorted multilevel index
- 

In [7]:
sales.groupby(['city','weekday']).mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,bread,butter
city,weekday,Unnamed: 2_level_1,Unnamed: 3_level_1
Austin,Mon,327,70
Austin,Sun,139,20
Dallas,Mon,456,98
Dallas,Sun,237,45


### Customers

- create a customers series that tells us who made the purchases in our sales dataframe
- customers has an identical index to sales, namely a range starting at 0
-

In [8]:
customers = pd.Series(['Dave', 'Alice', 'Bob', 'Alice'])
customers

0     Dave
1    Alice
2      Bob
3    Alice
dtype: object

### Groupby and sum: by series

In [9]:
sales.groupby(customers)['bread'].sum()

Alice    693
Bob      327
Dave     139
Name: bread, dtype: int64

### Categorical data
- Advantages
    - uses less memory
    - speeds up operations like `groupby()`
  
- `.unique()` : returns an array of the distict entries
- `value_counts()`: returns the times each value occurs

In [10]:
sales['weekday']

0    Sun
1    Sun
2    Mon
3    Mon
Name: weekday, dtype: object

In [11]:
sales['weekday'].unique()

array(['Sun', 'Mon'], dtype=object)

In [12]:
sales['weekday'].value_counts()

Mon    2
Sun    2
Name: weekday, dtype: int64

In [13]:
sales['weekday'] = sales['weekday'].astype('category')
sales['weekday']

0    Sun
1    Sun
2    Mon
3    Mon
Name: weekday, dtype: category
Categories (2, object): [Mon, Sun]