# Session 12: Reading files into dataframes. Operations on data. Aggregating and grouping.

## Reading files into DataFrames:

`pandas` is a module really versatile when converting data in different files into DataFrames.

We have several functions from `pandas` to read files into DataFrames:
* `pd.read_csv` converts CSV files into a `pd.DataFrame`
* `pd.read_json` converts JSON files into a `pd.DataFrame`
* `pd.read_html` converts HTML files into a `pd.DataFrame`
* `pd.read_clipboard` converts the data in your clipboard into a `pd.DataFrame`
* and many more... https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html

In general, `pandas` will read the file just ok, but there are sometimes in which we need to specify some arguments within `read_csv()`:
* separator: `sep` can be semicolon (;), comma (,), tab (\t), etc
* encoding: `encoding` can be `utf-8`, `latin1`, ...

In [1]:
# lets read animals.csv

import pandas as pd

animals = pd.read_csv("../files/animals.csv", sep=",")

animals.head()

Unnamed: 0,year,district,dogs,cats
0,2019,ARGANZUELA,10556,5074
1,2019,BARAJAS,5086,1515
2,2019,CARABANCHEL,20258,6387
3,2019,CENTRO,16010,9248
4,2019,CHAMARTÍN,11098,3922


### pandas: `head`, `tail`, `sample`

* `df.head(n)` will display the first n rows of a dataframe. By default, n=5.
* `df.tail(n)` will display the last n rows of a dataframe. By default, n=5.
* `df.sample(n)` will display a random sample of n rows of a dataframe. By default, n=1.

In [2]:
animals.head()

Unnamed: 0,year,district,dogs,cats
0,2019,ARGANZUELA,10556,5074
1,2019,BARAJAS,5086,1515
2,2019,CARABANCHEL,20258,6387
3,2019,CENTRO,16010,9248
4,2019,CHAMARTÍN,11098,3922


In [3]:
animals.tail(3)

Unnamed: 0,year,district,dogs,cats
123,2014,VICÁLVARO,4584,505
124,2014,VILLA DE VALLECAS,7107,940
125,2014,VILLAVERDE,10467,851


In [4]:
animals.sample(4)

Unnamed: 0,year,district,dogs,cats
93,2015,LATINA,18529,2595
55,2017,RETIRO,8309,2313
77,2016,SALAMANCA,12709,3424
19,2019,VILLA DE VALLECAS,9923,2946


## Operations with the data in the columns

With pandas we can not only store tabular-like data, but also perform different operations with it

### Using `pandas` methods and attributes 

Since our columns are nothing `pd.Series` objects, we can use all the attributes and methods that apply to them:
* https://pandas.pydata.org/docs/reference/api/pandas.Series.html

Just a sample of what we can do:
* Attributes:
    * `.index`, `.shape`, `.size`, `.values`, `.T`, 
* Methods:
    * `.abs()`, `.min()`, `.max()`, `.count()`, `.value_counts()`
    * `.sum()`, `.cumsum()`, `.mean()`, `.std()`
    * `.isna()`, `.isnull()`, `.idxmin()`, `.idxmax()`
    * `.unique()`, `.nunique()`, `.drop_duplicates()`

In [5]:
list(animals.index)[:5]

[0, 1, 2, 3, 4]

In [6]:
animals.shape

(126, 4)

In [7]:
animals.size

504

In [8]:
animals.values

array([[2019, 'ARGANZUELA', 10556, 5074],
       [2019, 'BARAJAS', 5086, 1515],
       [2019, 'CARABANCHEL', 20258, 6387],
       [2019, 'CENTRO', 16010, 9248],
       [2019, 'CHAMARTÍN', 11098, 3922],
       [2019, 'CHAMBERÍ', 13359, 4692],
       [2019, 'CIUDAD LINEAL', 17286, 8183],
       [2019, 'FUENCARRAL-EL PARDO', 17375, 6121],
       [2019, 'HORTALEZA', 15836, 8556],
       [2019, 'LATINA', 19049, 10564],
       [2019, 'MONCLOA-ARAVACA', 12367, 3931],
       [2019, 'MORATALAZ', 6724, 2502],
       [2019, 'PUENTE DE VALLECAS', 23437, 6208],
       [2019, 'RETIRO', 7786, 3105],
       [2019, 'SALAMANCA', 13471, 5033],
       [2019, 'SAN BLAS', 14228, 5064],
       [2019, 'TETUÁN', 12470, 5535],
       [2019, 'USERA', 12393, 2898],
       [2019, 'VICÁLVARO', 5244, 1505],
       [2019, 'VILLA DE VALLECAS', 9923, 2946],
       [2019, 'VILLAVERDE', 12917, 2694],
       [2018, 'ARGANZUELA', 10622, 4458],
       [2018, 'BARAJAS', 5203, 1300],
       [2018, 'CARABANCHEL', 20265, 5524],

In [9]:
animals.T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,116,117,118,119,120,121,122,123,124,125
year,2019,2019,2019,2019,2019,2019,2019,2019,2019,2019,...,2014,2014,2014,2014,2014,2014,2014,2014,2014,2014
district,ARGANZUELA,BARAJAS,CARABANCHEL,CENTRO,CHAMARTÍN,CHAMBERÍ,CIUDAD LINEAL,FUENCARRAL-EL PARDO,HORTALEZA,LATINA,...,MORATALAZ,PUENTE DE VALLECAS,RETIRO,SALAMANCA,SAN BLAS,TETUÁN,USERA,VICÁLVARO,VILLA DE VALLECAS,VILLAVERDE
dogs,10556,5086,20258,16010,11098,13359,17286,17375,15836,19049,...,6706,22072,8774,12942,12786,12301,11310,4584,7107,10467
cats,5074,1515,6387,9248,3922,4692,8183,6121,8556,10564,...,1153,2065,1344,1793,2043,2178,978,505,940,851


#### abs

In [10]:
animals["dogs"].abs()

0      10556
1       5086
2      20258
3      16010
4      11098
       ...  
121    12301
122    11310
123     4584
124     7107
125    10467
Name: dogs, Length: 126, dtype: int64

#### max, min, count, value_counts

In [11]:
# max value in series
animals["dogs"].max()

23860

In [12]:
# min value in series
animals["dogs"].min()

4584

In [13]:
# number of elements in series
animals["district"].count()

126

In [14]:
animals["district"].size

126

In [15]:
animals["district"].shape

(126,)

In [16]:
# counts items per category
animals["year"].value_counts()

2019    21
2018    21
2017    21
2016    21
2015    21
2014    21
Name: year, dtype: int64

In [17]:
# if we pass normalize=True to `value_counts` we will get the proportions instead of the totals
animals["year"].value_counts(normalize=True)

2019    0.166667
2018    0.166667
2017    0.166667
2016    0.166667
2015    0.166667
2014    0.166667
Name: year, dtype: float64

#### sum, cumsum, mean, std

In [18]:
# sum of all elements
animals["cats"].sum()

419173

In [19]:
# cummulative sum 
# item1, item1+item2, item1+item2+item3, ...
animals["cats"].cumsum()

0        5074
1        6589
2       12976
3       22224
4       26146
        ...  
121    415899
122    416877
123    417382
124    418322
125    419173
Name: cats, Length: 126, dtype: int64

In [20]:
# mean value of series
animals["cats"].mean()

3326.7698412698414

In [21]:
# standard deviation of series
animals["cats"].std()

2062.750967665068

#### isna/isnull, idxmin/idxmax

In [22]:
import numpy as np

In [25]:
# missing values in pandas areindicated as NaN
# with isna we can check how many 

s = pd.Series([1, None, "a", True, np.nan]).isna()



In [26]:
# isna: returns array with same shape with True/False to mask NaN
animals["dogs"].isna()

0      False
1      False
2      False
3      False
4      False
       ...  
121    False
122    False
123    False
124    False
125    False
Name: dogs, Length: 126, dtype: bool

In [28]:
# with dropna we can drop rows with NaN
pd.Series([1, None, "a", True]).dropna()

0       1
2       a
3    True
dtype: object

In [29]:
# idxmax() returns the row label (index) of the highest value in series
animals["dogs"].idxmax()

54

In [30]:
animals["dogs"][animals["dogs"].idxmax()] == animals["dogs"].max()

True

In [31]:
# idxmin() returns the row label (index) of the lowest value in series
animals["dogs"].idxmin()

123

#### unique, nunique, drop_duplicates

In [32]:
# returns an array with the unique values, like doing set(series)
animals["district"].unique()

array(['ARGANZUELA', 'BARAJAS', 'CARABANCHEL', 'CENTRO', 'CHAMARTÍN',
       'CHAMBERÍ', 'CIUDAD LINEAL', 'FUENCARRAL-EL PARDO', 'HORTALEZA',
       'LATINA', 'MONCLOA-ARAVACA', 'MORATALAZ', 'PUENTE DE VALLECAS',
       'RETIRO', 'SALAMANCA', 'SAN BLAS', 'TETUÁN', 'USERA', 'VICÁLVARO',
       'VILLA DE VALLECAS', 'VILLAVERDE', 'FUENCARRAL EL PARDO'],
      dtype=object)

In [33]:
# nunique returns how many unique elements there are in the series, like doing len(set(series))
animals["district"].nunique()

22

In [34]:
# drop_duplicates returns a series with only the unique values and the index at which they are
animals["district"].drop_duplicates()

0               ARGANZUELA
1                  BARAJAS
2              CARABANCHEL
3                   CENTRO
4                CHAMARTÍN
5                 CHAMBERÍ
6            CIUDAD LINEAL
7      FUENCARRAL-EL PARDO
8                HORTALEZA
9                   LATINA
10         MONCLOA-ARAVACA
11               MORATALAZ
12      PUENTE DE VALLECAS
13                  RETIRO
14               SALAMANCA
15                SAN BLAS
16                  TETUÁN
17                   USERA
18               VICÁLVARO
19       VILLA DE VALLECAS
20              VILLAVERDE
112    FUENCARRAL EL PARDO
Name: district, dtype: object

### Create new columns out of existing columns

* We can operate 2 or more columns with arithmetic operators
* We can perform logical operations in columns using np.where
    * ```Python
    np.where(condition_on_column, result_if_true, result_if_false)
    ```


In [35]:
# sum two columns

animals["total_animals"] = animals["cats"] + animals["dogs"]

animals.head()

Unnamed: 0,year,district,dogs,cats,total_animals
0,2019,ARGANZUELA,10556,5074,15630
1,2019,BARAJAS,5086,1515,6601
2,2019,CARABANCHEL,20258,6387,26645
3,2019,CENTRO,16010,9248,25258
4,2019,CHAMARTÍN,11098,3922,15020


### np.where

```Python
np.where(
    condition_to_check,
    value_if_condition_is_true,
    value_if_condition_is_false
)
```

In [36]:
# create a new column based on a logical condition on an existing column: `np.where`

import numpy as np

mean_animals = animals["total_animals"].mean()

animals["total_animals_cat"] = np.where(
    animals["total_animals"] > mean_animals, #if animals above mean
    "above_mean", # save "above_mean"
    "below_mean" # save "below_mean"
)

animals.sample(5)

Unnamed: 0,year,district,dogs,cats,total_animals,total_animals_cat
71,2016,HORTALEZA,16451,6200,22651,above_mean
88,2015,CHAMARTÍN,13159,1860,15019,below_mean
106,2014,BARAJAS,5233,659,5892,below_mean
25,2018,CHAMARTÍN,11417,3601,15018,below_mean
99,2015,SAN BLAS,13067,2188,15255,below_mean


In [37]:
# concatenating strings and converting
animals["concat_string"] = animals["year"].astype(str) + animals["district"]

animals

Unnamed: 0,year,district,dogs,cats,total_animals,total_animals_cat,concat_string
0,2019,ARGANZUELA,10556,5074,15630,below_mean,2019ARGANZUELA
1,2019,BARAJAS,5086,1515,6601,below_mean,2019BARAJAS
2,2019,CARABANCHEL,20258,6387,26645,above_mean,2019CARABANCHEL
3,2019,CENTRO,16010,9248,25258,above_mean,2019CENTRO
4,2019,CHAMARTÍN,11098,3922,15020,below_mean,2019CHAMARTÍN
...,...,...,...,...,...,...,...
121,2014,TETUÁN,12301,2178,14479,below_mean,2014TETUÁN
122,2014,USERA,11310,978,12288,below_mean,2014USERA
123,2014,VICÁLVARO,4584,505,5089,below_mean,2014VICÁLVARO
124,2014,VILLA DE VALLECAS,7107,940,8047,below_mean,2014VILLA DE VALLECAS


In [38]:
# create a new column called "cats_per_dog" that contains the ratio cats/dogs
animals["cats_per_dog"] = animals["cats"] / animals["dogs"]

animals.head()

Unnamed: 0,year,district,dogs,cats,total_animals,total_animals_cat,concat_string,cats_per_dog
0,2019,ARGANZUELA,10556,5074,15630,below_mean,2019ARGANZUELA,0.480674
1,2019,BARAJAS,5086,1515,6601,below_mean,2019BARAJAS,0.297877
2,2019,CARABANCHEL,20258,6387,26645,above_mean,2019CARABANCHEL,0.315283
3,2019,CENTRO,16010,9248,25258,above_mean,2019CENTRO,0.577639
4,2019,CHAMARTÍN,11098,3922,15020,below_mean,2019CHAMARTÍN,0.353397


In [39]:
# create a new column called "cum_sum_animals" that contains 
# the cummulative sum of the total animals 
animals["cum_sum_animals"] = animals["total_animals"].cumsum()

animals.head()

Unnamed: 0,year,district,dogs,cats,total_animals,total_animals_cat,concat_string,cats_per_dog,cum_sum_animals
0,2019,ARGANZUELA,10556,5074,15630,below_mean,2019ARGANZUELA,0.480674,15630
1,2019,BARAJAS,5086,1515,6601,below_mean,2019BARAJAS,0.297877,22231
2,2019,CARABANCHEL,20258,6387,26645,above_mean,2019CARABANCHEL,0.315283,48876
3,2019,CENTRO,16010,9248,25258,above_mean,2019CENTRO,0.577639,74134
4,2019,CHAMARTÍN,11098,3922,15020,below_mean,2019CHAMARTÍN,0.353397,89154


### Sorting columns using `.sort_values()`

We can sort our dataframes this way:

```Python
df.sort_values(by=[columns_to_order_with], ascending=True)
```

In [40]:
animals.sort_values(by="cats", ascending=False)

Unnamed: 0,year,district,dogs,cats,total_animals,total_animals_cat,concat_string,cats_per_dog,cum_sum_animals
9,2019,LATINA,19049,10564,29613,above_mean,2019LATINA,0.554570,210175
3,2019,CENTRO,16010,9248,25258,above_mean,2019CENTRO,0.577639,74134
8,2019,HORTALEZA,15836,8556,24392,above_mean,2019HORTALEZA,0.540288,180562
24,2018,CENTRO,15881,8186,24067,above_mean,2018CENTRO,0.515459,453995
6,2019,CIUDAD LINEAL,17286,8183,25469,above_mean,2019CIUDAD LINEAL,0.473389,132674
...,...,...,...,...,...,...,...,...,...
125,2014,VILLAVERDE,10467,851,11318,below_mean,2014VILLAVERDE,0.081303,2065011
85,2015,BARAJAS,5217,663,5880,below_mean,2015BARAJAS,0.127085,1464881
106,2014,BARAJAS,5233,659,5892,below_mean,2014BARAJAS,0.125932,1777685
102,2015,VICÁLVARO,4702,545,5247,below_mean,2015VICÁLVARO,0.115908,1739965


In [41]:
animals.sort_values(by=["cats", "dogs"], ascending=[False, True])

Unnamed: 0,year,district,dogs,cats,total_animals,total_animals_cat,concat_string,cats_per_dog,cum_sum_animals
9,2019,LATINA,19049,10564,29613,above_mean,2019LATINA,0.554570,210175
3,2019,CENTRO,16010,9248,25258,above_mean,2019CENTRO,0.577639,74134
8,2019,HORTALEZA,15836,8556,24392,above_mean,2019HORTALEZA,0.540288,180562
24,2018,CENTRO,15881,8186,24067,above_mean,2018CENTRO,0.515459,453995
6,2019,CIUDAD LINEAL,17286,8183,25469,above_mean,2019CIUDAD LINEAL,0.473389,132674
...,...,...,...,...,...,...,...,...,...
125,2014,VILLAVERDE,10467,851,11318,below_mean,2014VILLAVERDE,0.081303,2065011
85,2015,BARAJAS,5217,663,5880,below_mean,2015BARAJAS,0.127085,1464881
106,2014,BARAJAS,5233,659,5892,below_mean,2014BARAJAS,0.125932,1777685
102,2015,VICÁLVARO,4702,545,5247,below_mean,2015VICÁLVARO,0.115908,1739965


## Practice

### Exercise 1:
Whats the percentage that represents the dogs in "LATINA" in 2018 compared to the whole city in 2018 

In [42]:
dogs_latina_2018 = animals[
    (animals["district"]=="LATINA")&
    (animals["year"]==2018)
]["dogs"].values[0]

dogs_2018 = animals[
    (animals["year"]==2018)
]["dogs"].sum()

ratio = round(dogs_latina_2018 * 100 / dogs_2018, 1)

f"{ratio} % of the dogs in Madrid in 2018 are in Latina"

'6.9 % of the dogs in Madrid in 2018 are in Latina'

### Exercise 2:
How many districts had an "above_mean" rating in 2016?

In [43]:
animals[
    (animals["year"]==2016)&
    (animals["total_animals_cat"]=="above_mean")
]["district"].nunique()

9

### Exercise 3:
Has the "Hortaleza" district increased or decreased its dog population in the analyzed period? By how much?

In [47]:
dogs_hortaleza_2019 = animals[
    (animals["district"]=="HORTALEZA")
][["year", "dogs"]].sort_values(by="year", ascending=False)["dogs"].values[0]

dogs_hortaleza_2014 = animals[
    (animals["district"]=="HORTALEZA")
][["year", "dogs"]].sort_values(by="year", ascending=False)["dogs"].values[-1]

# to calculate the evolution we substract the number of dogs in Hortaleza in 2014 from 2019
evolution = dogs_hortaleza_2019 - dogs_hortaleza_2014

# results
result = "increased" if evolution > 0 else "decreased"

print(f"The number of dogs in Hortaleza has {result} by {abs(evolution)} dogs from 2014 to 2019")

The number of dogs in Hortaleza has decreased by 640 dogs from 2014 to 2019


## Groupby and aggregations

### `groupby`

Just like in SQL, we can use `groupby` to perform operations to whole groups on our DataFrames.

```Python
df.groupby([columns_to_group]).function_to_apply_to_each_group
```

In [49]:
# read energy data
from datetime import datetime

energy = pd.read_csv("../files/energy.csv")

energy.head()

Unnamed: 0,datetime,power_demand,nuclear,gas,solar,hydro,coal,wind,spot_price,year,month,day,hour,weekday
0,2018-12-31 23:00:00+00:00,23251.2,6059.2,2954.0,7.1,3202.8,1867.0,3830.3,66.88,2018,12,31,23,0
1,2019-01-01 00:00:00+00:00,22485.0,6059.2,3044.1,8.0,2884.4,1618.0,3172.1,66.88,2019,1,1,0,1
2,2019-01-01 01:00:00+00:00,20977.0,6059.2,3138.6,7.5,1950.8,1535.3,2980.5,66.0,2019,1,1,1,1
3,2019-01-01 02:00:00+00:00,19754.2,6059.2,3596.2,7.5,1675.7,1344.0,2840.0,63.64,2019,1,1,2,1
4,2019-01-01 03:00:00+00:00,19320.6,6063.4,3192.6,7.5,1581.8,1345.0,3253.4,58.85,2019,1,1,3,1


In [50]:
energy[["power_demand"]].head()

Unnamed: 0,power_demand
0,23251.2
1,22485.0
2,20977.0
3,19754.2
4,19320.6


In [51]:
# mean spot_price per month 
energy.groupby(["month"])[["spot_price"]].mean()

Unnamed: 0_level_0,spot_price
month,Unnamed: 1_level_1
1,61.959852
2,54.020491
3,48.836116
4,50.403611
5,48.393992
6,47.161236
7,51.464301
8,44.951815
9,42.130111
10,47.152325


In [52]:
# max power_demand per hour
energy.groupby(["hour"])[["power_demand"]].max()

Unnamed: 0_level_0,power_demand
hour,Unnamed: 1_level_1
0,27739.0
1,26459.4
2,26298.4
3,26429.5
4,28167.9
5,30511.2
6,34706.4
7,38116.0
8,39241.7
9,39898.3


In [58]:
# day of week with lowest average consumption of fossil fuels 

energy["fossil_fuel_consumption"] = energy["gas"] + energy["coal"]

energy.groupby(["weekday"])[["fossil_fuel_consumption"]].mean().idxmin()  # Sunday

fossil_fuel_consumption    6
dtype: int64

### Inside a `groupby` object

`groupby` creates a tuple per `category` in the `column`(s) we're grouping by:
* The first element of the tuple is each one of the `category` in `column`
* The second element is the data associated to that category:
    * ```Python
    df[df[col_groupby]==category]
    ```

In [59]:
# what's inside a groupby object?
groupby_object = energy.groupby("month")

In [60]:
# first element
list(groupby_object)[0][1]["month"].unique()

array([1])

In [61]:
# category
list(groupby_object)[0][0]

1

In [62]:
# data associated with category
list(groupby_object)[0][1]

Unnamed: 0,datetime,power_demand,nuclear,gas,solar,hydro,coal,wind,spot_price,year,month,day,hour,weekday,fossil_fuel_consumption
1,2019-01-01 00:00:00+00:00,22485.0,6059.2,3044.1,8.0,2884.4,1618.0,3172.1,66.88,2019,1,1,0,1,4662.1
2,2019-01-01 01:00:00+00:00,20977.0,6059.2,3138.6,7.5,1950.8,1535.3,2980.5,66.00,2019,1,1,1,1,4673.9
3,2019-01-01 02:00:00+00:00,19754.2,6059.2,3596.2,7.5,1675.7,1344.0,2840.0,63.64,2019,1,1,2,1,4940.2
4,2019-01-01 03:00:00+00:00,19320.6,6063.4,3192.6,7.5,1581.8,1345.0,3253.4,58.85,2019,1,1,3,1,4537.6
5,2019-01-01 04:00:00+00:00,19262.3,6063.4,3167.9,7.5,1535.6,1377.5,3234.0,55.47,2019,1,1,4,1,4545.4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
740,2019-01-31 19:00:00+00:00,37133.6,7112.7,2653.9,,4599.3,2891.0,15494.3,59.90,2019,1,31,19,3,5544.9
741,2019-01-31 20:00:00+00:00,36185.0,7112.7,1542.8,,3940.3,2645.0,15699.5,52.84,2019,1,31,20,3,4187.8
742,2019-01-31 21:00:00+00:00,33965.2,7112.7,1343.0,,3863.8,2367.8,15589.3,51.36,2019,1,31,21,3,3710.8
743,2019-01-31 22:00:00+00:00,30144.1,7111.7,868.4,,3627.9,1832.0,15153.3,48.80,2019,1,31,22,3,2700.4


Now we understand the `groupby` object, we can dig a bit deeper into the syntax

If we want to groupby several columns, we can pass a list of columns to `groupby` and perform the operation we need.

If we don't want the columns to become the index of the resulting DF, we can pass `as_index=False` to `groupby`

In [63]:
# groupby with several columns

# mean  power_demand and spot_price per month and weekday
df = energy.groupby(["month", "weekday"])[["power_demand", "spot_price"]].mean()

df.columns = [f"mean_{col}" for col in df.columns]

df

Unnamed: 0_level_0,Unnamed: 1_level_0,mean_power_demand,mean_spot_price
month,weekday,Unnamed: 2_level_1,Unnamed: 3_level_1
1,0,32236.002083,62.180313
1,1,31511.434167,63.257333
1,2,32730.300833,61.783417
1,3,32959.660000,61.658500
1,4,33013.616667,65.393125
...,...,...,...
12,2,28763.133333,37.196458
12,3,29758.062500,33.490417
12,4,28440.963542,31.316875
12,5,26226.058333,27.177708


In [64]:
# with `as_index=False` we can keep the index
energy.groupby(["month", "weekday"], as_index=False)[["power_demand", "spot_price"]].mean()

Unnamed: 0,month,weekday,power_demand,spot_price
0,1,0,32236.002083,62.180313
1,1,1,31511.434167,63.257333
2,1,2,32730.300833,61.783417
3,1,3,32959.660000,61.658500
4,1,4,33013.616667,65.393125
...,...,...,...,...
79,12,2,28763.133333,37.196458
80,12,3,29758.062500,33.490417
81,12,4,28440.963542,31.316875
82,12,5,26226.058333,27.177708


### `groupby` and `agg`

If we want to perform different operations after `groupby` we can mix `groupby` and `agg`.

In [66]:
# groupby on several columns and perform mean and sum on coal and wind

energy.groupby(["month", "weekday"]).agg({
    "coal": ["sum", "mean"],
    "wind": ["sum", "mean"]
})

Unnamed: 0_level_0,Unnamed: 1_level_0,coal,coal,wind,wind
Unnamed: 0_level_1,Unnamed: 1_level_1,sum,mean,sum,mean
month,weekday,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
1,0,421269.8,4388.227083,824014.0,8583.479167
1,1,507610.3,4230.085833,846574.9,7054.790833
1,2,512002.1,4266.684167,1127327.0,9394.391667
1,3,499262.3,4160.519167,1107441.0,9228.675000
1,4,456063.6,4750.662500,604727.5,6299.244792
...,...,...,...,...,...
12,2,57690.3,801.254167,687588.4,7162.379167
12,3,60719.6,645.953191,966697.5,10069.765625
12,4,49367.6,530.834409,843487.3,8786.326042
12,5,31480.4,655.841667,675993.2,7041.595833


In [67]:
# We can handle a multiindex like the one resulting from a groupby with several columns 
# and several operations in the following way:

df = energy.groupby(["month", "weekday"]).agg({
    "coal": ["sum", "mean"],
    "wind": ["sum", "mean"]
})

# mean coal generation on Tuesdays in January
df.loc[(1, 1), ("coal", "mean")]

4230.085833333333

When we have a DataFrame with several indices, we can use `unstack()` and `stack()`:

### `stack` and `unstack`

These methods allow us to "move" labels from rows to columns and viceversa
* `unstack` moves row labels to column labels
* `stack` moves column labels to row labels

By default, the level at which these function operates is on the -1th level.

In [68]:
# create DF with 2 indices
energy.groupby(["month", "weekday"]).agg({
    "coal": ["sum", "mean"],
    "wind": ["sum", "mean"]
})

Unnamed: 0_level_0,Unnamed: 1_level_0,coal,coal,wind,wind
Unnamed: 0_level_1,Unnamed: 1_level_1,sum,mean,sum,mean
month,weekday,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
1,0,421269.8,4388.227083,824014.0,8583.479167
1,1,507610.3,4230.085833,846574.9,7054.790833
1,2,512002.1,4266.684167,1127327.0,9394.391667
1,3,499262.3,4160.519167,1107441.0,9228.675000
1,4,456063.6,4750.662500,604727.5,6299.244792
...,...,...,...,...,...
12,2,57690.3,801.254167,687588.4,7162.379167
12,3,60719.6,645.953191,966697.5,10069.765625
12,4,49367.6,530.834409,843487.3,8786.326042
12,5,31480.4,655.841667,675993.2,7041.595833


In [69]:
# move `weekday` from rows to columns: unstack weekday
energy.groupby(["month", "weekday"]).agg({
    "coal": ["sum", "mean"],
    "wind": ["sum", "mean"]
}).unstack(level="weekday")

Unnamed: 0_level_0,coal,coal,coal,coal,coal,coal,coal,coal,coal,coal,...,wind,wind,wind,wind,wind,wind,wind,wind,wind,wind
Unnamed: 0_level_1,sum,sum,sum,sum,sum,sum,sum,mean,mean,mean,...,sum,sum,sum,mean,mean,mean,mean,mean,mean,mean
weekday,0,1,2,3,4,5,6,0,1,2,...,4,5,6,0,1,2,3,4,5,6
month,Unnamed: 1_level_3,Unnamed: 2_level_3,Unnamed: 3_level_3,Unnamed: 4_level_3,Unnamed: 5_level_3,Unnamed: 6_level_3,Unnamed: 7_level_3,Unnamed: 8_level_3,Unnamed: 9_level_3,Unnamed: 10_level_3,Unnamed: 11_level_3,Unnamed: 12_level_3,Unnamed: 13_level_3,Unnamed: 14_level_3,Unnamed: 15_level_3,Unnamed: 16_level_3,Unnamed: 17_level_3,Unnamed: 18_level_3,Unnamed: 19_level_3,Unnamed: 20_level_3,Unnamed: 21_level_3
1,421269.8,507610.3,512002.1,499262.3,456063.6,403004.1,299088.4,4388.227083,4230.085833,4266.684167,...,604727.5,547692.5,845505.9,8583.479167,7054.790833,9394.391667,9228.675,6299.244792,5705.130208,8807.353125
2,355023.1,412356.6,409794.9,354202.2,296157.4,228108.9,201911.1,3698.157292,4295.38125,4268.696875,...,687880.1,742590.4,709697.2,4573.125,3127.797917,3264.470833,4621.85,7165.417708,7735.316667,7392.679167
3,90575.2,93075.8,83690.6,115528.7,211058.3,136452.7,105241.6,943.491667,969.539583,871.777083,...,479457.2,509594.6,710068.8,7840.760417,8366.327083,9745.963542,6416.372917,3995.476667,4246.621667,5917.24
4,143198.9,141308.6,93011.4,93500.1,97725.8,86918.0,75626.0,1193.324167,1177.571667,968.86875,...,688169.6,654563.4,540527.0,4688.543333,5507.959167,9419.402083,7540.68125,7168.433333,6818.36875,5630.489583
5,46014.0,57783.7,67651.7,65313.1,54077.2,34526.5,34845.2,479.3125,601.913542,563.764167,...,873835.7,631901.0,625664.3,4705.53125,5094.652083,6217.274167,6080.86,7281.964167,6582.302083,6517.336458
6,60259.3,69456.2,65415.9,57449.5,53481.9,58768.7,65776.0,627.701042,723.502083,681.415625,...,449544.9,414505.6,425935.3,3330.016667,5273.940625,5041.561458,5647.36875,4682.759375,3454.213333,3549.460833
7,103709.9,120206.6,126196.5,102481.9,102255.1,59825.4,54971.8,864.249167,1001.721667,1051.6375,...,315439.9,438007.4,383020.7,5041.106667,4705.435833,4125.745,3559.652083,3285.832292,4562.577083,3989.798958
8,39654.1,47677.5,49256.2,68909.5,68596.1,54148.6,35596.5,413.063542,496.640625,513.085417,...,437170.9,334951.0,385160.3,4416.210417,3737.441667,3280.898958,3903.4125,3643.090833,2791.258333,4012.086458
9,87533.4,74152.3,73621.4,63098.8,62168.6,42217.6,57211.7,729.445,772.419792,766.889583,...,577429.3,548192.7,590555.9,4250.530833,5994.954167,5176.195833,5125.916667,6014.888542,5710.340625,4921.299167
10,100779.0,133426.4,127907.2,116723.8,78866.9,64727.8,65093.0,1049.78125,1111.886667,1065.893333,...,397112.2,492724.4,444821.8,4660.223958,5447.506667,5343.123333,5583.9275,4136.585417,5132.545833,4633.560417


In [70]:
# move ("coal", "wind") from columns labels to rows: stack 0
energy.groupby(["month", "weekday"]).agg({
    "coal": ["sum", "mean"],
    "wind": ["sum", "mean"]
}).stack()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,coal,wind
month,weekday,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,0,sum,421269.800000,8.240140e+05
1,0,mean,4388.227083,8.583479e+03
1,1,sum,507610.300000,8.465749e+05
1,1,mean,4230.085833,7.054791e+03
1,2,sum,512002.100000,1.127327e+06
...,...,...,...,...
12,4,mean,530.834409,8.786326e+03
12,5,sum,31480.400000,6.759932e+05
12,5,mean,655.841667,7.041596e+03
12,6,sum,51225.700000,7.598387e+05


## Practice

### Exercise 1: `energy` dataset
What's the maximum solar power generation happened in August?

In [74]:
# let's see the maximum solar generation in each month
energy.groupby(["month"]).agg({"solar": ["max"]})

Unnamed: 0_level_0,solar
Unnamed: 0_level_1,max
month,Unnamed: 1_level_2
1,3058.4
2,3543.3
3,3798.3
4,3704.3
5,3726.1
6,3700.9
7,3998.1
8,4050.1
9,4213.1
10,4212.4


In [75]:
# only getting the value for August (month 8)
energy.groupby(["month"]).agg({"solar": ["max"]}).loc[8]

solar  max    4050.1
Name: 8, dtype: float64

### Exercise 2: `energy` dataset
What's the average production of each of the following technologies on Hour 5

```Python
tech = ["nuclear", "solar", "hydro"]
```

In [77]:
tech = ["nuclear", "solar", "hydro"]

energy.groupby("hour")[tech].mean().loc[5]

nuclear    6386.275824
solar        62.435920
hydro      2461.044231
Name: 5, dtype: float64

### Exercise 3:
Create a new column called `stop_wind` with value 1 if `spot_price` is below 20, and 0 otherwise.

In [78]:
energy["stop_wind"] = np.where(
    energy["spot_price"] < 20, 
    1,
    0
)

energy.head(1)

Unnamed: 0,datetime,power_demand,nuclear,gas,solar,hydro,coal,wind,spot_price,year,month,day,hour,weekday,fossil_fuel_consumption,stop_wind
0,2018-12-31 23:00:00+00:00,23251.2,6059.2,2954.0,7.1,3202.8,1867.0,3830.3,66.88,2018,12,31,23,0,4821.0,0


### Exercise 4:
Create a new column called weekend with 0 if weekday=0,1,2,3,4 and 1 otherwise

In [81]:
energy["weekend"] = np.where(
    energy["weekday"] > 4,
    1,
    0
)

energy.head(1)

Unnamed: 0,datetime,power_demand,nuclear,gas,solar,hydro,coal,wind,spot_price,year,month,day,hour,weekday,fossil_fuel_consumption,stop_wind,weekend
0,2018-12-31 23:00:00+00:00,23251.2,6059.2,2954.0,7.1,3202.8,1867.0,3830.3,66.88,2018,12,31,23,0,4821.0,0,0
