In [12]:
import pandas as pd
import numpy as np
df = pd.read_csv('data/Consumo_cerveja.csv', 
                 decimal=',', 
                 thousands='.', 
                 header=0, 
                 names=['date','median_temp','min_temp','max_temp','precip','weekend','consumption'], 
                 parse_dates=['date'], 
                 nrows=365)

Let's finish up by talking about a number of useful functions for analysis before we put it all together!

# Mapping
Often we have some data that we want to replace with a better representation. For example,
in our beer drinking data, I would assume that the seasons are important, but since winter actually overlaps 2 years (winter is December, January, February) there's no easy numeric way to divide the year up.

Enter mapping - map is a great convenience function that lets you apply a function or dictionary on every row of data. To create our season mapping, let's set up a dictionary mapping label to number of month:

In [13]:
season = {
    "winter": [12, 1, 2],
    "spring": [3, 4, 5],
    "summer": [6, 7, 8],
    "autumn": [9, 10,11]
}

We then invert the dictionary, so we have a nice representation mapping month number to season

In [14]:
season_map = {i: k
              for k, v in season.items()
              for i in v
             }
season_map

{12: 'winter',
 1: 'winter',
 2: 'winter',
 3: 'spring',
 4: 'spring',
 5: 'spring',
 6: 'summer',
 7: 'summer',
 8: 'summer',
 9: 'autumn',
 10: 'autumn',
 11: 'autumn'}

Something we haven't mentioned yet, but converting date to a datetime type gives us access to the special .dt namespace which has datetime specific functionality. In this example, we use the .month property to get the month of a given date

In [15]:
df.date.dt.month

0       1
1       1
2       1
3       1
4       1
5       1
6       1
7       1
8       1
9       1
10      1
11      1
12      1
13      1
14      1
15      1
16      1
17      1
18      1
19      1
20      1
21      1
22      1
23      1
24      1
25      1
26      1
27      1
28      1
29      1
       ..
335    12
336    12
337    12
338    12
339    12
340    12
341    12
342    12
343    12
344    12
345    12
346    12
347    12
348    12
349    12
350    12
351    12
352    12
353    12
354    12
355    12
356    12
357    12
358    12
359    12
360    12
361    12
362    12
363    12
364    12
Name: date, Length: 365, dtype: int64

Now we are ready to map our season dictionary onto our months

In [16]:
df['season'] = df.date.dt.month.map(season_map)

Let's get some random samples to check that it worked as expected

![Fun Fact](images/fun_fact.resized.jpeg) Use `.sample` instead of `.head` - you'll catch dataerrors from the middle of your dataset!

In [17]:
df.sample(10)

Unnamed: 0,date,median_temp,min_temp,max_temp,precip,weekend,consumption,season
161,2015-06-11,22.72,17.5,26.9,0.0,0,24615,summer
68,2015-03-10,23.12,21.2,26.9,9.7,0,23042,spring
76,2015-03-18,21.24,19.7,24.1,0.0,0,20167,spring
314,2015-11-11,26.2,19.8,32.7,0.0,0,29569,autumn
284,2015-10-12,22.76,19.0,29.6,0.0,0,26249,autumn
152,2015-06-02,16.04,15.0,17.5,0.5,0,20106,summer
235,2015-08-24,16.98,15.1,20.5,0.0,0,23210,summer
362,2015-12-29,21.68,20.3,24.1,10.3,0,22309,winter
309,2015-11-06,19.76,18.0,22.8,0.0,0,20575,autumn
31,2015-02-01,24.16,20.6,28.0,0.0,1,32057,winter


We can also use the `.value_counts` method to check that everything is as expected

In [18]:
df.season.value_counts()

spring    92
summer    92
autumn    91
winter    90
Name: season, dtype: int64

# Binning

Another very common operation is to want to assign data to a bin. 
For example, we might want to turn a regression problem of predicting consumption into a classification problem of low vs high consumption.

Let's arbitrarily choose 25,000 as our cutoff for high consumption.
Since pandas is based on Numpy, we can often use numpy functions when it suits us.

There is nothing in pandas that does quite what np.where does, so I use it all the time for this type of operation

In [20]:
df['high_consumption'] = np.where(df.consumption < 25000, 0, 1)
df.sample(10)

Unnamed: 0,date,median_temp,min_temp,max_temp,precip,weekend,consumption,season,high_consumption
337,2015-12-04,22.76,19.0,29.1,0.0,0,29513,winter,1
276,2015-10-04,18.6,17.1,21.3,0.2,1,24862,autumn,0
221,2015-08-10,21.2,15.6,28.0,0.0,0,23181,summer,0
96,2015-04-07,17.38,16.1,20.0,7.1,0,21004,spring,0
269,2015-09-27,22.3,18.2,30.6,1.8,1,32184,autumn,1
262,2015-09-20,23.6,19.2,33.3,0.0,1,34695,autumn,1
253,2015-09-11,16.88,14.1,21.8,31.8,0,21454,autumn,0
268,2015-09-26,20.58,18.2,24.9,28.6,1,29637,autumn,1
327,2015-11-24,21.36,19.3,22.6,7.0,0,21689,autumn,0
160,2015-06-10,20.54,15.5,26.2,0.0,0,22008,summer,0


In [21]:
df.high_consumption.value_counts()

0    188
1    177
Name: high_consumption, dtype: int64

Often our usecase is a bit more complicated than a simple higher or lower. `pd.cut` gives us a lot more flexibility in setting our cutoff points. Let's let pandas figure out the best place to split the data, by specifying that we want 2 bins -this will find the point that divides our data into equal width bins

In [25]:
pd.cut(df.consumption, bins=2).value_counts()

(14319.406, 26140.0]    217
(26140.0, 37937.0]      148
Name: consumption, dtype: int64

While descriptive, that's not very pretty nor easy to select - `pd.cut` also supports passing a list of labels, so let's do that.

In [27]:
pd.cut(df.consumption, bins=2, labels=['low', 'high']).value_counts()

low     217
high    148
Name: consumption, dtype: int64

Same result, but with nice labels for ease of indexing

Of course we can also pass our own bins if we have irregular intervals

In [31]:
pd.cut(df.consumption, bins=[0, 25000, 99999], labels=['low', 'high'])

0      high
1      high
2      high
3      high
4      high
5      high
6      high
7      high
8       low
9      high
10     high
11     high
12     high
13     high
14     high
15     high
16     high
17     high
18     high
19     high
20     high
21     high
22      low
23     high
24     high
25      low
26     high
27      low
28      low
29     high
       ... 
335    high
336    high
337    high
338    high
339    high
340     low
341    high
342    high
343     low
344    high
345    high
346    high
347    high
348    high
349     low
350     low
351    high
352    high
353    high
354     low
355    high
356    high
357    high
358    high
359     low
360    high
361    high
362     low
363     low
364     low
Name: consumption, Length: 365, dtype: category
Categories (2, object): [low < high]

Sometimes you want your bins to be based on quantiles instead of arbitrary intervals - `pd.qcut` makes that easy. 

In [37]:
df['consumption_group'] = pd.qcut(df.consumption, q=3, labels=['low', 'medium', 'high'])

We can of course specify our quantiles explicitlyb

In [34]:
pd.qcut(df.consumption, q=[0, 0.25, 0.75, 1], labels=['low', 'medium', 'high'])

0      medium
1        high
2        high
3        high
4        high
5      medium
6        high
7      medium
8      medium
9        high
10       high
11     medium
12     medium
13       high
14     medium
15       high
16       high
17       high
18       high
19       high
20       high
21     medium
22        low
23     medium
24       high
25        low
26       high
27     medium
28     medium
29     medium
        ...  
335      high
336    medium
337      high
338      high
339      high
340    medium
341    medium
342    medium
343    medium
344      high
345      high
346      high
347    medium
348    medium
349       low
350    medium
351    medium
352      high
353      high
354    medium
355    medium
356    medium
357      high
358    medium
359       low
360      high
361    medium
362    medium
363       low
364    medium
Name: consumption, Length: 365, dtype: category
Categories (3, object): [low < medium < high]

# Get Dummies

One common task in datascience, is to one-hot encode categorical columns. As this is also known as "creating dummy variables" pandas has a built in solution for that - `.get_dummies`. It takes your dataframe and one-hot-encodes all categorical columns it finds (usually string columns)

In [38]:
pd.get_dummies(df)

Unnamed: 0,date,median_temp,min_temp,max_temp,precip,weekend,consumption,high_consumption,season_autumn,season_spring,season_summer,season_winter,consumption_group_low,consumption_group_medium,consumption_group_high
0,2015-01-01,27.30,23.9,32.5,0.0,0,25461,1,0,0,0,1,0,1,0
1,2015-01-02,27.02,24.5,33.5,0.0,0,28972,1,0,0,0,1,0,0,1
2,2015-01-03,24.82,22.4,29.9,0.0,1,30814,1,0,0,0,1,0,0,1
3,2015-01-04,23.98,21.5,28.6,1.2,1,29799,1,0,0,0,1,0,0,1
4,2015-01-05,23.82,21.0,28.3,0.0,0,28900,1,0,0,0,1,0,0,1
5,2015-01-06,23.78,20.1,30.5,12.2,0,28218,1,0,0,0,1,0,0,1
6,2015-01-07,24.00,19.5,33.7,0.0,0,29732,1,0,0,0,1,0,0,1
7,2015-01-08,24.90,19.5,32.8,48.6,0,28397,1,0,0,0,1,0,0,1
8,2015-01-09,28.20,21.9,34.0,4.4,0,24886,0,0,0,0,1,0,1,0
9,2015-01-10,26.76,22.1,34.2,0.0,1,37937,1,0,0,0,1,0,0,1


You can also specify which columns to encode, as well as passing the drop_first parameter if you're trying to avoid multicollinearity. 

Note that the 'high_consumption' column is an int dtype

In [40]:
pd.get_dummies(df, drop_first=True, columns=['season', 'high_consumption'])

Unnamed: 0,date,median_temp,min_temp,max_temp,precip,weekend,consumption,consumption_group,season_spring,season_summer,season_winter,high_consumption_1
0,2015-01-01,27.30,23.9,32.5,0.0,0,25461,medium,0,0,1,1
1,2015-01-02,27.02,24.5,33.5,0.0,0,28972,high,0,0,1,1
2,2015-01-03,24.82,22.4,29.9,0.0,1,30814,high,0,0,1,1
3,2015-01-04,23.98,21.5,28.6,1.2,1,29799,high,0,0,1,1
4,2015-01-05,23.82,21.0,28.3,0.0,0,28900,high,0,0,1,1
5,2015-01-06,23.78,20.1,30.5,12.2,0,28218,high,0,0,1,1
6,2015-01-07,24.00,19.5,33.7,0.0,0,29732,high,0,0,1,1
7,2015-01-08,24.90,19.5,32.8,48.6,0,28397,high,0,0,1,1
8,2015-01-09,28.20,21.9,34.0,4.4,0,24886,medium,0,0,1,0
9,2015-01-10,26.76,22.1,34.2,0.0,1,37937,high,0,0,1,1


# Shifting and Diffing

Sometimes, you want to compare running differences - what's the change between days or months? Pandas provides utility methods to do that in various forms

In [41]:
# Subtract the previous value
df.consumption.diff()

0          NaN
1       3511.0
2       1842.0
3      -1015.0
4       -899.0
5       -682.0
6       1514.0
7      -1335.0
8      -3511.0
9      13051.0
10     -1683.0
11    -10511.0
12      1247.0
13      4835.0
14     -6101.0
15      4214.0
16      7752.0
17     -7166.0
18     -1259.0
19      5862.0
20     -5997.0
21     -3335.0
22     -4011.0
23      6564.0
24      2740.0
25     -9568.0
26      8452.0
27     -7369.0
28        93.0
29      4149.0
        ...   
335     5942.0
336    -2066.0
337     1108.0
338     2938.0
339      329.0
340    -9405.0
341     4338.0
342     -576.0
343    -4204.0
344     7807.0
345    -1161.0
346     -391.0
347    -1057.0
348      486.0
349    -7555.0
350     3275.0
351     2705.0
352     5494.0
353    -2409.0
354    -5293.0
355     1994.0
356     -360.0
357     5104.0
358    -5264.0
359    -4353.0
360    10352.0
361    -6212.0
362    -3786.0
363    -1842.0
364     1979.0
Name: consumption, Length: 365, dtype: float64

In [44]:
# Subtract the value from 30 days before
df.consumption.diff(periods=30).dropna()

30      1569.0
31      3085.0
32     -6717.0
33      1856.0
34     -4162.0
35     -8268.0
36     -6911.0
37       496.0
38      5040.0
39    -13875.0
40    -15117.0
41      1062.0
42      -601.0
43     -7606.0
44      4507.0
45     -4970.0
46    -12347.0
47    -13125.0
48     -7873.0
49    -12205.0
50     -4563.0
51      5148.0
52      9041.0
53     -2656.0
54     -4129.0
55      3846.0
56     -7188.0
57      3638.0
58      3771.0
59       630.0
        ...   
335     9823.0
336     5664.0
337     8034.0
338     9317.0
339    12205.0
340     -955.0
341     -897.0
342    -1319.0
343    -5031.0
344     1171.0
345      312.0
346      541.0
347     1295.0
348     -769.0
349    -3547.0
350    -2627.0
351     3428.0
352     9576.0
353     9795.0
354    -5558.0
355    -5105.0
356     4806.0
357     9883.0
358     1189.0
359    -3330.0
360     3328.0
361    -8287.0
362    -8308.0
363      229.0
364    -2083.0
Name: consumption, Length: 335, dtype: float64

In [45]:
# Get the percentage change compared to 7 days ago
df.consumption.pct_change(periods=7)

0           NaN
1           NaN
2           NaN
3           NaN
4           NaN
5           NaN
6           NaN
7      0.115314
8     -0.141033
9      0.231161
10     0.216618
11    -0.109239
12    -0.043518
13     0.070396
14    -0.094130
15     0.203006
16    -0.006511
17    -0.158052
18     0.136814
19     0.301482
20    -0.084682
21     0.002760
22    -0.272363
23    -0.247864
24     0.018477
25    -0.264651
26    -0.146753
27    -0.224065
28    -0.120140
29     0.232326
         ...   
335    0.213066
336    0.123393
337    0.018427
338   -0.056163
339    0.070647
340    0.155005
341    0.129806
342   -0.109416
343   -0.192642
344    0.041575
345   -0.088503
346   -0.109579
347    0.203465
348    0.032620
349   -0.223864
350    0.061222
351   -0.120299
352    0.099970
353    0.032171
354   -0.117202
355   -0.062515
356    0.256671
357    0.297284
358   -0.027143
359   -0.325209
360    0.072360
361    0.050777
362   -0.168443
363   -0.226727
364   -0.289054
Name: consumption, Lengt

These are convenience methods built around `.shift` - `.shift` lets you easily compare a row with another row

In [48]:
# Shift all columns one step
df.shift(1).head()

Unnamed: 0,date,median_temp,min_temp,max_temp,precip,weekend,consumption,season,high_consumption,consumption_group
0,NaT,,,,,,,,,
1,2015-01-01,27.3,23.9,32.5,0.0,0.0,25461.0,winter,1.0,medium
2,2015-01-02,27.02,24.5,33.5,0.0,0.0,28972.0,winter,1.0,high
3,2015-01-03,24.82,22.4,29.9,0.0,1.0,30814.0,winter,1.0,high
4,2015-01-04,23.98,21.5,28.6,1.2,1.0,29799.0,winter,1.0,high


Notice how 25,461 was at index 0 previously and is now at index 1

In [47]:
df.consumption.head()

0    25461
1    28972
2    30814
3    29799
4    28900
Name: consumption, dtype: int64

This can be a great way to create a lagged feature set for time series modelling. If you want to predict next days consumption, I can simply create a dataframe of shifted periods

In [50]:
pd.concat([df.consumption.shift(i).rename(f't_{-i}') for i in range(5)], axis=1).head(10)

Unnamed: 0,t_0,t_-1,t_-2,t_-3,t_-4
0,25461,,,,
1,28972,25461.0,,,
2,30814,28972.0,25461.0,,
3,29799,30814.0,28972.0,25461.0,
4,28900,29799.0,30814.0,28972.0,25461.0
5,28218,28900.0,29799.0,30814.0,28972.0
6,29732,28218.0,28900.0,29799.0,30814.0
7,28397,29732.0,28218.0,28900.0,29799.0
8,24886,28397.0,29732.0,28218.0,28900.0
9,37937,24886.0,28397.0,29732.0,28218.0


A nice special case comes when using a Datetime Index like we did before. Then we can use `.tshift` and get some nice benefits

In [57]:
df = df.set_index('date')

We don't get any missing values, as we are simply incrementing the index by one period

In [60]:
df.tshift(1).head()

Unnamed: 0_level_0,median_temp,min_temp,max_temp,precip,weekend,consumption,season,high_consumption,consumption_group
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2015-01-02,27.3,23.9,32.5,0.0,0,25461,winter,1,medium
2015-01-03,27.02,24.5,33.5,0.0,0,28972,winter,1,high
2015-01-04,24.82,22.4,29.9,0.0,1,30814,winter,1,high
2015-01-05,23.98,21.5,28.6,1.2,1,29799,winter,1,high
2015-01-06,23.82,21.0,28.3,0.0,0,28900,winter,1,high


We can set different frequencies to shift by - for example, using 'M' "rounds up" to the nearest month end

In [65]:
df.tshift(1, freq='M').head()

Unnamed: 0_level_0,median_temp,min_temp,max_temp,precip,weekend,consumption,season,high_consumption,consumption_group
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2015-01-31,27.3,23.9,32.5,0.0,0,25461,winter,1,medium
2015-01-31,27.02,24.5,33.5,0.0,0,28972,winter,1,high
2015-01-31,24.82,22.4,29.9,0.0,1,30814,winter,1,high
2015-01-31,23.98,21.5,28.6,1.2,1,29799,winter,1,high
2015-01-31,23.82,21.0,28.3,0.0,0,28900,winter,1,high
