# Pandas for beer - Drinking patterns in Sao Paoulo

In [1]:
%matplotlib inline

In [2]:
import pandas as pd

# Reading data

Pandas has a fantastic ability to read data files - pretty much any modern data storage can be read in via Pandas.

In [3]:
#pd.read_

One of the main limitations of Python as a datascience language was reading in data - as an example, this is how you would read in a csv "the old fashioned way"

# Old Way

In [5]:
import csv

with open('data/Consumo_cerveja.csv') as f:
    reader = csv.reader(f)
    data = [line for line in reader]

In [7]:
data[1]

['2015-01-01', '27,3', '23,9', '32,5', '0', '0', '25.461']

In [9]:
# If you want to get fancy :-)
with open('data/Consumo_cerveja.csv') as f:
    reader = csv.DictReader(f)
    data = [line for line in reader]
data[0]

OrderedDict([('Data', '2015-01-01'),
             ('Temperatura Media (C)', '27,3'),
             ('Temperatura Minima (C)', '23,9'),
             ('Temperatura Maxima (C)', '32,5'),
             ('Precipitacao (mm)', '0'),
             ('Final de Semana', '0'),
             ('Consumo de cerveja (litros)', '25.461')])

Let's try that the Pandas way!

# Pandas way

In [10]:
df = pd.read_csv('data/Consumo_cerveja.csv')
df.head()

Unnamed: 0,Data,Temperatura Media (C),Temperatura Minima (C),Temperatura Maxima (C),Precipitacao (mm),Final de Semana,Consumo de cerveja (litros)
0,2015-01-01,273,239,325,0,0.0,25.461
1,2015-01-02,2702,245,335,0,0.0,28.972
2,2015-01-03,2482,224,299,0,1.0,30.814
3,2015-01-04,2398,215,286,12,1.0,29.799
4,2015-01-05,2382,21,283,0,0.0,28.9


My Brazilian Portugese is a bit rusty, and those names are a bit long to type, so I want something shorter and in english

In [11]:
translated_names = ['date',
                    'median_temp',
                    'min_temp',
                    'max_temp',
                    'precip',
                    'weekend',
                    'consumption']

In [13]:
df = pd.read_csv('data/Consumo_cerveja.csv', header=0, names=translated_names)
df.head()

Unnamed: 0,date,median_temp,min_temp,max_temp,precip,weekend,consumption
0,2015-01-01,273,239,325,0,0.0,25.461
1,2015-01-02,2702,245,335,0,0.0,28.972
2,2015-01-03,2482,224,299,0,1.0,30.814
3,2015-01-04,2398,215,286,12,1.0,29.799
4,2015-01-05,2382,21,283,0,0.0,28.9


I set header to be the first row, but tell pandas that I want them to be overwritten by my list of translated names - One line!

# Data types
CSVs are inherently text based - in the old way, we saw that everything was a string, and I would have to spend some time parsing those
Pandas does that conversion for you, but it's always good to check!

In [14]:
df.dtypes

date            object
median_temp     object
min_temp        object
max_temp        object
precip          object
weekend        float64
consumption    float64
dtype: object

Temperatures are definitely numbers and not 'object' - something went wrong here. Any guesses?

In [18]:
df = pd.read_csv('data/Consumo_cerveja.csv', header=0, names=translated_names, decimal=',', thousands='.')

In [19]:
df.dtypes

date            object
median_temp    float64
min_temp       float64
max_temp       float64
precip         float64
weekend        float64
consumption    float64
dtype: object

In [20]:
df.head()

Unnamed: 0,date,median_temp,min_temp,max_temp,precip,weekend,consumption
0,2015-01-01,27.3,23.9,32.5,0.0,0.0,25461.0
1,2015-01-02,27.02,24.5,33.5,0.0,0.0,28972.0
2,2015-01-03,24.82,22.4,29.9,0.0,1.0,30814.0
3,2015-01-04,23.98,21.5,28.6,1.2,1.0,29799.0
4,2015-01-05,23.82,21.0,28.3,0.0,0.0,28900.0


In [None]:
pd.read_csv

CSVs are hard! Especially when working with data from different countries and standards.
All that string parsing reduced to two parameters in .read_csv!

With all the ways CSVs can go wrong, it's important to double check your data after you've loaded it

![Fun Fact](images/fun_fact.resized.jpeg) While there is an official CSV standard no-one follows it! That's why pd.read_csv has 49 parameters...

In [22]:
df.tail()

Unnamed: 0,date,median_temp,min_temp,max_temp,precip,weekend,consumption
936,,,,,,,
937,,,,,,,
938,,,,,,,
939,,,,,,,
940,,,,,,,


Looks like some dirty data - what's gone wrong here?

In [23]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 941 entries, 0 to 940
Data columns (total 7 columns):
date           365 non-null object
median_temp    365 non-null float64
min_temp       365 non-null float64
max_temp       365 non-null float64
precip         365 non-null float64
weekend        365 non-null float64
consumption    365 non-null float64
dtypes: float64(6), object(1)
memory usage: 51.5+ KB


In [24]:
df.describe()

Unnamed: 0,median_temp,min_temp,max_temp,precip,weekend,consumption
count,365.0,365.0,365.0,365.0,365.0,365.0
mean,21.226356,17.46137,26.611507,5.196712,0.284932,25401.367123
std,3.180108,2.826185,4.317366,12.417844,0.452001,4399.142703
min,12.9,10.6,14.5,0.0,0.0,14343.0
25%,19.02,15.3,23.8,0.0,0.0,22008.0
50%,21.38,17.9,26.9,0.0,0.0,24867.0
75%,23.28,19.6,29.4,3.2,1.0,28631.0
max,28.86,24.5,36.5,94.8,1.0,37937.0


Looks like there's only 365 values total, but we've read in 940 rows - giving us rows full of NaNs!
Remember, you still have access to your normal shell toolbox when working in Jupyter Lab! (Or you could just open the file in your favorite text editor)

In [25]:
!tail data/Consumo_cerveja.csv

,,,,,,
,,,,,,
,,,,,,
,,,,,,
,,,,,,
,,,,,,
,,,,,,
,,,,,,
,,,,,,
,,,,,,


So this is no fault of Pandas - the data supplier actually included 576 empty lines.

![Fun Fact](images/fun_fact.resized.jpeg) This can often happen when exporting from Excel and you don't realize you have a lot of blank cells!

We know we have one year's worth of data, so we can simply read in 365 lines

In [28]:
df = pd.read_csv('data/Consumo_cerveja.csv', decimal=',', thousands='.', header=0, names=translated_names, nrows=365)

In [29]:
df.dtypes

date            object
median_temp    float64
min_temp       float64
max_temp       float64
precip         float64
weekend          int64
consumption      int64
dtype: object

We still have one object left - the date. Pandas was built by a finance quant, so it has first-class support for handling datetimes. For now, just know that we can load in dates as datetimes, they will be useful later!

In [31]:
df = pd.read_csv('data/Consumo_cerveja.csv', decimal=',', thousands='.', header=0, names=translated_names, nrows=365, parse_dates=['date'])
df.dtypes

date           datetime64[ns]
median_temp           float64
min_temp              float64
max_temp              float64
precip                float64
weekend                 int64
consumption             int64
dtype: object

In [32]:
df.head()

Unnamed: 0,date,median_temp,min_temp,max_temp,precip,weekend,consumption
0,2015-01-01,27.3,23.9,32.5,0.0,0,25461
1,2015-01-02,27.02,24.5,33.5,0.0,0,28972
2,2015-01-03,24.82,22.4,29.9,0.0,1,30814
3,2015-01-04,23.98,21.5,28.6,1.2,1,29799
4,2015-01-05,23.82,21.0,28.3,0.0,0,28900


In [33]:
df.describe()

Unnamed: 0,median_temp,min_temp,max_temp,precip,weekend,consumption
count,365.0,365.0,365.0,365.0,365.0,365.0
mean,21.226356,17.46137,26.611507,5.196712,0.284932,25401.367123
std,3.180108,2.826185,4.317366,12.417844,0.452001,4399.142703
min,12.9,10.6,14.5,0.0,0.0,14343.0
25%,19.02,15.3,23.8,0.0,0.0,22008.0
50%,21.38,17.9,26.9,0.0,0.0,24867.0
75%,23.28,19.6,29.4,3.2,1.0,28631.0
max,28.86,24.5,36.5,94.8,1.0,37937.0


Our data looks much better now! Let's start manipulating it!

# Indexing
First order of business is how to access our data. Pandas has many ways to get at your data!

We are going to cover the following:
- column selection
- loc
- iloc

In [110]:
# Choose one column
df['median_temp']

0      27.30
1      27.02
2      24.82
3      23.98
4      23.82
5      23.78
6      24.00
7      24.90
8      28.20
9      26.76
10     27.62
11     25.96
12     25.52
13     25.96
14     25.86
15     26.50
16     28.86
17     28.26
18     28.22
19     27.68
20     25.32
21     21.74
22     21.04
23     23.12
24     24.40
25     22.40
26     23.60
27     25.68
28     25.00
29     22.80
       ...  
335    22.10
336    22.44
337    22.76
338    24.80
339    23.12
340    20.04
341    21.70
342    23.96
343    24.00
344    24.04
345    23.92
346    24.54
347    26.28
348    25.66
349    22.04
350    23.32
351    26.42
352    23.74
353    22.84
354    23.12
355    24.60
356    27.46
357    24.72
358    23.58
359    23.34
360    24.00
361    22.64
362    21.68
363    21.38
364    24.76
Name: median_temp, Length: 365, dtype: float64

In [111]:
# Choose multiple columns
df[['median_temp', 'max_temp']]

Unnamed: 0,median_temp,max_temp
0,27.30,32.5
1,27.02,33.5
2,24.82,29.9
3,23.98,28.6
4,23.82,28.3
5,23.78,30.5
6,24.00,33.7
7,24.90,32.8
8,28.20,34.0
9,26.76,34.2


In [114]:
# Choose index and columns
df.loc[:, 'median_temp']

0      27.30
1      27.02
2      24.82
3      23.98
4      23.82
5      23.78
6      24.00
7      24.90
8      28.20
9      26.76
10     27.62
11     25.96
12     25.52
13     25.96
14     25.86
15     26.50
16     28.86
17     28.26
18     28.22
19     27.68
20     25.32
21     21.74
22     21.04
23     23.12
24     24.40
25     22.40
26     23.60
27     25.68
28     25.00
29     22.80
       ...  
335    22.10
336    22.44
337    22.76
338    24.80
339    23.12
340    20.04
341    21.70
342    23.96
343    24.00
344    24.04
345    23.92
346    24.54
347    26.28
348    25.66
349    22.04
350    23.32
351    26.42
352    23.74
353    22.84
354    23.12
355    24.60
356    27.46
357    24.72
358    23.58
359    23.34
360    24.00
361    22.64
362    21.68
363    21.38
364    24.76
Name: median_temp, Length: 365, dtype: float64

In [115]:
# Choose first row and 'median_temp' column
df.loc[0, 'median_temp']

27.300000000000001

In [34]:
# Choose first row and all columns
df.loc[0, :]

date           2015-01-01 00:00:00
median_temp                   27.3
min_temp                      23.9
max_temp                      32.5
precip                           0
weekend                          0
consumption                  25461
Name: 0, dtype: object

In [35]:
# Choose first row and two columns
df.loc[0, ['median_temp', 'min_temp']]

median_temp    27.3
min_temp       23.9
Name: 0, dtype: object

In [36]:
# Choose first row and second column
df.iloc[0, 1]

27.3

In [37]:
# Choose first row and second + third column
df.iloc[0, [1, 2]]

median_temp    27.3
min_temp       23.9
Name: 0, dtype: object

![Fun Fact](images/fun_fact.resized.jpeg) There are actually two main datastructures in Pandas - the DataFrame and the Series! Think of a Series as a single row or column in a DataFrame - it's what we get back in our examples when we select out a row or column

# Boolean indexing

A very common operation is selecting a subset of rows based on some criteria. Pandas borrows "Boolean indexing" from numpy, which means to index using an array of True or False - e.g. show me all the rows where something is true. It's much easier to show by example!

In [129]:
# I want only rows where it's a weekend
df[df['weekend'] == 1]

Unnamed: 0,date,median_temp,min_temp,max_temp,precip,weekend,consumption
2,2015-01-03,24.82,22.4,29.9,0.0,1,30814
3,2015-01-04,23.98,21.5,28.6,1.2,1,29799
9,2015-01-10,26.76,22.1,34.2,0.0,1,37937
10,2015-01-11,27.62,22.2,34.8,3.4,1,36254
16,2015-01-17,28.86,22.0,35.8,0.0,1,37690
17,2015-01-18,28.26,23.4,35.6,0.0,1,30524
23,2015-01-24,23.12,19.0,29.4,13.0,1,28348
24,2015-01-25,24.40,18.1,30.0,0.0,1,31088
30,2015-01-31,21.64,18.5,24.3,0.2,1,27030
31,2015-02-01,24.16,20.6,28.0,0.0,1,32057


In [38]:
# I want only the rows where min_temp is greater than 23
df[df['min_temp'] > 23]

Unnamed: 0,date,median_temp,min_temp,max_temp,precip,weekend,consumption
0,2015-01-01,27.3,23.9,32.5,0.0,0,25461
1,2015-01-02,27.02,24.5,33.5,0.0,0,28972
17,2015-01-18,28.26,23.4,35.6,0.0,1,30524
19,2015-01-20,27.68,23.3,35.6,0.6,0,35127
42,2015-02-12,27.66,23.1,32.7,0.0,0,26389
322,2015-11-19,26.16,23.3,30.4,0.0,0,22960


We can also combine filters to set multiple conditions on our data

![Warning](images/warning.resized.png) For unimportant technical reasons, don't use the python keywords "and", "not", "or". 

Use the bitwise operator symbols: 
- & (and)
- | (or)
- ~ (not)



In [39]:
# I want only the rows where min_temp is greater than 23 and it's the weekend
df[(df['weekend'] == 1) & (df['min_temp'] > 23)]

Unnamed: 0,date,median_temp,min_temp,max_temp,precip,weekend,consumption
17,2015-01-18,28.26,23.4,35.6,0.0,1,30524


In [40]:
# I want only the rows where min_temp is greater than 23 or it's the weekend
df[(df['min_temp'] > 23) | (df['weekend'] == 1)]

Unnamed: 0,date,median_temp,min_temp,max_temp,precip,weekend,consumption
0,2015-01-01,27.30,23.9,32.5,0.0,0,25461
1,2015-01-02,27.02,24.5,33.5,0.0,0,28972
2,2015-01-03,24.82,22.4,29.9,0.0,1,30814
3,2015-01-04,23.98,21.5,28.6,1.2,1,29799
9,2015-01-10,26.76,22.1,34.2,0.0,1,37937
10,2015-01-11,27.62,22.2,34.8,3.4,1,36254
16,2015-01-17,28.86,22.0,35.8,0.0,1,37690
17,2015-01-18,28.26,23.4,35.6,0.0,1,30524
19,2015-01-20,27.68,23.3,35.6,0.6,0,35127
23,2015-01-24,23.12,19.0,29.4,13.0,1,28348


# Operations
Now we know how to select our data - let's start trying to glean some insight from our data! Pandas comes with a rich array of data aggregation methods built-in

In [43]:
# Make a new dataframe called temperatures which only has min_temp and max_temp
temperatures = df.loc[:, ['min_temp', 'max_temp']]

In [42]:
temperatures

Unnamed: 0,min_temp,max_temp
0,23.9,32.5
1,24.5,33.5
2,22.4,29.9
3,21.5,28.6
4,21.0,28.3
5,20.1,30.5
6,19.5,33.7
7,19.5,32.8
8,21.9,34.0
9,22.1,34.2


In [46]:
# What's the mean min+max temperature?
temperatures.mean()

min_temp    17.461370
max_temp    26.611507
dtype: float64

Now I know the mean of each column - but what if I want to ask a different question - what if I want to know the midpoint of the temperature per day?

In [49]:
# Take the mean across the columns
temperatures.mean(axis='columns')

0      28.20
1      29.00
2      26.15
3      25.05
4      24.65
5      25.30
6      26.60
7      26.15
8      27.95
9      28.15
10     28.50
11     28.40
12     28.00
13     27.65
14     27.15
15     27.50
16     28.90
17     29.50
18     29.60
19     29.45
20     26.80
21     22.65
22     22.30
23     24.20
24     24.05
25     23.80
26     24.60
27     25.00
28     25.35
29     23.90
       ...  
335    23.80
336    23.15
337    24.05
338    25.05
339    24.30
340    20.95
341    23.00
342    23.75
343    24.95
344    25.50
345    25.70
346    25.15
347    26.80
348    26.70
349    22.75
350    23.80
351    26.55
352    25.55
353    24.30
354    24.35
355    26.00
356    27.25
357    26.00
358    24.40
359    23.80
360    24.65
361    23.90
362    22.20
363    20.85
364    24.60
Length: 365, dtype: float64

In [50]:
# The default is to take the mean across the rows or 'index'
temperatures.mean(axis='index')

min_temp    17.461370
max_temp    26.611507
dtype: float64

Note that these operations merely return the result, there is no modification of the source data

In [52]:
temperatures

Unnamed: 0,min_temp,max_temp
0,23.9,32.5
1,24.5,33.5
2,22.4,29.9
3,21.5,28.6
4,21.0,28.3
5,20.1,30.5
6,19.5,33.7
7,19.5,32.8
8,21.9,34.0
9,22.1,34.2


Often we do want to persist our results, so we can use them in other calculations. In Pandas, this is easy - simply assign to a column name.

![Warning](images/warning.resized.png) Assigning to a dataframe works just like in a dictionary - if the name already exists, then it will overwrite the values!

In [53]:
temperatures['mean'] = temperatures.mean(axis='columns')

Now we can use this new column in a new calculation. How far away is the mean from the median?

In [54]:
df['median_temp'] - temperatures['mean']

0     -0.90
1     -1.98
2     -1.33
3     -1.07
4     -0.83
5     -1.52
6     -2.60
7     -1.25
8      0.25
9     -1.39
10    -0.88
11    -2.44
12    -2.48
13    -1.69
14    -1.29
15    -1.00
16    -0.04
17    -1.24
18    -1.38
19    -1.77
20    -1.48
21    -0.91
22    -1.26
23    -1.08
24     0.35
25    -1.40
26    -1.00
27     0.68
28    -0.35
29    -1.10
       ... 
335   -1.70
336   -0.71
337   -1.29
338   -0.25
339   -1.18
340   -0.91
341   -1.30
342    0.21
343   -0.95
344   -1.46
345   -1.78
346   -0.61
347   -0.52
348   -1.04
349   -0.71
350   -0.48
351   -0.13
352   -1.81
353   -1.46
354   -1.23
355   -1.40
356    0.21
357   -1.28
358   -0.82
359   -0.46
360   -0.65
361   -1.26
362   -0.52
363    0.53
364    0.16
Length: 365, dtype: float64

Note that pandas does elementwise operations, so you can also do +, -, / and * and they will work as you expect

In [57]:
# Get consumption in 1000's of liters
df['consumption'] / 1000

0      25.461
1      28.972
2      30.814
3      29.799
4      28.900
5      28.218
6      29.732
7      28.397
8      24.886
9      37.937
10     36.254
11     25.743
12     26.990
13     31.825
14     25.724
15     29.938
16     37.690
17     30.524
18     29.265
19     35.127
20     29.130
21     25.795
22     21.784
23     28.348
24     31.088
25     21.520
26     29.972
27     22.603
28     22.696
29     26.845
        ...  
335    30.471
336    28.405
337    29.513
338    32.451
339    32.780
340    23.375
341    27.713
342    27.137
343    22.933
344    30.740
345    29.579
346    29.188
347    28.131
348    28.617
349    21.062
350    24.337
351    27.042
352    32.536
353    30.127
354    24.834
355    26.828
356    26.468
357    31.572
358    26.308
359    21.955
360    32.307
361    26.095
362    22.309
363    20.467
364    22.446
Name: consumption, Length: 365, dtype: float64

# Saving Data

In addition to reading from many datasources, pandas can also write to many datasources. Now that we have cleaned up our data, we would like to export it again, so it's easy to read in.
There are a ton of choices, but the 4 I use most are:
- to_csv
- to_excel
- to_parquet
- to_sql
- to_hdf

.to_csv and .to_excel do what we expect, so I want to show off the other three

# SQL
to_sql lets us dump the data directly into the database of our choice, great for working with big dataset! Pandas uses sqlalchemy under the hood, so we need to ensure sqlalchemy is installed and specify an engine, so pandas can connect

In [61]:
from sqlalchemy import create_engine
engine = 'sqlite:///beer.db'

In [63]:
df.to_sql('consumption', engine, index=False)

Now we can use SQL to read back in only the parts we are interested in!

In [69]:
only_weekend = pd.read_sql("select * from consumption where weekend = 1", engine)
only_weekend.head()

Unnamed: 0,date,median_temp,min_temp,max_temp,precip,weekend,consumption
0,2015-01-03 00:00:00.000000,24.82,22.4,29.9,0.0,1,30814
1,2015-01-04 00:00:00.000000,23.98,21.5,28.6,1.2,1,29799
2,2015-01-10 00:00:00.000000,26.76,22.1,34.2,0.0,1,37937
3,2015-01-11 00:00:00.000000,27.62,22.2,34.8,3.4,1,36254
4,2015-01-17 00:00:00.000000,28.86,22.0,35.8,0.0,1,37690


In [67]:
# Note that slqlite3 doesn't support datetimes as its own datatype - other DBs will do this correctly!
only_weekend.dtypes

date            object
median_temp    float64
min_temp       float64
max_temp       float64
precip         float64
weekend          int64
consumption      int64
dtype: object

# Parquet
to_parquet let's us save data as [parquet](https://parquet.apache.org/) file - a binary columnar storage format. Columnar data storage is great for analysis, as we are usually interested in retrieving data by columns, as opposed to rows. Columnar data storage is also easier to compress, giving us storage benefits as well. Parquet is an Apache project and is thus used widely in the Hadoop ecosystem.

Parquet requires pyarrow to be installed

In [70]:
df.to_parquet('consumption.parquet')

In [71]:
parquet_df = pd.read_parquet('consumption.parquet')

# HDF5
HDF5 is another great format for large datasets - it also allows you to specify metadata and other neat tricks. In addition you can ask it to create an index of data columns, allowing you to query it using simple comparisons. It's great when you want to store large datasets, but still want to be able to query subsets of it.

HDF5 requires pytables to be installed

In [82]:
df.to_hdf('table_data.hdf', 'consumption', format='table', data_columns=True, complevel=9)

In [83]:
df.to_hdf('fixed_data.hdf', 'consumption', complevel=9)

In [80]:
pd.read_hdf('data.hdf', 'consumption', where='weekend == 1')

Unnamed: 0,date,median_temp,min_temp,max_temp,precip,weekend,consumption
2,2015-01-03,24.82,22.4,29.9,0.0,1,30814
3,2015-01-04,23.98,21.5,28.6,1.2,1,29799
9,2015-01-10,26.76,22.1,34.2,0.0,1,37937
10,2015-01-11,27.62,22.2,34.8,3.4,1,36254
16,2015-01-17,28.86,22.0,35.8,0.0,1,37690
17,2015-01-18,28.26,23.4,35.6,0.0,1,30524
23,2015-01-24,23.12,19.0,29.4,13.0,1,28348
24,2015-01-25,24.40,18.1,30.0,0.0,1,31088
30,2015-01-31,21.64,18.5,24.3,0.2,1,27030
31,2015-02-01,24.16,20.6,28.0,0.0,1,32057
