# Pandas for beer - Drinking patterns in Sao Paoulo

In [None]:
%matplotlib inline

In [None]:
import pandas as pd

# Reading data

Pandas has a fantastic ability to read data files - pretty much any modern data storage can be read in via Pandas.

In [None]:
#pd.read_

One of the main limitations of Python as a datascience language was reading in data - as an example, this is how you would read in a csv "the old fashioned way"

# Old Way

In [None]:
import csv

with open('data/Consumo_cerveja.csv') as f:
    reader = csv.reader(f)
    data = [line for line in reader]

In [None]:
data[1]

In [None]:
# If you want to get fancy :-)
with open('data/Consumo_cerveja.csv') as f:
    reader = csv.DictReader(f)
    data = [line for line in reader]
data[0]

Let's try that the Pandas way!

# Pandas way

In [None]:
df = pd.read_csv('data/Consumo_cerveja.csv')
df.head()

My Brazilian Portugese is a bit rusty, and those names are a bit long to type, so I want something shorter and in english

In [None]:
translated_names = ['date',
                    'median_temp',
                    'min_temp',
                    'max_temp',
                    'precip',
                    'weekend',
                    'consumption']

In [None]:
df = pd.read_csv('data/Consumo_cerveja.csv', header=0, names=translated_names)
df.head()

I set header to be the first row, but tell pandas that I want them to be overwritten by my list of translated names - One line!

# Data types
CSVs are inherently text based - in the old way, we saw that everything was a string, and I would have to spend some time parsing those
Pandas does that conversion for you, but it's always good to check!

In [None]:
df.dtypes

Temperatures are definitely numbers and not 'object' - something went wrong here. Any guesses?

In [None]:
df = pd.read_csv('data/Consumo_cerveja.csv', header=0, names=translated_names, decimal=',', thousands='.')

In [None]:
df.dtypes

In [None]:
df.head()

CSVs are hard! Especially when working with data from different countries and standards.
All that string parsing reduced to two parameters in .read_csv!

With all the ways CSVs can go wrong, it's important to double check your data after you've loaded it

![Fun Fact](images/fun_fact.resized.jpeg) While there is an official CSV standard no-one follows it! That's why pd.read_csv has 49 parameters...

In [None]:
df.tail()

Looks like some dirty data - what's gone wrong here?

In [None]:
df.info()

In [None]:
df.describe()

Looks like there's only 365 values total, but we've read in 940 rows - giving us rows full of NaNs!
Remember, you still have access to your normal shell toolbox when working in Jupyter Lab! (Or you could just open the file in your favorite text editor)

In [None]:
!tail data/Consumo_cerveja.csv

So this is no fault of Pandas - the data supplier actually included 576 empty lines.

![Fun Fact](images/fun_fact.resized.jpeg) This can often happen when exporting from Excel and you don't realize you have a lot of blank cells!

We know we have one year's worth of data, so we can simply read in 365 lines

In [None]:
df = pd.read_csv('data/Consumo_cerveja.csv', decimal=',', thousands='.', header=0, names=translated_names, nrows=365)

In [None]:
df.dtypes

We still have one object left - the date. Pandas was built by a finance quant, so it has first-class support for handling datetimes. For now, just know that we can load in dates as datetimes, they will be useful later!

In [None]:
df = pd.read_csv('data/Consumo_cerveja.csv', decimal=',', thousands='.', header=0, names=translated_names, nrows=365, parse_dates=['date'])
df.dtypes

In [None]:
df.head()

In [None]:
df.describe()

Our data looks much better now! Let's start manipulating it!

# Indexing
First order of business is how to access our data. Pandas has many ways to get at your data!

We are going to cover the following:
- column selection
- loc
- iloc

In [None]:
# Choose one column
df['median_temp']

In [None]:
# Choose multiple columns
df[['median_temp', 'max_temp']]

In [None]:
# Choose index and columns
df.loc[:, 'median_temp']

In [None]:
# Choose first row and 'median_temp' column
df.loc[0, 'median_temp']

In [None]:
# Choose first row and all columns
df.loc[0, :]

In [None]:
# Choose first row and two columns
df.loc[0, ['median_temp', 'min_temp']]

In [None]:
# Choose first row and second column
df.iloc[0, 1]

In [None]:
# Choose first row and second + third column
df.iloc[0, [1, 2]]

![Fun Fact](images/fun_fact.resized.jpeg) There are actually two main datastructures in Pandas - the DataFrame and the Series! Think of a Series as a single row or column in a DataFrame - it's what we get back in our examples when we select out a row or column

# Boolean indexing

A very common operation is selecting a subset of rows based on some criteria. Pandas borrows "Boolean indexing" from numpy, which means to index using an array of True or False - e.g. show me all the rows where something is true. It's much easier to show by example!

In [None]:
# I want only rows where it's a weekend
df[df['weekend'] == 1]

In [None]:
# I want only the rows where min_temp is greater than 23
df[df['min_temp'] > 23]

We can also combine filters to set multiple conditions on our data

![Warning](images/warning.resized.png) For unimportant technical reasons, don't use the python keywords "and", "not", "or". 

Use the bitwise operator symbols: 
- & (and)
- | (or)
- ~ (not)



In [None]:
# I want only the rows where min_temp is greater than 23 and it's the weekend
df[(df['weekend'] == 1) & (df['min_temp'] > 23)]

In [None]:
# I want only the rows where min_temp is greater than 23 or it's the weekend
df[(df['min_temp'] > 23) | (df['weekend'] == 1)]

# Operations
Now we know how to select our data - let's start trying to glean some insight from our data! Pandas comes with a rich array of data aggregation methods built-in

In [None]:
# Make a new dataframe called temperatures which only has min_temp and max_temp
temperatures = df.loc[:, ['min_temp', 'max_temp']]

In [None]:
temperatures

In [None]:
# What's the mean min+max temperature?
temperatures.mean()

Now I know the mean of each column - but what if I want to ask a different question - what if I want to know the midpoint of the temperature per day?

In [None]:
# Take the mean across the columns
temperatures.mean(axis='columns')

In [None]:
# The default is to take the mean across the rows or 'index'
temperatures.mean(axis='index')

Note that these operations merely return the result, there is no modification of the source data

In [None]:
temperatures

Often we do want to persist our results, so we can use them in other calculations. In Pandas, this is easy - simply assign to a column name.

![Warning](images/warning.resized.png) Assigning to a dataframe works just like in a dictionary - if the name already exists, then it will overwrite the values!

In [None]:
temperatures['mean'] = temperatures.mean(axis='columns')

Now we can use this new column in a new calculation. How far away is the mean from the median?

In [None]:
df['median_temp'] - temperatures['mean']

Note that pandas does elementwise operations, so you can also do +, -, / and * and they will work as you expect

In [None]:
# Get consumption in 1000's of liters
df['consumption'] / 1000

# Saving Data

In addition to reading from many datasources, pandas can also write to many datasources. Now that we have cleaned up our data, we would like to export it again, so it's easy to read in.
There are a ton of choices, but the 4 I use most are:
- to_csv
- to_excel
- to_parquet
- to_sql
- to_hdf

.to_csv and .to_excel do what we expect, so I want to show off the other three

# SQL
to_sql lets us dump the data directly into the database of our choice, great for working with big dataset! Pandas uses sqlalchemy under the hood, so we need to ensure sqlalchemy is installed and specify an engine, so pandas can connect

In [None]:
from sqlalchemy import create_engine
engine = 'sqlite:///beer.db'

In [None]:
df.to_sql('consumption', engine, index=False)

Now we can use SQL to read back in only the parts we are interested in!

In [None]:
only_weekend = pd.read_sql("select * from consumption where weekend = 1", engine)
only_weekend.head()

In [None]:
# Note that slqlite3 doesn't support datetimes as its own datatype - other DBs will do this correctly!
only_weekend.dtypes

# Parquet
to_parquet let's us save data as [parquet](https://parquet.apache.org/) file - a binary columnar storage format. Columnar data storage is great for analysis, as we are usually interested in retrieving data by columns, as opposed to rows. Columnar data storage is also easier to compress, giving us storage benefits as well. Parquet is an Apache project and is thus used widely in the Hadoop ecosystem.

![warning](images/warning.resized.png) Parquet requires pyarrow to be installed

In [None]:
df.to_parquet('consumption.parquet')

In [None]:
parquet_df = pd.read_parquet('consumption.parquet')

# HDF5
HDF5 is another great format for large datasets - it also allows you to specify metadata and other neat tricks. In addition you can ask it to create an index of data columns, allowing you to query it using simple comparisons. It's great when you want to store large datasets, but still want to be able to query subsets of it.

![warning](images/warning.resized.png) HDF5 requires pytables to be installed

In [None]:
df.to_hdf('table_data.hdf', 'consumption', format='table', data_columns=True, complevel=9)

In [None]:
df.to_hdf('fixed_data.hdf', 'consumption', complevel=9)

In [None]:
pd.read_hdf('table_data.hdf', 'consumption', where='weekend == 1')