<center>
    
# R406: Applied Economic Modelling with Python

</center>

<br> <br> 

<center>

## Pandas

</center>

<br><br> 

<center>
<b> Andrey Vassilev </b>
</center>
 

# Outline

1. An overview of Pandas
2. Main data structures
3. Basic operations on Pandas objects
4. Importing data
  - reading and writing CSV files
  - reading and writing Excel files
  - Pointers to other functionalities (**Note:** just references here! )
     - SQL queries
     - Stata, SAS and SPSS (via `savReaderWriter`) files
5. Merging data
6. Cleaning and transforming data
7. Reshaping data
8. Aggregation
9. An overview of the split-apply-combine concept
10. Pivot tables

# Main facts about Pandas

- Pandas is a Python package that offers rich data processing and analysis functionality.
- In particular, it can work with series of observations and tabular heterogeneous data (think a dataset consisting of several time series or observations on different subjects).
- Pandas allows us to clean, transform, filter, sort etc. a dataset.
- Pandas also allows us to split, merge and extract various representations of our data.
- Pandas can interact with different data sources.
- It has sophisticated date-time functionality.

# Pandas data structures

- The main data structures in Pandas are: 
    - `Series` 
    - `DataFrame` 
- The `Series` is 1D and can be used as a building block of a `DataFrame`
- The `DataFrame` is 2D 

To start exploring the various Pandas structures we first import the relevant modules:

In [None]:
import pandas as pd # another established convention
import numpy as np

# Series

A `Series` can be created from a list.

In [None]:
s = pd.Series([1,4,-2,0,np.nan,3])
s

A `Series` object has several main characteristics.

It has an index.

In [None]:
s.index

This type of indexing is trivial because it coincides with the familiar indexing for sequences. We can substitute it with more interesting indexes:

In [None]:
dt = pd.date_range(start="2017-01-11",periods=len(s),freq="M") # Monthly frequency, starting Jan 11, 2017
print(dt)
s.index=dt
s

You can inspect the contents of a `Series` by using the `head()` and `tail()` methods.

In [None]:
s.head()

In [None]:
s.head(3) # Try changing it to 2 or 4

In [None]:
s.tail()

In [None]:
s.values # You can extract the values as an array

In [None]:
s[4] = 8 # assignment can be done in a standard way
s.describe()

A `Series` can be created from a dictionary. The dictionary keys will be used as index, which will be sorted.

In [None]:
s = pd.Series({"a":1,"b":3,"f":4, "c":-2.2})
s

You can also create it by simultaneously passing values and index.

In [None]:
s = pd.Series(np.random.rand(5),index = ["e"+str(i) for i in range(1,6)])
s

An element of a `Series` can be accessed by its index, "dictionary-style"...

In [None]:
s['e2']

... or by its position:

In [None]:
s[1] 

A `Series` object also supports slicing:

In [None]:
s[1:3]

Slicing can be done with respect to the index elements (notice that it is inclusive, unlike position-based slicing):

In [None]:
s['e1':'e3']

# `DataFrame`s

The `DataFrame` is the Pandas data structure that holds tabular data. It can be created from a NumPy array and takes an index argument, just like a `Series`. In addition, it takes a `columns` argument specifying column names.

In [None]:
df = pd.DataFrame(np.array([[2,1,3,5],[34,36,29,35]]).T,
                  index = pd.date_range(start="2005",periods=4,freq='A'),
                  columns = ['A','B'])
df

In [None]:
df.index

In [None]:
df.columns

A `DataFrame` column can be accessed by direct indexing:

In [None]:
df['B']

However, using a slice will be assumed to refer to the index:

In [None]:
df['A':'B'] 
# Notice that the error message says that 
# the string provided is not a date!

Similarly, you need a slice to access rows. The following is an error because Pandas assumes you are trying to provide a column name:

In [None]:
df['2005-12-31']

This already works:

In [None]:
df['2005-12-31':'2006-12-31']

Or a trivial type of slice if you need to access a single row:

In [None]:
df['2005-12-31':'2005-12-31']

Since the previous conventions may be inconvenient in some use cases, Pandas offers a more flexible way to access elements.

The `iloc` reference (index location) allows us to specify positions:

In [None]:
df.iloc[0]

In [None]:
type(df.iloc[0])

In [None]:
df.iloc[-1]

In [None]:
df.iloc[1:3]

In [None]:
df.iloc[:3]

It can be used to access rows and columns simultaneously:

In [None]:
df.iloc[1,0]

In [None]:
df.iloc[:,0]

In [None]:
df.iloc[2,:] # equivalent to df.iloc[2]

The `loc` functionality allows us to refer by label instead of position.

In [None]:
df.loc['20071231'] # You can also provide the date string in this format

In [None]:
df.loc['20071231':'20071231']

In [None]:
df.loc['20061231':'20071231']

In [None]:
df.loc['20061231':]

In [None]:
df.loc[:,'B']

Incidentally, `iloc` and `loc` work also for the indexes of `Series` objects.

It is possible to select a custom subset of the data by passing a list.

In [None]:
df.iloc[[0,2,3],:]

In [None]:
tmpidx = df.index # change the index temporarily
                  # to avoid complications with dates
df.index = list('abcd')
df.loc[['a','c','d'],'B':]

In [None]:
# restore index
df.index = tmpidx
del tmpidx

## Ways of creating `DataFrame`s

Apart from passing an array, we can also pass a list of lists:

In [None]:
df1 = pd.DataFrame([[2,1,3,5],[34,36,29,35]],
                  index = ['A','B'],
                  columns = range(4))
df1

Or we can create the `DataFrame` from a dictionary of `Series` objects.

In [None]:
s1 = pd.Series(np.random.rand(6),index = range(6,0,-1)) # We can index backward
s2 = pd.Series(np.random.rand(6),index = range(6,0,-1)) 
df1 = pd.DataFrame({'Ser1':s1,'Ser2':s2})
df1['Ser1']

Notice what happens when the indexes of the series are different:

In [None]:
s1 = pd.Series(np.random.rand(6),index = range(6,0,-1)) # We can index backward
s3 = pd.Series(np.random.rand(6),index = list('abcdef')) 
df2 = pd.DataFrame({'Ser1':s1,'Ser3':s3})
df2

## Indexes

The last example hints at some of the properties of indexes. They behave like ordered sets and are designed this way in order to facilitate operations like various joins of datasets.

First, an index can be created as an independent object and passed to a `Series` or `DataFrame` constructor later.

In [None]:
i1 = pd.Index(list('abcde'))
i2 = pd.Index(list('acdghkl'))
print(i1) 
print(i2)

You can access the elements of an index by position or using a slice:

In [None]:
i1[2]

In [None]:
i2[1:5:2]

But indexes are immutable. This is a conscious design choice to safeguard the integrity of data transformations and merges.

In [None]:
# This raises an error
i2[2] = 'z'

Indexes also support set operations (again useful when combining datasets):

In [None]:
i1 & i2

In [None]:
i1 | i2

In [None]:
i1 ^ i2

In [None]:
i1.difference(i2) # i1-i2 is deprecated for Index objects

# More on selection and assignment

A column name of a `DataFrame` can be accessed as an attribute.

In [None]:
df.B # equivalent to df['B']

We can assign using a slice:

In [None]:
df.loc['20051231':'20071231','A'] = [111]*3
df

And we can add an entire column:

In [None]:
df['C'] = np.random.rand(4)
df

While we have been working with numeric values up to here, there nothing to prevent us from having columns of different types:

In [None]:
df['D'] = ['red', 'blue', 'green', 'yellow']
df['E'] = [True, True, False, True]
# df.pop('D') 
df.dtypes

We can delete columns like this:

In [None]:
del df['D']
df

Or like this:

In [None]:
df.pop('E')
df

Or, if we need to delete many columns, we can just keep what we need:

In [None]:
df = df[['A','B']]
df

Rows in a `DataFrame` can be deleted by means of `drop()`. Note that it returns a copy unless you force in-place changes (either by assignment or by passing `inplace=True`).

In [None]:
df.drop(df.index[0]) # Drop the row that corresponds to the first index

In [None]:
df # still the old one

In [None]:
df = df.drop(df.index[0])
df

In [None]:
df.drop(df.index[1],inplace=True)
df

Replace `df` with a new one to use for the following demonstration.

In [None]:
df = pd.DataFrame(np.array([[-4.31464978,  4.18579587, -3.95827137,  0.43225809],
                           [-1.00034678,  4.32407815,  4.79826565, -4.52343789],
                           [ 3.43708467,  1.2913998 ,  4.12525004, -0.55061573],
                           [ 3.54330653,  4.45819847,  4.15887073,  4.50748233],
                           [ 4.1124862 ,  4.18789329, -1.5093025 ,  3.1387294 ]]), 
                  index = range(5),columns=list('ABCD'))
df

# Filtering

We can filter a dataframe based on a global condition (if it can be evaluated). The entries that fail the condition are filled with `nan`s.

In [None]:
df[df>0]
# An equivalent way would be df.where(df>0)

The `where()` method allows us to replace the `NaN`s with a specified value or condition

In [None]:
df.where(df>0,999)
# try also df.where(df>0,-df)

We can also filter a dataframe based on the values of a specific column:

In [None]:
df[df['A']<3]

In [None]:
df[ (df['A']>-2) & (df['A']<3.5) ]

# Sorting

Sometimes we want to rearrange our dataframe based on the values of certain columns. This can be done by using `sort_values()`

In [None]:
df.sort_values('B')

In [None]:
df.sort_values('A',ascending=False) # sort in descending order

In [None]:
df.loc[1,'B'] = df.loc[2,'B']
print(df)
df.sort_values(['B','C']) # sort by two columns to break ties

In [None]:
# apply ascending vs descending sort to different columns
df.sort_values(['B','C'],ascending=[True,False]) 

Sorting can also be forced to happen in-place using the familiar `inplace` argument.

# Importing data

In [None]:
# If necessary
%reset

In [None]:

import numpy as np
import pandas as pd

# Reading CSV files

The basic function for reading CSV files is `read_csv()`. In its most basic form it takes only a string containing the name of the file to be imported. File will be imported as a `DataFrame`.

In [None]:
# See https://github.com/fivethirtyeight/data/tree/master/college-majors
# for a description of the data
gradstudents = pd.read_csv("grad-students.csv")

In [None]:
gradstudents.head()

Some useful parameters (see [docs](http://pandas.pydata.org/pandas-docs/stable/io.html) for a full description):
 - Pandas will try to infer the separator but if you know your file uses a special delimiter, pass something like `sep = ";"`
 - If your data contains a header (=column names) but it is not positioned at row 1 (which is `header = 0` by default), you can skip the first few rows and pass something like `header = 3`. Pass `header = None` if you know your file contains no header.
 - More generally, you can pass something like `skiprows = 2` (skips the first two rows) or `skiprows = [0,2,3]` (skips specific row numbers) to skip rows at the beginning of a file. The parameter `skipfooter = n` skips the last `n` rows.
 - The parameter `names = ["Col1", "Col2"]` will ensure you get specific column names in your `DataFrame`.

You can read a file that is not in your current working directory. It is done like this:   
```mj = pd.read_csv(r"C:\Users\User\Downloads\majors-list.csv")
```

Or you can even pass a specific URL to retrieve your CSV from the web:

In [None]:
women_stem = pd.read_csv(r"https://github.com/fivethirtyeight/data/raw/master/college-majors/women-stem.csv")
women_stem.head(3)

# Writing data to CSV files

Data is written to a CSV file using the `to_csv()` method of a dataframe.

In [None]:
gsshort = gradstudents.iloc[0:5,[1,3,5]]
print(gsshort)
gsshort.to_csv("gsshort.csv", header=False)

The `to_csv()` method can also take parameters specifying:
 - the delimiter: `sep = ";"`
 - whether to write the column names as a table header (True by default) but can be `header = False`
 - whether to write the index (True by default), can be `index = False`

It can also take a path different than the current working directory.

# Reading Excel files

Reading an Excel file can be done with the `read_excel()` function. It take a filename or a URL and returns a `DataFrame`. It can take other arguments, such a specific sheet name to read the data from.

In [None]:
# Due to changes in the BoE website this cell no longer works. It needs to be changed or deleted.
# ratesraw = pd.read_excel(r"http://www.bankofengland.co.uk/statistics/Documents/dl/251115fsg.xls", sheet_name= "Data")
# ratesraw.head()

Other arguments can include skipping a specific number of rows, including a custom set of column names etc.

In [None]:
rates = pd.read_excel(r"251115fsg.xls", 
                      sheet_name = "Data", header = None, skiprows = 4, 
                      names = ["date","r"])
rates.head()

We can also take the index from a specific column etc. In general, the approach and syntax are similar to those for CSV files.

In [None]:
rates1 = pd.read_excel(r"251115fsg.xls", 
                      sheet_name = "Data", header = None, skiprows = 4, 
                      names = ["r"], index_col=0)
rates1.head()

# Writing Excel files

**Big time warning: If you have an Excel file with the same name, the method shown below will essentially delete it and recreate it, including only the data from your dataframe. It will NOT update only specific sheets or ranges in an existing spreadsheet. If you need more advanced functionality, such as writing data to a specific range in a specific sheet, look elsewhere (e.g. the `openpyxl` library). **

The dataframe method for writing to Excel is called `to_excel()`.

In [None]:
rates1["r"] += 1

In [None]:
rates1.to_excel("rates1.xlsx",sheet_name="rates")

Again, you can pass a number of parameters. For instance, as shown below, you can choose not to include the `DataFrame` index and columns. You can also specify a specific starting place in the sheet.

In [None]:
rates1.to_excel("rates1.xlsx", sheet_name="rates", header = False, 
                index = False, startcol=3, startrow=3)

# Getting data directly from statistical sources

- Pandas has an associated library called `pandas_datareader` which facilitates access to information from several popular data sources.
- Examples include (see [here](https://pandas-datareader.readthedocs.io/en/latest/remote_data.html) for the full list):
    - Eurostat      
    - World Bank
    - OECD
    - Yahoo! Finance
    - Google Finance
    - St.Louis FED (FRED)
- We shall look at the first three to get an idea how it works.

# Getting a Eurostat dataset

In [None]:
import pandas_datareader.data as web

In [None]:
# Table tps00001 reports the population of a country 
# on 1 January of the respective year.
df = web.DataReader("tps00001", 'eurostat')

In [None]:
df.head(3)

In [None]:
df.index

In [None]:
df.columns

## An aside on working with `MultiIndex` object

- The previous example shows that the columns are represented by a complex object, a `MultiIndex`. 
- This is essentially a nested structure of column names with headings, subheadings etc.
- Here are a few common operations on them

In [None]:
# Selection
df["Population on 1 January - total"]["Albania"]

In [None]:
# One more level, produces a Series
df["Population on 1 January - total"]["Albania"]["Annual"]

In [None]:
# Selecting several at once
df["Population on 1 January - total"][["Albania","Azerbaijan"]]

In [None]:
# You can reassign the columns to simplify the structure
df1 = df["Population on 1 January - total"][["Albania","Azerbaijan"]]
df1.columns = ["Albania","Azerbaijan"]
df1

# Interfacing with a database

**Note: No code examples here. Given for general info.**

- Pandas has functions to retrieve information from databases and write back `DataFrame`s to databases.
- This relies on using the powerful `SQLAlchemy` library
- There are functions such as: 
   - `read_sql_table()` to retrieve a table from a database
   - `read_sql_query()` to run a query against the database
- A dataframe has a method `to_sql()` to write it as a table in the database.

# Working with Stata, SAS and SPSS files

**Note: No code examples here. Given for general info.**

- These file formats are relatively popular and you may have to read in and process data packaged in one of them.
- The respective native Pandas methods are:
  - `read_stata()`
  - `read_sas()`
- Pandas does not work natively with SPSS files. The `savReaderWriter` module provides IO functions to work with the `sav` format.
- As a general rule, if you need to export data from Python to another program, the safest choice is probably to export in plain-text format and then read it into the other application.

# Data cleaning, merging, transformation and reshaping

In [None]:
# If necessary
%reset 

In [None]:
import numpy as np
import pandas as pd
from IPython.display import display

# Merging datasets

- In Pandas datasets are merged similarly to database merge operations ("joins")
- There are different kinds of joins depending on which dataset is the "leading" one in the merge operation.
- Technically, one can specify different choices of common element(s) that determine the merging operation.

## Implicit merges

In this case Pandas will automatically try to find common columns to join on.

In [None]:
df1 = pd.DataFrame({"id":[112,113,114,116,115],"x1":[1,3,2,4,5]})
df2 = pd.DataFrame({"id":[112,115,114,116,113],"x2":[23,13,24,45,44]})
display(df1,df2)

In [None]:
pd.merge(df1,df2)

## Explicit merges on key

In [None]:
df1 = pd.DataFrame({"id":[112,113,114,116,115],"id1":[16,14,12,15,13],"x1":[1,3,2,4,5]})
df2 = pd.DataFrame({"id":[112,115,114,116,113],"id1":[16,12,14,15,13], "x2":[23,13,24,45,44]})
display(df1,df2)

In [None]:
pd.merge(df1,df2,on="id")

In [None]:
pd.merge(df1,df2,on="id1")

The keys we are merging on need not have the same names.

In [None]:
df1 = pd.DataFrame({"id1":[112,113,114,116,115],"x1":[1,3,2,4,5]})
df2 = pd.DataFrame({"id2":[112,115,114,116,113],"x2":[23,13,24,45,44]})
display(df1,df2)

In [None]:
pd.merge(df1,df2,left_on="id1", right_on="id2")

What happens when the keys match partially?

In [None]:
df1 = pd.DataFrame({"id":[0,113,114,116,115],"x1":[1,3,2,4,5]})
df2 = pd.DataFrame({"id":[112,115,114,116,999],"x2":[23,13,24,45,44]})
display(df1,df2)

In [None]:
pd.merge(df1,df2)

The match is performed only on the common keys. This is called an *inner join*. It is an intersection operation on the keys. Pandas does this by default but we can control it using the `how` parameter.

In [None]:
pd.merge(df1,df2,how="inner") # same as above!

The merging operation can be made inclusive by making sure that no key from either `DataFrame` has been left out. This is called an *outer join* and is a union operation on the keys. Missing elements are filled with `NaN`.

In [None]:
pd.merge(df1,df2,how="outer")

It is also possible to have one of the `DataFrame`s as the "leading" one and the second one will be merged only where possible.

In [None]:
pd.merge(df1,df2,how="left")

In [None]:
pd.merge(df1,df2,how="right")
# pd.merge(df2,df1,how="left") # will give the same result

We can also merge on more than one key. Consider these two dataframes.

In [None]:
df1 = pd.DataFrame({"id":[1,1,2,2,3],"x1":[1,3,2,4,5]})
df2 = pd.DataFrame({"id":[1,1,2,2,3],"x2":[23,13,24,45,44]})
display(df1,df2)

Here is what happens when you merge:

In [None]:
pd.merge(df1,df2)

Now suppose they have an additional key that can serve to uniquely identify rows:

In [None]:
df1 = pd.DataFrame({"id":[1,1,2,2,3],"id1":[1,2,1,2,1],"x1":[1,3,2,4,5]})
df2 = pd.DataFrame({"id":[1,1,2,2,3],"id1":[1,2,1,2,1],"x2":[23,13,24,45,44]})
display(df1,df2)

In [None]:
pd.merge(df1,df2)

## Merging on index

You can also use dataframe indexes as the merge keys.

In [None]:
ind1 = pd.date_range(start="2005",periods=5,freq="A")
df1 = pd.DataFrame({"x1":[1,3,2,4,5]},index=ind1)
df2 = pd.DataFrame({"x2":[23,13,24,45,44]},index=ind1)
display(df1,df2)

In [None]:
pd.merge(df1,df2,left_index=True,right_index=True)

More complex merges also work as above:

In [None]:
ind1 = pd.date_range(start="2005",periods=5,freq="A")
ind2 = pd.date_range(start="2004",periods=5,freq="A")
df1 = pd.DataFrame({"x1":[1,3,2,4,5]},index=ind1)
df2 = pd.DataFrame({"x2":[23,13,24,45,44]},index=ind2)
display(df1,df2)

In [None]:
pd.merge(df1,df2,left_index=True,right_index=True)

In [None]:
pd.merge(df1,df2,left_index=True,right_index=True,how="outer")

Note that there is also a `join()` method that merges on indexes. Its syntax is a bit more compact then that of `merge()` but we won't deal with it.

## Concatenation

Another way of combining datasets is to concatenate them (think stacking them one on top of another).

In [None]:
ind1 = pd.date_range(start="2000",periods=5,freq="A")
ind2 = pd.date_range(start="2004",periods=5,freq="A")
df1 = pd.DataFrame({"x1":[1,3,2,4,5]},index=ind1)
df2 = pd.DataFrame({"x1":[23,13,24,45,44]},index=ind2)
display(df1,df2)

In [None]:
pd.concat([df1,df2])

Compare with the result of a merge operation:

In [None]:
pd.merge(df1,df2,left_index=True,right_index=True,how="outer")

# Transformations and data cleaning

There are numerous operations that can be classified as "cleaning" or "transforming" the data. Cleaning is generally any type of operation that removes unnecessary information or handles the case of missing information. Transformations can be even more diverse and obviously can be part of a cleaning operation.

## Finding and removing duplicates

In [None]:
df = pd.DataFrame({"x1":[1,3,5,7,3],"x2":[2,4,6,8,4]})
display(df)
df.duplicated()

In [None]:
df = pd.DataFrame({"x1":[1,3,5,1,7,3,],"x2":[2,4,6,2,8,4]})
display(df)
df.duplicated()

In [None]:
df.drop_duplicates()

In [None]:
df.drop_duplicates(inplace=True)
df

## Transforming data with a function or a map

Let's look at the simples case first:

In [None]:
display(df)

In [None]:
df['x3'] = 5*df['x1'] - df['x2']**2
df

We are obviously not constrained to simple operations:

In [None]:
def Transf(x):
    tmp = x.copy() # What happens if you don't use copy()?
    tmp[tmp<0] *= 2
    tmp[tmp>0] += 33
    return tmp
df['x4'] = Transf(df['x3'])
df

Or we can use the `map()` method to do the transformation. This allows us to use a function which is not vectorized.

In [None]:
df['x5'] = df['x4'].map(lambda x:"Negative" if x<0 else "Positive" if x>0 else "Zero")
df

## Detecting null values

In [None]:
df["x5"]=np.nan
df.iloc[1,1]=np.nan
df.loc[2,"x3"]=None
df

In [None]:
df.isnull()

## Dropping NAs

In [None]:
del df['x5']
df

In [None]:
df.dropna() # Drops rows by default

In [None]:
df.dropna(axis=1) # Drops columns

We can consider only a certain column (or columns) when dropping:

In [None]:
df.dropna(subset=["x2"])

The `dropna()` method also allows us to:
- substitute inplace (as seen previously);
- use the `how = 'all'` argument to drop a label only if all entries are missing;
- use the `threshold = n` argument to specify that at least `n` values should be missing before dropping.

## Filling in missing values

In [None]:
display(df)
df.fillna(-999)

In [None]:
display(df)
df.fillna({'x1':1.11,'x2':2.22,'x3':3.33,'x4':4.44})

In [None]:
display(df)
df.fillna(method='backfill')

In [None]:
display(df)
df.fillna(method='pad')

## Replacing values

We can replace values in general using the `replace()` method.

In [None]:
df1 = df.fillna({'x1':1.11,'x2':2.22,'x3':3.33,'x4':4.44})
display(df1)
df1.replace(to_replace = [1.0,2.22,3.33],value=[100,222,333])

We can also use a dictionary to pass the substitution values:

In [None]:
display(df1)
df1.replace({2.22:np.nan,3.33:np.nan})

## Computing dummy variables

Sometimes it is useful for modelling purposes to generate a set of dummy variables from a categorical variable:

In [None]:
df1 = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'b'],
                    'data': range(6)})
display(df1)
pd.get_dummies(df1['key'])

We can get rid of the `key` variable in our example and replace it with the corresponding dummies:

In [None]:
pd.merge(df1[['data']],pd.get_dummies(df1['key']),left_index=True,right_index=True)

## Discretization and binning

We may have to distribute measurements into pre-specified groups, similarly to how one places observations in the different bins of a histogram. This is done with the `cut()` function.

As an example, suppose you are given weight measurements on 10 persons and you want to classify them in groups as follows:
- up to 50 kg.
- between 50 and 60 kg.
- between 60 and 90 kg.
- ...
- above 90 kg.

In [None]:
weights = [49,91,61,88,75,56,45,54,77,71]
bins = [0,50,60,70,80,90,np.inf]
wbin = pd.cut(weights,bins)
wbin

In [None]:
# These are the labels
wbin.categories

In [None]:
# And these are the groups the observations belong to
wbin.codes

We can get a tally of the number of people in each group:

In [None]:
pd.value_counts(wbin)

# Reshaping data

This part deals with various ways of representing our dataset by rearranging it from rows to columns and vice versa, making the data "wide" or "long" etc.

## Stacking and unstacking data

- The `stack()` method pivots from columns to rows.
- The `unstack()` method pivots from rows to columns.

Stacking makes data "long".

In [None]:
display(df)
stacked = df.stack()
# returns a Series with a hierarchical index
display(stacked) 

In [None]:
stacked[0]

In [None]:
stacked[4]['x2':'x4']

In [None]:
display(df)
stacked = df.stack(dropna=False) # keeps the NaNs
display(stacked) 

Unstacking works from rows to columns, i.e. makes you data "wide".

In [None]:
index = [('California', 2000), ('California', 2010),
         ('New York', 2000), ('New York', 2010),
         ('Texas', 2000), ('Texas', 2010)]
populations = [33871648, 37253956,
               18976457, 19378102,
               20851820, 25145561]
index = pd.MultiIndex.from_tuples(index)
pop = pd.Series(populations, index=index)
pop

In [None]:
pop.unstack()

## Pivoting

- The stacking and unstacking operations can be generalized a bit for more convenient use. 
- This is done through the `pivot()` method, which let us choose what goes on the rows and what — on the columns.
- It is especially useful for long data in the format usually retrieved from a database.

Consider the following dataset, which contains artificial balance of payments data:

In [None]:
df1 = pd.DataFrame({'date':[2010,2010,2011,2011,2012,2012],
                   'BOPcat':['X','M']*3,
                   'valLC':np.array([3000,3000,2900,3100,3050,2950]),
                   'valFC':np.array([3000,3000,2900,3100,3050,2950])*2})
display(df1)

Stacking does not produce very usable results:

In [None]:
df1.stack()

And neither does unstacking.

In [None]:
df1.unstack()

Let's use the `pivot()` method and instruct it to put the `date` variable on the rows and the `BOPcat` variable on the columns, tabulating the `valLC` variable.

In [None]:
df1.pivot('date','BOPcat','valLC')

We can do the same with the `valFC` variable:

In [None]:
df1.pivot('date','BOPcat','valFC')

Or swap rows for columns:

In [None]:
df1.pivot('BOPcat', 'date', 'valFC')

## Melting data

### The general idea

Sometimes your dataset will be organized in such a way that column names contain information that is actually data. Consider the following dataset:

In [None]:
dt = pd.DataFrame({'first' : ['John', 'Mary'],
                   'last' : ['Doe', 'Bo'],
                   'height' : [170, 180],
                   'weight' : [60, 80]})
dt

- Here the column names `height` and `weight` themselves contain information on the type of measurement (variable). 
- This information can be transformed into more compact form if we put it in a separate column and place the corresponding values in another column, like this:  

| Variable | Value |
| -------- | ----- |
| height   | 170   |
| height   | 180   |
| weight   | 80    |
| weight   | 60    |

- The above is a basic example of *melting*.

- This proposal may not look too different from the original format.
- However, imagine that we had observations on more variables like waistline, body fat percentage etc. 
- These would grow the dataframe horizontally in the original representation while under the proposed transformation having more variables will imply adding row information to a fixed number of columns.
- Obviously this process can apply only to some variables (called *measured variables* or *value variables*), as we need to keep certain variables (called *identifier variables*) in order to be able to identify observations uniquely.

### The Pandas implementation of melting

The `melt()` function collects the information from the columns (in this case, whether the measurement refers to a person's height or weight) and places it in a new variable:

In [None]:
pd.melt(dt, id_vars=['first', 'last'])

The `id_vars` list declares certain variables as identifiers and excludes them from the `melt` operation.

It is possible to change the name of the variable to something more expressive:

In [None]:
pd.melt(dt, id_vars=['first', 'last'], var_name='quantity')

To put things in perspective, the `id_vars` are needed in order to avoid losing information. In this case, we use the combination of first and last name to identify which person an observation refers to. Here is the (useless) molten dataframe without this declaration:

In [None]:
pd.melt(dt)

### More on the rationale behind melting

- At this stage one might wonder whether melting is such a good idea: it seems to make a choice in favour of "long" rather than "wide" data, with the side effect that the readability of the dataset may be worsened in the process of transformation.
- However, the primary advantage of melting is that it puts the data in a generic format that is suitable for transformation into different alternative representations, as needed.

- Think of it as having the dataset in a database-like format which is convenient for extracting different tables for different purposes.
- Actually, the term "melt" is used in reference to having molten metal that can be cast into different forms, as desired. Indeed, the statistical computing and graphics environment R uses precisely the term "cast" for this reverse operation (recall that in Pandas this is done via the `pivot()` method shown previously).

# Data aggregation, the split-apply-combine paradigm and pivot tables

In [None]:
# If necessary
%reset 

In [None]:
import numpy as np
import pandas as pd
from IPython.display import display

# Aggregation operations for `Series`

These practically mirror the respective operations for arrays:

In [None]:
rng = np.random.RandomState(5)
s = pd.Series(rng.rand(5))
s

In [None]:
s.sum()

In [None]:
s.mean()

Many other operations are available. Here are a few examples to give you ideas:

In [None]:
s.min()

In [None]:
s.prod()

In [None]:
s.cumsum()

In [None]:
s.cumprod()

# Aggregation operations for `DataFrames`

This is the same in spirit to the operations for `Series`. The novelty here is the option to perform an operation rowwise or columnwise.

In [None]:
wght = pd.DataFrame({'Bob':[90,91,89,88,86],
                     'Jane':[68,62,61,59,59], 
                     'Joe':[75,76,77,79,80]},
                 index=pd.date_range(start="20160601",
                                     periods=5,freq='M'))
wght

In [None]:
wght.mean()

In [None]:
wght.mean(axis=1)

In [None]:
wght.mean(axis="columns")

In [None]:
wage = pd.DataFrame({'Bill':[1000,1100,1050,1000,1200],'Jill':[2000,2000,2000,3000,2000], 'Jane':[500,550,550,600,500]},
                 index=pd.date_range(start="20160101",periods=5,freq='M'))
wage

In [None]:
wage.sum()

In [None]:
wage.sum(axis=1)

In [None]:
otherincome = pd.DataFrame({'Bill':[2000,2100,2050,2000,2200],'Jill':[3000,3000,3000,4000,3000], 'Jane':[1500,1550,1550,1600,1500]},
                 index=pd.date_range(start="20160101",periods=5,freq='M'))
display(otherincome)

In many respects a `DataFrame` behaves just like a NumPy array:

In [None]:
wage + otherincome

In [None]:
# same as above
wage.add(otherincome)

In [None]:
wage*3

# Split-apply-combine operations

- A common need that arises in data analysis is to divide a dataset into several subsets according to some criterion, process and analyse these subsets separately and put the results back together.
- This workflow is known as **split-apply-combine**.
- Pandas supports this approach via the `groupby` operation.
- The next slide contains a nice illustration of the main idea (courtesy of Jake VanderPlas's *Python Data Science Handbook*).

![Split-apply-combine illustrated](http://nbviewer.jupyter.org/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/figures/03.08-split-apply-combine.png)

### The diamonds dataset

Source: R's `ggplot2` package

Description taken from http://docs.ggplot2.org/current/diamonds.html

**Variables:**  
- price. price in US dollars (\\$326-\\$18,823)
- carat. weight of the diamond (0.2-5.01)
- cut. quality of the cut (Fair, Good, Very Good, Premium, Ideal)
- colour. diamond colour, from J (worst) to D (best)
- clarity. a measurement of how clear the diamond is (I1 (worst), SI1, SI2, VS1, VS2, VVS1, VVS2, IF (best))
- x. length in mm (0-10.74)
- y. width in mm (0-58.9)
- z. depth in mm (0-31.8)
- depth. total depth percentage = z / mean(x, y) = 2 * z / (x + y) (43-79)
- table. width of top of diamond relative to widest point (43-95)

In [None]:
dia = pd.read_csv("diamonds.csv")

In [None]:
dia.head()

The `unique()` method of a series allows us to get the distinct values.

In [None]:
dia['color'].unique()

In [None]:
dia.describe()

In [None]:
dia.groupby('cut').describe() # This is a GroupBy object

We can refer to a column of a `GroupBy` object and invoke a method on it:

In [None]:
dia.groupby('cut')['carat'].mean()

Or we can iterate over the group members. This produces tuples of group names and dataframes corresponding to the respective group.

In [None]:
for group in dia.groupby('cut'):
    print(group[0])
    print(type(group[1]))

# Or, almost equivalently
# for gr, fr in dia.groupbys('cut'):
#     print(gr,type(fr),sep=": ")

We can get a particular group with `get_group()`:

In [None]:
dia.groupby('cut').get_group('Good').head()

## Operations on groups

### Aggregation

In [None]:
dia.groupby('cut').agg([sum,min,max])
# Note the results for string variables

We can specify the operations to be column-specific.

In [None]:
dia.groupby('cut').agg({'carat':np.mean, 'price':max})

In [None]:
dia.groupby('cut').agg({'carat':np.mean, 'price':[min,max]})

### Filtering

We may want to keep only groups that satisfy certain conditions. Let's say we want to keep only those groups which have more than 30 diamonds with a price above \$18500. We can do it with the `filter()` method.

In [None]:
def SelectManyExpensive(x):
    return sum(x['price'] > 18500) > 30
dia.groupby('cut').filter(SelectManyExpensive).head()

In [None]:
dia.groupby('cut').filter(SelectManyExpensive).shape

In [None]:
dia.groupby('cut').filter(SelectManyExpensive)['cut'].unique()

### Transforming

The `transform()` method allows us to apply a transformation to each group. Here is a group-specific standardization transformation:

In [None]:
dia.loc[:,['cut','table','price']].groupby(
    'cut').transform(lambda x: (x - x.mean()) / x.std()).head()

### Apply operations

The `apply()` method is similar to the transform method  with the difference that the function passed to the method takes a dataframe to perform some calculation and returns a Pandas object or a scalar.

In [None]:
from scipy.stats import linregress
def ReturnSlope(x):
    return linregress(x['price'],x['carat']).slope
dia.groupby('cut').apply(ReturnSlope)

# Pivot tables

We already know the reshaping operation `pivot`. Pivot tables carry this idea further by providing data aggregation functionality. The basic syntax is `pivot_table(values, index, columns)` with aggregation performed using the `mean` function by default.

In [None]:
dia.pivot_table(values='price',index='cut',columns='color')

In [None]:
dia.pivot_table(values='price',index='cut',columns='color',aggfunc=sum)

We can pass several aggregating functions.

In [None]:
dia.pivot_table(values='price',index='cut',columns='color',aggfunc=[min,max])

We can also work with hierarchical indexes or columns:

In [None]:
dia.pivot_table(values='price',index=['cut','clarity'],columns='color').head(15)

In [None]:
dia.pivot_table(values='price',index='color',columns=['cut','clarity']).head(15)

We can add margins:

In [None]:
dia.pivot_table(values='price',index='color',columns='clarity',margins=True,aggfunc=sum).head(15)

And choose a fill value for NAs:

In [None]:
dia.pivot_table(values='price',index='color',columns=['cut','clarity'],fill_value=-1).head(15)