# This highlights a potential bug or just unwanted/strange behavior when using the pandas.DataFrame.apply method in a particular way.  This is not a typical usage of the method, but this notebook demonstrates that it behaves inconsistently.

First, let's show what we are working with:

In [1]:
import pandas as pd
import numpy as np
import sys

print 'Pandas version: %s' % pd.version.version
print 'NumPy version: %s' % np.version.version
print 'Base python version:%s' % sys.version

Pandas version: 0.16.0
NumPy version: 1.9.2
Base python version:2.7.9 (default, Apr  7 2015, 12:46:00) 
[GCC 4.8.2]


## Create a DataFrame instance with *only* numbers.  What we will eventually see is that running the pd.DataFrame.apply on this method behaves differently whether the DataFrame instance contains other types (e.g. strings)

In [2]:
df1 = pd.DataFrame(np.arange(6).reshape(2,3), columns=['c1','c2','c3'])
df1

Unnamed: 0,c1,c2,c3
0,0,1,2
1,3,4,5


## Create some function to use with the apply method.  

This issue was discovered in a different way, but this function suffers from the same issue.  It doesn't appear to do anything particularly relevant, but it should be easy enough to follow.

In [3]:
def f(row, col_choice_1, col_choice_2, max_exp):
    '''
    First argument, 'row', is a row from a DataFrame instance (so, a Series)
    Second and third arguments are keys to select data from 'row'
    max_exp is an integer representing the 'maximum exponent'
    
    This function returns a list of pandas.Series instances
    '''
    s_list = []
    d = row[col_choice_1] + row[col_choice_2]
    for i in range(1, max_exp + 1):
        simple_dict = {'orig':d, 'power': i, 'raised':d**i}
        name = 'r'+str(i)
        s = pd.Series(simple_dict, name=name)
        s_list.append(s)
    return s_list

## For example, see what it does to the first row of the df1 DataFrame:

In [4]:
x = f(df1.ix[0], 'c2', 'c3', 4)
print x

print '*'*50

print pd.DataFrame(x)

[orig      3
power     1
raised    3
Name: r1, dtype: int64, orig      3
power     2
raised    9
Name: r2, dtype: int64, orig       3
power      3
raised    27
Name: r3, dtype: int64, orig       3
power      4
raised    81
Name: r4, dtype: int64]
**************************************************
    orig  power  raised
r1     3      1       3
r2     3      2       9
r3     3      3      27
r4     3      4      81


## Hopefully that is pretty straightforward to see.  Now, instead of using the function directly, pass the function through the pandas.DataFrame.apply method, specifically the method bound to the df1 instance:

In [5]:
df1.apply(f, args=('c2', 'c3', 4), axis=1)

ValueError: Shape of passed values is (2, 4), indices imply (2, 3)

## An error! (as expected)  Try a different exponent:

In [6]:
df1.apply(f, args=('c2', 'c3', 2), axis=1)

ValueError: Shape of passed values is (2, 2), indices imply (2, 3)

## Also note it performs strangely if the function is passed the value of 3 for the exponent argument.  The function returns a list of Series instances.  Here, it appears to populate each cell of the DataFrame with a Series instance:

In [7]:
df1.apply(f, args=('c2', 'c3', 3), axis=1)

Unnamed: 0,c1,c2,c3
0,"orig 3 power 1 raised 3 Name: r1, ...","orig 3 power 2 raised 9 Name: r2, ...",orig 3 power 3 raised 27 Name: r...
1,"orig 9 power 1 raised 9 Name: r1, ...",orig 9 power 2 raised 81 Name: r...,orig 9 power 3 raised 729 Name...


## Now, make another DataFrame similar to df1, but with one simple change.  Here, we add an extra column that contains a string.

In [8]:
df2 = df1.copy()
df2['c4'] = 'X'
df2

Unnamed: 0,c1,c2,c3,c4
0,0,1,2,X
1,3,4,5,X


## Apply the same function as above and see that it now works:

In [9]:
result = df2.apply(f, args=('c2', 'c3', 2), axis=1)
result

for r in result:
    print type(r)
    print pd.DataFrame(r)

<type 'list'>
    orig  power  raised
r1     3      1       3
r2     3      2       9
<type 'list'>
    orig  power  raised
r1     9      1       9
r2     9      2      81


## Try a different exponent:

In [10]:
result = df2.apply(f, args=('c2', 'c3', 5), axis=1)
result

for r in result:
    print type(r)
    print pd.DataFrame(r)

<type 'list'>
    orig  power  raised
r1     3      1       3
r2     3      2       9
r3     3      3      27
r4     3      4      81
r5     3      5     243
<type 'list'>
    orig  power  raised
r1     9      1       9
r2     9      2      81
r3     9      3     729
r4     9      4    6561
r5     9      5   59049


## Try an exponent of 4.  Note that this is the number of columns in the df2 DataFrame:

In [11]:
result = df2.apply(f, args=('c2', 'c3', 4), axis=1)
result

for r in result:
    print type(r)
    print pd.DataFrame(r)

<type 'str'>


PandasError: DataFrame constructor not properly called!

## Note that it's returning a string object (?!) and the DataFrame constructor will not accept that.

## To confirm this behavior, create another DataFrame, and add two additional columns:

In [12]:
df3 = df1.copy()
df3['c4'] = 'X'
df3['c5'] = 'Y'
df3

Unnamed: 0,c1,c2,c3,c4,c5
0,0,1,2,X,Y
1,3,4,5,X,Y


## Try running with an exponent of 4 (which should be OK, since we now have 5 columns in df3):

In [13]:
result = df3.apply(f, args=('c2', 'c3', 4), axis=1)
result

for r in result:
    print type(r)
    print pd.DataFrame(r)

<type 'list'>
    orig  power  raised
r1     3      1       3
r2     3      2       9
r3     3      3      27
r4     3      4      81
<type 'list'>
    orig  power  raised
r1     9      1       9
r2     9      2      81
r3     9      3     729
r4     9      4    6561


## OK, so that worked.  But again, what if the exponent matches the number of columns?

In [14]:
result = df3.apply(f, args=('c2', 'c3', 5), axis=1)
result

for r in result:
    print type(r)
    print pd.DataFrame(r)

<type 'str'>


PandasError: DataFrame constructor not properly called!