# Advanced Pandas functionality 
## - DataFrame.apply()

## Introduction
* We now try to use Pandas DataFrames to hold objects instead of numbers
* Process all Columns or Rows using the .apply .applymap methods

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
% matplotlib notebook

### Preparing test data

First we generate some objects, namely 100 numpy arrays containing 500 random values each:

In [2]:
curves = [np.random.randn(500) for i in range(100)]

Then we generate some random ids for the curves (This could be Tube-IDs):

In [3]:
ids = np.random.choice(range(10000, 99999), 100, replace=False)

.. and put everything into a Series:

In [4]:
s1 = pd.Series(data=curves, 
               index=pd.Int64Index(ids, name='ID'), 
               name='first_sensor')

Finally we make a DataFrame from it:

In [5]:
df1 = s1.to_frame()
df1.head(2)

Unnamed: 0_level_0,first_sensor
ID,Unnamed: 1_level_1
56544,"[0.8114525441505623, -0.3217884672643395, 0.60..."
71891,"[-0.26833609828743826, -0.22780342069485734, 0..."


For demonstration purposes we now add Measurements from a second sensor:

In [6]:
curves_from_sensor_2 = [np.random.randn(500) for i in range(100)]
s2 = pd.Series(data=curves_from_sensor_2, 
               index=pd.Int64Index(ids, name='ID'), 
               name='second_sensor')
df2 = s2.to_frame()

In [7]:
df = df1.join(df2)
df.head(2)

Unnamed: 0_level_0,first_sensor,second_sensor
ID,Unnamed: 1_level_1,Unnamed: 2_level_1
56544,"[0.8114525441505623, -0.3217884672643395, 0.60...","[-0.9107989962672383, -1.0180821890732348, 0.7..."
71891,"[-0.26833609828743826, -0.22780342069485734, 0...","[-0.527631423555406, -0.6043961713463648, 0.45..."


# Applying functions

## 1. `DataFrame.apply()`
We now want to calculate some summarizing statistics on the curves. Therefore we use `.apply()` on the dataframe. The function called by `.apply` gets the columns (`axis=0`) or the rows (`axis=1`) of the dataframe one by one as input.

In [8]:
def _calculate_mean_of_sensor(row, column='first_sensor'):
    single_curve = row[column]    
    return np.mean(single_curve)

# Axis=1 applies Row-Wise!!
mean_of_first_sensor = df.apply(_calculate_mean_of_sensor, axis=1).rename('mean_of_first_sensor')
mean_of_first_sensor.head(2)

ID
56544    0.004345
71891    0.021336
Name: mean_of_first_sensor, dtype: float64

A function can use multiple columns for calculation. Lets say we want to calculate the difference of the means from sensor 1 and sensor 2:

In [9]:
def _get_mean_difference(row, first_sensor='first_sensor', second_sensor='second_sensor'):
    sensor_1_curve = row[first_sensor]
    sensor_2_curve = row[second_sensor]
    
    return np.abs(np.mean(sensor_1_curve) - np.mean(sensor_2_curve))

mean_difference = df.apply(_get_mean_difference, axis=1).rename('mean_difference')
mean_difference.head(2)

ID
56544    0.011244
71891    0.035214
Name: mean_difference, dtype: float64

Functions can also have multiple outputs. In this case we return a pd.Series:

In [10]:
def _get_mean_difference(row, first_sensor='first_sensor', second_sensor='second_sensor'):
    sensor_1_curve = row[first_sensor]
    sensor_2_curve = row[second_sensor]
    mean_curve_1 = np.mean(sensor_1_curve)
    mean_curve_2 = np.mean(sensor_2_curve)
 
    return pd.Series({'Mean_Curve_1': mean_curve_1, 'Mean_Curve_2': mean_curve_2})

means = df.apply(_get_mean_difference, axis=1)
means.head(2)

Unnamed: 0_level_0,Mean_Curve_1,Mean_Curve_2
ID,Unnamed: 1_level_1,Unnamed: 2_level_1
56544,0.004345,-0.006899
71891,0.021336,0.056549


## 2. `DataFrame.applymap()`

If we want to apply the SAME function to ALL fields of the table, and not row or columnwise, we can use `.applymap()`. Here we calculate the length of each curve:

In [11]:
lengths = df.applymap(len).add_prefix('length_')
lengths.head(2)

Unnamed: 0_level_0,length_first_sensor,length_second_sensor
ID,Unnamed: 1_level_1,Unnamed: 2_level_1
56544,500,500
71891,500,500


## 3. Series.apply()
`Series.apply()` applies the function simply to each field of the Series. This is very similar to `DataFrame.applymap()`

In [12]:
s1.apply(len).head(2)

ID
56544    500
71891    500
Name: first_sensor, dtype: int64