# Advanced Pandas functionality
## - DataFrame.apply()

## Introduction
* We now try to use Pandas DataFrames to hold objects instead of numbers
* Process all Columns or Rows using the .apply .applymap methods

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

### Preparing test data

First we generate some objects, namely 100 numpy arrays containing 500 random values each:

In [2]:
curves = [np.random.randn(500) for i in range(100)]

Then we generate some random ids for the curves (This could be Tube-IDs):

In [3]:
ids = np.random.choice(range(10000, 99999), 100, replace=False)
ids

array([95437, 68665, 85526, 82262, 20467, 30134, 35891, 43742, 55650,
       82130, 69579, 73958, 38449, 32999, 94211, 80614, 92808, 88652,
       42248, 25149, 73435, 21957, 83799, 51375, 97738, 29072, 80899,
       83665, 47280, 28518, 91560, 15346, 26089, 42736, 13239, 48198,
       93190, 74132, 33181, 57280, 38715, 26915, 39098, 87776, 39174,
       78156, 96252, 79862, 37778, 81156, 78152, 23293, 72383, 50947,
       94188, 75322, 96960, 94090, 76252, 44455, 77794, 83168, 90352,
       19157, 89723, 65739, 20048, 65337, 11184, 80951, 15388, 21093,
       91278, 26814, 50407, 75116, 53018, 78726, 64545, 74736, 17807,
       53150, 68601, 79925, 34942, 53586, 80881, 45867, 68724, 15337,
       53458, 94026, 38589, 32275, 59805, 25578, 19873, 72391, 98951,
       18504])

.. and put everything into a Series:

In [4]:
s1 = pd.Series(data=curves,
               index=ids,
               name='first_sensor')

Finally we make a DataFrame from it:

In [5]:
df1 = s1.to_frame()
df1.head(5)

Unnamed: 0,first_sensor
95437,"[0.0058863113656874265, -0.9054804150551561, -..."
68665,"[0.8232155225909663, -0.6140061255516971, -0.8..."
85526,"[1.4396331019401678, 0.22997243788902322, -1.9..."
82262,"[0.791363294878909, 1.2764733125449317, -0.305..."
20467,"[0.025836670891773893, 0.7157073035142217, 0.1..."


For demonstration purposes we now add Measurements from a second sensor:

In [7]:
curves_from_sensor_2 = [np.random.randn(500) for i in range(100)]
s2 = pd.Series(data=curves_from_sensor_2,
               index=pd.Index(ids, dtype='int64', name='ID'), # Use pd.Index with dtype='int64'
               name='second_sensor')
df2 = s2.to_frame()

In [8]:
df = df1.join(df2)
df.head(2)

Unnamed: 0,first_sensor,second_sensor
95437,"[0.0058863113656874265, -0.9054804150551561, -...","[-1.3925245410643488, -1.3275409437614447, -0...."
68665,"[0.8232155225909663, -0.6140061255516971, -0.8...","[-1.2868518302789222, -0.7319126210487701, -0...."


# Applying functions

## 1. `DataFrame.apply()`
We now want to calculate some summarizing statistics on the curves. Therefore we use `.apply()` on the dataframe. The function called by `.apply` gets the columns (`axis=0`) or the rows (`axis=1`) of the dataframe one by one as input.

In [9]:
def _calculate_mean_of_sensor(row, column='first_sensor'):
    single_curve = row[column]
    return np.mean(single_curve)

# Axis=1 applies Row-Wise!!
mean_of_first_sensor = df.apply(_calculate_mean_of_sensor, axis=1).rename('mean_of_first_sensor')
mean_of_first_sensor.head(2)

Unnamed: 0,mean_of_first_sensor
95437,-0.018355
68665,-0.022625


A function can use multiple columns for calculation. Lets say we want to calculate the difference of the means from sensor 1 and sensor 2:

In [10]:
def _get_mean_difference(row, first_sensor='first_sensor', second_sensor='second_sensor'):
    sensor_1_curve = row[first_sensor]
    sensor_2_curve = row[second_sensor]

    return np.abs(np.mean(sensor_1_curve) - np.mean(sensor_2_curve))

mean_difference = df.apply(_get_mean_difference, axis=1).rename('mean_difference')
mean_difference.head(2)

Unnamed: 0,mean_difference
95437,0.034585
68665,0.000226


Functions can also have multiple outputs. In this case we return a pd.Series:

In [11]:
def _get_mean_difference(row, first_sensor='first_sensor', second_sensor='second_sensor'):
    sensor_1_curve = row[first_sensor]
    sensor_2_curve = row[second_sensor]
    mean_curve_1 = np.mean(sensor_1_curve)
    mean_curve_2 = np.mean(sensor_2_curve)

    return pd.Series({'Mean_Curve_1': mean_curve_1, 'Mean_Curve_2': mean_curve_2})

means = df.apply(_get_mean_difference, axis=1)
means.head(2)

Unnamed: 0,Mean_Curve_1,Mean_Curve_2
95437,-0.018355,0.01623
68665,-0.022625,-0.022399


## 2. `DataFrame.applymap()`

If we want to apply the SAME function to ALL fields of the table, and not row or columnwise, we can use `.applymap()`. Here we calculate the length of each curve:

In [12]:
lengths = df.applymap(len).add_prefix('length_')
lengths.head(2)

  lengths = df.applymap(len).add_prefix('length_')


Unnamed: 0,length_first_sensor,length_second_sensor
95437,500,500
68665,500,500


## 3. Series.apply()
`Series.apply()` applies the function simply to each field of the Series. This is very similar to `DataFrame.applymap()`

In [13]:
s1.apply(len).head(2)

Unnamed: 0,first_sensor
95437,500
68665,500
