# Demo notebook for analyzing calls and SMS data

## Introduction

Communication data includes calls and SMS information. These data can reveal important information about people's circadian rhythm, social patterns, and activity. Therefore, it is important to organize this information for further processing. To address this, `niimpy` includes the function `extract_features_comms` to clean, downsample, and extract features from communication data. This function employs other functions to extract the following features:

- `call_duration_total`: duration of incoming and outgoing calls
- `call_duration_mean`: mean duration of incoming and outgoing calls
- `call_duration_median`: median duration of incoming and outgoing calls
- `call_duration_std`: standard deviation of incoming and outgoing calls
- `call_count`: number of calls within a time window
- `call_outgoing_incoming_ratio`: number of outgoing calls divided by the number of incoming calls
- `sms_count`: count of 

In the following, we will analyze call logs provided by `niimpy` as an example to illustrate the use of niimpy's communication preprocessing functions.

## Read data

In [1]:
import sys
sys.path.append('../../')

import niimpy
from config import config
import niimpy.preprocessing.communication as com

In [2]:
data = niimpy.read_csv(config.MULTIUSER_AWARE_CALLS_PATH, tz='Europe/Helsinki')
data.shape

(38, 6)

There are 38 datapoints with 6 columns in the dataset. Let us have a quick look at the data:

In [3]:
data.head()

Unnamed: 0,user,device,time,call_type,call_duration,datetime
2020-01-09 02:08:03.896000+02:00,jd9INuQ5BBlW,3p83yASkOb_B,1578528000.0,incoming,1079,2020-01-09 02:08:03.896000+02:00
2020-01-09 02:49:44.969000192+02:00,jd9INuQ5BBlW,3p83yASkOb_B,1578531000.0,outgoing,174,2020-01-09 02:49:44.969000192+02:00
2020-01-09 02:22:57.168999936+02:00,jd9INuQ5BBlW,3p83yASkOb_B,1578529000.0,outgoing,890,2020-01-09 02:22:57.168999936+02:00
2020-01-09 02:27:21.187000064+02:00,jd9INuQ5BBlW,3p83yASkOb_B,1578530000.0,outgoing,1342,2020-01-09 02:27:21.187000064+02:00
2020-01-09 02:47:16.176999936+02:00,jd9INuQ5BBlW,3p83yASkOb_B,1578531000.0,incoming,645,2020-01-09 02:47:16.176999936+02:00


In [4]:
data.tail()

Unnamed: 0,user,device,time,call_type,call_duration,datetime
2019-08-12 22:10:21.504000+03:00,iGyXetHE3S8u,Cq9vueHh3zVs,1565637000.0,incoming,715,2019-08-12 22:10:21.504000+03:00
2019-08-12 22:27:19.923000064+03:00,iGyXetHE3S8u,Cq9vueHh3zVs,1565638000.0,outgoing,225,2019-08-12 22:27:19.923000064+03:00
2019-08-13 07:01:00.960999936+03:00,iGyXetHE3S8u,Cq9vueHh3zVs,1565669000.0,outgoing,1231,2019-08-13 07:01:00.960999936+03:00
2019-08-13 07:28:27.657999872+03:00,iGyXetHE3S8u,Cq9vueHh3zVs,1565671000.0,incoming,591,2019-08-13 07:28:27.657999872+03:00
2019-08-13 07:21:26.436000+03:00,iGyXetHE3S8u,Cq9vueHh3zVs,1565670000.0,outgoing,375,2019-08-13 07:21:26.436000+03:00


The dataframe seems to be complete. Its index is timestamps, and it has two main columns: call_type and call_duration. In addition, the dataframe contains information from multiple users. 
Here it is important to notice that the calls should be labeled as *incoming*, *outgoing* or *missed*. 

## Extracting features

To extract audio features, we need to employ the function `extract_features_comms`. This function needs two inputs, a dataframe with the data and a dictionary. The dataframe should contain the call observations, and the dictionary is used to input customizable arguments to the function. The function has some parameters by default. Let's have a look at those first. 

### Default option

The default option will compute all features in 30-minute aggregation windows. To use the `extract_features_comms` function with its default options, simply call the function. 

In [5]:
default = com.extract_features_comms(data, features=None)

computing <function call_duration_total at 0x000001E848768280>...
computing <function call_duration_mean at 0x000001E848768430>...
computing <function call_duration_median at 0x000001E8487684C0>...
computing <function call_duration_std at 0x000001E848768550>...
computing <function call_count at 0x000001E8487685E0>...
computing <function call_outgoing_incoming_ratio at 0x000001E848768670>...


The function prints the computed features so you can track its process. Now let's have a look at the outputs

In [6]:
default.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,outgoing_duration_total,incoming_duration_total,missed_duration_total,outgoing_duration_mean,incoming_duration_mean,missed_duration_mean,outgoing_duration_median,incoming_duration_median,missed_duration_median,outgoing_duration_std,incoming_duration_std,missed_duration_std,outgoing_count,incoming_count,missed_count,outgoing_incoming_ratio
user,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
iGyXetHE3S8u,2019-08-09 07:00:00+03:00,1322.0,0.0,0.0,1322.0,0.0,0.0,1322.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,inf
iGyXetHE3S8u,2019-08-09 07:30:00+03:00,959.0,1824.0,0.0,959.0,912.0,0.0,959.0,912.0,0.0,0.0,172.534055,0.0,1.0,2.0,1.0,0.5
iGyXetHE3S8u,2019-08-09 08:00:00+03:00,0.0,131.0,0.0,0.0,131.0,0.0,0.0,131.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
iGyXetHE3S8u,2019-08-09 08:30:00+03:00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
iGyXetHE3S8u,2019-08-09 09:00:00+03:00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [7]:
default.tail()

Unnamed: 0_level_0,Unnamed: 1_level_0,outgoing_duration_total,incoming_duration_total,missed_duration_total,outgoing_duration_mean,incoming_duration_mean,missed_duration_mean,outgoing_duration_median,incoming_duration_median,missed_duration_median,outgoing_duration_std,incoming_duration_std,missed_duration_std,outgoing_count,incoming_count,missed_count,outgoing_incoming_ratio
user,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
iGyXetHE3S8u,2019-08-09 05:00:00+03:00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
iGyXetHE3S8u,2019-08-09 05:30:00+03:00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
iGyXetHE3S8u,2019-08-09 06:00:00+03:00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
iGyXetHE3S8u,2019-08-09 06:30:00+03:00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
jd9INuQ5BBlW,2020-01-09 03:00:00+02:00,0.0,269.0,0.0,0.0,269.0,0.0,0.0,269.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0


The function output is also a dataframe where each column stands for a feature. The indexes are subjects and timestamps. 

### Customized features

The `extract_features_comms` function can also be customized. We can:
- extract some of the features (not all)
- modify the aggregation periods

All of these modifications need to be inside the dictionary input. 

Let's see how to use this to only call some functions. To do so, we need to create a dictionary where the keys are the name of the features we want to compute, and the values are empty dictionaries.

In [8]:
custom = {}
custom[com.call_duration_mean] = {}
custom[com.call_duration_median] = {}

In [9]:
custom_output = com.extract_features_comms(data, features=custom)
custom_output.head()

computing <function call_duration_mean at 0x000001E848768430>...
computing <function call_duration_median at 0x000001E8487684C0>...


Unnamed: 0_level_0,Unnamed: 1_level_0,outgoing_duration_mean,incoming_duration_mean,missed_duration_mean,outgoing_duration_median,incoming_duration_median,missed_duration_median
user,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
iGyXetHE3S8u,2019-08-09 07:00:00+03:00,1322.0,0.0,0.0,1322.0,0.0,0.0
iGyXetHE3S8u,2019-08-09 07:30:00+03:00,959.0,912.0,0.0,959.0,912.0,0.0
iGyXetHE3S8u,2019-08-09 08:00:00+03:00,0.0,131.0,0.0,0.0,131.0,0.0
iGyXetHE3S8u,2019-08-09 08:30:00+03:00,0.0,0.0,0.0,0.0,0.0,0.0
iGyXetHE3S8u,2019-08-09 09:00:00+03:00,0.0,0.0,0.0,0.0,0.0,0.0


As we see, this time only two features were computed in a 30-min aggregated period. Now, let's compute another set of features with different aggregation windows. For that, we rely on the arguments from the `pandas.DataFrame.resample` function. 

For this example, we will aggregate the features `call_count` and `call_duration_total`. The call duration total will be computed in a daily basis and the number of calls will be computed in 5-hour periods with a 5-min offset.

In [10]:
features = {com.call_duration_total:{"communication_column_name":"call_duration","resample_args":{"rule":"1D"}},
            com.call_count:{"communication_column_name":"call_duration","resample_args":{"rule":"5H","offset":"5min"}}}

As we see, we have an input dictionary in which the main keys are the names of the features to compute. For each feature, we also have a dictionary. This new dictionary has some other arguments, mainly the name of the column that we would like to use for the computation and another dictionary named `resample_args`. The name of the column helps us in case our dataframe has some other naming conventions. The `resample_args` dictionary contains the arguments to pass for the resampling (see `pandas.DataFrame.resample`).

In [11]:
custom_output = com.extract_features_comms(data, features=features)
custom_output.head(10)

computing <function call_duration_total at 0x000001E848768280>...
computing <function call_count at 0x000001E8487685E0>...


Unnamed: 0_level_0,Unnamed: 1_level_0,outgoing_duration_total,incoming_duration_total,missed_duration_total,outgoing_count,incoming_count,missed_count
user,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
iGyXetHE3S8u,2019-08-09 00:00:00+03:00,2281.0,1955.0,0.0,,,
iGyXetHE3S8u,2019-08-10 00:00:00+03:00,2726.0,1298.0,0.0,,,
iGyXetHE3S8u,2019-08-11 00:00:00+03:00,0.0,0.0,0.0,,,
iGyXetHE3S8u,2019-08-12 00:00:00+03:00,418.0,715.0,0.0,,,
iGyXetHE3S8u,2019-08-13 00:00:00+03:00,1606.0,591.0,0.0,,,
jd9INuQ5BBlW,2020-01-09 00:00:00+02:00,7318.0,6643.0,0.0,,,
iGyXetHE3S8u,2019-08-08 00:00:00+03:00,0.0,4409.0,0.0,,,
iGyXetHE3S8u,2019-08-09 05:05:00+03:00,,,,2.0,0.0,1.0
iGyXetHE3S8u,2019-08-09 10:05:00+03:00,,,,0.0,0.0,0.0
iGyXetHE3S8u,2019-08-09 15:05:00+03:00,,,,0.0,0.0,0.0


The output is once again a dataframe. In this case, two aggregations are shown. The first one is the daily aggregation computed for the `call_duration` feature. The second one is the 5-hour aggregation period with 5-min offset for the `call_count`. Therefore, the repeated user IDs. We must note that because the `call_count`feature is not required to be aggregated daily, the daily aggregation timestamps have a NaN value. Similarly, because the `call_duration`is not required to be aggregated in 5-hour windows, its values are NaN for all subjects. 

### SMS computations

`niimpy` includes one function to count the outgoing and incoming SMS. This function is not automatically called by `extract_features_comms`, but it can be used as a standalone. Let's see a quick example where we will upload the SMS data and preprocess it. 

In [12]:
data = niimpy.read_csv(config.MULTIUSER_AWARE_MESSAGES_PATH, tz='Europe/Helsinki')
data.head()

Unnamed: 0,user,device,time,message_type,datetime
2020-01-09 02:34:46.644999936+02:00,jd9INuQ5BBlW,3p83yASkOb_B,1578530000.0,incoming,2020-01-09 02:34:46.644999936+02:00
2020-01-09 02:34:58.803000064+02:00,jd9INuQ5BBlW,3p83yASkOb_B,1578530000.0,outgoing,2020-01-09 02:34:58.803000064+02:00
2020-01-09 02:35:37.611000064+02:00,jd9INuQ5BBlW,3p83yASkOb_B,1578530000.0,outgoing,2020-01-09 02:35:37.611000064+02:00
2020-01-09 02:55:40.640000+02:00,jd9INuQ5BBlW,3p83yASkOb_B,1578531000.0,outgoing,2020-01-09 02:55:40.640000+02:00
2020-01-09 02:55:50.914000128+02:00,jd9INuQ5BBlW,3p83yASkOb_B,1578531000.0,incoming,2020-01-09 02:55:50.914000128+02:00


In [13]:
sms = com.sms_count(data, feature_functions={})
sms.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,outgoing_count,incoming_count
user,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
iGyXetHE3S8u,2019-08-13 08:30:00+03:00,1,1.0
iGyXetHE3S8u,2019-08-13 09:00:00+03:00,0,0.0
iGyXetHE3S8u,2019-08-13 09:30:00+03:00,2,1.0
iGyXetHE3S8u,2019-08-13 10:00:00+03:00,0,0.0
iGyXetHE3S8u,2019-08-13 10:30:00+03:00,0,0.0


We see that the function also differentiates between the incoming and outgoing messages. This is crucial for understanding the communication patterns of a subject. 

Similar to the use of `extract_features_comms`, we can modify the aggregation period of the SMS count by including the correct arguments in the feature_functions dictionary. Let's see one example with a daily aggregation. 

In [14]:
sms = com.sms_count(data, feature_functions={"resample_args":{"rule":"1D"}})
sms.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,outgoing_count,incoming_count
user,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
iGyXetHE3S8u,2019-08-13 00:00:00+03:00,3,2.0
iGyXetHE3S8u,2019-08-14 00:00:00+03:00,2,0.0
iGyXetHE3S8u,2019-08-15 00:00:00+03:00,1,0.0
jd9INuQ5BBlW,2020-01-09 00:00:00+02:00,12,18.0


## Implementing own features

We can implement our own customized features easily. To do so, we need to define a function that accepts a dataframe and returns a dataframe. The returned object should be indexed by user and timestamps. 
To make the feature readily available in the default options, we need add the *call* prefix to the new function (e.g. `call_my-new-feature`). 

In [15]:
def call_count_all(df,feature_functions=None):
    if not "communication_column_name" in feature_functions:
        col_name = "call_duration"
    else:
        col_name = feature_functions["communication_column_name"]
    if not "resample_args" in feature_functions.keys():
        feature_functions["resample_args"] = {"rule":"30T"}
    
    if len(df)>0:
        result = df.groupby("user")[col_name].resample(**feature_functions["resample_args"]).count()    
        result.rename("call_count_all", inplace=True)
        result.to_frame()
    return result

Then, we can call our new function using the `extract_features_comms` function.

In [16]:
data = niimpy.read_csv(config.MULTIUSER_AWARE_CALLS_PATH, tz='Europe/Helsinki')
customized_features = com.extract_features_comms(data, features={call_count_all: {}})

computing <function call_count_all at 0x000001E826B4FEB0>...


In [17]:
customized_features.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,call_count_all
user,Unnamed: 1_level_1,Unnamed: 2_level_1
iGyXetHE3S8u,2019-08-08 22:30:00+03:00,5
iGyXetHE3S8u,2019-08-08 23:00:00+03:00,0
iGyXetHE3S8u,2019-08-08 23:30:00+03:00,0
iGyXetHE3S8u,2019-08-09 00:00:00+03:00,0
iGyXetHE3S8u,2019-08-09 00:30:00+03:00,0
