# Demo notebook for analyzing calls and SMS data

## Introduction

Communication data includes calls and SMS information. These data can reveal important information about people's circadian rhythm, social patterns, and activity. Therefore, it is important to organize this information for further processing. To address this, `niimpy` includes the function `extract_features_comms` to clean, downsample, and extract features from communication data. This function employs other functions to extract the following features:

- `call_duration_total`: duration of incoming and outgoing calls
- `call_duration_mean`: mean duration of incoming and outgoing calls
- `call_duration_median`: median duration of incoming and outgoing calls
- `call_duration_std`: standard deviation of incoming and outgoing calls
- `call_count`: number of calls within a time window
- `call_outgoing_incoming_ratio`: number of outgoing calls divided by the number of incoming calls
- `sms_count`: count of 

In the following, we will analyze call logs provided by `niimpy` as an example to illustrate the use of niimpy's communication preprocessing functions.

## Read data

In [1]:
import niimpy
import config as config
import niimpy.preprocessing.communication as com

ModuleNotFoundError: No module named 'config'

In [None]:
data = niimpy.read_csv(config.MULTIUSER_AWARE_CALLS_PATH, tz='Europe/Helsinki')
data.shape

There are 38 datapoints with 6 columns in the dataset. Let us have a quick look at the data:

In [None]:
data.head()

In [None]:
data.tail()

The dataframe seems to be complete. Its index is timestamps, and it has two main columns: call_type and call_duration. In addition, the dataframe contains information from multiple users. 
Here it is important to notice that the calls should be labeled as *incoming*, *outgoing* or *missed*. 

## Extracting features

To extract audio features, we need to employ the function `extract_features_comms`. This function needs two inputs, a dataframe with the data and a dictionary. The dataframe should contain the call observations, and the dictionary is used to input customizable arguments to the function. The function has some parameters by default. Let's have a look at those first. 

### Default option

The default option will compute all features in 30-minute aggregation windows. To use the `extract_features_comms` function with its default options, simply call the function. 

In [None]:
default = com.extract_features_comms(data, features=None)

The function prints the computed features so you can track its process. Now let's have a look at the outputs

In [None]:
default.head()

In [None]:
default.tail()

The function output is also a dataframe where each column stands for a feature. The indexes are subjects and timestamps. 

### Customized features

The `extract_features_comms` function can also be customized. We can:
- extract some of the features (not all)
- modify the aggregation periods

All of these modifications need to be inside the dictionary input. 

Let's see how to use this to only call some functions. To do so, we need to create a dictionary where the keys are the name of the features we want to compute, and the values are empty dictionaries.

In [None]:
custom = {}
custom[com.call_duration_mean] = {}
custom[com.call_duration_median] = {}

In [None]:
custom_output = com.extract_features_comms(data, features=custom)
custom_output.head()

As we see, this time only two features were computed in a 30-min aggregated period. Now, let's compute another set of features with different aggregation windows. For that, we rely on the arguments from the `pandas.DataFrame.resample` function. 

For this example, we will aggregate the features `call_count` and `call_duration_total`. The call duration total will be computed in a daily basis and the number of calls will be computed in 5-hour periods with a 5-min offset.

In [None]:
features = {com.call_duration_total:{"communication_column_name":"call_duration","resample_args":{"rule":"1D"}},
            com.call_count:{"communication_column_name":"call_duration","resample_args":{"rule":"5H","offset":"5min"}}}

As we see, we have an input dictionary in which the main keys are the names of the features to compute. For each feature, we also have a dictionary. This new dictionary has some other arguments, mainly the name of the column that we would like to use for the computation and another dictionary named `resample_args`. The name of the column helps us in case our dataframe has some other naming conventions. The `resample_args` dictionary contains the arguments to pass for the resampling (see `pandas.DataFrame.resample`).

In [None]:
custom_output = com.extract_features_comms(data, features=features)
custom_output.head(10)

The output is once again a dataframe. In this case, two aggregations are shown. The first one is the daily aggregation computed for the `call_duration` feature. The second one is the 5-hour aggregation period with 5-min offset for the `call_count`. Therefore, the repeated user IDs. We must note that because the `call_count`feature is not required to be aggregated daily, the daily aggregation timestamps have a NaN value. Similarly, because the `call_duration`is not required to be aggregated in 5-hour windows, its values are NaN for all subjects. 

### SMS computations

`niimpy` includes one function to count the outgoing and incoming SMS. This function is not automatically called by `extract_features_comms`, but it can be used as a standalone. Let's see a quick example where we will upload the SMS data and preprocess it. 

In [None]:
data = niimpy.read_csv(config.MULTIUSER_AWARE_MESSAGES_PATH, tz='Europe/Helsinki')
data.head()

In [None]:
sms = com.sms_count(data, feature_functions={})
sms.head()

We see that the function also differentiates between the incoming and outgoing messages. This is crucial for understanding the communication patterns of a subject. 

Similar to the use of `extract_features_comms`, we can modify the aggregation period of the SMS count by including the correct arguments in the feature_functions dictionary. Let's see one example with a daily aggregation. 

In [None]:
sms = com.sms_count(data, feature_functions={"resample_args":{"rule":"1D"}})
sms.head()

## Implementing own features

We can implement our own customized features easily. To do so, we need to define a function that accepts a dataframe and returns a dataframe. The returned object should be indexed by user and timestamps. 
To make the feature readily available in the default options, we need add the *call* prefix to the new function (e.g. `call_my-new-feature`). 

In [None]:
def call_count_all(df,feature_functions=None):
    if not "communication_column_name" in feature_functions:
        col_name = "call_duration"
    else:
        col_name = feature_functions["communication_column_name"]
    if not "resample_args" in feature_functions.keys():
        feature_functions["resample_args"] = {"rule":"30T"}
    
    if len(df)>0:
        result = df.groupby("user")[col_name].resample(**feature_functions["resample_args"]).count()    
        result.rename("call_count_all", inplace=True)
        result.to_frame()
    return result

Then, we can call our new function using the `extract_features_comms` function.

In [None]:
data = niimpy.read_csv(config.MULTIUSER_AWARE_CALLS_PATH, tz='Europe/Helsinki')
customized_features = com.extract_features_comms(data, features={call_count_all: {}})

In [None]:
customized_features.head()