# Demo notebook for analyzing application data

## Introduction

Application data refers to the information about which apps are open at a certain time. These data can reveal important information about people's circadian rhythm, social patterns, and activity. Application data is an event data; this means it cannot be sampled at a regular frequency, intead, we just have information about the events that occured. There are two main issues with application data (1) missing data detection, and (2) privacy concerns. 

Regarding missing data detection, we may never know if all events were detected and reported. Unfortunately there is little we can do. Nevertheless, we can take into account some factors that may interfere with the correct detection of all events (e.g. when the phone's battery is depleated). Therefore, to correctly process application data, we need to consider other information like the battery status of the phone. 
Regarding the privacy concerns, application names can reveal too much about a subject, for example, an uncommon app use may help identify a subject. Consequently, we try anonimizing the data by grouping the apps. 

To address both of these issues, `niimpy` includes the function `extract_features_app` to clean, downsample, and extract features from application data while taking into account factors like the battery level and naming groups. In addition, `niimpy` provides a map with some of the common apps for pseudo-anonymization. This function employs other functions to extract the following features:

- `app_count`: number of times an app group has been used 
- `app_duration`: how long an app group has been used

The app module has one internal function that help classify the apps into groups. 

In the following, we will analyze screen data provided by `niimpy` as an example to illustrate the use of application data.

## Read data

In [1]:
import niimpy
import niimpy.preprocessing.application as app
import warnings
import pandas as pd
warnings.filterwarnings("ignore")

In [2]:
data = niimpy.read_csv('/m/cs/scratch/networks/trianaa1/Paper3/niimpy/niimpy/sampledata/singleuser_AwareApplicationNotifications.csv', tz='Europe/Helsinki')
data.shape

(132, 6)

There are 132 datapoints with 6 columns in the dataset. Let us have a quick look at the data:

In [3]:
data.head()

Unnamed: 0,user,device,time,application_name,package_name,datetime
2019-08-05 14:02:51.009999872+03:00,iGyXetHE3S8u,Cq9vueHh3zVs,1565003000.0,Android System,android,2019-08-05 14:02:51.009999872+03:00
2019-08-05 14:02:58.009999872+03:00,iGyXetHE3S8u,Cq9vueHh3zVs,1565003000.0,Android System,android,2019-08-05 14:02:58.009999872+03:00
2019-08-05 14:03:17.009999872+03:00,iGyXetHE3S8u,Cq9vueHh3zVs,1565003000.0,Google Play Music,com.google.android.music,2019-08-05 14:03:17.009999872+03:00
2019-08-05 14:02:55.009999872+03:00,iGyXetHE3S8u,Cq9vueHh3zVs,1565003000.0,Google Play Music,com.google.android.music,2019-08-05 14:02:55.009999872+03:00
2019-08-05 14:03:31.009999872+03:00,iGyXetHE3S8u,Cq9vueHh3zVs,1565003000.0,Gmail,com.google.android.gm,2019-08-05 14:03:31.009999872+03:00


The dataframe seems to be complete. Its index is timestamps, and it has a column tracking the name of the application that was prompted (*application_name*). 

#### A few words on missing data
Missing data for application is difficult to detect. Firstly, this sensor is triggered by events (i.e. not sampled at a fixed frequency). Secondly, different phones, OS, and settings change the ease to detect apps. Thirdly, events not related to the application sensor may affect its behavior, e.g. battery running out. Unfortunately, we can only correct missing data for events such as the screen turning off by using data from the screen sensor and the battery level. These can be taken into account in `niimpy` if we provide the screen and battery data. 

#### A few words on grouping the apps
As previously mentioned, the application name may reveal too much about a subject and privacy problems may arise. A possible solution to this problem is to classify the apps into more generic groups. For example, commonly used apps like WhatsApp, Signal, Telegram, etc. are commonly used for texting, so we can group them under the label *texting*. `niimpy` provides a default map, but this should be adapted to the characteristics of the sample, since apps are available depending on countries and populations. 

#### A few words on the role of the battery and screen
As mentioned before, sometimes the screen may be OFF and these events will not be caught by the application data sensor. For example, we can open an app and let it open until the phone screen turns off automatically. Another example is when the battery is depleated and the phone is shut down automatically. Having this information is crucial for correctly computing how long a subject used each app group. `niimpy`'s screen module is adapted to take into account both, the screen and battery data. 
For this example, we have both, so let's load the screen and battery data.

In [4]:
bat_data = niimpy.read_csv('/m/cs/scratch/networks/trianaa1/Paper3/niimpy/niimpy/sampledata/multiuser_AwareBattery.csv', tz='Europe/Helsinki')
screen_data = niimpy.read_csv('/m/cs/scratch/networks/trianaa1/Paper3/niimpy/niimpy/sampledata/multiuser_AwareScreen.csv', tz='Europe/Helsinki')

In [5]:
bat_data.head()

Unnamed: 0,user,device,time,battery_level,battery_status,battery_health,battery_adaptor,datetime
2020-01-09 02:20:02.924999936+02:00,jd9INuQ5BBlW,3p83yASkOb_B,1578529000.0,74,3,2,0,2020-01-09 02:20:02.924999936+02:00
2020-01-09 02:21:30.405999872+02:00,jd9INuQ5BBlW,3p83yASkOb_B,1578529000.0,73,3,2,0,2020-01-09 02:21:30.405999872+02:00
2020-01-09 02:24:12.805999872+02:00,jd9INuQ5BBlW,3p83yASkOb_B,1578529000.0,72,3,2,0,2020-01-09 02:24:12.805999872+02:00
2020-01-09 02:35:38.561000192+02:00,jd9INuQ5BBlW,3p83yASkOb_B,1578530000.0,72,2,2,0,2020-01-09 02:35:38.561000192+02:00
2020-01-09 02:35:38.953000192+02:00,jd9INuQ5BBlW,3p83yASkOb_B,1578530000.0,72,2,2,2,2020-01-09 02:35:38.953000192+02:00


The dataframe looks fine. In this case, we are interested in the battery_status information. This is standard information provided by Android. However, if the dataframe stores this information in a column with a different name, we can use the argument `battery_column_name` and input our custom battery column name (see Extracting features, customized features). 

In [6]:
screen_data.head()

Unnamed: 0,user,device,time,screen_status,datetime
2020-01-09 02:06:41.573999872+02:00,jd9INuQ5BBlW,OWd1Uau8POix,1578528000.0,0,2020-01-09 02:06:41.573999872+02:00
2020-01-09 02:09:29.152000+02:00,jd9INuQ5BBlW,OWd1Uau8POix,1578529000.0,1,2020-01-09 02:09:29.152000+02:00
2020-01-09 02:09:32.790999808+02:00,jd9INuQ5BBlW,OWd1Uau8POix,1578529000.0,3,2020-01-09 02:09:32.790999808+02:00
2020-01-09 02:11:41.996000+02:00,jd9INuQ5BBlW,OWd1Uau8POix,1578529000.0,0,2020-01-09 02:11:41.996000+02:00
2020-01-09 02:16:19.010999808+02:00,jd9INuQ5BBlW,OWd1Uau8POix,1578529000.0,1,2020-01-09 02:16:19.010999808+02:00


This dataframe looks fine too. In this case, we are interested in the screen_status information. However, if the dataframe has this information in a column with a different name, we can use the argument `screen_column_name` and input our custom screen column name (see Extracting features, customized features). 

## Extracting features

To extract app features, we need to employ the function `extract_features_app`. This function needs four inputs, a dataframe with the data, two dataframes with the information from screen and battery sensors, and a dictionary. The dataframe should contain the app observations, and the dictionary is used to input customizable arguments to the function. The battery and screen dataframes can be empty in case we do not have such information. The function has some parameters by default. Let's have a look at those first. 

### Default option

The default option will compute all features in 30-minute aggregation windows. To use the `extract_features_app` function with its default options, simply call the function. Remember to include battery and screen data when available. 

In [7]:
default = app.extract_features_app(data, bat_data, screen_data, features=None)

computing app_count...
computing app_duration...


The function prints the computed features so you can track its process. Now, let's have a look at the outputs

In [10]:
default.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,count,duration
user,app_group,datetime,Unnamed: 3_level_1,Unnamed: 4_level_1
iGyXetHE3S8u,comm,2019-08-05 14:00:00+03:00,86,37.0
iGyXetHE3S8u,leisure,2019-08-05 14:00:00+03:00,20,7.0
iGyXetHE3S8u,na,2019-08-05 14:00:00+03:00,19,9.0
iGyXetHE3S8u,work,2019-08-05 14:00:00+03:00,7,6.0


The function output is also a dataframe where each column stands for a feature. The indexes are subjects, app groups, and timestamps. 

The default option can also be run in absence of battery or screen data. In this case, simply input an empty dataframe in the second or third position of the `extract_features_app` function.

In [11]:
empty_bat = pd.DataFrame()
empty_screen = pd.DataFrame()
no_bat = app.extract_features_app(data, empty_bat, screen_data, features=None) #no battery information
no_screen = app.extract_features_app(data, bat_data, empty_screen, features=None) #no screen information
no_bat_no_screen = app.extract_features_app(data, empty_bat, empty_screen, features=None) #no battery and no screen information

computing app_count...
computing app_duration...
computing app_count...


TypeError: shutdown_info() missing 1 required positional argument: 'feature_functions'

### Customized features

The `extract_features_apps` function can also be customized. We can:
- extract some of the features (not all)
- modify the aggregation periods

All of these modifications need to be inside the dictionary input. 

Let's see how to use this to only call some functions. To do so, we need to create a dictionary where the keys are the name of the features we want to compute, and the values are empty dictionaries.

In [12]:
custom = {}
custom['app_count'] = {}

In [20]:
custom_output = app.extract_features_app(data, bat_data, screen_data, features=custom)
custom_output.head()

computing app_count...


Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,count
user,app_group,datetime,Unnamed: 3_level_1
iGyXetHE3S8u,comm,2019-08-05 14:00:00+03:00,86
iGyXetHE3S8u,leisure,2019-08-05 14:00:00+03:00,20
iGyXetHE3S8u,na,2019-08-05 14:00:00+03:00,19
iGyXetHE3S8u,work,2019-08-05 14:00:00+03:00,7


Now, let's compute another set of features with different aggregation windows. For that, we rely on the arguments from the `pandas.DataFrame.resample` function. 

For this example, we will aggregate the features `app_count` and `app_duration`. The app count will be computed in a daily basis and the app duration will be computed in 5-minute periods.

In [21]:
features = {"app_count":{"app_column_name": "application_name","screen_column_name":"screen_status","resample_args":{"rule":"1D"}},
            "app_duration":{"app_column_name": "application_name","screen_column_name":"screen_status","resample_args":{"rule":"5T"}}}

As we see, we have an input dictionary in which the main keys are the names of the features to compute. For each feature, we also have a dictionary. This new dictionary has some other arguments, mainly the name of the column that we would like to use for the computation and another dictionary named `resample_args`. The `app_column_name` is the column name where the application names are stored; the `screen_column_name` is the column name where the screen status is stored; the `battery_column_name` is the column name where the battery status is stored. This helps in case our dataframe has some other naming conventions. The `resample_args` dictionary contains the arguments to pass for the resampling (see `pandas.DataFrame.resample`).

In [25]:
custom_output = app.extract_features_app(data, bat_data, screen_data, features=features)
custom_output

computing app_count...
computing app_duration...


Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,count,duration
user,app_group,datetime,Unnamed: 3_level_1,Unnamed: 4_level_1
iGyXetHE3S8u,comm,2019-08-05 00:00:00+03:00,86.0,
iGyXetHE3S8u,leisure,2019-08-05 00:00:00+03:00,20.0,
iGyXetHE3S8u,na,2019-08-05 00:00:00+03:00,19.0,
iGyXetHE3S8u,work,2019-08-05 00:00:00+03:00,7.0,
iGyXetHE3S8u,comm,2019-08-05 14:00:00+03:00,,37.0
iGyXetHE3S8u,leisure,2019-08-05 14:00:00+03:00,,7.0
iGyXetHE3S8u,na,2019-08-05 14:00:00+03:00,,9.0
iGyXetHE3S8u,work,2019-08-05 14:00:00+03:00,,6.0


The output is once again a dataframe. In this case, two aggregations are shown. The first one is the daily aggregation computed for the `app_count` feature. The second one is the 5-min aggregation period `app_duration`. We must note that because the `app_duration`feature is not required to be aggregated daily, the daily aggregation timestamps have a NaN value. Similarly, because the `app_count`is not required to be aggregated in 5-minute windows, its values are NaN for all subjects. 

## Implementing own features

We can implement our own customized features easily. To do so, we need to define a function that accepts a dataframe and returns a dataframe. The returned object should be indexed by user and timestamps. 
To make the feature readily available in the default options, we need add the *call* prefix to the new function (e.g. `app_my-new-feature`). 

In [26]:
def app_min_duration(df,feature_functions=None):
    df2 = classify_app(df, feature_functions)
    df2['duration']=np.nan
    df2['duration']=df2['datetime'].diff()
    df2['duration'] = df2['duration'].shift(-1)
    thr = pd.Timedelta('10 hours')
    df2 = df2[~(df2.duration>thr)]
    df2 = df2[~(df2.duration>thr)]
    df2["duration"] = df2["duration"].dt.total_seconds()
    
    df2.dropna(inplace=True)
    
    if len(df2)>0:
        df2['datetime'] = pd.to_datetime(df2['datetime'])
        df2.set_index('datetime', inplace=True)
        result = df2.groupby(["user","app_group"])["duration"].resample(**feature_functions["resample_args"]).min()
        
    return result

Then, we can call our new function using the `extract_features_app` function.

In [27]:
customized_features = app.extract_features_app(data, bat_data, screen_data, features={"app_min_duration": {}})

computing app_min_duration...


NameError: name 'app_min_duration' is not defined