# Demo notebook for analyzing screen on/off data

## Introduction

Screen data refers to the information about the status of the screen as reported by Android. These data can reveal important information about people's circadian rhythm, social patterns, and activity. Screen data is an event data, this means that it cannot be sampled at a regular frequency, but we just have information about the events that occured. However, some factors may interfere with the correct detection of all events (e.g. when the phone's battery is depleated). Therefore, to correctly process screen data, we need to take into account other information like the battery status of the phone. This may difficult the preprocessing. To address this, `niimpy` includes the function `extract_features_screen` to clean, downsample, and extract features from screen data while taking into account factors like the battery level. This function employs other functions to extract the following features:

- `screen_off`: reports when the screen has been turned off
- `screen_count`: number of times the screen has turned on, off, or has been in use
- `screen_duration`: duration in seconds of the screen on, off, and in use statuses
- `screen_duration_min`: minimum duration in seconds of the screen on, off, and in use statuses
- `screen_duration_max`: maximum duration in seconds of the screen on, off, and in use statuses
- `screen_duration_median`: median duration in seconds of the screen on, off, and in use statuses
- `screen_duration_mean`: mean duration in seconds of the screen on, off, and in use statuses
- `screen_duration_std`: standard deviation of the duration in seconds of the screen on, off, and in use statuses
- `screen_first_unlock`: reports the first time when the phone was unlocked every day

In addition, the screen module has three internal functions that help classify the events and calculate their status duration. 

In the following, we will analyze screen data provided by `niimpy` as an example to illustrate the use of screen data.

## Read data

In [15]:
import niimpy
import config as config
import niimpy.preprocessing.screen as s
import warnings
import pandas as pd
warnings.filterwarnings("ignore")

In [16]:
data = niimpy.read_csv(config.MULTIUSER_AWARE_SCREEN_PATH, tz='Europe/Helsinki')
data.shape

(277, 5)

There are 227 datapoints with 5 columns in the dataset. Let us have a quick look at the data:

In [17]:
data.head()

Unnamed: 0,user,device,time,screen_status,datetime
2020-01-09 02:06:41.573999872+02:00,jd9INuQ5BBlW,OWd1Uau8POix,1578528000.0,0,2020-01-09 02:06:41.573999872+02:00
2020-01-09 02:09:29.152000+02:00,jd9INuQ5BBlW,OWd1Uau8POix,1578529000.0,1,2020-01-09 02:09:29.152000+02:00
2020-01-09 02:09:32.790999808+02:00,jd9INuQ5BBlW,OWd1Uau8POix,1578529000.0,3,2020-01-09 02:09:32.790999808+02:00
2020-01-09 02:11:41.996000+02:00,jd9INuQ5BBlW,OWd1Uau8POix,1578529000.0,0,2020-01-09 02:11:41.996000+02:00
2020-01-09 02:16:19.010999808+02:00,jd9INuQ5BBlW,OWd1Uau8POix,1578529000.0,1,2020-01-09 02:16:19.010999808+02:00


In [18]:
data.tail()

Unnamed: 0,user,device,time,screen_status,datetime
2019-09-08 17:17:14.216000+03:00,iGyXetHE3S8u,Cq9vueHh3zVs,1567952000.0,1,2019-09-08 17:17:14.216000+03:00
2019-09-08 17:17:31.966000128+03:00,iGyXetHE3S8u,Cq9vueHh3zVs,1567952000.0,0,2019-09-08 17:17:31.966000128+03:00
2019-09-08 20:50:07.360000+03:00,iGyXetHE3S8u,Cq9vueHh3zVs,1567965000.0,3,2019-09-08 20:50:07.360000+03:00
2019-09-08 20:50:08.139000064+03:00,iGyXetHE3S8u,Cq9vueHh3zVs,1567965000.0,1,2019-09-08 20:50:08.139000064+03:00
2019-09-08 20:53:12.960000+03:00,iGyXetHE3S8u,Cq9vueHh3zVs,1567965000.0,0,2019-09-08 20:53:12.960000+03:00


The dataframe seems to be complete. Its index is timestamps, and it has a column indicating the screen status. This screen status is coded in numbers as: 0=off, 1=on, 2=locked, 3=unlocked. 

#### A few words on missing data
Missing data for screen is difficult to detect. Firstly, this sensor is triggered by events and not sampled at a fixed frequency. Secondly, different phones, OS, and settings change how the screen is turned on/off; for example, one phone may go from OFF to ON to UNLOCKED, while another phone may go from OFF to UNLOCKED directly. Thirdly, events not related to the screen may affect its behavior, e.g. battery running out. Neverthless, there are some events transitions that are impossible to have, like a status to itself (e.g. two consecutive 0s). These *imposible* statuses helps us determine the missing data. 

#### A few words on the classification of the events
We can know the status of the screen at a certain timepoint. However, we need a bit more to know the duration and the meaning of it. Consequently, we need to look at the numbers of two consecutive events and classify the transitions (going from one state to another consecutively) as:
- from 3 to 0,1,2: the phone was in use 
- from 1 to 0,1,3: the phone was on
- from 0 to 1,2,3: the phone was off

Other transitions are irrelevant. 

#### A few words on the role of the battery
As mentioned before, battery statuses can affect the screen behavior. In particular, when the battery is depleated and the phone is shut down automatically, the screen sensor does not cast any events, so even when the screen is technically OFF because the phone does not have any battery left, we will not see that 0 in the screen status column. Thus, it is important to take into account the battery information when analyzing screen data. `niimpy`'s screen module is adapted to take into account the battery data. 
Since we do have some battery data, we will load it.

In [19]:
bat_data = niimpy.read_csv(config.MULTIUSER_AWARE_BATTERY_PATH, tz='Europe/Helsinki')
bat_data.head()

Unnamed: 0,user,device,time,battery_level,battery_status,battery_health,battery_adaptor,datetime
2020-01-09 02:20:02.924999936+02:00,jd9INuQ5BBlW,3p83yASkOb_B,1578529000.0,74,3,2,0,2020-01-09 02:20:02.924999936+02:00
2020-01-09 02:21:30.405999872+02:00,jd9INuQ5BBlW,3p83yASkOb_B,1578529000.0,73,3,2,0,2020-01-09 02:21:30.405999872+02:00
2020-01-09 02:24:12.805999872+02:00,jd9INuQ5BBlW,3p83yASkOb_B,1578529000.0,72,3,2,0,2020-01-09 02:24:12.805999872+02:00
2020-01-09 02:35:38.561000192+02:00,jd9INuQ5BBlW,3p83yASkOb_B,1578530000.0,72,2,2,0,2020-01-09 02:35:38.561000192+02:00
2020-01-09 02:35:38.953000192+02:00,jd9INuQ5BBlW,3p83yASkOb_B,1578530000.0,72,2,2,2,2020-01-09 02:35:38.953000192+02:00


The dataframe looks fine. In this case, we are interested in the battery_status information. This is standard information provided by Android. However, if the dataframe has this information in a column with a different name, we can use the argument `battery_column_name` similarly to the use of `screen_column_name` (see Extracting features, customized features). 

## Extracting features

To extract screen features, we need to employ the function `extract_features_screen`. This function needs three inputs, a dataframe with the data, a dataframe with the battery data, and a dictionary. The main dataframe should contain the screen observations, and the dictionary is used to input customizable arguments to the function. The battery dataframe can be empty in case we do not have such information. The function has some parameters by default. Let's have a look at those first. 

### Default option

The default option will compute all features in 30-minute aggregation windows. To use the `extract_features_screen` function with its default options, simply call the function. Remember to include battery data when available.

In [20]:
default = s.extract_features_screen(data, bat_data, features=None)

computing <function screen_off at 0x2ab5649dab80>...
computing <function screen_count at 0x2ab5649dac10>...
computing <function screen_duration at 0x2ab5649daca0>...
computing <function screen_duration_min at 0x2ab5649dad30>...
computing <function screen_duration_max at 0x2ab5649dadc0>...
computing <function screen_duration_mean at 0x2ab5649dae50>...
computing <function screen_duration_median at 0x2ab5649daee0>...
computing <function screen_duration_std at 0x2ab5649daf70>...
computing <function screen_first_unlock at 0x2ab5649e3040>...


The function prints the computed features so you can track its process. Now let's have a look at the outputs

In [21]:
default.tail()

Unnamed: 0_level_0,Unnamed: 1_level_0,screen_off,screen_on_count,screen_off_count,screen_use_count,screen_on_durationtotal,screen_off_durationtotal,screen_use_durationtotal,screen_on_durationminimum,screen_off_durationminimum,screen_use_durationminimum,...,screen_on_durationmean,screen_off_durationmean,screen_use_durationmean,screen_on_durationmedian,screen_off_durationmedian,screen_use_durationmedian,screen_on_durationstd,screen_off_durationstd,screen_use_durationstd,datetime
user,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
jd9INuQ5BBlW,2020-01-09 22:00:00+02:00,,1.0,1.0,0.0,154.643,0.011,,154.643,0.011,,...,154.643,0.011,,154.643,0.011,,,,,NaT
jd9INuQ5BBlW,2020-01-09 22:30:00+02:00,,0.0,0.0,0.0,0.0,0.0,,,,,...,,,,,,,,,,NaT
jd9INuQ5BBlW,2020-01-09 23:00:00+02:00,,4.0,3.0,0.0,6.931,0.025,,2.079,0.008,,...,2.310333,0.008333,,2.262,0.008,,0.258906,0.000577,,NaT
iGyXetHE3S8u,2019-08-05 00:00:00+03:00,,,,,,,,,,,...,,,,,,,,,,2019-08-05 14:03:42.322000128+03:00
jd9INuQ5BBlW,2020-01-09 00:00:00+02:00,,,,,,,,,,,...,,,,,,,,,,2020-01-09 02:16:19.010999808+02:00


The function output is also a dataframe where each column stands for a feature. The indexes are subjects and timestamps. Note that the last two indexes refer to the first unlock time of the day and therefore, have separate timestamps and lots of NaN values. 

The default option can also be run in absence of battery data. In this case, simply input an empty dataframe in the second position of the `extract_features_screen`function.

In [22]:
empty_data = pd.DataFrame()
default = s.extract_features_screen(data, empty_data, features=None)
default.tail()

computing <function screen_off at 0x2ab5649dab80>...
computing <function screen_count at 0x2ab5649dac10>...
computing <function screen_duration at 0x2ab5649daca0>...
computing <function screen_duration_min at 0x2ab5649dad30>...
computing <function screen_duration_max at 0x2ab5649dadc0>...
computing <function screen_duration_mean at 0x2ab5649dae50>...
computing <function screen_duration_median at 0x2ab5649daee0>...
computing <function screen_duration_std at 0x2ab5649daf70>...
computing <function screen_first_unlock at 0x2ab5649e3040>...


Unnamed: 0_level_0,Unnamed: 1_level_0,screen_off,screen_on_count,screen_off_count,screen_use_count,screen_on_durationtotal,screen_off_durationtotal,screen_use_durationtotal,screen_on_durationminimum,screen_off_durationminimum,screen_use_durationminimum,...,screen_on_durationmean,screen_off_durationmean,screen_use_durationmean,screen_on_durationmedian,screen_off_durationmedian,screen_use_durationmedian,screen_on_durationstd,screen_off_durationstd,screen_use_durationstd,datetime
user,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
jd9INuQ5BBlW,2020-01-09 22:00:00+02:00,,1.0,1.0,0.0,154.643,0.011,,154.643,0.011,,...,154.643,0.011,,154.643,0.011,,,,,NaT
jd9INuQ5BBlW,2020-01-09 22:30:00+02:00,,0.0,0.0,0.0,0.0,0.0,,,,,...,,,,,,,,,,NaT
jd9INuQ5BBlW,2020-01-09 23:00:00+02:00,,4.0,3.0,0.0,6.931,0.025,,2.079,0.008,,...,2.310333,0.008333,,2.262,0.008,,0.258906,0.000577,,NaT
iGyXetHE3S8u,2019-08-05 00:00:00+03:00,,,,,,,,,,,...,,,,,,,,,,2019-08-05 14:03:42.322000128+03:00
jd9INuQ5BBlW,2020-01-09 00:00:00+02:00,,,,,,,,,,,...,,,,,,,,,,2020-01-09 02:16:19.010999808+02:00


### Customized features

The `extract_features_screen` function can also be customized. We can:
- extract some of the features (not all)
- modify the aggregation periods

All of these modifications need to be inside the dictionary input. 

Let's see how to use this to only call some functions. To do so, we need to create a dictionary where the keys are the name of the features we want to compute, and the values are empty dictionaries.

In [25]:
custom = {}
custom[s.screen_duration_max] = {}
custom[s.screen_count] = {}

In [26]:
custom_output = s.extract_features_screen(data, bat_data, features=custom)
custom_output.head()

computing <function screen_duration_max at 0x2ab5649dadc0>...
computing <function screen_count at 0x2ab5649dac10>...


Unnamed: 0_level_0,Unnamed: 1_level_0,screen_on_durationmaximum,screen_off_durationmaximum,screen_use_durationmaximum,screen_on_count,screen_off_count,screen_use_count
user,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
iGyXetHE3S8u,2019-08-05 14:00:00+03:00,100.365,1084.703,0.345,4,4,4
iGyXetHE3S8u,2019-08-05 14:30:00+03:00,15.402,284779.15,0.119,2,2,2
iGyXetHE3S8u,2019-08-05 15:00:00+03:00,,,,0,0,0
iGyXetHE3S8u,2019-08-05 15:30:00+03:00,,,,0,0,0
iGyXetHE3S8u,2019-08-05 16:00:00+03:00,,,,0,0,0


As we see, this time only two features were computed in a 30-min aggregated period. Now, let's compute another set of features with different aggregation windows. For that, we rely on the arguments from the `pandas.DataFrame.resample` function. 

For this example, we will aggregate the features `screen_count` and `screen_duration`. The screen count will be computed in a daily basis and the screen duration will be computed in 5-hour periods with a 5-min offset.

In [28]:
features = {s.screen_count:{"screen_column_name":"screen_status","resample_args":{"rule":"1D"}},
            s.screen_duration:{"screen_column_name":"screen_status","resample_args":{"rule":"5H","offset":"5min"}}}

As we see, we have an input dictionary in which the main keys are the names of the features to compute. For each feature, we also have a dictionary. This new dictionary has some other arguments, mainly the name of the column that we would like to use for the computation and another dictionary named `resample_args`. The `screen_column_name` is the column name where the screen status is stored; this helps in case our dataframe has some other naming conventions. The `resample_args` dictionary contains the arguments to pass for the resampling (see `pandas.DataFrame.resample`).

In [29]:
custom_output = s.extract_features_screen(data, bat_data, features=features)
custom_output.head()

computing <function screen_count at 0x2ab5649dac10>...
computing <function screen_duration at 0x2ab5649daca0>...


Unnamed: 0_level_0,Unnamed: 1_level_0,screen_on_count,screen_off_count,screen_use_count,screen_on_durationtotal,screen_off_durationtotal,screen_use_durationtotal
user,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
iGyXetHE3S8u,2019-08-05 00:00:00+03:00,6.0,6.0,6.0,,,
iGyXetHE3S8u,2019-08-06 00:00:00+03:00,0.0,0.0,0.0,,,
iGyXetHE3S8u,2019-08-07 00:00:00+03:00,0.0,0.0,0.0,,,
iGyXetHE3S8u,2019-08-08 00:00:00+03:00,6.0,6.0,6.0,,,
iGyXetHE3S8u,2019-08-09 00:00:00+03:00,2.0,2.0,2.0,,,


In [30]:
custom_output.tail()

Unnamed: 0_level_0,Unnamed: 1_level_0,screen_on_count,screen_off_count,screen_use_count,screen_on_durationtotal,screen_off_durationtotal,screen_use_durationtotal
user,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
jd9INuQ5BBlW,2020-01-09 00:05:00+02:00,,,,54.894,1674.82,656.738
jd9INuQ5BBlW,2020-01-09 05:05:00+02:00,,,,0.0,0.0,0.0
jd9INuQ5BBlW,2020-01-09 10:05:00+02:00,,,,135.896,1455.094,425.015
jd9INuQ5BBlW,2020-01-09 15:05:00+02:00,,,,221.362001,24.673,667.703
jd9INuQ5BBlW,2020-01-09 20:05:00+02:00,,,,557.914999,46.079,169.079


The output is once again a dataframe. In this case, two aggregations are shown. The first one is the daily aggregation computed for the `screen_count` feature. The second one is the 5-hour aggregation period with 5-min offset for the `screen_duration`. We must note that because the `screen_duration`feature is not required to be aggregated daily, the daily aggregation timestamps have a NaN value. Similarly, because the `screen_count`is not required to be aggregated in 5-hour windows, its values are NaN for all subjects. 

## Implementing own features

We can implement our own customized features easily. To do so, we need to define a function that accepts a dataframe and returns a dataframe. The returned object should be indexed by user and timestamps. 
To make the feature readily available in the default options, we need add the *call* prefix to the new function (e.g. `screen_my-new-feature`). 

In [35]:
def screen_last_unlock(df, bat, feature_functions=None):
    if not "screen_column_name" in feature_functions:
        col_name = "screen_status"
    else:
        col_name = feature_functions["screen_column_name"]
    if not "resample_args" in feature_functions.keys():
        feature_functions["resample_args"] = {"rule":"30T"}
    
    df2 = s.util_screen(df, bat, feature_functions)
    df2 = s.event_classification_screen(df2, feature_functions)
    
    result = df2[df2.on==1].groupby("user").resample(rule='1D').max()
    result = result[["datetime"]]
    
    return result

Then, we can call our new function using the `extract_features_comms` function.

In [36]:
customized_features = s.extract_features_screen(data, bat_data, features={screen_last_unlock: {}})

computing <function screen_last_unlock at 0x2ab5684a7940>...


In [37]:
customized_features.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,datetime
user,Unnamed: 1_level_1,Unnamed: 2_level_1
iGyXetHE3S8u,2019-08-05 00:00:00+03:00,2019-08-05 14:49:45.596999936+03:00
iGyXetHE3S8u,2019-08-06 00:00:00+03:00,NaT
iGyXetHE3S8u,2019-08-07 00:00:00+03:00,NaT
iGyXetHE3S8u,2019-08-08 00:00:00+03:00,2019-08-08 22:44:13.834000128+03:00
iGyXetHE3S8u,2019-08-09 00:00:00+03:00,2019-08-09 07:50:33.224000+03:00
