# Basic transformations

This page shows some basic transformations you can do once you have read data.  Really, it is simply a pandas crash course, since pandas provides all the operations you may need and there is no need for us to re-invent things.  Pandas provides a solid but flexible base for us to build advanced operations on top of.

You can read more at the [Pandas documentation](https://pandas.pydata.org/pandas-docs/stable/index.html).

## Extracting single rows and columns
Let's first import mobile phone battery status data.

In [1]:
TZ = 'Europe/Helsinki'

In [2]:
import niimpy
df = niimpy.read_csv(niimpy.sampledata.MULTIUSER_AWAREBATTERY_CSV, tz=TZ)

Then check first rows of the dataframe.

In [3]:
df.head()

Unnamed: 0,user,device,time,battery_level,battery_status,battery_health,battery_adaptor,datetime
2020-01-09 02:20:02.924999936+02:00,jd9INuQ5BBlW,3p83yASkOb_B,1578529000.0,74,3,2,0,2020-01-09 02:20:02.924999936+02:00
2020-01-09 02:21:30.405999872+02:00,jd9INuQ5BBlW,3p83yASkOb_B,1578529000.0,73,3,2,0,2020-01-09 02:21:30.405999872+02:00
2020-01-09 02:24:12.805999872+02:00,jd9INuQ5BBlW,3p83yASkOb_B,1578529000.0,72,3,2,0,2020-01-09 02:24:12.805999872+02:00
2020-01-09 02:35:38.561000192+02:00,jd9INuQ5BBlW,3p83yASkOb_B,1578530000.0,72,2,2,0,2020-01-09 02:35:38.561000192+02:00
2020-01-09 02:35:38.953000192+02:00,jd9INuQ5BBlW,3p83yASkOb_B,1578530000.0,72,2,2,2,2020-01-09 02:35:38.953000192+02:00


Get a single column, in this case all **users**:

In [4]:
df['user']

2020-01-09 02:20:02.924999936+02:00    jd9INuQ5BBlW
2020-01-09 02:21:30.405999872+02:00    jd9INuQ5BBlW
2020-01-09 02:24:12.805999872+02:00    jd9INuQ5BBlW
2020-01-09 02:35:38.561000192+02:00    jd9INuQ5BBlW
2020-01-09 02:35:38.953000192+02:00    jd9INuQ5BBlW
                                           ...     
2020-01-09 23:02:13.938999808+02:00    jd9INuQ5BBlW
2020-01-09 23:10:37.262000128+02:00    jd9INuQ5BBlW
2020-01-09 23:22:13.966000128+02:00    jd9INuQ5BBlW
2020-01-09 23:32:13.959000064+02:00    jd9INuQ5BBlW
2020-01-09 23:39:06.800000+02:00       jd9INuQ5BBlW
Name: user, Length: 373, dtype: object

Get a single row, in this case the **5th** (note the first row is zero):

In [5]:
df.iloc[4]

user                                      jd9INuQ5BBlW
device                                    3p83yASkOb_B
time                                    1578530138.953
battery_level                                       72
battery_status                                       2
battery_health                                       2
battery_adaptor                                      2
datetime           2020-01-09 02:35:38.953000192+02:00
Name: 2020-01-09 02:35:38.953000192+02:00, dtype: object

## Listing unique users
We can list unique users by using `pandas.unique()` function.

In [6]:
df['user'].unique()

array(['jd9INuQ5BBlW'], dtype=object)

## List unique values
Same applies to other data features/columns.

In [7]:
df['battery_status'].unique()

array([3, 2, 5], dtype=int64)

## Extract data of only one subject
We can extract data of only one subject by following:

In [8]:
df[df['user'] == 'jd9INuQ5BBlW']

Unnamed: 0,user,device,time,battery_level,battery_status,battery_health,battery_adaptor,datetime
2020-01-09 02:20:02.924999936+02:00,jd9INuQ5BBlW,3p83yASkOb_B,1.578529e+09,74,3,2,0,2020-01-09 02:20:02.924999936+02:00
2020-01-09 02:21:30.405999872+02:00,jd9INuQ5BBlW,3p83yASkOb_B,1.578529e+09,73,3,2,0,2020-01-09 02:21:30.405999872+02:00
2020-01-09 02:24:12.805999872+02:00,jd9INuQ5BBlW,3p83yASkOb_B,1.578529e+09,72,3,2,0,2020-01-09 02:24:12.805999872+02:00
2020-01-09 02:35:38.561000192+02:00,jd9INuQ5BBlW,3p83yASkOb_B,1.578530e+09,72,2,2,0,2020-01-09 02:35:38.561000192+02:00
2020-01-09 02:35:38.953000192+02:00,jd9INuQ5BBlW,3p83yASkOb_B,1.578530e+09,72,2,2,2,2020-01-09 02:35:38.953000192+02:00
...,...,...,...,...,...,...,...,...
2020-01-09 23:02:13.938999808+02:00,jd9INuQ5BBlW,OWd1Uau8POix,1.578604e+09,73,3,2,0,2020-01-09 23:02:13.938999808+02:00
2020-01-09 23:10:37.262000128+02:00,jd9INuQ5BBlW,OWd1Uau8POix,1.578604e+09,73,3,2,0,2020-01-09 23:10:37.262000128+02:00
2020-01-09 23:22:13.966000128+02:00,jd9INuQ5BBlW,OWd1Uau8POix,1.578605e+09,72,3,2,0,2020-01-09 23:22:13.966000128+02:00
2020-01-09 23:32:13.959000064+02:00,jd9INuQ5BBlW,OWd1Uau8POix,1.578606e+09,71,3,2,0,2020-01-09 23:32:13.959000064+02:00


## Renaming a column or columns
Dataframe column can be renamed using `pandas.DataFrame.rename()` function.

In [9]:
df.rename(columns={'time': 'timestamp'}, inplace=True)
df.head()

Unnamed: 0,user,device,timestamp,battery_level,battery_status,battery_health,battery_adaptor,datetime
2020-01-09 02:20:02.924999936+02:00,jd9INuQ5BBlW,3p83yASkOb_B,1578529000.0,74,3,2,0,2020-01-09 02:20:02.924999936+02:00
2020-01-09 02:21:30.405999872+02:00,jd9INuQ5BBlW,3p83yASkOb_B,1578529000.0,73,3,2,0,2020-01-09 02:21:30.405999872+02:00
2020-01-09 02:24:12.805999872+02:00,jd9INuQ5BBlW,3p83yASkOb_B,1578529000.0,72,3,2,0,2020-01-09 02:24:12.805999872+02:00
2020-01-09 02:35:38.561000192+02:00,jd9INuQ5BBlW,3p83yASkOb_B,1578530000.0,72,2,2,0,2020-01-09 02:35:38.561000192+02:00
2020-01-09 02:35:38.953000192+02:00,jd9INuQ5BBlW,3p83yASkOb_B,1578530000.0,72,2,2,2,2020-01-09 02:35:38.953000192+02:00


## Change datatypes
Let's then check the dataframe datatypes:

In [10]:
df.dtypes

user                                        object
device                                      object
timestamp                                  float64
battery_level                                int64
battery_status                               int64
battery_health                               int64
battery_adaptor                              int64
datetime           datetime64[ns, Europe/Helsinki]
dtype: object

We can change the datatypes with `pandas.astype()` function. Here we change **battery_health** datatype to categorical:

In [11]:
df.astype({'battery_health': 'category'}).dtypes

user                                        object
device                                      object
timestamp                                  float64
battery_level                                int64
battery_status                               int64
battery_health                            category
battery_adaptor                              int64
datetime           datetime64[ns, Europe/Helsinki]
dtype: object

## Transforming a column to a new value
Dataframe values can be transformed (decoded etc.) into new values by using `pandas.transform()`function.

Here we just add just one to the column values.


In [12]:
df['battery_adaptor'].transform(lambda x: x + 1)

2020-01-09 02:20:02.924999936+02:00    1
2020-01-09 02:21:30.405999872+02:00    1
2020-01-09 02:24:12.805999872+02:00    1
2020-01-09 02:35:38.561000192+02:00    1
2020-01-09 02:35:38.953000192+02:00    3
                                      ..
2020-01-09 23:02:13.938999808+02:00    1
2020-01-09 23:10:37.262000128+02:00    1
2020-01-09 23:22:13.966000128+02:00    1
2020-01-09 23:32:13.959000064+02:00    1
2020-01-09 23:39:06.800000+02:00       1
Name: battery_adaptor, Length: 373, dtype: int64

## Resample
Dataframe down/upsampling can be done with `pandas.resample()` function.

Here we downsample the data by hour and aggregate the mean:

In [13]:
df['battery_level'].resample('H').agg("mean")

2020-01-09 02:00:00+02:00     80.836735
2020-01-09 03:00:00+02:00     96.972973
2020-01-09 04:00:00+02:00     99.990291
2020-01-09 05:00:00+02:00    100.000000
2020-01-09 06:00:00+02:00    100.000000
2020-01-09 07:00:00+02:00    100.000000
2020-01-09 08:00:00+02:00    100.000000
2020-01-09 09:00:00+02:00    100.000000
2020-01-09 10:00:00+02:00    100.000000
2020-01-09 11:00:00+02:00     98.000000
2020-01-09 12:00:00+02:00     95.000000
2020-01-09 13:00:00+02:00     92.500000
2020-01-09 14:00:00+02:00     92.866667
2020-01-09 15:00:00+02:00     99.428571
2020-01-09 16:00:00+02:00     96.333333
2020-01-09 17:00:00+02:00     92.500000
2020-01-09 18:00:00+02:00     89.200000
2020-01-09 19:00:00+02:00     86.166667
2020-01-09 20:00:00+02:00     82.000000
2020-01-09 21:00:00+02:00     78.428571
2020-01-09 22:00:00+02:00     75.000000
2020-01-09 23:00:00+02:00     72.000000
Freq: H, Name: battery_level, dtype: float64

## Groupby
For groupwise data inspection, we can use `pandas.DataFrame.groupby()` function.

Let's first load dataframe having several subjects belonging to different groups.

In [14]:
import os
import niimpy.preprocessing.sampledata
data_path = os.path.join(niimpy.preprocessing.sampledata.SAMPLEDATA_DIR, 'sl_activity.csv')

In [15]:
df = niimpy.read_csv(data_path,tz='Europe/Helsinki')
df

Unnamed: 0,timestamp,user,activity,group
0,2013-03-27 06:00:00-05:00,u00,2,none
1,2013-03-27 07:00:00-05:00,u00,1,none
2,2013-03-27 08:00:00-05:00,u00,2,none
3,2013-03-27 09:00:00-05:00,u00,3,none
4,2013-03-27 10:00:00-05:00,u00,4,none
...,...,...,...,...
55902,2013-05-31 18:00:00-05:00,u59,5,mild
55903,2013-05-31 19:00:00-05:00,u59,5,mild
55904,2013-05-31 20:00:00-05:00,u59,4,mild
55905,2013-05-31 21:00:00-05:00,u59,5,mild


We can summarize the data by grouping the observations by **group** and **user**, and then aggregating the mean:

In [16]:
df.groupby(['group','user']).agg("mean")

Unnamed: 0_level_0,Unnamed: 1_level_0,activity
group,user,Unnamed: 2_level_1
mild,u02,0.922348
mild,u04,1.46696
mild,u07,0.914457
mild,u16,0.702918
mild,u20,0.277946
mild,u24,0.938028
mild,u27,0.653724
mild,u31,0.929495
mild,u35,0.519455
mild,u43,0.809045


# Summary statistics

There are many ways you may want to get an overview of your data.

Let's first load mobile phone screen activity data.

In [17]:
import niimpy
df = niimpy.read_csv(niimpy.sampledata.MULTIUSER_AWARESCREEN_CSV, tz='Europe/Helsinki')

In [18]:
df

Unnamed: 0,user,device,time,screen_status,datetime
2020-01-09 02:26:02.896000+02:00,jd9INuQ5BBlW,3p83yASkOb_B,1.578530e+09,0,2020-01-09 02:26:02.896000+02:00
2020-01-09 02:34:39.969000192+02:00,jd9INuQ5BBlW,3p83yASkOb_B,1.578530e+09,1,2020-01-09 02:34:39.969000192+02:00
2020-01-09 02:34:53.168999936+02:00,jd9INuQ5BBlW,3p83yASkOb_B,1.578530e+09,0,2020-01-09 02:34:53.168999936+02:00
2020-01-09 02:34:53.187000064+02:00,jd9INuQ5BBlW,3p83yASkOb_B,1.578530e+09,2,2020-01-09 02:34:53.187000064+02:00
2020-01-09 02:35:39.176999936+02:00,jd9INuQ5BBlW,3p83yASkOb_B,1.578530e+09,1,2020-01-09 02:35:39.176999936+02:00
...,...,...,...,...,...
2020-01-09 23:12:54.517999872+02:00,jd9INuQ5BBlW,OWd1Uau8POix,1.578604e+09,0,2020-01-09 23:12:54.517999872+02:00
2020-01-09 23:12:54.526000128+02:00,jd9INuQ5BBlW,OWd1Uau8POix,1.578604e+09,2,2020-01-09 23:12:54.526000128+02:00
2020-01-09 23:12:54.828000+02:00,jd9INuQ5BBlW,OWd1Uau8POix,1.578604e+09,1,2020-01-09 23:12:54.828000+02:00
2020-01-09 23:13:06.361999872+02:00,jd9INuQ5BBlW,OWd1Uau8POix,1.578604e+09,0,2020-01-09 23:13:06.361999872+02:00


## Hourly data
It is easy to get the amount of data (observations) in each hour

In [19]:
hourly = df.groupby([df.index.date, df.index.hour]).size()
hourly

2020-01-09  2     37
            10     6
            11     3
            12     3
            14    17
            15    35
            16     4
            17     8
            18     4
            20     4
            21    19
            22     3
            23    12
dtype: int64

In [20]:
# The index is the (day, hour) pairs and the
# value is the number at that time
print('%s had %d data points'%(hourly.index[0], hourly.iloc[0]))

(datetime.date(2020, 1, 9), 2) had 37 data points


## Occurance

In niimpy, **occurance** is a way to see the completeness of data.

Occurance is defined as such:
* Divides all time into hours
* Divides all hours into five 12-minute intervals
* Count the number of 12-minute intervals that have data.  This is $occurrence$
* For each hour, report $occurrence$. "5"is taken to mean that data is present somewhat regularly, while "0" means we have no data.

This isn't the perfect measure, but is reasonably effective and simple to calculate.  For data which isn't continuous (like screen data we are actually using), it shows how much the sensor has been used.

Column meanings: `day` is obvious, `hour` is hour of day, `occurrence` is the measure described above, `count` is total number of data points in this hour, `withdata` is which of the 12-min intervals (0-4) have data.

Note that the "uniformly present data" is not true for all data sources.

In [21]:
occurrences = niimpy.util.occurrence(df.index)
occurrences.head()

Unnamed: 0,day,hour,occurrence
2020-01-09 02:00:00,2020-01-09,2,5
2020-01-09 10:00:00,2020-01-09,10,2
2020-01-09 11:00:00,2020-01-09,11,1
2020-01-09 12:00:00,2020-01-09,12,1
2020-01-09 14:00:00,2020-01-09,14,2


We can create a simplified presentation (pivot table) for the data by using `pandas.pivot()`function:

In [24]:
occurrences.pivot('hour', 'day')

Unnamed: 0_level_0,occurrence
day,2020-01-09
hour,Unnamed: 1_level_2
2,5
10,2
11,1
12,1
14,2
15,3
16,1
17,2
18,1
20,1
