# 1. Data Resampling

The data is captured from the [bespoke Hubs Cloud
Client](https://github.com/ayman/hubs/tree/hubs-cloud/src/systems/research)
is sampled at the end user's frames per seconds.  This sampling rate
data varies between users with different hardware capabilities.  To
effectively analyse the data, it must first be resampled at a
consistent rate.

We chose to organise the dataset into "frames" based on the datetime
index using Pandas.  This notebook demonstrates how data from the Hubs
logger is processed in preparation for analysis.

The data released with this notebook has been modified for distribution 
as an open dataset.  The logging code produces a JSON file, which we 
have converted to CSV.  The CSV was also given simplified column names, 
columns not used for this analysis were removed, and any columns with 
personally identifying information were removed.

## Using this Notebook, Code, or Data
This notebook and all of the resources included here are released on a 
[Mozilla Public License 2.0](https://www.mozilla.org/en-US/MPL/2.0/). 
The data is released under [CC-BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/).
To cite the paper, the bespoke logging client, the dataset, or this 
notebook please see the 
[README.md](https://github.com/ayman/hubs-research-2021/blob/main/README.md) 
or the [DOI in the ACM Digital Library](https://doi.org/10.1145/3411764.3445729).

In [2]:
import datetime
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import pytz
import seaborn as sn

The dataset provided with this example has already been assigned
usable names for columns and removed all unused columns. Columns, like
`display_name`, have been removed for anonymisation. The raw data from
the updated logging code will have different column names and more
columns than seen here. For reference, here is our format.

| Column          | Description                                                                    |
| :-------------  | :----------------------------------------------------------------------------- |
| `timestamp`     | Events timestamp                                                               |
| `uuid`          | Stable unique ID for users                                                     |
| `states`        | Hubs tags for user state within room                                           |
| `room`          | Which room, although room names inconsistenly use / and can include subscreens |
| `position_x`    | Coordinates vary based on size of room                                         |
| `position_y`    | Height (typically 1, unless using fly mode)                                    |
| `position_z`    | Coordinates vary based on size of room                                         |
| `direction_x`   | A vector, value from -1 to 1                                                   |
| `direction_y`   | A vector, value from -1 to 1                                                   |
| `direction_z`   | A vector, value from -1 to 1                                                   |
| `orientation_w` | A quaternion, a value from 0 to 1                                              |
| `orientation_x` | A quaternion, a value from -1 to 1                                             |
| `orientation_y` | A quaternion, a value from -1 to 1                                             |
| `orientation_z` | A quaternion, a value set to 0                                                 |

In [3]:
# Load the data from Hubs Tracker. The anonymised data already has column headings labelled - see specification for column headers.

pd.set_option('display.max_columns', None)
poses = pd.read_csv("../2.Data/group1.csv.bz2")

In [4]:
def add_datetime_index(row):
    """This method adds a timeindex to the dataframe. This is used to treat the data as a 
    timeseries and resample at a consistent framerate"""
    e_time =  datetime.datetime.fromtimestamp(row['timestamp']/1000, tz=pytz.timezone("UTC"))
    return e_time

# Apply the function above to add the datetime index to the DF
poses.index = poses.apply(add_datetime_index, axis=1)
poses.index.name = "frame_id"

poses.head(10)

Unnamed: 0_level_0,uuid,detectOS,is_browser_environment,check_headset_connected,is_mobile_VR,is_oculus_browser,timestamp,room,position_x,position_y,position_z,quat_x,quat_y,quat_z,quat_w,direction_x,direction_y,direction_z,characterControllerFly,spacebubble,visible,loaded,entered,muted,lastFPS,volume,audioOutputMode
frame_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1
2021-08-10 14:45:00.003000+00:00,0ceacf76-e25a-45aa-8641-7829d7a4f571,Linux,1,1,1,1,1628607000000.0,/Ahrjdgs/viajero-room-303,-10.180137,1.221815,22.390828,0.223575,-0.68348,0.177684,0.671787,-0.838855,-0.543276,-0.034262,0,1,1,1,1,0,89.46,5e-324,0
2021-08-10 14:45:00.013000+00:00,0ceacf76-e25a-45aa-8641-7829d7a4f571,Linux,1,1,1,1,1628607000000.0,/Ahrjdgs/viajero-room-303,-10.180094,1.221718,22.390915,0.223354,-0.683562,0.177259,0.671889,-0.839373,-0.542473,-0.034288,0,1,1,1,1,0,89.46,5e-324,0
2021-08-10 14:45:00.025000+00:00,0ceacf76-e25a-45aa-8641-7829d7a4f571,Linux,1,1,1,1,1628607000000.0,/Ahrjdgs/viajero-room-303,-10.180021,1.221647,22.390998,0.223209,-0.68375,0.176803,0.671867,-0.839849,-0.541711,-0.034671,0,1,1,1,1,0,89.46,5e-324,0
2021-08-10 14:45:00.036000+00:00,0ceacf76-e25a-45aa-8641-7829d7a4f571,Linux,1,1,1,1,1628607000000.0,/Ahrjdgs/viajero-room-303,-10.179947,1.221586,22.391065,0.223173,-0.683776,0.176436,0.671949,-0.840173,-0.541207,-0.03471,0,1,1,1,1,0,89.46,5e-324,0
2021-08-10 14:45:00.047000+00:00,0ceacf76-e25a-45aa-8641-7829d7a4f571,Linux,1,1,1,1,1628607000000.0,/Ahrjdgs/viajero-room-303,-10.179872,1.221514,22.391104,0.223031,-0.683782,0.175948,0.672117,-0.84068,-0.540426,-0.034603,0,1,1,1,1,0,89.46,5e-324,0
2021-08-10 14:45:00.058000+00:00,0ceacf76-e25a-45aa-8641-7829d7a4f571,Linux,1,1,1,1,1628607000000.0,/Ahrjdgs/viajero-room-303,-10.179788,1.221435,22.391135,0.222743,-0.683854,0.175286,0.672312,-0.84144,-0.539245,-0.034542,0,1,1,1,1,0,89.46,5e-324,0
2021-08-10 14:45:00.069000+00:00,0ceacf76-e25a-45aa-8641-7829d7a4f571,Linux,1,1,1,1,1628607000000.0,/Ahrjdgs/viajero-room-303,-10.17967,1.22133,22.391178,0.222376,-0.683943,0.174571,0.67253,-0.842303,-0.537902,-0.034458,0,1,1,1,1,0,84.17,5e-324,0
2021-08-10 14:45:00.080000+00:00,0ceacf76-e25a-45aa-8641-7829d7a4f571,Linux,1,1,1,1,1628607000000.0,/Ahrjdgs/viajero-room-303,-10.179427,1.221081,22.391219,0.221928,-0.684066,0.173706,0.672776,-0.843347,-0.536268,-0.034397,0,1,1,1,1,0,84.17,5e-324,0
2021-08-10 14:45:00.092000+00:00,0ceacf76-e25a-45aa-8641-7829d7a4f571,Linux,1,1,1,1,1628607000000.0,/Ahrjdgs/viajero-room-303,-10.179372,1.220915,22.391177,0.221541,-0.684104,0.172712,0.673121,-0.844444,-0.534553,-0.034157,0,1,1,1,1,0,84.17,5e-324,0
2021-08-10 14:45:00.102000+00:00,0ceacf76-e25a-45aa-8641-7829d7a4f571,Linux,1,1,1,1,1628607000000.0,/Ahrjdgs/viajero-room-303,-10.179245,1.220747,22.391115,0.221143,-0.68409,0.17188,0.673479,-0.84542,-0.533033,-0.033767,0,1,1,1,1,0,84.17,5e-324,0


Select all the pose data for when the state is `visible`. This state
corresponds to users when they have entered the Hubs room fully
(leaveing the lobby) and are visible to others.

In [5]:
entered_poses = poses[poses.visible == 1]

## Time Series Resampling
Analysing the data requires resampling at a consistent rate.  In
Pandas, we can do this by assigning the index of the dataframe as a
time index.

There are two options here:
 * `fillna` takes nearest value to fill upsampled data, but won't fill
beyond the limit.  Might be better to interpolate, but this behaved in
odd ways in Pandas and has some bugs.
 * `dropna` remove bad values.
We are using `dropna`.

In [6]:
def resample_CSV(room_df, room_name):
    """Resample a room to different frame rates (60, 40, 30, and 20 FPS), excluding times 
    longer than 1 minutes out of scene (limit=fps).  This writes the output to a CSV file 
    based on the room name."""
    room_df =  room_df.loc[~room_df.index.duplicated(keep='first')]

    for fps in [30]: #, 30, 40, 60]:
        all_users = []
        users = room_df.groupby('uuid')
        for user, user_data in users:
            resampled_data = user_data.resample(f'{1/fps:.3f}S').bfill(limit=int(fps*60)).dropna()
            all_users.append(resampled_data)     
        joined_users = pd.concat(all_users)
        joined_users.to_csv(f'{room_name}_resampled_{fps}.csv')

Resample all our rooms; the function writes them to CSV files for 
later use in the next notebook.

In [7]:
if not os.path.exists('outputs'):
    os.makedirs('outputs')

resample_CSV(entered_poses, "outputs/group1")



Next visit the <a href="2.GenerateSocialMetrics.ipynb">Generate Social Metrics</a> notebook.