# 1. Data Resampling

The data is captured from the [bespoke Hubs Cloud Client](https://github.com/ayman/hubs/tree/hubs-cloud/src/systems/research) is sampled at the end user's frames per seconds.  This sampling rate data varies between users with different hardware capabilities.  To effectively analyse the data, it must first be resampled at a consistent rate.

We chose to organise the dataset into "frames" based on the datetime index using Pandas.  This notebook demonstrates how data from the Hubs logger is processed in preparation for analysis. 

The data released with this notebook has been modified for distribution as an open dataset.  The logging code produces a JSON file, which we have converted to CSV.  The CSV was also given simplified column names, columns not used for this analysis were removed, and any columns with personally identifying information were removed.

## Using this Notebook, Code, or Data
This notebook and all of the resources included here are released on a [Mozilla Public License 2.0](https://www.mozilla.org/en-US/MPL/2.0/).  The data is released under [CC-BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/).  To cite the paper, the bespoke logging client, the dataset, or this notebook please see the [README.md](https://github.com/ayman/hubs-research-2021/blob/main/README.md) or the [DOI in the ACM Digital Library](https://doi.org/10.1145/3411764.3445729).

In [1]:
import datetime
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import pytz
import seaborn as sn

The dataset provided with this example has already been assigned usable names for columns and removed all unused columns. Columns, like `display_name`, have been removed for anonymisation. The raw data from the updated logging code will have different column names and more columns than seen here. For reference, here is our format.

| Column          | Description                                                                    |
| :-------------  | :----------------------------------------------------------------------------- |
| `timestamp`     | Events timestamp                                                               |
| `uuid`          | Stable unique ID for users                                                     |
| `states`        | Hubs tags for user state within room                                           |
| `room`          | Which room, although room names inconsistenly use / and can include subscreens |
| `position_x`    | Coordinates vary based on size of room                                         |
| `position_y`    | Height (typically 1, unless using fly mode)                                    |
| `position_z`    | Coordinates vary based on size of room                                         |
| `direction_x`   | A vector, value from -1 to 1                                                   |
| `direction_y`   | A vector, value from -1 to 1                                                   |
| `direction_z`   | A vector, value from -1 to 1                                                   |
| `orientation_w` | A quaternion, a value from 0 to 1                                              |
| `orientation_x` | A quaternion, a value from -1 to 1                                             |
| `orientation_y` | A quaternion, a value from -1 to 1                                             |
| `orientation_z` | A quaternion, a value set to 0                                                 |

In [2]:
poses = pd.read_csv("../2.Data/poses.csv",
                    usecols=["timestamp",
                             "uuid",
                             "states",
                             "room",
                             "position_x",
                             "position_y",
                             "position_z",
                             "direction_x",
                             "direction_y",
                             "direction_z",
                             "orientation_w",
                             "orientation_x",
                             "orientation_y",
                             "orientation_z",])

poses.head(2)

Unnamed: 0,timestamp,uuid,room,states,position_x,position_y,position_z,direction_x,direction_y,direction_z,orientation_x,orientation_y,orientation_z,orientation_w
0,1588147000.0,e66510f1-5be6-49d3-b453-d6c4c06fd90c,/x5Dw6Dp/social-xr-workshop,"['spacebubble', 'visible', 'loaded']",12.514999,9.039499,38.039475,-0.694996,-0.391185,-0.603286,-0.218455,0.388116,0.0,0.895345
1,1588147000.0,e66510f1-5be6-49d3-b453-d6c4c06fd90c,/x5Dw6Dp/social-xr-workshop,"['spacebubble', 'visible', 'loaded']",13.14642,8.565934,37.408054,-0.694996,-0.391185,-0.603286,-0.218455,0.388116,0.0,0.895345


In [3]:
def add_datetime_index(row):
    """This method adds a timeindex to the dataframe. This is used to treat the data as a 
    timeseries and resample at a consistent framerate"""
    e_time =  datetime.datetime.fromtimestamp(row['timestamp'], tz=pytz.timezone("UTC"))
    return e_time

# Apply the function above to add the datetime index to the DF
poses.index = poses.apply(add_datetime_index, axis=1)
poses.index.name = "frame_id"

poses.head(2)

Unnamed: 0_level_0,timestamp,uuid,room,states,position_x,position_y,position_z,direction_x,direction_y,direction_z,orientation_x,orientation_y,orientation_z,orientation_w
frame_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
2020-04-29 07:49:32.877000+00:00,1588147000.0,e66510f1-5be6-49d3-b453-d6c4c06fd90c,/x5Dw6Dp/social-xr-workshop,"['spacebubble', 'visible', 'loaded']",12.514999,9.039499,38.039475,-0.694996,-0.391185,-0.603286,-0.218455,0.388116,0.0,0.895345
2020-04-29 07:49:44.802000+00:00,1588147000.0,e66510f1-5be6-49d3-b453-d6c4c06fd90c,/x5Dw6Dp/social-xr-workshop,"['spacebubble', 'visible', 'loaded']",13.14642,8.565934,37.408054,-0.694996,-0.391185,-0.603286,-0.218455,0.388116,0.0,0.895345


Select all the pose data for when the state is `visible`. This state corresponds to users when they have entered the Hubs room fully (leaveing the lobby) and are visible to others.

Then we make a data frame for each room.  Our workshop had one main room and three breakout rooms.

In [4]:
entered_poses = poses[poses.states.str.contains("'visible'")]

main_room = entered_poses[entered_poses['room'].str.match('/x5Dw6Dp/social-xr-workshop')]
a_room = entered_poses[entered_poses['room'].str.match('/AJ8FNzb/breakout-room-a')]
b_room = entered_poses[entered_poses['room'].str.match('/uRAjooi/breakout-room-b')]
c_room = entered_poses[entered_poses['room'].str.match('/y5HBKwr/breakout-room-c')]

## Time Series Resampling
Analysing the data requires resampling at a consistent rate.  In Pandas, we can do this by assigning the index of the dataframe as a time index.

There are two options here:
 * `fillna` takes nearest value to fill upsampled data, but won't fill beyond the limit.  Might be better to interpolate, but this behaved in odd ways in Pandas and has some bugs.
 * `dropna` remove bad values.
We are using `dropna`.

### Note on Sparse Sample Rate
In this code, we sample the data at 10 frames per minute.  This is substantially sparser than the logging code is capable of, but in the interest of optimisation during our workshop deployment, we did not push the performance of the client side logger in this case.  In follow-on work, we tested the performance of real-time logging and succeeded with significantly higher sampling rates.  

In [5]:
def resample_CSV(room_df, room_name):
    """Resample a room to different frame rates (60, 40, 30, and 20 FPS), excluding times 
    longer than 1 minutes out of scene (limit=fps).  This writes the output to a CSV file 
    based on the room name."""
    room_df =  room_df.loc[~room_df.index.duplicated(keep='first')]

    for fps in [.167]: #, 30, 40, 60]:
        all_users = []
        users = room_df.groupby('uuid')
        for user, user_data in users:
            resampled_data = user_data.resample(f'{1/fps:.3f}S').bfill(limit=int(fps*60)).dropna()
            all_users.append(resampled_data)     
        joined_users = pd.concat(all_users)
        joined_users.to_csv(f'{room_name}_resampled_{fps}.csv')

Resample all our rooms; the function writes them to CSV files for later use in the next notebook.

In [6]:
if not os.path.exists('outputs'):
    os.makedirs('outputs')
resample_CSV(a_room, "outputs/room_a")
resample_CSV(b_room, "outputs/room_b")
resample_CSV(c_room, "outputs/room_c")
resample_CSV(main_room, "outputs/main_room")

Next visit the <a href="2.GenerateSocialMetrics.ipynb">Generate Social Metrics</a> notebook.