# MultimodalHack with CPR - Tutorial

This iPython Notebook was prepared by Daniele Di Mitri for the Learning Analytics Hackathon to take place in Amsterdam 23-24 October 2018. For more information about the Hackathon check here http://lsac2018.org/#hackathon


## 1. The Multimodal Tutor for CPR 
The aim of this study is to test a new intelligent Multimodal Tutor for training people to perform cardiopulmonary resuscitation using patient manikins (CPR tutor). The tutor uses a multi-sensor setup for tracking the CPR execution and generating personalised feedback. This study is part of a PhD project focusing on multimodal data support for investigating practice-based learning scenarios, such as psychomotor skills training in the classroom or at the workplace. For the CPR tutor the multimodal data considered consist of trainee's body position recorded by  Microsoft Kinect camera, electromyogram recorded with Myo armband, and performance metrics derived from the manikin (Laerdal ResusciAnne QCPR manikin). This specific manikin can be linked with the Simpad SkillsReporter, a device capable of calculating reliable CPR performance metrics, such as *CompressionRate*, *CompressionDepth* and *CompressionRelease*. What the manikin is currently not capable of assessing is other important performance metrics of CPR such as arms being locked while doing the chest compressions, or whether the trainee uses his/her body weight to facilitate the chest compressions. 
![alt text](https://i.imgur.com/n7KYl4l.png "The Multimodal Tutor for CPR")

### 1. Data collection: Multimodal Learning Hub
The Multimodal Learning Hub (LearningHub) is a system that focuses on the data collection and data storing of multimodal learning experiences. It uses the concept of *Meaningful Learning Task* (MLT) and introduces a new data format (MLT session file) for data storing and exchange. The LearningHub implements a set of specifications that shape it for certain types of learning activities. It was created to be compatible primarily with commercial devices (e.g. Microsoft Kinect, Leap Motion, Myo Armband) and other sensors with drivers running with the most common operating systems. It focuses on short and meaningful learning activities (~10 minutes) and uses a distributed, client-server architecture with a master node controlling and receiving updates from multiple data-provider applications. It also handles video and audio recordings with the main purpose to support the human annotation process. The LearningHub is open source and developed in C# and can be found here https://github.com/janschneiderou/LearningHub

### 2. Data storing: MLT sessions
The expected output of the LearningHub is one (or multiple) MLT session files including 1) one-to-n multimodal, time-synchronised sensor recordings; 2) a video/audio file providing evidence for retrospective annotations. Each session is stored in a zip folder and has the following structure:

**SessionFile.zip**
+ annotations.json 
+ app1.json 
+ app2.json 
+ ... 
+ appn.json 
+ video.mp4 

The **app1.json, app2.json, ... appN.json** are the files containing the sensor data (e.g. from Kinect and Myo armband). These files constitute the input data of the data analysis.

The **annotation.json** contains the time-intervals annotated with with the *Visual Inspection Tool*. The labels contained in this file are used in the prediction/classissification file generated. 

### 3. Data annotation: Visual Inspection Tool  
The data annotation is performed using the *Visual Inspection Tool* (VIT). The VIT allows to load the MLT session and segment the sensor data into time intervals and provide several key-value annotations. It also allows to load the annoations which are generated by the system manually. The VIT can be found open source on GitHub https://github.com/dimstudio/visual-inspection-tool 


## 2. Purpose of this Tutorial 
In this tutorial we focus on the step of the data-informed cylce. 4) The data Processing. We use some sample MLT sessions of the Multimodal Tutor. 

### Machine Learning Task description 

INPUT SPACE: all the non-zero attributes contained in the sensor data. Each of this attribute is a single time-series which has different time frequencies.  

HYPOTHESIS SPACE: the space of hypothesis includes five binary (0 or 1) target attributes, three *CompressionRate*, *CompressionDepth* and *CompressionRelease*  derived by the CPR Manikin data, and two *BodyWeight* and *ArmsLocked* detected by the manikin. 

PROBLEM REPRESENTATION: we use Attribute-Value representation. Each time-interval (i.e. chest compression) is a row while each sensor attribute is a column. To aggregate each attribute into numerical features we use aggregating functions. A graphical representation of the problem can be found here 
![alt text](https://i.imgur.com/MJMaZrB.png "Data trasformation visualized")

## 3. Script
### 1. Import some python libraries (NumPy, Pandas, SkLearn, etc. )

In [1]:
# set of imports
import numpy as np

import os 
import zipfile
import json
import re
import operator
import requests
import StringIO
import datetime

import pandas as pd
from pandas.io.json import json_normalize  

from sklearn.svm import SVC
from sklearn import preprocessing
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

import matplotlib.pyplot as plt

### 2. Retrieve session file(s)

In [2]:
sessions = []
zip_file_urls = ['https://github.com/dimstudio/multimodal-analyzer/raw/master/example_sessions/OKDDM1_2018-7-24-11H57M49S371_annotated.zip']
for zip_url in zip_file_urls:
    r1 = requests.get(zip_url, stream=True)
    z1 = StringIO.StringIO(r1.content)
sessions.append(z1)

### 3. Data preparation in a pandas DataFrame

In [3]:
dfALL = pd.DataFrame() # Dataframe with all summarised data
dfAnn = pd.DataFrame() # Dataframe containing the annotations

# for each session in the list of sessions
for s in sessions:
    
    #1. Reading data from zip file
    with zipfile.ZipFile(s) as z:
        
        # get current absolute time in seconds. This is necessary to add the delta correctly
        for info in z.infolist():
            file_datetime = datetime.datetime(*info.date_time)
        current_time_offset = pd.to_datetime(pd.to_datetime(file_datetime, format='%H:%M:%S.%f'),unit='s')
        
         # First look for annotation.json
        for filename in z.namelist():
            
            if not os.path.isdir(filename):
                
                if '.json' in filename:
                    
                    with z.open(filename) as f:
                         data = json.load(f) 
                    # if it has the 'intervals' then then it is an annotatation file 
                    
                    if 'intervals' in data:
                        
                        # concatenate the data with the intervals normalized and drop attribute 'intervals'
                        df = pd.concat([pd.DataFrame(data), 
                            json_normalize(data['intervals'])], 
                            axis=1).drop('intervals', 1)
                        
                        # convert to numeric (when reading from JSON it converts into object in the pandas DF)
                        # with the parameter 'ignore' it will skip all the non-numerical fields 
                        df = df.apply(pd.to_numeric, errors='ignore')
                        
                        # remove the prefix 'annotations.' from the column names
                        df.columns = df.columns.str.replace("annotations.", "")
                        
                        # from string to timedelta + offset
                        df.start = pd.to_timedelta(df.start) + current_time_offset
                        
                        # from string to timedelta + offset
                        df.end = pd.to_timedelta(df.end) + current_time_offset
                        
                        # duration as subtractions of delta in seconds
                        df['duration'] = (df.end-df.start) / np.timedelta64(1, 's')   
                        
                        # append this dataframe to the dataframe annotations
                        dfAnn = dfAnn.append(df) 
                        
                    # if it has 'frames' then it is a sensor file 
                    elif 'frames' in data:
                        
                        # concatenate the data with the intervals normalized and drop attribute 'frames'
                        df = pd.concat([pd.DataFrame(data), 
                            json_normalize(data['frames'])], 
                            axis=1).drop('frames', 1)
                        
                        # remove underscore from columnfile e.g. 3_Ankle_Left_X becomes 3AnkleLeftX
                        df.columns = df.columns.str.replace("_", "")
                        
                        # from string to timedelta + offset
                        df['frameStamp']= pd.to_timedelta(df['frameStamp']) + current_time_offset
                        
                        # retrieve the applicaiton name
                        appName = df.applicationName.all()
                        
                        # remove the prefix 'frameAttributes.' from the column names
                        df.columns = df.columns.str.replace("frameAttributes", df.applicationName.all())
                        
                        # set the timestamp as index 
                        df = df.set_index('frameStamp').iloc[:,2:]
                        
                        # exclude duplicates (taking the first occurence in case of duplicates)
                        df = df[~df.index.duplicated(keep='first')]
                        
                        # convert to numeric (when reading from JSON it converts into object in the pandas DF)
                        # with the parameter 'ignore' it will skip all the non-numerical fields 
                        df = df.apply(pd.to_numeric, errors='ignore')
                        
                        # Keep the numeric types only (categorical data are not supported now)
                        df = df.select_dtypes(include=['float64','int64'])
                        
                        # Remove columns in which the sum of attributes is 0 (meaning there the information is 0)
                        df = df.loc[:, (df.sum(axis=0) != 0)]
                        
                        # The application KienctReader can track up to 6 people, whose attributes are 
                        # 1ShoulderLeftX or 3AnkleRightY. We get rid of this numbers assuming there is only 1 user
                        # This part has to be rethinked in case of 2 users
                        df.rename(columns=lambda x: re.sub('Kinect.\d','Kinect.',x),inplace=True)
                        
                        # Concate this dataframe in the dfALL and then sort dfALL by index
                        dfALL = pd.concat([dfALL, df], ignore_index=False,sort=False).sort_index()
                        
                        
df1 =  dfALL.apply(pd.to_numeric).fillna(method='bfill')

### 4. Feature extraction with aggregate functions

In [4]:
# Exclude irrelevant attributes 
to_exclude = ['Ankle']
for el in to_exclude:
    df1 = dfALL[[col for col in df1.columns if el not in col]]
    
# Feature aggregation functions 
aggregations = ['max','min','std','mean','var']

# The masked_df allows to select intervals based on the annotaton 
masked_df = [
    df1[(df2_start <= df1.index) & (df1.index <= df2_end)]
    for df2_start, df2_end in zip(dfAnn['start'], dfAnn['end'])
]

features = []

# For each attribute
for key in df1.columns.values:
    
    # For each function function aggregation
    for a in aggregations:
        
        # the name is attribute.aggregation 
        fname = key+'.'+a
        
        # append it as a feature
        features.append(fname)
        
        # apply the aggregation function 
        if a == 'max':
            dfAnn[fname] = [np.max(dt[key]) if not dt.empty else None for dt in masked_df]
        elif a == 'min':
            dfAnn[fname] = [np.min(dt[key]) if not dt.empty else None for dt in masked_df]
        elif a == 'std':    
            dfAnn[fname] = [np.std(dt[key]) if not dt.empty else None for dt in masked_df]
        elif a == 'mean':    
            dfAnn[fname] = [np.mean(dt[key]) if not dt.empty else None for dt in masked_df]
        elif a == 'var':
            dfAnn[fname] = [np.var(dt[key]) if not dt.empty else None for dt in masked_df]
            
dfAnn = dfAnn.dropna()

### 5. Feature importances with ETF and SVM classfier training

In [5]:
# The target features set
target_features = ['classRate','classDepth','classRelease']


# For each target feature in the feature set
for target in target_features:
    
    # Scaling features  
    #http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html
    
    # Instantiating the model
    min_max_scaler = preprocessing.MinMaxScaler()
    
    # the domain is the valeus of the features 
    X = dfAnn[features].values

    # Scaling features 
    X = min_max_scaler.fit_transform(X)
    
    # Range (target feature)
    y = dfAnn[target].values.ravel()
    
    # ExtraTreeClassifier for feature importance
    #http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesClassifier.html
    ETF = ExtraTreesClassifier()
    ETF.fit(X, y)
    
    # Just sorting the dictionary by the feature with highest to the lowest importance
    importance = sorted(dict(zip(dfAnn[features].columns, ETF.feature_importances_)).items(), 
                        key=operator.itemgetter(1),reverse=True)
    
    # Classification with SVM for classification linear kernel  
    svc = SVC(kernel="linear", C=1)
    accuracies = []
    
    # Try from 1 to n_features to check the best number of features 
    #for n in range(1,len(importance)):
        
    training_feautres = []

    # For each element in the feature set ordered for importance
    for el in importance[:len(importance)]:
        training_feautres.append(el[0])

    # Get the value of these featues PLUS the target
    dfAnn[training_feautres+[target]]

    # The domain is composed by the Scaled training featues
    X = min_max_scaler.fit_transform(dfAnn[training_feautres].values)

    # The range is target values
    y = dfAnn[target].values.ravel()

    # Set the support vector model with test size at 33 % and a random seed
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=72)
    svc.fit(X_train, y_train)

    # Prediction
    y_pred = svc.predict(X_test)

    # Append to the accuracy list
    print 'SVM '+target+' N='+str(np.shape(dfAnn)[0]) +' Accuracy = ' accuracy_score(y_test, y_pred)

SyntaxError: invalid syntax (<ipython-input-5-804674801cbf>, line 62)

###  \*\*\*\* End tutorial \*\*\*\*

## Questions for the Hackathon:
+ How can we overall improved the prediction accuracy of the classification models?
+ Can we find better features (e.g. FFT etc)? For example tools such as https://tech.blue-yonder.com/tsfresh-scientific-and-free/
+ Train the same classifier on other sessions individually?
+ Train the classifier on other sessions together?