evaluation.tex

\chapter{Evaluation}
  \section{Data collection method and data collected}
    Table~\ref{tab:data-collected} details the amount of evaluation data collected.
    
    The data was collected with the smartwatch worn on the left wrist and the phone loose in the right hand trouser pocket.
    
    In total I collect over five hours of activity recordings over 12 different activities. This produces a total of 1816 instances to classify in the dataset.
    % Statistics about data gathered: how many records, total length etc.
    \begin{table}
      \centering
        \input{data/TableDataSetMetadata}
      \caption{A summary of data collected.}
      \label{tab:data-collected}
    \end{table}
  \section{Evaluation process}
    Once the data was gathered, it was processed and features were extracted according to Section~\ref{sec:data-processing}.
    
    The data set was then split into training set and a testing set. This was done with a stratified shuffling splitter. This means that the proportional distribution of true labels in the testing set follows that of the training set. Whether an instance is placed in the training set or the testing set is chosen at random.
    
    The split was performed 10 times, with elements chosen at random each time. The splitter does not guarantee that the same split will not be made on subsequent splits, but such an event is unlikely given the size of the dataset.
    
    The data set was split in the ratio 50:50 training:testing.
    
    Once the data had been split, the labels were stripped from the testing data and set aside. Then, three separate instances of each of the four classifiers was trained on the training set. Each of the three instances only had access to features extracted from the phone accelerometer data, features extracted from the watch accelerometer data and both the phone-extracted and the watch-extracted features respectively.
    
    Each of the three instances were then tested on the test set and their evaluation metrics --- confusion matrix and $\mathrm{F}_1$ measure --- were calculated by comparing the true labels to the classified labels.
  \section{Evaluation metrics}
    Two primary methods of evaluation are used in this project: $\mathrm{F}_1$ measure and the confusion matrix.
    This section details these methods of evaluation. These metrics are discussed in Sokolova \emph{et al.}\cite{sokolova2009systematic}
    \subsection{F1 measure}
      In order to define the $\mathrm{F}_1$ measure, it is necessary to first define precision and recall.
      \begin{description}
        \item[Precision], defined as $\frac{\text{True positives}}{\text{True positives} + \text{False positives}}$, is the proportion of instances of a particular label in which the classifier's labels agree with the true labels.
        \item[Recall], defined as $\frac{\text{True positives}}{\text{True positives} + \text{False negatives}}$, is the proportion of all the instances with a particular label which are labelled as such. 
      \end{description}
      
      The $\mathrm{F}_1$ measure is then defined as the harmonic mean of precision and recall:
      
      $$\mathrm{F}_1 = 2 \cdot \frac{\mathrm{precision} \cdot \mathrm{recall}}{\mathrm{precision} + \mathrm{recall}}$$
      
      We use the $\mathrm{F}_1$ measure rather than precision or recall in isolation because neither provides sufficient information. It is trivial to maximise recall in isolation: label everything with the class in question. A precision of $1$ indicates everything we have returned is correct, but does not take into account how many instances we have missed.
      
      As opposed to accuracy --- the proportion of instances that were correctly classified --- the $\mathrm{F}_1$ measure is more informative when the instances are rare compared to the size of the dataset. Consider a binary classification problem with two labels: $\chi$ and $\bar{\chi}$. If $\chi$ occurs just 1\% of the time, a classifier that always returns $\bar{\chi}$ will by 99\% accurate. However, its  $\mathrm{F}_1$ measure for the $\chi$ class will be 0. Because no class makes up more than 20\% of the dataset (see Table~\ref{tab:data-collected}), $\mathrm{F}_1$ is a better evaluation measure in this case.
    
      The $\mathrm{F}_1$ measure reaches its best value, $1$, when both precision and recall equal $1$. That is when all the classifier's labels agree with all the true labels and none of the true instances have been mislabeled.
      
      Precision and recall, and thus the $\mathrm{F}_1$ measure, are only defined for a single class (i.e. a binary classification problem). As this is a multi-class classification problem, the $\mathrm{F}_1$ measure is reported for each class separately.
      
      The results from the classifiers are nondeterministic for two reasons:
      \begin{enumerate}
        \item the stratified shuffling splitter splits the dataset into a training set and a testing set proportionally at random; and
        \item the decision tree classifier and random forest classifier are nondeterministic in their operation.
      \end{enumerate}
      
      The ten splits will therefore produce different, independent $\mathrm{F}_1$ measures. The mean of the $\mathrm{F}_1$ measures is given together with its standard error. The standard error is given by:
      $$\mathrm{SE}_{\mathrm{F}_1} = \frac{\sigma}{\sqrt{n}}$$
      
      where $\sigma$ is the standard deviation of all measurements of the $\mathrm{F}_1$ measure and $n$ is the number of trials, in this case ten.
        
    \subsection {Confusion matrix}
      A confusion matrix lists the true labels as rows and the classified labels as columns. Any given cell $c_{ij}$ is a count of the number of instances with a true label of $i$ that were classified as label $j$. Confusion matrices display all classifications and misclassifications.
      
      The confusion matrices presented in this project are the additions of confusion matrices from separate trials.
    % F1 measure: define precision and recall first
    % Confusion matrix
    
  
  \section{Phone-only measurements}
    The following results have been obtained by training each of the four classifiers on features extracted only from phone-collected accelerometer data.
    
    Figure~\ref{fig:F1GraphForEachClassifier_phone} gives the $\mathrm{F}_1$ measures for each activity resulting from classification using each of the four classifiers. The random forest classifier performs best of the four classifiers, outperforming others in 75\% of activities. All three proper classifiers significantly outperform the baseline dummy random classifier.
    
    Figure~\ref{fig:F1GraphAverage_phone} averages the $\mathrm{F}_1$ measures given in Figure~\ref{fig:F1GraphForEachClassifier_phone}. The random forest classifier outperforms the other three classifiers in average $\mathrm{F}_1$ measure. The decision tree and naive Bayes classifiers perform at the same level. Again, all three score significantly higher than the baseline dummy random classifier.
    
    Table~\ref{tab:ConfusionMatrix_phone_RandomForestClassifier} presents a cumulative confusion matrix for the random forest classifier trained on phone-only features. The random forest classifier was chosen as the best performing classifier. The matrix is the addition of indvidiual confusion matrices from ten independent trials.
    
    From the confusion matrix, we see some of the misclassifications we would expect from phone-only classification:
    \begin{itemize}
      \item Computer use misclassified as eating, and vice versa. Both these activities are the only seated activities and the phone should struggle to differentiate between them.
      \item Gym cycling misclassified as cycling. Both of these activities exhibit the same peddling motion in the leg, which phone data alone may struggle to differentiate.
      \item Many standing activities have been misclassified as each other, such as fussball, teethbrushing, standing and gallery perusal. Again, the phone data struggles to differentiate these static upright activities from each other.
    \end{itemize}  
    Note that the classifier has almost perfect precision on those activities that require highly periodic leg movements: gym cycling, running and walking. All (bar one) of the instances classified as those activities were correct.


    \begin{figure}
      \centering
      \includegraphics[width=1.0\textwidth]{F1GraphForEachClassifier_phone}
      \caption[$\mathrm{F}_1$ measures for each activity for each of the four classifiers trained on phone-only features.]{$\mathrm{F}_1$ measures for each activity for each of the four classifiers trained on phone-only features. Best is 1, worst is 0.}
      \label{fig:F1GraphForEachClassifier_phone}
    \end{figure}

    \begin{figure}
      \centering
      \includegraphics[width=1.0\textwidth]{F1GraphAverage_phone}
      \caption[Average $\mathrm{F}_1$ measures across all activities for each of the four classifiers trained on phone-only features.]{Average $\mathrm{F}_1$ measures across all activities for each of the four classifiers trained on phone-only features. Error bars are calculated as the standard error in the mean.}
      \label{fig:F1GraphAverage_phone} 
    \end{figure}

    \begin{table}
      \tabcolsep=0.11cm
      \centering
        \input{data/ConfusionMatrix_phone_RandomForestClassifier}
      \caption[Confusion matrix of the random forest classifier trained on phone-only features]{Cumulative confusion matrix from ten trials of the random forest classifier, the best performing of all the classifiers, trained on phone-only features.}
      \label{tab:ConfusionMatrix_phone_RandomForestClassifier}
    \end{table}
    
  \section{Watch-only measurements}
    The following results have been obtained by training each of the four classifiers on features extracted only from watch-collected accelerometer data.
    
    Figure~\ref{fig:F1GraphForEachClassifier_wear} gives the $\mathrm{F}_1$ measures for each activity resulting from classification using each of the four classifiers. Like in phone-only classification, the random forest classifier performs best of the four classifiers, outperforming others in 83\% of activities.
    
    Figure~\ref{fig:F1GraphAverage_wear} averages the $\mathrm{F}_1$ measures given in Figure~\ref{fig:F1GraphForEachClassifier_wear}. The random forest classifier outperforms the other three classifiers in average $\mathrm{F}_1$ measure. The decision tree performs marginally better than the naive Bayes classifier. Again, all three score significantly higher than the baseline dummy random classifier.
    
    Table~\ref{tab:ConfusionMatrix_wear_RandomForestClassifier} presents a cumulative confusion matrix for the random forest classifier trained on watch-only features. Again, the random forest classifier was chosen as the best performing classifier. The matrix is the addition of individual confusion matrices from ten independent trials.
    
    Compared to phone-only classification, the watch-only classifications exhibit less precision in those leg-period activities, such as running and walking. However, some periodicity is still present in the wrist movement and these activities are still classified accurately. A more interesting case is that of gym cycling, which is highly periodic in the leg movement but completely aperiodic in its wrist movement. As a result, its misclassification rate suffers.
    
    Standing activities also suffer from higher misclassification when using watch-only features.
    
    Unexpectedly, computer use and eating also are subject to a higher rate of misclassification than when using the phone-only features. Fussball, however, is better classified using the watch as opposed to the phone.

    \begin{figure}
      \centering
      \includegraphics[width=1.0\textwidth]{F1GraphForEachClassifier_wear}
      \caption{$\mathrm{F}_1$ measures for each activity for each of the four classifiers trained on watch-only features. Best is 1, worst is 0.}
      \label{fig:F1GraphForEachClassifier_wear}
    \end{figure}

    \begin{figure}
      \centering
      \includegraphics[width=1.0\textwidth]{F1GraphAverage_wear}
      \caption[Average $\mathrm{F}_1$ measures across all activities for each of the four classifiers trained on watch-only features]{Average $\mathrm{F}_1$ measures across all activities for each of the four classifiers trained on watch-only features. Error bars are calculated as the standard error in the mean. The random forest classifier again performs best overall.}
      \label{fig:F1GraphAverage_wear}
    \end{figure}
    
    \begin{table}
      \tabcolsep=0.11cm
      \centering
        \input{data/ConfusionMatrix_wear_RandomForestClassifier}
      \caption[Confusion matrix of the random forest classifier trained on watch-only features]{Cumulative confusion matrix from ten trials of the random forest classifier, the best performing of all the classifiers, trained on watch-only features.}
      \label{tab:ConfusionMatrix_wear_RandomForestClassifier}
    \end{table}
    
  
  \section{Phone and watch measurements}
    The following results have been obtained by training each of the four classifiers on features extracted from both phone and watch accelerometer data.
    
    Figure~\ref{fig:F1GraphForEachClassifier_both} gives the $\mathrm{F}_1$ measures for each activity resulting from classification using each of the four classifiers trained on phone and watch features. Like in phone-only and watch-only classification, the random forest classifier performs best of the four classifiers, outperforming others in 67\% of activities. This percentage however is the lowest of the three feature-set cases. The naive Bayes classifier performs better in more types of activities than the decision tree classifier. This is particularly evident in those activities identified to be  periodic: cycling, gym cycling, running, stairs, teethbrushing and walking.
    
    Figure~\ref{fig:F1GraphAverage_both} averages the $\mathrm{F}_1$ measures given in Figure~\ref{fig:F1GraphForEachClassifier_both}. The random forest classifier outperforms the other three classifiers in average $\mathrm{F}_1$ measure. The decision tree and naive Bayes classifiers both score equally at the $\mathrm{F}_1$ measure. Again, all three score significantly higher than the baseline dummy random classifier.
    
    Table~\ref{tab:ConfusionMatrix_both_RandomForestClassifier} presents a cumulative confusion matrix for the random forest classifier trained on both phone and watch features. Again, the random forest classifier was chosen as the best performing classifier.
    
    Using both phone and watch features reduces the effect of the broad categories of uncertainty that were present in phone-only and watch-only classification, though in many cases the difference is negligible. The only activity in which using phone and watch features together significantly outperforms either phone or watch classification on its own is in the climbing activity.
    
    \begin{figure}
      \centering
      \includegraphics[width=1.0\textwidth]{F1GraphForEachClassifier_both}
      \caption[$\mathrm{F}_1$ measures for each activity for each of the four classifiers trained on both phone and watch features]{$\mathrm{F}_1$ measures for each activity for each of the four classifiers trained on both phone and watch features. Best is 1, worst is 0.}
      \label{fig:F1GraphForEachClassifier_both}
    \end{figure}
    
    \begin{figure}
      \centering
      \includegraphics[width=1.0\textwidth]{F1GraphAverage_both}
      \caption[Average $\mathrm{F}_1$ measures across all activities for each of the four classifiers trained on both phone and watch features]{Average $\mathrm{F}_1$ measures across all activities for each of the four classifiers trained on both phone and watch features. Error bars are calculated as the standard error in the mean.}
      \label{fig:F1GraphAverage_both}
    \end{figure}
    
    \begin{table}
      \tabcolsep=0.11cm
      \centering
        \input{data/ConfusionMatrix_both_RandomForestClassifier}
      \caption[Confusion matrix of the random forest classifier trained on both phone and watch features features]{Cumulative confusion matrix from ten trials of the random forest classifier, the best performing of all the classifiers, trained on both phone and watch features.}
      \label{tab:ConfusionMatrix_both_RandomForestClassifier}
    \end{table}
  
  \section{Comparison}
    This section presents graphs that directly compare the $\mathrm{F}_1$ measures as calculated by classifiers trained with phone-only, watch-only and phone and watch feature sets.
    
    Figure~\ref{fig:F1Graph_RandomForestClassifier} gives {$\mathrm{F}_1$ measures for each activity using the random forest classifier trained on each of the three feature sets. On average, the random forest classifier performed best and so is discussed primarily in this evaluation.
    
    Climbing is the only activity in which using both phone and watch feature sets significantly outperforms either feature set on its own. In all other trials, using both the phone and watch feature sets was better but not significantly so. This is not necessarily because using both feature sets performed badly, but because both the phone-only and watch-only feature sets performed well. 
  
    Figure~\ref{fig:F1Graph_average_f1_per_featureset} averages {$\mathrm{F}_1$ measures across all activities grouped by classifier and feature set. In all three of the classifiers, the phone and watch outperform phone-only features and watch-only features. On average, the phone-only and the watch-only features perform equally well.
    
    A particularly interesting result is that phone-only classification performs best in computer use and eating classification; introducing features extracted from watch data actually reduces the accuracy of the classifier. One might expect most of the information to be gained from computer use and eating activities should come from the watch rather than the phone. One reason for this is that the smartwatch data could be noisy: one could argue that movements not necessarily related to the activity are more likely to be recorded on a smartwatch than on a smartphone.
    
    Another possible reason for this misclassification is the lack of other seated activities. Computer use and eating are the only two seated activities. Contrast this to semi-stationary standing activities such as fussball, standing, gallery perusal or teethbrushing. Watch-only classification consistently outperforms phone-only classification in these activities, while phone-only classification scores lower in each case than it does for computer use or for eating. The ability to classify a certain activity depends very much on the other activities present in the dataset and how fine the nuances are between them. 
    
        The second of the two overall aims of the project was to evaluate to what extent the smartwatch is better at helping to classify activities. Both figures here show the the smartwatch is just as good as the smartphone. Though it outperforms the smartphone in some activities and performs marginally better on average, it is not statistically significant. Using both phone and watch features outperforms both on average and on a per-activity level.
    \begin{figure}
      \centering
      \includegraphics[width=1.0\textwidth]{F1Graph_RandomForestClassifier}
      \caption{$\mathrm{F}_1$ measures for each activity using the random forest classifier trained on phone-only, wear-only and both phone and wear features.}
      \label{fig:F1Graph_RandomForestClassifier}
    \end{figure}
    \begin{figure}
      \centering
      \includegraphics[width=1.0\textwidth]{F1Graph_average_f1_per_featureset}
      \caption{Average $\mathrm{F}_1$ measures for each activity from all classifiers, trained on phone-only, wear-only and both phone and wear features.}
      \label{fig:F1Graph_average_f1_per_featureset}
    \end{figure}
  
  \section{Feature importances}
    Feature importances can be calculated when using decision trees and random forests. Feature importances, also known as Gini importance, is the normalised total reduction in impurity brought by that feature\cite{breiman2001random}. A feature that can split the whole dataset into exactly two labels would have a feature importance of 1.
    
    In the case of classification using both phone and watch features, we would expect to see a mix of features from both devices.
    
    Figure~\ref{fig:FeatureImportancesTogether} presents the top five most important features from phone-only, watch-only and both phone and watch classification. The feature importances were averaged from 50 component decision trees of a random forest classifier. As expected, phone and watch features are equally important in the phone and watch classification task. 
    
    This method is also useful to evaluate feature selection. Of all features calculated, three general classes of feature stand out as being important:
    \begin{enumerate}
      \item correlation between axes;
      \item spectral flatness, a measure of periodicity; and
      \item peak frequency.
    \end{enumerate}
    These three classes of feature can be collected just as easily on the watch as on the phone, and the quality of data seems to be comparable.
    
    Figure~\ref{fig:FeatureImportancesCumulative} gives cumulative feature importances for each activity. A random forest classifier was given both phone and watch features but was trained using sets of one vs. rest binary labels. That is, for each activity $\chi$, the labels were converted into having two possible values: $\chi$ and $\bar{\chi}$. This allows extraction of feature importances per activity.
    
    Feature importances for some activities are understandable: phone features are more important when classifying stairs, standing and walking. Some, however, make less sense. As discussed earlier, computer use and eating also assign more importance to phone features, while cycling is the activity that assigns the most importance to watch features.
    
    One possible reason for this observation is that while phone features are good enough to place cycling into a general class of leg-periodic activities, differentiating them requires more information than is available through the phone alone. Though the most characteristic movements could come from one accelerometer, these could be shared with other activities. A second accelerometer that does not necessarily follow those characteristic movements, such as the watch while cycling, could potentially be used to nuance the results and differentiate cycling from, say, cycling in the gym.
    
    \begin{figure}
      \centering
      \includegraphics[width=1.0\textwidth]{FeatureImportancesTogether}
      \caption[Feature importances of the top five most important features]{Feature importances of the top five most important features averaged over 50 decision trees in a random forest classifier trained with the three feature sets. Recall that data from the phone and the watch each produce 22 features, and so phone and watch classification has 44 features from which to pick.}
      \label{fig:FeatureImportancesTogether}
    \end{figure}
    
    \begin{figure}
      \centering
      \includegraphics[width=1.0\textwidth]{FeatureImportancesCumulative}
      \caption[Cumulative feature importances for each activity]{Cumulative feature importances for each activity. A random forest classifier was trained with one vs. rest labels and both phone and watch features. The importances of all the phone features and of all the watch features were totalled separately. The average total watch feature importance, $\approx 0.42$ is marked as the dotted line on the graph.}
      \label{fig:FeatureImportancesCumulative}
    \end{figure}
  \section{Relation to original goals}
    The goals of the project were:
    \begin{enumerate}
      \item to classify activities based on accelerometer recordings from a consumer smartwatch and smartphone; and
      \item evaluate to what extent the smartwatch is better at helping to classify activities.
    \end{enumerate}
    
    The confusion matrices presented meet goal one. In each of the three sections, they show correct classification of between 80\% and 90\% of activities, which is between 60\% and 70\% better than a random baseline. Evaluation using $\mathrm{F}_1$ measures show similar improvement.
    
    The $\mathrm{F}_1$ measures meet goal two. They do not show the smartwatch to be better at classifying activities on average than the smartphone. However, using data from both devices performs better than either device individually. Certain devices are better for classification of particular activities.
  \clearpage
  \section{Summary}
    In this section I compared $\mathrm{F}_1$ measures per activity and on average over all activities between phone-only features, watch-only features and both phone and watch features classification. I presented confusion matrices for each of these three feature sets. Finally, I analysed importances of features.
    
    The key results of this project are:
    \begin{enumerate}
      \item The best $\mathrm{F}_1$ measure was $0.96 \pm 0.01$, measured with a random forest classifier trained on phone and smartwatch data.
      \item The smartwatch performed slightly better than the smartphone, but the difference is not significant.
      \item Some activities such as climbing, fussball and gallery perusal were classified more accurately with the smartwatch than the smartphone. The smartphone performed better at computer use, eating and gym cycling.
      \item Unexpected results were that computer use and eating were both classified better using  phone-only features, while cycling was classified better using watch-only features. One theory for this observation is that the accelerometer we think of as being best suited to the task is only sufficient to place the activity in a general class. The phone isn't enough to distinguish between cycling and gym cycling, for example. For that distinction, the smartwatch may be better suited.
      \item Analysis of feature importances indicates that, to a random forest classifier, the phone features were found to be more important on average than watch features.
      \item The feature importance analysis reflects the findings from the $\mathrm{F}_1$ measures: the watch features are most important in cycling, gallery perusal and climbing, while the phone features are most important in computer use and standing.
    \end{enumerate}