# Comparison of common Machine Learning Algorithms considering a sport activities Classification Task

Even if Deep Learning is continuosly growing in terms of importance, classical Machine Learning technquis still represent a cornerstone for people who approach computer science for the first time. There are loads of different techniques, some of them are specifically devoted to tackle given types of problems; some others are instead multi-porpose.

One popular task is the so called *classification problem*, where the model’s output is a category with a semantic meaning. A classification model attempts to draw some conclusion from observed values.
Different methods can be implemented to tackle this problem.

Our focus is a brief comparative study over four different machine learning supervised techniques:
1. Logistic Regression
2. K Nearest Neighbors
3. Decision Trees
4. Multilayer Perceptron

## Pre-processing

Since the [dataset](https://archive-beta.ics.uci.edu/ml/datasets/daily+and+sports+activities) used for this study is organised in folders and subfolders according to a hierarchical scheme, a quick preprocessing operation aimed to creade a huge database is necessary. The idea is to extract the files from each folder and stack them together in a single CSV file.

In [45]:
import pandas as pd
import numpy as np

The structure is organised in 19 folder (one for each activity), each one containing 8 folder (one for each person), again containing 60 text files (each represents 5 sec of sampling). Therefore, some indexes are initilized to refer to each of the aforementioned elements.

In [46]:
fileIndex = ["%02d" % x for x in range(1,61)]
personIndex = ["%01d" % x for x in range(1,9)]
activityIndex = ["%02d" % x for x in range(1,20)]

Each file representing 5 sec sampling contains 125 rows (sampling frequency of 25Hz), hence for gain some representative values of those 5 sec time span, the average and the variance of all the recorder values are computed and combined into a single array. In this way, there are now 60 arrays for each person and activity which can be joined together. In addition, in the last column, the index of the acitivy, which ranges from 1 up to 19, is added.
To clearly undestand the final CSV file, the headers of each column are implemented as follows.

In [None]:
datasetHeader = ["T_xacc", "T_yacc", "T_zacc", "T_xgyro", "T_ygyro", "T_zgyro", "T_xmag", "T_ymag", "T_zmag",
        "RA_xacc", "RA_yacc", "RA_zacc", "RA_xgyro", "RA_ygyro", "RA_zgyro", "RA_xmag", "RA_ymag", "RA_zmag",
        "LA_xacc", "LA_yacc", "LA_zacc", "LA_xgyro", "LA_ygyro", "LA_zgyro", "LA_xmag", "LA_ymag", "LA_zmag",
        "RL_xacc", "RL_yacc", "RL_zacc", "RL_xgyro", "RL_ygyro", "RL_zgyro", "RL_xmag", "RL_ymag", "RL_zmag",
        "LL_xacc", "LL_yacc", "LL_zacc", "LL_xgyro", "LL_ygyro", "LL_zgyro", "LL_xmag", "LL_ymag", "LL_zmag",
        "var_T_xacc", "var_T_yacc", "var_T_zacc", "var_T_xgyro", "var_T_ygyro", "var_T_zgyro", "var_T_xmag", "var_T_ymag", "var_T_zmag",
        "var_RA_xacc", "var_RA_yacc", "var_RA_zacc", "var_RA_xgyro", "var_RA_ygyro", "var_RA_zgyro", "var_RA_xmag", "var_RA_ymag", "var_RA_zmag",
        "var_LA_xacc", "var_LA_yacc", "var_LA_zacc", "var_LA_xgyro", "var_LA_ygyro", "var_LA_zgyro", "var_LA_xmag", "var_LA_ymag", "var_LA_zmag",
        "var_RL_xacc", "var_RL_yacc", "var_RL_zacc", "var_RL_xgyro", "var_RL_ygyro", "var_RL_zgyro", "var_RL_xmag", "var_RL_ymag", "var_RL_zmag",
        "var_LL_xacc", "var_LL_yacc", "var_LL_zacc", "var_LL_xgyro", "var_LL_ygyro", "var_LL_zgyro", "var_LL_xmag", "var_LL_ymag", "var_LL_zmag",
        "activity_index"]

The entire dataset is made of 9120 arrays containing 91 elements: 3 senosors measuring along 3 different axis for 5 sensing locations, then 45 variance vlaues, plus the acitivity index at the end. To extract and collect all the data, a set of nested for loops is used.

In [None]:
# Create the entire dataset: 9120 x 91
# Person from 1 to 8
allActivities = []
print("Importing all data")
for k in range(19):
    print("Elaborating activity number: ", activityIndex[k])
    for j in range(8):
        print("Elaborating person number: ", personIndex[j], end = "\r")
        for i in range(60):
            filename = f"./sportsDataset/a{activityIndex[k]}/p{personIndex[j]}/s{fileIndex[i]}.txt"

            data = np.loadtxt(filename, delimiter=',', skiprows=1, dtype=float)
            dataT = data.transpose()            
            average = np.mean(dataT, axis = 1)
            variance = np.var(dataT, axis = 1)
            
            index = np.array([int(activityIndex[k])])
            #newData = np.append(average,variance, index, axis = 0)
            newData = np.concatenate((average, variance, index), axis = None)
            
            allActivities.append(newData)

print("\nData correctly stored")

ActivitiesDataset = np.array(allActivities)
np.savetxt("./sportsDataset/ActivitiesDataset.csv", ActivitiesDataset, delimiter=",", header = ','.join(datasetHeader), comments='')
print("Activity dataset created\n")

For training all the models, a specific dataset is created considering the data of the first 7 people. A similar approach is the used for its creation.

In [None]:
# Create the train dataset: 7980 x 91
# Person from 1 to 7
trainingActivities = []
print("Importing training data")
for k in range(19):
    print("Elaborating activity number: ", activityIndex[k])
    for j in range(7):
        print("Elaborating person number: ", personIndex[j], end = "\r")
        for i in range(60):
            filename = f"./sportsDataset/a{activityIndex[k]}/p{personIndex[j]}/s{fileIndex[i]}.txt"

            data = np.loadtxt(filename, delimiter=',', skiprows=1, dtype=float)
            dataT = data.transpose()            
            average = np.mean(dataT, axis = 1)
            variance = np.var(dataT, axis = 1)
            
            index = np.array([int(activityIndex[k])])
            #newData = np.append(average,variance, index, axis = 0)
            newData = np.concatenate((average, variance, index), axis = None)
            
            trainingActivities.append(newData)

print("\nTraining data correctly stored")

activitiesTrainDataset = np.array(trainingActivities)
np.savetxt("./sportsDataset/TrainingDataset.csv", activitiesTrainDataset, delimiter=",", header = ','.join(datasetHeader), comments='')
print("Training dataset created\n")

While for training all the models, the data regarding the 8th person are used instead.

In [None]:
# Create the test dataset: 1140 x 91
# Person 8
testingActivities = []
print("Importing testing data")
for k in range(19):
    print("Elaborating activity number: ", activityIndex[k])
    print("Elaborating person number: ", personIndex[7], end = "\r")
    for i in range(60):
        filename = './sportsDataset/a' + activityIndex[k] + '/p'+ personIndex[7] + '/s' + fileIndex[i] + '.txt'

        data = np.loadtxt(filename, delimiter=',', skiprows=1, dtype=float)
        dataT = data.transpose()            
        average = np.mean(dataT, axis = 1)
        variance = np.var(dataT, axis = 1)
        
        index = np.array([int(activityIndex[k])])
        #newData = np.append(average, variance, index, axis = 0)
        newData = np.concatenate((average, variance, index), axis = None)
        
        testingActivities.append(newData)

print("\nTesting data correctly stored")

activitiesTestDataset = np.array(testingActivities)
np.savetxt("./sportsDataset/TestDataset.csv", activitiesTestDataset, delimiter=",", header = ','.join(datasetHeader), comments='')
print("Testing dataset created\n")

Now, all the necessary databases are saved as CSV files and ready to feed the models.

## Logistic Regression

Logistic regression is the right algorithm to start with classification algorithms. Even though, the name ‘Regression’ comes up, it is not a regression model, but a classification model. It uses a logistic function to frame binary output model. The output of the logistic regression will be a probability (0≤x≤1), and can be used to predict the binary 0 or 1 as the output ( if x<0.5, output= 0, else output=1).

**Loss function**

We use **cross entropy** as our loss function. The basic logic here is that, whenever my prediction is badly wrong, (eg : y’ =1 & y = 0), cost will be -log(0) which is infinity.

**Advantages**
-   Easy, fast and simple classification method.
-   θ parameters explains the direction and intensity of significance of independent variables over the dependent variable.
-   Can be used for multiclass classifications also.
-   Loss function is always convex.

**Disadvantages**
-   Cannot be applied on non-linear classification problems.
-   Proper selection of features is required.
-   Good signal to noise ratio is expected.
-   Colinearity and outliers tampers the accuracy of LR model.

**Hyperparameters**
Logistic regression hyperparameters are mainly two: Learning rate(α) and Regularization parameter(λ). Those have to be tuned properly to achieve high accuracy.

In [1]:
# Code here

## K Nearest Neighbors

K-nearest neighbors is a non-parametric method used for classification and regression. It is one of the most used ML techniques. It is a lazy learning model, with local approximation.

**Advantages**
-   Easy and simple machine learning model.
-   Few hyperparameters to tune.

 **Disadvantages**
-   k should be wisely selected.
-   Large computation cost during runtime if sample size is large.
-   Proper scaling should be provided for fair treatment among features.

**Hyperparameters**
KNN mainly involves two hyperparameters:
-   K value : how many neighbors to participate in the KNN algorithm. k should be tuned based on the validation error.
-   distance function : in our case, we choose the Minkowski distance because it allows us to work in a N-D space.

In [None]:
# Code Here

## Decision Tree

Decision tree is a tree based algorithm used to solve regression and classification problems. An inverted tree is framed which is branched off from a homogeneous probability distributed root node, to highly heterogeneous leaf nodes, for deriving the output.

**Algorithm to select conditions**

For (classification and regression trees), *Gini index* or *Entropy index* can be used as classification metric. This lets us calculate how well the datapoints are mixed together.

**Advantages**
-   No preprocessing needed on data.
-   No assumptions on distribution of data.
-   Handles colinearity efficiently.
-   Decision trees can provide understandable explanation over the prediction.

**Disadvantages**
-   Chances for overfitting the model if we keep on building the tree to achieve high purity. decision tree pruning can be used to solve this issue.
-   Prone to outliers.
-   Tree may grow to be very complex while training complicated datasets.
-   Looses valuable information while handling continuous variables.

**Hyperparameters**
Decision tree includes many hyperparameters and I will list a few among them.

-   **criterion** : which cost function for selecting the next tree node. Mostly used ones are gini/entropy.
-   **max depth :** it is the maximum allowed depth of the decision tree.
-   **minimum samples split :** It is the minimum nodes required to split an internal node.
-   **minimum samples leaf :** minimum samples that are required to be at the leaf node.

## Multilayer Perceptron

In [None]:
# Brief theory recap here

In [None]:
# Code Here

## Comparison Between Models and Final Considerations

In this paragraph, some considerations about performances and effectiveness are reported with the aim of undestranting the best working conditions for each model.

Logistic regression has a convex loss function, so it won't hang in a local minima, whereas for example neaural network may. One important thing to consider is that logistic regression outperforms neural network when training data is less and features are large, since neural networks need large training data. Of course there is a strike also for neural networks since they can support non-linear solutions where for example logistic regression can not. 
Talking about time consumption, KNN is comparatively slower than other competitors like logistic regression and decision trees, but it supports non-linear solutions too. One major downgrade is that, KNN can only output the labels. Lukily, KNN requires less data to achieve a sufficient accuracy respect to neural networks, but it needs lot of hyperparameter tuning compared to KNN.
Finally, let's spend some workd about decision trees. In general, they handle colinearity better, but can not derive the significance of features, hence they are better for a categorical evaluation. Respect to KNN, decision tree supports automatic feature interaction, and it is faster due to KNN’s expensive real time execution. Decision trees perform better when there is a large set of categorical values in the training data. In comparison to neural networks, decision trees are better suited when the scenario demands an explanation over the decision, but when there is sufficient training data, neural networks outperfomr drastically decision trees.

## References

- [Comparative Study on Classic Machine learning Algorithms](https://towardsdatascience.com/comparative-study-on-classic-machine-learning-algorithms-24f9ff6ab222)
- [Daily and Sports Activities Dataset](https://archive-beta.ics.uci.edu/ml/datasets/daily+and+sports+activities)
- [Scikit Learn Python Library](https://scikit-learn.org/stable/)
- [Project GitHub Repository](https://github.com/d-galli/SportActivitiesClassification)