In [1]:
import glob
import pandas as pd
import numpy as np
import time
from IPython.display import display, HTML

from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier

In the previous notebook we have extracted relevant features from the GPS trajectories and in this notebook we will try to classify the modality of the trajectories with these features. 

Since the processing can take up a few hours depending on the computational power, the processed data is provided for you. 

It can be downloaded from google drive; https://drive.google.com/open?id=0B22kg5oTwAn-Q0xOVDhRMVNhMDQ
and it is also passed around in memory sticks. 

In [2]:
#The column containing the labels is not exactly in the clean format we want to have it in. 
#Some labels have commas at the begin, end and double commas in the middle, 
#so lets make a function which cleans these labels.
def clean_label(label):
    return label.lstrip(',').rstrip(',').replace(',,', ',')

INPUT_FOLDER = '../processed_data/'
headers_metadf = ['trajectory_id', 'start_time', 'end_time', 'v_ave', 'v_med', 'v_max', 'v_std', 'a_ave', 'a_med', 'a_max', 'a_std', 'labels']

#Lets load all of the processed data, containing the features of all trajectories into one single dataframe. 
#The easiest way to do this is to load all of the into a list and concatenate them.
list_df_metadata = []
for file in glob.glob(INPUT_FOLDER + "*_metadata.csv"):
    df_metadata = pd.read_csv(file, index_col=0)
    list_df_metadata.append(df_metadata)
df_metadata = pd.concat(list_df_metadata)

#Remove all rows, which contain NaN values in these columns:
df_labeled = df_metadata.dropna(subset=['v_ave','v_med','v_max', 'v_std', 'a_ave', 'a_med', 'a_max', 'a_std', 'labels'])

#Clean the labels-column
df_labeled.loc[:,'labels'] = df_labeled['labels'].apply(lambda x: clean_label(x))

ValueError: No objects to concatenate

Lets analyze the trajectories:

Most of the trajectories are not labeled. 

Of the labeled trajectories around 60% contains a single modality. The other trajectories are multi-modal [walk -> bus -> train, etc] and for now we will not take them into consideration. 

In [3]:
all_labels = df_labeled['labels'].unique()
print("Example of trajectory labels:")
for label in all_labels[0:5]:
    print(label)

#We can filter out single modal trajectories by taking the labels which do not contain a comma:
single_modality_labels = [elem for elem in all_labels if ',' not in elem]

df_single_modality = df_labeled[df_labeled['labels'].isin(single_modality_labels)]

print("\nTotal number of trajectories: {}".format(len(df_metadata)))
print("Total number of labeled trajectories: {}".format(len(df_labeled)))
print("Total number of single modality trajectories: {}".format(len(df_single_modality)))

NameError: name 'df_labeled' is not defined

Lets split the trajectories containing a single modality, into a 70% training set and a 30% test set.

In [4]:
mask = np.random.rand(len(df_single_modality)) < 0.7
df_train = df_single_modality[mask]
df_test = df_single_modality[~mask]

print(len(df_train))

NameError: name 'df_single_modality' is not defined

The matrices containing the X and Y values of the training set (X_train, Y_train) will be used to train a classifier with. And the matrices X_test and Y_test can be used to test the accuracy of the trained classifier.

Usually a dataset is split randomly into a 70% training set and a 30% test set, but this also depends on the size of the dataset. If the dataset is small, 30% could not be enough for properly testing your trained classifier. 

You could also split it into a 50% / 25% / 25% training set / test set / validation set.
See: https://en.wikipedia.org/wiki/Test_set

In [5]:
#The columns 
X_colnames = ['v_ave','v_med','v_max', 'v_std', 'a_ave', 'a_med', 'a_max', 'a_std']
Y_colnames = ['labels']

X_train = df_train[X_colnames].values
Y_train = np.ravel(df_train[Y_colnames].values)
X_test = df_test[X_colnames].values
Y_test = np.ravel(df_test[Y_colnames].values)

NameError: name 'df_train' is not defined

Now that we have all of data ready and in the format we want to have it in, lets start with the classification part. 

For classification of the single modal trajectories, we will use three classifiers in the scikit-learn library. 

## 1. Random Forest
http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html


## 2. Logistic Regression
http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

http://ataspinar.com/2016/03/28/regression-logistic-regression-and-maximum-entropy/


## 3. Support Vector Machines.
http://scikit-learn.org/stable/modules/svm.html

https://youtu.be/3liCbRZPrZA


### Also see:

http://www.r2d3.us/visual-intro-to-machine-learning-part-1/

http://ciml.info/


In [6]:
rf_classifier = RandomForestClassifier(n_estimators = 18)
logreg_classifier = LogisticRegression()
svm_classifier = SVC()

In [7]:
#Random Forest
t_start = time.clock()
rf_classifier.fit(X_train, Y_train)
t_end = time.clock()
t_diff = t_end - t_start

train_score = rf_classifier.score(X_train, Y_train)
test_score = rf_classifier.score(X_test, Y_test)
y_pred_rf= rf_classifier.predict(X_test)
print("trained Random Forest in {:.2f} s.\t Score on training / test set: {} / {}".format(t_diff, train_score, test_score))

#Logistic Regression
t_start = time.clock()
logreg_classifier.fit(X_train, Y_train)
t_end = time.clock()
t_diff = t_end - t_start

train_score = logreg_classifier.score(X_train, Y_train)
test_score = logreg_classifier.score(X_test, Y_test)
y_pred_logreg = logreg_classifier.predict(X_test)
print("trained Logistic Regression in {:.2f} s.\t Score on training / test set: {} / {}".format(t_diff, train_score, test_score))

#Linear SVM
t_start = time.clock()
svm_classifier.fit(X_train, Y_train)
t_end = time.clock()
t_diff = t_end - t_start

train_score = svm_classifier.score(X_train, Y_train)
test_score = svm_classifier.score(X_test, Y_test)
y_pred_svm = svm_classifier.predict(X_test)
print("trained SVM Classifier in {:.2f} s.\t Score on training / test set: {} / {}".format(t_diff, train_score, test_score))

NameError: name 'X_train' is not defined

## Improving the accuracy of RF classifier
The most accurate classifier is the Random Forest classifier, with an accuracy of 78 % on the test set. 
Although this is already quiet high, lets see how we can improve it even more. 

To be able to do this, first we need to understand what this average accuracy of 78% consists of.

In the cell below:

- print the number of entries of each modality in the dataset.
- print the f1-score per class within the test set. 

hint: the metrics module of the scikit-learn library contains a lot of methods which can be used to evaluate the performance of your classifier:
http://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics

If there are not enough entries of a specific class in the dataset, the classifier will difficulties of finding a general rule which can correctly model it. This will lower the overall accuracy of the classifier, so lets remove all of the modalities which have less than 10 entries in 'df_single_modality'. 

You might also have seen that there are entries which are labeled as a different modality, although the behaviour will be approximately the same (for example car vs taxi). Incorrectly classifying these entries will also lower the accuracy of the classifier. Which of the existing labels in df_single_modality can be combined into one label?

With the new and improved df_single_modality:
- generate training and test sets
- train the three classifiers and determine if the accuracy has improved 

In [8]:
#generate the training and test sets

In [9]:
#Run the three classifiers again

In [10]:
#Evaluate the performance of each mdoality