# 10-Fold Cross Validation + Test set + SVM with linear and RBF kernels

To test generalizability of our accuracy scores results, we perform 10-Fold cross validation on the training set `uci_train` as provided by the authors. Moreover, we test the best models on the test set `uci_test`, also provided by the authors.

In [1]:
from utils import *  # * import all or specific functions

In [2]:
# get the feature columns
feature_cols = [c for c in uci_df.columns if c not in ('subject','activity')]

# Train-test split
X_train = uci_train[feature_cols] # feature vectors only in train
y_train = uci_train['activity'] # labels in train
X_test = uci_test[feature_cols] # feature vectors only in test
y_test = uci_test['activity'] # labels in test


In [3]:
# Create pipeline for each model: 1. SVM with linear kernel, 2. SVM with RBF kernel, 3. Logistic Regression,  4. Random Forest, 5. KNN, 6. Decision Tree
pipeline_svm_linear = make_pipeline(StandardScaler(), SVC(kernel='linear', random_state=42))
pipeline_svm_rbf = make_pipeline(StandardScaler(), SVC(kernel='rbf', C=1, random_state=42))
pipeline_logistic = make_pipeline(StandardScaler(), LogisticRegression(random_state=42,max_iter= 1000))
pipeline_random_forest = make_pipeline(StandardScaler(), RandomForestClassifier(n_estimators=100, random_state=42))
pipeline_knn = make_pipeline(StandardScaler(), KNeighborsClassifier(n_neighbors=5)) # k = 5
pipeline_decision_tree = make_pipeline(StandardScaler(), DecisionTreeClassifier(random_state=42))

# Perform 10-fold cross-validation for each model
cv_scores_svm_linear = cross_val_score(pipeline_svm_linear, X_train, y_train, cv=10)
cv_scores_svm_rbf = cross_val_score(pipeline_svm_rbf, X_train, y_train, cv=10)
cv_scores_logistic = cross_val_score(pipeline_logistic, X_train, y_train, cv=10)
cv_scores_random_forest = cross_val_score(pipeline_random_forest, X_train, y_train, cv=10)
cv_scores_knn = cross_val_score(pipeline_knn, X_train, y_train, cv=10)
cv_scores_decision_tree = cross_val_score(pipeline_decision_tree, X_train, y_train, cv=10)

# one time evaluation on the test set for each model
## one time svm linear
pipeline_svm_linear.fit(X_train, y_train)
test_acc_svm_linear = pipeline_svm_linear.score(X_test, y_test)
## one time svm rbf
pipeline_svm_rbf.fit(X_train, y_train)
test_acc_svm_rbf = pipeline_svm_rbf.score(X_test, y_test)
## one time logistic regression
pipeline_logistic.fit(X_train, y_train)
test_acc_logistic = pipeline_logistic.score(X_test, y_test)
## one time random forest
pipeline_random_forest.fit(X_train, y_train)
test_acc_random_forest = pipeline_random_forest.score(X_test, y_test)
## one time KNN
pipeline_knn.fit(X_train, y_train)
test_acc_knn = pipeline_knn.score(X_test, y_test)
## one time decision tree
pipeline_decision_tree.fit(X_train, y_train)
test_acc_decision_tree = pipeline_decision_tree.score(X_test, y_test)

# Report the results in a table format
results = pd.DataFrame({
    'Model': ['SVM Linear', 'SVM RBF', 'Logistic Regression', 'Random Forest', 'KNN', 'Decision Tree'],
    '10-Fold CV Accuracy (%)': [
        np.mean(cv_scores_svm_linear) * 100,
        np.mean(cv_scores_svm_rbf) * 100,
        np.mean(cv_scores_logistic) * 100,
        np.mean(cv_scores_random_forest) * 100,
        np.mean(cv_scores_knn) * 100,
        np.mean(cv_scores_decision_tree) * 100
    ],
    'Test Set Accuracy (%)': [
        test_acc_svm_linear * 100,
        test_acc_svm_rbf * 100,
        test_acc_logistic * 100,
        test_acc_random_forest * 100,
        test_acc_knn * 100,
        test_acc_decision_tree * 100
    ]
})
results = results.set_index('Model')   

# Display the results
results


Unnamed: 0_level_0,10-Fold CV Accuracy (%),Test Set Accuracy (%)
Model,Unnamed: 1_level_1,Unnamed: 2_level_1
SVM Linear,94.519133,96.063794
SVM RBF,93.512385,95.181541
Logistic Regression,94.478298,95.486936
Random Forest,93.322057,92.772311
KNN,88.072057,88.361045
Decision Tree,85.895722,86.257211


# Linear Discriminant Analysis (LDA)

Recall that in the main notebook, we performed LDA on the full dataset to embed our data into a lower dimensional space. We then trained 1-NN classifier, and validated this classifier's performance with LOO cross validation.

Here, we apply LDA (with 5 components) only on the train set `X_train`, `X_train`, instead of the entire dataset. We continue to use 1-NN classifier as our model. However, this time we validate it in two ways; 
1. 10-Fold Cross validation on the train set. 
2. Validate on the test set `X_test`, `y_test`.


In [4]:
# Perform LDA on the train set
lda = LinearDiscriminantAnalysis(n_components=5).fit(X_train, y_train)  # LDA with 5 components
X_train_lda = lda.transform(X_train) # this is now the reduced feature space (5 components)

# train a 1-NN classifier on the LDA transformed train set
pipeline_lda_1nn = make_pipeline(StandardScaler(), KNeighborsClassifier(n_neighbors = 1))

# Perform 10-fold cross-validation
cv_scores_lda_1nn = cross_val_score(pipeline_lda_1nn, X_train_lda, y_train, cv=10)

# one time evaluation on the test set
X_test_lda = lda.transform(X_test)  # transform the test set using the same LDA model: the LDA axes are learned only from the training set.
pipeline_lda_1nn.fit(X_train_lda, y_train)  # fit the pipeline on the LDA-transformed train set
test_acc_lda_1nn = pipeline_lda_1nn.score(X_test_lda, y_test)

# Report the results
lda_results = pd.DataFrame({
    'Model': ['LDA 1-NN'],
    '10-Fold CV Accuracy (%)': [np.mean(cv_scores_lda_1nn) * 100],
    'Test Set Accuracy (%)': [test_acc_lda_1nn * 100]
})
lda_results = lda_results.set_index('Model')
lda_results




Unnamed: 0_level_0,10-Fold CV Accuracy (%),Test Set Accuracy (%)
Model,Unnamed: 1_level_1,Unnamed: 2_level_1
LDA 1-NN,97.987097,95.554801
