# Classifier Analysis

Parallel to the implementation of our CNN, we have decided it is a good idea to train some classifiers on the same dataset. We will use the following classifiers:
- Logistic Regression
- Random Forest
- SVM
- KNN
- AdaBoost
- Ensemble of the above classifiers

We will analyze the performance of the classifiers we have trained on the test set. We will use the following metrics:
- Accuracy
- Precision
- Recall
- F1 Score
- ROC AUC
- Confusion Matrix

# Logistic Regression

For logistic regression, we found problem with the default solver, so we used the 'saga' solver as it is better suited for larger datasets.

We had an issue with the solver not converging, so we increased the maximum number of iterations to 5000 and after having found best Inverse of Regularization Strength (C) parameter using GridSearchCV, we trained the model with the best C value and looking for other hyperparams.

We found that the best hyperparameters for the model were:
- C: 200
- penalty: l1
- solver: saga
- max_iter: 5000
- tol: 0.0001

However, even with these hyperparameters, the model did not perform well on the test set. The accuracy was around 0.57, which is almost the same as random guessing. The precision, recall, and F1 score were also 0.57.


## Confusion matrix
![img](../classifiers/diagrams/log_reg/cm.png)

## ROC curve
![img](../classifiers/diagrams/log_reg/roc_curve.png)

## Precision-Recall curve
![img](../classifiers/diagrams/log_reg/pr_curve.png)


## Conclusion
Logistic Regression did not perform well on the dataset. The model was not able to learn the patterns in the data and performed almost the same as random guessing.
Due to that result, we did not want to include the model in the ensemble model.

# Random Forest Classifier

For Random Forest Classifier, we were really satisfied with achieved results. 
Again, GridSearchCV was used to find the best hyperparameters for the model.

We did so iteratively, starting with selected hyperparameters and adding them on top of each other due to computing limitation of our hardware.

We found that the best hyperparameters for the model were:
- n_estimators: 300
- max_depth: 50
- min_samples_split: 2
- min_samples_leaf: 1
- max_features: sqrt
- criterion: entropy
- bootstrap: False

With these hyperparameters, the model performed well on the test set. The accuracy was around 0.87, which is a good result. The precision, recall, and F1 score were also around 0.87.

## Confusion matrix
![img](../classifiers/diagrams/random_forest/cm.png)

## ROC curve
![img](../classifiers/diagrams/random_forest/roc_curve.png)

## Precision-Recall curve
![img](../classifiers/diagrams/random_forest/pr_curve.png)

## Conclusion
Random Forest Classifier performed best on the dataset. The model was able to learn the patterns in the data and performed around 0.87 accuracy. We decided to include the model in the ensemble model. Given its performance, it is a good classifier to match with our CNN.

# K-Nearest Neighbors Classifier

For K-Nearest Neighbors Classifier, we concluded that even with fine-tuning the hyperparameters, we could not improve the model's performance. It was bound to the range of 0.69-0.71 accuracy, which is semi-satisfactory result - but as shown previously, you can find a better classifier for this dataset.

Used hyperparameters:
- n_neighbors: 3
- weights: distance

Other hyperparameters were set to default as they performed seemingly best.

## Confusion matrix
![img](../classifiers/diagrams/knn/cm.png)

## ROC curve
![img](../classifiers/diagrams/knn/roc_curve.png)

## Precision-Recall curve
![img](../classifiers/diagrams/knn/pr_curve.png)

## Conclusion
K-Nearest Neighbors Classifier performed mediocre on this dataset. The model was not able to learn the patterns in the data and performed around 0.70 accuracy. However, we decided to include the model in the ensemble model as it performed better than Logistic Regression.

# Support Vector Machine Classifier

Support Vector Machine Classifier was also stubborn to fine-tune. We tried to find the best hyperparameters for the model using GridSearchCV, but the model was not able to learn the patterns in the data and performed around 0.70 accuracy.

Used hyperparameters:
- C: 100
- kernel: rbf
- gamma: 0.01

Other hyperparameters were set to default as they performed seemingly best.

## Confusion matrix
![img](../classifiers/diagrams/svm/cm.png)

## ROC curve
![img](../classifiers/diagrams/svm/roc_curve.png)

## Precision-Recall curve
![img](../classifiers/diagrams/svm/pr_curve.png)

## Conclusion
Support Vector Machine Classifier performed mediocre on this dataset. The model was not able to learn the patterns in the data and performed around 0.70 accuracy. However, we decided to include the model in the ensemble model as it performed better than Logistic Regression.

## AdaBoost Classifier

For AdaBoost Classifier, we were dissapointed in the results as the classifier achieved a whooping accuracy of 0.49 on the test set!
Which is a pretty much equivalent to rolling a dice to predict the class.

Used hyperparameters:
- n_estimators: 200
- learning_rate: 0.5
- algorithm: SAMME

other parameters were set as default.

## Confusion matrix
![img](../classifiers/diagrams/adaboost/cm.png)

## ROC curve
![img](../classifiers/diagrams/adaboost/roc_curve.png)

## Precision-Recall curve
![img](../classifiers/diagrams/adaboost/pr_curve.png)

## Conclusion
There is nothing much to say about AdaBoost Classifier for this dataset. Due to the low accuracy, we decided not to include the model in the ensemble model.

# Ensemble Classifier

For the ensemble classifier, we used the following classifiers:
- Random Forest
- K-Nearest Neighbors
- Support Vector Machine

We used the VotingClassifier from sklearn to combine the predictions of the classifiers. We used the 'soft' voting strategy, which predicts the class label based on the argmax of the sums of the predicted probabilities.

We also used the VotingClassifier with 'hard' voting strategy, which predicts the class label based on the majority vote.

However, due to the fact that two out of three classifiers performed poorly, the ensemble classifier also did not perform well on the dataset. The accuracy was around 0.80 and 0.79 for 'soft' and 'hard' voting strategy, respectively.

## Confusion matrix
![img](../classifiers/diagrams/ensemble/cm.png)

## ROC curve
![img](../classifiers/diagrams/ensemble/roc_curve.png)

## Precision-Recall curve
![img](../classifiers/diagrams/ensemble/pr_curve.png)

## Conclusion
The ensemble classifier did not manage to improve our accuracy. It is needed to get a better classifiers to improve the results and probably give a thought to remove the KNN and SVM from the ensemble model.

This shows, that the Random Forest Classifier is miles ahead of other classifiers for this dataset.