<a href="https://colab.research.google.com/github/carlosfmorenog/CMM536/blob/master/CMM536_Topic_6/CMM536_T6_Lab_Solved.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Topic 6 Laboratory (Solved)

In this activity, you will use imbalanced datasets from the [Keel](https://sci2s.ugr.es/keel/imbalanced.php) public repository to appropriately evaluate performance metric in two popular classifiers.

## Performance Evaluation in Binary Datasets

### Making sense of the necessary data

We will use the [glass](https://sci2s.ugr.es/keel/dataset/data/imbalanced/glass-names.txt) dataset. This one has 214 samples of 7 different types of glasses. Each glass sample has 9 features, each corresponding to a different element that composes a glass sample: Rl(??), Na, Mg, Al, Si, K, Ca, Ba and Fe.

Since we will start with binary datasets, we will first use the [glass0](https://sci2s.ugr.es/keel/keel-dataset/datasets/imbalanced/imb_IRlowerThan9/names/glass0-names.txt) version of the dataset. The only difference of this one with respect of the original one is that a certain glass class called `class 0` is compared against the rest of the glass classes.

Click [here](https://sci2s.ugr.es/keel/dataset.php?cod=141) to visit the *glass0* description website. If you go to the bottom (where it says **Files and additional references**) you an download the **complete dataset**. Unzip to get a file with the extension *.dat*. **OR**, you can simply download `glass0.dat` from Moodle!

### Loading the Data

To import the dataset **directly from the source** to Colab, run the following cell. You are using Linix commands to get and unzip the file!

In [None]:
## Run this cell to import glass0.dat to Colab
!wget -O data.zip https://sci2s.ugr.es/keel/dataset/data/imbalanced/glass0.zip
!unzip data.zip

--2025-02-18 15:59:06--  https://sci2s.ugr.es/keel/dataset/data/imbalanced/glass0.zip
Resolving sci2s.ugr.es (sci2s.ugr.es)... 150.214.190.154
Connecting to sci2s.ugr.es (sci2s.ugr.es)|150.214.190.154|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5490 (5.4K) [application/zip]
Saving to: ‘data.zip’


2025-02-18 15:59:06 (235 MB/s) - ‘data.zip’ saved [5490/5490]

Archive:  data.zip
  inflating: glass0.dat              


In [None]:
# Load data and target
import numpy as np

data = np.genfromtxt('/content/glass0.dat',
                     usecols=range(9), # this only brings the first nine columns of the dataset
                     skip_header=14, # The first 14 lines of the .dat contain a description
                     delimiter=',')


target_names = np.genfromtxt('/content/glass0.dat',
                     usecols=range(9,10), # This brings the last column of the dataset, which has the class
                     dtype = None,
                     encoding = None, # This helps us get the strings in a numpy array
                     skip_header=14,
                     delimiter=',')

print(data,data.shape)
print(target_names,target_names.shape)

[[ 1.51588824 12.87795     3.43036    ...  8.04468     0.
   0.1224    ]
 [ 1.5176423  12.9777      3.53812    ...  8.52888     0.
   0.        ]
 [ 1.52212996 14.20795     3.82099    ...  9.5726      0.
   0.        ]
 ...
 [ 1.51837126 14.321       3.25974    ...  5.78508     1.62855
   0.        ]
 [ 1.51657164 14.7998      0.         ...  8.2814      1.71045
   0.        ]
 [ 1.51732338 14.95275     0.         ...  8.61496     1.5498
   0.        ]] (214, 9)
[' positive' ' positive' ' positive' ' positive' ' positive' ' positive'
 ' positive' ' positive' ' positive' ' positive' ' positive' ' positive'
 ' positive' ' positive' ' positive' ' positive' ' positive' ' positive'
 ' positive' ' positive' ' positive' ' positive' ' positive' ' positive'
 ' positive' ' positive' ' positive' ' positive' ' positive' ' positive'
 ' positive' ' positive' ' positive' ' positive' ' positive' ' positive'
 ' positive' ' positive' ' positive' ' positive' ' positive' ' positive'
 ' positive' ' positiv

Notice that once again, we will store the data and the target in separate variables. Moreover, we will generate a new target variable called `target` with `positive=0` and `negative=1` using the **list comprehension** technique:

In [None]:
target = []
target = [0 if i == ' positive' else 1 for i in target_names]
target=np.array(target)
print(target,target.shape)

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1] (214,)


Notice that we chose 0 to be the positive and 1 to be the negative since the [glass0 documentation](https://sci2s.ugr.es/keel/dataset.php?cod=141) said so!

### Training the Classifiers

Now we need to train a supervised learning model to evaluate. In this case we will use two popular classification models so that we can compare which is better for this dataset: **Support Vector Machine (SVM)** vs. **Random Forests (RF)**:

First we need to split our dataset into training and testing sets (80% train, 20% test). You have already done this in the past!

In [None]:
## Use this cell to split your dataset into training and testing w/ stratification
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(data,target,stratify=target,test_size=0.2)
print('Number of positive and negative samples in the training set: ',np.count_nonzero(y_train == 0),np.count_nonzero(y_train == 1))
print('Number of positive and negative samples in the test set: ',np.count_nonzero(y_test == 0),np.count_nonzero(y_test == 1))

Number of positive and negative samples in the training set:  56 115
Number of positive and negative samples in the test set:  14 29


Now train a SVM model called `model_svm` using the training data and predict the test data. Store the fitted model in a variable called `clf_svm` and the prediction results in a variable called `y_svm`.

In [None]:
## Use this cell to train a SVM model to predict the labels of the test data
from sklearn.svm import SVC
model_svm = SVC(kernel='linear')
clf_svm = model_svm.fit(X_train,y_train)
y_svm = model_svm.predict(X_test)
print(y_svm,y_svm.shape)

[1 0 0 1 1 1 1 1 1 1 1 1 1 0 1 1 1 0 1 1 1 1 0 1 1 1 1 1 1 0 1 1 1 1 1 1 1
 0 1 0 1 0 1] (43,)


Likewise, train a RF model called `model_rf` (using the `RandomForestClassifier()` function contained in `sklearn.ensemble` package) on the same training data and predict the test data. Store the results in variables called `clf_rf` and `y_rf`.

In [None]:
## Use this cell to train a RF model to predict the labels of the test data
from sklearn.ensemble import RandomForestClassifier
model_rf = RandomForestClassifier(n_estimators = 1000, random_state = 42)
clf_rf = model_rf.fit(X_train, y_train)
y_rf = model_rf.predict(X_test)
print(y_rf,y_rf.shape)

[1 0 0 0 0 1 1 1 1 1 1 1 1 0 1 1 1 0 1 1 1 1 0 1 1 1 1 1 1 0 1 1 1 1 1 1 1
 0 1 0 1 0 1] (43,)


Both `y_svm` and `y_rf` should be *numpy* array vectors of size (43,)

### Performance Evaluation Metrics

In week 3 lab, we saw that once that we predict the values for the test set, we can compare both the `y_test` vector with the predicted vectors (in our case `y_svm` and `y_rf`) to obtain the results. If you recall, it was quite tricky to do this as we had more classes in the vector! Therefore, this time we will do it slightly different...

#### Accuracy

By now you should know that everything in Python can be imported! Performance metrics can be imported as well. We will first use accuracy (although it should be very clear by now that this one is not suitable for imbalanced datasets).

In [None]:
from sklearn.metrics import accuracy_score
print('Accuracy of SVM: ', accuracy_score(y_svm, y_test))
print('Accuracy of RF: ', accuracy_score(y_rf, y_test))

Accuracy of SVM:  0.7906976744186046
Accuracy of RF:  0.8372093023255814


**Which is the best classifier in terms of accuracy?**

**Answer:** RF, since [some sources](https://elitedatascience.com/imbalanced-classes) suggest that decision-tree based models learn better from imbalance data compared to vector-based models.

#### AUC-ROC

In this week's lecture we saw that one of the most robust metrics for accuracy in the *imbalanced* world is the area under the receiver operating characteristic curve (or AUC-ROC for short). To get this metric, we need to retrain our models, this time in a probabilistic way!

This means that by setting the input `probability=True` when generating our SVM model (RF doesn't need it) and by changing `predict` to `predict_proba`, we will not obtain a fixed output, but instead a probability (value between 0 and 1) of the classification to be 0 or 1 respectively.

To keep the previous models/classes/predictions in our workspace, train new (probabilistic) models in the following cell and store the trained models as `model_svm_proba` and `model_rf_proba`, the classifiers as `clf_svm_proba` and `clf_rf_proba`, and the outputs as `y_svm_proba` and `y_rf_proba`:

In [None]:
## Use this cell to re-train the models in a probabilistic way
model_svm_proba = SVC(kernel='linear', probability=True)
clf_svm_proba = model_svm_proba.fit(X_train,y_train)
y_svm_proba = clf_svm_proba.predict_proba(X_test)

model_rf_proba = RandomForestClassifier(n_estimators = 1000, random_state = 42)
clf_rf_proba = model_rf_proba.fit(X_train, y_train)
y_rf_proba = clf_rf_proba.predict_proba(X_test)

print(y_svm_proba,y_svm_proba.shape)
print(y_rf_proba,y_rf_proba.shape)

[[1.15009056e-06 9.99998850e-01]
 [7.15473282e-01 2.84526718e-01]
 [8.05246444e-01 1.94753556e-01]
 [3.87624827e-01 6.12375173e-01]
 [4.56021534e-01 5.43978466e-01]
 [3.53214303e-01 6.46785697e-01]
 [2.50324793e-01 7.49675207e-01]
 [8.13670943e-06 9.99991863e-01]
 [3.15542156e-01 6.84457844e-01]
 [4.48333398e-01 5.51666602e-01]
 [3.18470282e-01 6.81529718e-01]
 [3.94831976e-01 6.05168024e-01]
 [4.67455823e-02 9.53254418e-01]
 [7.45703186e-01 2.54296814e-01]
 [2.86412734e-01 7.13587266e-01]
 [6.67364519e-06 9.99993326e-01]
 [6.57218899e-02 9.34278110e-01]
 [8.76522627e-01 1.23477373e-01]
 [2.97443861e-02 9.70255614e-01]
 [1.65983082e-06 9.99998340e-01]
 [1.88105199e-02 9.81189480e-01]
 [2.14956458e-07 9.99999785e-01]
 [5.13394744e-01 4.86605256e-01]
 [4.21130269e-01 5.78869731e-01]
 [7.22082729e-06 9.99992779e-01]
 [1.68669602e-01 8.31330398e-01]
 [4.30664685e-01 5.69335315e-01]
 [2.69794430e-01 7.30205570e-01]
 [4.76759981e-01 5.23240019e-01]
 [8.41027402e-01 1.58972598e-01]
 [2.840782

For `y_svm_proba` and `y_rf_proba` you should get *numpy* arrays of shape (43, 2), as now each output contains the probability of each sample to be 0 (first value) or 1 (second value).

Now we will evaluate the AUC-ROC. We will **not** get the plot (as we don't have different thresholds), but rather we will obtain the numeric area under the curve (a value between 0 and 1, the larger the better).

To obtain this values, you can run the following cell (provided that you have correctly calculated `y_svm_proba` and `y_rf_proba` in the previous step):

In [None]:
## Testing AUC-ROC considering the probabilities of the positive class (i.e. 0)
from sklearn.metrics import roc_auc_score

# First, we need to extract only the probabilities of classifying 0 (second column of our numpy arrays)
# We can do this once again by means of list comprehension
y_svm_proba_0 = [p[0] for p in y_svm_proba]
y_rf_proba_0 = [p[0] for p in y_rf_proba]

print('AUC for SVM: ',roc_auc_score(y_test, y_svm_proba_0))
print('AUC for RF: ',roc_auc_score(y_test, y_rf_proba_0))

AUC for SVM:  0.20197044334975373
AUC for RF:  0.10960591133004927


If you are getting AUC-ROC values **smaller than 0.5** then you can simply invert them (or get the AUC-ROC for the rest/negative class) to get the actual AUC-ROC value for this classifier.

A comprehensive explanation of why AUC-ROC cannot be smaller than 0.5 can be found [here](https://www.datascienceblog.net/post/machine-learning/interpreting-roc-curves-auc/).

In [None]:
## Use this cell to test AUC-ROC considering the probabilities of the rest/negative class (i.e. 1)
from sklearn.metrics import roc_auc_score

# First, we need to extract only the probabilities of classifying 1 (second column of our numpy arrays)
# We can do this once again by means of list comprehension
y_svm_proba_1 = [p[1] for p in y_svm_proba]
y_rf_proba_1 = [p[1] for p in y_rf_proba]

print('AUC for SVM: ',roc_auc_score(y_test, y_svm_proba_1))
print('AUC for RF: ',roc_auc_score(y_test, y_rf_proba_1))

AUC for SVM:  0.7980295566502463
AUC for RF:  0.8903940886699507


**Is the classifier that was better in accuracy still better in AUC?**

**ANSWER:** Depends, but in this case it is.

### Cross Validation

Let's see if the previous results were not product of chance or a lucky data split! To do so, we will use cross validation with $k=5$ folds.

Using the `KFold()`function contained in the `sklearn.model_selection` package you can split a dataset (in this case, the original one contained in the `data` variable) by using the `get_n_splits()` method

In [None]:
from sklearn.model_selection import KFold
kf = KFold(n_splits=5)
kf.get_n_splits(data)
print(kf)

KFold(n_splits=5, random_state=None, shuffle=False)


So far you only get a `KFold` object. However, we can put this object inside a for loop to create different training and testing sets in each iteration:

In [None]:
i=1
for train_index, test_index in kf.split(data):
    print('\nFold '+str(i)+':')
    print('TRAIN INDEXES:', train_index)
    print('TEST INDEXES:', test_index)
    X_train, X_test = data[train_index], data[test_index]
    y_train, y_test = target[train_index], target[test_index]
    i+=1


Fold 1:
TRAIN INDEXES: [ 43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60
  61  62  63  64  65  66  67  68  69  70  71  72  73  74  75  76  77  78
  79  80  81  82  83  84  85  86  87  88  89  90  91  92  93  94  95  96
  97  98  99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114
 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132
 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150
 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168
 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186
 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204
 205 206 207 208 209 210 211 212 213]
TEST INDEXES: [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42]

Fold 2:
TRAIN INDEXES: [  0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17
  18  19  20  21  22  23  24  25  26  

Notice that what is being printed in the cell above are not the samples, but the indexes from where the samples will be taken in each fold!

Now use the cell below to implement the **stratified** version of KFolds. Remember that stratification will ensure that both classes are equally present in each set for every fold.

In [None]:
## Use this cell to do a KFolds split in a stratified way
# Hint: Use StratifiedKFold
from sklearn.model_selection import StratifiedKFold
kf_strat = StratifiedKFold(n_splits=5)
kf_strat.get_n_splits(data, target)
print(kf_strat)

StratifiedKFold(n_splits=5, random_state=None, shuffle=False)


To evaluate models using cross validation you **don't** need all of this! In fact, you only use this if you want to store your folds or if you want to implement more complex models.

As a matter of fact it is easier to simply import the `cross_validate` function from the `sklearn.model_selection` package and get the scores as follows:

In [None]:
# Cross validatig the original data (the function will do it all)
from sklearn.model_selection import cross_validate

scores_svm = cross_validate(model_svm, data, target, cv=5)
print('SVM cross-validated scores: ', scores_svm)
scores_rf = cross_validate(model_rf, data, target, cv=5)
print('RF cross-validated scores: ', scores_rf)

SVM cross-validated scores:  {'fit_time': array([0.01265383, 0.008919  , 0.01169252, 0.01165199, 0.0165143 ]), 'score_time': array([0.00155091, 0.00315356, 0.00368977, 0.00476646, 0.00163722]), 'test_score': array([0.65116279, 0.72093023, 0.74418605, 0.79069767, 0.76190476])}
RF cross-validated scores:  {'fit_time': array([5.22920299, 2.48337793, 1.48193789, 1.58507061, 2.29550672]), 'score_time': array([0.20600367, 0.07307172, 0.06779122, 0.10320139, 0.06229401]), 'test_score': array([0.90697674, 0.88372093, 0.86046512, 0.88372093, 0.85714286])}


**From these metrics do you notice any disadvantage of RF?**

**Answer:** RF takes longer to fit!

In fact, these are not the only metrics you can get from cross validation! Using the `cross_val_scores` function from the `sklearn.model_selection` package you can get (almost) all of the measures that we discussed in class. You can check the [scoring documentation](https://scikit-learn.org/stable/modules/model_evaluation.html) to see what you can calculate,and below you will find some examples

In [None]:
# First we import cross_val_scores
from sklearn.model_selection import cross_val_score

In [None]:
# Evaluating Accuracy for each fold
print('Accuracy for SVM: ',cross_val_score(model_svm, data, target, cv=5, scoring = 'accuracy'))
print('Accuracy for RF: ',cross_val_score(model_rf, data, target, cv=5, scoring = 'accuracy'))

Accuracy for SVM:  [0.65116279 0.72093023 0.74418605 0.79069767 0.76190476]
Accuracy for RF:  [0.90697674 0.88372093 0.86046512 0.88372093 0.85714286]


In [None]:
# Evaluating Mean Accuracy for all folds
print('Mean Accuracy for SVM: ',np.mean(cross_val_score(model_svm, data, target, cv=5, scoring = 'accuracy')))
print('Mean Accuracy for RF: ',np.mean(cross_val_score(model_rf, data, target, cv=5, scoring = 'accuracy')))

Mean Accuracy for SVM:  0.7337763012181617
Mean Accuracy for RF:  0.8784053156146179


In [None]:
# Evaluating Precision, Recall and F1-score (both by fold and average)
print('Precision for SVM: ',cross_val_score(model_svm, data, target, cv=5, scoring = 'precision'))
print('Precision for RF: ',cross_val_score(model_rf, data, target, cv=5, scoring = 'precision'))
print('Recall for SVM: ',cross_val_score(model_svm, data, target, cv=5, scoring = 'recall'))
print('Recall for RF: ',cross_val_score(model_rf, data, target, cv=5, scoring = 'recall'))
print('F1-Score for SVM: ',cross_val_score(model_svm, data, target, cv=5, scoring = 'f1'))
print('F1-Score for RF: ',cross_val_score(model_rf, data, target, cv=5, scoring = 'f1'))

print('Mean Precision for SVM: ',np.mean(cross_val_score(model_svm, data, target, cv=5, scoring = 'precision')))
print('Mean Precision for RF: ',np.mean(cross_val_score(model_rf, data, target, cv=5, scoring = 'precision')))
print('Mean Recall for SVM: ',np.mean(cross_val_score(model_svm, data, target, cv=5, scoring = 'recall')))
print('Mean Recall for RF: ',np.mean(cross_val_score(model_rf, data, target, cv=5, scoring = 'recall')))
print('Mean F1-Score for SVM: ',np.mean(cross_val_score(model_svm, data, target, cv=5, scoring = 'f1')))
print('Mean F1-Score for RF: ',np.mean(cross_val_score(model_rf, data, target, cv=5, scoring = 'f1')))

Precision for SVM:  [0.75       0.84       0.8        0.79411765 0.73684211]
Precision for RF:  [0.93103448 0.92857143 0.96       0.9        0.84375   ]
Recall for SVM:  [0.72413793 0.72413793 0.82758621 0.93103448 1.        ]
Recall for RF:  [0.93103448 0.89655172 0.82758621 0.93103448 0.96428571]
F1-Score for SVM:  [0.73684211 0.77777778 0.81355932 0.85714286 0.84848485]
F1-Score for RF:  [0.93103448 0.9122807  0.88888889 0.91525424 0.9       ]
Mean Precision for SVM:  0.7841919504643962
Mean Precision for RF:  0.9126711822660099
Mean Recall for SVM:  0.8413793103448276
Mean Recall for RF:  0.9100985221674875
Mean F1-Score for SVM:  0.8067613821405079
Mean F1-Score for RF:  0.9094916621380064


Notice that in the Keel repository, there is already a version of the glass dataset called **5FCV**. This means that the authors of this repository already provide a "proper" partition of this dataset to test cross validation. If you download this version, you will get 10 *.dat* files instead of 1. **Do you know why?**

**Answer:** Because there are 5 pairs of files, one for each fold of train and test data.

## Bonus

In [None]:
## Run this cell to import galss0.dat to Colab
!wget -O data_cv.zip https://sci2s.ugr.es/keel/dataset/data/imbalanced/glass0-5-fold.zip
!unzip data_cv.zip

--2025-02-18 16:00:50--  https://sci2s.ugr.es/keel/dataset/data/imbalanced/glass0-5-fold.zip
Resolving sci2s.ugr.es (sci2s.ugr.es)... 150.214.190.154
Connecting to sci2s.ugr.es (sci2s.ugr.es)|150.214.190.154|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 30292 (30K) [application/zip]
Saving to: ‘data_cv.zip’


2025-02-18 16:00:50 (550 KB/s) - ‘data_cv.zip’ saved [30292/30292]

Archive:  data_cv.zip
  inflating: glass0-5-1tra.dat       
  inflating: glass0-5-1tst.dat       
  inflating: glass0-5-2tra.dat       
  inflating: glass0-5-2tst.dat       
  inflating: glass0-5-3tra.dat       
  inflating: glass0-5-3tst.dat       
  inflating: glass0-5-4tra.dat       
  inflating: glass0-5-4tst.dat       
  inflating: glass0-5-5tra.dat       
  inflating: glass0-5-5tst.dat       


In [None]:
accuracy_svm_final = 0
accuracy_rf_final = 0
for i in range(5): # Iterate five times
    # Load data for each fold
    print('%%%%%%%%%%%%%FOLD '+str(i+1)+'%%%%%%%%%%%%%%%%%%%%%%%')
    X_train = np.genfromtxt('/content/glass0-5-'+str(i+1)+'tra.dat',
                     usecols=range(9), # this only brings the first nine columns of the dataset
                     skip_header=14, # The first 14 lines of the .dat contain a description
                     delimiter=',')
    y_train_names = np.genfromtxt('/content/glass0-5-'+str(i+1)+'tra.dat',
                     usecols=range(9,10), # This brings the last column of the dataset, which has the class
                     dtype = None,
                     encoding = None, # This helps us get the strings in a numpy array
                     skip_header=14,
                     delimiter=',')
    X_test = np.genfromtxt('/content/glass0-5-'+str(i+1)+'tst.dat',
                     usecols=range(9), # this only brings the first nine columns of the dataset
                     skip_header=14, # The first 14 lines of the .dat contain a description
                     delimiter=',')
    y_test_names = np.genfromtxt('/content/glass0-5-'+str(i+1)+'tst.dat',
                     usecols=range(9,10), # This brings the last column of the dataset, which has the class
                     dtype = None,
                     encoding = None, # This helps us get the strings in a numpy array
                     skip_header=14,
                     delimiter=',')
    # Convert target into numbers
    y_train = []
    y_train = [0 if i == ' positive' else 1 for i in y_train_names]
    y_train=np.array(y_train)
    y_test = []
    y_test = [0 if i == ' positive' else 1 for i in y_test_names]
    y_test=np.array(y_test)
    # TRAIN-TEST SVM
    model_svm = SVC(kernel='linear')
    clf_svm = model_svm.fit(X_train,y_train)
    y_svm = model_svm.predict(X_test)
    # TRAIN-TEST RF
    model_rf = RandomForestClassifier(n_estimators = 1000, random_state = 42)
    clf_rf = model_rf.fit(X_train, y_train)
    y_rf = model_rf.predict(X_test)
    # Calculate accuracy
    print('Accuracy of SVM (Fold '+str(i+1)+'): ', accuracy_score(y_svm, y_test))
    print('Accuracy of RF (Fold '+str(i+1)+'): ', accuracy_score(y_rf, y_test))
    accuracy_svm_final = accuracy_svm_final + accuracy_score(y_svm, y_test)
    accuracy_rf_final = accuracy_rf_final + accuracy_score(y_rf, y_test)
print('%%%%%%%%FINAL RESULTS%%%%%%%%%%%%%')
print('Average Accuracy of SVM (All Folds): ', accuracy_svm_final/5)
print('Average Accuracy of RF (All Folds): ', accuracy_rf_final/5)

%%%%%%%%%%%%%FOLD 1%%%%%%%%%%%%%%%%%%%%%%%
Accuracy of SVM (Fold 1):  0.7209302325581395
Accuracy of RF (Fold 1):  0.8837209302325582
%%%%%%%%%%%%%FOLD 2%%%%%%%%%%%%%%%%%%%%%%%
Accuracy of SVM (Fold 2):  0.7906976744186046
Accuracy of RF (Fold 2):  0.8372093023255814
%%%%%%%%%%%%%FOLD 3%%%%%%%%%%%%%%%%%%%%%%%
Accuracy of SVM (Fold 3):  0.7674418604651163
Accuracy of RF (Fold 3):  0.8837209302325582
%%%%%%%%%%%%%FOLD 4%%%%%%%%%%%%%%%%%%%%%%%
Accuracy of SVM (Fold 4):  0.6511627906976745
Accuracy of RF (Fold 4):  0.8372093023255814
%%%%%%%%%%%%%FOLD 5%%%%%%%%%%%%%%%%%%%%%%%
Accuracy of SVM (Fold 5):  0.8095238095238095
Accuracy of RF (Fold 5):  0.9285714285714286
%%%%%%%%FINAL RESULTS%%%%%%%%%%%%%
Average Accuracy of SVM (All Folds):  0.7479512735326688
Average Accuracy of RF (All Folds):  0.8740863787375416
