## About iPython Notebooks ##

iPython Notebooks are interactive coding environments embedded in a webpage. You will be using iPython notebooks in this class. Make sure you fill in any place that says `# BEGIN CODE HERE #END CODE HERE`. After writing your code, you can run the cell by either pressing "SHIFT"+"ENTER" or by clicking on "Run" (denoted by a play symbol). Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All). 

 **What you need to remember:**

- Run your cells using SHIFT+ENTER (or "Run cell")
- Write code in the designated areas using Python 3 only
- Do not modify the code outside of the designated areas
- In some cases you will also need to explain the results. There will also be designated areas for that. 

Fill in your **NAME** and **AEM** below:

In [1]:
NAME = "Vasiliki Zarkadoula"
AEM = ""

---

# Assignment 3 - Ensemble Methods #

Welcome to your third assignment. This exercise will test your understanding on Ensemble Methods.

In [2]:
# Always run this cell
import numpy as np
import pandas as pd

# USE THE FOLLOWING RANDOM STATE FOR YOUR CODE
RANDOM_STATE = 42

## Download the Dataset ##
Download the dataset using the following cell or from this [link](https://github.com/sakrifor/public/tree/master/machine_learning_course/EnsembleDataset) and put the files in the same folder as the .ipynb file. 
In this assignment you are going to work with a dataset originated from the [ImageCLEFmed: The Medical Task 2016](https://www.imageclef.org/2016/medical) and the **Compound figure detection** subtask. The goal of this subtask is to identify whether a figure is a compound figure (one image consists of more than one figure) or not. The train dataset consits of 4197 examples/figures and each figure has 4096 features which were extracted using a deep neural network. The *CLASS* column represents the class of each example where 1 is a compoung figure and 0 is not. 


In [3]:
import urllib.request
url_train = 'https://github.com/sakrifor/public/raw/master/machine_learning_course/EnsembleDataset/train_set.csv'
filename_train = 'train_set.csv'
urllib.request.urlretrieve(url_train, filename_train)
url_test = 'https://github.com/sakrifor/public/raw/master/machine_learning_course/EnsembleDataset/test_set_noclass.csv'
filename_test = 'test_set_noclass.csv'
urllib.request.urlretrieve(url_test, filename_test)

('test_set_noclass.csv', <http.client.HTTPMessage at 0x2cafaaf7280>)

In [4]:
# Run this cell to load the data
train_set = pd.read_csv("train_set.csv").sample(frac=1).reset_index(drop=True)
train_set.head()
X = train_set.drop(columns=['CLASS'])
y = train_set['CLASS'].values

## 1.0 Testing different ensemble methods ##
In this part of the assignment you are asked to create and test different ensemble methods using the train_set.csv dataset. You should use **10-fold cross validation** for your tests and report the average f-measure and accuracy of your models.

### !!! Use n_jobs=-1 where is posibble to use all the cores of a machine for running your tests ###

### 1.1 Voting ###
Create a voting classifier which uses three estimators/classifiers. Test both soft and hard voting and choose the best one.

In [5]:
# BEGIN CODE HERE
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.gaussian_process.kernels import RBF
from sklearn.naive_bayes import GaussianNB
from sklearn.neural_network import MLPClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.metrics import mean_squared_error, accuracy_score, f1_score
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_validate
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
from sklearn.ensemble import StackingClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import GradientBoostingClassifier


# Read again the data in order to add the random state
train_set = pd.read_csv("train_set.csv").sample(frac=1, random_state=RANDOM_STATE).reset_index(drop=True)
X = train_set.drop(columns=['CLASS'])
y = train_set['CLASS'].values

cls1 =  SVC(probability = False ,random_state=RANDOM_STATE)
cls2 =  SVC(gamma='auto', probability = False ,random_state=RANDOM_STATE)
cls3 =  SVC(kernel= 'linear', probability = False ,random_state=RANDOM_STATE)

classifiers = [('svm1', cls1),('svm2', cls2),('svm3', cls3)]
vcls = VotingClassifier(estimators=classifiers, voting='hard') # Voting Classifier

# evaluate using cross-validation
cv = StratifiedKFold(n_splits=10, shuffle=True, random_state=RANDOM_STATE)
scores = cross_validate(vcls, X, y, scoring=('accuracy', 'f1'), cv=cv, n_jobs=-1)

avg_fmeasure = np.mean(scores['test_f1'])       # The average f-measure
avg_accuracy = np.mean(scores['test_accuracy'])  # The average accuracy


#END CODE HERE

In [6]:
print("Classifier:")
print(vcls)
print("F1-Score:{} & Accuracy:{}".format(avg_fmeasure,avg_accuracy))

Classifier:
VotingClassifier(estimators=[('svm1', SVC(random_state=42)),
                             ('svm2', SVC(gamma='auto', random_state=42)),
                             ('svm3', SVC(kernel='linear', random_state=42))])
F1-Score:0.8806722221016031 & Accuracy:0.8572462779861347


### 1.2 Stacking ###
Create a stacking classifier which uses two estimators/classifiers. Try different classifiers for the combination of the initial classifiers. Report your results in the following cell.

In [7]:
# BEGIN CODE HERE

cls1 = KNeighborsClassifier(n_neighbors=7)
cls2 = SVC(probability = False ,random_state=RANDOM_STATE)

cls = [('knn',cls1),('svm',cls2)]
scls = StackingClassifier(cls, cv=10, passthrough=False) # Stacking Classifier

# evaluate using cross-validation
cv = StratifiedKFold(n_splits=10, shuffle=True, random_state=RANDOM_STATE)
scores = cross_validate(scls, X, y, scoring=('accuracy', 'f1'), cv=cv, verbose=2, n_jobs=-1)

avg_fmeasure = np.mean(scores['test_f1'])       # The average f-measure
avg_accuracy = np.mean(scores['test_accuracy'])  # The average accuracy

#END CODE HERE

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   7 out of  10 | elapsed: 21.8min remaining:  9.3min
[Parallel(n_jobs=-1)]: Done  10 out of  10 | elapsed: 29.0min finished


In [8]:
print("Classifier:")
print(scls)
print("F1-Score:{} & Accuracy:{}".format(avg_fmeasure,avg_accuracy))

Classifier:
StackingClassifier(cv=10,
                   estimators=[('knn', KNeighborsClassifier(n_neighbors=7)),
                               ('svm', SVC(random_state=42))])
F1-Score:0.8790692218309399 & Accuracy:0.8567678145243779


### 1.3 Report the results ###  
Report the results of your experiments in the following cell. How did you choose your initial classifiers? 

I chose randomly to combine those three classifiers: decision trees - knn - svm. <br>
(Decision trees and knn are also faster to run)

-------------------------------------------------------------------------------------------------------------------------------

RESULTS

**1.1 Voting**

i.<br>
DecisionTreeClassifier(random_state=RANDOM_STATE) <br>
DecisionTreeClassifier(max_depth = 15, random_state=RANDOM_STATE)<br>
DecisionTreeClassifier(max_depth = 8,random_state=RANDOM_STATE)

| Voting Type | Accuracy | F1 Score |
| --- | --- | --- |
| Soft | 0.7059  | 0.7549 |
| Hard | 0.7059 | 0.7549 |

ii. <br>
SVC(probability = False ,random_state=RANDOM_STATE)<br>
SVC(gamma='auto', probability = False ,random_state=RANDOM_STATE)<br>
SVC(kernel= 'linear', probability = False ,random_state=RANDOM_STATE)

| Voting Type | Accuracy | F1 Score |
| --- | --- | --- |
| Soft | 0.8102  | 0.8395 |
| Hard | 0.8572 | 0.8806 |

iii.<br>
KNeighborsClassifier(n_neighbors=5) <br>
KNeighborsClassifier(n_neighbors=7) <br>
KNeighborsClassifier(n_neighbors=10)

| Voting Type | Accuracy | F1 Score |
| --- | --- | --- |
| Soft | 0.8102  | 0.8395 |
| Hard | 0.8387 | 0.8107 |

iv.<br>
cls1 =  DecisionTreeClassifier(random_state=RANDOM_STATE)<br>
cls2 =  SVC(probability = False ,random_state=RANDOM_STATE)<br>
cls3 =  KNeighborsClassifier(n_neighbors=7)

| Voting Type | Accuracy | F1 Score |
| --- | --- | --- |
| Soft | 0.8086  | 0.8623 |
| Hard | 0.8353 | 0.8402 |


**It seems like the combination (ii) has the best performance.

-------------------------------------------------------------------------------------------------------------------------------


**1.2 Stacking**

i.<br>
DecisionTreeClassifier(random_state=RANDOM_STATE,max_depth=15)<br>
DecisionTreeClassifier(random_state=RANDOM_STATE) <br>

| Stacking Type | Accuracy | F1 Score |
| ------------- | --- | --- |
| passthrough = F | 0.7059  | 0.7622 |
| passthrough = T | 0.8377 | 0.8622 |

<br>
ii.<br>
DecisionTreeClassifier(random_state=RANDOM_STATE) <br>
SVC(probability = False ,random_state=RANDOM_STATE)<br>

| Stacking Type | Accuracy | F1 Score |
| ------------- | --- | --- |
| passthrough = F | 0.8555  | 0.8778 |
| passthrough = T | 0.8381 | 0.8627 |

<br>
iii.<br>
KNeighborsClassifier(n_neighbors=7)<br>
SVC(probability = False ,random_state=RANDOM_STATE)<br>

| Stacking Type | Accuracy | F1 Score |
| ------------- | --- | --- |
| passthrough = F | 0.8567  | 0.8790 |
| passthrough = T | 0.8365 | 0.8613 |

<br>
iv.<br>
KNeighborsClassifier(n_neighbors=7) <br>
KNeighborsClassifier(n_neighbors=11) <br>

| Stacking Type | Accuracy | F1 Score |
| ------------- | --- | --- |
| passthrough = F | 0.8076  | 0.8405 |
| passthrough = T | 0.8377 | 0.8620 |

<br>
v.<br>
SVC(probability = False ,random_state=RANDOM_STATE) <br>
SVC(gamma='auto', probability = False ,random_state=RANDOM_STATE) <br>

| Stacking Type | Accuracy | F1 Score |
| ------------- | --- | --- |
| passthrough = F | 0.8543  | 0.8767 |
| passthrough = T | 0.8630 | 0.8386 |

<br>
**It seems like the combination (iii) has the best performance.

## 2.0 Randomization ##

**2.1** You are asked to create three ensembles of decision trees where each one uses a different method for producing homogeneous ensembles. Compare them with a simple decision tree classifier and report your results in the dictionaries (dict) below using as key the given name of your classifier and as value the f1/accuracy score. The dictionaries should contain four different elements.  

In [3]:
# BEGIN CODE HERE
# # Gridsearch to tune the hyper-parameters of the classifiers to be used

# # Decision tree gridsearch parameters
# param_grid = {'clf__criterion': ['gini', 'entropy'], 'clf__min_samples_leaf': [1, 2, 3, 4],
#               'clf__max_depth': [5, 7, 10, 12, 15, 17, 20, None], 'clf__splitter': ['best', 'random']}
# # Random forest gridsearch parameters
# param_grid = { 'clf__n_estimators': [200, 500],'clf__max_features': ['auto', 'sqrt', 'log2'],
#             'clf__max_depth' : [4,5,6,7,8],'clf__criterion' :['gini', 'entropy']}
# # Gradient boosting gridsearch parameters
# param_grid = {'clf__learning_rate':[0.1,0.05,0.01,0.005], 'clf__n_estimators':[100], 'clf__max_depth':[2,5,8] }
# # Bagging gridsearch parameters
# param_grid = {'clf__n_estimators': [50, 100], 'clf__max_features': [0.6, 0.8, 1.0], 'clf__base_estimator__max_depth': [5, 8, 10]}

# pipe = Pipeline([('scale', StandardScaler()), ('clf', BaggingClassifier(base_estimator=DecisionTreeClassifier(), random_state=RANDOM_STATE))])
# grid = GridSearchCV(pipe, param_grid, n_jobs=-1, verbose=2, refit=True,cv = 10)

# grid.fit(X, y)

# # print best parameter after tuning
# print(grid.best_params_)

ens1 = BaggingClassifier(DecisionTreeClassifier( max_depth=10), max_features=0.8, n_estimators=100, random_state=RANDOM_STATE)
ens2 = RandomForestClassifier(criterion='gini', max_depth=8, max_features='auto', n_estimators=500, random_state=RANDOM_STATE)
ens3 = GradientBoostingClassifier(learning_rate=0.1, max_depth=5, n_estimators=100, random_state=RANDOM_STATE)
tree = DecisionTreeClassifier(criterion='gini',max_depth=7, min_samples_leaf=2,splitter='best', random_state=RANDOM_STATE)

# evaluate a given model using cross-validation
def evaluate_model(model, X, y):
    cv = StratifiedKFold(n_splits=10, shuffle=True, random_state=RANDOM_STATE)
    scores = cross_validate(model, X, y, scoring=('accuracy', 'f1'), cv=cv, verbose=2, n_jobs=3)
    return scores

# get f1 and accuracy scores
def get_scores(model, X, y):
    scores = evaluate_model(model, X, y)
    f1_score = np.mean(scores['test_f1'])  
    acc_score = np.mean(scores['test_accuracy'])
    return f1_score, acc_score

f1_ens1, acc_ens1 = get_scores(ens1, X, y)
f1_ens2, acc_ens2 = get_scores(ens2, X, y)
f1_ens3, acc_ens3 = get_scores(ens3, X, y)
f1_tree, acc_tree = get_scores(tree, X, y)

f_measures = dict()
accuracies = dict()

f_measures['Ensemble with Bagging'] = f1_ens1
f_measures['Ensemble with RandomForest'] = f1_ens2
f_measures['Ensemble with GradientBoost'] = f1_ens3
f_measures['Simple DecisionTree'] = f1_tree

accuracies['Ensemble with Bagging'] = acc_ens1
accuracies['Ensemble with RandomForest'] = acc_ens2
accuracies['Ensemble with GradientBoost'] = acc_ens3
accuracies['Simple DecisionTree'] = acc_tree

#END CODE HERE

[Parallel(n_jobs=3)]: Using backend LokyBackend with 3 concurrent workers.
[Parallel(n_jobs=3)]: Done  10 out of  10 | elapsed: 20.2min finished
[Parallel(n_jobs=3)]: Using backend LokyBackend with 3 concurrent workers.
[Parallel(n_jobs=3)]: Done  10 out of  10 | elapsed:  2.5min finished
[Parallel(n_jobs=3)]: Using backend LokyBackend with 3 concurrent workers.
[Parallel(n_jobs=3)]: Done  10 out of  10 | elapsed: 15.8min finished
[Parallel(n_jobs=3)]: Using backend LokyBackend with 3 concurrent workers.
[Parallel(n_jobs=3)]: Done  10 out of  10 | elapsed:   15.5s finished


In [4]:
print(ens1)
print(ens2)
print(ens3)
print(tree)
for name,score in f_measures.items():
    print("Classifier:{} -  F1:{}".format(name,score))
for name,score in accuracies.items():
    print("Classifier:{} -  Accuracy:{}".format(name,score))

BaggingClassifier(base_estimator=DecisionTreeClassifier(max_depth=10),
                  max_features=0.8, n_estimators=100, random_state=42)
RandomForestClassifier(max_depth=8, n_estimators=500, random_state=42)
GradientBoostingClassifier(max_depth=5, random_state=42)
DecisionTreeClassifier(max_depth=7, min_samples_leaf=2, random_state=42)
Classifier:Ensemble with Bagging -  F1:0.8447776121623592
Classifier:Ensemble with RandomForest -  F1:0.8310431580006441
Classifier:Ensemble with GradientBoost -  F1:0.8599528033439358
Classifier:Simple DecisionTree -  F1:0.7699716070093251
Classifier:Ensemble with Bagging -  Accuracy:0.8069598818047506
Classifier:Ensemble with RandomForest -  Accuracy:0.7797937265598363
Classifier:Ensemble with GradientBoost -  Accuracy:0.8296067734969883
Classifier:Simple DecisionTree -  Accuracy:0.7204483464030004


**2.2** Describe your classifiers and your results.

Best classifier: gradient boost

**2.3** Increasing the number of estimators in a bagging classifier can drastically increase the training time of a classifier. Is there any solution to this problem? Can the same solution be applied to boosting classifiers?

## 3.0 Creating the best classifier ##

**3.1** In this part of the assignment you are asked to train the best possible ensemble! Describe the process you followed to achieve this result. How did you choose your classifier and your parameters and why. Report the f-measure & accuracy (10-fold cross validation) of your final classifier and results of classifiers you tried in the cell following the code. Can you achieve an accuracy over 83-84%?

In [11]:
# BEGIN CODE HERE

ens1 = RandomForestClassifier(criterion='gini', max_depth=8, max_features='auto', n_estimators=500, random_state=RANDOM_STATE)
ens2 =  KNeighborsClassifier(n_neighbors=8, weights='distance')
ens3 =  KNeighborsClassifier(n_neighbors=15)
ens4 = SVC(C= 0.1, gamma=1, kernel='poly', probability = False ,random_state=RANDOM_STATE)
ens5 = SVC(C=10, gamma=0.0001, probability = False ,random_state=RANDOM_STATE)

classifiers = [('rf',ens1),('knn1',ens2),('knn2',ens3),('svm1',ens4),('svm2',ens5)]

best_cls = StackingClassifier(classifiers, cv=10, n_jobs=1) 

cv = StratifiedKFold(n_splits=10, shuffle=True, random_state=RANDOM_STATE)
scores_stacking = cross_validate(best_cls, X, y, scoring=('accuracy', 'f1'), cv=cv, verbose=2, n_jobs=-1)

best_fmeasure = np.mean(scores_stacking['test_f1'])     
best_accuracy = np.mean(scores_stacking['test_accuracy'])

# best_fmeasure = 0   
# best_accuracy = 0



#END CODE HERE

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   7 out of  10 | elapsed: 52.9min remaining: 22.7min
[Parallel(n_jobs=-1)]: Done  10 out of  10 | elapsed: 71.7min finished


In [12]:
print("Classifier:")
print(best_cls)
print("F1-Score:{} & Accuracy:{}".format(best_fmeasure,best_accuracy))

Classifier:
StackingClassifier(cv=10,
                   estimators=[('rf',
                                RandomForestClassifier(max_depth=8,
                                                       n_estimators=500,
                                                       random_state=42)),
                               ('knn1',
                                KNeighborsClassifier(n_neighbors=8,
                                                     weights='distance')),
                               ('knn2', KNeighborsClassifier(n_neighbors=15)),
                               ('svm1',
                                SVC(C=0.1, gamma=1, kernel='poly',
                                    random_state=42)),
                               ('svm2',
                                SVC(C=10, gamma=0.0001, random_state=42))],
                   n_jobs=1)
F1-Score:0.885870956173741 & Accuracy:0.8648698715763155


**3.2** Describe the process you followed to achieve this result. How did you choose your classifier and your parameters and why. Report the f-measure & accuracy (10-fold cross validation) of your final classifier and results of classifiers you tried in the cell following the code.

Best combination:

ens1 = RandomForestClassifier(criterion='gini', max_depth=8, max_features='auto', n_estimators=500, random_state=RANDOM_STATE)<br>
ens2 =  KNeighborsClassifier(n_neighbors=8, weights='distance')<br>
ens3 =  KNeighborsClassifier(n_neighbors=15)<br>
ens4 = SVC(C= 0.1, gamma=1, kernel='poly', probability = False ,random_state=RANDOM_STATE)<br>
ens5 = SVC(C=10, gamma=0.0001, probability = False ,random_state=RANDOM_STATE)<br>
best_cls = StackingClassifier(classifiers, cv=10, n_jobs=1) <br><br>
F1-Score: 0.885870956173741  Accuracy: 0.8648698715763155 
<br><br>
hard VotingClassifier: F1-Score:0.8692758583179436 & Accuracy:0.8412791226275713
<br>

-------------------------------------------------------------------------------------------------------------------------------

ens1 =  SVC(C=10, gamma=0.0001, probability = False ,random_state=RANDOM_STATE)<br>
ens2 =  SVC(C=1, gamma=0.01, probability = False ,random_state=RANDOM_STATE)<br>
ens3 =  KNeighborsClassifier(n_neighbors=8,weights='distance' )<br>
ens4 =  KNeighborsClassifier(n_neighbors=15,weights='distance' )<br>
ens5 =  DecisionTreeClassifier(criterion='entropy',max_depth=15,min_samples_leaf=50)  <br>
ens6 =  DecisionTreeClassifier(criterion='gini',max_depth=8,min_samples_leaf=5)<br> 
<br>
Hard voting: F1-Score:0.8743786421718432 & Accuracy:0.8415138083873168<br>
Stacking: F1-Score:0.8855501090943972 & Accuracy:0.864389135128992<br>

-------------------------------------------------------------------------------------------------------------------------------

final_dt = DecisionTreeClassifier(max_leaf_nodes=15, max_depth=7)               
ens1 = BaggingClassifier(base_estimator=final_dt,n_estimators=100, random_state=RANDOM_STATE)<br>
ens2 = RandomForestClassifier(criterion='gini', max_depth=8, max_features='auto', n_estimators=500, random_state=RANDOM_STATE)<br>
ens3 =  KNeighborsClassifier(n_neighbors=8, weights='distance')<br>
ens4 = SVC(C= 0.1, gamma=1, kernel='poly', probability = False ,random_state=RANDOM_STATE)<br>
<br>
Stacking: F1-Score:0.8722317872878111 & Accuracy:0.8481895669962496<br>
Hard voting: F1-Score:0.8608855333326844 & Accuracy:0.8336532560518239<br>

-------------------------------------------------------------------------------------------------------------------------------

ens1 =  KNeighborsClassifier(n_neighbors=8, weights='distance')<br>
ens3 = SVC(C= 0.1, gamma=1, kernel='poly', probability = False ,random_state=RANDOM_STATE)<br>
ens5 = RandomForestClassifier(criterion='gini', max_depth=8, max_features='auto', n_estimators=500, random_state=RANDOM_STATE)<br>
ens6 = GradientBoostingClassifier(learning_rate=0.1, max_depth=5, n_estimators=100, random_state=RANDOM_STATE)<br>
<br>
Stacking: F1-Score:0.8720228600326587 & Accuracy:0.8486657574724401<br>
Hard voting: 0.8658548382169 & Accuracy:0.8410404591430846<br>

-------------------------------------------------------------------------------------------------------------------------------

ens1 =  KNeighborsClassifier(n_neighbors=8, weights='distance')<br>
ens2 =  KNeighborsClassifier(n_neighbors=15)<br>
ens3 = SVC(C= 0.1, gamma=1, kernel='poly', probability = False ,random_state=RANDOM_STATE)<br>
ens4 = SVC(C=10, gamma=0.0001, probability = False ,random_state=RANDOM_STATE)<br>
ens5 = RandomForestClassifier(criterion='gini', max_depth=8, max_features='auto', n_estimators=500, random_state=RANDOM_STATE)<br>
ens6 = GradientBoostingClassifier(learning_rate=0.1, max_depth=5, n_estimators=100, random_state=RANDOM_STATE)<br>
<br>
Stacking: F1-Score:0.8856607633611994 & Accuracy:0.864633481077395<br>
Hard voting: F1-Score:0.8690737891162538 & Accuracy:0.8443772019547676<br>

-------------------------------------------------------------------------------------------------------------------------------

ens1 = RandomForestClassifier(criterion='gini', max_depth=8, max_features='auto', n_estimators=500, random_state=RANDOM_STATE)<br>
ens2 =  KNeighborsClassifier(n_neighbors=8, weights='distance')<br>
ens3 =  KNeighborsClassifier(n_neighbors=15)<br>
ens4 = SVC(C= 0.1, gamma=1, kernel='poly', probability = False ,random_state=RANDOM_STATE)<br>
ens5 = SVC(C=10, gamma=0.0001, probability = False ,random_state=RANDOM_STATE)<br>
ens6 =  DecisionTreeClassifier(criterion='entropy',max_depth=15,min_samples_leaf=50)  <br>
ens7 =  DecisionTreeClassifier(criterion='gini',max_depth=8,min_samples_leaf=5) <br>
<br>
Stacking: F1-Score:0.8835310632971692 & Accuracy:0.8620087509944312<br>
Hard voting: F1-Score:0.8600895789978917 & Accuracy:0.8307915672235481<br>



**3.3** Create a classifier that is going to be used in production - in a live system. Use the *test_set_noclass.csv* to make predictions. Store the predictions in a list.  

In [13]:
# BEGIN CODE HERE
X_test = pd.read_csv("test_set_noclass.csv").reset_index(drop=True)

cls = best_cls

cls.fit(X, y)
ypred = cls.predict(X_test)

predictions = list(ypred)

#END CODE HERE

In [14]:
print(cls)
print(predictions)

StackingClassifier(cv=10,
                   estimators=[('rf',
                                RandomForestClassifier(max_depth=8,
                                                       n_estimators=500,
                                                       random_state=42)),
                               ('knn1',
                                KNeighborsClassifier(n_neighbors=8,
                                                     weights='distance')),
                               ('knn2', KNeighborsClassifier(n_neighbors=15)),
                               ('svm1',
                                SVC(C=0.1, gamma=1, kernel='poly',
                                    random_state=42)),
                               ('svm2',
                                SVC(C=10, gamma=0.0001, random_state=42))],
                   n_jobs=1)
[1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1

LEAVE HERE ANY COMMENTS ABOUT YOUR CLASSIFIER

#### This following cell will not be executed. The test_set.csv with the classes will be made available after the deadline and this cell is for testing purposes!!! Do not modify it! ###

In [15]:
from sklearn.metrics import f1_score,accuracy_score
final_test_set = pd.read_csv('test_set.csv')
ground_truth = final_test_set['CLASS']
print("Accuracy:{}".format(accuracy_score(predictions,ground_truth)))
print("F1-Score:{}".format(f1_score(predictions,ground_truth)))

FileNotFoundError: [Errno 2] No such file or directory: 'test_set.csv'