In [None]:
import numpy as np
import pandas as pd

from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.model_selection import GridSearchCV

import warnings
warnings.filterwarnings("ignore")

**Objective and goal for this lab**

This lab is intended to give a quick user guide on how to fit an AdaBoostClassifier. Analogously, you can just as well fit an AdaBoostRegressor for suitable problems.

---

**Import the data**

In [None]:
iris_df = pd.read_csv('../data/IRIS.csv')
iris_df

In [None]:
import plotly.express as px

df = px.data.iris()
fig = px.scatter_3d(iris_df, x='sepal_length', y='petal_length', z='petal_width', color='species')
fig.show()

In [None]:
iris_df['species'].value_counts()

The target column *species* is categorical. We need to make it numerical by assigning each class a specific number.

In [None]:
class_map = {'Iris-setosa':0, 'Iris-versicolor':1, 'Iris-virginica': 2}

numerical_targets = [class_map[value] for value in iris_df['species']]

iris_df['species'] = numerical_targets

In [None]:
iris_df['species'].value_counts()

In [None]:
X, y = iris_df.drop(columns='species'), iris_df['species']

No further pre-processing or feature engineering is needed for this simple dataset.

---

**GridSearch with AdaBoost**

AdaBoost is in general composed by a sequential series of weak learners. The reason is that each one of them, by itself, will underfit - and thus have high bias. But, by using the boosting method, each weak learner can train to become good at what the previous learner was bad at. We then use all the trained learners for prediction, just like we did for bagging (in Random Forest). This will in theory eliminate the high bias problem of each individual learner, and instead give us a strong ensamble classifier/regressor.

The default weak learner when using AdaBoost is a decision tree with max depth of 1 (this is really a weak learner, right!).

If you don't specify otherwise when initializing AdaBoost, the default weak learner is thus a decision tree.

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import AdaBoostClassifier

# define the model with default hyperparameters
model = AdaBoostClassifier()

# define the grid of values to search
param_grid = {'n_estimators': [10, 50, 100, 500],                 # antalet sekventiella 'svaga' modeller att träna
              'learning_rate': [0.001, 0.01, 0.1, 1.0]}           # avgör hur mycket varje fel i iterationen ska viktas

# define the grid search procedure
grid_search = GridSearchCV(estimator=model, 
                           param_grid=param_grid, 
                           n_jobs=-1, 
                           cv=3, 
                           scoring='accuracy')

# execute the grid search
grid_result = grid_search.fit(X, y)

# summarize the best score and configuration
print(f"Best: {grid_result.best_score_} using {grid_result.best_params_}")
print('---'*25, end='\n\n')

# summarize all scores that were evaluated
mean_test_scores = grid_result.cv_results_['mean_test_score']
params = grid_result.cv_results_['params']
for mean_test_score, param in zip(mean_test_scores, params):
    print('params:')
    print(f"{param}")
    print('mean accuracy:')
    print(f'{round(mean_test_score,4)}')
    print('---'*25, end='\n\n')

---

Above, we mentioned that AdaBoost concists of weak learners that we train sequentually, each one tries to become good at what the previous one was lacking. However, you can actually controll yourself what these weak learners should be. You can in practice choose anything to serve the role of the weak learner! In fact, it doesn't really have to be a weak learner, it can also be a strong learner!

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import AdaBoostClassifier, RandomForestClassifier


# define the model with default hyperparameters
model = AdaBoostClassifier(estimator=RandomForestClassifier())    # Note that we've here chosen RandomForestClassifier (with default hyperparameters) 
                                                                  # as the weak learner!  

# define the grid of values to search
param_grid = {'n_estimators': [10, 50, 100, 500],                 
              'learning_rate': [0.001, 0.01, 0.1, 1.0]}           

# define the grid search procedure
grid_search = GridSearchCV(estimator=model, 
                           param_grid=param_grid, 
                           n_jobs=-1, 
                           cv=3, 
                           scoring='accuracy')

# execute the grid search
grid_result = grid_search.fit(X, y)

# summarize the best score and configuration
print(f"Best: {grid_result.best_score_} using {grid_result.best_params_}")
print('---'*25, end='\n\n')

# summarize all scores that were evaluated
mean_test_scores = grid_result.cv_results_['mean_test_score']
params = grid_result.cv_results_['params']
for mean_test_score, param in zip(mean_test_scores, params):
    print('params:')
    print(f"{param}")
    print('mean accuracy:')
    print(f'{round(mean_test_score,4)}')
    print('---'*25, end='\n\n')

---

## Challenges

**Task 1**

Make sure to completely understand the whole process we've laid out above.

**Task 2**

Read more about [AdaBoostClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html) and [AdaBoostRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostRegressor.html) in respective documentation.

**Task 3**

Redo the grid search, but now include a search for the best *estimator* - in other words, do a GridSearch where you also use different models as 'weak learners'. 

Experiment away!

**Task 4**

Now, create your own (more complicated) dataset using make_blobs, as we have previously. Use the following:

n_samples = 2000

n_features = 3

n_classes = 6

Do a 3D-plot of the data to make sure you get a grasp of it. Make sure the color of each point indicate class belonging.

When that's done, also pick a random state such that the classes are somewhat difficult to distinguish.

Therafter, redo Task 3 on this new dataset.