## Home assignment 05: Bagging and OOB score

Please, fill the lines in the code below.
This is a simplified version of `BaggingRegressor` from `sklearn`. Please, notice, that `sklearn` API is **not preserved**.

Your algorithm should be able to train different instances of the same model class on bootstrapped datasets and to provide [OOB score](https://en.wikipedia.org/wiki/Out-of-bag_error) for the training set.

The model should be passed as model class with no explicit parameters and no parentheses.

Example:
```
import numpy as np
from sklearn.linear_model import LinearRegression

bagging_regressor = SimplifiedBaggingRegressor(num_bags=10, oob=True)
bagging_regressor.fit(LinearRegression, X, y)

```

## Домашнее задание 05: Упаковка в пакеты и оценка OOB

Пожалуйста, заполните строки в приведенном ниже коде.
Это упрощенная версия `BaggingRegressor` из `sklearn`. Пожалуйста, обратите внимание, что API `sklearn` **не сохранен**.

Ваш алгоритм должен быть способен обучать разные экземпляры одного и того же класса модели на загрузочных наборах данных и предоставлять [OOB score](https://en.wikipedia.org/wiki/Out-of-bag_error ) для обучающего набора.

Модель должна передаваться как класс model без явных параметров и круглых скобок.

Пример:
```
import numpy as np
from sklearn.linear_model import LinearRegression

bagging_regressor = SimplifiedBaggingRegressor(num_bags=10, oob=True)
bagging_regression.fit(Linear Regression, X, y)
```

In [3]:
import numpy as np

In [20]:

import numpy as np

# Создаем исходный массив
arr = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])

# Разделяем массив на 3 примерно равные части
split_array = np.array_split(arr, 3)
print(split_array)

[array([1, 2, 3, 4]), array([5, 6, 7]), array([ 8,  9, 10])]


In [15]:
arr[[0,6]]

array([1, 7])

In [42]:
arr = np.random.rand(20, 1)
display(arr.shape)
arr

(20, 1)

array([[0.14451313],
       [0.90366121],
       [0.95873456],
       [0.29741762],
       [0.85239533],
       [0.48013907],
       [0.75020498],
       [0.97639246],
       [0.33878313],
       [0.35789961],
       [0.65511425],
       [0.80951362],
       [0.01487471],
       [0.67268766],
       [0.58087499],
       [0.40891187],
       [0.13828067],
       [0.47808768],
       [0.03249579],
       [0.4377425 ]])

In [43]:
n = 20
s = 3

indices_list = []
for i in range(s):
    # print((i*s, i*s+s))
    bag_indx = list(range(i*(n//s),  i*(n//s) + n//s))
    if bag_indx[-1] < n-1 and s == i+1:
        bag_indx = list(range(i*(n//s),  n))


    indices_list.append(bag_indx)
indices_list

[[0, 1, 2, 3, 4, 5], [6, 7, 8, 9, 10, 11], [12, 13, 14, 15, 16, 17, 18, 19]]

In [44]:
for bag in indices_list:
    print(arr[bag])

[[0.14451313]
 [0.90366121]
 [0.95873456]
 [0.29741762]
 [0.85239533]
 [0.48013907]]
[[0.75020498]
 [0.97639246]
 [0.33878313]
 [0.35789961]
 [0.65511425]
 [0.80951362]]
[[0.01487471]
 [0.67268766]
 [0.58087499]
 [0.40891187]
 [0.13828067]
 [0.47808768]
 [0.03249579]
 [0.4377425 ]]


In [59]:
last_pred = 0
for i in range(3):
    arr = np.ones((20, 1))
    last_pred += arr
    # print(last_pred)
# last_pred/3

In [142]:
class SimplifiedBaggingRegressor:
    def __init__(self, num_bags, oob=False):
        self.num_bags = num_bags
        self.oob = oob
        
    def _generate_splits(self, data: np.ndarray):
        '''
        Generate indices for every bag and store in self.indices_list list
        '''
        self.indices_list = []
        data_length = len(data)
        for bag in range(self.num_bags):
            # Your Code Here
            # bag_indx = list(range(bag*(data_length//self.num_bags), bag*(data_length//self.num_bags) + data_length//self.num_bags))
            # if bag_indx[-1] < data_length-1 and bag+1 == self.num_bags:
            #     bag_indx = list(range(bag*(data_length//self.num_bags), data_length))
            # bag_indx = list(range(data_length))
            
            bag_indx =  np.random.choice(list(range(data_length)), size=data_length, replace=True)
            self.indices_list.append(bag_indx)

        
    def fit(self, model_constructor, data, target):
        '''
        Fit model on every bag.
        Model constructor with no parameters (and with no ()) is passed to this function.
        
        example:
        
        bagging_regressor = SimplifiedBaggingRegressor(num_bags=10, oob=True)
        bagging_regressor.fit(LinearRegression, X, y)
        '''
        self.data = None
        self.target = None
        self._generate_splits(data)
        # print(len(self.indices_list))
        # print(type(self.indices_list))
        # print(len(data))
        # print(list(map(len, self.indices_list)))
        assert len(set(list(map(len, self.indices_list)))) == 1, 'All bags should be of the same length!'
        assert list(map(len, self.indices_list))[0] == len(data), 'All bags should contain `len(data)` number of elements!'
        self.models_list = []
        for bag in range(self.num_bags):
            model = model_constructor()
            data_bag, target_bag = data[self.indices_list[bag]], target[self.indices_list[bag]] # Your Code Here
            self.models_list.append(model.fit(data_bag, target_bag)) # store fitted models here
        if self.oob:
            self.data = data
            self.target = target
        
    def predict(self, data):
        '''
        Get average prediction for every object from passed dataset
        '''
        # Your code here
        last_pred = 0
        for model in self.models_list:
            last_pred += model.predict(data)
        return last_pred / len(self.models_list)

    def _get_oob_predictions_from_every_model(self):
        '''
        Generates list of lists, where list i contains predictions for self.data[i] object
        from all models, which have not seen this object during training phase
        '''
        list_of_predictions_lists = [[] for _ in range(len(self.data))]
        # Your Code Here
        for i in range(len(self.data)): # перебираем каждый элемент массива X
            for j, bag__indx in enumerate(self.indices_list): # перебираем каждый элемент мешков индексов
                if i not in bag__indx: # если не индекс содержится в мешке
                    list_of_predictions_lists[i].append(self.models_list[j].predict([self.data[i]]))
        
        self.list_of_predictions_lists = np.array(list_of_predictions_lists, dtype=object)
    
    def _get_averaged_oob_predictions(self):
        '''
        Compute average prediction for every object from training set.
        If object has been used in all bags on training phase, return None instead of prediction
        '''
        self._get_oob_predictions_from_every_model()
        self.oob_predictions = [np.mean(pred_list) if pred_list!= [] else None for pred_list in self.list_of_predictions_lists] # Your Code Here
        # print(self.list_of_predictions_lists)
        # print(self.oob_predictions)
    def OOB_score(self):
        '''
        Compute mean square error for all objects, which have at least one prediction
        '''
        self._get_averaged_oob_predictions()
        
        target = np.array([self.target[i] for i, mean_pred in enumerate(self.oob_predictions) if mean_pred!= None ]).reshape(-1,1)
        predictions = np.array([[mean_pred] for i, mean_pred in enumerate(self.oob_predictions) if mean_pred!= None ])
        # print(target[:2])
        # print(predictions[:2])
        return  np.mean((predictions - target) ** 2) # Your Code Here

### Local tests:

In [143]:
from sklearn.linear_model import LinearRegression
from tqdm.auto import tqdm

#### Simple tests:

In [144]:
for _ in tqdm(range(100)):
    X = np.random.randn(2000, 10)
    y = np.mean(X, axis=1)
    bagging_regressor = SimplifiedBaggingRegressor(num_bags=10, oob=True)
    bagging_regressor.fit(LinearRegression, X, y)
    predictions = bagging_regressor.predict(X)
    assert np.mean((predictions - y)**2) < 1e-6, 'Linear dependency should be fitted with almost zero error!'
    assert bagging_regressor.oob, 'OOB feature must be turned on'
    oob_score = bagging_regressor.OOB_score()
    # print(oob_score)
    assert oob_score < 1e-6, 'OOB error for linear dependency should be also close to zero!'
    assert abs(
        np.mean(
            list(map(len, bagging_regressor.list_of_predictions_lists))
        ) / bagging_regressor.num_bags - 1/np.exp(1)) < 0.1, 'Probability of missing a bag should be close to theoretical value!'
    
print('Simple tests done!')

100%|██████████| 100/100 [00:37<00:00,  2.68it/s]

Simple tests done!





#### Medium tests

In [145]:
for _ in tqdm(range(10)):
    X = np.random.randn(200, 150)
    y = np.random.randn(len(X))
    bagging_regressor = SimplifiedBaggingRegressor(num_bags=20, oob=True)
    bagging_regressor.fit(LinearRegression, X, y)
    predictions = bagging_regressor.predict(X)
    average_train_error = np.mean((predictions - y)**2)
    assert bagging_regressor.oob, 'OOB feature must be turned on'
    oob_score = bagging_regressor.OOB_score()
    assert oob_score > average_train_error, 'OOB error must be higher than train error due to overfitting!'
    assert abs(
        np.mean(
            list(map(len, bagging_regressor.list_of_predictions_lists))
        ) / bagging_regressor.num_bags - 1/np.exp(1)) < 0.1, 'Probability of missing a bag should be close to theoretical value!'
    
print('Medium tests done!')

100%|██████████| 10/10 [00:05<00:00,  1.78it/s]

Medium tests done!





#### Complex tests:

In [146]:
for _ in tqdm(range(10)):
    X = np.random.randn(2000, 15)
    y = np.random.randn(len(X))
    bagging_regressor = SimplifiedBaggingRegressor(num_bags=100, oob=True)
    bagging_regressor.fit(LinearRegression, X, y)
    predictions = bagging_regressor.predict(X)
    oob_score = bagging_regressor.OOB_score()
    assert abs(
        np.mean(
            list(map(len, bagging_regressor.list_of_predictions_lists))
        ) / bagging_regressor.num_bags - 1/np.exp(1)) < 1e-2, 'Probability of missing a bag should be close to theoretical value!'
    
print('Complex tests done!')

100%|██████████| 10/10 [00:37<00:00,  3.71s/it]

Complex tests done!





In [147]:
np.mean(
            list(map(len, bagging_regressor.list_of_predictions_lists))
        ) / bagging_regressor.num_bags - 1/np.exp(1)

0.0006805588285576647

Great job! Please, save `SimplifiedBaggingRegressor` to  `bagging.py` and submit your solution to the grading system!