## Home assignment 05: Bagging and OOB score

Please, fill the lines in the code below.
This is a simplified version of `BaggingRegressor` from `sklearn`. Please, notice, that `sklearn` API is **not preserved**.

Your algorithm should be able to train different instances of the same model class on bootstrapped datasets and to provide [OOB score](https://en.wikipedia.org/wiki/Out-of-bag_error) for the training set.

The model should be passed as model class with no explicit parameters and no parentheses.

Example:
```
import numpy as np
from sklearn.linear_model import LinearRegression

bagging_regressor = SimplifiedBaggingRegressor(num_bags=10, oob=True)
bagging_regressor.fit(LinearRegression, X, y)

```

In [1]:
import numpy as np
import random

In [8]:
class SimplifiedBaggingRegressor:
    def __init__(self, num_bags, oob=False):
        self.num_bags = num_bags
        self.oob = oob
        
    def _generate_splits(self, data: np.ndarray):
        '''
        Generate indices for every bag and store in self.indices_list list
        '''
        self.indices_list = []
        data_length = len(data)
        for bag in range(self.num_bags):
             bag_ind =  np.random.choice(len(data), size=data_length)
            
             self.indices_list.append(bag_ind)

    def fit(self, model_constructor, data, target):
        '''
        Fit model on every bag.
        Model constructor with no parameters (and with no ()) is passed to this function.
        
        example:
        
        bagging_regressor = SimplifiedBaggingRegressor(num_bags=10, oob=True)
        bagging_regressor.fit(LinearRegression, X, y)
        '''
        self.data = None
        self.target = None
        self._generate_splits(data)
        assert len(set(list(map(len, self.indices_list)))) == 1, 'All bags should be of the same length!'
        assert list(map(len, self.indices_list))[0] == len(data), 'All bags should contain `len(data)` number of elements!'
        self.models_list = []
        for bag in range(self.num_bags):
            model = model_constructor()
            data_bag, target_bag = data[self.indices_list[bag]], target[self.indices_list[bag]]

            self.models_list.append(model.fit(data_bag, target_bag)) # store fitted models here

        if self.oob:
            self.data = data
            self.target = target
        
    def predict(self, data):
        '''
        Get average prediction for every object from passed dataset
        '''
        predictions = list()
        for m in self.models_list:
            predictions.append(m.predict(data))
        preds = np.stack(predictions)
        
        return np.mean(preds, axis=0)
    
    def _get_oob_predictions_from_every_model(self):
        '''
        Generates list of lists, where list i contains predictions for self.data[i] object
        from all models, which have not seen this object during training phase
        '''
        list_of_predictions_lists = [[] for _ in range(len(self.data))]
        
        targets_list = [[] for _ in range(len(self.data))]
        
        for i in range(len(self.data)):
            not_seen_m = []
            for b in range(self.num_bags):
                if i not in self.indices_list[b]:
                    not_seen_m.append(self.models_list[b])
            
            list_of_predictions_lists[i] = [m.predict(self.data[i].reshape(1, -1)) for m in not_seen_m ]
            targets_list[i] = self.target[i]
            
                
        self.list_of_predictions_lists = np.array(list_of_predictions_lists, dtype=object)
        self.targets_list = np.array(targets_list, dtype=object)
    
    def _get_averaged_oob_predictions(self):
        '''
        Compute average prediction for every object from training set.
        If object has been used in all bags on training phase, return None instead of prediction
        '''
        #unique, counts = np.unique(self.indices_list, return_counts=True)
        #d = dict(zip(unique, counts))
        #c = [v for v,k in d.items() if k==self.num_bags ]
        self._get_oob_predictions_from_every_model()
        pred = []
        used_targets = []
        for i in range(len(self.list_of_predictions_lists)):
            if len(self.list_of_predictions_lists[i])==0:
                #pred.append(np.nan)
                #used_targets.append(np.nan)
                pass
            else:
                #pred.append((sum(self.list_of_predictions_lists[i])/len(self.list_of_predictions_lists[i]))[0])
                pred.append(max(self.list_of_predictions_lists[i], key=self.list_of_predictions_lists[i].count)[0])
                used_targets.append(self.targets_list[i])
        
        #если добавлять наны, надо заменить их на 0
        #self.oob_predictions =np.nan_to_num(np.array(pred, dtype=object))
        #self.used_targets = np.nan_to_num(np.array(used_targets, dtype=object))
        
        
        self.oob_predictions =np.array(pred, dtype=object)
        self.used_targets = np.array(used_targets, dtype=object)
        

    def OOB_score(self):
        '''
        Compute mean square error for all objects, which have at least one prediction
        '''
        self._get_averaged_oob_predictions()

        #если с нанами
        #np.nanmean((np.subtract(self.oob_predictions,self.used_targets )**2))
        
        return np.mean(((self.oob_predictions-self.used_targets)**2))

### Local tests:

In [9]:
from sklearn.linear_model import LinearRegression
from tqdm.auto import tqdm

#### Simple tests:

In [10]:
for _ in tqdm(range(100)):
    X = np.random.randn(2000, 10)
    y = np.mean(X, axis=1)
    bagging_regressor = SimplifiedBaggingRegressor(num_bags=10, oob=True)
    bagging_regressor.fit(LinearRegression, X, y)
    predictions = bagging_regressor.predict(X)
    assert np.mean((predictions - y)**2) < 1e-6, 'Linear dependency should be fitted with almost zero error!'
    assert bagging_regressor.oob, 'OOB feature must be turned on'
    oob_score = bagging_regressor.OOB_score()
    assert oob_score < 1e-6, 'OOB error for linear dependency should be also close to zero!'
    assert abs(
        np.mean(
            list(map(len, bagging_regressor.list_of_predictions_lists))
        ) / bagging_regressor.num_bags - 1/np.exp(1)) < 0.1, 'Probability of missing a bag should be close to theoretical value!'
    
print('Simple tests done!')

  0%|          | 0/100 [00:00<?, ?it/s]

Simple tests done!


#### Medium tests

In [11]:
for _ in tqdm(range(10)):
    X = np.random.randn(200, 150)
    y = np.random.randn(len(X))
    bagging_regressor = SimplifiedBaggingRegressor(num_bags=20, oob=True)
    bagging_regressor.fit(LinearRegression, X, y)
    predictions = bagging_regressor.predict(X)
    average_train_error = np.mean((predictions - y)**2)
    assert bagging_regressor.oob, 'OOB feature must be turned on'
    oob_score = bagging_regressor.OOB_score()
    assert oob_score > average_train_error, 'OOB error must be higher than train error due to overfitting!'
    assert abs(
        np.mean(
            list(map(len, bagging_regressor.list_of_predictions_lists))
        ) / bagging_regressor.num_bags - 1/np.exp(1)) < 0.1, 'Probability of missing a bag should be close to theoretical value!'
    
print('Medium tests done!')

  0%|          | 0/10 [00:00<?, ?it/s]

Medium tests done!


#### Complex tests:

In [12]:
for _ in tqdm(range(10)):
    X = np.random.randn(2000, 15)
    y = np.random.randn(len(X))
    bagging_regressor = SimplifiedBaggingRegressor(num_bags=100, oob=True)
    bagging_regressor.fit(LinearRegression, X, y)
    predictions = bagging_regressor.predict(X)
    oob_score = bagging_regressor.OOB_score()
    assert abs(
        np.mean(
            list(map(len, bagging_regressor.list_of_predictions_lists))
        ) / bagging_regressor.num_bags - 1/np.exp(1)) < 1e-2, 'Probability of missing a bag should be close to theoretical value!'
    
print('Complex tests done!')

  0%|          | 0/10 [00:00<?, ?it/s]

Complex tests done!


In [13]:
np.mean(
            list(map(len, bagging_regressor.list_of_predictions_lists))
        ) / bagging_regressor.num_bags - 1/np.exp(1)

0.0005605588285576557

Great job! Please, save `SimplifiedBaggingRegressor` to  `bagging.py` and submit your solution to the grading system!

Что было сделано:
- рандомно для каждого из 10 bag подбирались индексы, с помощью которых выбирались случайные объекты
- каждая следующая модель училась на своем "мешке" с объектами
- предсказания oob получались на тех объектах, на которых модель не училась: смотрела, в какой мешок входил передаваемый на вход индекс, и не отдавала его соответствующим моделям. Параллельно для тех объектов, которые получили предсказание хотя бы одной модели, запоминала их таргеты
- в качестве итогового предсказания я взяла голосование большинством, как сказано в рекомендуемой статье об OOB (в принципе, усредненные предсказания всех моделей давали такой же результат, т е тесты с ними проходились)
- ради простоты, для объектов, которые были во всех 10 мешках, я не возвращала None, а просто пропускала их в сохраняющем averaged_oob_predictions листе; вместо этого можно добавлять в лист np.nan, потом заменять nan на 0 и получать так среднюю величину, но получается схожий результат при большем числе операций.