# 1 Monte um passo a passo para o algoritmo RF

## Bootstrap + Feature Selection

Bootstrap é o método de obtenção de datasets por amostragem com reposição de um dataset original. Feature selection é o mesmo processo de amostragem, porém em relação às variáveis/caracteristicas do dataset.
<br>
O número de variáveis à ser selecionada obdesse uma regra determinada por Leo Breiman (2001)
- Para classificação: $$m = \sqrt{p}$$
- para regressão: $$m = \frac{p}{3}$$

<br>
Essa técnica leva em consideração o conceito de Sabedoria das Multidões e é robusto contra.

## Model

Treinamento dos modelos base learnings de árvores de descisão usando os diferentes datasets geradas pelo bootstrap

## Aggregate

Sumarização dos resultados obtidos nos modelos. Isso cria resultados descorrelacionados e, portanto, com variância baixa

---

# 2 Explique com suas palavras o Random forest

## Bootstrap + Feature selection

Bootstrap é um método de obtenção de datasets que são amostras de um dataset original. Feature selection é a seleção aleatória de caracteristicas, variáveis, presentes no dataset. Um dataset será gerado com o mesmo número de instâncias do original, e menos variáveis. Cada uma dessas entradas desse novo dataset será obtida sorteando-se uma das instâncias e variáveis originais. Em cada sorteio, todas as instâncias e variáveis têm a mesma probabilidade de serem escolhidas, sendo portanto uma amostragem randômica com reposição. O número de variáveis sorteadas depende da pergunta, regressão ou classificação, e do número total de variáveis.

## Model

Cada uma das datasets será usada para a determinação de um modelo de árvores de descisão, por treinamento com esse dataset. Esses modelos são chamdos de base learnings.

## Aggregate

De cada árvore, será obtido os resultados de predição. 

- Caso seja um problema de classificação, o resultado será a classe que mais se repetil nas previsões realizadas pelos modelos.
- Caso seja um problema de regressão, será usada a média aritmética dos valores obtidos pelas previsões dos modelos.

---

# 3 Qual a diferença entre Bagging e Random Forest?

O Bagging é uma técnica de esemble que visa a otimizar modelos e minizar overfiting. Random forest é a utilização de bagging, com feature selection e árvores de descição nas bases learning

---

# 3 Implementar em python o Random Forest

In [1]:
# Importação das bibliotecas
import pandas as pd
import numpy as np

# Modelagem
from sklearn.tree import DecisionTreeClassifier

from sklearn.model_selection import train_test_split

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

In [2]:
# Instanciamento do dataset original dontendo dados randômicos
num_instancias = 100
data = {
    "feature_1": np.random.rand(num_instancias),
    "feature_2": np.random.randint(1, 10, num_instancias),
    "feature_3": np.random.randn(1, num_instancias)[0],
    "feature_4": np.random.randint(10, 100, num_instancias),
    "target": np.random.randint(0, 2, num_instancias),
}
df = pd.DataFrame(data)
df.iloc[:, :-1] = df.iloc[:, :-1].astype("float64")
df.head()

Unnamed: 0,feature_1,feature_2,feature_3,feature_4,target
0,0.591064,2.0,-0.238647,27.0,1
1,0.509598,3.0,0.186567,28.0,0
2,0.784422,4.0,-0.965638,63.0,0
3,0.754248,8.0,1.677793,23.0,1
4,0.822343,8.0,-1.365305,11.0,0


In [3]:
# Verificando as variáveis do dataset original
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   feature_1  100 non-null    float64
 1   feature_2  100 non-null    float64
 2   feature_3  100 non-null    float64
 3   feature_4  100 non-null    float64
 4   target     100 non-null    int32  
dtypes: float64(4), int32(1)
memory usage: 3.6 KB


## Bootstrap + Feature selection

In [4]:
def define_colunas(dataset: pd.DataFrame) -> pd.DataFrame:
    p = len(dataset.columns)-1
    m = int(np.sqrt(p))
    colunas = np.random.randint(0, p, m)
    return dataset.iloc[:, colunas]

In [28]:
def bootstrap(dataset: pd.DataFrame) -> pd.DataFrame:
    '''
    Função que recebe o dataset original e retorna um novo por bootstrap
    :param dataset (pd.DataFrame): Original
    :return new_dataset (pd.DataFrame): Bootstrap
    '''
    dataset = df.copy()
    new_dataset = pd.DataFrame(columns=dataset.columns)
    sample_enter = dataset.sample(1)

    for _ in range(len(dataset)):
        new_dataset = pd.concat((new_dataset, sample_enter))
        sample_enter = dataset.sample(1)
    new_dataset = new_dataset.astype("float64")

    new_dataset_feature_select = define_colunas(dataset=new_dataset)
    new_dataset_feature_select.loc[:, 'target'] = new_dataset.loc[:, 'target']
    return new_dataset_feature_select

In [29]:
# Instanciando uma lista contendo todos os datasets obtidos por bootstrap

# Serão obtidos 20 datasets
entradas = 20

# Index para os datasets
set_teste = {i for i in range(entradas)}

c = True
while c:

    # Instanciando uma lista com os datasets
    # X = df.iloc[:, :-1]
    # y = df.iloc[:, -1]
    # _, df_test, _, _ = train_test_split(X, y, test_size=0.3, random_state=0)

    df_bootstrap = [bootstrap(df) for _ in range(entradas)]

    # Instancia um array com os index contidos em cada um dos datasets
    df_index = np.array([list(df_bootstrap[df].index) for df in range(entradas)])
    try:

        # Verifica se o index do dataset padrão está presente em cada um dos datset obtidos por bootstrap
        save_index = [i for i in df.index.values if set(np.where(df_index == i)[0]) == set_teste][0]

        # Caso o index esteja presente em todos, encerra o while
        c = False
        print(f"Index comum: {save_index}")
    except:
        print("Sem index comum")

save_index

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_dataset_feature_select.loc[:, 'target'] = new_dataset.loc[:, 'target']
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_dataset_feature_select.loc[:, 'target'] = new_dataset.loc[:, 'target']
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_dataset_feature_select.loc[:, 'target'] = new_datas

Sem index comum


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_dataset_feature_select.loc[:, 'target'] = new_dataset.loc[:, 'target']
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_dataset_feature_select.loc[:, 'target'] = new_dataset.loc[:, 'target']
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_dataset_feature_select.loc[:, 'target'] = new_datas

Sem index comum


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_dataset_feature_select.loc[:, 'target'] = new_dataset.loc[:, 'target']
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_dataset_feature_select.loc[:, 'target'] = new_dataset.loc[:, 'target']
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_dataset_feature_select.loc[:, 'target'] = new_datas

Sem index comum


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_dataset_feature_select.loc[:, 'target'] = new_dataset.loc[:, 'target']
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_dataset_feature_select.loc[:, 'target'] = new_dataset.loc[:, 'target']
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_dataset_feature_select.loc[:, 'target'] = new_datas

Sem index comum


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_dataset_feature_select.loc[:, 'target'] = new_dataset.loc[:, 'target']
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_dataset_feature_select.loc[:, 'target'] = new_dataset.loc[:, 'target']
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_dataset_feature_select.loc[:, 'target'] = new_datas

Sem index comum


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_dataset_feature_select.loc[:, 'target'] = new_dataset.loc[:, 'target']
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_dataset_feature_select.loc[:, 'target'] = new_dataset.loc[:, 'target']
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_dataset_feature_select.loc[:, 'target'] = new_datas

Sem index comum


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_dataset_feature_select.loc[:, 'target'] = new_dataset.loc[:, 'target']
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_dataset_feature_select.loc[:, 'target'] = new_dataset.loc[:, 'target']
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_dataset_feature_select.loc[:, 'target'] = new_datas

Sem index comum


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_dataset_feature_select.loc[:, 'target'] = new_dataset.loc[:, 'target']
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_dataset_feature_select.loc[:, 'target'] = new_dataset.loc[:, 'target']
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_dataset_feature_select.loc[:, 'target'] = new_datas

Sem index comum


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_dataset_feature_select.loc[:, 'target'] = new_dataset.loc[:, 'target']
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_dataset_feature_select.loc[:, 'target'] = new_dataset.loc[:, 'target']
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_dataset_feature_select.loc[:, 'target'] = new_datas

Sem index comum


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_dataset_feature_select.loc[:, 'target'] = new_dataset.loc[:, 'target']
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_dataset_feature_select.loc[:, 'target'] = new_dataset.loc[:, 'target']
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_dataset_feature_select.loc[:, 'target'] = new_datas

Index comum: 6


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_dataset_feature_select.loc[:, 'target'] = new_dataset.loc[:, 'target']
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_dataset_feature_select.loc[:, 'target'] = new_dataset.loc[:, 'target']


6

## Model

In [30]:
X = df.iloc[:, :-1]
y = df.iloc[:, -1]
print(X.shape)
print(y.shape)

(100, 4)
(100,)


In [31]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
print(X_test.shape)
print(y_test.shape)
print(X_train.shape)
print(y_train.shape)

(30, 4)
(30,)
(70, 4)
(70,)


In [32]:
# Instanciando os objetos com os algoritmos de árvores
arvore = DecisionTreeClassifier(random_state=123)

# Treinamento da árvore
arvore.fit(X_train, y_train)

# Verificando as métricas treino
mse2 = arvore.score(X_train, y_train)

template = "O R² da árvore com profundidade "
template += f"{arvore.get_depth()} é: {mse2:.2f}".replace(".",",")
print("Treinamento")
print(template)

Treinamento
O R² da árvore com profundidade 9 é: 1,00


In [33]:
X_test['pred'] = arvore.predict(X_test)
X_test['target'] = y_test
X_test.head()

Unnamed: 0,feature_1,feature_2,feature_3,feature_4,pred,target
26,0.160184,5.0,0.835536,87.0,0,1
86,0.522215,7.0,-0.120786,99.0,0,0
2,0.784422,4.0,-0.965638,63.0,1,0
55,0.515207,7.0,1.167991,18.0,0,1
75,0.345315,7.0,0.022101,42.0,0,1


In [34]:
# Faz a predição do target e comparar com o dado original
X_test.loc[save_index, :]

feature_1     0.593418
feature_2     9.000000
feature_3     2.043655
feature_4    55.000000
pred          1.000000
target        1.000000
Name: 6, dtype: float64

In [70]:
def base_learnings(df_train: pd.DataFrame, save_index: int) -> dict:
    '''
    Função de obtenção dos Base learnings.
    Recebe uma DataFrame e determina a predição do target em uma instância específica
    :param df_trains (pd.DataFrame): Dataset para ser usado no treinamento
    :return (int): Dicionário com a predição dos valores obtidos no treinamento
    '''
    X = df_train.iloc[:, :-1]
    y = df_train.iloc[:, -1]
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
    
    arvore = DecisionTreeClassifier(random_state=123)
    arvore.fit(X_train, y_train)
    
    X_test['pred'] = arvore.predict(X_test)
    X_test['target'] = y_test

    try:
        retorno = {
            "index": save_index,
            "valor": float(X_test.loc[save_index, 'pred']),
        }
    except:
        retorno = {
            "index": save_index,
            "valor": np.nan,
        }
    return retorno

In [71]:
# Testando a função de predição em um dataset bootstrap
base_learnings(df_train=df_bootstrap[0], save_index=save_index)

{'index': 6, 'valor': 0.0}

In [77]:
# Testando a função de predição em todos os datasets bootstrap
aggregat_list = [base_learnings(df_train=df, save_index=save_index)["valor"] for df in df_bootstrap]
aggregat_list = [agg for agg in aggregat_list if type(agg) == float]
aggregat_list = np.array(aggregat_list)
aggregat_list = aggregat_list[~np.isnan(aggregat_list)]
aggregat_list

array([0., 1., 0., 1., 1., 1., 0.])

## Aggregation

In [78]:
# Compara os valores obtidos e retorna o mais frequente pelo método de agregação 
num_zeros = (np.array(aggregat_list) == 0.0).sum()
num_ones = (np.array(aggregat_list) != 0.0).sum()
if num_zeros > num_ones:
    aggregat = 0.0
    print(f"Aggregation = {aggregat}")
else:
    aggregat = 1.0
    print(f"Aggregation = {aggregat}")

Aggregation = 1.0
