### 作業目的: 實作樹型模型

在本次課程中實作了以Entropy計算訊息增益的決策樹模型，而計算訊息增益的方法除了Entropy只外還有Gini。因此本次作業希望讀者實作以Gini計算

訊息增益，且基於課程的決策樹模型建構隨機森林模型。

在作業資料夾中的`decision_tree_functions.py`檔案有在作業中實作的所有函式，在實作作業中可以充分利用已經寫好的函式

### Q1: 使用Gini計算訊息增益

$$
Gini = \sum_{i=1}^cp(i)(1-p(i)) = 1 - \sum_{i=1}^cp(i)^2
$$

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
%cd './drive/My Drive/NLP/day27'

/content/drive/My Drive/NLP/day27


In [3]:
import pandas as pd
import numpy as np
import random
from decision_tree_functions import decision_tree, train_test_split

In [4]:
# 使用與課程中相同的假資料

training_data = [
    ['Green', 3.1, 'Apple'],
    ['Red', 3.2, 'Apple'],
    ['Red', 1.2, 'Grape'],
    ['Red', 1, 'Grape'],
    ['Yellow', 3.3, 'Lemon'],
    ['Yellow', 3.1, 'Lemon'],
    ['Green', 3, 'Apple'],
    ['Red', 1.1, 'Grape'],
    ['Yellow', 3, 'Lemon'],
    ['Red', 1.2, 'Grape'],
]

header = ["color", "diameter", "label"]

df = pd.DataFrame(data=training_data, columns=header)
df.head()

Unnamed: 0,color,diameter,label
0,Green,3.1,Apple
1,Red,3.2,Apple
2,Red,1.2,Grape
3,Red,1.0,Grape
4,Yellow,3.3,Lemon


In [5]:
#Gini impurity
def calculate_gini(data):
    
    #取的資料的label訊息
    ###<your code>###
    data_label = data[:,-1]
    
    #取得所有輸入資料的獨立類別與其個數
    ###<your code>###
    unique_class , counts = np.unique(data_label , return_counts = True)
    
    #計算機率
    ###<your code>###
    probability = counts / counts.sum()
    
    #計算gini impurity
    ###<your code>###
    gini = 1 - sum(probability**2)
    
    return gini

In [6]:
#分割資料集
###<your code>###
train_df, test_df = train_test_split(df, 0.2)

#以Gini inpurity作為metric_function訓練決策樹
tree = decision_tree(metric_function=calculate_gini , task_type = 'classification', counter = 0 , min_samples=2, max_depth=5)
tree.fit(train_df)

{'diameter <= 1.2': ['Grape', {'color = Yellow': ['Lemon', 'Apple']}]}

In [7]:
# 以建構好的樹進行預測
sample = test_df.iloc[0]
###<your code>###
tree.pred(sample,tree.sub_tree)

'Lemon'

In [8]:
sample

color       Yellow
diameter         3
label        Lemon
Name: 8, dtype: object

### Q2: 實作隨機森林
利用決策樹來實作隨機森林模型，讀者可參考隨機森林課程講義。

此份作業只要求讀者實作隨機sample訓練資料，而隨機sample特徵進行訓練的部分，讀者可以參考`decision_tree_functions.py`中的`get_potential_splits`與`decision_tree`部分(新增參數`random_features`)

In [9]:
class random_forest():
    '''Random forest model
    Parameters
    ----------
    n_boostrap: int
        number of samples to sample to train indivisual decision tree
    n_tree: int
        number of trees to form a forest
    '''
    
    def __init__(self, n_bootstrap, n_trees, task_type, min_samples, max_depth, metric_function, n_features=None):
        self.n_bootstrap = n_bootstrap
        self.n_trees = n_trees
        self.task_type = task_type
        self.min_samples = min_samples
        self.max_depth = max_depth
        self.metric_function = metric_function
        self.n_features = n_features
    
    def bootstrapping(self, train_df, n_bootstrap):
        #sample data to be used to train individual tree
        indices = list(train_df.index)
        bootstrap_indices = random.sample(population=indices, k=n_bootstrap)
        
        df_bootstrapped = df.iloc[bootstrap_indices,:]
        
        #avoid pick the samples with all the same label
        labels = df_bootstrapped[:,-1].vlaues
        if len(np.unique(labels))==1:
          df_bootstrapped = self.bootstrapping(train_df , n_bootstrap)

        return df_bootstrapped
    
    def fit(self, train_df):
        
        self.forest = []
        
        
        for i in range(self.n_trees):
          tree = decision_tree(metric_function=self.metric_function , task_type =self.task_type  , counter = 0 , min_samples=self.min_samples , max_depth=self.max_depth)
          tree.fit(train_df)
          self.forest.append(tree)
        """"
        for i in range(self.n_trees):
            tree = decision_tree(self.metric_function, self.task_type, 0, self.min_samples, self.max_depth, self.n_features)
            tree.fit(train_df)
            self.forest.append(tree)
        """

        return self.forest
    
    def pred(self, test_df):
        df_predictions = {}
        
        # 不能先用iloc，因為index是亂的
        test_df = test_df.reset_index(drop=True)
        test_df = test_df.iloc[: , :-1]

        #########
        #注意:實作的決策樹每次只能預測"一筆資料"
        #########
        
        for j in range(len(test_df)):
          pred_list = []
          for i in range(self.n_trees):
            predictions = self.forest[i].pred(test_df.iloc[j,:] , self.forest[i].sub_tree)
            # https://stackoverflow.com/questions/17839973/constructing-pandas-dataframe-from-values-in-variables-gives-valueerror-if-usi
            pred_list.append(predictions)
          df_predictions[f'{j}th data'] = pred_list


        df_predictions = pd.DataFrame(df_predictions)
        print(df_predictions)
        #majority voting
        random_forest_predictions = []
        
        if self.task_type == 'classification':
          for i in range(len(test_df)):
            all_classifier_per_data = df_predictions.iloc[i,:].values
            labels , counts = np.unique(all_classifier_per_data , return_counts = True)
            index = np.argsort(counts)
            random_forest_predictions.append(labels[index])

        else:
          for i in range(len(test_df)):
            all_classifier_per_data = df_predictions.iloc[i,:].values
            mean = np.mean(all_classifier_per_data)
            random_forest_predictions.append(mean)
        
        return random_forest_predictions

In [10]:
train_df, test_df = train_test_split(df, 0.2)

#建立隨機森林模型
###<your code>###
forest = random_forest(n_bootstrap =5 , n_trees = 4 , task_type = 'classification', min_samples=2 , max_depth=5 , metric_function = calculate_gini)
forest.fit(train_df)

[<decision_tree_functions.decision_tree at 0x7f26385767b8>,
 <decision_tree_functions.decision_tree at 0x7f2638576940>,
 <decision_tree_functions.decision_tree at 0x7f2638576978>,
 <decision_tree_functions.decision_tree at 0x7f2638576c50>]

In [11]:
test_df

Unnamed: 0,color,diameter,label
1,Red,3.2,Apple
5,Yellow,3.1,Lemon


In [12]:
forest.pred(test_df)

  0th data 1th data
0    Apple    Lemon
1    Apple    Lemon
2    Apple    Lemon
3    Apple    Lemon


[array(['Apple', 'Lemon'], dtype=object),
 array(['Apple', 'Lemon'], dtype=object)]