# Random Forest (Classification) Hello World

Random Forest is a subset of decision tree algorithms. 

![image.png](attachment:image.png)

Decision Tree is similar to an if-else-like flowchat, it gives a "reason" for every decisions made. Different from most ML algorithms that cannot be easily explained, decision tree algorithms provides high **interpretability**: all decisions can be explained with a flow chart, but not a bunch of magically trained numbers.

By building many decision trees, decision tree can give more detailed reasons. For example, with only one tree, you can only ask one simple if-else question. But with many trees, you can ask more questions to tell the subtle differences between two things. This is called **information gain**. 

However, it would only be possible if no useless trees are built. So, we should keep the data **Entropy** or **Gini Impurity** high, such that it is messy enough for trees to do meaningful decision-making.

Random Forest is sometimes called **ensemble method**, as it is the collection of many specially trained small trees. 

For more details, please read the reference and the original papers.


**Reference:**
- https://medium.com/jameslearningnote/%E8%B3%87%E6%96%99%E5%88%86%E6%9E%90-%E6%A9%9F%E5%99%A8%E5%AD%B8%E7%BF%92-%E7%AC%AC3-5%E8%AC%9B-%E6%B1%BA%E7%AD%96%E6%A8%B9-decision-tree-%E4%BB%A5%E5%8F%8A%E9%9A%A8%E6%A9%9F%E6%A3%AE%E6%9E%97-random-forest-%E4%BB%8B%E7%B4%B9-7079b0ddfbda
- https://www.datacamp.com/community/tutorials/random-forests-classifier-python
- https://chtseng.wordpress.com/2017/02/24/%E9%9A%A8%E6%A9%9F%E6%A3%AE%E6%9E%97random-forest/


In [1]:
# Dataset

# Load example dataset from scikit-learn dataset library
from sklearn import datasets
iris = datasets.load_iris()


In [2]:
# print the names of the four features
print(iris.feature_names)

# print the label species(setosa, versicolor,virginica)
print(iris.target_names)

# print the first 5 iris data
print(iris.data[0:5])

# print the first 5 iris labels / results (0:setosa, 1:versicolor, 2:virginica)
print(iris.target[0:5])


['setosa' 'versicolor' 'virginica']
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
[[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]]
[0 0 0 0 0]


In [3]:
# Split dataset into two sets: training set & testing set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = \
    train_test_split(iris.data, iris.target, test_size=0.3) # 70% training and 30% test


In [4]:
# Import Random Forest from Ensemble Module
from sklearn.ensemble import RandomForestClassifier

# Create a Random Forest Classifier
rfc = RandomForestClassifier(n_estimators=100) # 100 trees will be created

# Train the model using training dataset
rfc.fit(X_train, y_train)

# Predict the response for test dataset
y_pred = rfc.predict(X_test)


In [7]:
# Evalaute accuracy with scikit-learn metrics modules
from sklearn import metrics
print("Accuracy: ", metrics.accuracy_score(y_test, y_pred))


Accuracy:  0.9555555555555556


In [9]:
# Represent the decision trees in Random Forest using "feature importance score"
import pandas as pd
pd.Series(rfc.feature_importances_,index=iris.feature_names).sort_values()


sepal width (cm)     0.027100
sepal length (cm)    0.091697
petal width (cm)     0.411409
petal length (cm)    0.469795
dtype: float64