# 随机森林
适合处理大型高维特征的数据集，能够评估各个特征在分类问题上的重要性
- 随机：N个样本，每个样本M个特征
    - 训练样本随机，随机有放回地抽样，一次抽取一个样本，重复N次
    - 特征数量随机，随机选取m个特征（m<<M）
- 森林：包含多个决策树的分类器  

泰坦尼克号遇难判断

In [7]:
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction import DictVectorizer
from sklearn.model_selection import train_test_split, GridSearchCV

In [5]:
# 1、 加载数据集
titanic = pd.read_csv('data/titanic.csv')

# 筛选特征值和目标值
x = titanic[["pclass", "age", "sex"]]
y = titanic["survived"]

In [6]:
# 2、数据处理
# 1）缺失值处理
x["age"].fillna(x["age"].mean(), inplace=True)

# 2) 转换成字典
x = x.to_dict(orient="records")

# 3) 数据集划分
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=0)

# 4) 字典特征抽取
transfer = DictVectorizer()
x_train = transfer.fit_transform(x_train)
x_test = transfer.transform(x_test)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  x["age"].fillna(x["age"].mean(), inplace=True)


In [8]:
# 3、随机森林预估器
estimator = RandomForestClassifier()
# 参数准备
param_dict = {"n_estimators": [120,200,300,500,800,1200], "max_depth": [5,8,15,25,30]}
estimator = GridSearchCV(estimator, param_grid=param_dict, cv=5)
estimator.fit(x_train, y_train)

In [9]:
# 4、模型评估
# 方法1：直接比对真实值和预测值
y_predict = estimator.predict(x_test)
print("y_predict:\n", y_predict)
print("直接比对真实值和预测值:\n", y_test == y_predict)

# 方法2：计算准确率
score = estimator.score(x_test, y_test)
print("准确率为：\n", score)

# 最佳参数：best_params_
print("最佳参数：\n", estimator.best_params_)
# 最佳结果：best_score_
print("最佳结果：\n", estimator.best_score_)
# 最佳估计器：best_estimator_
print("最佳估计器:\n", estimator.best_estimator_)
# 交叉验证结果：cv_results_
print("交叉验证结果:\n", estimator.cv_results_)

y_predict:
 [0 0 1 1 0 0 0 1 1 0 1 0 0 0 0 0 0 1 0 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0 1 0 0
 0 0 0 0 1 1 0 1 0 0 1 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 1 0 1 0 0 0 1 1 0 1 0 1 1
 0 1 1 0 0 0 0 1 0 1 0 0 0 1 1 1 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 1 0 1 1 0 0
 0 1 0 0 0 1 0 1 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 1 0 0 1 1 0
 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 1 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 1 0 0 0 1 0 1 1
 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 1 0 1 0 1 0 0 0 0 1 0 0 0 1 0 0 0
 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0]
直接比对真实值和预测值:
 588      True
672      True
31       True
511      True
792      True
        ...  
946     False
60      False
1291     True
901      True
632      True
Name: survived, Length: 329, dtype: bool
准确率为：
 0.8480243161094225
最佳参数：
 {'max_depth': 5, 'n_estimators': 800}
最佳结果：
 0.8150471356055112
最佳估计器:
 Random