## 分类算法--决策树
决策树常用的判定依据有:<br>
1. ID3--计算信息增益
> 缺点:计算信息熵时,当特征值越多,条件增益就越小,反之得到的信息增益就越大.<br>
> 因此ID3算法会利于包含特征值较多的特征
2. C4.5--在信息增益的基础上除以一个特征对应固有值,得到信息增益率
> 缺点:与ID3恰恰相反,当一个特征包含的特征值较小时,对应固有值越小,因此整体得到的信息增益率越大<br>
> 因此C4.5算法有利于特征值较少的特征
3. CART--基尼系数
> 整体划分更加仔细

### 泰坦尼克号幸存预测案例分析

In [3]:
# 导包及导入数据
import pandas as pd
from sklearn.tree import DecisionTreeClassifier, export_graphviz  # 决策树,输出接口
from sklearn.feature_extraction import DictVectorizer  # 字典数据特政工程提取
from sklearn.model_selection import train_test_split, GridSearchCV  # 数据集分割, 网格搜素
from sklearn.ensemble import RandomForestClassifier  # 随机森林
import pandas as ps

titan = pd.read_csv('./data/titanic.txt')
print(titan.info())
print('-' * 50)
print(titan.head())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1313 entries, 0 to 1312
Data columns (total 11 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   row.names  1313 non-null   int64  
 1   pclass     1313 non-null   object 
 2   survived   1313 non-null   int64  
 3   name       1313 non-null   object 
 4   age        633 non-null    float64
 5   embarked   821 non-null    object 
 6   home.dest  754 non-null    object 
 7   room       77 non-null     object 
 8   ticket     69 non-null     object 
 9   boat       347 non-null    object 
 10  sex        1313 non-null   object 
dtypes: float64(1), int64(2), object(8)
memory usage: 113.0+ KB
None
--------------------------------------------------
   row.names pclass  survived  \
0          1    1st         1   
1          2    1st         0   
2          3    1st         0   
3          4    1st         0   
4          5    1st         1   

                                              name    

In [4]:
# 处理数据,从数据集中提取出特征值与目标值
# 推测阶级,性别,年龄三个特征与幸存有关,因此提取为特征值
x = titan[['pclass', 'age', 'sex']]

y = titan['survived']
print(x.info())  # 判断特征值中是否存在空值
print(x.describe(include=object))

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1313 entries, 0 to 1312
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   pclass  1313 non-null   object 
 1   age     633 non-null    float64
 2   sex     1313 non-null   object 
dtypes: float64(1), object(2)
memory usage: 30.9+ KB
None
       pclass   sex
count    1313  1313
unique      3     2
top       3rd  male
freq      711   850


In [6]:
# 发生年龄中存在1313-633个空值,特征值要求不可为空,因此必须进行填补
x['age'] = x['age'].fillna(x['age'].mean())

# 处理结束后,对数据集进行分割,得到训练集和测试集
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25, random_state=4)
print(x_train.head())

    pclass        age     sex
598    2nd  30.000000    male
246    1st  62.000000    male
905    3rd  31.194181  female
300    1st  31.194181  female
509    2nd  64.000000    male


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  x['age'] = x['age'].fillna(x['age'].mean())


In [9]:
# 对数据集进行特征提取,采用字典数据转换为one-hot编码
# 此时特征转换一定要输出矩阵,而不是压缩矩阵
dict = DictVectorizer(sparse=False)

# 先利用to_dict将数据转换为字典,利用orient=record参数将对应列名转为键
x_train = dict.fit_transform(x_train.to_dict(orient='records'))
print(type(x_train))
print(dict.get_feature_names_out())
print('-' * 50)
x_test = dict.transform(x_test.to_dict(orient='records'))
print(x_train)

<class 'numpy.ndarray'>
['age' 'pclass=1st' 'pclass=2nd' 'pclass=3rd' 'sex=female' 'sex=male']
--------------------------------------------------
[[30.          0.          1.          0.          0.          1.        ]
 [62.          1.          0.          0.          0.          1.        ]
 [31.19418104  0.          0.          1.          1.          0.        ]
 ...
 [34.          0.          1.          0.          0.          1.        ]
 [46.          1.          0.          0.          0.          1.        ]
 [31.19418104  0.          0.          1.          0.          1.        ]]


In [10]:
# 利用决策树进行预测
dec = DecisionTreeClassifier()

# 进行fit训练
dec.fit(x_train, y_train)

# 预测准确率
print('预测准确率:', dec.score(x_test, y_test))

# 导出决策树的结构
export_graphviz(dec, out_file="tree.dot",
                feature_names=['age', 'pclass=1st', 'pclass=2nd', 'pclass=3rd', 'female', 'male'])

预测准确率: 0.8085106382978723


导出的决策树结构,可以通过<br>
dot -Tpng tree.dot -o tree.png<br>
命令转为图查看<对应目录终端下执行>

了解决策树接口的各参数,进行调参练习
> criterion="gini", 决策树算法依据,默认gini系数,还可以选择entropy信息熵,id3,c4.5
max_depth=None, 树的深度选取
min_samples_split=2, 拆分内部节点所需的最少样本数
min_samples_leaf=1, 叶子结点的最少样本数
random_state=None, 随机数种子

In [13]:
# 调参练习
# 分割数据集重新处理
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25, random_state=4)
print(x_train.head())
# 进行处理（特征工程）特征-》类别-》one_hot编码
dict = DictVectorizer(sparse=False)

# 这一步是对字典进行特征抽取
x_train = dict.fit_transform(x_train.to_dict(orient="records"))
print(type(x_train))
print(dict.get_feature_names_out())
print('-' * 50)
x_test = dict.transform(x_test.to_dict(orient="records"))

# print(x_train)
# # 用决策树进行预测，修改max_depth为10，发现提升了
dec = DecisionTreeClassifier(max_depth=10)

dec.fit(x_train, y_train)
#
# # 预测准确率
print("预测的准确率：", dec.score(x_test, y_test))
#
# # 导出决策树的结构
export_graphviz(dec, out_file="tree.dot",
                feature_names=dict.get_feature_names_out())

    pclass        age     sex
598    2nd  30.000000    male
246    1st  62.000000    male
905    3rd  31.194181  female
300    1st  31.194181  female
509    2nd  64.000000    male
<class 'numpy.ndarray'>
['age' 'pclass=1st' 'pclass=2nd' 'pclass=3rd' 'sex=female' 'sex=male']
--------------------------------------------------
预测的准确率： 0.817629179331307


引入随机森林进行进一步优化

In [15]:
# 随机森林进行预测 （超参数调优），n_jobs充分利用多核的一个参数
# -1即默认取最大值,也可以填入正数指定对应核数
rf = RandomForestClassifier(n_jobs=-1)
# 120, 200, 300, 500, 800, 1200,n_estimators森林中决策树的数目，也就是分类器的数目
# max_samples  是最大样本数
#bagging类型
param = {"n_estimators": [1500, 1750, 2000], "max_depth": [i for i in range(2, 10)]}

# 网格搜索与交叉验证
gc = GridSearchCV(rf, param_grid=param, cv=3)

gc.fit(x_train, y_train)

print("准确率：", gc.score(x_test, y_test))

print("查看选择的参数模型：", gc.best_params_)

print("选择最好的模型是：", gc.best_estimator_)

准确率： 0.8328267477203647
查看选择的参数模型： {'max_depth': 6, 'n_estimators': 1500}
选择最好的模型是： RandomForestClassifier(max_depth=6, n_estimators=1500, n_jobs=-1)


通过结果可以看到,在随机森林下,预测率进一步提升,并且发现树深度不宜过高!


## 线性回归与梯度下降预测
利用线性回归预测房子价格

In [20]:
# 导包
from sklearn.datasets import load_boston  # 数据集
from sklearn.linear_model import LinearRegression  # 线性回归接口
from sklearn.linear_model import SGDRegressor  # 梯度下降预测
from sklearn.preprocessing import StandardScaler  # 标准化接口
from sklearn.model_selection import train_test_split  # 数据集分离
from sklearn.metrics import mean_squared_error, classification_report, roc_auc_score
# 均方差计算, 分类结果, rocauc指标
import joblib
import pandas as pd
import numpy as np

In [22]:
# 获取数据
lb = load_boston()
# 查看各种数据
print('获取特征值:\n', lb.data)
print('获取目标值:\n', lb.target)
print('获取特征名称:\n', lb.feature_names)

获取特征值:
 [[6.3200e-03 1.8000e+01 2.3100e+00 ... 1.5300e+01 3.9690e+02 4.9800e+00]
 [2.7310e-02 0.0000e+00 7.0700e+00 ... 1.7800e+01 3.9690e+02 9.1400e+00]
 [2.7290e-02 0.0000e+00 7.0700e+00 ... 1.7800e+01 3.9283e+02 4.0300e+00]
 ...
 [6.0760e-02 0.0000e+00 1.1930e+01 ... 2.1000e+01 3.9690e+02 5.6400e+00]
 [1.0959e-01 0.0000e+00 1.1930e+01 ... 2.1000e+01 3.9345e+02 6.4800e+00]
 [4.7410e-02 0.0000e+00 1.1930e+01 ... 2.1000e+01 3.9690e+02 7.8800e+00]]
获取目标值:
 [24.  21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 15.  18.9 21.7 20.4
 18.2 19.9 23.1 17.5 20.2 18.2 13.6 19.6 15.2 14.5 15.6 13.9 16.6 14.8
 18.4 21.  12.7 14.5 13.2 13.1 13.5 18.9 20.  21.  24.7 30.8 34.9 26.6
 25.3 24.7 21.2 19.3 20.  16.6 14.4 19.4 19.7 20.5 25.  23.4 18.9 35.4
 24.7 31.6 23.3 19.6 18.7 16.  22.2 25.  33.  23.5 19.4 22.  17.4 20.9
 24.2 21.7 22.8 23.4 24.1 21.4 20.  20.8 21.2 20.3 28.  23.9 24.8 22.9
 23.9 26.6 22.5 22.2 23.6 28.7 22.6 22.  22.9 25.  20.6 28.4 21.4 38.7
 43.8 33.2 27.5 26.5 18.6 19.3 20.1 19.5 19


    The Boston housing prices dataset has an ethical problem. You can refer to
    the documentation of this function for further details.

    The scikit-learn maintainers therefore strongly discourage the use of this
    dataset unless the purpose of the code is to study and educate about
    ethical issues in data science and machine learning.

    In this special case, you can fetch the dataset from the original
    source::

        import pandas as pd
        import numpy as np

        data_url = "http://lib.stat.cmu.edu/datasets/boston"
        raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
        data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
        target = raw_df.values[1::2, 2]

    Alternative datasets include the California housing dataset (i.e.
    :func:`~sklearn.datasets.fetch_california_housing`) and the Ames housing
    dataset. You can load the datasets as follows::

        from sklearn.datasets import fetch_california_ho

In [24]:
# 分割数据集
x_train, x_test, y_train, y_test = train_test_split(lb.data, lb.target, test_size=0.25, random_state=1)

# 对数据集的特征值与目标值进行标准化统一度量
std_x = StandardScaler()
x_train = std_x.fit_transform(x_train)
x_test = std_x.transform(x_test)

std_y = StandardScaler()
print(y_train.shape)
# 由于fit_transform要求对象是二维的,而我们的目标值是一维,需要进行一个reshape
# reshape(-1,1)的含义就是转为原数据长度的行与1列的二维数据
y_train = std_y.fit_transform(y_train.reshape(-1, 1))
y_test = std_y.transform(y_test.reshape(-1, 1))

(379,)


In [26]:
# estimator-评估器进行预测
# 通过正规方程进行线性回归
lr = LinearRegression()
# fit,predict,评估好坏从准确率变为均方差,均方差越小说明拟合的越好
lr.fit(x_train, y_train)

# 查看回归系数,评估特征与目标之间的相关性
print('回归系数:', lr.coef_)

y_predict = lr.predict(x_test)
# 此时评估得到的是标准化处理后的数据,可以用过inverse_transform得到实际房子价格预测
y_lr_predict = std_y.inverse_transform(y_predict)
# 保存训练好的模型,模型中保存的是预测的w值与所用的模型结构
joblib.dump(lr, "./tmp/test.pkl")
print("正规方程测试集里面每个房子的预测价格：\n", y_lr_predict)
# 测试集上的损失
print("正规方程的均方误差：", mean_squared_error(y_test, y_predict))

回归系数: [[-0.12026411  0.15044778  0.02951803  0.07470354 -0.28043353  0.22170939
   0.02190624 -0.35275513  0.29939558 -0.2028089  -0.23911894  0.06305081
  -0.45259462]]
正规方程测试集里面每个房子的预测价格：
 [[32.37816533]
 [27.95684437]
 [18.07213891]
 [21.63166556]
 [18.93029508]
 [19.96277202]
 [32.2834674 ]
 [18.06715668]
 [24.72989076]
 [26.85359369]
 [27.23326816]
 [28.57021239]
 [21.18778302]
 [26.94393815]
 [23.37892579]
 [20.89176865]
 [17.11746934]
 [37.73997945]
 [30.51980066]
 [ 8.44489436]
 [20.86557977]
 [16.21989418]
 [25.13605925]
 [24.77658813]
 [31.40497629]
 [11.02741407]
 [13.82097563]
 [16.80208261]
 [35.94637198]
 [14.7155729 ]
 [21.23939821]
 [14.15079469]
 [42.72492585]
 [17.83887162]
 [21.84610225]
 [20.40178099]
 [17.50287927]
 [27.00093206]
 [ 9.80760408]
 [20.00288662]
 [24.27066782]
 [21.06719021]
 [29.47089776]
 [16.48482565]
 [19.38852695]
 [14.54778282]
 [39.39838319]
 [18.09810655]
 [26.22164983]
 [20.60676525]
 [25.09994066]
 [24.48366723]
 [25.02297948]
 [26.84986898]

注意上面输出的均方误差是标准化后的数据,预测值与目标值的均方误差<br>
要得到是房子价格的均方误差,要是用inverse_from后数据进行计算

In [28]:
# 加载保存的模型练习以及计算实际房子价格均方误差
model = joblib.load('./tmp/test.pkl')
print("正规方程的均方误差：", mean_squared_error(y_test, y_predict))
print("房子价格实际的均方误差：", mean_squared_error(std_y.inverse_transform(y_test),
                                         std_y.inverse_transform(y_predict)))

正规方程的均方误差： 0.2758842244225054
房子价格实际的均方误差： 21.89776539604949


### 利用梯度下降取寻找最佳w值进行预测

In [32]:
# 梯度下降去进行房价预测,数据量大要用这个
# 默认可以去调 eta0 = 0.008，会改变learning_rate
# learning_rate='optimal',alpha会影响学习率的值，由alpha来算学习率
# penalty选择的是正则化的力度,也就是l1,l2正则化
sgd = SGDRegressor(eta0=0.008,
                   penalty='l1', alpha=0.005)
# # 训练
sgd.fit(x_train, y_train)
#
print('梯度下降的回归系数', sgd.coef_)
#
# 预测测试集的房子价格
print(sgd.predict(x_test).shape)
y_sgd_predict = std_y.inverse_transform(sgd.predict(x_test).reshape(-1, 1))
y_predict = sgd.predict(x_test)
print("梯度下降测试集里面每个房子的预测价格：", y_sgd_predict)
print("梯度下降的均方误差：", mean_squared_error(y_test, y_predict))
print("梯度下降的原始房价量纲均方误差：", mean_squared_error(std_y.inverse_transform(y_test), y_sgd_predict))

梯度下降的回归系数 [-0.08862077  0.07383654 -0.01991498  0.08008351 -0.16955852  0.27093474
  0.         -0.24018851  0.09679906 -0.01853834 -0.22002831  0.0656138
 -0.42331052]
(127,)
梯度下降测试集里面每个房子的预测价格： [[30.15401847]
 [28.05865708]
 [18.15747402]
 [22.37419684]
 [18.64410312]
 [20.88503473]
 [29.89968201]
 [18.6240174 ]
 [23.82564061]
 [26.86122957]
 [26.53561614]
 [29.23085726]
 [21.59004752]
 [25.50839441]
 [22.96719372]
 [19.72695494]
 [17.30180168]
 [37.90589789]
 [29.57569127]
 [ 9.93875559]
 [20.85204256]
 [17.63096892]
 [25.36913841]
 [25.10677698]
 [30.18155048]
 [11.01656908]
 [14.45790257]
 [19.25534421]
 [35.65602813]
 [14.18790954]
 [23.71592045]
 [14.61310784]
 [40.30263706]
 [18.32577521]
 [24.05765884]
 [20.9654494 ]
 [17.91069116]
 [28.04590387]
 [ 8.33136065]
 [19.65432286]
 [26.33640301]
 [21.93066339]
 [28.62213342]
 [15.75157198]
 [18.82629821]
 [15.36251318]
 [39.91400373]
 [17.87806575]
 [25.90489542]
 [20.94655075]
 [25.01766154]
 [24.36572305]
 [25.57259802]
 [26.6071

  y = column_or_1d(y, warn=True)
