# 第三章 模型搭建和评估--建模
经过前面的两章的知识点的学习，我可以对数数据的本身进行处理，比如数据本身的增删查补，还可以做必要的清洗工作。那么下面我们就要开始使用我们前面处理好的数据了。这一章我们要做的就是使用数据，我们做数据分析的目的也就是，运用我们的数据以及结合我的业务来得到某些我们需要知道的结果。那么分析的第一步就是建模，搭建一个预测模型或者其他模型；我们从这个模型的到结果之后，我们要分析我的模型是不是足够的可靠，那我就需要评估这个模型。今天我们学习建模，下一节我们学习评估。

我们拥有的泰坦尼克号的数据集，那么我们这次的目的就是，完成泰坦尼克号存活预测这个任务。

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import Image
%matplotlib inline

In [2]:
plt.rcParams['font.sans-serif'] = ['SimHei']  # 用来正常显示中文标签
plt.rcParams['axes.unicode_minus'] = False  # 用来正常显示负号
plt.rcParams['figure.figsize'] = (10, 6)  # 设置输出图片大小

【思考】这些库的作用是什么呢？你需要查一查  
回答：pandas的作用是将数据构造成DataFrame或者Series形式处理，便于进行数据的增删查补和清洗工作；
numpy是对数组进行特定代数运算的处理包；pyplot是对数据可视化过程中必不可少的一个库；seaborn则是进一步增加可视化图表的类别、形式的库;
而 IPython.display.image 则是用于显示来自网络的图像。以上答案来自个人理解和百度。

In [3]:
cdf = pd.read_csv('clear_data.csv')
df = pd.read_csv('train.csv')
cdf.head(10)

Unnamed: 0,PassengerId,Pclass,Age,SibSp,Parch,Fare,Sex_female,Sex_male,Embarked_C,Embarked_Q,Embarked_S
0,0,3,22.0,1,0,7.25,0,1,0,0,1
1,1,1,38.0,1,0,71.2833,1,0,1,0,0
2,2,3,26.0,0,0,7.925,1,0,0,0,1
3,3,1,35.0,1,0,53.1,1,0,0,0,1
4,4,3,35.0,0,0,8.05,0,1,0,0,1
5,5,3,29.699118,0,0,8.4583,0,1,0,1,0
6,6,1,54.0,0,0,51.8625,0,1,0,0,1
7,7,3,2.0,3,1,21.075,0,1,0,0,1
8,8,3,27.0,0,2,11.1333,1,0,0,0,1
9,9,2,14.0,1,0,30.0708,1,0,1,0,0


In [4]:
df.head(10)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C


<font size=4>模型搭建</font>  

1、处理完前面的数据我们就得到建模数据，下一步是选择合适模型  
2、在进行模型选择之前我们需要先知道数据集最终是进行监督学习还是无监督学习  
3、模型的选择一方面是通过我们的任务来决定的。  
4、除了根据我们任务来选择模型外，还可以根据数据样本量以及特征的稀疏性来决定  
5、刚开始我们总是先尝试使用一个基本的模型来作为其baseline，进而再训练其他模型做对比，最终选择泛化能力或性能比较好的模型  

这里我的建模，并不是从零开始，自己一个人完成完成所有代码的编译。我们这里使用一个机器学习最常用的一个库（sklearn）来完成我们的模型的搭建

### 任务一：切割训练集和测试集  
这里使用留出法划分数据集

1、将数据集分为自变量和因变量  
2、按比例切割训练集和测试集(一般测试集的比例有30%、25%、20%、15%和10%)  
3、使用分层抽样  
4、设置随机种子以便结果能复现  

【思考】

1、划分数据集的方法有哪些？  
2、为什么使用分层抽样，这样的好处有什么？  

### 任务提示1  
1、切割数据集是为了后续能评估模型泛化能力  
2、sklearn中切割数据集的方法为train_test_split  
3、查看函数文档可以在jupyter noteboo里面使用train_test_split?后回车即可看到  
4、分层和随机种子在参数里寻找  
要从clear_data.csv和train.csv中提取train_test_split()所需的参数

In [6]:
from sklearn.model_selection import train_test_split

In [7]:
train_test_split?

In [9]:
X = cdf
y = df['Survived']

In [10]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [11]:
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

(712, 11) (179, 11) (712,) (179,)


In [12]:
712/(712+179
    )

0.7991021324354658

### 任务二：模型创建
创建基于线性模型的分类模型（逻辑回归）  
创建基于树的分类模型（决策树、随机森林）  
分别使用这些模型进行训练，分别的到训练集和测试集的得分  
查看模型的参数，并更改参数值，观察模型变化

提示

逻辑回归不是回归模型而是分类模型，不要与LinearRegression混淆  
随机森林其实是决策树集成为了降低决策树过拟合的情况  
线性模型所在的模块为sklearn.linear_model  
树模型所在的模块为sklearn.ensemble

In [33]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import RandomForestClassifier

In [14]:
LogisticRegression?

In [22]:
clf = LogisticRegression(random_state= 42).fit(X_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [16]:
clf.predict(X_test)

array([0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0,
       1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0,
       1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1,
       0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0,
       1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0,
       0, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0,
       0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0,
       0, 1, 1], dtype=int64)

In [17]:
clf.score(X_test,y_test)

0.7988826815642458

In [23]:
print('训练集得分：{:.3f}'.format(clf.score(X_train,y_train)))
print('测试集得分：{:.3f}'.format(clf.score(X_test,y_test)))

训练集得分：0.802
测试集得分：0.799


In [24]:
clf = LogisticRegression(random_state= 42, C = 0.5).fit(X_train, y_train)
print('训练集得分：{:.3f}'.format(clf.score(X_train,y_train)))
print('测试集得分：{:.3f}'.format(clf.score(X_test,y_test)))

训练集得分：0.801
测试集得分：0.782


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [25]:
clf = LogisticRegression(random_state= 42, C = 0.5, class_weight= 'balanced').fit(X_train, y_train)
print('训练集得分：{:.3f}'.format(clf.score(X_train,y_train)))
print('测试集得分：{:.3f}'.format(clf.score(X_test,y_test)))

训练集得分：0.785
测试集得分：0.788


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [26]:
RandomForestRegressor?

In [27]:
rf = RandomForestRegressor(random_state=42, ).fit(X_train, y_train)

In [28]:
rf.score(X_train, y_train)

0.9141012975662229

In [32]:
rf = RandomForestRegressor(random_state=0 ).fit(X_train, y_train)
print('训练集得分：{:.3f}'.format(rf.score(X_train,y_train)))
print('测试集得分：{:.3f}'.format(rf.score(X_test,y_test)))

训练集得分：0.916
测试集得分：0.438


In [34]:
rf = RandomForestClassifier(random_state=42).fit(X_train, y_train)
print('训练集得分：{:.3f}'.format(rf.score(X_train,y_train)))
print('测试集得分：{:.3f}'.format(rf.score(X_test,y_test)))

训练集得分：1.000
测试集得分：0.844
