## Titanic 沉没

这是一个分类任务，特征包含离散特征和连续特征，数据如下：[Kaggle地址](https://www.kaggle.com/c/titanic/data)。目标是根据数据特征预测一个人是否能在泰坦尼克的沉没事故中存活下来。接下来解释下数据的格式：

```
survival        目标列，是否存活，1代表存活 (0 = No; 1 = Yes)  
pclass          乘坐的舱位级别 (1 = 1st; 2 = 2nd; 3 = 3rd)  
name            姓名 
sex             性别  
age             年龄  
sibsp           兄弟姐妹的数量（乘客中）  
parch           父母的数量（乘客中）  
ticket          票号  
fare            票价  
cabin           客舱  
embarked        登船的港口  
                (C = Cherbourg; Q = Queenstown; S = Southampton)

In [1]:
import pandas as pd
import numpy as np
from sklearn import preprocessing
pd.set_option("display.max_columns",60)

## 导入数据

In [2]:
train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")
IDtest = test["PassengerId"]

In [4]:
test.head(2)

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S


In [3]:
train.head(2)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C


In [5]:
train_len = len(train)
dataset =  pd.concat(objs=[train, test], axis=0).reset_index(drop=True)

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  


## 查看数据

In [7]:
dataset.tail()

Unnamed: 0,Age,Cabin,Embarked,Fare,Name,Parch,PassengerId,Pclass,Sex,SibSp,Survived,Ticket
1304,,,S,8.05,"Spector, Mr. Woolf",0,1305,3,male,0,,A.5. 3236
1305,39.0,C105,C,108.9,"Oliva y Ocana, Dona. Fermina",0,1306,1,female,0,,PC 17758
1306,38.5,,S,7.25,"Saether, Mr. Simon Sivertsen",0,1307,3,male,0,,SOTON/O.Q. 3101262
1307,,,S,8.05,"Ware, Mr. Frederick",0,1308,3,male,0,,359309
1308,,,C,22.3583,"Peter, Master. Michael J",1,1309,3,male,1,,2668


In [8]:
dataset.dtypes

Age            float64
Cabin           object
Embarked        object
Fare           float64
Name            object
Parch            int64
PassengerId      int64
Pclass           int64
Sex             object
SibSp            int64
Survived       float64
Ticket          object
dtype: object

In [11]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 12 columns):
Age            1046 non-null float64
Cabin          295 non-null object
Embarked       1307 non-null object
Fare           1308 non-null float64
Name           1309 non-null object
Parch          1309 non-null int64
PassengerId    1309 non-null int64
Pclass         1309 non-null int64
Sex            1309 non-null object
SibSp          1309 non-null int64
Survived       891 non-null float64
Ticket         1309 non-null object
dtypes: float64(3), int64(4), object(5)
memory usage: 122.8+ KB


In [12]:
dataset.isnull().sum()

Age             263
Cabin          1014
Embarked          2
Fare              1
Name              0
Parch             0
PassengerId       0
Pclass            0
Sex               0
SibSp             0
Survived        418
Ticket            0
dtype: int64

## 特征分析

In [13]:
# pclass
train.groupby('Pclass')['Survived'].count()

Pclass
1    216
2    184
3    491
Name: Survived, dtype: int64

In [14]:
train.groupby('Pclass')['Survived'].sum()/train.groupby('Pclass')['Survived'].count()

Pclass
1    0.629630
2    0.472826
3    0.242363
Name: Survived, dtype: float64

In [15]:
enc_pclass = preprocessing.OneHotEncoder() # 引入onehot编码类
enc_pclass.fit(dataset[['Pclass']])

In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly.


OneHotEncoder(categorical_features=None, categories=None,
       dtype=<class 'numpy.float64'>, handle_unknown='error',
       n_values=None, sparse=True)

In [16]:
Pclass_feature = pd.DataFrame(enc_pclass.transform(dataset[['Pclass']]).toarray(),
                              columns={'Pclass'+str(i) for i in range(len(dataset['Pclass'].unique()))},dtype=int)

In [17]:
Pclass_feature.head()

Unnamed: 0,Pclass0,Pclass1,Pclass2
0,0,0,1
1,1,0,0
2,0,0,1
3,1,0,0
4,0,0,1


In [28]:
dataset[['Name']].head(2)

Unnamed: 0,Name
0,"Braund, Mr. Owen Harris"
1,"Cumings, Mrs. John Bradley (Florence Briggs Th..."


In [25]:
# name
dataset_title = [i.split(",")[1].split(".")[0].strip() for i in dataset["Name"]] #提取姓名
dataset["Title"] = pd.Series(dataset_title)
dataset["Title"].head()

0      Mr
1     Mrs
2    Miss
3     Mrs
4      Mr
Name: Title, dtype: object

In [35]:
dataset.groupby(["Title"])['Name'].count() # 查看数据分布

Title
0     61
1    462
2    757
3     29
Name: Name, dtype: int64

In [32]:
dataset["Title"] = dataset["Title"].replace(
    ['Lady', 'the Countess','Countess','Capt', 
     'Col','Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare')

In [33]:
dataset["Title"] = dataset["Title"].map({"Master":0, "Miss":1, "Ms" : 1 , "Mme":1, "Mlle":1, "Mrs":1, "Mr":2, "Rare":3})
dataset["Title"] = dataset["Title"].astype(int)

In [34]:
dataset['Title'].unique()

array([2, 1, 0, 3])

In [36]:
enc_name = preprocessing.OneHotEncoder() # 引入onehot编码类
enc_name.fit(dataset[['Title']])

In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly.


OneHotEncoder(categorical_features=None, categories=None,
       dtype=<class 'numpy.float64'>, handle_unknown='error',
       n_values=None, sparse=True)

In [39]:
Name_feature = pd.DataFrame(enc_name.transform(dataset[['Title']]).toarray(),
                              columns=['Title'+str(i) for i in range(len(dataset['Title'].unique()))],dtype=int)

In [40]:
Name_feature.head()

Unnamed: 0,Title0,Title1,Title2,Title3
0,0,0,1,0
1,0,1,0,0
2,0,1,0,0
3,0,1,0,0
4,0,0,1,0


In [41]:
# sex
train.groupby('Sex')['Survived'].sum()/train.groupby('Sex')['Survived'].count()

Sex
female    0.742038
male      0.188908
Name: Survived, dtype: float64

In [46]:
train.groupby('Sex')['Survived'].count()

Sex
female    314
male      577
Name: Survived, dtype: int64

In [44]:
Sex_feature = pd.DataFrame(dataset['Sex'].map({'female':0,'male':1}).values,columns={'sex'})

In [45]:
Sex_feature.head()

Unnamed: 0,sex
0,1
1,0
2,0
3,0
4,1


In [50]:
dataset[['SibSp','Parch','Pclass']].head(2)

Unnamed: 0,SibSp,Parch,Pclass
0,1,0,3
1,1,0,1


In [47]:
# Age 填补缺失值
index_NaN_age = list(dataset["Age"][dataset["Age"].isnull()].index)

In [51]:
for i in index_NaN_age :
    age_med = dataset["Age"].median()
    age_pred = dataset["Age"][((dataset['SibSp'] == dataset.iloc[i]["SibSp"]) & (dataset['Parch'] == dataset.iloc[i]["Parch"]) & (dataset['Pclass'] == dataset.iloc[i]["Pclass"]))].median()
    if not np.isnan(age_pred) :
        dataset['Age'].iloc[i] = age_pred
    else :
        dataset['Age'].iloc[i] = age_med
        
### 利用其余特征来构建一个回归，age利用回归给出

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)
  return np.nanmean(a, axis, out=out, keepdims=keepdims)


In [52]:
Age_feature = dataset[['Age']]

In [54]:
Age_feature.head(3)

Unnamed: 0,Age
0,22.0
1,38.0
2,26.0


In [55]:
# SibSp
dataset['SibSp'].unique()

array([1, 0, 3, 4, 2, 5, 8])

In [56]:
dataset.groupby(['SibSp'])['Survived'].count()

SibSp
0    608
1    209
2     28
3     16
4     18
5      5
8      7
Name: Survived, dtype: int64

In [22]:
dataset.groupby(['SibSp'])['Survived'].sum()

SibSp
0    210.0
1    112.0
2     13.0
3      4.0
4      3.0
5      0.0
8      0.0
Name: Survived, dtype: float64

In [57]:
SibSp_feature = dataset[['SibSp']]
SibSp_feature['SibSp'] = np.where(SibSp_feature['SibSp']<2,SibSp_feature['SibSp'],2)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [59]:
SibSp_feature.SibSp.unique()

array([1, 0, 2])

In [60]:
# Parch
train.groupby(['Parch'])['Survived'].count()

Parch
0    678
1    118
2     80
3      5
4      4
5      5
6      1
Name: Survived, dtype: int64

In [25]:
train.groupby(['Parch'])['Survived'].count()

Parch
0    678
1    118
2     80
3      5
4      4
5      5
6      1
Name: Survived, dtype: int64

In [63]:
Parch_feature = dataset[['Parch']]
Parch_feature['Parch'] = np.where(Parch_feature['Parch']>1,2,Parch_feature['Parch'])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [64]:
Parch_feature.head()

Unnamed: 0,Parch
0,0
1,0
2,0
3,0
4,0


In [27]:
# ticket
dataset['Ticket'].unique()

array(['A/5 21171', 'PC 17599', 'STON/O2. 3101282', '113803', '373450',
       '330877', '17463', '349909', '347742', '237736', 'PP 9549',
       '113783', 'A/5. 2151', '347082', '350406', '248706', '382652',
       '244373', '345763', '2649', '239865', '248698', '330923', '113788',
       '347077', '2631', '19950', '330959', '349216', 'PC 17601',
       'PC 17569', '335677', 'C.A. 24579', 'PC 17604', '113789', '2677',
       'A./5. 2152', '345764', '2651', '7546', '11668', '349253',
       'SC/Paris 2123', '330958', 'S.C./A.4. 23567', '370371', '14311',
       '2662', '349237', '3101295', 'A/4. 39886', 'PC 17572', '2926',
       '113509', '19947', 'C.A. 31026', '2697', 'C.A. 34651', 'CA 2144',
       '2669', '113572', '36973', '347088', 'PC 17605', '2661',
       'C.A. 29395', 'S.P. 3464', '3101281', '315151', 'C.A. 33111',
       'S.O.C. 14879', '2680', '1601', '348123', '349208', '374746',
       '248738', '364516', '345767', '345779', '330932', '113059',
       'SO/C 14885', '31012

In [65]:
Ticket = []
for i in list(dataset.Ticket):
    if not i.isdigit() :
        Ticket.append(i.replace(".","").replace("/","").strip().split(' ')[0]) #Take prefix
    else:
        Ticket.append("X")

In [66]:
Ticket_feature = pd.DataFrame(Ticket,columns={'Ticket'})

In [67]:
Ticket_feature.groupby('Ticket')['Ticket'].count() # 数据偏差较大，不好处理，要么舍弃【大部分人的处理办法】，该特征选择从众原则吧。

Ticket
A            1
A4          10
A5          28
AQ3          1
AQ4          1
AS           1
C            8
CA          68
CASOTON      1
FC           3
FCC          9
Fa           1
LINE         4
LP           1
PC          92
PP           4
PPP          2
SC           2
SCA3         1
SCA4         2
SCAH         5
SCOW         1
SCPARIS     14
SCParis      5
SOC          8
SOP          1
SOPP         7
SOTONO2      3
SOTONOQ     24
SP           1
STONO       14
STONO2       7
STONOQ       1
SWPP         2
WC          15
WEP          4
X          957
Name: Ticket, dtype: int64

In [68]:
# Fare 
dataset["Fare"] = dataset["Fare"].fillna(dataset["Fare"].median())
Fare_feature = dataset[['Fare']]

In [70]:
Fare_feature.describe()

Unnamed: 0,Fare
count,1309.0
mean,33.281086
std,51.7415
min,0.0
25%,7.8958
50%,14.4542
75%,31.275
max,512.3292


In [71]:
# Embarked
dataset["Embarked"].isnull().sum()

2

In [72]:
dataset.groupby(["Embarked"])['Fare'].count()

Embarked
C    270
Q    123
S    914
Name: Fare, dtype: int64

In [74]:
dataset["Embarked"] = dataset["Embarked"].fillna("S")

In [75]:
enc_Embarked = preprocessing.OneHotEncoder() # 引入onehot编码类
enc_Embarked.fit(dataset[['Embarked']])

OneHotEncoder(categorical_features=None, categories=None,
       dtype=<class 'numpy.float64'>, handle_unknown='error',
       n_values=None, sparse=True)

In [76]:
Embarked_feature = pd.DataFrame(enc_Embarked.transform(dataset[['Embarked']]).toarray(),
            columns={'Embarked'+str(i) for i in range(len(dataset['Embarked'].unique()))},dtype=int)

In [77]:
Embarked_feature.head()

Unnamed: 0,Embarked0,Embarked2,Embarked1
0,0,0,1
1,1,0,0
2,0,0,1
3,0,0,1
4,0,0,1


In [78]:
# Cabin
dataset["Cabin"].head()

0     NaN
1     C85
2     NaN
3    C123
4     NaN
Name: Cabin, dtype: object

In [81]:
dataset["Cabin"].unique()

array([nan, 'C85', 'C123', 'E46', 'G6', 'C103', 'D56', 'A6',
       'C23 C25 C27', 'B78', 'D33', 'B30', 'C52', 'B28', 'C83', 'F33',
       'F G73', 'E31', 'A5', 'D10 D12', 'D26', 'C110', 'B58 B60', 'E101',
       'F E69', 'D47', 'B86', 'F2', 'C2', 'E33', 'B19', 'A7', 'C49', 'F4',
       'A32', 'B4', 'B80', 'A31', 'D36', 'D15', 'C93', 'C78', 'D35',
       'C87', 'B77', 'E67', 'B94', 'C125', 'C99', 'C118', 'D7', 'A19',
       'B49', 'D', 'C22 C26', 'C106', 'C65', 'E36', 'C54',
       'B57 B59 B63 B66', 'C7', 'E34', 'C32', 'B18', 'C124', 'C91', 'E40',
       'T', 'C128', 'D37', 'B35', 'E50', 'C82', 'B96 B98', 'E10', 'E44',
       'A34', 'C104', 'C111', 'C92', 'E38', 'D21', 'E12', 'E63', 'A14',
       'B37', 'C30', 'D20', 'B79', 'E25', 'D46', 'B73', 'C95', 'B38',
       'B39', 'B22', 'C86', 'C70', 'A16', 'C101', 'C68', 'A10', 'E68',
       'B41', 'A20', 'D19', 'D50', 'D9', 'A23', 'B50', 'A26', 'D48',
       'E58', 'C126', 'B71', 'B51 B53 B55', 'D49', 'B5', 'B20', 'F G63',
       'C62 C64',

In [100]:
Cabin_feature = dataset[["Cabin"]]

In [79]:
Cabin_feature = Cabin_feature.fillna('X')

In [82]:
Cabin_feature['Cabin'] = Cabin_feature['Cabin'].map(lambda x: x[0])

In [83]:
Cabin_feature[Cabin_feature['Cabin'].isin(['G','T'])]='F'

In [84]:
enc_Cabin = preprocessing.OneHotEncoder() # 引入onehot编码类
enc_Cabin.fit(Cabin_feature[['Cabin']])
Cabin_feature = pd.DataFrame(enc_Cabin.transform(Cabin_feature[['Cabin']]).toarray(),
                             columns={'Cabin'+str(i) for i in range(len(Cabin_feature['Cabin'].unique()))},dtype=int)

In [85]:
Cabin_feature.head()

Unnamed: 0,Cabin1,Cabin5,Cabin2,Cabin6,Cabin4,Cabin3,Cabin0
0,0,0,0,0,0,0,1
1,0,0,1,0,0,0,0
2,0,0,0,0,0,0,1
3,0,0,1,0,0,0,0
4,0,0,0,0,0,0,1


In [86]:
#### new feature
# family size
dataset["Fsize"] = dataset["SibSp"] + dataset["Parch"] + 1

In [87]:
dataset.groupby(['Fsize'])['Survived'].count()

Fsize
1     537
2     161
3     102
4      29
5      15
6      22
7      12
8       6
11      7
Name: Survived, dtype: int64

In [88]:
dataset.groupby(['Fsize'])['Survived'].sum()/dataset.groupby(['Fsize'])['Survived'].count()

Fsize
1     0.303538
2     0.552795
3     0.578431
4     0.724138
5     0.200000
6     0.136364
7     0.333333
8     0.000000
11    0.000000
Name: Survived, dtype: float64

In [89]:
Fsize_feature = dataset[['Fsize']]

In [90]:
Fsize_feature['Fsize'] = np.where(Fsize_feature['Fsize']>=5,5,Fsize_feature['Fsize'])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [91]:
Fsize_feature['Fsize'] = np.where((Fsize_feature['Fsize']<5)&(Fsize_feature['Fsize']>=2),2,Fsize_feature['Fsize'])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [94]:
Fsize_feature['Fsize'].unique()

array([2, 1, 5])

In [97]:
## 整合所有特征
feature_list = ['Name_feature','Sex_feature','Age_feature','SibSp_feature','Parch_feature',
'Fare_feature','Embarked_feature','Cabin_feature','Fsize_feature']
baseinfo = dataset[['PassengerId','Survived']]
for iname in feature_list:
    baseinfo = pd.concat([baseinfo,eval(iname)],axis=1)

In [98]:
baseinfo.head()

Unnamed: 0,PassengerId,Survived,Title0,Title1,Title2,Title3,sex,Age,SibSp,Parch,Fare,Embarked0,Embarked2,Embarked1,Cabin1,Cabin5,Cabin2,Cabin6,Cabin4,Cabin3,Cabin0,Fsize
0,1,0.0,0,0,1,0,1,22.0,1,0,7.25,0,0,1,0,0,0,0,0,0,1,2
1,2,1.0,0,1,0,0,0,38.0,1,0,71.2833,1,0,0,0,0,1,0,0,0,0,2
2,3,1.0,0,1,0,0,0,26.0,0,0,7.925,0,0,1,0,0,0,0,0,0,1,1
3,4,1.0,0,1,0,0,0,35.0,1,0,53.1,0,0,1,0,0,1,0,0,0,0,2
4,5,0.0,0,0,1,0,1,35.0,0,0,8.05,0,0,1,0,0,0,0,0,0,1,1


In [99]:
baseinfo.shape

(1309, 22)

In [106]:
train_df = baseinfo[baseinfo['PassengerId'].isin(list(train['PassengerId']))]

In [107]:
test_df = baseinfo[baseinfo['PassengerId'].isin(list(test['PassengerId']))]

In [108]:
train_df.head()

Unnamed: 0,PassengerId,Survived,Title0,Title1,Title2,Title3,sex,Age,SibSp,Parch,Fare,Embarked0,Embarked2,Embarked1,Cabin1,Cabin5,Cabin2,Cabin6,Cabin4,Cabin3,Cabin0,Fsize
0,1,0.0,0,0,1,0,1,22.0,1,0,7.25,0,0,1,0,0,0,0,0,0,1,2
1,2,1.0,0,1,0,0,0,38.0,1,0,71.2833,1,0,0,0,0,1,0,0,0,0,2
2,3,1.0,0,1,0,0,0,26.0,0,0,7.925,0,0,1,0,0,0,0,0,0,1,1
3,4,1.0,0,1,0,0,0,35.0,1,0,53.1,0,0,1,0,0,1,0,0,0,0,2
4,5,0.0,0,0,1,0,1,35.0,0,0,8.05,0,0,1,0,0,0,0,0,0,1,1


In [109]:
test_df.head()

Unnamed: 0,PassengerId,Survived,Title0,Title1,Title2,Title3,sex,Age,SibSp,Parch,Fare,Embarked0,Embarked2,Embarked1,Cabin1,Cabin5,Cabin2,Cabin6,Cabin4,Cabin3,Cabin0,Fsize
891,892,,0,0,1,0,1,34.5,0,0,7.8292,0,1,0,0,0,0,0,0,0,1,1
892,893,,0,1,0,0,0,47.0,1,0,7.0,0,0,1,0,0,0,0,0,0,1,2
893,894,,0,0,1,0,1,62.0,0,0,9.6875,0,1,0,0,0,0,0,0,0,1,1
894,895,,0,0,1,0,1,27.0,0,0,8.6625,0,0,1,0,0,0,0,0,0,1,1
895,896,,0,1,0,0,0,22.0,1,1,12.2875,0,0,1,0,0,0,0,0,0,1,2


### 构建模型1 XGBOOST

In [113]:
import xgboost as xgb
from sklearn.model_selection import train_test_split

In [110]:
params={
    'eta': 0.3,
    'max_depth':3,   
    'min_child_weight':1,
    'gamma':0.1, 
    'subsample':0.8,
    'colsample_bytree':0.8,
    'booster':'gbtree',
    'objective': 'binary:logistic',
    'scale_pos_weight': 1,
    'silent':0 ,
    'eval_metric': 'auc'
}

In [111]:
Dtrain = train_df.iloc[0::,2::]
y_train = train_df[['Survived']]
Dtest = test_df.iloc[0::,2::]

In [114]:
Xtrain,Xvalid,ytrain,yvalid = train_test_split(Dtrain,y_train,test_size=0.3, random_state=42)
Xtrain = Xtrain.reset_index(drop=True)
Xvalid = Xvalid.reset_index(drop=True)
ytrain = ytrain.reset_index(drop=True)
yvalid = yvalid.reset_index(drop=True)

In [115]:
d_train = xgb.DMatrix(Xtrain, label=ytrain)
d_valid = xgb.DMatrix(Xvalid, label=yvalid)
all_train = xgb.DMatrix(Dtrain,label=y_train)
test = xgb.DMatrix(Dtest)

In [116]:
watchlist = [(d_train, 'train'), (d_valid, 'valid')]

In [117]:
model_bst = xgb.train(params, d_train, 30, watchlist, early_stopping_rounds=10, verbose_eval=1)

[22:07:51] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 14 extra nodes, 0 pruned nodes, max_depth=3
[0]	train-auc:0.831925	valid-auc:0.809807
Multiple eval metrics have been passed: 'valid-auc' will be used for early stopping.

Will train until valid-auc hasn't improved in 10 rounds.
[22:07:51] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 14 extra nodes, 0 pruned nodes, max_depth=3
[1]	train-auc:0.865525	valid-auc:0.851782
[22:07:51] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 14 extra nodes, 0 pruned nodes, max_depth=3
[2]	train-auc:0.871632	valid-auc:0.877948
[22:07:51] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 14 extra nodes, 0 pruned nodes, max_depth=3
[3]	train-auc:0.876469	valid-auc:0.885809
[22:07:51] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 12 extra nodes, 0 pruned nodes, max_depth=3
[4]	train-auc:0.878578	valid-auc:0.886355
[22:07:51] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 14 extra nodes, 0 p

In [118]:
params={
    'eta': 0.3,
    'max_depth':3,   
    'min_child_weight':1,
    'gamma':0.1, 
    'subsample':0.8,
    'colsample_bytree':0.8,
    'booster':'gbtree',
    'objective': 'binary:logistic',
    'scale_pos_weight': 1,
    'silent':0     
}

In [119]:
model_bst = xgb.train(params, all_train, 9)

[22:08:28] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 12 extra nodes, 0 pruned nodes, max_depth=3
[22:08:28] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 14 extra nodes, 0 pruned nodes, max_depth=3
[22:08:28] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 14 extra nodes, 0 pruned nodes, max_depth=3
[22:08:28] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 14 extra nodes, 0 pruned nodes, max_depth=3
[22:08:28] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 14 extra nodes, 0 pruned nodes, max_depth=3
[22:08:28] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 14 extra nodes, 0 pruned nodes, max_depth=3
[22:08:28] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 14 extra nodes, 0 pruned nodes, max_depth=3
[22:08:28] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 14 extra nodes, 0 pruned nodes, max_depth=3
[22:08:28] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 14 extra nodes, 0 pruned nodes, max_

In [285]:
# predict

test_Survived = pd.Series(model_bst.predict(test), name="Survived")

results = pd.concat([IDtest,test_Survived],axis=1)

results['Survived'] = np.where(results['Survived']>=0.5,1,0)

results.to_csv("ensemble_python_verone.csv",index=False)



### 构建模型2  XGBOOST +LR


In [120]:
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from xgboost.sklearn import XGBClassifier

In [121]:
Dtrain = train_df.iloc[0::,2::].reset_index(drop=True)
y_train = train_df[['Survived']].reset_index(drop=True)
Dtest = test_df.iloc[0::,2::].reset_index(drop=True)


In [125]:
param_test1 = {
    'n_estimators':range(3,40,2)}
gsearch1 = GridSearchCV(estimator = XGBClassifier(learning_rate =0.3, max_depth=3,
                                        min_child_weight=1, gamma=0.1, subsample=0.8, colsample_bytree=0.8,
                                        objective= 'binary:logistic', nthread=4, scale_pos_weight=1,random_state=42), 
param_grid = param_test1, scoring='roc_auc',n_jobs=4, cv=5)
gsearch1.fit(Dtrain,y_train)

  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


GridSearchCV(cv=5, error_score='raise-deprecating',
       estimator=XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=0.8, gamma=0.1, learning_rate=0.3,
       max_delta_step=0, max_depth=3, min_child_weight=1, missing=None,
       n_estimators=100, n_jobs=1, nthread=4, objective='binary:logistic',
       random_state=42, reg_alpha=0, reg_lambda=1, scale_pos_weight=1,
       seed=None, silent=True, subsample=0.8),
       fit_params=None, iid='warn', n_jobs=4,
       param_grid={'n_estimators': range(3, 40, 2)},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring='roc_auc', verbose=0)

In [126]:
gsearch1.scorer_, gsearch1.best_params_, gsearch1.best_score_

(make_scorer(roc_auc_score, needs_threshold=True),
 {'n_estimators': 11},
 0.8657790489854768)

In [127]:
best_paras = {
    'n_estimators': 11,
    'learning_rate': 0.3,
    'max_depth':3,   
    'min_child_weight':1,
    'gamma':0.1, 
    'subsample':0.8,
    'colsample_bytree':0.8,
    'booster':'gbtree',
    'objective': 'binary:logistic',
    'scale_pos_weight': 1,
    'random_state':42}

In [128]:
layer0_estimator = XGBClassifier(**best_paras)

In [129]:
layer0_estimator.fit(Dtrain,y_train)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=0.8, gamma=0.1, learning_rate=0.3,
       max_delta_step=0, max_depth=3, min_child_weight=1, missing=None,
       n_estimators=11, n_jobs=1, nthread=None,
       objective='binary:logistic', random_state=42, reg_alpha=0,
       reg_lambda=1, scale_pos_weight=1, seed=None, silent=True,
       subsample=0.8)

In [130]:
X_train_leaves = layer0_estimator.apply(Dtrain)

In [132]:
X_train_leaves

array([[ 5,  5, 11, ..., 11,  8,  7],
       [ 7,  8,  3, ...,  9, 14,  9],
       [ 7,  7,  7, ...,  8,  8,  7],
       ...,
       [ 7,  8,  8, ...,  8,  8,  9],
       [11, 12,  9, ..., 11,  8,  9],
       [ 5,  5, 11, ..., 11,  8,  7]], dtype=int32)

In [133]:
X_train_leaves.shape

(891, 11)

In [131]:
X_test_leaves = layer0_estimator.apply(Dtest)

In [134]:
xgbenc = preprocessing.OneHotEncoder()
X_trans = xgbenc.fit_transform(X_train_leaves)
X_test = xgbenc.transform(X_test_leaves)

In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly.


In [135]:
X_trans

<891x82 sparse matrix of type '<class 'numpy.float64'>'
	with 9801 stored elements in Compressed Sparse Row format>

In [136]:
# new feature
Dtrain_new = pd.concat([pd.DataFrame(X_trans.toarray()),Dtrain],axis=1)
Dtest_new = pd.concat([pd.DataFrame(X_test.toarray()),Dtest],axis=1)

In [137]:
Dtrain_new.shape

(891, 102)

In [138]:
# 引入逻辑回归
lr = LogisticRegression(penalty='l2')

In [139]:
param_test1 = {
    'C':range(1,100,10)}
gsearch1 = GridSearchCV(estimator = LogisticRegression(penalty='l2',random_state=42), 
param_grid = param_test1, scoring='roc_auc',n_jobs=4, cv=5)
gsearch1.fit(Dtrain_new,y_train)

  y = column_or_1d(y, warn=True)


GridSearchCV(cv=5, error_score='raise-deprecating',
       estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=42, solver='warn',
          tol=0.0001, verbose=0, warm_start=False),
       fit_params=None, iid='warn', n_jobs=4,
       param_grid={'C': range(1, 100, 10)}, pre_dispatch='2*n_jobs',
       refit=True, return_train_score='warn', scoring='roc_auc', verbose=0)

In [140]:
gsearch1.scorer_, gsearch1.best_params_, gsearch1.best_score_

(make_scorer(roc_auc_score, needs_threshold=True), {'C': 1}, 0.895956232911137)

In [142]:
best_paras = {'C':1,'penalty':'l2'}

In [143]:
lr = LogisticRegression(**best_paras)

In [144]:
lr.fit(Dtrain_new,y_train)

  y = column_or_1d(y, warn=True)


LogisticRegression(C=1, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)

In [397]:
# predict

test_Survived = pd.Series(lr.predict(Dtest_new), name="Survived")

results = pd.concat([IDtest,test_Survived],axis=1)

results['Survived'] = np.where(results['Survived']>=0.5,1,0)

results.to_csv("ensemble_python_vertwo.csv",index=False)