# Python进行保险业之交叉销售预测

## 1.1 商业理解

我们的客户是一家为用户提供健康保险的保险公司，现在他们需要建立一个模型来预测过去一年的投保人(客户)是否也会对公司提供的汽车保险感兴趣。

保险单是一种约定，在这种约定下，公司承诺为特定类型的损失、损害、疾病或死亡提供赔偿保证，客户则需要支付一定的保险费。保险费是客户为这种担保定期向保险公司支付的一笔钱。

例如，你每年要为20万的健康保险支付2000的保险费。现在，如果你想知道公司怎么能在只收取5000的保费的情况下承担如此高的住院费用，这就是“概率”概念的由来。例如，像你一样，可能有100名客户每年支付2000的保费，但只有少数人(比如2-3人)会在那一年住院，而不是所有人。通过这种方式，每个人都分担了其他人的风险。

就像医疗保险一样，有一种车辆保险，客户每年需要向保险公司支付一定金额的保费，以便在车辆发生不幸事故时，保险公司会向客户提供一笔赔偿(称为“保额”)。

建立一个模型来预测客户是否对汽车保险感兴趣，这对公司非常有帮助，因为它可以相应地规划沟通策略，接触这些客户，并优化其商业模式和收入。

## 1.2 数据理解

为了预测客户是否对车辆保险感兴趣，您可以获得有关人口统计数据(性别、年龄、地区编码类型)、车辆(车辆年龄、损坏情况)、保单(保费、采购渠道)等信息。

数据划分为训练集和测试集，训练数据包含381109笔客户资料，每笔客户资料包含12个字段，1个客户ID字段、10个输入字段及1个目标字段-Response是否响应(1代表感兴趣，0代表不感兴趣)。测试数据包含127037笔客户资料；字段个数与训练数据相同，目标字段没有值。字段的定义可参考下文。

字段|字段翻译|角色|测量类型|不同值个数
---|:--:|---:|--:|--:
ID |客户ID|记录标识|无类型|381109
Gender |性别|输入|分类型|2
Age |年龄|输入|数值型|66
Driving_License |是否有驾照|输入|分类型|2
Region_Code |用户所在区域的编码|输入|分类型|53
Previously_Insured |之前是否投保|输入|分类型|2
Vehicle_Age |车龄|输入|分类型|3
Vehicle_Damage |车辆损坏情况|输入|分类型|2
Annual_Premium |年度保费（卢比）|输入|数值型|48838
Policy_Sales_Channel |销售渠道|输入|分类|155
Vintage |往来时长（天）|输入|数值型|290
Response |是否响应|目标|分类型|2

## 1.3 数据读入和预览

In [1]:
# 数据整理
import numpy as np 
import pandas as pd 

# 可视化
import matplotlib.pyplot as plt 
import seaborn as sns 
import plotly as py 
import plotly.graph_objs as go 
import plotly.express as px 
pyplot = py.offline.plot 
from exploratory_data_analysis import EDAnalysis # 自定义

In [39]:
# 读入训练集
train = pd.read_csv('../data/train.csv')
train.head() 

Unnamed: 0,id,Gender,Age,Driving_License,Region_Code,Previously_Insured,Vehicle_Age,Vehicle_Damage,Annual_Premium,Policy_Sales_Channel,Vintage,Response
0,1,Male,44,1,28.0,0,> 2 Years,Yes,40454.0,26.0,217,1
1,2,Male,76,1,3.0,0,1-2 Year,No,33536.0,26.0,183,0
2,3,Male,47,1,28.0,0,> 2 Years,Yes,38294.0,26.0,27,1
3,4,Male,21,1,11.0,1,< 1 Year,No,28619.0,152.0,203,0
4,5,Female,29,1,41.0,1,< 1 Year,No,27496.0,152.0,39,0


In [40]:
# 读入测试集
test = pd.read_csv('../data/test.csv')
test.head() 

Unnamed: 0,id,Gender,Age,Driving_License,Region_Code,Previously_Insured,Vehicle_Age,Vehicle_Damage,Annual_Premium,Policy_Sales_Channel,Vintage
0,381110,Male,25,1,11.0,1,< 1 Year,No,35786.0,152.0,53
1,381111,Male,40,1,28.0,0,1-2 Year,Yes,33762.0,7.0,111
2,381112,Male,47,1,28.0,0,1-2 Year,Yes,40050.0,124.0,199
3,381113,Male,24,1,27.0,1,< 1 Year,Yes,37356.0,152.0,187
4,381114,Male,27,1,28.0,1,< 1 Year,No,59097.0,152.0,297


In [41]:
print(train.info())
print('-' * 50)
print(test.info()) 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 381109 entries, 0 to 381108
Data columns (total 12 columns):
 #   Column                Non-Null Count   Dtype  
---  ------                --------------   -----  
 0   id                    381109 non-null  int64  
 1   Gender                381109 non-null  object 
 2   Age                   381109 non-null  int64  
 3   Driving_License       381109 non-null  int64  
 4   Region_Code           381109 non-null  float64
 5   Previously_Insured    381109 non-null  int64  
 6   Vehicle_Age           381109 non-null  object 
 7   Vehicle_Damage        381109 non-null  object 
 8   Annual_Premium        381109 non-null  float64
 9   Policy_Sales_Channel  381109 non-null  float64
 10  Vintage               381109 non-null  int64  
 11  Response              381109 non-null  int64  
dtypes: float64(3), int64(6), object(3)
memory usage: 34.9+ MB
None
--------------------------------------------------
<class 'pandas.core.frame.DataFrame'>
Rang

## 1.4 探索性分析

我们基于训练数据集进行探索性数据分析。

### 1.4.1 描述性分析

首先对数据集中数值型属性进行描述性统计分析。

In [42]:
desc_table = train.drop(['id', 'Vehicle_Age'], axis=1).describe().T
desc_table

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Age,381109.0,38.822584,15.511611,20.0,25.0,36.0,49.0,85.0
Driving_License,381109.0,0.997869,0.04611,0.0,1.0,1.0,1.0,1.0
Region_Code,381109.0,26.388807,13.229888,0.0,15.0,28.0,35.0,52.0
Previously_Insured,381109.0,0.45821,0.498251,0.0,0.0,0.0,1.0,1.0
Annual_Premium,381109.0,30564.389581,17213.155057,2630.0,24405.0,31669.0,39400.0,540165.0
Policy_Sales_Channel,381109.0,112.034295,54.203995,1.0,29.0,133.0,152.0,163.0
Vintage,381109.0,154.347397,83.671304,10.0,82.0,154.0,227.0,299.0
Response,381109.0,0.122563,0.327936,0.0,0.0,0.0,0.0,1.0


从以上描述性分析结果可以得出：

- 客户年龄：客户的年龄范围在20 ~ 85岁之间，平均年龄是38岁，青年群体居多；
- 是否有驾照：99.89%客户都持有驾照；
- 之前是否投保：45.82%的客户已经购买了车辆保险；
- 年度保费：客户的保费范围在2630 ~ 540165之间，平均的保费金额是30564。
- 往来时长：此数据基于过去一年的数据，客户的往来时间范围在10~299天之间，平均往来时长为154天。
- 是否响应：平均来看，客户对车辆保险感兴趣的概率为12.25%。

### 1.4.2 目标变量的分布

In [6]:
train['Response'].value_counts() 

0    334399
1     46710
Name: Response, dtype: int64

In [7]:
values = train['Response'].value_counts().values.tolist()

# 轨迹
trace1 = go.Pie(labels=['Not interested', 'Interested'], 
                values=values,
                hole=.5,
                marker={'line': {'color': 'white', 'width': 1.3}}
               )
# 轨迹列表
data = [trace1] 
# 布局
layout = go.Layout(title=f'Distribution_ratio of Response', height=600)
# 画布
fig = go.Figure(data=data, layout=layout)
# 生成HTML
pyplot(fig, filename='./html/目标变量分布.html') 

'./html/目标变量分布.html'

训练集共有381109笔客户资料，其中感兴趣的有46710人，占比12.3%，不感兴趣的有334399人，占比87.7%。

### 1.4.3 性别与是否感兴趣

In [23]:
pd.crosstab(train['Gender'], train['Response'])  

Response,0,1
Gender,Unnamed: 1_level_1,Unnamed: 2_level_1
Female,156835,18185
Male,177564,28525


In [77]:
# 实例类
eda = EDAnalysis(data=train, id_col='id', target='Response')

# 柱形图
fig = eda.draw_bar_stack_cat(colname='Gender')
pyplot(fig, filename='./html/性别与是否感兴趣.html') 

'./html/性别与是否感兴趣.html'

结论：从条形图可以看出，男性的客户群体对汽车保险感兴趣的概率稍高，是13.84%，相较女性客户高出3个百分点。

### 1.4.4 是否有驾照和是否感兴趣

In [21]:
pd.crosstab(train['Driving_License'], train['Response'])  

Response,0,1
Driving_License,Unnamed: 1_level_1,Unnamed: 2_level_1
0,771,41
1,333628,46669


In [25]:
fig = eda.draw_bar_stack_cat(colname='Driving_License')
pyplot(fig, filename='./html/是否有驾照和是否感兴趣.html')  

'./html/是否有驾照和是否感兴趣.html'

结论：有驾照的客户对汽车保险感兴趣的概率较高，为12.27%，没有驾照的客户仅有5.05%感兴趣。

### 1.4.5 之前是否投保与是否感兴趣

In [26]:
pd.crosstab(train['Previously_Insured'], train['Response'])  

Response,0,1
Previously_Insured,Unnamed: 1_level_1,Unnamed: 2_level_1
0,159929,46552
1,174470,158


In [27]:
fig = eda.draw_bar_stack_cat(colname='Previously_Insured')
pyplot(fig, filename='./html/之前是否投保与是否感兴趣.html')  

'./html/之前是否投保与是否感兴趣.html'

结论：没有购买汽车保险的客户响应概率更高，为22.54%，有购买汽车保险的客户则没有这一需求，感兴趣的概率仅为0.09%。

### 1.4.6 车龄与是否感兴趣

In [28]:
pd.crosstab(train['Vehicle_Age'], train['Response'])  

Response,0,1
Vehicle_Age,Unnamed: 1_level_1,Unnamed: 2_level_1
1-2 Year,165510,34806
< 1 Year,157584,7202
> 2 Years,11305,4702


In [29]:
fig = eda.draw_bar_stack_cat(colname='Vehicle_Age')
pyplot(fig, filename='./html/车龄与是否感兴趣.html')   

'./html/车龄与是否感兴趣.html'

结论：车龄越大，响应概率越高，大于两年的车龄感兴趣的概率最高，为29.37%，其次是1~2年车龄，概率为17.38%。小于1年的仅为4.37%。

### 1.4.7 车辆损坏情况与是否感兴趣

In [32]:
pd.crosstab(train['Vehicle_Damage'], train['Response'])  

Response,0,1
Vehicle_Damage,Unnamed: 1_level_1,Unnamed: 2_level_1
No,187714,982
Yes,146685,45728


In [33]:
fig = eda.draw_bar_stack_cat(colname='Vehicle_Damage')
pyplot(fig, filename='./html/车辆损坏情况与是否感兴趣.html') 

'./html/车辆损坏情况与是否感兴趣.html'

车辆曾经损坏过的客户有较高的响应概率，为23.76%，相比之下，客户过去车辆没有损坏的响应概率仅为0.52%

### 1.4.8 不同年龄与是否感兴趣

In [36]:
fig = eda.draw_bar_stack_num(colname='Age')
pyplot(fig, filename='./html/不同年龄与是否感兴趣.html') 

'./html/不同年龄与是否感兴趣.html'

从直方图中可以看出，年龄较高的群体和较低的群体响应的概率较低，30~60岁之前的客户响应概率较高。

### 1.4.9 年度保费与是否感兴趣

In [37]:
fig = eda.draw_bar_stack_num(colname='Annual_Premium')
pyplot(fig, filename='./html/年度保费与是否感兴趣.html')  

'./html/年度保费与是否感兴趣.html'

### 1.4.10 年龄和年度保费与是否响应关系

In [91]:
fig = px.scatter(train, 
                x="Annual_Premium", 
                y="Age", 
                color="Response",
                title='Annual_premium vs Age scatter'
)

pyplot(fig, filename='./html/年龄和年度保费与是否响应关系.html') 

'./html/年龄和年度保费与是否响应关系.html'

由于年度保费有异常值和极端值，需要对其处理之后再做进一步的解读。

通过可视化探索，我们大致可以知道，车龄在1年以上，之前有车辆损坏的情况出现，且未购买过车辆保险的客户有较高的响应概率。

## 1.5 数据预处理

此部分工作主要包含字段选择，数据清洗和数据编码，字段的处理如下：

1. Region_Code和Policy_Sales_Channel：分类数过多，且不易解读，删除；
2. Annual_Premium：异常值处理
3. Gender、Vehicle_Age、Vehicle_Damage：分类型数据转换为数值型编码

In [43]:
# 删除字段
train = train.drop(['Region_Code', 'Policy_Sales_Channel'], axis=1) 

# 盖帽法处理异常值
f_max = train['Annual_Premium'].mean() + 3*train['Annual_Premium'].std()
f_min = train['Annual_Premium'].mean() - 3*train['Annual_Premium'].std() 

train.loc[train['Annual_Premium'] > f_max, 'Annual_Premium'] = f_max
train.loc[train['Annual_Premium'] < f_min, 'Annual_Premium'] = f_min 

# 数据编码
train['Gender'] = train['Gender'].map({'Male': 1, 'Female': 0}) 
train['Vehicle_Damage'] = train['Vehicle_Damage'].map({'Yes': 1, 'No': 0}) 
train['Vehicle_Age'] = train['Vehicle_Age'].map({'< 1 Year': 0, '1-2 Year': 1, '> 2 Years': 2}) 
train.head() 

Unnamed: 0,id,Gender,Age,Driving_License,Previously_Insured,Vehicle_Age,Vehicle_Damage,Annual_Premium,Vintage,Response
0,1,1,44,1,0,2,1,40454.0,217,1
1,2,1,76,1,0,1,0,33536.0,183,0
2,3,1,47,1,0,2,1,38294.0,27,1
3,4,1,21,1,1,0,0,28619.0,203,0
4,5,0,29,1,1,0,0,27496.0,39,0


测试集做相同的处理：

In [44]:
# 删除字段
test = test.drop(['Region_Code', 'Policy_Sales_Channel'], axis=1)  
# 盖帽法处理
test.loc[test['Annual_Premium'] > f_max, 'Annual_Premium'] = f_max
test.loc[test['Annual_Premium'] < f_min, 'Annual_Premium'] = f_min 

# 数据编码
test['Gender'] = test['Gender'].map({'Male': 1, 'Female': 0}) 
test['Vehicle_Damage'] = test['Vehicle_Damage'].map({'Yes': 1, 'No': 0}) 
test['Vehicle_Age'] = test['Vehicle_Age'].map({'< 1 Year': 0, '1-2 Year': 1, '> 2 Years': 2}) 
test.head() 

Unnamed: 0,id,Gender,Age,Driving_License,Previously_Insured,Vehicle_Age,Vehicle_Damage,Annual_Premium,Vintage
0,381110,1,25,1,1,0,0,35786.0,53
1,381111,1,40,1,0,1,1,33762.0,111
2,381112,1,47,1,0,1,1,40050.0,199
3,381113,1,24,1,1,0,1,37356.0,187
4,381114,1,27,1,1,0,0,59097.0,297


## 1.6 数据建模

我们选择使用以下几种模型进行建置，并比较模型的分类效能。

首先在将训练集划分为训练集和验证集，其中训练集用于训练模型，验证集用于验证模型效果。首先导入建模库：

In [93]:
# 建模
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from lightgbm import LGBMClassifier

# 预处理
from sklearn.preprocessing import StandardScaler, MinMaxScaler

# 模型评估
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score, f1_score, roc_auc_score

In [63]:
# 划分特征和标签
X = train.drop(['id', 'Response'], axis=1)
y = train['Response'] 

# 划分训练集和验证集(分层抽样) 
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, stratify=y, random_state=0) 
print(X_train.shape, X_val.shape, y_train.shape, y_val.shape) 

(304887, 8) (76222, 8) (304887,) (76222,)


In [64]:
# 处理样本不平衡，对0类样本进行降采样
from imblearn.under_sampling import RandomUnderSampler
 
under_model = RandomUnderSampler(sampling_strategy={0:133759, 1:37368}, random_state=0)
X_train, y_train = under_model.fit_sample(X_train, y_train)  

In [65]:
# 保存一份极值标准化的数据
mms = MinMaxScaler()

X_train_scaled = pd.DataFrame(mms.fit_transform(X_train), columns=x_under.columns)
X_val_scaled = pd.DataFrame(mms.transform(X_val), columns=x_under.columns)

# 测试集
X_test = test.drop('id', axis=1) 
X_test_scaled = pd.DataFrame(mms.transform(X_test), columns=X_test.columns)  

### 1.6.1 KNN算法

In [68]:
# 建立knn
knn = KNeighborsClassifier(n_neighbors=3, n_jobs=-1)
knn.fit(X_train_scaled, y_train)

y_pred = knn.predict(X_val_scaled)

print('Simple KNeighborsClassifier accuracy：%.3f' % (accuracy_score(y_val, y_pred)))
print('Simple KNeighborsClassifier f1_score: %.3f' % (f1_score(y_val, y_pred)))  
print('Simple KNeighborsClassifier roc_auc_score: %.3f' % (roc_auc_score(y_val, y_pred))) 

Simple KNeighborsClassifier accuracy：0.807
Simple KNeighborsClassifier f1_score: 0.337
Simple KNeighborsClassifier roc_auc_score: 0.632


In [69]:
# 对测试集评估
test_y = knn.predict(X_test_scaled)
test_y[:5] 

array([0, 0, 1, 0, 0], dtype=int64)

### 1.6.2 Logistic回归

In [75]:
# Logistic回归
lr = LogisticRegression()
lr.fit(X_train_scaled, y_train)

y_pred = lr.predict(X_val_scaled)

print('Simple LogisticRegression accuracy：%.3f' % (accuracy_score(y_val, y_pred)))
print('Simple LogisticRegression f1_score: %.3f' % (f1_score(y_val, y_pred)))  
print('Simple LogisticRegression roc_auc_score: %.3f' % (roc_auc_score(y_val, y_pred)))

Simple LogisticRegression accuracy：0.863
Simple LogisticRegression f1_score: 0.156
Simple LogisticRegression roc_auc_score: 0.536


### 1.6.3 决策树

In [77]:
# 决策树
dtc = DecisionTreeClassifier(max_depth=10, random_state=0) 
dtc.fit(X_train, y_train)

y_pred = dtc.predict(X_val) 

print('Simple DecisionTreeClassifier accuracy：%.3f' % (accuracy_score(y_val, y_pred)))
print('Simple DecisionTreeClassifier f1_score: %.3f' % (f1_score(y_val, y_pred)))  
print('Simple DecisionTreeClassifier roc_auc_score: %.3f' % (roc_auc_score(y_val, y_pred))) 

Simple DecisionTreeClassifier accuracy：0.849
Simple DecisionTreeClassifier f1_score: 0.310
Simple DecisionTreeClassifier roc_auc_score: 0.603


In [94]:
# 以f1为优化标准优化决策树算法
parameters = {
    'splitter': ('best', 'random'),
    'criterion':('gini', 'entropy'),
    'max_depth':[*range(1, 30, 2)],
}

# 建立模型
clf = DecisionTreeClassifier(random_state=0)
GS = GridSearchCV(clf, parameters, cv=5, scoring='f1')
GS.fit(X_train, y_train) 

GridSearchCV(cv=5, estimator=DecisionTreeClassifier(random_state=0),
             param_grid={'criterion': ('gini', 'entropy'),
                         'max_depth': [1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21,
                                       23, 25, 27, 29],
                         'splitter': ('best', 'random')},
             scoring='f1')

In [95]:
# 最佳模型
best_model = model.best_estimator_

best_model.fit(X_train, y_train) 
y_pred = best_model.predict(X_val) 

print('Randomized  DecisionTree accuracy: %.3f' % (accuracy_score(y_val, y_pred)))
print('Randomized  DecisionTree f1_score: %.3f' % (f1_score(y_val, y_pred)))  
print('Randomized  DecisionTree roc_auc_score: %.3f' % (roc_auc_score(y_val, y_pred))) 

Randomized  DecisionTree accuracy: 0.790
Randomized  DecisionTree f1_score: 0.333
Randomized  DecisionTree roc_auc_score: 0.634


In [96]:
# 属性重要性
imp = pd.DataFrame(zip(X_train.columns, best_model.feature_importances_), columns=['col_name', 'importance'])
imp = imp.sort_values('importance', ascending=False)
imp['accumulative_importance'] = imp['importance'].cumsum() 
imp = round(imp, 3) 
imp 

Unnamed: 0,col_name,importance,accumulative_importance
6,Annual_Premium,0.312,0.312
7,Vintage,0.293,0.604
5,Vehicle_Damage,0.211,0.815
1,Age,0.134,0.949
0,Gender,0.025,0.974
3,Previously_Insured,0.019,0.993
4,Vehicle_Age,0.006,0.999
2,Driving_License,0.001,1.0


### 1.6.4 随机森林

In [97]:
# 决策树
rfc = RandomForestClassifier(n_estimators=100, max_depth=10, n_jobs=-1)  
rfc.fit(X_train, y_train)

y_pred = rfc.predict(X_val) 

print('Simple RandomForestClassifier accuracy：%.3f' % (accuracy_score(y_val, y_pred)))
print('Simple RandomForestClassifier f1_score: %.3f' % (f1_score(y_val, y_pred)))  
print('Simple RandomForestClassifier roc_auc_score: %.3f' % (roc_auc_score(y_val, y_pred))) 

Simple RandomForestClassifier accuracy：0.870
Simple RandomForestClassifier f1_score: 0.177
Simple RandomForestClassifier roc_auc_score: 0.545


###  1.6.5 LightGBM

In [98]:
lgbm = LGBMClassifier(n_estimators=100, random_state=0)
lgbm.fit(X_train, y_train)

y_pred = lgbm.predict(X_val)

print('Simple LGBM accuracy: %.3f' % (accuracy_score(y_val, y_pred)))
print('Simple LGBM f1_score: %.3f' % (f1_score(y_val, y_pred)))  
print('Simple LGBM roc_auc_score: %.3f' % (roc_auc_score(y_val, y_pred))) 

Simple LGBM accuracy: 0.857
Simple LGBM f1_score: 0.290
Simple LGBM roc_auc_score: 0.591


综上，以f1-score作为评价标准的情况下，KNN算法有较好的分类效能，这可能是由于数据样本本身不平衡导致，后续可以通过其他类别不平衡的方式做进一步处理，同时可以通过参数调整的方式来优化其他模型，通过调整预测的门槛值来增加预测效能等其他方式。