## 载入数据

我们仅仅使用 1997 年以后的数据集：

In [1]:
import pandas as pd
import numpy as np

In [2]:
gtd_df = pd.read_hdf('./data/gtd_1998-2017.h5')  # 载入数据集

In [3]:
import json

with open('./data/gtd_1998-2017_names.json') as fp:
    names = json.load(fp)                           # 载入变量名

In [23]:
class Bunch(dict):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.__dict__ = self

In [24]:
names = Bunch(names)

下面我们看看 GTD 的数据:

In [4]:
gtd_df.head()

Unnamed: 0,eventid,iyear,imonth,iday,extended,country,country_txt,region,region_txt,provstate,...,ishostkid,addnotes,scite1,scite2,scite3,dbsource,INT_LOG,INT_IDEO,INT_MISC,INT_ANY
67507,199801010001,1998,1,1,0,34,Burundi,11,Sub-Saharan Africa,Bujumbura Mairie,...,0.0,Unknown,"Burundi Rebels, Ex-Rwandan Army Soldiers Blam...",Burundi--Attack Reported on Bujumbura Airport...,Unknown,CETIS,0.0,1.0,0.0,1.0
67508,199801010002,1998,1,1,0,167,Russia,9,Eastern Europe,Moscow (Federal City),...,0.0,Unknown,"Bomb injures 3 in Moscow subway system, The ...","Bomb injures 3 in Moscow subway, Charleston ...","Bomb Injures 3 Workers in Moscow Metro, Los ...",CETIS,-1.0,-1.0,0.0,-1.0
67509,199801010003,1998,1,1,0,603,United Kingdom,8,Western Europe,Northern Ireland,...,0.0,Unknown,Protestant gunmen kill Catholic in New Year's...,Ulster Peace Shattered by Shooting: Catholic ...,Unknown,CETIS,0.0,0.0,1.0,1.0
67510,199801020001,1998,1,2,0,95,Iraq,10,Middle East & North Africa,Baghdad,...,0.0,Unknown,Iraq Condemns Attack on UNSCOM Baghdad Office...,"Farouk Choukri , Iraq, UN Officials Continue ...","Iraqi Interior Minister on UNSCOM Attack, Kuw...",CETIS,-1.0,-1.0,1.0,1.0
67511,199801020002,1998,1,2,0,155,West Bank and Gaza Strip,10,Middle East & North Africa,West Bank,...,0.0,Unknown,"Woman Shot, The Philadelphia Inquirer, Janua...",Israeli Woman Critically Hurt by Gunfire in W...,Unknown,CETIS,-1.0,-1.0,0.0,-1.0


In [6]:
print('数据的尺寸：', gtd_df.shape)

数据的尺寸： (114184, 65)


通过前面的数据预处理我们已经将文本变量的缺失值替换为 `Unknown`，分类变量的缺失值替换为 `-1`，但是对于数值型变量的缺失值未做填充处理。这个填充工作不能大而化之，需要针对具体问题具体分析。为了下面我们讨论的方便，我们先将所有含缺失值的样本均丢掉：

In [10]:
df = gtd_df.dropna()

np.unique(df.isnull().sum())  # 不含缺失值

array([0], dtype=int64)

In [11]:
df.shape

(99521, 65)

In [21]:
miss_df = gtd_df.loc[list(set(gtd_df.index) - set(df.index)),:]  # 含有缺失值的样本集

df.shape[0] + miss_df.shape[0] == gtd_df.shape[0]

True

下面主要研究 `df` 数据集：

对于一个机器学习问题，首先需要做的的工作是**识别或确定问题**。

## 第一步：确定问题

我们首先要明确要研究的问题是**分类**还是**回归**问题？一般数据由特征 $X$ 和标签 $y$ 来构成，$y$ 可以是一列也可以是多列，可以是二值型 (布尔值型数据类型，一般使用 `0` 和 `1` 进行编码) 也可以是实数型。一般地，它为二值型时该问题为分类问题，它为实数型是该问题为回归问题。

## 第二步：分离数据

一般为了防止模型出现过拟合现象，需要将数划分为**训练集**和**验证集**。即使用训练集训练模型，使用验证集调节超参数。

In [22]:
from sklearn.model_selection import StratifiedKFold  # 分类问题
#from sklearn.model_selection import KFold        # 回归问题

In [28]:
names.NV  # 查看数值变量

['nkillter',
 'imonth',
 'nperpcap',
 'nkillus',
 'nkill',
 'iday',
 'nwoundus',
 'nwoundte',
 'latitude',
 'iyear',
 'nwound',
 'longitude',
 'eventid']

这里我们仅仅选择以下几个数值变量:

In [31]:
nv = [
    'nkillter','nperpcap','nkillus','nkill','nwoundus','nwoundte', 'nwound',
]

X = df[nv]

In [32]:
对数值变量进行标准化处理:

['natlty1',
 'INT_IDEO',
 'targtype1',
 'INT_ANY',
 'guncertain1',
 'ishostkid',
 'country',
 'specificity',
 'suicide',
 'individual',
 'weapsubtype1',
 'property',
 'crit3',
 'targsubtype1',
 'claimed',
 'success',
 'crit2',
 'extended',
 'doubtterr',
 'propextent',
 'INT_MISC',
 'crit1',
 'multiple',
 'weaptype1',
 'INT_LOG',
 'vicinity',
 'attacktype1',
 'region']

In [27]:
X = df[list()]
X

Unnamed: 0,nkillter,imonth,nperpcap,nkillus,nkill,iday,nwoundus,nwoundte,latitude,iyear,nwound,longitude,eventid
67508,0.0,1,0.0,0.0,0.0,1,0.0,0.0,55.751377,1998,3.0,37.579914,199801010002
67509,0.0,1,0.0,0.0,1.0,1,0.0,0.0,54.607712,1998,0.0,-5.956210,199801010003
67511,0.0,1,0.0,0.0,0.0,2,0.0,0.0,31.995965,1998,1.0,35.271110,199801020002
67525,0.0,1,0.0,0.0,1.0,9,0.0,0.0,43.280364,1998,0.0,-2.171588,199801090001
67526,0.0,1,0.0,0.0,0.0,9,0.0,0.0,28.585836,1998,44.0,77.153336,199801090002
67531,0.0,1,0.0,0.0,1.0,11,0.0,0.0,54.607712,1998,0.0,-5.956210,199801110002
67532,0.0,1,0.0,0.0,25.0,11,0.0,0.0,31.505470,1998,0.0,74.342880,199801110003
67538,0.0,1,0.0,0.0,0.0,13,0.0,0.0,36.800000,1998,0.0,4.266667,199801130001
67539,0.0,1,0.0,0.0,2.0,13,0.0,0.0,36.350000,1998,0.0,6.600000,199801130002
67540,0.0,1,0.0,0.0,0.0,13,0.0,0.0,36.296387,1998,5.0,3.669930,199801130003


In [None]:
eval_size = 0.1
kf = StratifiedKFold(y, round(1/eval_size))
train_indices, valid_indices = next(iter(kf))
X_train, y_train = X[train_indices], y[train_indices]
X_valid, y_valid = X[valid_indices], y[valid_indices]