## 11 章 用Embedding提升机器学习性能

## 11.1 项目概述
	采用Embedding提升模型性能，对传统机器学习算法，如典型的XGBoost算法，采用相同数据集，相同算法，但对输入数据预处理不同，一种是通常处理方法，另一种应用Embedding。同样对神经网络也进行比较，使用相同模型结构，但输入数据采用不同策略，一种是对分类特征进行One-hot编码，另一种是采用Embedding处理。这些方法比较结果如下表所示
 
|算法|RMSE|RMSE(with EE)|
|:-|:-|:-|
|XGBoost|0.176|	0.098|
|NN|0.101|0.095|

由上表可以看出，神经网络NN的性能优于传统机器学习，使用EE（EntityEmbedding）的算法优于不使用EE的模型

### 11.1.1 数据集简介
	本章使用的数据集为1115家德国罗斯曼商店的历史销售数据。任务是预测测试集的“销售”列。数据集中的某些商店已暂时关闭以进行翻新。涉及数据文件如下：  
	rain.csv：包括销售在内的历史数据；  
	test.csv：历史数据（不包括销售）；  
	sample_submission.csv：格式正确的样本提交文件；  
	store.csv：有关商店的补充信息。  

### 11.1.2 导入数据
1）导入需要的库  
导入train.csv,store.csv,store_states.csv等文件。

In [1]:
import pickle  #把内存信息序列化写入磁盘
import csv

2）定义两个函数，csv2dicts用于把csv文件转换为字典，set_nan_as_string函数把空值用'0'填充。

In [2]:
#把csv文件转换为字典
def csv2dicts(csvfile):
    data = []
    keys = []
    for row_index, row in enumerate(csvfile):
        #把第一行标题打印出来
        if row_index == 0:
            keys = row
            print(row)
            continue
        
        data.append({key: value for key, value in zip(keys, row)})
    return data

#如果值为空，则用'0'填充
def set_nan_as_string(data, replace_str='0'):
    for i, x in enumerate(data):
        for key, value in x.items():
            if value == '':
                x[key] = replace_str
        data[i] = x


3）导入数据

In [3]:
train_data = r".\data\train.csv"
store_data = r".\data\store.csv"
store_states = r'.\data\store_states.csv'

#把处理后的训练数据写入文件
with open(train_data) as csvfile:
    data = csv.reader(csvfile, delimiter=',')
    with open('train_data.pickle', 'wb') as f:
        data = csv2dicts(data)
        #头尾倒过来
        data = data[::-1]
        #序列化，把数据保存到文件中
        pickle.dump(data, f, -1)
        print(data[:3])

['Store', 'DayOfWeek', 'Date', 'Sales', 'Customers', 'Open', 'Promo', 'StateHoliday', 'SchoolHoliday']
[{'Store': '1115', 'DayOfWeek': '2', 'Date': '2013-01-01', 'Sales': '0', 'Customers': '0', 'Open': '0', 'Promo': '0', 'StateHoliday': 'a', 'SchoolHoliday': '1'}, {'Store': '1114', 'DayOfWeek': '2', 'Date': '2013-01-01', 'Sales': '0', 'Customers': '0', 'Open': '0', 'Promo': '0', 'StateHoliday': 'a', 'SchoolHoliday': '1'}, {'Store': '1113', 'DayOfWeek': '2', 'Date': '2013-01-01', 'Sales': '0', 'Customers': '0', 'Open': '0', 'Promo': '0', 'StateHoliday': 'a', 'SchoolHoliday': '1'}]


4）处理store_data，store_states数据。

In [4]:
#把处理后的store_data，store_states数据写入文件store_data.pickle
with open(store_data) as csvfile, open(store_states) as csvfile2:
    data = csv.reader(csvfile, delimiter=',')
    state_data = csv.reader(csvfile2, delimiter=',')
    with open('store_data.pickle', 'wb') as f:
        data = csv2dicts(data)
        state_data = csv2dicts(state_data)
        set_nan_as_string(data)
        #把state加到store_data数据集中，然后保存生成的数据        
        for index, val in enumerate(data):
            state = state_data[index]
            val['State'] = state['State']
            data[index] = val
        pickle.dump(data, f, -1)
        print(data[:2])

['Store', 'StoreType', 'Assortment', 'CompetitionDistance', 'CompetitionOpenSinceMonth', 'CompetitionOpenSinceYear', 'Promo2', 'Promo2SinceWeek', 'Promo2SinceYear', 'PromoInterval']
['Store', 'State']
[{'Store': '1', 'StoreType': 'c', 'Assortment': 'a', 'CompetitionDistance': '1270', 'CompetitionOpenSinceMonth': '9', 'CompetitionOpenSinceYear': '2008', 'Promo2': '0', 'Promo2SinceWeek': '0', 'Promo2SinceYear': '0', 'PromoInterval': '0', 'State': 'HE'}, {'Store': '2', 'StoreType': 'a', 'Assortment': 'a', 'CompetitionDistance': '570', 'CompetitionOpenSinceMonth': '11', 'CompetitionOpenSinceYear': '2007', 'Promo2': '1', 'Promo2SinceWeek': '13', 'Promo2SinceYear': '2010', 'PromoInterval': 'Jan,Apr,Jul,Oct', 'State': 'TH'}]


### 11.1.3 预处理数据
1）导入需要的库

In [5]:
from datetime import datetime
from sklearn import preprocessing
import numpy as np
import random
random.seed(42)

2）使用pickle读取pickle文件数据。

In [6]:
#读取pickle文件
with open('train_data.pickle', 'rb') as f:
    train_data = pickle.load(f)
    num_records = len(train_data)
with open('store_data.pickle', 'rb') as f:
    store_data = pickle.load(f)

3）对销售时间字段进行拆分和转换。

In [7]:
#对时间特征进行拆分和转换，是否促销promo等特征转换为整数
def feature_list(record):
    dt = datetime.strptime(record['Date'], '%Y-%m-%d')
    store_index = int(record['Store'])
    year = dt.year
    month = dt.month
    day = dt.day
    day_of_week = int(record['DayOfWeek'])
    try:
        store_open = int(record['Open'])
    except:
        store_open = 1

    promo = int(record['Promo'])
    #同时返回state对应的简称
    return [store_open,
            store_index,
            day_of_week,
            promo,
            year,
            month,
            day,
            store_data[store_index - 1]['State']
            ]

4）对train_data进行一些简单清理或过滤操作。

In [8]:
#生成训练数据
train_data_X = []
train_data_y = []

for record in train_data:
    if record['Sales'] != '0' and record['Open'] != '':
        fl = feature_list(record)
        train_data_X.append(fl)
        train_data_y.append(int(record['Sales']))
print("销售记录数: ", len(train_data_y))

print("最小销售量:{}，最大销售量:{}".format(min(train_data_y), max(train_data_y)))


销售记录数:  844338
最小销售量:46，最大销售量:41551


5）数值化各特征，把结果保存到文件feature_train_data.pickle中。

In [9]:
full_X = np.array(train_data_X)
#full_X = np.array(full_X)
train_data_X = np.array(train_data_X)
les = []
#对每列进行处理，把类别转换为数值
for i in range(train_data_X.shape[1]):
    le = preprocessing.LabelEncoder()
    le.fit(full_X[:, i])
    les.append(le)
    train_data_X[:, i] = le.transform(train_data_X[:, i])

#处理后的数据写入pickle文件
with open('les.pickle', 'wb') as f:
    pickle.dump(les, f, -1)

#把训练数据转换为整数
train_data_X = train_data_X.astype(int)
train_data_y = np.array(train_data_y)

#保存数据到feature_train_data.pickle文件
with open('feature_train_data.pickle', 'wb') as f:
    pickle.dump((train_data_X, train_data_y), f, -1)
    print(train_data_X[0], train_data_y[0])


[  0 109   1   0   0   0   0   7] 5961


### 11.1.4 定义公共函数

要定义公共函数，主要分为以下几个步骤。  
1）首先导入必要的库或模块。

In [10]:
import numpy
import pickle  
numpy.random.seed(123)

from sklearn.preprocessing import StandardScaler
import xgboost as xgb
from sklearn import neighbors
from sklearn.preprocessing import Normalizer
from sklearn.preprocessing import OneHotEncoder


from tensorflow.keras.models import Sequential
from tensorflow.keras.models import Model as KerasModel
from tensorflow.keras.layers import Input, Dense, Activation, Reshape,Flatten
from tensorflow.keras.layers import Concatenate
from tensorflow.keras.layers import Embedding
from tensorflow.keras.callbacks import ModelCheckpoint

#屏蔽警告信息
import warnings
warnings.filterwarnings("ignore")

2）设置一些超参数。

In [11]:
train_ratio = 0.9
shuffle_data = False
one_hot_as_input = False
embeddings_as_input = False
save_embeddings = True
saved_embeddings_fname = "embeddings.pickle"  # set save_embeddings to True to create this file

3）定义几个公共函数。

In [12]:
f = open('feature_train_data.pickle', 'rb')
(X, y) = pickle.load(f)

num_records = len(X)
train_size = int(train_ratio * num_records)

if shuffle_data:
    print("Using shuffled data")
    sh = numpy.arange(X.shape[0])
    numpy.random.shuffle(sh)
    X = X[sh]
    y = y[sh]

if embeddings_as_input:
    print("Using learned embeddings as input")
    X = embed_features(X, saved_embeddings_fname)

if one_hot_as_input:
    print("Using one-hot encoding as input")
    enc = OneHotEncoder(sparse=False)
    enc.fit(X)
    X = enc.transform(X)

def sample(X, y, n):
    '''random samples'''
    num_row = X.shape[0]
    indices = numpy.random.randint(num_row, size=n)
    return X[indices, :], y[indices]

def evaluate_models(models, X, y):
    assert(min(y) > 0)
    guessed_sales = numpy.array([model.guess(X) for model in models])
    mean_sales = guessed_sales.mean(axis=0)
    relative_err = numpy.absolute((y - mean_sales) / y)
    result = numpy.sum(relative_err) / len(y)
    return result

#分别取出各特征,取出X中前8列数据，除第1列，
def split_features(X):
    X_list = []
    #获取X第2列数据
    store_index = X[..., [1]]
    X_list.append(store_index)
    #获取X第3列数据,以下类推
    day_of_week = X[..., [2]]
    X_list.append(day_of_week)

    promo = X[..., [3]]
    X_list.append(promo)

    year = X[..., [4]]
    X_list.append(year)

    month = X[..., [5]]
    X_list.append(month)

    day = X[..., [6]]
    X_list.append(day)

    State = X[..., [7]]
    X_list.append(State)

    return X_list
