In [14]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = 'all'

# 正则化
在cost function中加入正则项：![regularization](regularization.png)

其中α是常数，q是范式L1或L2:
![L1](L1.png)
![L2](L2.png)

训练LR模型的过程 就是降低w的代价函数的过程。

如果训练到了某个点，它的w非常大，整个cost将由这些很大的w决定。这种情况下，学到的模型对未知数据的泛化会很差。

这里引入正则项以惩罚较大的w，因为w现在成为最小化cost的一部分。

正则项能消除过拟合。

α则是平衡log loss 和泛化。如果α太小，则不能消解过大的w，模型可能出现高方差或过拟合；如果α太大，就很难fit 训练集，表现欠拟合。α 的调优非常重要。

## 选择L1 还是L2? 
经验上，取决于是否需要进行特征选择。

classification中，feature selection 是一个过程：选择有效特征的子集。目的是用于构造更好的模型。
实务中，并不是每个特征都具有有用的信息；一些特征可能冗余或不相关，因此可以舍弃。公式上，w1x1 + w2x2 ...，不重要的特征，其w就可设为0.

### 在LR分类器中，只有L1正则化才能实现特征选择：
考虑两个权值向量w1 =(1, 0), w2 = (0.5, 0.5)，假设它们能产生相同的log loss，它们的L1、L2正则项是：
![regularization_of_w](regularization_of_w.png)

L1(w1) = L1(w2)

L2(w2) < L2(w1)

这表明，对于分量一大一小显然的w，L2比L1 惩罚更多。换句话说，
    
    对所有权重，不论大小，L2正则化都倾向给出相对小的值，不偏袒显然很大或显然很小的任意权重。
    而L1正则化允许一些权重的值显著大或显著小。
    
只有L1正则化，某些权重可以被压缩到接近或正好为0，这样就可以进行特征选择。


sklearn 中，penalty参数可选 'None', 'l1', 'l2', 'None', 'elasticnet'(mixture of l1 and l2)

## 特征选择的L1正则化：

In [1]:
import numpy as np
import csv
from sklearn.metrics import roc_auc_score

def read_ad_click_data(n, offset=0):
    X_dict, y = [],[]
    with open('train.csv','r') as csvfile:
        reader = csv.DictReader(csvfile)
        for i in range(offset):
            next(reader)
        i = 0
        for row in reader:
            i += 1
            y.append(int(row['click']))
            del row['click'], row['id'], row['hour'], row['device_id'], row['device_ip']
            X_dict.append(row)
            if i >= n:
                break
    return X_dict, y

In [2]:
n = 10000
X_dict_train, y_train = read_ad_click_data(n)

In [3]:
from sklearn.feature_extraction import DictVectorizer
dict_one_hot_encoder = DictVectorizer(sparse=True)
X_train = dict_one_hot_encoder.fit_transform(X_dict_train)

X_dict_test, y_test = read_ad_click_data(n,n)
X_test = dict_one_hot_encoder.transform(X_dict_test)

X_train_10k = X_train
y_train_10k = np.array(y_train)

## 特征选择

### SGD LR 模型初始化及训练：

In [38]:
from sklearn.linear_model import SGDClassifier
l1_feature_selector = SGDClassifier(loss='log', penalty='l1',
                                   alpha=0.0001, fit_intercept=True,
                                   n_iter=5, learning_rate='constant',
                                   eta0=0.01)
l1_feature_selector.fit(X_train_10k, y_train_10k)

SGDClassifier(alpha=0.0001, average=False, class_weight=None, epsilon=0.1,
       eta0=0.01, fit_intercept=True, l1_ratio=0.15,
       learning_rate='constant', loss='log', n_iter=5, n_jobs=1,
       penalty='l1', power_t=0.5, random_state=None, shuffle=True,
       verbose=0, warm_start=False)

### 选出重要特征：
基于训练好的模型，使用***transform*** method

In [39]:
X_train_10k_selected = l1_feature_selector.transform(X_train_10k)



In [40]:
print(X_train_10k_selected.shape) 
print(X_train_10k.shape)

(10000, 629)
(10000, 2820)


产生的数据集只含有最重要的629维特征，也就是说其他维的w可能都为0。因为随机梯度的随机性，每次训练选出的特征数会稍有差异。

#### 进一步看训练好的模型的weights:

In [42]:
print(l1_feature_selector.coef_)

[[ 0.1697109  0.         0.        ...,  0.         0.         0.       ]]


#### 底部10个权重和对应的10个最不重要的特征:

In [43]:
print(np.sort(l1_feature_selector.coef_)[0][:10])
print(np.argsort(l1_feature_selector.coef_)[0][:10]) # 索引排序

[-0.60701977 -0.44406823 -0.42496192 -0.40632397 -0.40632397 -0.40010368
 -0.36818184 -0.32234671 -0.3213957  -0.30947936]
[ 559 2172   34 2370 2566 1540  278 2113  579 2116]


#### 最重的10个权重和对应的10个最重要的特征：

In [44]:
np.sort(l1_feature_selector.coef_)[0][-10:]
np.argsort(l1_feature_selector.coef_)[0][-10:]

array([ 0.28891887,  0.29265008,  0.30409379,  0.30864271,  0.30979478,
        0.3488831 ,  0.35582366,  0.35835959,  0.37607186,  0.38586131])

array([2769,  554,  546, 2275,  547, 2149, 2580, 1503, 1519, 2761])

#### 我们也能找出具体是哪些特征：

In [45]:
dict_one_hot_encoder.feature_names_[2761]
dict_one_hot_encoder.feature_names_[546]

'site_id=d9750ee7'

'C21=13'

# Online learning

我们已能处理100k样本量，比这更大呢？接下来讲如何以online learning 训练大规模数据集。

SGD由GD发展而来，每次迭代依序更新单个样本，而不是每次都把全部训练集遍历一遍。

通过online learning，我们可以进一步放大SGD。

Online learning 中，用于训练的新数据可以实时地或依序地投喂，而不像在离线学习环境中一次性投入。

一次训练只需载入、预处理一个相对小的数据块，这样就释放了用于保存整个大型数据集的内存。比如，我可以将4000万分成20个200万，依次投入每个200万。

除了更好的计算可行性，online learning 的适用性能满足很多现代化场景需要。比如股价预测模型，广告点击预测模型，垃圾邮件检测，其数据都是实时产生和更新的，基于最新数据训练已有模型即可，而不是基于旧数据和新数据从头再建模。



![](online_learning.png)

实现：
    
    SGDClassifier.partial_fit (之前用的 .fit 即是离线学习）

### 训练百万级样本量：10 x 100k = 100W样本

In [46]:
sgd_lr = SGDClassifier(loss='log', penalty=None, 
                      fit_intercept=True,
                      n_iter=1,  # 如果使用partial_fit，n_iter设为1
                      learning_rate = 'constant',
                      eta0=0.01)

In [49]:
import timeit
start_time = timeit.default_timer()

for i in range(20):
    X_dict_train, y_train_every_100k = read_ad_click_data(100000, i*100000)
    X_train_every_100k = dict_one_hot_encoder.transform(X_dict_train)
    sgd_lr.partial_fit(X_train_every_100k,y_train_every_100k,classes=[0,1])
    
print("--- %0.3fs seconds ---" % (timeit.default_timer() - start_time))

SGDClassifier(alpha=0.0001, average=False, class_weight=None, epsilon=0.1,
       eta0=0.01, fit_intercept=True, l1_ratio=0.15,
       learning_rate='constant', loss='log', n_iter=1, n_jobs=1,
       penalty=None, power_t=0.5, random_state=None, shuffle=True,
       verbose=0, warm_start=False)

SGDClassifier(alpha=0.0001, average=False, class_weight=None, epsilon=0.1,
       eta0=0.01, fit_intercept=True, l1_ratio=0.15,
       learning_rate='constant', loss='log', n_iter=1, n_jobs=1,
       penalty=None, power_t=0.5, random_state=None, shuffle=True,
       verbose=0, warm_start=False)

SGDClassifier(alpha=0.0001, average=False, class_weight=None, epsilon=0.1,
       eta0=0.01, fit_intercept=True, l1_ratio=0.15,
       learning_rate='constant', loss='log', n_iter=1, n_jobs=1,
       penalty=None, power_t=0.5, random_state=None, shuffle=True,
       verbose=0, warm_start=False)

SGDClassifier(alpha=0.0001, average=False, class_weight=None, epsilon=0.1,
       eta0=0.01, fit_intercept=True, l1_ratio=0.15,
       learning_rate='constant', loss='log', n_iter=1, n_jobs=1,
       penalty=None, power_t=0.5, random_state=None, shuffle=True,
       verbose=0, warm_start=False)

SGDClassifier(alpha=0.0001, average=False, class_weight=None, epsilon=0.1,
       eta0=0.01, fit_intercept=True, l1_ratio=0.15,
       learning_rate='constant', loss='log', n_iter=1, n_jobs=1,
       penalty=None, power_t=0.5, random_state=None, shuffle=True,
       verbose=0, warm_start=False)

SGDClassifier(alpha=0.0001, average=False, class_weight=None, epsilon=0.1,
       eta0=0.01, fit_intercept=True, l1_ratio=0.15,
       learning_rate='constant', loss='log', n_iter=1, n_jobs=1,
       penalty=None, power_t=0.5, random_state=None, shuffle=True,
       verbose=0, warm_start=False)

SGDClassifier(alpha=0.0001, average=False, class_weight=None, epsilon=0.1,
       eta0=0.01, fit_intercept=True, l1_ratio=0.15,
       learning_rate='constant', loss='log', n_iter=1, n_jobs=1,
       penalty=None, power_t=0.5, random_state=None, shuffle=True,
       verbose=0, warm_start=False)

SGDClassifier(alpha=0.0001, average=False, class_weight=None, epsilon=0.1,
       eta0=0.01, fit_intercept=True, l1_ratio=0.15,
       learning_rate='constant', loss='log', n_iter=1, n_jobs=1,
       penalty=None, power_t=0.5, random_state=None, shuffle=True,
       verbose=0, warm_start=False)

SGDClassifier(alpha=0.0001, average=False, class_weight=None, epsilon=0.1,
       eta0=0.01, fit_intercept=True, l1_ratio=0.15,
       learning_rate='constant', loss='log', n_iter=1, n_jobs=1,
       penalty=None, power_t=0.5, random_state=None, shuffle=True,
       verbose=0, warm_start=False)

SGDClassifier(alpha=0.0001, average=False, class_weight=None, epsilon=0.1,
       eta0=0.01, fit_intercept=True, l1_ratio=0.15,
       learning_rate='constant', loss='log', n_iter=1, n_jobs=1,
       penalty=None, power_t=0.5, random_state=None, shuffle=True,
       verbose=0, warm_start=False)

SGDClassifier(alpha=0.0001, average=False, class_weight=None, epsilon=0.1,
       eta0=0.01, fit_intercept=True, l1_ratio=0.15,
       learning_rate='constant', loss='log', n_iter=1, n_jobs=1,
       penalty=None, power_t=0.5, random_state=None, shuffle=True,
       verbose=0, warm_start=False)

SGDClassifier(alpha=0.0001, average=False, class_weight=None, epsilon=0.1,
       eta0=0.01, fit_intercept=True, l1_ratio=0.15,
       learning_rate='constant', loss='log', n_iter=1, n_jobs=1,
       penalty=None, power_t=0.5, random_state=None, shuffle=True,
       verbose=0, warm_start=False)

SGDClassifier(alpha=0.0001, average=False, class_weight=None, epsilon=0.1,
       eta0=0.01, fit_intercept=True, l1_ratio=0.15,
       learning_rate='constant', loss='log', n_iter=1, n_jobs=1,
       penalty=None, power_t=0.5, random_state=None, shuffle=True,
       verbose=0, warm_start=False)

SGDClassifier(alpha=0.0001, average=False, class_weight=None, epsilon=0.1,
       eta0=0.01, fit_intercept=True, l1_ratio=0.15,
       learning_rate='constant', loss='log', n_iter=1, n_jobs=1,
       penalty=None, power_t=0.5, random_state=None, shuffle=True,
       verbose=0, warm_start=False)

SGDClassifier(alpha=0.0001, average=False, class_weight=None, epsilon=0.1,
       eta0=0.01, fit_intercept=True, l1_ratio=0.15,
       learning_rate='constant', loss='log', n_iter=1, n_jobs=1,
       penalty=None, power_t=0.5, random_state=None, shuffle=True,
       verbose=0, warm_start=False)

SGDClassifier(alpha=0.0001, average=False, class_weight=None, epsilon=0.1,
       eta0=0.01, fit_intercept=True, l1_ratio=0.15,
       learning_rate='constant', loss='log', n_iter=1, n_jobs=1,
       penalty=None, power_t=0.5, random_state=None, shuffle=True,
       verbose=0, warm_start=False)

SGDClassifier(alpha=0.0001, average=False, class_weight=None, epsilon=0.1,
       eta0=0.01, fit_intercept=True, l1_ratio=0.15,
       learning_rate='constant', loss='log', n_iter=1, n_jobs=1,
       penalty=None, power_t=0.5, random_state=None, shuffle=True,
       verbose=0, warm_start=False)

SGDClassifier(alpha=0.0001, average=False, class_weight=None, epsilon=0.1,
       eta0=0.01, fit_intercept=True, l1_ratio=0.15,
       learning_rate='constant', loss='log', n_iter=1, n_jobs=1,
       penalty=None, power_t=0.5, random_state=None, shuffle=True,
       verbose=0, warm_start=False)

SGDClassifier(alpha=0.0001, average=False, class_weight=None, epsilon=0.1,
       eta0=0.01, fit_intercept=True, l1_ratio=0.15,
       learning_rate='constant', loss='log', n_iter=1, n_jobs=1,
       penalty=None, power_t=0.5, random_state=None, shuffle=True,
       verbose=0, warm_start=False)

SGDClassifier(alpha=0.0001, average=False, class_weight=None, epsilon=0.1,
       eta0=0.01, fit_intercept=True, l1_ratio=0.15,
       learning_rate='constant', loss='log', n_iter=1, n_jobs=1,
       penalty=None, power_t=0.5, random_state=None, shuffle=True,
       verbose=0, warm_start=False)

--- 440.872s seconds ---


使用在线学习，百万级样本的训练也能很快。

In [50]:
X_dict_test, y_test_next10k = read_ad_click_data(10000, (i+1)*200000)
X_test_next10k = dict_one_hot_encoder.transform(X_dict_test)

In [51]:
predictions = sgd_lr.predict_proba(X_test_next10k)[:,1]
print("The ROC AUC on testing set is: {0:.3f}".format(roc_auc_score(y_test_next10k,predictions)))


The ROC AUC on testing set is: 0.714
