The goal of this notebook is to retrain the pipeline using the best parameters found 
to reduce the model size of 1.2G. 

In [1]:
from w2vpipe.pipeline import word_embedding, text_clean

from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression


from sklearn.externals import joblib

In [2]:
logit_clf_word2vec = LogisticRegression(solver = "lbfgs", max_iter = 10000, C = 4.032\
                                        , class_weight = {0: 0.8, 1: 0.2})

word2vec_pipe = Pipeline([('text_cleaning', text_clean()),
                 ("word_embedding", word_embedding(algo_name = "word2vec", workers =2, size = 300)),
                 ("logit_clf_word2vec",logit_clf_word2vec)
                ])

In [3]:
import pandas as pd

train_data = pd.read_csv("../clean_data/Cleaned_train_text_with_pii_2018_12_29_07_26_56_266227.csv")

In [4]:
%%time
word2vec_pipe.fit(train_data['Text'], train_data['Target'])

Building new vocabulary and training the word2vec model


  0%|          | 0/800000 [00:00<?, ?it/s]

transforming while training word2vec model with new data.


100%|██████████| 800000/800000 [01:27<00:00, 9183.85it/s] 
100%|██████████| 800000/800000 [00:03<00:00, 222464.63it/s]


CPU times: user 10min 26s, sys: 1min 39s, total: 12min 6s
Wall time: 5min 34s


Pipeline(memory=None,
     steps=[('text_cleaning', <w2vpipe.pipeline.text_clean object at 0x7f0f8f632a90>), ('word_embedding', word_embedding(algo_name='word2vec', continue_train_pre_train=True,
        dump_file=False, epochs=5, min_count=1, pre_train=None,
        re_train_new_sentences=True, size=300, window=5, workers=2)...enalty='l2', random_state=None,
          solver='lbfgs', tol=0.0001, verbose=0, warm_start=False))])

The model is about 1.1GB since I choose to use a vector length of 300 to embed the words. 

In [6]:
joblib.dump(word2vec_pipe, 'word2vec_pipe_production.pkl', compress = 1)

['word2vec_pipe_production.pkl']

## Balance the model size and model performance in the future by changing the size of vector embedding.

See if the performance is still the same

In [7]:
test_data = pd.read_csv("../clean_data/Cleaned_test_text_with_pii_\
2018_12_31_05_35_46_815414.csv")

In [8]:
%%time 
binary_pred = word2vec_pipe.predict(test_data["Text"])
binary_true = test_data["Target"]

from sklearn.metrics import classification_report

print(classification_report(y_true = binary_true, y_pred = binary_pred))

  0%|          | 0/80000 [00:00<?, ?it/s]

transforming while training word2vec model with new data.


100%|██████████| 80000/80000 [00:09<00:00, 8524.90it/s] 
100%|██████████| 80000/80000 [00:00<00:00, 217229.72it/s]


              precision    recall  f1-score   support

           0       0.87      0.97      0.92     10000
           1       1.00      0.98      0.99     70000

   micro avg       0.98      0.98      0.98     80000
   macro avg       0.93      0.97      0.95     80000
weighted avg       0.98      0.98      0.98     80000

CPU times: user 32.7 s, sys: 8.23 s, total: 41 s
Wall time: 21.7 s
