This notebook is designed to vectorize documents using pre-trained Doc2Vec Models

The environment that was used is printed in the below cell

In [1]:
print(__import__('sys').version)
!conda list -n PY27

2.7.16 |Anaconda, Inc.| (default, Mar 14 2019, 15:42:17) [MSC v.1500 64 bit (AMD64)]
# packages in environment at C:\Anaconda3\envs\PY27:
#
# Name                    Version                   Build  Channel
backports                 1.0                        py_2  
backports.shutil_get_terminal_size 1.0.0                    py27_2  
backports_abc             0.5                        py_0  
blas                      1.0                         mkl  
boto                      2.49.0                   pypi_0    pypi
boto3                     1.9.196                  pypi_0    pypi
botocore                  1.12.196                 pypi_0    pypi
certifi                   2019.6.16                py27_0  
chardet                   3.0.4                    pypi_0    pypi
colorama                  0.4.1                    py27_0  
decorator                 4.4.0                    py27_1  
docutils                  0.14                     pypi_0    pypi
enum34                    1.1.6   

In [1]:
import warnings

warnings.simplefilter("ignore")
warnings.simplefilter("ignore", category=PendingDeprecationWarning)

from toolz import compose, curry
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from tqdm import tqdm

import gensim, os

try:
    import cPickle as pickle
except:
    import pickle
    

Forked version of gensim for Python 2.7: https://github.com/jhlau/gensim

(Windows will require VS C++ 9.0 compiler: https://www.microsoft.com/en-gb/download/details.aspx?id=44266)

Once downloaded: Unzip gensim and navigate to its directory in command prompt/bash. Run 'python setup.py install'

Pretrained: Dov2Vec models: https://github.com/jhlau/doc2vec

In [2]:
def save_pickle(filename, data):
    with open(os.path.normpath(filename), 'wb') as open_file:
        pickle.dump(data, open_file)

def load_pickle(filename):
    with open(os.path.normpath(filename), 'rb') as open_file:
        return pickle.load(open_file)

In [3]:
def datetime_sort(data):
    return sorted(data, key=lambda x: x['datetime'])

In [4]:
def extract_texts(data):
    return [doc['content'] for doc in data]

In [5]:
ap_model = Doc2Vec.load(r'Models\ap\doc2vec.bin')
wiki_model = Doc2Vec.load(r'Models\wiki\doc2vec.bin')

Select the model you want to use

In [7]:
model = wiki_model

### Main corpus

Vectorize corpus documents

Make sure the input and output paths are correct

In [8]:
INPUT_FILENAME = r'C:\Users\Simon\OneDrive - University of Exeter\__Project__\__Data__\03 Preprocessing\out\pre_processed.pkl'
OUTPUT_FILENAME = r'C:\Users\Simon\OneDrive - University of Exeter\__Project__\05 Filter Docs\temp\vectorized.pkl'

In [9]:
doc2vec_process_pipe = compose(
                               curry(save_pickle)(OUTPUT_FILENAME),
                               list,
                               curry(map)(model.infer_vector),
                               tqdm,
                               curry(map)(gensim.utils.simple_preprocess),
                               tqdm,
                               curry(map)(lambda x: x['content']),
                               load_pickle,
                               )

In [10]:
doc2vec_process_pipe(INPUT_FILENAME)

100%|██████████| 227/227 [00:00<00:00, 486.08it/s]
100%|██████████| 227/227 [00:02<00:00, 92.16it/s]


### SVM training data

Vectorize SVM training data

In [9]:
GTD_DESCRIPTIONS = r'C:\Users\Simon\OneDrive - University of Exeter\__Project__\__Data__\GTD\Preprocessed Info p2.pkl'

POSITIVE_DOCS = r'C:\Users\Simon\OneDrive - University of Exeter\__Project__\__Data__\04 Training Data\SVM Positives\Agg2.pkl'

NEGATIVE_DOCS = r'C:\Users\Simon\OneDrive - University of Exeter\__Project__\__Data__\04 Training Data\SVM Negatives\Agg2.pkl'

In [10]:
gtd_des = load_pickle(GTD_DESCRIPTIONS)
pos_docs = load_pickle(POSITIVE_DOCS)
neg_docs = load_pickle(NEGATIVE_DOCS)

In [11]:
# For Positive and Negative samples

process_docs = compose(
                       #curry(save_pickle)(SAVENAME),
                       list,
                       curry(map)(model.infer_vector),
                       tqdm,
                       curry(map)(gensim.utils.simple_preprocess),
                       tqdm,
                       curry(map)(lambda x: x['content']))

# For The positive GTD descriptions

process_texts = compose(
                        #curry(save_pickle)(SAVENAME),
                        list,
                        curry(map)(model.infer_vector),
                        tqdm,
                        curry(map)(gensim.utils.simple_preprocess),)

doc2vec_process_pipe = curry(map)(lambda x: process_docs(x) if type(x[0]) == dict else process_texts(x))

In [12]:
svm_data = doc2vec_process_pipe([pos_docs, neg_docs, gtd_des])

100%|██████████| 1508/1508 [00:01<00:00, 865.17it/s] 
100%|██████████| 1508/1508 [00:08<00:00, 183.28it/s]
100%|██████████| 4570/4570 [00:06<00:00, 699.31it/s]
100%|██████████| 4570/4570 [00:30<00:00, 152.16it/s]
100%|██████████| 8624/8624 [00:03<00:00, 2558.29it/s]


In [13]:
SAVENAME = r'C:\Users\Simon\OneDrive - University of Exeter\__Project__\__Data__\04 Training Data\SVM\pos_neg_gtd_wiki.pkl'

save_pickle(SAVENAME, svm_data)