# Word Embeddings with LinearSVC

This notebook provides code for training a LinearSVC model with word embeddings as features. 

It then allows for combining traditional features (in one-hot representation) with word embeddings. Since combining highly sparse one-hot token with dense representations is generally not a good idea, I am only taking the baseline features for this experiment.

In [None]:
import pickle

import feature_extraction_util as extract
import classification_util as classify
import evaluation_util as evaluate

from gensim import models

## Creating and saving the Word Embedding model

If this cell has already been executed, then the trained model is loaded in the next cell.

In [None]:
# create word embedding model
word_embeddings_path = '../word-embeddings/GoogleNews-vectors-negative300.bin'
word_embedding_model = models.KeyedVectors.load_word2vec_format(word_embeddings_path,binary=True)

# save word embedding model
embedding_model_pathname = '../models/word_embedding_model_conll2003.sav'
pickle.dump(word_embedding_model, open(embedding_model_pathname,'wb')) # save word embedding model

#### The cell below contains necessary definitions in order to train the models, as well as classify the test-sets with pretrained models.

In [None]:
selected_features = ['Token', 'POS', 'Allcaps', 'Cap_after_lower']

train_file = '../data/conll2003.train.preprocessed.conll'
test_file = '../data/conll2003.test.preprocessed.conll'
outputfile = '../data/conll2003.test.output.conll' # generic pathname for saving results

## Word Embeddings as features
In this section, I train the LinearSVC model with word embeddings as features. Since this process can take longer than the other models, I added print() messages to indicate which part of the process the system is currently executing.

Run the cell below to load the word embedding model.

In [None]:
# load saved word embedding model
embedding_model_pathname = '../models/word_embedding_model_conll2003.sav'
loaded_word_embedding_model = pickle.load(open(embedding_model_pathname,'rb'))

### Extracting the word embeddings, training and saving the model

If this cell has already been executed, the trained model is loaded in the next cell.

In [None]:
# extracting features and labels
print('Extracting dense features from training file...')
embeddings_as_features, gold_labels = extract.embeddings_as_features(train_file,
                                                                    loaded_word_embedding_model,get_gold=True)

# create SVM classifier and train with word embeddings
print('Training classifier...')
embedding_classifier = classify.create_embedding_classifier(embeddings_as_features,gold_labels)

# save SVM model trained with word embeddings as features
print('Saving trained SVM model...')
embedding_classifier_pathname = '../models/embeddings_svm_conll2003.sav'
pickle.dump(embedding_classifier, open(embedding_classifier_pathname, 'wb'))

print('Done.')

### Classifying the test sets with the saved models and evaluating the results

## Combined Word Embeddings and traditional features
In this section, I train the LinearSVC model with combined dense and sparse vectors by combining word embeddings and traditional features. Since this process can take longer than the other models, I added print() messages to indicate which part of the process the system is currently executing.

In [None]:
# load saved SVM model
embedding_classifier_pathname = '../models/embeddings_svm_conll2003.sav'
loaded_model = pickle.load(open(embedding_classifier_pathname,'rb'))

# extract features from test file
print('Extracting features from test file...')
test_features = extract.embeddings_as_features(test_file,loaded_word_embedding_model,get_gold=False)

# classify data and write to file
print('Writing embedding classification to outputfile...')
classify.classify_data_given_features(test_features,loaded_model,test_file,
                       outputfile.replace('.conll','.embeddings_svm.conll'))

print('Finished classifying.')

outputdata = '../data/conll2003.test.output.embeddings_svm.conll'
    
# display classification report

print("Classification Report and Confusion Matrix for SVM trained with embeddings as features")
evaluate.get_confusion_matrix_and_classification_report(outputdata,exclude_majority=True)

### Extracting the word embeddings, training and saving the model

If this cell has already been executed, the trained model is loaded in the next cell.

<span style="color:red">Note: Due to memory issues, my kernel is currently crashing when trying to convert one-hot vectors into dense vectors!!!</span>

The kernel crashes specifically when executing the function `combined_features()`, when it calls `combine_sparse_and_dense_features()`, when running `sparse_vectors = np.array(sparse_features.toarray())` (you can inspect these functions in feature_extraction_util.py).

In [None]:
# extracting combined features and labels
print('Extracting combined features from training file...')
combined_features, gold_labels, vec = extract.combined_features(train_file,loaded_word_embedding_model,
                                                                    selected_features,get_gold_and_vec=True)

# create SVM classifier and train with combined features
print('Training classifier....')
combined_classifier = classify.create_embedding_classifier(combined_features,gold_labels)

# save SVM model trained with combined features
# also saving vectorizer to be used when fitting model with the traditional features from the test-set
print('Saving trained SVM model and vectorizer...')

combined_classifier_pathname = '../models/combined_features_svm_conll2003.sav'
vectorizer_pathname = '../models/traditional_feature_vectorizer_conll2003.sav'

pickle.dump(combined_classifier, open(combined_classifier_pathname,'wb'))
pickle.dump(vec, open(vectorizer_pathname,'wb'))

print('Done.')

### Classifying the test sets with the saved models and evaluating the results

In [None]:
# load saved SVM model and vec
combined_classifier_pathname = '../models/combined_features_svm_conll2003.sav'
vectorizer_pathname = '../models/traditional_feature_vectorizer_conll2003.sav'

loaded_model = pickle.load(open(combined_classifier_pathname,'rb'))
loaded_vec = pickle.load(open(vectorizer_pathname,'rb'))

# extract features from test file
print('Extracting features from test-set...')
test_features = extract.combined_features(test_file,loaded_word_embedding_model,selected_features,
                                          get_gold_and_vec=False, vectorizer=loaded_vec)

# classify data and write to file
print('Writing embedding classification to outputfile...')
classify.classify_data_given_features(test_features,loaded_model,test_file,
              outputfile.replace('.conll','.combined_features_svm.conll'))

print('Finished classifying.')

outputdata = '../data/conll2003.test.output.combined_features_svm.conll'
    
# display classification report

print("Classification Report and Confusion Matrix for SVM trained with combined features")
evaluate.get_confusion_matrix_and_classification_report(outputdata,exclude_majority=True)