# Feature ablation with LinearSVC

This notebook provides code for feature ablation for SVM. In this step, we have already trained our baseline model, LogisticRegression(), on a few features: `Token`, `Pos`, `Cap_after_lower`, and `Allcaps`. The latter two are binary features which check whether the word is capitalized and comes after a lowercase word, and whether the word is in uppercase, respectively. More features have been extracted initially during the preprocessing phase, during which the conll format has been updated to include more features that have been extracted.

In this notebook, we will be using the already processed conll format, so we already have them in the file and they are accessible. For feature ablation, we will take the highest performing model - SVM - and perform feature ablation as follows:
1. First I will train SVM on the token only and see if it performs better or worse.
2. I will then add back the baseline features -`Pos`, `Cap_after_lower` and `Allcaps`, and see how SVM measures against the basline model with the same features, and how it measures against itself with only the token as the feature.
3. I will then add the other orthographic features which are not related to capitalization - `Comp_suf` (followed by a company marker), `Demonym` (contains an adjectivial inflection), and `Poss_mark` (followed by 's) to the token - and see what happens. I will add each feature once, and then do a version that contains all the orthography features together.
4. Finally, I will add the `Pos` and `Chunk` to the token to see the extent that syntactical features affect the result.

The logic of this order is due to time constraints which prevent me from checking all possible combinations. What I am interested in is the effect of orthographic features that are not related to capitalization, versus those that are. I am also interested in checking a feature set that does not contain any syntactical features.

Finally, I will take the best performing combination and test it on LogisticRegression and NaiveBayes.

In [None]:
import pickle

import feature_extraction_util as extract
import classification_util as classify
import evaluation_util as evaluate

# Feature Ablation - eliminating and adding features

First I try with eight different combinations. I begin with removing all features and incrementally adding back the orthographic features to the token, compare with the baseline features and check combinations of orthographic and syntactic features.

#### The cell below contains necessary definitions in order to train the models, as well as classify the test-sets with pretrained models.

In [None]:
token_only = ['Token']
baseline_features = ['Token', 'POS','Allcaps', 'Cap_after_lower']
token_and_demonym = ['Token', 'Demonym']
token_and_comp_suf = ['Token', 'Comp_suf']
token_and_poss_mark = ['Token', 'Poss_mark']
orthography_only = ['Token', 'Demonym', 'Comp_suf', 'Poss_mark']
orthography_and_cap = ['Token', 'Allcaps', 'Cap_after_lower', 'Demonym', 'Comp_suf', 'Poss_mark']
token_and_syntax = ['Token', 'POS', 'Chunk']

all_selected_features = {'token_only':token_only,'baseline_features':baseline_features,
                     'token_and_demonym':token_and_demonym,'token_and_comp_suf':token_and_comp_suf,
                     'token_and_poss_mark':token_and_poss_mark,'orthography_only':orthography_only,
                     'orthography_and_cap':orthography_and_cap,'token_and_syntax':token_and_syntax}

train_file = '../data/conll2003.train.preprocessed.conll'
test_file = '../data/conll2003.test.preprocessed.conll'
outputfile = '../data/conll2003.test.output.conll' # generic pathname for saving results

## Extracting the features, training and saving the model

If this cell has already been executed, the trained model is loaded in the next cell.

In [None]:
for name,selected_features in all_selected_features.items():
    
    # extract features and create classifier
    features, labels = extract.features_and_labels(train_file,selected_features)
    model, vec = classify.create_classifier(features,labels,'SVM')
    
    # save trained model and vectorizer
    classifier_pathname = '../models/%s_svm_model_conll2003.sav' % name
    vectorizer_pathname = '../models/%s_svm_vec_conll2003.sav' % name
    
    pickle.dump(model, open(classifier_pathname,'wb')) 
    pickle.dump(vec, open(vectorizer_pathname,'wb')) 

## Classifying the test sets with the saved model and evaluating the results

In [None]:
for name,selected_features in all_selected_features.items():
            
    # load saved models and vecs
    classifier_pathname = '../models/%s_svm_model_conll2003.sav' % name
    vectorizer_pathname = '../models/%s_svm_vec_conll2003.sav' % name
    
    loaded_model = pickle.load(open(classifier_pathname,'rb'))
    loaded_vec = pickle.load(open(vectorizer_pathname,'rb'))
        
    # classify data and write to file
    classify.classify_data(loaded_model,loaded_vec,selected_features,test_file,
                           outputfile.replace('.conll','.' + name + '_svm.conll'))
    
    outputdata = '../data/conll2003.test.output.%s_svm.conll' % name
    
    # print confusion matrix and classification report
    print("Classification Report and Confusion Matrix for the %s SVM model" % name)
    evaluate.get_confusion_matrix_and_classification_report(outputdata,exclude_majority=True)

# Checking more feature combinations with LinearSVC

Evidently, the feature combination of `Token`, `POS`, and `Chunk` has the best performance in terms of macro precision, recall and f1-score, as well as overall accuracy. I will now try this feature combination together with the orthographic features `Allcaps` and `Cap_after_lower`, which seem to have better results than the other orthographic features. 

## Extracting the features, training and saving the model

If this cell has already been executed, the trained model is loaded in the next cell.

In [None]:
token_syntax_and_cap = ['Token', 'POS', 'Chunk', 'Allcaps', 'Cap_after_lower']

# extract features and labels
features, labels = extract.features_and_labels(train_file,token_syntax_and_cap)

# create SVM classifier
model, vec = classify.create_classifier(features,labels,'SVM')

# save trained model and vectorizer
classifier_pathname = '../models/token_syntax_and_cap_svm_model_conll2003.sav'
vectorizer_pathname = '../models/token_syntax_and_cap_svm_vec_conll2003.sav'

pickle.dump(model, open(classifier_pathname,'wb'))
pickle.dump(vec, open(vectorizer_pathname,'wb'))

## Classifying the test sets with the saved models and evaluating the results

In [None]:
# load saved model and vectorizer 

classifier_pathname = '../models/token_syntax_and_cap_svm_model_conll2003.sav'
vectorizer_pathname = '../models/token_syntax_and_cap_svm_vec_conll2003.sav'

loaded_model = pickle.load(open(classifier_pathname,'rb'))
loaded_vec = pickle.load(open(vectorizer_pathname,'rb'))

# classify data and write to file
classify.classify_data(loaded_model,loaded_vec,token_syntax_and_cap,test_file, 
                       outputfile.replace('.conll','.token_syntax_and_cap_svm.conll'))

outputdata = '../data/conll2003.test.output.token_syntax_and_cap_svm.conll'

# print confusion matrix and classification report
print("Classification Report and Confusion Matrix for the token_syntax_and_cap SVM model")
evaluate.get_confusion_matrix_and_classification_report(outputdata,exclude_majority=True)

# Trying the best feature combination with LogisticRegression and NaiveBayes

I will now try the feature combination of `Token`, `POS`, and `Chunk`, together with the orthographic features `Allcaps` and `Cap_after_lower`, with other classifiers - namely LogisticRegression and NaiveBayes.

## Extracting the features, training and saving the models

If this cell has already been executed, the trained model is loaded in the next cell.

In [None]:
for modelname in ['logreg', 'NB']:
    
    # extract features and create classifiers
    features, labels = extract.features_and_labels(train_file,token_syntax_and_cap)
    model, vec = classify.create_classifier(features,labels,modelname)
    
    # save trained model and vectorizer
    classifier_pathname = '../models/token_syntax_and_cap_%s_model_conll2003.sav' % modelname
    vectorizer_pathname = '../models/token_syntax_and_cap_%s_vec_conll2003.sav' % modelname

    pickle.dump(model, open(classifier_pathname, 'wb'))
    pickle.dump(vec, open(vectorizer_pathname,'wb'))

## Classifying the test sets with the saved models and evaluating the results

In [None]:
for modelname in ['logreg', 'NB']:
    
    classifier_pathname = '../models/token_syntax_and_cap_%s_model_conll2003.sav' % modelname
    vectorizer_pathname = '../models/token_syntax_and_cap_%s_vec_conll2003.sav' % modelname
    
    loaded_model = pickle.load(open(classifier_pathname,'rb'))
    loaded_vec = pickle.load(open(vectorizer_pathname,'rb'))
    
    # classify data and write to file
    classify.classify_data(loaded_model,loaded_vec,token_syntax_and_cap,test_file, 
                           outputfile.replace('.conll','.token_syntax_and_cap_' + modelname + '.conll'))
                
    outputdata = '../data/conll2003.test.output.token_syntax_and_cap_%s.conll' % modelname
    
    # print confusion matrix and classification report
    print("Classification Report and Confusion Matrix for the token_syntax_and_cap %s model" % modelname)
    evaluate.get_confusion_matrix_and_classification_report(outputdata,exclude_majority=True)