# Expanding the System - LogisticRegression, NaiveBayes and LinearSVC

This notebook provides code for an extension of the basic system, covering more features represented as one-hot represenation. In this notebook I am also training three different models: LogisticRegression, NaiveBayes and LinearSVC (SVM). 
This time, I am adding to the baseline features  - `Token`, `POS`, `Cap_after_lower`, and `Allcaps` - and including several new features. `Chunk` takes the additional syntactical information given in the original CONLL format which represent dependencies, and `Demonym`, `Comp_suf`, and `Poss_mark` represent orthographic information based on the original data analysis.

In [None]:
import pickle

import feature_extraction_util as extract
import classification_util as classify
import evaluation_util as evaluate

#### The cell below contains necessary definitions in order to train the model, as well as classify the test-sets with pretrained model.

In [None]:
selected_features = ['Token', 'POS', 'Chunk', 'Allcaps', 'Cap_after_lower', 'Demonym', 'Comp_suf', 'Poss_mark']

train_file = '../data/conll2003.train.preprocessed.conll'
test_file = '../data/conll2003.test.preprocessed.conll'
outputfile = '../data/conll2003.test.output.conll' # generic pathname for saving results 

## Extracting the features, training and saving the models

If this cell has already been executed, the trained model is loaded in the next cell.

In [None]:
# extract features and labels
training_features, gold_labels = extract.features_and_labels(train_file,selected_features)

for modelname in ['logreg', 'NB', 'SVM']:
    
    # create classifier
    model, vec = classify.create_classifier(training_features,gold_labels,modelname)
    
    # save trained model and vectorizer
    classifier_pathname = '../models/expanded_%s_model_conll2003.sav' % modelname    
    vectorizer_pathname = '../models/expanded_%s_vec_conll2003.sav' % modelname

    pickle.dump(model, open(classifier_pathname, 'wb'))
    pickle.dump(vec, open(vectorizer_pathname,'wb'))

## Classifying the test sets with the saved models and evaluating the results

In [None]:
for modelname in ['logreg', 'NB', 'SVM']:
    
    # load saved model and vec
    classifier_pathname = '../models/expanded_%s_model_conll2003.sav' % modelname
    vectorizer_pathname = '../models/expanded_%s_vec_conll2003.sav' % modelname
    
    loaded_model = pickle.load(open(classifier_pathname,'rb'))
    loaded_vec = pickle.load(open(vectorizer_pathname,'rb'))
        
    # classify data and write to file
    classify.classify_data(loaded_model,loaded_vec,selected_features,test_file,
                           outputfile.replace('.conll','.' + modelname + '.conll'))
        
    outputdata = '../data/conll2003.test.output.%s.conll' % modelname
    
    print("Classification Report and Confusion Matrix for the %s model" % modelname)
    evaluate.get_confusion_matrix_and_classification_report(outputdata,exclude_majority=True)