# Basic System - LogisticRegression

This notebook provides code for implementing a very simple machine learning system for named entity recognition.
It uses logistic regression and four features: `Token` (the token itsel), `POS` (part of speech), `Allcaps` (a binary feature that indicates whether the token is in allcaps or not), and `Cap_after_lower` (a binary feature that indicates whether the token is capitalized after an all-lowercase token). 

In [None]:
import pickle

import feature_extraction_util as extract
import classification_util as classify
import evaluation_util as evaluate

#### The cell below contains necessary definitions in order to train the model, as well as classify the test-sets with pretrained model.

In [None]:
baseline_features = ['Token', 'POS', 'Allcaps', 'Cap_after_lower']

train_file = '../data/conll2003.train.preprocessed.conll'
test_file = '../data/conll2003.test.preprocessed.conll'
outputfile = '../data/conll2003.test.output.conll' # generic pathname for saving results 

## Extracting the features, training and saving the model

If this cell has already been executed, the trained model is loaded in the next cell.

In [None]:
# extract features and labels
training_features, gold_labels = extract.features_and_labels(train_file,baseline_features)

# create classifier and train it with the training features and labels
model, vec = classify.create_classifier(training_features,gold_labels,'logreg')

# save trained model and vectorizer
classifier_pathname = '../models/baseline_logreg_model_conll2003.sav'
vectorizer_pathname = '../models/baseline_logreg_vec_conll2003.sav'

pickle.dump(model,open(classifier_pathname,'wb'))
pickle.dump(vec,open(vectorizer_pathname,'wb'))

## Classifying the test sets with the saved model and evaluating the results

In [None]:
# load saved model and vec
classifier_pathname = '../models/baseline_logreg_model_conll2003.sav'
vectorizer_pathname = '../models/baseline_logreg_vec_conll2003.sav'

loaded_model = pickle.load(open(classifier_pathname,'rb'))
loaded_vec = pickle.load(open(vectorizer_pathname,'rb'))

# classify data and write to file
classify.classify_data(loaded_model,loaded_vec,baseline_features,test_file,
                       outputfile.replace('.conll','.baseline_logreg.conll'))

# print confusion matrix and classification report
outputdata = '../data/conll2003.test.output.baseline_logreg.conll'
evaluate.get_confusion_matrix_and_classification_report(outputdata,exclude_majority=True)