# EDAN20 - Assignment #3
### Extracting noun groups using machine learning techniques
Initial code provided by Pierre Nugues (@pnugues/ilpp)
Completed by Jonathan Moran (jo6155mo-s) & Alexis Cole (alexiscole99)

## Choosing a training and a test sets
1. As annotated data and annotation scheme, you will use the data available from CoNLL 2000.

In [1]:
import conll_reader

column_names = ['form', 'pos', 'chunk']

2. Download both the training and test sets (the same as in the previous assignment) and decompress them.

In [2]:
train_file = '../../corpus/conll2000/train.txt'
test_file = '../../corpus/conll2000/test.txt'

train_corpus = conll_reader.read_sentences(train_file)
train_corpus = conll_reader.split_rows(train_corpus, column_names)
test_corpus = conll_reader.read_sentences(test_file)
test_corpus = conll_reader.split_rows(test_corpus, column_names)

In [3]:
train_corpus[1]

[{'form': 'Chancellor', 'pos': 'NNP', 'chunk': 'O'},
 {'form': 'of', 'pos': 'IN', 'chunk': 'B-PP'},
 {'form': 'the', 'pos': 'DT', 'chunk': 'B-NP'},
 {'form': 'Exchequer', 'pos': 'NNP', 'chunk': 'I-NP'},
 {'form': 'Nigel', 'pos': 'NNP', 'chunk': 'B-NP'},
 {'form': 'Lawson', 'pos': 'NNP', 'chunk': 'I-NP'},
 {'form': "'s", 'pos': 'POS', 'chunk': 'B-NP'},
 {'form': 'restated', 'pos': 'VBN', 'chunk': 'I-NP'},
 {'form': 'commitment', 'pos': 'NN', 'chunk': 'I-NP'},
 {'form': 'to', 'pos': 'TO', 'chunk': 'B-PP'},
 {'form': 'a', 'pos': 'DT', 'chunk': 'B-NP'},
 {'form': 'firm', 'pos': 'NN', 'chunk': 'I-NP'},
 {'form': 'monetary', 'pos': 'JJ', 'chunk': 'I-NP'},
 {'form': 'policy', 'pos': 'NN', 'chunk': 'I-NP'},
 {'form': 'has', 'pos': 'VBZ', 'chunk': 'B-VP'},
 {'form': 'helped', 'pos': 'VBN', 'chunk': 'I-VP'},
 {'form': 'to', 'pos': 'TO', 'chunk': 'I-VP'},
 {'form': 'prevent', 'pos': 'VB', 'chunk': 'I-VP'},
 {'form': 'a', 'pos': 'DT', 'chunk': 'B-NP'},
 {'form': 'freefall', 'pos': 'NN', 'chunk': '

In [4]:
test_corpus[1]

[{'form': 'Rockwell', 'pos': 'NNP', 'chunk': 'B-NP'},
 {'form': 'said', 'pos': 'VBD', 'chunk': 'B-VP'},
 {'form': 'the', 'pos': 'DT', 'chunk': 'B-NP'},
 {'form': 'agreement', 'pos': 'NN', 'chunk': 'I-NP'},
 {'form': 'calls', 'pos': 'VBZ', 'chunk': 'B-VP'},
 {'form': 'for', 'pos': 'IN', 'chunk': 'B-SBAR'},
 {'form': 'it', 'pos': 'PRP', 'chunk': 'B-NP'},
 {'form': 'to', 'pos': 'TO', 'chunk': 'B-VP'},
 {'form': 'supply', 'pos': 'VB', 'chunk': 'I-VP'},
 {'form': '200', 'pos': 'CD', 'chunk': 'B-NP'},
 {'form': 'additional', 'pos': 'JJ', 'chunk': 'I-NP'},
 {'form': 'so-called', 'pos': 'JJ', 'chunk': 'I-NP'},
 {'form': 'shipsets', 'pos': 'NNS', 'chunk': 'I-NP'},
 {'form': 'for', 'pos': 'IN', 'chunk': 'B-PP'},
 {'form': 'the', 'pos': 'DT', 'chunk': 'B-NP'},
 {'form': 'planes', 'pos': 'NNS', 'chunk': 'I-NP'},
 {'form': '.', 'pos': '.', 'chunk': 'O'}]

3. Be sure that you have the scikit-learn package: Start it by typing import sklearn in Python.

In [5]:
import sklearn

## Baseline

Most statistical algorithms for language processing start with a so-called baseline. The baseline figure corresponds to the application of a minimal technique that is used to assess the difficulty of a task and for comparison with further programs.

1. Read the baseline proposed by the organizers of the CoNLL 2000 shared task (In the Results Sect.). What do you think of it?

2. Implement this baseline program. You may either create a completely new program or start from an existing program that you will modify [Program folder ].

- For each part of speech, select the most frequent chunk.

In [6]:
def count_pos(corpus):
    """
    Computes the part-of-speech distribution
    in a CoNLL 2000 file
    :param corpus:
    :return:
    """
    pos_cnt = {}
    for sentence in corpus:
        for row in sentence:
            if row['pos'] in pos_cnt:
                pos_cnt[row['pos']] += 1
            else:
                pos_cnt[row['pos']] = 1

    return pos_cnt

- Complete the `train` function so that it computes the chunk distribution for each part of speech. You will use the `train` file to derive your distribution and you will store the results in a dictionary.

In [7]:
def train(corpus):
    """
    Computes the chunk distribution by pos
    The result is stored in a dictionary
    :param corpus: CoNLL tokens annotated with three-column tagset (POS, chunk tag, predicted chunk tag)
    :return pos_chunk: dictionary containing the most-frequent pos-chunk associations
    """
    pos_cnt = count_pos(corpus)

    # We compute the chunk distribution by POS
    chunk_dist = {key: {} for key in pos_cnt.keys()}

    """
    Fill in code to compute the chunk distribution for each part of speech
    Chunk distribution: num of pos-chunk associations 
    - case 1 (expected): chunk matches pos, incremement occurence count
    - case 2 (initialisation): chunk matches pos, set count to 1
    """

    for sentence in corpus:
        for row in sentence:
            # MODIFIED 5/10 -- @jonathanloganmoran:
            chunk = row['chunk']
            pos = row['pos']

            if chunk in chunk_dist[pos]:
                chunk_dist[pos][chunk] += 1
            else:
                chunk_dist[pos][chunk] = 1

    print(chunk_dist['NNP'])
                
    """
    Fill in code so that for each part of speech, you select the most frequent chunk.
    You will build a dictionary with key values:
    pos_chunk[pos] = most frequent chunk for pos
    """
    
    # We determine the best association
    pos_chunk = {}
    """
    MODIFIED 5/10 -- @jonathanloganmoran:
    - use lambda function to retrieve max POS
    """
    for chunk in chunk_dist:
        pos_chunk[chunk] = max(chunk_dist[chunk], key=lambda i: chunk_dist[chunk][i])
    
    return pos_chunk

- In the example above, you will have (NN, I-NP)

In [8]:
model = train(train_corpus)

print("Excerpt of training corpus for 'pos': 'NNP' tag: ", model['NNP'])

{'B-NP': 8314, 'I-NP': 11470, 'O': 57, 'B-VP': 18, 'B-ADVP': 9, 'I-ADVP': 1, 'I-VP': 2, 'B-PRT': 1, 'B-ADJP': 8, 'B-INTJ': 2, 'I-ADJP': 2}
Excerpt of training corpus for 'pos': 'NNP' tag:  I-NP


- You will store your results in an output file that has four columns. The three first columns will be the input columns from the test file: word, part of speech, and gold-standard chunk. You will append the predicted chunk as the 4th column.

In [9]:
def predict(model, corpus):
    """
    Predicts the chunk from the part of speech
    Adds a pchunk column
    :param model: most-frequent pos-chunk associations
    :param corpus: CoNLL annotated tokens + three-column tagset
    :return: CoNLL dataset with predicted chunks
    """
    """
    We add a predicted chunk column: pchunk
    """
    for sentence in corpus:
        for row in sentence:
            row['pchunk'] = model[row['pos']]
    return corpus

- Your output file should look like the excerpt below:

In [10]:
predicted = predict(model, test_corpus)
print(predicted[1])

[{'form': 'Rockwell', 'pos': 'NNP', 'chunk': 'B-NP', 'pchunk': 'I-NP'}, {'form': 'said', 'pos': 'VBD', 'chunk': 'B-VP', 'pchunk': 'B-VP'}, {'form': 'the', 'pos': 'DT', 'chunk': 'B-NP', 'pchunk': 'B-NP'}, {'form': 'agreement', 'pos': 'NN', 'chunk': 'I-NP', 'pchunk': 'I-NP'}, {'form': 'calls', 'pos': 'VBZ', 'chunk': 'B-VP', 'pchunk': 'B-VP'}, {'form': 'for', 'pos': 'IN', 'chunk': 'B-SBAR', 'pchunk': 'B-PP'}, {'form': 'it', 'pos': 'PRP', 'chunk': 'B-NP', 'pchunk': 'B-NP'}, {'form': 'to', 'pos': 'TO', 'chunk': 'B-VP', 'pchunk': 'B-PP'}, {'form': 'supply', 'pos': 'VB', 'chunk': 'I-VP', 'pchunk': 'I-VP'}, {'form': '200', 'pos': 'CD', 'chunk': 'B-NP', 'pchunk': 'I-NP'}, {'form': 'additional', 'pos': 'JJ', 'chunk': 'I-NP', 'pchunk': 'I-NP'}, {'form': 'so-called', 'pos': 'JJ', 'chunk': 'I-NP', 'pchunk': 'I-NP'}, {'form': 'shipsets', 'pos': 'NNS', 'chunk': 'I-NP', 'pchunk': 'I-NP'}, {'form': 'for', 'pos': 'IN', 'chunk': 'B-PP', 'pchunk': 'B-PP'}, {'form': 'the', 'pos': 'DT', 'chunk': 'B-NP', 'pc

3. Measure the performance of the system. Use the `conlleval.txt` evaluation program used by the CoNLL 2000 shared task.

In [11]:
def eval(predicted):
    """
    Evaluates the predicted chunk accuracy
    :param predicted: the pchunk-annotated dataset
    :return: the percentage of correct chunk tags (classification accuracy)
    """
    word_cnt = 0
    correct = 0
    for sentence in predicted:
        for row in sentence:
            word_cnt += 1
            if row['chunk'] == row['pchunk']:
                correct += 1
    return correct / word_cnt

In [12]:
accuracy = eval(predicted)
print("Accuracy", accuracy)

Accuracy 0.7729066846782194


- `conlleval.txt` is the official CoNLL Perl script. It expects the two last columns of the test set to be the manually assigned chunk (gold standard) and the predicted chunk.

In [13]:
f_out = open('out', 'w')
# We write the word (form), part of speech (pos),
# gold-standard chunk (chunk), and predicted chunk (pchunk)
for sentence in predicted:
    for row in sentence:
        f_out.write(row['form'] + ' ' + row['pos'] + ' ' + row['chunk'] + ' ' + row['pchunk'] + '\n')
    f_out.write('\n')
f_out.close()

- where the `out` file contains both the gold and predicted chunk tags. `conlleval.txt` is a Perl script.

In [15]:
!perl conlleval.txt <out

processed 47377 tokens with 23852 phrases; found: 26992 phrases; correct: 19592.
accuracy:  77.29%; precision:  72.58%; recall:  82.14%; FB1:  77.07
             ADJP: precision:   0.00%; recall:   0.00%; FB1:   0.00  0
             ADVP: precision:  44.33%; recall:  77.71%; FB1:  56.46  1518
            CONJP: precision:   0.00%; recall:   0.00%; FB1:   0.00  0
             INTJ: precision:  50.00%; recall:  50.00%; FB1:  50.00  2
              LST: precision:   0.00%; recall:   0.00%; FB1:   0.00  0
               NP: precision:  79.87%; recall:  86.80%; FB1:  83.19  13500
               PP: precision:  74.73%; recall:  97.07%; FB1:  84.45  6249
              PRT: precision:  75.00%; recall:   8.49%; FB1:  15.25  12
             SBAR: precision:   0.00%; recall:   0.00%; FB1:   0.00  0
               VP: precision:  60.53%; recall:  74.22%; FB1:  66.68  5711


## Using Machine Learning

"In this exercise, you will apply and explore the `ml_chunker.py` program. You will start from the original program you downloaded and modify it so that you understand how to improve the performance of your chunker. You will not add new features to the feature vector."

In [16]:
""""
CoNLL 2000 file reader
"""
__author__ = "Pierre Nugues"
# MODIFIED -- @jonathanloganmoran

def read_str(file):
    """
    Creates a string of sentences from the corpus
    :param file:
    :return:
    """
    f = str(open(file).read()).strip()
    sentences = f.split('\n\n')
    return sentences

In [17]:
# column_names = ['form', 'pos', 'chunk']
train_file = '../../corpus/conll2000/train.txt'
test_file = '../../corpus/conll2000/test.txt'

train_sentences = read_str(train_file)
# train_corpus = conll_reader.split_rows(train_corpus, column_names)
test_sentences = read_str(test_file)
# test_corpus = conll_reader.split_rows(test_corpus, column_names)

In [18]:
def extract_features_sent(sentence, w_size, feature_names):
    """
    Extract the features from one sentence
    returns X and y, where X is a list of dictionaries and
    y is a list of symbols
    :param sentence: string containing the CoNLL structure of a sentence
    :param w_size:
    :return:
    """

    # We pad the sentence to extract the context window more easily
    start = "BOS BOS BOS\n"
    end = "\nEOS EOS EOS"
    start *= w_size
    end *= w_size
    sentence = start + sentence
    sentence += end

    # Each sentence is a list of rows
    sentence = sentence.splitlines()
    padded_sentence = list()
    for line in sentence:
        line = line.split()
        padded_sentence.append(line)
    # print(padded_sentence)

    # We extract the features and the classes
    # X contains is a list of features, where each feature vector is a dictionary
    # y is the list of classes
    X = list()
    y = list()
    for i in range(len(padded_sentence) - 2 * w_size):
        # x is a row of X
        x = list()
        # The words in lower case
        for j in range(2 * w_size + 1):
            x.append(padded_sentence[i + j][0].lower())
        # The POS
        for j in range(2 * w_size + 1):
            x.append(padded_sentence[i + j][1])
        # The chunks (Up to the word)
        """
        for j in range(w_size):
            feature_line.append(padded_sentence[i + j][2])
        """
        # We represent the feature vector as a dictionary
        X.append(dict(zip(feature_names, x)))
        # The classes are stored in a list
        y.append(padded_sentence[i + w_size][2])
    return X, y

In [19]:
def extract_features(sentences, w_size, feature_names):
    """
    Builds X matrix and y vector
    X is a list of dictionaries and y is a list
    :param sentences:
    :param w_size:
    :return:
    """
    X_l = []
    y_l = []
    for sentence in sentences:
        X, y = extract_features_sent(sentence, w_size, feature_names)
        X_l.extend(X)
        y_l.extend(y)
    return X_l, y_l

In [20]:
feature_names = ['word_n2', 'word_n1', 'word', 'word_p1', 'word_p2',
                'pos_n2', 'pos_n1', 'pos', 'pos_p1', 'pos_p2']
w_size = 2  # The size of the context window to the left and right of the word

print("Extracting the features...")
X_dict, y = extract_features(train_sentences, w_size, feature_names)

Extracting the features...


In [21]:
from sklearn.feature_extraction import DictVectorizer

print("Encoding the features...")
# Vectorize the feature matrix and carry out a one-hot encoding
vec = DictVectorizer(sparse=True)
X = vec.fit_transform(X_dict)
# The statement below will swallow a considerable memory
# X = vec.fit_transform(X_dict).toarray()

Encoding the features...


In [22]:
from sklearn import linear_model

print("Training the model...")
classifier = linear_model.LogisticRegression(penalty='l2', dual=True, solver='liblinear')
model = classifier.fit(X, y)
print(model)

Training the model...




LogisticRegression(C=1.0, class_weight=None, dual=True, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='liblinear', tol=0.0001, verbose=0,
                   warm_start=False)


In [23]:
# We apply the model to the test set
# test_sentences = list(conll_reader.read_sentences(test_corpus))

In [24]:
# Here we carry out a chunk tag prediction and we report the per tag error
# This is done for the whole corpus without regard for the sentence structure
print("Predicting the chunks in the test set...")
X_test_dict, y_test = extract_features(test_sentences, w_size, feature_names)
# Vectorize the test set and one-hot encoding
X_test = vec.transform(X_test_dict)  # Possible to add: .toarray()
y_test_predicted = classifier.predict(X_test)

from sklearn import metrics
print("Classification report for classifier %s:\n%s\n" % (classifier, metrics.classification_report(y_test, y_test_predicted)))

Predicting the chunks in the test set...


  'precision', 'predicted', average, warn_for)


Classification report for classifier LogisticRegression(C=1.0, class_weight=None, dual=True, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='liblinear', tol=0.0001, verbose=0,
                   warm_start=False):
              precision    recall  f1-score   support

      B-ADJP       0.83      0.67      0.74       438
      B-ADVP       0.81      0.81      0.81       866
     B-CONJP       0.67      0.44      0.53         9
      B-INTJ       1.00      0.50      0.67         2
       B-LST       0.00      0.00      0.00         5
        B-NP       0.96      0.96      0.96     12422
        B-PP       0.96      0.98      0.97      4811
       B-PRT       0.77      0.74      0.75       106
      B-SBAR       0.89      0.84      0.87       535
        B-VP       0.95      0.95      0.95      4658
      I-ADJP       0.86      0.54     

In [25]:
def predict_new(test_sentences, feature_names, f_out):
    for test_sentence in test_sentences:
        X_test_dict, y_test = extract_features_sent(test_sentence, w_size, feature_names)
        # Vectorize the test sentence and one hot encoding
        X_test = vec.transform(X_test_dict)
        # Predicts the chunks and returns numbers
        y_test_predicted = classifier.predict(X_test)
        # Appends the predicted chunks as a last column and saves the rows
        rows = test_sentence.splitlines()
        rows = [rows[i] + ' ' + y_test_predicted[i] for i in range(len(rows))]
        for row in rows:
            f_out.write(row + '\n')
        f_out.write('\n')
    f_out.close()

In [26]:
# Here we tag the test set and we save it.
# This prediction is redundant with the piece of code above,
# but we need to predict one sentence at a time to have the same
# corpus structure
print("Predicting the test set...")
f_out = open('out_v2', 'w')
predict_new(test_sentences, feature_names, f_out)

Predicting the test set...


### CoNLL 2000 shared task (Kudoh and Matsumoto, 2000)
The program that won the CoNLL 2000 shared task (Kudoh and Matsumoto, 2000) used a window of five words around the chunk tag to identify, `c_i` . They built a feature vector consisting of:

- The values of the five words in this window: `w_i-2` , `w_i-1` , `w_i` , `w_i+1` , `w_i+2`
- The values of the five parts of speech in this window: `t_i-2` , `t_i-1` , `t_i` , `t_i+1` , `t_i+2`
- The values of the two previous chunk tags in the first part of the window: `c_i-2` , `c_i-1`

The two last parameters are said to be dynamic because the program computes them at run-time. Kudoh and Matsumoto trained a classifier based on support vector machines. Read Kudoh and Matsumoto's paper and the Yamcha software site.

##### 1. What is the feature vector that corresponds to the `ml_chunker.py` program? Is it the same Kudoh and Matsumoto used in their experiment?

##### 2. What is the performance of the chunker?

In [27]:
!perl conlleval.txt <out_v2

processed 47377 tokens with 23852 phrases; found: 24251 phrases; correct: 22010.
accuracy:  94.96%; precision:  90.76%; recall:  92.28%; FB1:  91.51
             ADJP: precision:  74.22%; recall:  65.07%; FB1:  69.34  384
             ADVP: precision:  78.45%; recall:  79.45%; FB1:  78.94  877
            CONJP: precision:  44.44%; recall:  44.44%; FB1:  44.44  9
             INTJ: precision: 100.00%; recall:  50.00%; FB1:  66.67  1
              LST: precision:   0.00%; recall:   0.00%; FB1:   0.00  0
               NP: precision:  90.31%; recall:  92.34%; FB1:  91.31  12701
               PP: precision:  95.87%; recall:  97.86%; FB1:  96.85  4911
              PRT: precision:  77.23%; recall:  73.58%; FB1:  75.36  101
             SBAR: precision:  89.15%; recall:  84.49%; FB1:  86.76  507
               VP: precision:  90.84%; recall:  92.83%; FB1:  91.82  4760


##### 3. Remove the lexical features (the words) from the feature vector and measure the performance. You should observe a decrease.

In [28]:
def extract_features_sent_wordless(sentence, w_size, feature_names):
    """
    Extract the features from one sentence
    returns X and y, where X is a list of dictionaries and
    y is a list of symbols
    :param sentence: string containing the CoNLL structure of a sentence
    :param w_size:
    :return:
    """

    # We pad the sentence to extract the context window more easily
    start = "BOS BOS BOS\n"
    end = "\nEOS EOS EOS"
    start *= w_size
    end *= w_size
    sentence = start + sentence
    sentence += end

    # Each sentence is a list of rows
    sentence = sentence.splitlines()
    padded_sentence = list()
    for line in sentence:
        line = line.split()
        padded_sentence.append(line)
    # print(padded_sentence)

    # We extract the features and the classes
    # X contains is a list of features, where each feature vector is a dictionary
    # y is the list of classes
    X = list()
    y = list()
    for i in range(len(padded_sentence) - 2 * w_size):
        # x is a row of X
        x = list()
        
        # TASK 2.3 -- REMOVE THE LEXICAL FEATURES (WORDS)
        """
        # The words in lower case
        for j in range(2 * w_size + 1):
            x.append(padded_sentence[i + j][0].lower())
        """
        # The POS
        for j in range(2 * w_size + 1):
            x.append(padded_sentence[i + j][1])
        # The chunks (Up to the word)
        """
        for j in range(w_size):
            feature_line.append(padded_sentence[i + j][2])
        """
        # We represent the feature vector as a dictionary
        X.append(dict(zip(feature_names, x)))
        # The classes are stored in a list
        y.append(padded_sentence[i + w_size][2])
    return X, y

In [29]:
def extract_features_wordless(sentences, w_size, feature_names):
    """
    Builds X matrix and y vector
    X is a list of dictionaries and y is a list
    :param sentences:
    :param w_size:
    :return:
    """
    X_l = []
    y_l = []
    for sentence in sentences:
        X, y = extract_features_sent(sentence, w_size, feature_names)
        X_l.extend(X)
        y_l.extend(y)
    return X_l, y_l

In [30]:
def predict_wordless(test_sentences, feature_names, f_out):
    for test_sentence in test_sentences:
        
        # TASK 2.3 -- REMOVE THE LEXICAL FEATURES (WORDS)
        X_test_dict, y_test = extract_features_sent_wordless(test_sentence, w_size, feature_names)
        # Vectorize the test sentence and one hot encoding
        X_test = vec.transform(X_test_dict)
        # Predicts the chunks and returns numbers
        y_test_predicted = classifier.predict(X_test)
        # Appends the predicted chunks as a last column and saves the rows
        rows = test_sentence.splitlines()
        rows = [rows[i] + ' ' + y_test_predicted[i] for i in range(len(rows))]
        for row in rows:
            f_out.write(row + '\n')
        f_out.write('\n')
    f_out.close()

In [31]:
w_size = 2  # The size of the context window to the left and right of the word
feature_names = ['word_n2', 'word_n1', 'word', 'word_p1', 'word_p2',
                     'pos_n2', 'pos_n1', 'pos', 'pos_p1', 'pos_p2']

print("Extracting the features...")
X_dict, y = extract_features_wordless(train_sentences, w_size, feature_names)

print("Encoding the features...")
# Vectorize the feature matrix and carry out a one-hot encoding
vec = DictVectorizer(sparse=True)
X = vec.fit_transform(X_dict)
# The statement below will swallow a considerable memory
# X = vec.fit_transform(X_dict).toarray()

#training_start_time = time.clock()
print("Training the model...")
classifier = linear_model.LogisticRegression(penalty='l2', dual=True, solver='liblinear')
model = classifier.fit(X, y)
print(model)

# Here we tag the test set and we save it.
# This prediction is redundant with the piece of code above,
# but we need to predict one sentence at a time to have the same
# corpus structure
print("Predicting the test set...")
f_out = open('out_wordless', 'w')
predict_wordless(test_sentences, feature_names, f_out)

Extracting the features...
Encoding the features...
Training the model...
LogisticRegression(C=1.0, class_weight=None, dual=True, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='liblinear', tol=0.0001, verbose=0,
                   warm_start=False)
Predicting the test set...


In [32]:
!perl conlleval.txt <out_wordless

processed 47377 tokens with 23852 phrases; found: 37022 phrases; correct: 4545.
accuracy:  41.82%; precision:  12.28%; recall:  19.06%; FB1:  14.93
             ADJP: precision:   0.00%; recall:   0.00%; FB1:   0.00  0
             ADVP: precision:   0.00%; recall:   0.00%; FB1:   0.00  0
            CONJP: precision:   0.00%; recall:   0.00%; FB1:   0.00  0
             INTJ: precision:   0.00%; recall:   0.00%; FB1:   0.00  0
              LST: precision:   0.00%; recall:   0.00%; FB1:   0.00  0
               NP: precision:  12.28%; recall:  36.59%; FB1:  18.38  37022
               PP: precision:   0.00%; recall:   0.00%; FB1:   0.00  0
              PRT: precision:   0.00%; recall:   0.00%; FB1:   0.00  0
             SBAR: precision:   0.00%; recall:   0.00%; FB1:   0.00  0
               VP: precision:   0.00%; recall:   0.00%; FB1:   0.00  0


##### 4. a) What is the classifier used in the program?

In the `ml_chunker.py` program, the logisitic regression classifier is used. The model is provided by the scikit `linear_model` package.

##### 4. b) Try two other classifiers and measure their performance: decision trees, perceptron, support vector machines, etc.. Be aware that support vector machines take a long time to train: up to one hour.

In [33]:
# Perceptron linear classifier
from sklearn.linear_model import Perceptron

print("Training the model...")
classifier = Perceptron()
model = classifier.fit(X, y)
print(model)

# Here we tag the test set and we save it.
# This prediction is redundant with the piece of code above,
# but we need to predict one sentence at a time to have the same
# corpus structure
print("Predicting the test set...")
f_out = open('out_perceptron', 'w')
predict_new(test_sentences, feature_names, f_out)

Training the model...
Perceptron(alpha=0.0001, class_weight=None, early_stopping=False, eta0=1.0,
           fit_intercept=True, max_iter=1000, n_iter_no_change=5, n_jobs=None,
           penalty=None, random_state=0, shuffle=True, tol=0.001,
           validation_fraction=0.1, verbose=0, warm_start=False)
Predicting the test set...


In [34]:
!perl conlleval.txt <out_perceptron

processed 47377 tokens with 23852 phrases; found: 24900 phrases; correct: 21464.
accuracy:  93.20%; precision:  86.20%; recall:  89.99%; FB1:  88.05
             ADJP: precision:  63.73%; recall:  57.76%; FB1:  60.60  397
             ADVP: precision:  72.29%; recall:  77.71%; FB1:  74.90  931
            CONJP: precision:   8.70%; recall:  22.22%; FB1:  12.50  23
             INTJ: precision:  12.50%; recall:  50.00%; FB1:  20.00  8
              LST: precision:   0.00%; recall:   0.00%; FB1:   0.00  7
               NP: precision:  86.90%; recall:  89.71%; FB1:  88.28  12824
               PP: precision:  94.82%; recall:  97.11%; FB1:  95.95  4927
              PRT: precision:  63.33%; recall:  71.70%; FB1:  67.26  120
             SBAR: precision:  87.00%; recall:  77.57%; FB1:  82.02  477
              UCP: precision:   0.00%; recall:   0.00%; FB1:   0.00  376
               VP: precision:  87.90%; recall:  90.77%; FB1:  89.31  4810


In [35]:
# Decision trees classifier
from sklearn import tree

print("Training the model...")
classifier = tree.DecisionTreeClassifier()
model = classifier.fit(X, y)
print(model)

# Here we tag the test set and we save it.
# This prediction is redundant with the piece of code above,
# but we need to predict one sentence at a time to have the same
# corpus structure
print("Predicting the test set...")
f_out = open('out_trees', 'w')
predict_new(test_sentences, feature_names, f_out)

Training the model...
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=None, splitter='best')
Predicting the test set...


In [37]:
!perl conlleval.txt <out_trees

processed 47377 tokens with 23852 phrases; found: 24385 phrases; correct: 21929.
accuracy:  94.74%; precision:  89.93%; recall:  91.94%; FB1:  90.92
             ADJP: precision:  63.38%; recall:  67.58%; FB1:  65.41  467
             ADVP: precision:  77.62%; recall:  76.91%; FB1:  77.26  858
            CONJP: precision:  36.36%; recall:  44.44%; FB1:  40.00  11
             INTJ: precision: 100.00%; recall:  50.00%; FB1:  66.67  1
              LST: precision:   0.00%; recall:   0.00%; FB1:   0.00  0
               NP: precision:  89.62%; recall:  92.27%; FB1:  90.93  12789
               PP: precision:  96.36%; recall:  97.26%; FB1:  96.80  4856
              PRT: precision:  68.38%; recall:  75.47%; FB1:  71.75  117
             SBAR: precision:  86.19%; recall:  82.80%; FB1:  84.46  514
               VP: precision:  90.07%; recall:  92.27%; FB1:  91.16  4772


## Improving the Chunker
Implement one of these two options, the first one being easier.

1. Complement the feature vector used in the previous section with the two dynamic features, c i-2 , c i-1 , and train a new model. You will need to modify the extract_features_sent and predict functions. In his experiments, your teacher obtained a F1 score of 92.65 with logistic regression and a lbfgs solver and automatic multiclass;

In [38]:
def extract_features_sent_dynamic(sentence, w_size, feature_names):
    """
    Extract the features from one sentence
    returns X and y, where X is a list of dictionaries and
    y is a list of symbols
    :param sentence: string containing the CoNLL structure of a sentence
    :param w_size:
    :return:
    """

    # We pad the sentence to extract the context window more easily
    start = "BOS BOS BOS\n"
    end = "\nEOS EOS EOS"
    start *= w_size
    end *= w_size
    sentence = start + sentence
    sentence += end

    # Each sentence is a list of rows
    sentence = sentence.splitlines()
    padded_sentence = list()
    for line in sentence:
        line = line.split()
        padded_sentence.append(line)
    # print(padded_sentence)

    # We extract the features and the classes
    # X contains is a list of features, where each feature vector is a dictionary
    # y is the list of classes
    X = list()
    y = list()
    for i in range(len(padded_sentence) - 2 * w_size):
        # x is a row of X
        x = list()
        # The words in lower case
        for j in range(2 * w_size + 1):
            x.append(padded_sentence[i + j][0].lower())
        # The POS
        for j in range(2 * w_size + 1):
            x.append(padded_sentence[i + j][1])
        # The chunks (Up to the word)
        
        # TASK 3 -- PREDICTION USING DYNAMIC FEATURES
        for j in range(w_size):
            #feature_line.append(padded_sentence[i + j][2])
            x.append(padded_sentence[i + j][2])
        # We represent the feature vector as a dictionary
        X.append(dict(zip(feature_names, x)))
        # The classes are stored in a list
        y.append(padded_sentence[i + w_size][2])
    return X, y

In [39]:
def extract_features_dynamic(sentences, w_size, feature_names):
    """
    Builds X matrix and y vector
    X is a list of dictionaries and y is a list
    :param sentences:
    :param w_size:
    :return:
    """
    X_l = []
    y_l = []
    for sentence in sentences:
        X, y = extract_features_sent_dynamic(sentence, w_size, feature_names)
        X_l.extend(X)
        y_l.extend(y)
    return X_l, y_l

In [40]:
def predict_dynamic(test_sentences, feature_names, f_out):
    for test_sentence in test_sentences:
        X_test_dict, y_test = extract_features_sent_dynamic(test_sentence, w_size, feature_names)
        
        # TASK 3 -- USING DYNAMIC FEATURES
        y_c1 = "BOS"        # assume c_i-2 = starting identifier
        y_c2 = "BOS"        # assume c_i-1 = starting identifier
        
        y_test_predicted = []
        for sent, x_dict in enumerate(X_test_dict):
            # using previous tag predicitions (the dynamic features)
            x_dict["c_i-1"] = y_c1
            x_dict["c_i-2"] = y_c2
        
            # Vectorize the test sentence and one hot encoding
            X_test_vectorized = vec.transform(x_dict)
            # Predicts the chunks and returns numbers
            # y_test_predicted = classifier.predict(X_test_vectorized)
            y_test_predicted.append(classifier.predict(X_test_vectorized))
            
            y_c2 = y_c1
            y_c1 = y_test_predicted[sent][0]
            
        # Appends the predicted chunks as a last column and saves the rows
        rows = test_sentence.splitlines()
        rows = [rows[j] + ' ' + str(y_test_predicted[j][0]) for j in range(len(rows))]
        for row in rows:
            f_out.write(row + '\n')
        f_out.write('\n')
    f_out.close()

In [41]:
feature_names = ['word_n2', 'word_n1', 'word', 'word_p1', 'word_p2',
                'pos_n2', 'pos_n1', 'pos', 'pos_p1', 'pos_p2']

print("Extracting the features...")
X_dict, y = extract_features_dynamic(train_sentences, w_size, feature_names)

print("Encoding the features...")
# Vectorize the feature matrix and carry out a one-hot encoding
vec = DictVectorizer(sparse=True)
X = vec.fit_transform(X_dict)
# The statement below will swallow a considerable memory
# X = vec.fit_transform(X_dict).toarray()

print("Training the model...")
classifier = linear_model.LogisticRegression(penalty='l2', dual=True, solver='liblinear')
model = classifier.fit(X, y)
print(model)

# Here we tag the test set and we save it.
# This prediction is redundant with the piece of code above,
# but we need to predict one sentence at a time to have the same
# corpus structure
print("Predicting the test set...")
f_out = open('out_dynamic', 'w')
predict_dynamic(test_sentences, feature_names, f_out)

Extracting the features...
Encoding the features...
Training the model...




LogisticRegression(C=1.0, class_weight=None, dual=True, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='liblinear', tol=0.0001, verbose=0,
                   warm_start=False)
Predicting the test set...


In [42]:
!perl conlleval.txt <out_dynamic

processed 47377 tokens with 23852 phrases; found: 24251 phrases; correct: 22010.
accuracy:  94.96%; precision:  90.76%; recall:  92.28%; FB1:  91.51
             ADJP: precision:  74.22%; recall:  65.07%; FB1:  69.34  384
             ADVP: precision:  78.45%; recall:  79.45%; FB1:  78.94  877
            CONJP: precision:  44.44%; recall:  44.44%; FB1:  44.44  9
             INTJ: precision: 100.00%; recall:  50.00%; FB1:  66.67  1
              LST: precision:   0.00%; recall:   0.00%; FB1:   0.00  0
               NP: precision:  90.31%; recall:  92.34%; FB1:  91.31  12701
               PP: precision:  95.87%; recall:  97.86%; FB1:  96.85  4911
              PRT: precision:  77.23%; recall:  73.58%; FB1:  75.36  101
             SBAR: precision:  89.15%; recall:  84.49%; FB1:  86.76  507
               VP: precision:  90.84%; recall:  92.83%; FB1:  91.82  4760
