# Challenge: An improved sentiment analysis system

## (10 pts) Now, based on the training and development set, think of a better design for developing an improved sentiment analysis system for tweets using any model you like. 

In the previous parts, we have adopted the use of the Hidden Markov Models (HMM) for the sentiment analysis system.

In this challenge, we have decided to use a different model for the system. The model we have decided to use is Conditional Random Fields (CRF). 

A CRF can be considered as a generalization of HMM or we can say that a HMM is a particular case of CRF where constant probabilities are used to model state transitions. 

In contrast to the generative model (HMM), CRF is a discriminative model and the primary advantage of CRFs over HMMs is their conditional nature, resulting in the relaxation of the independence assumptions required by HMMs


## Import relevant libraries

NLTK is a leading platform for building Python programs to work with human language data and is great for natural language processing.

pycrfsuite is a python binding to CRFsuite - an implementation of Conditional Random Fields (CRFs) for labeling sequential data.
https://github.com/scrapinghub/python-crfsuite

In [1]:
import nltk
import pycrfsuite

## Prepare the Dataset for Training

We have two separate functions for obtaining data, one for the training data (labelled) and one for the test data (unlabelled).

In [2]:
def get_data(filename):
    f = open(filename,'r')
    lines = f.readlines()
    datas = []
    
    start = 0
    for i in range(len(lines)):
        if lines[i] == '\n':
            datas.append(lines[start:i])
            start = i+1
        lines[i] = lines[i].replace('\n','')
        lines[i] = tuple(lines[i].split(' '))
        
    # check formatting
    for i in range(len(datas)):
        for j in range(len(datas[i])):
            assert len(datas[i][j])==2
    
    return datas

def get_unlabelled_data(filename):
    f = open(filename,'r')
    lines = f.readlines()
    datas = []
    
    start = 0
    for i in range(len(lines)):
        if lines[i] == '\n':
            datas.append(lines[start:i])
            start = i+1
        lines[i] = lines[i].replace('\n','')
    
    return datas


## Generating Part-of-Speech Tags

First, we use a feature in NLP: Part-of-Speech (POS) tagging of the words. These tags indicate whether the word is a noun, verb or an adjective etc. 

NLTK's POS tagger will be used to generate the POS tags for the data.

*Note that there is a slight difference in POS-tagging for labelled and unlabelled data.

In [3]:
def pos_tagging(datas):
    data = []

    for i, sentences in enumerate(datas):
        
        # Obtain the list of tokens in the data
        tokens = [t for t, label in sentences]

        # Perform POS tagging
        tagged = nltk.pos_tag(tokens)

        # Take the word, POS tag, and its label
        data.append([(w, POS, label) for (w, label), (word, POS) in zip(sentences, tagged)])
        
    return data

def unlabelled_pos_tagging(datas):
    data = []
    
    for i, sentences in enumerate(datas):
        
        # Obtain the list of tokens in the data
        tokens = [t for t in sentences]
        
        # Perform POS tagging
        tagged = nltk.pos_tag(tokens)
        
        # Take the word and its POS tag
        data.append(tagged)
        
    return data
    
    
"""
Example of data after POS-tagging
"""
datas = get_data('EN/train')
tagged_data = pos_tagging(datas)
print(tagged_data[0])

print("\n" + "BREAK" + "\n")

datas = get_unlabelled_data('EN/dev.in')
tagged_data = unlabelled_pos_tagging(datas)
print(tagged_data[0])

[('We', 'PRP', 'O'), ('were', 'VBD', 'O'), ('then', 'RB', 'O'), ('charged', 'VBN', 'O'), ('for', 'IN', 'O'), ('their', 'PRP$', 'O'), ('most', 'RBS', 'O'), ('expensive', 'JJ', 'O'), ('sake', 'NN', 'O'), ('(', '(', 'O'), ('$', '$', 'O'), ('20', 'CD', 'O'), ('+', 'NNP', 'O'), ('per', 'IN', 'O'), ('serving', 'VBG', 'O'), (')', ')', 'O'), ('when', 'WRB', 'O'), ('we', 'PRP', 'O'), ('in', 'IN', 'O'), ('fact', 'NN', 'O'), ('drank', 'IN', 'O'), ('a', 'DT', 'O'), ('sake', 'NN', 'O'), ('of', 'IN', 'O'), ('less', 'JJR', 'O'), ('than', 'IN', 'O'), ('half', 'PDT', 'O'), ('that', 'DT', 'O'), ('price', 'NN', 'O'), ('.', '.', 'O')]

BREAK

[('When', 'WRB'), ('I', 'PRP'), ('called', 'VBD'), ('this', 'DT'), ('morning', 'NN'), (',', ','), ('I', 'PRP'), ("didn't", 'VBP'), ('think', 'VB'), ('I', 'PRP'), ('would', 'MD'), ('be', 'VB'), ('able', 'JJ'), ('to', 'TO'), ('get', 'VB'), ('in', 'IN'), ('at', 'IN'), ('12', 'CD'), (',', ','), ('but', 'CC'), ('I', 'PRP'), ('was', 'VBD'), ('able', 'JJ'), ('to', 'TO'), ('

# Generating Features

POS tag is one of the features for each of the token. However, we require more features in the dataset for better accuracy.

We have adopted some of the more commonly used features for a word in named entity recognition:

    The word (w) itself (converted to lowercase for normalisation)
    The prefix/suffix of w (e.g. -ion)
    The words surrounding w, such as the previous and the next word
    Whether w is in uppercase or lowercase
    Whether w is a number, or contains digits
    The POS tag of w, and those of the surrounding words
    Whether w is or contains a special character (e.g. hypen, dollar sign)


In [4]:
def word2features(sentence, i):
    word = sentence[i][0]
    POStag = sentence[i][1]

    # Common features for all words
    features = [
        'bias',
        'word.lower=' + word.lower(),
        'word[-3:]=' + word[-3:],
        'word[-2:]=' + word[-2:],
        'word.isupper=%s' % word.isupper(),
        'word.istitle=%s' % word.istitle(),
        'word.isdigit=%s' % word.isdigit(),
        'POStag=' + POStag
    ]

    # Features for words that are not
    # at the beginning of a sentence
    if i > 0:
        word1 = sentence[i-1][0]
        POStag1 = sentence[i-1][1]
        features.extend([
            '-1:word.lower=' + word1.lower(),
            '-1:word.istitle=%s' % word1.istitle(),
            '-1:word.isupper=%s' % word1.isupper(),
            '-1:word.isdigit=%s' % word1.isdigit(),
            '-1:POStag=' + POStag1
        ])
    else:
        # Indicate that it is the 'beginning of a sentence'
        features.append('BOS')

    # Features for words that are not
    # at the end of a sentence
    if i < len(sentence)-1:
        word1 = sentence[i+1][0]
        POStag1 = sentence[i+1][1]
        features.extend([
            '+1:word.lower=' + word1.lower(),
            '+1:word.istitle=%s' % word1.istitle(),
            '+1:word.isupper=%s' % word1.isupper(),
            '+1:word.isdigit=%s' % word1.isdigit(),
            '+1:POStag=' + POStag1
        ])
    else:
        # Indicate that it is the 'end of a document'
        features.append('EOS')

    return features


# Train the model

To train the model, we need to first prepare the training data and the corresponding labels. 

We separate the data into two parts: features & labels - by extracting them from the data.

The process_data function serves as a form of convenience for getting our x_train and y_train.

In [5]:
# A function for extracting features in sentences
def extract_features(sentence):
    return [word2features(sentence, i) for i in range(len(sentence))]

# A function for generating the list of labels for each sentence
def get_labels(sentence):
    return [label for (token, POStag, label) in sentence]

# A function for getting x_train and y_train
def process_data(filename, labelled = True):
    if (labelled):
        datas = get_data(filename)    
        tagged_data = pos_tagging(datas)
        x = [extract_features(sentence) for sentence in tagged_data]
        y = [get_labels(sentence) for sentence in tagged_data]
    else:
        datas = get_unlabelled_data(filename)
        tagged_data = unlabelled_pos_tagging(datas)
        x = [extract_features(sentence) for sentence in tagged_data]
        y = 0
        
    return x, y


Now, we train the model using pycrfsuite.Trainer

In [6]:
def train_model(filename, model_out):
    x_train, y_train = process_data(filename)
    
    # Set (verbose=True) to see the steps in training the model
    trainer = pycrfsuite.Trainer(verbose=False)

    # Submit training data to the trainer
    for xseq, yseq in zip(x_train, y_train):
        trainer.append(xseq, yseq)

    # Set the parameters of the model
    trainer.set_params({
        # coefficient for L1 penalty
        'c1': 0.1,

        # coefficient for L2 penalty
        'c2': 0.01,  

        # maximum number of iterations
        'max_iterations': 200,

        # whether to include transitions that
        # are possible, but not observed
        'feature.possible_transitions': True
    })

    # model will be trained and output to the file as specified in the argument
    trainer.train(model_out)


# Apply CRF on Test Data

Using the trained model, we apply it to the test data to predict the labels for each word in the data. 

In [7]:
def decode_file(fin, fout, model):
    x_test, y_test = process_data(fin, labelled=False)
    unlabelled_data = get_unlabelled_data(fin)
    
    tagger = pycrfsuite.Tagger()
    tagger.open(model)
    y_pred = [tagger.tag(xseq) for xseq in x_test]
    
    fout = open(fout,'w')
    for i in range(len(unlabelled_data)):
        for j in range(len(unlabelled_data[i])):
            x = unlabelled_data[i][j]
            y = y_pred[i][j]
            fout.write('{} {}\n'.format(x,y))
        fout.write('\n')
    fout.close
    print("Conditional Random Fields complete")


## Training and Decoding on EN data results

In [8]:
print('training...')
trained_model = train_model('EN/train', 'crf_EN.model')

print('decoding ...')
decode_file('EN/dev.in','EN/dev.p5.out', 'crf_EN.model')

training...
decoding ...
Conditional Random Fields complete


    
    >python3 evalResult.py EN/dev.out EN/dev.p5.out

    #Entity in gold data: 226
    #Entity in prediction: 179

    #Correct Entity : 128
    Entity  precision: 0.7151
    Entity  recall: 0.5664
    Entity  F: 0.6321

    #Correct Sentiment : 87
    Sentiment  precision: 0.4860
    Sentiment  recall: 0.3850
    Sentiment  F: 0.4296

## Training and Decoding on FR data results

In [9]:
print('training...')
trained_model = train_model('FR/train', 'crf_FR.model')

print('decoding ...')
decode_file('FR/dev.in','FR/dev.p5.out', 'crf_FR.model')

training...
decoding ...
Conditional Random Fields complete


   
    >python3 evalResult.py FR/dev.out FR/dev.p5.out

    #Entity in gold data: 223
    #Entity in prediction: 181

    #Correct Entity : 144
    Entity  precision: 0.7956
    Entity  recall: 0.6457
    Entity  F: 0.7129

    #Correct Sentiment : 106
    Sentiment  precision: 0.5856
    Sentiment  recall: 0.4753
    Sentiment  F: 0.5248

## (10 pts) We will evaluate your system’s performance on two held out test sets EN/test.in and FR/test.in. The test sets will only be released on 4 Dec 2017 at 5pm (48 hours before the deadline). Use your new system to generate the outputs. Write your outputs to EN/test.p5.out and FR/test.p5.out.

## Training and Decoding on test(EN) data

In [10]:
print('training...')
trained_model = train_model('EN/train', 'crf_EN.model')

print('decoding ...')
decode_file('test/EN/test.in','test/EN/test.p5.out', 'crf_EN.model')

training...
decoding ...
Conditional Random Fields complete


## Training and Decoding on test(FR) data 

In [11]:
print('training...')
trained_model = train_model('FR/train', 'crf_FR.model')

print('decoding ...')
decode_file('test/FR/test.in','test/FR/test.p5.out', 'crf_FR.model')

training...
decoding ...
Conditional Random Fields complete
