# Homework 3: Twitter POS tagging

Student Name: Jiyu Chen

Student ID: 908066

Python version used: 3.6.3

## General info

<b>Due date</b>: 11pm, Sunday April 15

<b>Submission method</b>: see LMS

<b>Submission materials</b>: completed copy of this iPython notebook

<b>Late submissions</b>: -20% per day

<b>Marks</b>: 5% of mark for class

<b>Overview</b>: In this homework, you'll be adapting a POS tagger to Twitter data, starting from a tagger trained on Penn Treebank. You will also use prior information on the Twitter tagset to obtain better performance. Finally, you will also analyse your results in a more fine-grained way. For extra credits, you will implement the Expectation-Maximisation algorithm.

<b>Materials</b>: See the main class LMS page for information on the basic setup required for this class, including an iPython notebook viewer and the python packages NLTK, Numpy, Scipy, Matplotlib, Scikit-Learn, and Gensim. In particular, if you are not using a lab computer which already has it installed, we recommend installing all the data for NLTK, since you will need various parts of it to complete this assignment. You can also use any Python built-in packages, but do not use any other 3rd party packages; if your iPython notebook doesn't run on the marker's machine, you will lose marks.  

<b>Evaluation</b>: Your iPython notebook should run end-to-end without any errors in a reasonable amount of time, and you must follow all instructions provided below, including specific implementation requirements and instructions for what needs to be printed (please avoid printing output we don't ask for). You should leave the output from running your code in the iPython notebook you submit, to assist with marking. The amount each section is worth is given in parenthesis after the instructions. You will be marked not only on the correctness of your methods, but also the quality and efficency of your code: in particular, you should be careful to use Python built-in functions and operators when appropriate and pick descriptive variable names that adhere to <a href="https://www.python.org/dev/peps/pep-0008/">Python style requirements</a>. If you think it might be unclear what you are doing, you should comment your code to help the marker make sense of it.

<b>Extra credit</b>: Each homework has a task which is optional with respect to getting full marks on the assignment, but that can be used to offset any points lost on this or any other homework assignment (but not the final project or the exam). We recommend you skip over this step on your first pass, and come back if you have time: the amount of effort required to receive full marks (1 point) on an extra credit question will be substantially more than earning the same amount of credit on other parts of the homework.

<b>Updates</b>: Any major changes to the assignment will be announced via LMS. Minor changes and clarifications will be announced in the forum on LMS, we recommend you check the forum regularly.

<b>Academic Misconduct</b>: For most people, collaboration will form a natural part of the undertaking of this homework, and we encourge you to discuss it in general terms with other students. However, this ultimately is still an individual task, and so reuse of code or other instances of clear influence will be considered cheating. We will be checking submissions for originality and will invoke the University’s <a href="http://academichonesty.unimelb.edu.au/policy.html">Academic Misconduct policy</a> where inappropriate levels of collusion or plagiarism are deemed to have taken place.


### Part 1: Preprocessing (2.0)

<b>Instructions</b>: you first task is to preprocess the data. We will use two datasets for training: 1) the Penn Treebank sample used in the workshops and 2) the Twitter samples data you used in Homework 1. In order to adapt the tagger to the Twitter data we need to built a *joint* vocabulary containing all the word types in PTB and the twitter_samples corpora. So, in addition to preprocessing, your code should also build this vocabulary. Finally, you should also store the tagset, for reasons that will be clearer later.

<b>Important</b>: you are allowed to reuse all the code from the workshop notebooks. In fact you are encouraged to do so as much of this homework will be based on these notebooks.

The vocabulary and the tagset should be stored in Python dictionaries, mapping each word (or tag) to an index (integer). This is similar to what is done in the W6/W7 workshop notebooks. The preprocessed corpora should contain indices only, as in the workshop.

Let's start with the PTB data. You should iterate over all sentences and words, and build the vocabulary and the tagset. Important: make sure you <b>lowercase</b> words before they are added to the dictionary. You should also generate the preprocessed corpus. It should be a list where each element is a tagged sentence, represented as another list of (word, tag) indices (which should correspond to the original words/tags). Print the first preprocessed sentence, the index for the word 'electricity' and the length of the full tagset. (0.5)


In [1]:
from nltk.corpus import treebank
from collections import defaultdict
import re

#generate preprocessed corpus

# reference from WSTA_N7_unsupervised_HMMs
def preprocessing():
    """create word and tag list and transform corpus
    into a corpus of (word,tag) pair list
    return: transformed corpus, word index list, tag index list
    """
    corpus = treebank.tagged_sents()
    word_numbers = {}
    tag_numbers = {}
    num_corpus = []
    for sent in corpus:
        num_sent = []
        for word, tag in sent:
            wi = word_numbers.setdefault(word.lower(), len(word_numbers))
            ti = tag_numbers.setdefault(tag, len(tag_numbers))
            num_sent.append((wi, ti))
        num_corpus.append(num_sent)
    return num_corpus,word_numbers,tag_numbers

#end of reference
        
        
corpus,volcabulary,tagset = preprocessing()
print(corpus[0])
print(volcabulary.get('electricity'))
print(len(tagset))

[(0, 0), (1, 0), (2, 1), (3, 2), (4, 3), (5, 4), (2, 1), (6, 5), (7, 6), (8, 7), (9, 8), (10, 9), (11, 7), (12, 4), (13, 8), (14, 0), (15, 2), (16, 10)]
1095
46


<b>Instructions</b>: now you should do the same with the twitter_samples dataset. From now on, we will refer this dataset as the **training** tweets. Since this data is not tagged, the preprocessed corpus should be a list where each element is another list containing indices only (instead of (word, tag) tuples). A tokenised version of twitter_samples is available through the method .tokenized(), use this method to read your corpus. Besides generating the corpus, you should also **update** the vocabulary with the new words from this corpus.

There are two things to keep in mind when doing this process:

1) We will perform a bit more of preprocessing in this dataset, besides lowercasing. Specifically, you should replace special tokens with special symbols, as follows:
- Username mentions are tokens that start with '@': replace these tokens with 'USER_TOKEN'
- Hashtags are tokens that start with '#': replace these with 'HASHTAG_TOKEN'
- Retweets are represented as the token 'RT' (or 'rt' if you lowercase first): replace these with 'RETWEET_TOKEN'
- URLs are tokens that start with 'https://' or 'http://': replace these with 'URL_TOKEN'

2) **Do not create a new vocabulary**. Instead, you should update the vocabulary built from PTB with any new words present in this corpus. These should *include* the special tokens defined above but *not* the original un-preprocessed tokens.

The easiest way to do these steps is by doing 3 passes over the data: preprocess the words first, update the vocabulary and finally convert the corpus into the list format described above. However, it is possible to do all of this in one pass only.

Print the first sentence from your preprocessed corpora, the index for the word 'electricity' and the index for 'HASHTAG_TOKEN'. (0.5)

In [2]:
from nltk.corpus import twitter_samples
import re


def twitter_process(words):
    """transform twitter corpus into (word,tag)
    pair index list and also add new word into word list
    return: transformed twitter, updated word index list
    """
    corpus = twitter_samples.tokenized()
    word_numbers = words
    num_corpus = []
    
    #special replacement using regex
    user_token = re.compile('^@.*')
    hashtag_token = re.compile('^#.*')
    retweet_token = re.compile('^rt$')
    url_token = re.compile('^http[s]?://.*')
    
    for sent in corpus:
        num_sent = []
        for word in sent:
            word = word.lower()
            if user_token.match(word):
                word = user_token.sub("USER_TOKEN",word)
            if hashtag_token.match(word):
                word = hashtag_token.sub("HASHTAG_TOKEN",word)
            if retweet_token.match(word):
                word = retweet_token.sub("RETWEET_TOKEN",word)
            if url_token.match(word):
                word = url_token.sub("URL_TOKEN",word) 
            wi = word_numbers.setdefault(word, len(word_numbers))
            num_sent.append(wi)
        num_corpus.append(num_sent)
        
    return num_corpus,word_numbers

   
twitter_corpus,volcabulary= twitter_process(volcabulary)
print(volcabulary.get('electricity'))
print(volcabulary.get('HASHTAG_TOKEN'))
print(twitter_corpus[0])


1095
11409
[11387, 182, 11388, 11389]


<b>Instructions:</b> now we will preprocess the tagged twitter corpus used in W7 (Ritter et al.). This dataset will be referred from now on as **test** tweets. Before you do that though, you should update the tagset.

You might have noticed this in the workshop but this dataset has a few extra tags, besides the PTB ones. These were added to incorporate specific phenomena that happens on Twitter:
- "USR": username mentions
- "HT": hashtags
- "RT": retweets
- "URL": URL addresses

Notice that these special tags correspond to the special tokens we preprocessed before. These steps will be important in Part 3 later.

There a few additional tags which are not specific to Twitter but are not present in the PTB sample:
- "VPP"
- "TD"
- "O"

You should add these new seven tags to the tagset you built when reading the PTB corpus.

Another task is to add an extra type to the vocabulary: `<unk>`. This is in order to account for unknown or out-of-vocabulary words.

Finally, build two "inverted indices" for the vocabulary and the tagset. These should be lists, where the "i"-th element should contain the word (or tag) corresponding to the index "i" in the vocabulary (or tagset).

After doing these tasks, print the index for `<unk>` and the length of your resulting tagset. (0.5)

In [3]:
def pendingTagset(tagset):
    """pend 7 special tags into tagset
    return: updated tagset index list
    """
    tagset.setdefault("USR", len(tagset))
    tagset.setdefault("HT", len(tagset))
    tagset.setdefault("RT", len(tagset))
    tagset.setdefault("URL", len(tagset))
    tagset.setdefault("VPP", len(tagset))
    tagset.setdefault("TD", len(tagset))
    tagset.setdefault("O", len(tagset))
    return tagset

#print(tagset)
tagset = pendingTagset(tagset)

#add special volcabulary into word index list
volcabulary.setdefault('<unk>',len(volcabulary))

def invert(data):
    """invert index list
    invert [{index:item}]
    to [item] ordered by it's index
    return: inverted list
    """
    index_value = {}
    inverted_list = []
    print
    for w in list(data):
        index_value.setdefault(data.get(w),w)
    
    for index in list(index_value):
        inverted_list.append(index_value.get(index))
    
    return inverted_list
        
inv_volcabulary = invert(volcabulary)
inv_tagset = invert(tagset)
print(len(inv_tagset))
print(volcabulary.get('<unk>'))

    

53
26069


<b>Instructions</b>: now we can read the test tweets. Store them in the same format as the PTB corpora (list of lists containing (word, tag) index tuples). Do the same preprocessing steps that you did for the training tweets (lowercasing + replace special tokens). However, **do not** update the vocabulary. Why? Because the test set should simulate a real-world scenario, where out-of-vocabulary words can appear. Instead, after preprocessing each word, you should check if that word is in the vocabulary. If yes, just replace it with its index, otherwise you should replace it with the index for the `<unk>` token. Remember: you can reuse the code from the workshop for this task. Just be mindful that in the workshop we stored words and tags in two separate lists: here you should have a single list, as in the PTB corpus you preprocessed above.

When reading the POS tags for the test tweets you should do some additional preprocessing. There are three tags in this dataset which correspond to PTB tags but are represented with different names:
- "(". In PTB, this is represented as "-LRB-"
- ")". In PTB, this is represented as "-RRB-"
- "NONE". In PTB, this is represented as "-NONE-"

As you build the corpus for the test tweets, you should check if the tag for a word is one of the above. If yes, you should use the PTB equivalent instead. In practice, it is sufficient to ensure you use the correct index for the corresponding tag, using your tagset dictionary. This concept is sometimes referred as *tag harmonisation*, where two different tagsets are mapped to each other.

After this, print the first sentence of your preprocessed corpus. (0.5)

In [4]:
import urllib
try:
    urllib.request.urlretrieve("https://github.com/aritter/twitter_nlp/raw/master/data/annotated/pos.txt","pos.txt")
except: # Python 2
    urllib.urlretrieve("https://github.com/aritter/twitter_nlp/raw/master/data/annotated/pos.txt","pos.txt")
    
    
def test_preprocess():
    """transform the test corpus into 
    (word,tag) index list by refering 
    the training data
    return: transformed test corpus
    """
    user_token = re.compile('^@.*')
    hashtag_token = re.compile('^#.*')
    retweet_token = re.compile('^rt$')
    url_token = re.compile('^http[s]?://.*')
    lrb = re.compile('^\($')
    rrb = re.compile('^\)$')
    none = re.compile('^NONE$')
    corpus = []
    sent = []
    with open('pos.txt') as f:
        for line in f:
            if line.strip() == '':
                corpus.append(sent)
                sent = []
            else:
                word, pos = line.strip().split()
                word = word.lower()
                if user_token.match(word):
                    word = user_token.sub("USER_TOKEN",word)
                if hashtag_token.match(word):
                    word = hashtag_token.sub("HASHTAG_TOKEN",word)
                if retweet_token.match(word):
                    word = retweet_token.sub("RETWEET_TOKEN",word)
                if url_token.match(word):
                    word = url_token.sub("URL_TOKEN",word)
                if lrb.match(pos):
                    pos = lrb.sub("-LRB-",pos)
                if rrb.match(pos):
                    pos = rrb.sub("-RRB-",pos)
                if none.match(pos):
                    pos = none.sub("-NONE-",pos)
                if volcabulary.get(word) == None:
                    word = '<unk>'
                sent.append((volcabulary.get(word),tagset.get(pos)))
    #print(corpus[0])
    return corpus
    
test_corpus = test_preprocess()

#print out the first indexed sentence of transformed corpus
print(test_corpus[0])


#print out the first context sentence of transformed corpus
text = []
for (w,t) in test_corpus[0]:
    text.append((inv_volcabulary[w],inv_tagset[t]))
print(text)
        

    
    

[(11392, 46), (61, 19), (114, 11), (8, 7), (3224, 8), (170, 9), (325, 33), (1325, 19), (2375, 22), (3205, 12), (182, 9), (799, 2), (1522, 3), (16, 10), (8490, 0), (1146, 0), (2495, 0), (14039, 43), (26069, 0), (16, 10), (4263, 17), (1760, 4), (9464, 8), (2259, 17), (888, 4), (741, 8), (16, 10)]
[('USER_TOKEN', 'USR'), ('it', 'PRP'), ("'s", 'VBZ'), ('the', 'DT'), ('view', 'NN'), ('from', 'IN'), ('where', 'WRB'), ('i', 'PRP'), ("'m", 'VBP'), ('living', 'VBG'), ('for', 'IN'), ('two', 'CD'), ('weeks', 'NNS'), ('.', '.'), ('empire', 'NNP'), ('state', 'NNP'), ('building', 'NNP'), ('=', 'SYM'), ('<unk>', 'NNP'), ('.', '.'), ('pretty', 'RB'), ('bad', 'JJ'), ('storm', 'NN'), ('here', 'RB'), ('last', 'JJ'), ('evening', 'NN'), ('.', '.')]


<b>Hint</b>: if you did these steps correctly you should have 53 tags in your tagset and around 26000 words in your vocabulary.

### Part 2: Running the PTB tagger on the test tweets (1.5)

<b>Instructions</b>: your next task is to train a POS tagger on the PTB data and try it on the test tweets. This is exactly what we did in W7: feel free to reuse code. However, we are also gonna modify the code a bit.

Your first task is encapsulate the HMM training code into a function. You should name your function `count`. This function should take these input parameters:
- A tagged corpus, in the format described above (list of lists containing (word, tag) index tuples).
- The vocabulary (a dict).
- The tagset (a dict).

Output return values should contain:
- The initial tag probabilities (a vector).
- The transition probabilities (a matrix).
- The emission probabilities (a matrix).

Notice that in the workshop code the vocabulary and tagset were built as part of the training process. Here you should pass them explicitly as parameters instead. This is to ensure our tagger can take into account the words in the training tweets and the extra tags. Important: the workshop code initialise the probabilities with an `eps` value, to ensure you end up with non-zero probabilities for unseen events. You should do the same here.

After writing your function, run it on the PTB corpus to obtain the initial, transition and emission probabilities. (0.5)

In [5]:
import numpy as np

def count(tagset,volcabulary,corpus):
    """construct HMM and create
    initial states, transition matrix, emission matrix
    para: tagset: tagset index list
    para: volcabulary: word index list
    para: corpus: index transformed corpus
    return: initial states, transition matrix, emission matrix
    """
    S = len(tagset)
    V = len(volcabulary)
    num_corpus = corpus
    
    #reference from WSTA_N7_unsupervised_HMMs 
    # initalise
    eps = 0.1
    #initial state
    pi = eps * np.ones(S)
    #transition
    A = eps * np.ones((S, S))
    #emission
    O = eps * np.ones((S, V))

    # count
    for sent in num_corpus:
        last_tag = None
        for word, tag in sent:
            #count emissions 
            O[tag, word] += 1
            #using the first tag as initial state
            if last_tag == None:
                pi[tag] += 1
            else:
                #count transitions
                A[last_tag, tag] += 1
            #shift to next tag in HMM
            last_tag = tag
        
    # normalise
    pi /= np.sum(pi)
    for s in range(S):
        O[s,:] /= np.sum(O[s,:])
        A[s,:] /= np.sum(A[s,:]) 
    return pi, A, O
#end of reference

pi, A, O = count(tagset,volcabulary,corpus)


<b>Instructions</b>: now you should write a function for Viterbi. The input parameters are the same as in the workshop:
- The parameters (probabilities) of your HMM (a tuple (initial, transition, emission)).
- The input words (a list with numbers).

The output is slightly different though:
- A list of (word, tag) indices, containing the original input word and the predicted tag.

Run Viterbi on the test tweets and store the predictions in a list (might take a few seconds). Remember that in the processing part you stored the test tweets as (word, tag) indices lists: make sure your input to Viterbi are word index lists only. Print the first sentence of your predicted list. (0.5)

In [6]:
#reference from WSTA_N7_unsupervised_HMMs 
def viterbi(params, observations):
    """HMM prediction
    para: params: HMM parameters
    para: observations: word indexed list that waited to be tagged
    return: (word,tag) indexed list
    """
    prediction = []
    pi, A, O = params
    M = len(observations)
    S = pi.shape[0]
    
    alpha = np.zeros((M, S))
    alpha[:,:] = float('-inf')
    backpointers = np.zeros((M, S), 'int')
    
    # base case
    alpha[0, :] = pi * O[:,observations[0]]
    
    # recursive case
    for t in range(1, M):
        for s2 in range(S):
            for s1 in range(S):
                score = alpha[t-1, s1] * A[s1, s2] * O[s2, observations[t]]
                if score > alpha[t, s2]:
                    alpha[t, s2] = score
                    backpointers[t, s2] = s1
    
    # now follow backpointers to resolve the state sequence
    ss = []
    ss.append(np.argmax(alpha[M-1,:]))
    for i in range(M-1, 0, -1):
        ss.append(backpointers[i, ss[-1]])
    predict =  list(reversed(ss))
    for i in range(len(predict)):
        prediction.append((observations[i],predict[i]))
    return prediction

# end of reference
    
def tagging(corpus,pi,transition,emission):
    """HMM predict by sentences
    """
    prediction = []
    for sent in corpus:
        sents = []
        for (w,t) in sent:
            sents.append(w)
        pred = viterbi((pi,transition,emission),sents)
        prediction.append(pred)
    return prediction

prediction = tagging(test_corpus,pi,A,O)
print(prediction[0])


[(11392, 27), (61, 19), (114, 11), (8, 7), (3224, 8), (170, 9), (325, 33), (1325, 19), (2375, 22), (3205, 12), (182, 9), (799, 2), (1522, 3), (16, 10), (8490, 29), (1146, 8), (2495, 8), (14039, 10), (26069, 38), (16, 10), (4263, 29), (1760, 4), (9464, 8), (2259, 17), (888, 4), (741, 8), (16, 10)]


<b>Instructions</b>: you should now evaluate the results. Write a function that takes (word, tag) lists as inputs and outputs the tag sequence using the original tags in the tagset. Your inputs should be a sentence and the tag inverted index you built before.

Run this function on the predictions you obtained above **and** the test tweets, storing them in two separate lists. Finally, flat your predictions into a single list and do the same for the test tweets and report accuracy. (0.5)

In [7]:


def transform_prediction(corpus):
    """extract the tags indexes from a (word,tag) corpus
    para: (word,tag) index format corpus
    return: list of tag index
    """
    tag_sequence = []
    for sent in corpus:
        for (w,t) in sent:
            tag_sequence.append(t)
    return tag_sequence

gold_sequence = transform_prediction(test_corpus)
pred_sequence = transform_prediction(prediction)

def evaluation(seq1,seq2):
    """calculate accuracy of seq2 by refering to seq1
    para: seq1: list of gold standard tag index
    para: seq2: list of predicted tag index
    return: accuracy
    """
    error = 0
    total = 0
    for i in range(len(seq1)):
        if seq1[i] != seq2[i]:
            error +=1
        total += 1
    accuracy = 1 - error/total
    return accuracy
        
accuracy = evaluation(gold_sequence, pred_sequence)
print(accuracy)

0.6371419163648337


### Part 3: Adapting the tagger using prior information (1.5)

<b>Instructions</b>: now your task is to adapt the tagger using prior information. What do we mean by that? Remember from part 1 that the twitter tagset has some extra tags, related to special tokens such as mentions and hashtags. In other words, **we know beforehand** that these special tokens **should** have these tags. However, because these tags never appear in the PTB data, the tagger has no such information. We are going to add this in order to improve the tagger.

To recap, we know these things about the twitter data:
- username mentions should have the tag 'USR'
- hashtags should have the tag 'HT'
- retweet tokens should have the tag 'RT'
- URL tokens should have the tag 'URL'

Remember how we replace these tokens with unique special ones (such as 'USER_TOKEN')? Your task is to adapt the emission probabilities for these tokens. Modify the emission matrix: assign 1.0 probability for the emission P('USER_TOKEN'|'USR') and 0.0 for P(word|'USR') for all other words. Do the same for the other three special tags.

In order to do that, you should use the vocabulary and tagset dictionaries in order to obtain the indices for the corresponding words and tags. Then, use the indices to find the values in the emission matrix and modify them. Print your new emission matrix. (0.5)

In [8]:
def change_emission(emission):
    """change emission matrix according to prior info
    para: emission matrix that waited to be changed
    return: updated emission matrix
    """
    tag = [tagset.get('USR'),tagset.get('HT'),\
               tagset.get('RT'),tagset.get('URL')]
    word = [volcabulary.get('USER_TOKEN'),\
                volcabulary.get('HASHTAG_TOKEN'),\
                volcabulary.get('RETWEET_TOKEN'),\
                volcabulary.get('URL_TOKEN')]
    chmx = emission
    
    # change emission probability from all special tag-> word 0
    for i in tag:
        for j in range(len(volcabulary)):
            emission[i][j] = 0.0
     
    # change emission probability of all special tag->specific word 1
    for (t,w) in zip(tag,word):
        chmx[t,w] = 1.0
       
    return chmx


emission = change_emission(O)
print(emission)

            

  


[[9.15369893e-05 1.74752434e-04 8.32154448e-06 ... 8.32154448e-06
  8.32154448e-06 8.32154448e-06]
 [1.33457894e-05 1.33457894e-05 6.51955158e-01 ... 1.33457894e-05
  1.33457894e-05 1.33457894e-05]
 [1.62522347e-05 1.62522347e-05 1.62522347e-05 ... 1.62522347e-05
  1.62522347e-05 1.62522347e-05]
 ...
 [3.83582662e-05 3.83582662e-05 3.83582662e-05 ... 3.83582662e-05
  3.83582662e-05 3.83582662e-05]
 [3.83582662e-05 3.83582662e-05 3.83582662e-05 ... 3.83582662e-05
  3.83582662e-05 3.83582662e-05]
 [3.83582662e-05 3.83582662e-05 3.83582662e-05 ... 3.83582662e-05
  3.83582662e-05 3.83582662e-05]]


<b>Instructions</b>: now evaluate your new tagger on the test tweets again. You should report accuracy but also do a fine-grained error analysis. Print the F-scores for **each tag**. <b>Hint:</b> use the "classification_report" function in scikit-learn for that. You should report the tags that performed the best and the worse. (0.5) 

In [9]:
#tagging test_corpus using updated emission matrix in viterbi
prediction2 = tagging(test_corpus,pi,A,emission)


In [10]:
from sklearn import metrics


# extract second time pos tags
pred_sequence2 = transform_prediction(prediction2)
#accuracy2 = evaluation(gold_sequence, pred_sequence2)


def print_metrices(y_test,y_pred_class):
    """pos report using scikit-learn classification report
    para:y_test: gold standard labels
    para:y_pred_class: predicted labels
    """
    print("\nClassification report:")
    print(metrics.classification_report(y_test, y_pred_class))

    
print(metrics.accuracy_score(gold_sequence,pred_sequence))
print(metrics.accuracy_score(gold_sequence,pred_sequence2))

print_metrices(gold_sequence, pred_sequence2)

print("best performed tags:")
print(inv_tagset[48])

print("worst performed tags:")
worst = [inv_tagset[16], inv_tagset[27],\
        inv_tagset[34],inv_tagset[35],\
        inv_tagset[36],inv_tagset[37],\
        inv_tagset[39],inv_tagset[41],\
        inv_tagset[42],inv_tagset[43],\
        inv_tagset[44],inv_tagset[45],\
        inv_tagset[50],inv_tagset[51],\
        inv_tagset[52]]

print(worst)




0.6371419163648337
0.6950938426078367

Classification report:
             precision    recall  f1-score   support

          0       0.60      0.27      0.37      1159
          1       0.85      1.00      0.92       303
          2       0.59      0.59      0.59       268
          3       0.43      0.54      0.48       393
          4       0.64      0.59      0.61       670
          5       0.53      0.97      0.69       181
          6       0.65      0.70      0.68       660
          7       0.74      0.93      0.82       825
          8       0.79      0.63      0.70      1931
          9       0.81      0.88      0.85      1091
         10       0.72      0.83      0.77       875
         11       0.69      0.78      0.73       342
         12       0.88      0.50      0.64       303
         13       0.96      0.88      0.92       305
         14       0.77      0.74      0.75       306
         15       0.43      0.63      0.51       140
         16       0.00      0.00    

  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


<b>Instructions</b>: finally, based on the information you got above, do some analysis. Why do you think the tagger performed worse on the tags you mentioned above? How would you improve the tagger? Feel free to inspect some instances manually if you want (and show us if you do). Write your analysis in the markdown cell below. Notice that this question is inherently subjective: this is on purpose as you will be evaluated on your analytical abilities. But don't worry about going into depth: 2-4 sentences is enough (but feel free to write more if you need). (0.5)
    

<b>WRITE YOUR ANALYSIS HERE</b>
The first reason is the limitation on the amount of data that the classifier is not able to generalise well among rare instances in training set. However, we do get a good performance in tagging retweet token which also indicate a smaller variance in training data is able to result in a higher performance. Retweet token has only RT as tag whereas other words has more than one tags hold big variance such as nouns and verbs. A third reason is lack of prior information. In the first prediction, performance in most tags are relatively worse than the second prediction which contain some prior information. So combine this concern with the variants, we could find a way to maximise our expectation during tagging which could lower the variants. Another solution is simply apply a baseline method as it shows a 92% accuracy on WSJ corpus which might also have a higher accuracy in this data set.