# Twitter POS tagging

### Part 1: Preprocessing 

I will use two datasets for training: 1) the Penn Treebank sample used in the workshops and 2) the Twitter samples from NLTK. In order to adapt the tagger to the Twitter data we need to built a *joint* vocabulary containing all the word types in PTB and the twitter_samples corpora. So, in addition to preprocessing, I will build this vocabulary. 

The vocabulary and the tagset should be stored in Python dictionaries, mapping each word (or tag) to an index (integer). 

In [1]:
import nltk
from nltk.corpus import treebank
from collections import defaultdict

## PTB


corpus = treebank.tagged_sents()

word_numbers = {} # vocabulary
tag_numbers = {} # tagset

training_ptb_corpus = []
for sent in corpus:
    num_sent = []
    for word, tag in sent: # iterate over the sentences in corpus to get tuples (word,tag)
        wi = word_numbers.setdefault(word.lower(), len(word_numbers))
        ti = tag_numbers.setdefault(tag, len(tag_numbers))
        num_sent.append((wi, ti)) # tuple word, tag
    training_ptb_corpus.append(num_sent) # generate the pre processed corpus
    

    
S = len(tag_numbers)
V = len(word_numbers)
    
print("First sentence - Treebank:",training_ptb_corpus[0])
print("Index of word electricity:",word_numbers['electricity'])
print("Tagset size:",S)


First sentence - Treebank: [(0, 0), (1, 0), (2, 1), (3, 2), (4, 3), (5, 4), (2, 1), (6, 5), (7, 6), (8, 7), (9, 8), (10, 9), (11, 7), (12, 4), (13, 8), (14, 0), (15, 2), (16, 10)]
Index of word electricity: 1095
Tagset size: 46


Now, for twitter_samples dataset (**training** tweets), since this data is not tagged, the preprocessed corpus will be a list where each element is another list containing indices only (instead of (word, tag) tuples).

There are two things to keep in mind when doing this process:

1) We will perform a bit more of preprocessing in this dataset, besides lowercasing. Specifically, I will replace special tokens with special symbols, as follows:
- Username mentions are tokens that start with '@': replace these tokens with 'USER_TOKEN'
- Hashtags are tokens that start with '#': replace these with 'HASHTAG_TOKEN'
- Retweets are represented as the token 'RT' (or 'rt' if you lowercase first): replace these with 'RETWEET_TOKEN'
- URLs are tokens that start with 'https://' or 'http://': replace these with 'URL_TOKEN'

2) update the vocabulary built from PTB with any new words present in this corpus. These will *include* the special tokens defined above but *not* the original un-preprocessed tokens.



In [2]:
from nltk.corpus import twitter_samples

corpus_twitter=twitter_samples.tokenized()

def preprocess_word(word):
    word=word.lower()
    if (word.startswith('#')):
        word='HASHTAG_TOKEN'
    elif (word.startswith('@')):
        word='USER_TOKEN'
    elif (word=="rt"):
        word='RETWEET_TOKEN'
    elif (word.startswith("http://") or (word.startswith("https://"))):
        word='URL_TOKEN'
        
    return word



training_tweets_corpus=[]

# preprocess the words with twitter tokens
for tweet in corpus_twitter:
    for word in tweet:
        word_preprocessed=preprocess_word(word)
        word_numbers.setdefault(word_preprocessed,len(word_numbers))

# create the training twitter dataset
for tweet in corpus_twitter:
    num_tweet = []
    for word in tweet:
        num_tweet.append(word_numbers.get(word))
    training_tweets_corpus.append(num_tweet) 



V = len(word_numbers)
    
print("First sentence training tweets indexes:", training_tweets_corpus[0])
print("Index of word electricity:", word_numbers['electricity'])
print("Index of word HASHTAG_TOKEN:",word_numbers['HASHTAG_TOKEN'])


First sentence training tweets indexes: [11387, 182, 11388, 11389]
Index of word electricity: 1095
Index of word HASHTAG_TOKEN: 11409


Now I will preprocess the tagged twitter corpus (Ritter et al.) and update the tagset. This dataset will be referred from now on as **test** tweets.

This dataset has a few extra tags, besides the PTB ones. These were added to incorporate specific phenomena that happens on Twitter:
- "USR": username mentions
- "HT": hashtags
- "RT": retweets
- "URL": URL addresses

There a few additional tags which are not specific to Twitter but are not present in the PTB sample:
- "VPP"
- "TD"
- "O"

I willadd these new seven tags to the tagset built when reading the PTB corpus.

I will also add an extra type to the vocabulary: `<unk>`. This is in order to account for unknown or out-of-vocabulary words.

Finally, build two "inverted indices" for the vocabulary and the tagset. These should be lists, where the "i"-th element should contain the word (or tag) corresponding to the index "i" in the vocabulary (or tagset).


In [3]:
# update the tagset with 7 extra tags
ti = tag_numbers.setdefault("USR", len(tag_numbers))
ti = tag_numbers.setdefault("HT", len(tag_numbers))
ti = tag_numbers.setdefault("RT", len(tag_numbers))
ti = tag_numbers.setdefault("URL", len(tag_numbers))
ti = tag_numbers.setdefault("VPP", len(tag_numbers))
ti = tag_numbers.setdefault("TD", len(tag_numbers))
ti = tag_numbers.setdefault("O", len(tag_numbers))

S = len(tag_numbers)

#Another task is to add an extra type to the vocabulary: <unk>
word_numbers.setdefault("<unk>",len(word_numbers))

# create inverted lists
def get_vocabulary_list():
    word_names = [None] * len(word_numbers)
    for word, index in word_numbers.items():
        word_names[index] = word
    return word_names

def get_tagset_list():
    tag_names = [None] * len(tag_numbers)
    for tag, index in tag_numbers.items():
        tag_names[index] = tag
    return tag_names

word_names=get_vocabulary_list()
tag_names=get_tagset_list()

print("Index of word <unk>:", word_numbers['<unk>'])
print("New Tagset size:",S)

Index of word <unk>: 26069
New Tagset size: 53


Now let's read the test tweets, store them in the same format as the PTB corpora (list of lists containing (word, tag) index tuples). However, **I will not** update the vocabulary. Why? Because the test set should simulate a real-world scenario, where out-of-vocabulary words can appear. Instead, after preprocessing each word, I will check if that word is in the vocabulary. If yes, just replace it with its index, otherwise I replace it with the index for the `<unk>` token. 

I will also do a mapping for the PTB POS tags:
- "(". In PTB, this is represented as "-LRB-"
- ")". In PTB, this is represented as "-RRB-"
- "NONE". In PTB, this is represented as "-NONE-"

As I build the corpus for the test tweets, you will check if the tag for a word is one of the above. If yes, I will use the PTB equivalent instead. In practice, it is sufficient to ensure you use the correct index for the corresponding tag, using your tagset dictionary. This concept is sometimes referred as *tag harmonisation*, where two different tagsets are mapped to each other.


In [4]:
import urllib
try:
    urllib.request.urlretrieve("https://github.com/aritter/twitter_nlp/raw/master/data/annotated/pos.txt","pos.txt")
except: # Python 2
    urllib.urlretrieve("https://github.com/aritter/twitter_nlp/raw/master/data/annotated/pos.txt","pos.txt")


test_inputs = []
words_pos = []
with open('pos.txt') as f:
    words = []
    pos_tags = []
    for line in f:
        if line.strip() == '':
            #num_corpus.append(num_sent)
            test_inputs.append(words_pos)
            words_pos = []
        else:
            word, pos = line.strip().split()
            word=preprocess_word(word) # pre process the word according to what we are expecting
            
            # preprocess unknown words
            if word_numbers.get(word):
                wi=word_numbers[word]
            else:
                wi=word_numbers['<unk>']
            
            # preprocess tags
            if pos == "(":
                pos="-LRB-"
            if pos == ")":
                pos="-RRB-"
            if pos == "NONE":
                pos="-NONE-"
                
            ti = tag_numbers[pos]
            
            words_pos.append((wi, ti))
    
    
print (test_inputs[0])

[(11392, 46), (61, 19), (114, 11), (8, 7), (3224, 8), (170, 9), (325, 33), (1325, 19), (2375, 22), (3205, 12), (182, 9), (799, 2), (1522, 3), (16, 10), (8490, 0), (1146, 0), (2495, 0), (14039, 43), (26069, 0), (16, 10), (4263, 17), (1760, 4), (9464, 8), (2259, 17), (888, 4), (741, 8), (16, 10)]


In [5]:
print(len(word_numbers))
print(len(tag_numbers))

26070
53


### Part 2: Running the PTB tagger on the test tweets (1.5)

The next step is to train a POS tagger on the PTB data and try it on the test tweets. 

The first task is use HMM (Hidden Markov Model) and obtain for training:
- The initial tag probabilities (a vector).
- The transition probabilities (a matrix).
- The emission probabilities (a matrix).


In [1]:
import numpy as np


def count(tagged_corpus,vocabulary,tagset):
    
    S=len(tagset)
    V=len(vocabulary)
    
    # initalise
    eps = 0.1
    pi = eps * np.ones(S) # initial tag probability
    A = eps * np.ones((S, S)) # transition probability
    O = eps * np.ones((S, V)) # emission probability

    # count
    for sent in tagged_corpus:
        last_tag = None
        for word, tag in sent:
            O[tag, word] += 1 # counting for emission probabilities
            if last_tag == None:
                pi[tag] += 1
            else:
                A[last_tag, tag] += 1 # counting for transition probabilities
            last_tag = tag

    # normalise
    pi /= np.sum(pi)
    for s in range(S):
        O[s,:] /= np.sum(O[s,:])
        A[s,:] /= np.sum(A[s,:])

 
    return pi,A,O


In [2]:
# get the initial, transition and emision matrix
pi, A, O= count(training_ptb_corpus,word_numbers,tag_numbers)

NameError: name 'training_ptb_corpus' is not defined

Now I will define a function for Viterbi Algorithm taking as input:
- The parameters (probabilities) of HMM (a tuple (initial, transition, emission)).
- The input words (a list with numbers).

And getting as the output:
- A list of (word, tag) indices, containing the original input word and the predicted tag.

I will run Viterbi on the test tweets and store the predictions in a list.

In [8]:
def viterbi(params, observations):
        pi, A, O = params # where pi is the initial tag probability, A is the transition matrix and O the emission matrix
        M = len(observations)
        S = pi.shape[0]

        alpha = np.zeros((M, S))
        alpha[:,:] = float('-inf')
        backpointers = np.zeros((M, S), 'int')

        # base case
        alpha[0, :] = pi * O[:,observations[0]]

        # recursive case
        for t in range(1, M):
            for s2 in range(S):
                for s1 in range(S):
                    score = alpha[t-1, s1] * A[s1, s2] * O[s2, observations[t]]
                    if score > alpha[t, s2]:
                        alpha[t, s2] = score
                        backpointers[t, s2] = s1

        # now follow backpointers to resolve the state sequence
        ss = []
        ss.append(np.argmax(alpha[M-1,:]))
        for i in range(M-1, 0, -1):
            ss.append(backpointers[i, ss[-1]])
            
        predictions=list(reversed(ss))
        observation_predicted_tag=[]
        
        # create the output of model prediction
        for index in range(M):
            observation_predicted_tag.append((observations[index],predictions[index]))
        
        
        
        return observation_predicted_tag


In [9]:
# run the model
predictions = []
for sent in test_inputs:
    encoded_sent = [word[0] for word in sent] # get the word indexes for each word
    predictions.append(viterbi((pi, A, O), encoded_sent))

print(predictions[0])

[(11392, 27), (61, 19), (114, 11), (8, 7), (3224, 8), (170, 9), (325, 33), (1325, 19), (2375, 22), (3205, 12), (182, 9), (799, 2), (1522, 3), (16, 10), (8490, 29), (1146, 8), (2495, 8), (14039, 10), (26069, 38), (16, 10), (4263, 29), (1760, 4), (9464, 8), (2259, 17), (888, 4), (741, 8), (16, 10)]


To evaluate the results I will use a function that takes (word, tag) lists as inputs and outputs the tag sequence using the original tags in the tagset. Inputs will be a sentence and the tag inverted index I built before.

I run this function on the predictions  obtained above and the test tweets, storing them in two separate lists. Finally, I will put predictions into a single list and do the same for the test tweets reporting accuracy.

In [10]:
from sklearn.metrics import accuracy_score as acc

#input_word_tag is a sentence with predicted tag
def get_tag_sequence(input_word_tag):
    tag_secuence=[]
    for word_tag in input_word_tag:
        tag_secuence.append(tag_names[word_tag[1]])
    
    # output the tag sequence using original tags in the tagset
    return tag_secuence

predicted_tag_sequence=[]
for pred in predictions:
    predicted_tag_sequence.append(get_tag_sequence(pred))

original_tag_sequence=[]
for sent in test_inputs:
    original_tag_sequence.append([tag_names[word[1]] for word in sent] )

# flat our data into single lists
all_test_tags = [tag for tags in original_tag_sequence for tag in tags]
# for predictions, we need to obtain the original tag from the index
all_pred_tags = [tag for tags in predicted_tag_sequence for tag in tags]

print (acc(all_test_tags, all_pred_tags))

0.637141916365


### Part 3: Adapting the tagger using prior information 

I will adapt the tagger using prior information. What do we mean by that? Remember from part 1 that the twitter tagset has some extra tags, related to special tokens such as mentions and hashtags. In other words, **we know beforehand** that these special tokens **should** have these tags. However, because these tags never appear in the PTB data, the tagger has no such information. We are going to add this in order to improve the tagger.

To recap, we know these things about the twitter data:
- username mentions should have the tag 'USR'
- hashtags should have the tag 'HT'
- retweet tokens should have the tag 'RT'
- URL tokens should have the tag 'URL'

Remember how I replace these tokens with unique special ones (such as 'USER_TOKEN')? The task is to adapt the emission probabilities for these tokens. Modifying the emission matrix: assign 1.0 probability for the emission P('USER_TOKEN'|'USR') and 0.0 for P(word|'USR') for all other words. (The same for the other three special tags).

In order to do that, I use the vocabulary and tagset dictionaries in order to obtain the indices for the corresponding words and tags. Then, use the indices to find the values in the emission matrix and modify them. 

In [11]:
def count_2(tagged_corpus,vocabulary,tagset):
    
    S=len(tagset)
    V=len(vocabulary)
    
    
    # initalise
    eps = 0.1
    pi = eps * np.ones(S) # initial tag probability
    A = eps * np.ones((S, S)) # transition probability
    O = eps * np.ones((S, V)) # emission probability

    # count
    for sent in tagged_corpus:
        last_tag = None
        for word, tag in sent:
            O[tag, word] += 1 # counting for emission probabilities
            if last_tag == None:
                pi[tag] += 1
            else:
                A[last_tag, tag] += 1 # counting for transition probabilities
            last_tag = tag

    # normalise
    pi /= np.sum(pi)
    for s in range(S):
        O[s,:] /= np.sum(O[s,:])
        A[s,:] /= np.sum(A[s,:])
        

    # adjust the probabilities for special tokens
    # set everything to 0
    O[tag_numbers['USR']][:]=0.0
    O[tag_numbers['HT']][:]=0.0
    O[tag_numbers['RT']][:]=0.0
    O[tag_numbers['URL']][:]=0.0
    O[tag_numbers['USR']][word_numbers['USER_TOKEN']]=1.0
    O[tag_numbers['HT']][word_numbers['HASHTAG_TOKEN']]=1.0
    O[tag_numbers['RT']][word_numbers['RETWEET_TOKEN']]=1.0
    O[tag_numbers['URL']][word_numbers['URL_TOKEN']]=1.0
    
  
    return pi,A,O

In [12]:
print(O)

[[  9.15369893e-05   1.74752434e-04   8.32154448e-06 ...,   8.32154448e-06
    8.32154448e-06   8.32154448e-06]
 [  1.33457894e-05   1.33457894e-05   6.51955158e-01 ...,   1.33457894e-05
    1.33457894e-05   1.33457894e-05]
 [  1.62522347e-05   1.62522347e-05   1.62522347e-05 ...,   1.62522347e-05
    1.62522347e-05   1.62522347e-05]
 ..., 
 [  3.83582662e-05   3.83582662e-05   3.83582662e-05 ...,   3.83582662e-05
    3.83582662e-05   3.83582662e-05]
 [  3.83582662e-05   3.83582662e-05   3.83582662e-05 ...,   3.83582662e-05
    3.83582662e-05   3.83582662e-05]
 [  3.83582662e-05   3.83582662e-05   3.83582662e-05 ...,   3.83582662e-05
    3.83582662e-05   3.83582662e-05]]


Now I evaluate the new tagger on the test tweets again and report accuracy using a fine-grained error analysis. 

In [13]:
word_names=get_vocabulary_list()
tag_names=get_tagset_list()
pi, A, O= count_2(training_ptb_corpus,word_numbers,tag_numbers)
predictions = []
for sent in test_inputs:
    encoded_sent = [word[0] for word in sent] # get the word indexes for each word
    predictions.append(viterbi((pi, A, O), encoded_sent))

predicted_tag_sequence=[]
for pred in predictions:
    predicted_tag_sequence.append(get_tag_sequence(pred))

original_tag_sequence=[]
for sent in test_inputs:
    original_tag_sequence.append([tag_names[word[1]] for word in sent] )

# flat our data into single lists
all_test_tags = [tag for tags in original_tag_sequence for tag in tags]
# for predictions, we need to obtain the original tag from the index
all_pred_tags = [tag for tags in predicted_tag_sequence for tag in tags]

print (acc(all_test_tags, all_pred_tags))

0.695093842608


In [14]:
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import warnings
import pandas as pd

warnings.filterwarnings('ignore')

def print_report(predictions, classifications):
    print("Accuracy:")
    print(accuracy_score(classifications,predictions))
    print(classification_report(classifications,predictions))
    
print_report(all_pred_tags,all_test_tags)

Accuracy:
0.695093842608
             precision    recall  f1-score   support

          #       0.00      0.00      0.00         0
          $       0.00      0.00      0.00         0
         ''       0.03      0.20      0.06        91
          ,       0.85      1.00      0.92       303
      -LRB-       0.00      0.00      0.00        32
     -NONE-       0.00      0.00      0.00         2
      -RRB-       0.04      0.15      0.07        34
          .       0.72      0.83      0.77       875
          :       0.97      0.76      0.85       562
         CC       0.96      0.88      0.92       305
         CD       0.59      0.59      0.59       268
         DT       0.74      0.93      0.82       825
         EX       0.38      0.80      0.52        10
         FW       0.00      0.00      0.00         3
         HT       0.98      0.98      0.98       135
         IN       0.81      0.88      0.85      1091
         JJ       0.64      0.59      0.61       670
        JJR       0.

Of the tags predicted as the tag in analysis, the precision give us an idea how many of them were actually that specific tag. I can see several situations where this prediction was really low.
Aditionally, of the tags that were actually the tag in analysis, the proportion of the tags that were classified as this tag specifically doesn't meet the expectations.
Moreover, we can see a few tags we don't have True Positives and False Positives or False Negatives. For that reason our Precision, Recall or F-Measure is indefinied (support 0).

In the first tasks, we created the tagset according to twitter, but then, we trained the model with PTB corpus and tested it with twitter data. If we can train the model with with twitter data we can probably increase the performance, as the writing styles from PTB and Twitter corpus are significantly different. 

As a consequence, if we take a look at some tags with a large support, we can find some of them with low performance:
- CD : Cardinal Number
- JJ: Adjectives
- IN: Prepositions - Could be improved
- NN*: For some of them we have significant good support, however, our scores are in general low.
- VBN, VBG: Verbs conjucations have a low score and could be improved.

There are general observations for other tags/words that can be improved:
- Punctuation symbols.
- Include emoticons in the model since are very popular on twitter.
- mix of symbols + words + numbers may be confusing, we should preprocess this words.

Finally, a combination with other technique like using bgrams or modifications to the classifier could help to improve the score too as we can have more context to train the model.