# POS_Tagger

In this Notebook two POS taggers will be adopted to add the Part of Speech types of each token to the dataframe. <br>
The first is the standard tokenizer (a <i>pretrained PerceptronTagger</i>) provided by <i>nltk</i>. <br>
The second is the <i>TreeTagger</i> configured with the <i>Penn treebank</i>. <br>
The TreeTagger provides also lemmas of each token to the dataframe.

Specific informations about installing and configuring the TreeTagger are included in the README.md of the Github Repository.

In [12]:
from nltk import pos_tag
from nltk.tag.perceptron import PerceptronTagger
import pandas as pd
import numpy as np

In [13]:
from sklearn.feature_extraction import DictVectorizer
from sklearn.feature_extraction.text import CountVectorizer, HashingVectorizer
from sklearn.linear_model import Perceptron
from sklearn.model_selection import train_test_split
from sklearn.linear_model import SGDClassifier, PassiveAggressiveClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report

In [14]:
import treetaggerwrapper

  punct2find_re = re.compile("([^ ])([[" + ALONEMARKS + "])",
  DnsHostMatch_re = re.compile("(" + DnsHost_expression + ")",
  UrlMatch_re = re.compile(UrlMatch_expression, re.VERBOSE | re.IGNORECASE)
  EmailMatch_re = re.compile(EmailMatch_expression, re.VERBOSE | re.IGNORECASE)


<br> load the tokenized train and dev dataframes 

In [15]:
df_train = pd.read_csv("transitional_data/tokenized_train.csv")
df_dev = pd.read_csv("transitional_data/tokenized_dev.csv")

In [7]:
df_train

Unnamed: 0,SentenceNR,Token,Label
0,0,(,o
1,0,7,o
2,0,),o
3,0,On,o
4,0,specific,o
...,...,...,...
349072,9434,accused,o
349073,9434,No.1,o
349074,9434,as,o
349075,9434,aforementioned,o


## tag_by_SentenceNR
tagging the tokens per sentence <br>
The second parameter "tagger" of this function should be a tagger object. (here: standard tokenizer of nltk "pos_tag") <br>
It return a list of tags with the same length as the dataframe, otherwise a ValueError would be raised.

In [8]:
def tag_by_SentenceNR(df, tagger):
    all_tags = []
    sentence_numbers = df["SentenceNR"].unique()
    for nr in sentence_numbers:
        s = df[ df["SentenceNR"] == nr ]
        s["Token"] = s["Token"].astype("str")
        tags = [e[1] for e in tagger(s["Token"].tolist())]
        all_tags += tags
    if len(all_tags)== len(df):
        return all_tags
    else:
        raise ValueError

In [9]:
%%time
#df_train["standard_tagger"] = tag_by_SentenceNR(df_train, pos_tag)
df_dev["standard_tagger"]  = tag_by_SentenceNR(df_dev, pos_tag)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  s["Token"] = s["Token"].astype("str")


CPU times: user 2.28 s, sys: 36.3 ms, total: 2.31 s
Wall time: 2.33 s


In [53]:
df_train

Unnamed: 0,SentenceNR,Token,Label,standard_tagger
0,0,(,o,(
1,0,7,o,CD
2,0,),o,)
3,0,On,o,IN
4,0,specific,o,JJ
...,...,...,...,...
349072,9434,accused,o,JJ
349073,9434,No.1,o,NNP
349074,9434,as,o,IN
349075,9434,aforementioned,o,VBN


## tag_by_SentenceNR_with_PerceptronTagger
Because the PerceptronTagger cannot be called through the parameter. <br>
This time a new tagger object will be initialised in each loop.

In [10]:
def tag_by_SentenceNR_with_PerceptronTagger(df): 
    all_tags = []
    sentence_numbers = df["SentenceNR"].unique()
    for nr in sentence_numbers:
        s = df[ df["SentenceNR"] == nr ]
        s["Token"] = s["Token"].astype("str")
        pretrain = PerceptronTagger()
        tags = [e[1] for e in pretrain.tag(s["Token"].tolist())]
        all_tags += tags
    if len(all_tags)== len(df):
        return all_tags
    else:
        raise ValueError

In [11]:
%%time
df_train["PerceptronTagger"] = tag_by_SentenceNR_with_PerceptronTagger(df_train)
df_dev["PerceptronTagger"] = tag_by_SentenceNR_with_PerceptronTagger(df_dev)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  s["Token"] = s["Token"].astype("str")
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  s["Token"] = s["Token"].astype("str")


CPU times: user 22.5 s, sys: 219 ms, total: 22.8 s
Wall time: 22.8 s


In [12]:
df_train

Unnamed: 0,SentenceNR,Token,Label,standard_tagger,PerceptronTagger
0,0,(,o,(,(
1,0,7,o,CD,CD
2,0,),o,),)
3,0,On,o,IN,IN
4,0,specific,o,JJ,JJ
...,...,...,...,...,...
349072,9434,accused,o,JJ,JJ
349073,9434,No.1,o,NNP,NNP
349074,9434,as,o,IN,IN
349075,9434,aforementioned,o,VBN,VBN


<br> compare the outcome of standard tagger and PerceptronTagger

In [14]:
df_train[df_train['standard_tagger'] != df_train['PerceptronTagger']]
df_dev[df_dev['standard_tagger'] != df_dev['PerceptronTagger']]

Unnamed: 0,SentenceNR,Token,Label,standard_tagger,PerceptronTagger


They are actually the same. <br>
Delete the standard_tagger from the dataframe.

In [15]:
df_train.drop(['standard_tagger'], axis=1)
df_dev.drop(['standard_tagger'], axis=1)

Unnamed: 0,SentenceNR,Token,Label,PerceptronTagger
0,0,True,o,NN
1,0,",",o,","
2,0,our,o,PRP$
3,0,Constitution,B-STATUTE,NNP
4,0,has,o,VBZ
...,...,...,...,...
37450,948,of,o,IN
37451,948,right,o,JJ
37452,948,ear,o,NN
37453,948,lobule,o,NN


## tag_with_Treetagger
The Treetagger always requires a continous str input, rather than a list of tokens. <br>
<i>e. g.<br> 
    tags = tagger.tag_text("This is a very short text to tag.")<br>
    tags[0] = This  (tab)  DT  (tab)  this<br>
    tags[1] = is  (tab)  VBZ  (tab)  be<br>
    ...</i> 

## different tokenizing, how to cope with?
Of course the TreeTagger will <b>NOT</b> produce the same the tokening outcome, since sentences rather than tokens are given. <br>
Actually if tokens in list are given, the tagger will combine the tokens before performing the parsing and finally also return a different tokening output. <br>
And since the TreebankWordTokenizer of nltk returns a reliable outcome according to the comparison with annotated labels,<br>
we don't want to change the tokenizing.<br><br>
Instead the tokens and tags returned from the TreeTagger will be compared to the tokens (rows) in the original dataframe. <br>
For the "equivalent" tokens (who appear not only in the dataframe but also in the the TreeTagger outputs) their POS tags and lemmas will be added to the dataframe. <br>
The different tokens will be simply dismissed. The "TreeTagger" and "lemma" elements in their rows will keep empty value "np.nan".

In [11]:
def tag_with_Treetagger(df):
    
    # create a tagger object from treetaggerwrapper
    # TAGDIR="TreeTagger" shows where is the package installed.
    tagger = treetaggerwrapper.TreeTagger(TAGLANG='en', TAGDIR="TreeTagger")
    
    # create two empty columns in the dataframe
    df["TreeTagger"] = np.nan
    df["Lemma"] = np.nan
    
    # tagging by each sentence
    sentence_numbers = df["SentenceNR"].unique()
    
    for nr in sentence_numbers:
        s = df[ df["SentenceNR"] == nr ]
        tokens = [str(t) for t in s["Token"].tolist()]
        tags_and_lemmas = tagger.tag_text( " ".join( tokens ))
        penn = []
        for row in tags_and_lemmas:
            row = row.split("\t")
            if len(row) == 3:
                penn.append([ row[0], row[1], row[2] ])
            else:
                penn.append([np.nan, np.nan, np.nan])
        
        for unit in penn:
            for i in s.index.tolist():
                if str(unit[0]) == str(df.at[i, "Token"]) and pd.isnull(df.at[i, "TreeTagger"]) and pd.isnull(df.at[i, "Lemma"]):
                    df.at[i, "TreeTagger"] = unit[1]
                    df.at[i, "Lemma"] = unit[2]
                    break

<br> Unluckily the comparing process is very inefficient and takes even much more time than tagging itself.

In [54]:
%%time
tag_with_Treetagger(df_train)

CPU times: user 29min 50s, sys: 7.34 s, total: 29min 57s
Wall time: 30min


In [12]:
%%time
tag_with_Treetagger(df_dev)

CPU times: user 29.5 s, sys: 682 ms, total: 30.2 s
Wall time: 30.2 s


In [56]:
df_train

Unnamed: 0,SentenceNR,Token,Label,standard_tagger,TreeTagger,Lemma
0,0,(,o,(,(,(
1,0,7,o,CD,CD,@card@
2,0,),o,),),)
3,0,On,o,IN,IN,on
4,0,specific,o,JJ,JJ,specific
...,...,...,...,...,...,...
349072,9434,accused,o,JJ,VVN,accuse
349073,9434,No.1,o,NNP,NN,No.1
349074,9434,as,o,IN,IN,as
349075,9434,aforementioned,o,VBN,JJ,aforementioned


In [13]:
df_dev

Unnamed: 0,SentenceNR,Token,Label,standard_tagger,TreeTagger,Lemma
0,0,True,o,NN,UH,true
1,0,",",o,",",",",","
2,0,our,o,PRP$,PP$,our
3,0,Constitution,B-STATUTE,NNP,NP,Constitution
4,0,has,o,VBZ,VHZ,have
...,...,...,...,...,...,...
37450,948,of,o,IN,IN,of
37451,948,right,o,JJ,JJ,right
37452,948,ear,o,NN,NN,ear
37453,948,lobule,o,NN,NN,lobule


<br> How many cells in the dataframe are still empty?

In [64]:
for column in df_train.columns:
    print(df_train[column].isnull().value_counts())

False    349077
Name: SentenceNR, dtype: int64
False    349074
True          3
Name: Token, dtype: int64
False    349077
Name: Label, dtype: int64
False    349077
Name: standard_tagger, dtype: int64
False    341107
True       7970
Name: TreeTagger, dtype: int64
False    341107
True       7970
Name: Lemma, dtype: int64


<br> 2.28 % of the tokens didn't become the TreeTagger Tag and Lemma.

In [14]:
for column in df_dev.columns:
    print(df_dev[column].isnull().value_counts())

False    37455
Name: SentenceNR, dtype: int64
False    37454
True         1
Name: Token, dtype: int64
False    37455
Name: Label, dtype: int64
False    37455
Name: standard_tagger, dtype: int64
False    36785
True       670
Name: TreeTagger, dtype: int64
False    36785
True       670
Name: Lemma, dtype: int64


## Fill the NaNs
With .fillna() the empty cells in the dataframe will be substituted.<br>
This is necessary for the preparing of the maschine learning.

In [6]:
df_train_filled = df_train.copy()
df_train_filled = df_train_filled.fillna("0")

In [7]:
df_dev_filled = df_dev.copy()
df_dev_filled = df_dev_filled.fillna("0")

In [8]:
df_train.to_csv("transitional_data/tagged_train.csv", index=False)

In [9]:
df_dev.to_csv("transitional_data/tagged_dev.csv", index=False)

In [10]:
df_train_filled.to_csv("transitional_data/tagged_train_filled.csv", index=False)

In [11]:
df_dev_filled.to_csv("transitional_data/tagged_dev_filled.csv", index=False)

<br> Check one more time<br>There is no more NaNs in the dataframe

In [73]:
for column in df_train_filled.columns:
    print(df_train_filled[column].isnull().value_counts())

False    349077
Name: SentenceNR, dtype: int64
False    349077
Name: Token, dtype: int64
False    349077
Name: Label, dtype: int64
False    349077
Name: standard_tagger, dtype: int64
False    349077
Name: TreeTagger, dtype: int64
False    349077
Name: Lemma, dtype: int64


## Do the tags and lemmas make a difference? YES!
This time with the same primitive model and parameter the weighted avg reached  <b> 37% </b>. <br>
It has improved <b>9%</b> comparing to the "thin" matrix with only the tokens itself last time.

In [76]:
X_train = df_train_filled.drop(["Label", "SentenceNR"], axis = 1)
v = DictVectorizer(sparse=True)
X_train = v.fit_transform(X_train.to_dict('records'))
y_train = df_train_filled["Label"]

X_dev = df_dev_filled.drop(["Label", "SentenceNR"], axis=1)
X_dev = v.transform(X_dev.to_dict('records'))
y_dev = df_dev_filled["Label"]

print(X_train.shape, y_train.shape)
print(X_dev.shape, y_dev.shape)

(349077, 45064) (349077,)
(349077, 45064) (349077,)


In [77]:
classes = df_train_filled["Label"].unique().tolist()
print(classes)

['o', 'B-ORG', 'I-ORG', 'B-OTHER_PERSON', 'I-OTHER_PERSON', 'B-WITNESS', 'I-WITNESS', 'B-GPE', 'B-STATUTE', 'B-DATE', 'I-DATE', 'B-PROVISION', 'I-PROVISION', 'I-STATUTE', 'B-COURT', 'I-COURT', 'B-PRECEDENT', 'I-PRECEDENT', 'B-CASE_NUMBER', 'I-CASE_NUMBER', 'I-GPE', 'B-PETITIONER', 'I-PETITIONER', 'B-JUDGE', 'I-JUDGE', 'B-RESPONDENT', 'I-RESPONDENT']


In [78]:
per = Perceptron(verbose=10, n_jobs=-1, max_iter=5)
per.partial_fit(X_train, y_train, classes)

[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 4 concurrent workers.


-- Epoch 1
-- Epoch 1
-- Epoch 1
-- Epoch 1
Norm: 49.58, NNZs: 2195, Bias: -0.150000, T: 349077, Avg. loss: 0.009336
Total training time: 0.10 seconds.
Norm: 37.18, NNZs: 1016, Bias: -0.250000, T: 349077, Avg. loss: 0.006817
Total training time: 0.12 seconds.
Norm: 58.29, NNZs: 2089, Bias: -0.210000, T: 349077, Avg. loss: 0.010353
Total training time: 0.12 seconds.
-- Epoch 1
-- Epoch 1
Norm: 23.07, NNZs: 408, Bias: -0.160000, T: 349077, Avg. loss: 0.005621
Total training time: 0.13 seconds.
-- Epoch 1
-- Epoch 1
Norm: 37.01, NNZs: 1169, Bias: -0.130000, T: 349077, Avg. loss: 0.003532
Total training time: 0.11 seconds.
Norm: 50.00, NNZs: 2203, Bias: -0.190000, T: 349077, Avg. loss: 0.009757
Total training time: 0.11 seconds.
-- Epoch 1
Norm: 74.51, NNZs: 5102, Bias: -0.210000, T: 349077, Avg. loss: 0.016584
Total training time: 0.12 seconds.
-- Epoch 1
-- Epoch 1
Norm: 34.35, NNZs: 1090, Bias: -0.130000, T: 349077, Avg. loss: 0.003368
Total training time: 0.13 seconds.
-- Epoch 1
Norm:

[Parallel(n_jobs=-1)]: Done   5 tasks      | elapsed:    0.3s
[Parallel(n_jobs=-1)]: Done  10 tasks      | elapsed:    0.4s


Norm: 45.91, NNZs: 1976, Bias: -0.140000, T: 349077, Avg. loss: 0.006352
Total training time: 0.10 seconds.
-- Epoch 1
Norm: 34.41, NNZs: 914, Bias: -0.300000, T: 349077, Avg. loss: 0.020795
Total training time: 0.11 seconds.
Norm: 54.97, NNZs: 2222, Bias: -0.340000, T: 349077, Avg. loss: 0.033426
Total training time: 0.11 seconds.
-- Epoch 1
-- Epoch 1
Norm: 29.90, NNZs: 486, Bias: -0.150000, T: 349077, Avg. loss: 0.016240
Total training time: 0.12 seconds.
-- Epoch 1
Norm: 19.03, NNZs: 323, Bias: -0.100000, T: 349077, Avg. loss: 0.002467
Total training time: 0.11 seconds.
-- Epoch 1
Norm: 53.44, NNZs: 2460, Bias: -0.300000, T: 349077, Avg. loss: 0.023547
Total training time: 0.11 seconds.
Norm: 27.57, NNZs: 649, Bias: -0.130000, T: 349077, Avg. loss: 0.003027
Total training time: 0.11 seconds.
-- Epoch 1
Norm: 54.75, NNZs: 2649, Bias: -0.240000, T: 349077, Avg. loss: 0.014491
Total training time: 0.10 seconds.
-- Epoch 1
-- Epoch 1


[Parallel(n_jobs=-1)]: Done  17 tasks      | elapsed:    0.6s
[Parallel(n_jobs=-1)]: Done  23 out of  27 | elapsed:    0.8s remaining:    0.1s


Norm: 23.11, NNZs: 498, Bias: -0.120000, T: 349077, Avg. loss: 0.003472
Total training time: 0.11 seconds.
-- Epoch 1
Norm: 26.94, NNZs: 687, Bias: -0.200000, T: 349077, Avg. loss: 0.004112
Total training time: 0.10 seconds.
-- Epoch 1
Norm: 95.48, NNZs: 7324, Bias: -0.340000, T: 349077, Avg. loss: 0.097998
Total training time: 0.13 seconds.
Norm: 61.34, NNZs: 1937, Bias: -0.340000, T: 349077, Avg. loss: 0.048368
Total training time: 0.13 seconds.
-- Epoch 1
Norm: 50.08, NNZs: 1679, Bias: -0.310000, T: 349077, Avg. loss: 0.026681
Total training time: 0.10 seconds.
Norm: 34.15, NNZs: 1032, Bias: -0.100000, T: 349077, Avg. loss: 0.006094
Total training time: 0.12 seconds.
Norm: 133.98, NNZs: 13271, Bias: 0.340000, T: 349077, Avg. loss: 0.185963
Total training time: 0.10 seconds.


[Parallel(n_jobs=-1)]: Done  27 out of  27 | elapsed:    0.9s finished


In [79]:
classes.remove("o")
print(classification_report(y_pred=per.predict(X_dev), y_true=y_dev, labels=classes))

                precision    recall  f1-score   support

         B-ORG       0.46      0.59      0.52      1441
         I-ORG       0.41      0.25      0.31      2897
B-OTHER_PERSON       0.18      0.84      0.29      2653
I-OTHER_PERSON       0.78      0.01      0.03      2089
     B-WITNESS       0.38      0.60      0.46       881
     I-WITNESS       0.14      0.71      0.23       759
         B-GPE       0.49      0.66      0.56      1395
     B-STATUTE       0.71      0.34      0.46      1803
        B-DATE       0.70      0.17      0.28      1885
        I-DATE       0.42      0.14      0.22      1926
   B-PROVISION       0.82      0.91      0.86      2384
   I-PROVISION       0.37      0.47      0.41      6576
     I-STATUTE       0.29      0.20      0.24      3802
       B-COURT       0.86      0.55      0.67      1293
       I-COURT       0.45      0.20      0.28      2804
   B-PRECEDENT       0.47      0.40      0.43      1351
   I-PRECEDENT       0.38      0.40      0.39  