# Tokenizing
<br>
Purpose of this notebook is to convert the annotated judgement texts from the <b> javascript objects (json) </b> into <b> pandas dataframes </b>, which can be used as the matrix for mashine learning.<br> <br>
Labels in the annotated texts are stored in the json trees according to the sequence number of characters. <br>
<i> e.g. " ... Hongkong Bank ... " - { 'value': {'start': 90, 'end': 103, 'text': 'Hongkong Bank','labels': ["ORG"] } </i> <br> <br> 
With the <b> span_tokenize </b> the labels will be adapted to the sequence of tokens. <br>
<i> e.g. [ ... 'B-ORG', 'I-ORG', ... ] </i> <br> <br> 
Each token and its label makes up a single row in the dataframe.

In [1]:
import pandas as pd
import numpy as np
import json

In [2]:
from nltk.tokenize import word_tokenize, TreebankWordTokenizer, TreebankWordDetokenizer

In [3]:
from sklearn.feature_extraction import DictVectorizer
from sklearn.feature_extraction.text import CountVectorizer, HashingVectorizer
from sklearn.linear_model import Perceptron
from sklearn.model_selection import train_test_split
from sklearn.linear_model import SGDClassifier, PassiveAggressiveClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report

In [4]:
# read the training dataset
with open("NER_TRAIN/NER_TRAIN_JUDGEMENT.json") as json_file_train:
    json_object_train = json.load(json_file_train)

In [5]:
# read the developing dataset
with open("NER_DEV/NER_DEV/NER_DEV_JUDGEMENT.json") as json_file_dev:
    json_object_dev = json.load(json_file_dev)

<br>
Have a look at th first sentence of the judgment. (the first json tree) <br>
Each tree includes only one sentence.

In [6]:
json_object_train[0]

{'id': '90d9a97c7b7749ec8a4f460fda6f937e',
 'annotations': [{'result': [{'value': {'start': 90,
      'end': 103,
      'text': 'Hongkong Bank',
      'labels': ['ORG']},
     'id': 'C8HPTIM1',
     'from_name': 'label',
     'to_name': 'text',
     'type': 'labels'},
    {'value': {'start': 267,
      'end': 278,
      'text': 'Rahul & Co.',
      'labels': ['ORG']},
     'id': 'KOWE3RAM',
     'from_name': 'label',
     'to_name': 'text',
     'type': 'labels'}]}],
 'data': {'text': "\n\n(7) On specific query by the Bench about an entry of Rs. 1,31,37,500 on deposit side of Hongkong Bank account of which a photo copy is appearing at p. 40 of assessee's paper book, learned authorised representative submitted that it was related to loan from broker, Rahul & Co. on the basis of his submission a necessary mark is put by us on that photo copy."},
 'meta': {'source': 'tax_districtcourts judgement https://indiankanoon.org/doc/1556717/'}}

### get_start_and_end_and_labels
This function is designed to extract all labels with their spans from a json tree. <br>
It returns a list of labels. (length of the list corresponds to the number of total labels in the sentence. A label that spans more than one tokens count also as only one label) <br>
Each label is list of three elements.<br>
The first element is the name of the label, e. g. "ORG". <br>
The second element is the start of label, the third the end of the label.

In [7]:
def get_start_and_end_and_labels(tree):
    start_and_end_and_labels = []
    for label in tree["annotations"][0]["result"]:
        labels = label["value"]["labels"][0]
        start = label["value"]["start"]
        end = label["value"]["end"]
        start_and_end_and_labels.append([labels, start, end])
    return start_and_end_and_labels

<br>
The labels in the first sentence.

In [8]:
get_start_and_end_and_labels(json_object_train[0])

[['ORG', 90, 103], ['ORG', 267, 278]]

### return_text_and_label
returns a tuple in length of two. <br>
The first element is the text of the tree. (str) <br>
The second a list of all labels. (list)

In [9]:
def return_text_and_label(tree):
    text = tree["data"]["text"]
    labels = get_start_and_end_and_labels(tree)
    return text, labels

In [10]:
print(return_text_and_label(json_object_train[0]))

("\n\n(7) On specific query by the Bench about an entry of Rs. 1,31,37,500 on deposit side of Hongkong Bank account of which a photo copy is appearing at p. 40 of assessee's paper book, learned authorised representative submitted that it was related to loan from broker, Rahul & Co. on the basis of his submission a necessary mark is put by us on that photo copy.", [['ORG', 90, 103], ['ORG', 267, 278]])


### Have a try of the TreebankWordTokenizer of nltk
With TreebankWordDetokenizer the tokens stored in a list after tokenizing will be combined again to a str. 

In [11]:
twt = TreebankWordTokenizer()
twd = TreebankWordDetokenizer()
try_text, try_label = return_text_and_label(json_object_train[0])
tokens = twt.tokenize(try_text)
print(f"tokenized text: \n{tokens}\n")
print(f"recombined text with detokenizer: \n{twd.detokenize(tokens)}")

tokenized text: 
['(', '7', ')', 'On', 'specific', 'query', 'by', 'the', 'Bench', 'about', 'an', 'entry', 'of', 'Rs.', '1,31,37,500', 'on', 'deposit', 'side', 'of', 'Hongkong', 'Bank', 'account', 'of', 'which', 'a', 'photo', 'copy', 'is', 'appearing', 'at', 'p.', '40', 'of', 'assessee', "'s", 'paper', 'book', ',', 'learned', 'authorised', 'representative', 'submitted', 'that', 'it', 'was', 'related', 'to', 'loan', 'from', 'broker', ',', 'Rahul', '&', 'Co.', 'on', 'the', 'basis', 'of', 'his', 'submission', 'a', 'necessary', 'mark', 'is', 'put', 'by', 'us', 'on', 'that', 'photo', 'copy', '.']

recombined text with detokenizer: 
(7) On specific query by the Bench about an entry of Rs. 1,31,37,500 on deposit side of Hongkong Bank account of which a photo copy is appearing at p. 40 of assessee's paper book, learned authorised representative submitted that it was related to loan from broker, Rahul & Co. on the basis of his submission a necessary mark is put by us on that photo copy.


### span_tokenize
span_tokenize is a special function provided by nltk. <br>
It return the start and end of each token (according to the number sequence of characters) in a list. <br>
It is especially useful to 

In [12]:
try_tokens = list(TreebankWordTokenizer().span_tokenize(try_text))
print(try_tokens)

[(2, 3), (3, 4), (4, 5), (6, 8), (9, 17), (18, 23), (24, 26), (27, 30), (31, 36), (37, 42), (43, 45), (46, 51), (52, 54), (55, 58), (59, 70), (71, 73), (74, 81), (82, 86), (87, 89), (90, 98), (99, 103), (104, 111), (112, 114), (115, 120), (121, 122), (123, 128), (129, 133), (134, 136), (137, 146), (147, 149), (150, 152), (153, 155), (156, 158), (159, 167), (167, 169), (170, 175), (176, 180), (180, 181), (182, 189), (190, 200), (201, 215), (216, 225), (226, 230), (231, 233), (234, 237), (238, 245), (246, 248), (249, 253), (254, 258), (259, 265), (265, 266), (267, 272), (273, 274), (275, 278), (279, 281), (282, 285), (286, 291), (292, 294), (295, 298), (299, 309), (310, 311), (312, 321), (322, 326), (327, 329), (330, 333), (334, 336), (337, 339), (340, 342), (343, 347), (348, 353), (354, 358), (358, 359)]


### add_label_to_tokens
This function is crucial in the processing of the raw data. <br>
In the tradition of a NER (Named Entity Recognition) task, tokens should be tabbed not only by their labels, but also by their positions in the label. <br>
When a token does not lay in any label, it should be tabbed als <b> "o" </b>. ("outsider") <br>
When it lays at the beginning of a label, then <b> "B-" </b> ("beginning") plus the name of the label. <br>
All tokens after the first token in a label should be tabbed as "insider". (<b> "I-" </b>) <br> 

<i> e. g. <br>
" ... of Hongkong Bank ... " <br>
["o", "B-ORG", "I-ORG"]</i> <br> <br>

<i> special attention:</i> <br>
The first parameter of this function "<b>tokens</b>" requires a list of <b>token spans</b>, which are the products of a span_tokenizer. <br> <br>

<i> maximal span strategy </i> <br>
Because there is no guarantee that the tokenizer could always produce the same tokenizing as the one used by annotation. <br> 
And it is also possible, that some labels of the annotation does not correspond to the boundaries of (natural) tokens because of carelessness or different understanding of the boundaries. <br>
The maximal span strategy maximizes the included tokens in a label. So lang as a single character in the token is included in the label span, it will be labelled. <br> <br>
    
Later the quality of tokenizing and maximal span strategy will be checked with the <b>compare_label_with_labelled_tokens</b> function. <br>
It will prove that the maximal span strategy would not almost change a single label.

In [13]:
def add_label_to_tokens(tokens, labels):
    # at first, create a list of "o"s with the same length as the number of tokens in the sentence.
    # Hier the parameter tokens requires a list of token spans, which are the products of a span_tokenizer.
    token_labels = ["o" for token in tokens]
    
    # afterwards, search the tokens inside of the labels.
    # This process is not very efficient.
    # For each label in all labels it will interate all token spans in the sentence to find out which tokens belongs to this label.
    for label in labels:
        # label_start, label_end with character numbers
        label_start = label[1]
        label_end = label[2]
        if label_start <= label_end:
            for i in range(0, len(tokens)):
                # token_start, token_end also with character numbers
                token_start, token_end = tokens[i]
                
                # the first token in the label ("Beginning")
                if token_start <= label_start < token_end:
                    token_labels[i] = "B-" + label[0]
                
                # the last token in a label, if the label span does not correspond to the end of the token
                elif token_start < label_end <= token_end:
                    token_labels[i] = "I-" + label[0]
                
                # the following tokens after the first label ("Insider")
                if label_start < token_start <=  token_end <= label_end:
                    token_labels[i] = "I-" + label[0]
    return token_labels

### get_tokens_with_label
This function is an expansion of <i>add_label_to_tokens</i>. <br>
It wraps the <i>add_label_to_tokens</i> and provides it with the required parameters. <br>
Besides it returns also a list of tokens from the tokenizer. (with characters, rather in spans)

In [14]:
def get_tokens_with_label(tree):
    text, labels = return_text_and_label(tree)
    twt = TreebankWordTokenizer()
    tokens_span = list(TreebankWordTokenizer().span_tokenize(text))
    list_of_tokenized_text = twt.tokenize(text)
    list_of_label_of_each_token = add_label_to_tokens(tokens_span, labels)
    
    """
    d = {}
    for i in range(len(tokens_span)):
        d[ tokens_span[i] ] = tokenized_text[i]
    for key, value in d.items():
        print(f"{key}: {value}")
    """
    
    return  list_of_label_of_each_token, list_of_tokenized_text

<br> print out the labels and tokens in the first tree with <i> get_tokens_with_label </i>

In [15]:
list_of_label_of_each_token, list_of_tokens = get_tokens_with_label(json_object_train[0])
print(f"list_of_label_of_each_token: \n{list_of_label_of_each_token}\n")
print(f"list_of_tokenized_text: \n{list_of_tokens}")

list_of_label_of_each_token: 
['o', 'o', 'o', 'o', 'o', 'o', 'o', 'o', 'o', 'o', 'o', 'o', 'o', 'o', 'o', 'o', 'o', 'o', 'o', 'B-ORG', 'I-ORG', 'o', 'o', 'o', 'o', 'o', 'o', 'o', 'o', 'o', 'o', 'o', 'o', 'o', 'o', 'o', 'o', 'o', 'o', 'o', 'o', 'o', 'o', 'o', 'o', 'o', 'o', 'o', 'o', 'o', 'o', 'B-ORG', 'I-ORG', 'I-ORG', 'o', 'o', 'o', 'o', 'o', 'o', 'o', 'o', 'o', 'o', 'o', 'o', 'o', 'o', 'o', 'o', 'o', 'o']

list_of_tokenized_text: 
['(', '7', ')', 'On', 'specific', 'query', 'by', 'the', 'Bench', 'about', 'an', 'entry', 'of', 'Rs.', '1,31,37,500', 'on', 'deposit', 'side', 'of', 'Hongkong', 'Bank', 'account', 'of', 'which', 'a', 'photo', 'copy', 'is', 'appearing', 'at', 'p.', '40', 'of', 'assessee', "'s", 'paper', 'book', ',', 'learned', 'authorised', 'representative', 'submitted', 'that', 'it', 'was', 'related', 'to', 'loan', 'from', 'broker', ',', 'Rahul', '&', 'Co.', 'on', 'the', 'basis', 'of', 'his', 'submission', 'a', 'necessary', 'mark', 'is', 'put', 'by', 'us', 'on', 'that', 'pho

### compare_label_with_labelled_tokens
The functions compares the labelles tokens with the labelled text from the annotation to check the quality of Tokenizing. <br>

In [16]:
def compare_label_with_labelled_tokens(tree, object_number, print_the_differences = False):
    number_of_errors = 0
    error_report = ""
    
    # labels_with_text : a list of named entities directly extracted from the text according to the annotation.
    text, labels = return_text_and_label(tree)
    labels_with_text = []
    for label in labels:
        labels_with_text.append(text[label[1]:label[2]])
    
    # all_labelled_tokens: a list of labelled texts after tokenizing with the maximize span strategy.
    labelled_tokens, tokenized_text = get_tokens_with_label(tree)
    all_labelled_tokens = []
    l = len(labelled_tokens)
    for i in range(l):
        single_label = []
        if labelled_tokens[i].startswith("B"):
            single_label.append(tokenized_text[i])
            while i+ 1 < l and labelled_tokens[i+1].startswith("I"):
                single_label.append(tokenized_text[i+1])
                i += 1
        if len(single_label) > 0:
            all_labelled_tokens.append(" ".join(single_label))
    
    # compare whether the length of two lists (number of labels in a sentence) are the same.
    if len(labels_with_text) != len(all_labelled_tokens):
        number_of_errors += abs( len(labels_with_text) - len(all_labelled_tokens) )
        error_report += "--------------\n"
        error_report += f"different number of labels at the {object_number}th object!\n"
        error_report += f"labels: {labels}\n"
        error_report += f"labels_with_text: {labels_with_text}\n"
        error_report += f"all_labelled_tokens: {all_labelled_tokens} \n"
        error_report += "--------------\n"
    
    # compare each element in both lists.
    else:
        for i in range(len(labels_with_text)):
            gold = labels_with_text[i].replace(" ", "")
            tokenized = labels_with_text[i].replace(" ", "")
            if gold != tokenized:
                number_of_errors += 1
                error_report += "--------------\n"
                error_report += f"different labels at the {object_number}th object!\n"
                error_report += "potential tokenizing problem: "
                error_report += f"gold: {gold} -- tokenized: {tokenized}"
                error_report += "--------------"
    
    if print_the_differences:
        print(error_report)
    
    return number_of_errors, error_report

<br>
compare the annotation and labelled texts after tokenizing in a single object (1328)

In [17]:
number_of_errors_1328, error_report_1328 = compare_label_with_labelled_tokens(json_object_train[1328], object_number=1328, print_the_differences = True)

--------------
different number of labels at the 1328th object!
labels: [['OTHER_PERSON', 3, 22], ['CASE_NUMBER', 28, 41], ['CASE_NUMBER', 61, 75], ['COURT', 83, 140], ['ORG', 148, 166], ['CASE_NUMBER', 166, 167], ['CASE_NUMBER', 167, 176]]
labels_with_text: ['Jeevan Bheemmanagar', 'Cr..No.179/05', 'CC No.22109/06', '10th Addl. Chief Metropolitan Magistrate Court, Bangalore', 'Koramangala P.S.Cr', '.', 'No.430/05']
all_labelled_tokens: ['Jeevan Bheemmanagar', 'Cr..No.179/05', 'CC No.22109/06', '10th Addl. Chief Metropolitan Magistrate Court , Bangalore.', 'Koramangala', 'P.S.Cr.No.430/05'] 
--------------



In [18]:
number_of_errors_1328

1

Analysis:<br>
a tokenizing error at the end of the sentence: <br>
annoatation: 'Koramangala P.S.Cr', '.', 'No.430/05'
tokenized text : 'Koramangala', 'P.S.Cr.No.430/05'

### all wrong labelled entities after tokenizing in the training and development dataset

In [19]:
with open("tokenizing_report_train.txt", "w", encoding="utf-8") as f:
    number_of_errors = 0
    error_report = ""
    for i in range(len(json_object_train)):
        new_errors, new_error_report = compare_label_with_labelled_tokens(json_object_train[i], object_number = i)
        number_of_errors += new_errors
        error_report += new_error_report
    f.write(error_report)
    print(f"total number of wrong tokenized labels in the training dataset: {number_of_errors}")
f.close()

total number of wrong tokenized labels in the training dataset: 5


In [20]:
with open("tokenizing_report_dev.txt", "w", encoding="utf-8") as d:
    number_of_errors = 0
    error_report = ""
    for i in range(len(json_object_dev)):
        new_errors, new_error_report = compare_label_with_labelled_tokens(json_object_dev[i], object_number = i)
        number_of_errors += new_errors
        error_report += new_error_report
    d.write(error_report)
    print(f"total number of wrong tokenized labels in the developing dataset: {number_of_errors}")
d.close()

total number of wrong tokenized labels in the developing dataset: 1


## Quality Analysis of the tokenizing
The total number of wrong tokenized labels in both training and developing dataset are very low. <br>
This proves the gut quality of tokenizing. <br>
Among the 57966 labels in the training dataset are only 5 of them false tokenized. <br>
(Tokenizing accuracy = 99.9914 %) 

4 of the total 6 errors are caused by the dashes in the names. <br>
<i>labels_with_text: [ ... 'Bangalore', 'Madras'] <br>
all_labelled_tokens: [ .. 'Bangalore-Madras'] </i>

## Convert the json to dataframe
As preparation of the POS-tagging in next step, the sentence number of each token will also be stored in the dataframe

In [21]:
def json_to_df(trees):
    token_and_labels = []
    for n in range(len(trees)):
        labels, tokens = get_tokens_with_label(trees[n])
        if len(labels) != len(tokens):
            raise ValueError
        else:
            for i in range(len(labels)):
                token_and_labels.append([ n, tokens[i], labels[i]])
    df = pd.DataFrame(token_and_labels)
    df.columns = ['SentenceNR', 'Token', 'Label']
    return df

In [22]:
df_train = json_to_df(json_object_train)
df_train

Unnamed: 0,SentenceNR,Token,Label
0,0,(,o
1,0,7,o
2,0,),o
3,0,On,o
4,0,specific,o
...,...,...,...
349072,9434,accused,o
349073,9434,No.1,o
349074,9434,as,o
349075,9434,aforementioned,o


In [23]:
df_dev = json_to_df(json_object_dev)
df_dev

Unnamed: 0,SentenceNR,Token,Label
0,0,True,o
1,0,",",o
2,0,our,o
3,0,Constitution,B-STATUTE
4,0,has,o
...,...,...,...
37450,948,of,o
37451,948,right,o
37452,948,ear,o
37453,948,lobule,o


## store the dataframes

In [31]:
df_train.to_csv("tokenized_train.csv", index=False)

In [32]:
df_dev.to_csv("tokenized_dev.csv", index=False)

## load the dataframes again to the notebook

In [33]:
df_train = pd.read_csv("tokenized_train.csv")
df_train

Unnamed: 0,SentenceNR,Token,Label
0,0,(,o
1,0,7,o
2,0,),o
3,0,On,o
4,0,specific,o
...,...,...,...
349072,9434,accused,o
349073,9434,No.1,o
349074,9434,as,o
349075,9434,aforementioned,o


In [34]:
df_dev = pd.read_csv("tokenized_dev.csv")
df_dev

Unnamed: 0,SentenceNR,Token,Label
0,0,True,o
1,0,",",o
2,0,our,o
3,0,Constitution,B-STATUTE
4,0,has,o
...,...,...,...
37450,948,of,o
37451,948,right,o
37452,948,ear,o
37453,948,lobule,o


## How good will it work right now?
With a simple <i>Perceptron</i> modell from <i>Sklearn</i>

In [45]:
X_train = df_train.drop(["Label", "SentenceNR"], axis = 1)
v = DictVectorizer(sparse=True)
X_train = v.fit_transform(X_train.to_dict('records'))
y_train = df_train["Label"]

X_dev = df_dev.drop(["Label", "SentenceNR"], axis=1)
X_dev = v.transform(X_dev.to_dict('records'))
y_dev = df_dev["Label"]

print(X_train.shape, y_train.shape)
print(X_dev.shape, y_dev.shape)

(349077, 27286) (349077,)
(37455, 27286) (37455,)


In [46]:
classes = df_train["Label"].unique().tolist()
print(classes)

['o', 'B-ORG', 'I-ORG', 'B-OTHER_PERSON', 'I-OTHER_PERSON', 'B-WITNESS', 'I-WITNESS', 'B-GPE', 'B-STATUTE', 'B-DATE', 'I-DATE', 'B-PROVISION', 'I-PROVISION', 'I-STATUTE', 'B-COURT', 'I-COURT', 'B-PRECEDENT', 'I-PRECEDENT', 'B-CASE_NUMBER', 'I-CASE_NUMBER', 'I-GPE', 'B-PETITIONER', 'I-PETITIONER', 'B-JUDGE', 'I-JUDGE', 'B-RESPONDENT', 'I-RESPONDENT']


In [47]:
per = Perceptron(verbose=10, n_jobs=-1, max_iter=5)
per.partial_fit(X_train, y_train, classes)

-- Epoch 1
-- Epoch 1
-- Epoch 1


[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 4 concurrent workers.


-- Epoch 1
Norm: 9.85, NNZs: 97, Bias: -0.070000, T: 349077, Avg. loss: 0.001413
Total training time: 0.27 seconds.
-- Epoch 1
Norm: 31.54, NNZs: 995, Bias: -0.010000, T: 349077, Avg. loss: 0.001695
Total training time: 0.38 seconds.
Norm: 49.06, NNZs: 2407, Bias: -0.010000, T: 349077, Avg. loss: 0.000801
Total training time: 0.43 seconds.
-- Epoch 1
Norm: 21.59, NNZs: 466, Bias: -0.060000, T: 349077, Avg. loss: 0.001467
Total training time: 0.46 seconds.
-- Epoch 1
-- Epoch 1
Norm: 22.45, NNZs: 504, Bias: -0.020000, T: 349077, Avg. loss: 0.000592
Total training time: 0.23 seconds.
-- Epoch 1
Norm: 30.87, NNZs: 953, Bias: -0.010000, T: 349077, Avg. loss: 0.001874
Total training time: 0.20 seconds.
-- Epoch 1


[Parallel(n_jobs=-1)]: Done   5 tasks      | elapsed:    0.7s


Norm: 21.42, NNZs: 459, Bias: -0.030000, T: 349077, Avg. loss: 0.000576
Total training time: 0.22 seconds.
-- Epoch 1
Norm: 50.86, NNZs: 2587, Bias: -0.010000, T: 349077, Avg. loss: 0.002511
Total training time: 0.26 seconds.
-- Epoch 1
Norm: 32.63, NNZs: 1065, Bias: -0.010000, T: 349077, Avg. loss: 0.002090
Total training time: 0.24 seconds.
-- Epoch 1
Norm: 14.25, NNZs: 203, Bias: -0.090000, T: 349077, Avg. loss: 0.001243
Total training time: 0.27 seconds.
-- Epoch 1
Norm: 17.61, NNZs: 310, Bias: -0.060000, T: 349077, Avg. loss: 0.000435
Total training time: 0.29 seconds.
-- Epoch 1
Norm: 30.48, NNZs: 929, Bias: -0.010000, T: 349077, Avg. loss: 0.000959
Total training time: 0.24 seconds.


[Parallel(n_jobs=-1)]: Done  10 tasks      | elapsed:    1.0s


-- Epoch 1Norm: 18.25, NNZs: 333, Bias: -0.050000, T: 349077, Avg. loss: 0.002001
Total training time: 0.33 seconds.

-- Epoch 1
Norm: 40.91, NNZs: 1674, Bias: 0.000000, T: 349077, Avg. loss: 0.007197
Total training time: 0.22 seconds.
Norm: 16.76, NNZs: 281, Bias: -0.030000, T: 349077, Avg. loss: 0.005235
Total training time: 0.20 seconds.
-- Epoch 1
-- Epoch 1
Norm: 10.44, NNZs: 109, Bias: -0.010000, T: 349077, Avg. loss: 0.000580
Total training time: 0.21 seconds.
Norm: 12.00, NNZs: 144, Bias: -0.080000, T: 349077, Avg. loss: 0.004115
Total training time: 0.30 seconds.
-- Epoch 1
-- Epoch 1
Norm: 15.68, NNZs: 246, Bias: -0.020000, T: 349077, Avg. loss: 0.000611
Total training time: 0.20 seconds.
-- Epoch 1
Norm: 30.40, NNZs: 924, Bias: -0.020000, T: 349077, Avg. loss: 0.005609
Total training time: 0.24 seconds.
-- Epoch 1


[Parallel(n_jobs=-1)]: Done  17 tasks      | elapsed:    1.6s


Norm: 32.86, NNZs: 1080, Bias: -0.040000, T: 349077, Avg. loss: 0.003068
Total training time: 0.21 seconds.
Norm: 63.56, NNZs: 4040, Bias: -0.040000, T: 349077, Avg. loss: 0.022261
Total training time: 0.22 seconds.
Norm: 12.08, NNZs: 146, Bias: -0.040000, T: 349077, Avg. loss: 0.000812
Total training time: 0.27 seconds.
-- Epoch 1
-- Epoch 1
Norm: 31.02, NNZs: 962, Bias: -0.080000, T: 349077, Avg. loss: 0.011708
Total training time: 0.27 seconds.
-- Epoch 1
-- Epoch 1


[Parallel(n_jobs=-1)]: Done  23 out of  27 | elapsed:    1.9s remaining:    0.3s


Norm: 15.62, NNZs: 244, Bias: -0.020000, T: 349077, Avg. loss: 0.000912
Total training time: 0.18 seconds.
Norm: 23.87, NNZs: 570, Bias: -0.080000, T: 349077, Avg. loss: 0.006616
Total training time: 0.19 seconds.
Norm: 20.57, NNZs: 423, Bias: -0.050000, T: 349077, Avg. loss: 0.001282
Total training time: 0.17 seconds.
Norm: 129.12, NNZs: 16671, Bias: -0.010000, T: 349077, Avg. loss: 0.042169
Total training time: 0.20 seconds.


[Parallel(n_jobs=-1)]: Done  27 out of  27 | elapsed:    2.2s finished


In [48]:
classes.remove("o")
print(classification_report(y_pred=per.predict(X_dev), y_true=y_dev, labels=classes))

                precision    recall  f1-score   support

         B-ORG       0.24      0.22      0.23       159
         I-ORG       0.24      0.10      0.14       342
B-OTHER_PERSON       0.32      0.17      0.23       276
I-OTHER_PERSON       0.37      0.17      0.24       195
     B-WITNESS       0.02      0.02      0.02        58
     I-WITNESS       0.18      0.04      0.06        54
         B-GPE       0.09      0.25      0.13       182
     B-STATUTE       0.54      0.37      0.44       222
        B-DATE       0.12      0.09      0.10       222
        I-DATE       0.39      0.15      0.22       132
   B-PROVISION       0.87      0.70      0.78       258
   I-PROVISION       0.48      0.16      0.24       772
     I-STATUTE       0.41      0.10      0.17       458
       B-COURT       0.87      0.60      0.71       178
       I-COURT       0.15      0.05      0.07       354
   B-PRECEDENT       0.14      0.05      0.07       177
   I-PRECEDENT       0.43      0.35      0.39  