# Introduction

In order to train on the words in a legislative bill we want to make use of an existing Natural Language Processing (NLP) model in order to come up with a data structure that will encode contextual information.

To this end we use the BERT model, which is one of the modern NLP models that takes into account context.

For this project, we want to take the text from the bill and predict which party sponsored it

## First steps

Load the BERT model using pytorch.

Why pytorch, this is higher level libary for constructing machine learning models.

In [6]:
import os.path

import pandas as pd
import transformers
from transformers import pipeline
import pandas as pd
import numpy as np
import json

In [4]:
from transformers import BertTokenizer, BertModel
tokenizer = BertTokenizer.from_pretrained('bert-base-cased')
model = BertModel.from_pretrained("bert-base-cased")

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertModel: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Now that the we have the model, we want to run inference on the text and get the output token.

In [5]:
def generate_token(summary_json):
    """
    execute the BERT model on the title section of the summary
    :param summary_json: the data from a summary file
    :return: a numpy vector
    """
    text = summary_json['text']
    encoded_input = tokenizer(text, return_tensors='pt')
    print("encoded input")
    print(encoded_input)
    print("model output")
    output = model(**encoded_input)
    last_shape = output.last_hidden_state.shape
    elements = np.cumproduct(last_shape)
    last_layer_vector = output.last_hidden_state.detach().numpy().reshape(max(elements))
    return last_layer_vector

In [11]:
parent_path = os.path.dirname(os.getcwd())
search_path = os.path.join(parent_path, "data", "extracted")
token_path = os.path.join(parent_path, "data", "tokenized")

for root, dirs, files in os.walk(search_path):
    for f in files:
        with open(os.path.join(root, f), 'r') as s_file:
            print(f'reading {f}')
            summary = json.load(s_file)
            encoding = generate_token(summary)
            output_path = os.path.join(token_path, f.replace(".json", ".npy"))
            print(f'saving {output_path}')
            np.save(output_path, encoding,allow_pickle=False)

reading summary_bill_1811_1393180.json
encoded input
{'input_ids': tensor([[  101, 11336, 26304,  1106,  1103,  5600,  1104,  1671, 11709,  3641,
          1105,  1103,  1671, 10614,  3641,   119,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}
model output
saving C:\Users\benja\git-projects\bitbucket\nlp_legislation_prediction\data\tokenized\summary_bill_1811_1393180.npy
reading summary_bill_1811_1393181.json
encoded input
{'input_ids': tensor([[  101, 11336, 26304,  1106, 13178,  4237,  1107,  2078, 10713,  1411,
           117,  1278,  1629,   117,  1137,  1491,  1629,  5845,   119,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}
model output
saving C:\Users\benja\git-projects\bitbucket\nlp_legislation_prediction\data\tok

# how dissimilar are the encodings

In [13]:
# lets reread in the vectors
encoded_data = {}
for root, dirs, files in os.walk(token_path):
    for f in files:
        encoding = np.load(os.path.join(root, f))
        encoded_data['source'] = f
        encoded_data['encoding'] = encoding
encoding_df = pd.DataFrame(encoded_data)

source       object
encoding    float32
dtype: object