<a href="https://colab.research.google.com/github/docheem/NLP-Portfolio/blob/main/PR_Applying_Transformers_to_Legal_and_Financial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Applying Transformers to Legal and Financial

Text-To-Text Transfer Transformer 

We will go through the concepts and architecture of the T5 transformer model. We will then apply T5 to summarizing documents with Hugging Face models



In [None]:
!pip install transformers
!pip install sentencepiece==0.1.94

display_architecture = False


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
import torch
import json

In [None]:
# importing the tokenizer, generation, and configuration classes


from transformers import T5Tokenizer, T5ForConditionalGeneration, T5Config


In [None]:
# Importing the T5-large conditional generation model 
# to generate text and the T5-largetokenizer

model = T5ForConditionalGeneration.from_pretrained('t5-large')

tokenizer = T5Tokenizer.from_pretrained('t5-large')



For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-large automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


In [None]:
# device on which torch tensors will be allocated
device = torch.device('cpu')

# Exploring the architecture of the T5 model

In [None]:
if display_architecture==True:
  print(model.config)

print(model.config)  

T5Config {
  "_name_or_path": "t5-large",
  "architectures": [
    "T5ForConditionalGeneration"
  ],
  "d_ff": 4096,
  "d_kv": 64,
  "d_model": 1024,
  "decoder_start_token_id": 0,
  "dense_act_fn": "relu",
  "dropout_rate": 0.1,
  "eos_token_id": 1,
  "feed_forward_proj": "relu",
  "initializer_factor": 1.0,
  "is_encoder_decoder": true,
  "is_gated_act": false,
  "layer_norm_epsilon": 1e-06,
  "model_type": "t5",
  "n_positions": 512,
  "num_decoder_layers": 24,
  "num_heads": 16,
  "num_layers": 24,
  "output_past": true,
  "pad_token_id": 0,
  "relative_attention_max_distance": 128,
  "relative_attention_num_buckets": 32,
  "task_specific_params": {
    "summarization": {
      "early_stopping": true,
      "length_penalty": 2.0,
      "max_length": 200,
      "min_length": 30,
      "no_repeat_ngram_size": 3,
      "num_beams": 4,
      "prefix": "summarize: "
    },
    "translation_en_to_de": {
      "early_stopping": true,
      "max_length": 300,
      "num_beams": 4,
      "p

We can see that T5:

- Implements the beam search algorithm, which will expand the four most significant text completion predictions
- Applies early stopping when num_beam sentences are completed per batch
- Makes sure not to repeat ngrams equal to no_repeat_ngram_size
- Controls the length of the samples with min_length and max_length
- Applies a length penalty
- Vocabulary size is a topic in itself. Too much vocabulary will lead to sparse representations. On the other hand, too little vocabulary will distort the NLP tasks.

We can also see the details of the transformer stacks by simply printing the model

In [None]:
#print(model)

In [None]:
#print(model.encoder)

In [None]:
#print(model.decoder)

In [None]:
#print(model.forward)

#Summarizing documents with T5-large

We will create a summarizing function that we can call with any text we wish to summarize. We will summarize legal and financial examples.

In [None]:
# Creating a summarization function

def summarize(text,ml):

  # stripped of the \n character or cleaning up
  # removes any extra spaces from the beginning and end of the text. 
  # Then, it replaces any new line characters ("\n") with nothing.
  preprocess_text = text.strip().replace("\n","")

  # apply the innovative T5 task prefix summarize to the input text
  # We're going to use a special code to help the robot understand 
  # that we want it to summarize the story. We'll add the code 'summarize:' 
  # to the beginning of the story, so the robot knows what we want it to do

  t5_prepared_Text = "summarize: "+preprocess_text

  # Let's check if the story looks right with the special code added.
  # We'll print out the story on the computer screen, so we can see 
  # what the robot is working with

  print ("Preprocessed and prepared text: \n", t5_prepared_Text)

   # Now that we've cleaned up the story and added the special code, 
   # we need to turn it into a code that our robot can understand better.
   # We'll use a special tool called a 'tokenizer' to do this.

  tokenized_text = tokenizer.encode(t5_prepared_Text, 
                                    
                                    return_tensors = "pt").to(device)

  # Summary
  # uses a "model" (which is like a really smart robot program) 
  # to generate a summary of the text
  # The settings used (such as "num_beams" and "max_length") 
  # help make sure the summary is a good length and makes sense.
  summary_ids = model.generate(tokenized_text,
                               
                                      num_beams = 4,
                               
                                      no_repeat_ngram_size = 2,
                               
                                      min_length = 30,
                               
                                      max_length = ml,
                               
                                      early_stopping = True)
  
  # The generated output will now decoded with the tokenizer
  # This line takes the summary that the robot generated 
  # (which is still in token ID form) and turns it back into 
  # regular words that we can read.

  output = tokenizer.decode(summary_ids[0],
                            
                            skip_special_tokens = True)
  
  # tells the function to return the summary as the output, 
  # so we can use it for whatever we need

  return output



The encoded text is ready to be sent to the model to generate a summary.


The input text is the beginning of the Project Gutenberg e-book containing the Declaration of Independence of the United States of America:

In [None]:
text="""
The United States Declaration of Independence was the first Etext
released by Project Gutenberg, early in 1971.  The title was stored
in an emailed instruction set which required a tape or diskpack be
hand mounted for retrieval.  The diskpack was the size of a large
cake in a cake carrier, cost $1500, and contained 5 megabytes, of
which this file took 1-2%.  Two tape backups were kept plus one on
paper tape.  The 10,000 files we hope to have online by the end of
2001 should take about 1-2% of a comparably priced drive in 2001.
"""

In [None]:
# We can call our summarize function and send the text
# we want to summarize and the maximum length of the summary

print("Number of characters:",len(text))

summary = summarize(text,50)


print ("\n\nSummarized text: \n",summary)

Number of characters: 534
Preprocessed and prepared text: 
 summarize: The United States Declaration of Independence was the first Etextreleased by Project Gutenberg, early in 1971.  The title was storedin an emailed instruction set which required a tape or diskpack behand mounted for retrieval.  The diskpack was the size of a largecake in a cake carrier, cost $1500, and contained 5 megabytes, ofwhich this file took 1-2%.  Two tape backups were kept plus one onpaper tape.  The 10,000 files we hope to have online by the end of2001 should take about 1-2% of a comparably priced drive in 2001.


Summarized text: 
 the united states declaration of independence was the first etext published by project gutenberg, early in 1971. the 10,000 files we hope to have online by the end of2001 should take about 1-2% of a comparably priced drive in


The Bill of Rights sample

In [None]:
# Bill of Rights sample

text ="""
No person shall be held to answer for a capital, or otherwise infamous crime,
unless on a presentment or indictment of a Grand Jury, except in cases arising
in the land or naval forces, or in the Militia, when in actual service
in time of War or public danger; nor shall any person be subject for
the same offense to be twice put in jeopardy of life or limb;
nor shall be compelled in any criminal case to be a witness against himself,
nor be deprived of life, liberty, or property, without due process of law;
nor shall private property be taken for public use without just compensation.

"""
print("Number of characters:",len(text))

summary=summarize(text,50)


print ("\n\nSummarized text: \n",summary)

Number of characters: 591
Preprocessed and prepared text: 
 summarize: No person shall be held to answer for a capital, or otherwise infamous crime,unless on a presentment or indictment of a Grand Jury, except in cases arisingin the land or naval forces, or in the Militia, when in actual servicein time of War or public danger; nor shall any person be subject forthe same offense to be twice put in jeopardy of life or limb;nor shall be compelled in any criminal case to be a witness against himself,nor be deprived of life, liberty, or property, without due process of law;nor shall private property be taken for public use without just compensation.


Summarized text: 
 no person shall be held to answer for a capital, or otherwise infamous crime, unless ona presentment or indictment ofa Grand Jury. nor shall any person be subject for the same offense to be twice put


A corporate law sample

In [None]:
# Montana Corporate Law
# https://corporations.uslegal.com/state-corporation-law/montana-corporation-law/#:~:text=Montana%20Corporation%20Law,carrying%20out%20its%20business%20activities.

text ="""The law regarding corporations prescribes that a corporation can be incorporated in the state of Montana to serve any lawful purpose.  In the state of Montana, a corporation has all the powers of a natural person for carrying out its business activities.  The corporation can sue and be sued in its corporate name.  It has perpetual succession.  The corporation can buy, sell or otherwise acquire an interest in a real or personal property.  It can conduct business, carry on operations, and have offices and exercise the powers in a state, territory or district in possession of the U.S., or in a foreign country.  It can appoint officers and agents of the corporation for various duties and fix their compensation.
The name of a corporation must contain the word “corporation” or its abbreviation “corp.”  The name of a corporation should not be deceptively similar to the name of another corporation incorporated in the same state.  It should not be deceptively identical to the fictitious name adopted by a foreign corporation having business transactions in the state.
The corporation is formed by one or more natural persons by executing and filing articles of incorporation to the secretary of state of filing.  The qualifications for directors are fixed either by articles of incorporation or bylaws.  The names and addresses of the initial directors and purpose of incorporation should be set forth in the articles of incorporation.  The articles of incorporation should contain the corporate name, the number of shares authorized to issue, a brief statement of the character of business carried out by the corporation, the names and addresses of the directors until successors are elected, and name and addresses of incorporators.  The shareholders have the power to change the size of board of directors.
"""
print("Number of characters:",len(text))

summary = summarize(text,50)


print ("\n\nSummarized text: \n",summary)
 

Number of characters: 1816
Preprocessed and prepared text: 
 summarize: The law regarding corporations prescribes that a corporation can be incorporated in the state of Montana to serve any lawful purpose.  In the state of Montana, a corporation has all the powers of a natural person for carrying out its business activities.  The corporation can sue and be sued in its corporate name.  It has perpetual succession.  The corporation can buy, sell or otherwise acquire an interest in a real or personal property.  It can conduct business, carry on operations, and have offices and exercise the powers in a state, territory or district in possession of the U.S., or in a foreign country.  It can appoint officers and agents of the corporation for various duties and fix their compensation.The name of a corporation must contain the word “corporation” or its abbreviation “corp.”  The name of a corporation should not be deceptively similar to the name of another corporation incorporated in the same s