## __Legal Document Summarizer__

This model is a combination of both extractive and abstractive text summarization approaches.

__Extractive__: 
Extractive, where important sentences are selected from the input text to form a summary. Most summarization approaches today are extractive in nature. It is based on the PageRank algorithm used in Google search Engine.

__Abstractive__:
Abstractive, where the model forms its own phrases and sentences to offer a more coherent summary, like what a human would generate. This approach is definitely a more appealing, but much more difficult than extractive summarization.

#### __Data Used for Training__

Legal Case Reports Data Set contains Australian legal cases from the Federal Court of Australia (FCA)

Original Data contained documents with average character length of 15000. Using __TextRank extractive approach__ the text length was reduced to an average of 3000 characters (Composed of top ranking sentences in the Text based on Page Rank Algorithm).

#### __Summary Prediction__

In [5]:
import pandas as pd
clean_text = pd.Series(text1).replace("[^a-zA-Z0-9]", " ")

In [9]:
print(clean_text)

0    1 this is an appeal from a judgment of riethmu...
dtype: object


In [1]:
# Sample Data for Testing

text1 = """1 this is an appeal from a judgment of riethmuller fm, given on 17 august 2005, dismissing an application for review of a decision of the refugee review tribunal ("the tribunal") to refuse the appellant the grant of a protection visa.
8 riethmuller fm then proceeded to consider the appellant's claim that there had been jurisdictional error in that the tribunal had not given him sufficient time to present his case.
his honour noted that there was correspondence from the tribunal which put the appellant on notice that the tribunal was not able to make a favourable decision upon the evidence he had provided in support of his application and that he had been given ample opportunity to provide more information.
10 the appellant filed a notice of appeal dated 22 august 2005 contending that the adjournment application should not have been refused by riethmuller fm because he was ill on the day of the hearing and he was not given sufficient time to plead his case in detail.
subsequently, the appellant filed an outline of submissions dated 19 december 2005 which largely sets out the appellant's factual background and repeats his substantive claims before riethmuller fm, namely that the tribunal failed to take his particular facts and circumstances into account, did not act in good faith and acted unreasonably.
in an outline of submissions filed on 2 december 2005, the first respondent submitted that it was open to riethmuller fm to refuse the appellant's application for an adjournment of the hearing in the federal magistrates court for the reasons given by his honour.
riethmuller fm was entitled to take the view that the most appropriate order in the interests of justice, bearing in mind the matters relied on by the appellant, the previous history of the proceedings, and the needs of other litigants and the court, was to decline to adjourn the hearing: see szbfl v refugee review tribunal [2005] fca 869 at [8] - [10] ; applicant mzqaf v minister for immigration and multicultural and indigenous affairs [2005] fca 1801 at [5] ; and nalm v minister for immigration and multicultural and indigenous affairs [2004] fcafc 17 at [22] - [26] .
13 as his honour pointed out, the tribunal was required to determine whether the appellant had a well-founded fear of persecution for a convention reason, based upon the appellant's particular claims.
the appellant did not proffer any explanation as to why the material in question had not been presented to the tribunal or at the hearing before riethmuller fm, other than to say it was not yet available to him.
15 in my opinion, the notice of appeal and the appellant's submissions, both written and oral, do not disclose any error of fact or law in the decision of riethmuller fm, or for that matter in the decision of the tribunal.

"""

summ1 = """appeal from decision of federal magistrate.
protection visa.
procedural unfairness.
whether appellant had sufficient time to plead case.
or obtain additional country information.
no error established.
migration."""

In [2]:
l_text = [text1]
l_summ = [summ1]

In [3]:
# Entity Recognition

import spacy

# Load the large English NLP model
nlp = spacy.load('en_core_web_lg')

# The text we want to examine
text = text1

# Parse the text with spaCy. This runs the entire pipeline.
doc = nlp(text)

# 'doc' now contains a parsed version of text. We can use it to do anything we want!
# For example, this will print out all the named entities that were detected:
for entity in doc.ents:
    if entity.label_!="CARDINAL":
        print(f"{entity.text} ({entity.label_})")# if entity.label_!="CARDINAL")


17 august 2005 (DATE)
august 2005 (DATE)
the day (DATE)
19 december 2005 (DATE)
2 december 2005 (DATE)
first (ORDINAL)
2005 (DATE)
2005 (DATE)
2004 (DATE)


In [4]:
# Keywords found in text

from summa import keywords
print(keywords.keywords(text1))

visa
protection
riethmuller
fm
honour
v
facts
fact
substantive claims
claim


In [6]:
## Predicting Summary

from __future__ import print_function

import pandas as pd
from keras_text_summarization.library.rnn import RecursiveRNN1
from rouge import Rouge 
import numpy as np
import pickle

def main():
    np.random.seed(42)
    data_dir_path = './data'
    model_dir_path = './models'

    print('loading pickle file ...')
    
    with open(data_dir_path + '/summary2.pkl', 'rb') as f:
        list_of_summaries = pickle.load(f)
    with open(data_dir_path + '/text2.pkl', 'rb') as f:
        list_of_text = pickle.load(f)
    
#     X = list_of_text[0:10]
#     Y = list_of_summaries[0:10]
    
    X = l_text
    Y = l_summ

    config = np.load(RecursiveRNN1.get_config_file_path(model_dir_path=model_dir_path)).item()

    summarizer = RecursiveRNN1(config)
    summarizer.load_weights(weight_file_path=RecursiveRNN1.get_weight_file_path(model_dir_path=model_dir_path))

    print('start predicting ...')
    orig_summary = []
    generated_summary = []
    for i in np.random.permutation(np.arange(len(X)))[0:20]:
        x = X[i]
        actual_summary = Y[i]
        orig_summary.append(actual_summary)
        
        gen_summary = summarizer.summarize(x)
        generated_summary.append(gen_summary)
        
        # print('Article: ', x)
        print('--Generated Summary: ', gen_summary)
        print('--Original Summary: ', actual_summary)
    
#     with open('gen_summary1.pkl', 'wb') as f:
#         pickle.dump(generated_summary, f)
#     with open('orig_summary1.pkl', 'wb') as f:
#         pickle.dump(orig_summary, f)


    hypothesis = actual_summary
    reference = gen_summary
    rouge = Rouge()
    scores = rouge.get_scores(hypothesis, reference)
    print('\n' + 'Rouge Score:')
    print(scores)

if __name__ == '__main__':
    main()

loading pickle file ...
max_input_seq_length 500
max_target_seq_length 50
num_input_tokens 5002
num_target_tokens 1391
start predicting ...
--Generated Summary:  application for another magistrate.
protection visa.
procedural unfairness.
whether from from had error and plead error in applicant's had error in plead error in information.
no error in information.
no error in information.
no error in information.
no error in applicant's had information.
no error in applicant's had information.
no error in applicant's had information.
no error in applicant's
--Original Summary:  appeal from decision of federal magistrate.
protection visa.
procedural unfairness.
whether appellant had sufficient time to plead case.
or obtain additional country information.
no error established.
migration.

Rouge Score:
[{'rouge-1': {'f': 0.5333333285333334, 'p': 0.4444444444444444, 'r': 0.6666666666666666}, 'rouge-2': {'f': 0.285714280733028, 'p': 0.2692307692307692, 'r': 0.30434782608695654}, 'rouge-l': {'f'