# Data Preprocessing for hybrid model

This notebook uses the `sumy` sdk to preprocess the judgement data based on extractive models.

The preprocessing step will shorten the judgement from over 16k(?) words into 500 sentenes which will then be used as input for the abstractive model. With this preprocessing step, we will be able to cut the inference time by a lot.

The extractive model being used here is LsaSummarizer. It uses Latent Semantic Analysis (LSA) to extract the most important sentences from a document. LSA is a widely-used technique in natural language processing that identifies hidden patterns in text data by analyzing the relationships between words and documents.
https://reintech.io/blog/how-to-create-a-text-summarization-tool-with-sumy-tutorial-for-developers

We think this will be helpful for legal documents like judgements, since the judgements usually inexplicitly contains structures like evidence, case, issues, analysis, decisions, etc.


# Connect to Google Drive

In [None]:
# Load the Drive helper and mount
from google.colab import drive

# This will prompt for authorization.
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


# Read in Train and Test data directly from csv

In [None]:
# Read in CSV data
import pandas as pd
train_df = pd.read_csv("/content/drive/MyDrive/W266 Final Project/data/train_data.csv")
train_df_filter = train_df[['index', 'judgement','summary']]
test_df = pd.read_csv("/content/drive/MyDrive/W266 Final Project/data/test_data.csv")
test_df_filter = test_df[['index', 'judgement','summary']]

# Randomly Select 1000 Records from train_data_LSA_extractive_500.csv


In [None]:
import pandas as pd
train_data = pd.read_csv("/content/drive/My Drive/W266 Final Project/output/train_data_LSA_extractive_500.csv")
train_data = train_data.dropna()
# # Check for zero nan values in df
# sum(train_data['Summary'].isnull())
train_data1000 = train_data.sample(n=1000, random_state=42)
train_data1000

Unnamed: 0,Index,Summary,ExtractiveSummary
7257,4809.txt,The first respondent joined M.B.B.S. course of...,Civil Appeal No. 2828 of 1977. Appeal by Speci...
4353,543.txt,"On the death of R, a Hindu jat, in April or Ma...",Civil Appeal No. 137 of 1953. Appeal from the ...
4072,372.txt,An application was filed by the first responde...,Appeal No. 312 of 1955. On appeal by special l...
132,uksc-2010-0177.txt,Scottish Widows Plc (Scottish Widows) is a lif...,This is an appeal from an interlocutor of the ...
7505,6391.txt,The appellant and the respondents applied for ...,vil Appeal Nos. 16 16 17 of 1990. From the Jud...
...,...,...,...
7028,329.txt,The exercise of the power conferred on the Reg...,Civil Appeal No. 116 of 1953. Appeal from the ...
3682,1148.txt,The Government of Jammu and Kashmir on the bas...,Appeal No. 31 of 1957. Appeal from the judgmen...
3437,5900.txt,The respondent company manufactures ossein and...,n that the products must contain visible piece...
4105,3856.txt,"Respondents are the ex proprietors, and occupa...",CIVIL Appeal No 2475 of 1968. From the Judgmen...


In [None]:
import csv

output_file_path = f'/content/drive/MyDrive/W266 Final Project/output/train_data_1000.csv'

train_data1000.to_csv(output_file_path, index=False)

In [None]:
len(train_data1000)

1000

# Setup Sumy

In [None]:
!pip install sumy

Collecting sumy
  Downloading sumy-0.11.0-py2.py3-none-any.whl (97 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m97.3/97.3 kB[0m [31m1.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting docopt<0.7,>=0.6.1 (from sumy)
  Downloading docopt-0.6.2.tar.gz (25 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting breadability>=0.1.20 (from sumy)
  Downloading breadability-0.1.20.tar.gz (32 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting pycountry>=18.2.23 (from sumy)
  Downloading pycountry-22.3.5.tar.gz (10.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.1/10.1 MB[0m [31m18.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Building wheels for collected packages: breadability, docopt, pycountry
  Building wheel for breadability (setup.py) ... [?25l[?25

In [None]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

# Generate extactive summary

In [None]:
NUM_SENTENCES = 100

In [None]:
from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.lsa import LsaSummarizer

# Initialize the summarizer with the LsaSummarizer algorithm
summarizer = LsaSummarizer()

output = []

print("train dataset length:", len(train_df_filter))

for judgement in train_df_filter['judgement']:
  parser = PlaintextParser.from_string(judgement, Tokenizer("english"))

  # Summarize the article and get the most important sentences
  summary = summarizer(parser.document, NUM_SENTENCES)  # You can change the number of sentences as needed
  summary_sentences = " ".join([str(sentence) for sentence in summary])
  output.append(summary_sentences)

print("Summarization generated for:", len(output))

train dataset length: 7723
Summarization generated for: 7723


In [None]:
import csv

output_file_path = f'/content/drive/MyDrive/W266 Final Project/output/train_data_LSA_extractive_{NUM_SENTENCES}.csv'

with open(output_file_path, 'w', newline='', encoding='utf-8') as csvfile:
    # Create a CSV writer object
    csv_writer = csv.writer(csvfile)

    # Write the header row (optional, if you want to include column headers)
    csv_writer.writerow(['Index', 'Summary', 'ExtractiveSummary'])

    # Write the data rows (judgement and its corresponding summary)
    for i in range(len(output)):
        # Write the judgement and its summary to the CSV file
        csv_writer.writerow([train_df_filter["index"][i], train_df_filter['summary'][i], output[i]])

print("Summary sentences written to:", output_file_path)

Summary sentences written to: /content/drive/MyDrive/W266 Final Project/output/train_data_LSA_extractive_100.csv


# Count average number of words

In [None]:
import pandas as pd
extractive = pd.read_csv(f"/content/drive/MyDrive/W266 Final Project/output/train_data_LSA_extractive_{NUM_SENTENCES}.csv")
extractive = extractive[['Index', 'ExtractiveSummary']]

In [None]:
words_list = extractive['ExtractiveSummary'][0].split()
print(len(words_list))

3415


In [None]:
total_word_count = 0

for summary in extractive['ExtractiveSummary']:
  word_count = len(summary.split())
  total_word_count += word_count

print(total_word_count)
print(len(extractive))
print(total_word_count / len(extractive))


20350581
7723
2635.0616340800207


# Hybrid Model (100-sentence) Evaluation

In [None]:
import pandas as pd
extractive = pd.read_csv(f"/content/drive/MyDrive/W266 Final Project/output/train_data_LSA_extractive_{NUM_SENTENCES}.csv")
extractive = extractive[['Index', 'Summary', 'ExtractiveSummary']]

In [None]:
!pip install -q evaluate
!pip install rouge_score

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.4/81.4 kB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m492.4/492.4 kB[0m [31m5.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m4.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m7.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m5.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m268.8/268.8 kB[0m [31m7.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting rouge_score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: rouge_score
  Building wheel for rouge_score (setup.py) ... [?25l[?25hdone
  Created wheel for rouge_sco

In [None]:
import evaluate
rouge = evaluate.load('rouge')
results = rouge.compute(predictions=extractive["ExtractiveSummary"],
                        references=extractive["Summary"])
print(results)

# Archived

In [None]:
# rename column

import pandas as pd
output = pd.read_csv("/content/drive/MyDrive/W266 Final Project/output/train_data_LSA_extractive_500_old.csv")
output = output['BaselineSummary']


import csv

output_file_path = '/content/drive/MyDrive/W266 Final Project/output/train_data_LSA_extractive_500.csv'

with open(output_file_path, 'w', newline='', encoding='utf-8') as csvfile:
    # Create a CSV writer object
    csv_writer = csv.writer(csvfile)

    # Write the header row (optional, if you want to include column headers)
    csv_writer.writerow(['Index', 'Summary', 'ExtractiveSummary'])

    # Write the data rows (judgement and its corresponding summary)
    for i in range(len(output)):
        # Write the judgement and its summary to the CSV file
        csv_writer.writerow([train_df_filter["index"][i], train_df_filter['summary'][i], output[i]])

print("Summary sentences written to:", output_file_path)

Summary sentences written to: /content/drive/MyDrive/W266 Final Project/output/train_data_LSA_extractive_500.csv
