# Codesphere - Hackathon working notebook

This notebook will serve as a working prototype for our hackathon problem statement. 

**Problem Statement:** Notes Summarizer for an Invoice collections application. 

**Description:** Our aim is to create an AI generated summary of all the notes added against a given client. The summary will save time for the collections agent as he/she will not be required to peruse all the notes available in the system for a given client to get the gist of the client standing and past history. The summary should be able to identify the key points from past notes, summarize them in a cohesive and readable manner and display them in not more than 2 paragraphs. 

In [1]:
pip install transformers

Note: you may need to restart the kernel to use updated packages.


### Import

In [2]:
from transformers import T5Tokenizer, T5ForConditionalGeneration
import pandas as pd

We will initialize the tokenizer with a base model. 

In [3]:
tokenizer = T5Tokenizer.from_pretrained('t5-base')
model = T5ForConditionalGeneration.from_pretrained('t5-base')

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


For phase 1, we will main note related data in an input file and load it. The file has the below columns

* Client name - This will be our primary reference.
* Note Date - When this particular note was added. 
* Note Comment - The actual comment data that needs to be summarized. 

Note that the file can have multiple notes for the same client.

In [5]:
# Load the Excel file
df = pd.read_excel('Dataset.xlsx', engine='openpyxl')
df.columns = ['Client Name', 'Note Date', 'Note Comment']

## Data Preprocessing

We shall not enter in-depth into data preprocessing for phase 1, but what we will do is to add a date reference to the note input by concatenating the note date with the comment. 

In order to make it more human readable, we will add some formatting. Hence, 01-Jan-2024 will read as 1st Jan and 15-Jan-2024 will read as 15th Jan. As there is no in-built function to add the ordinal indicator we want using [grammarly](https://www.grammarly.com/blog/how-to-write-ordinal-numbers-correctly/) as a reference :)

In [6]:
def add_ordinal_indicator(date):
    day = date.day
    if 4 <= day <= 20 or 24 <= day <= 30:
        suffix = "th"
    else:
        suffix = ["st", "nd", "rd"][day % 10 - 1]
    return f"On {day}{suffix} {date.strftime('%b %Y')}, "

Now to concatenate the notes itself... We will create a data dictionary to hold the output for every client. Finally we will pass the output to a pandas dataframe

In [7]:

client_notes = {}

for index, row in df.iterrows():
    client = row['Client Name']
    
    # If this is the first note for this client, create a new list for them
    if client not in client_notes:
        client_notes[client] = ''
    
    #Siddesh - Added this check as empty note was giving summarizing issues. 
    if row['Note Comment'] is not None:
        # Format the date and prepend it to the note
        note_date = add_ordinal_indicator(row['Note Date'])
        note_with_date = note_date + row['Note Comment']
        client_notes[client] += note_with_date + ' '

summary_df = pd.DataFrame(columns=['Client Name', 'Summary'])

# Generating the summary

We will now ask the model to generate summarized text for each client with a 150 char limit. We prefix the concatenated notes with a prompt "Summary:" to let the model know that we are expecting summarized output. This gets saved to a new dataframe

In [8]:
# Now you can generate the summary for each client and add it to the DataFrame. 
for client, notes in client_notes.items():
   
    inputs = tokenizer.encode("summarize: " + notes, return_tensors='pt', max_length=512, truncation=True)

    outputs = model.generate(inputs, max_length=150, min_length=40, length_penalty=2.0, num_beams=4, early_stopping=True)

    # Siddesh: Output is giving padding tags which we can remove for now. 
    summary = tokenizer.decode(outputs[0]).replace('<pad>', '').replace('</s>', '')

    summary_df = summary_df.append({'Client Name': client, 'Summary': summary}, ignore_index=True)

  summary_df = summary_df.append({'Client Name': client, 'Summary': summary}, ignore_index=True)
  summary_df = summary_df.append({'Client Name': client, 'Summary': summary}, ignore_index=True)


# Saving the output to file

For phase 1 - We will save the output back to our dataset file under summary tab. Note that we will overwrite any old output as each run is considered fresh. 



In [10]:
with pd.ExcelWriter('Dataset.xlsx', engine='openpyxl', mode='a', if_sheet_exists='replace') as writer:
    summary_df.to_excel(writer, sheet_name='Summary', index=False)

# Summary analysis

Let us have a peek into the summary generated by the model and see how the output compares to the input data

In [17]:
print("Input Text:")
print(client_notes['A'])
print("Output summary:")
print(summary_df['Summary'][0])

Input Text:
On 1st Jan 2024, Reached out to client AP contact for payment of 10 open invoices totaling 10250 USD. On 15th Jan 2024, Client advised that there is cash crunch impacting fund transfer. Partial payment expected by Jan Month end with further settlements in Feb On 2nd Feb 2024, As discussed, partial payment of 5,000 USD received via wire transfer.  On 14th Feb 2024, Had further follow-up with client on remaining open balance. This includes newly created 5 invoices with a value of 4000 USD leading to open balance of 12250 USD.  On 1st Mar 2024, Client has released another partial payment of 8000 USD. Had discussion with AP contact on settlement plan for current open balance.  On 15th Mar 2024, AP Contact has been changed from John to Matthew effective immediately. Matthew will be the SPOC for all payments going forward.  On 1st Apr 2024, Client has released full payment for all open invoices, effectively closing out the dunning process for current open AR. Expect all payments 

As we can see, the summary in question is extractive in nature - Meaning that the model has actually used the input data passed and picked key points and stitched them together to generate a summary. 

Now, this is a start, but we eventually want it to generate abstract summary so that it can add new words to form a more cohesive summary. 

T5 does have this capability - But we will need to fine tune it for abstractive summarization by training it on a relevant dataset. 

# Preparing a dataset to train