# Codesphere - Hackathon working notebook

This notebook will serve as a working prototype for our hackathon problem statement. 

**Problem Statement:** Notes Summarizer for an Invoice collections application. 

**Description:** Our aim is to create an AI generated summary of all the notes added against a given client. The summary will save time for the collections agent as he/she will not be required to peruse all the notes available in the system for a given client to get the gist of the client standing and past history. The summary should be able to identify the key points from past notes, summarize them in a cohesive and readable manner and display them in not more than 2 paragraphs. 

In [1]:
# pip install transformers
# pip install sentencepiece

In [2]:
#pip install --upgrade --user transformers sentencepiece protobuf


### Import

In [3]:
from transformers import T5Tokenizer, T5ForConditionalGeneration,BartTokenizer, BartForConditionalGeneration
import pandas as pd

We will initialize the tokenizer with a base model. 

In [4]:
tokenizer = T5Tokenizer.from_pretrained('google-t5/t5-base')
model = T5ForConditionalGeneration.from_pretrained('google-t5/t5-base')

For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on google-t5/t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.
You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [5]:
# Load BART Model
bart_tokenizer = BartTokenizer.from_pretrained('facebook/bart-large-cnn')
bart_model = BartForConditionalGeneration.from_pretrained('facebook/bart-large-cnn')

For phase 1, we will main note related data in an input file and load it. The file has the below columns

* Client name - This will be our primary reference.
* Note Date - When this particular note was added. 
* Note Comment - The actual comment data that needs to be summarized. 

Note that the file can have multiple notes for the same client.

In [6]:
# Load the Excel file
df = pd.read_excel('Dataset.xlsx', engine='openpyxl')
df.columns = ['Invoice Number','Client Name', 'Note Date', 'Note Comment']

## Data Preprocessing

We shall not enter in-depth into data preprocessing for phase 1, but what we will do is to add a date reference to the note input by concatenating the note date with the comment. 

In order to make it more human readable, we will add some formatting. Hence, 01-Jan-2024 will read as 1st Jan and 15-Jan-2024 will read as 15th Jan. As there is no in-built function to add the ordinal indicator we want using [grammarly](https://www.grammarly.com/blog/how-to-write-ordinal-numbers-correctly/) as a reference :)

In [7]:
def add_ordinal_indicator(date):
    day = date.day
    if 4 <= day <= 20 or 24 <= day <= 30:
        suffix = "th"
    else:
        suffix = ["st", "nd", "rd"][day % 10 - 1]
    return f"Update On: {day}{suffix} {date.strftime('%b %Y')}, "

Now to concatenate the notes itself... We will create a data dictionary to hold the output for every client. Finally we will pass the output to a pandas dataframe

In [8]:
## some pre-processing
Client_Notes = {}

df = df.sort_values(by=['Client Name', 'Note Date'], ascending=[True, True])

for index, row in df.iterrows():
    Client = row['Client Name']
    #Invoice = row['Invoice Number']
    
    # If this is the first note for this client, create a new list for them
    if Client not in Client_Notes:
        Client_Notes[Client] = ''
    
    #Siddesh - Added this check as empty note was giving summarizing issues. 
    if row['Note Comment'] is not None:
        # Format the date and prepend it to the note
        note_date = add_ordinal_indicator(row['Note Date'])
        note_with_date = note_date + row['Note Comment'].strip()
        
        if note_with_date[-1] !='.':
            note_with_date = note_with_date + '.'
        if note_with_date[0] =='.':
            note_with_date = note_with_date[1:]
        
        Client_Notes[Client] += note_with_date

#summary_df = pd.DataFrame(columns=['Invoice Number','Client Name', 'Summary'])

# Generating the summary

We will now ask the model to generate summarized text for each client with a 205 char limit. We prefix the concatenated notes with a prompt "Summary:" to let the model know that we are expecting summarized output. This gets saved to a new dataframe.
We are also comparing two LLM models:

1. T5 from Google
2. BART from Facebook

We will compare outputs from both these models to see which one performs better against the data



In [9]:
# Now you can generate the summary for each client and add it to the DataFrame. 
summaries = []
for client, notes in Client_Notes.items():
   
    # Generate summary using T5
    inputs = tokenizer.encode("summarize: " + notes, return_tensors='pt', max_length=1024, truncation=True)
    outputs = model.generate(
    inputs, 
    max_length=250, 
    min_length=40, 
    length_penalty=2, 
    num_beams=6,
    #do_sample=True ,    
    #temperature=0.7,  # Controls randomness, lower is more deterministic
    #top_k=50,          # Considers only the top k words by probability
    #top_p=0.6         # Nucleus sampling: keeps the top p probability mass
    )
    summary = tokenizer.decode(outputs[0], skip_special_tokens=True)
    
    # Generate summary using BART
    bart_inputs = bart_tokenizer(notes, return_tensors='pt', max_length=1024, truncation=True, padding='max_length')
    bart_outputs = bart_model.generate(
    bart_inputs['input_ids'], 
    max_length=250, 
    min_length=40, 
    length_penalty=2, 
    num_beams=6,
    #do_sample=True,    
    #temperature=0.7,   # Consistency in parameters for fair comparison
    #top_k=50,
    #top_p=0.6
    )
    bart_summary = bart_tokenizer.decode(bart_outputs[0], skip_special_tokens=True)

    summaries.append({'Client Name': client, 'T5 Summary': summary,'BART Summary': bart_summary})
    
# Create DataFrame after loop
summary_df = pd.DataFrame(summaries)    

# Saving the output to file

For phase 1 - We will save the output back to our dataset file under summary tab. Note that we will overwrite any old output as each run is considered fresh. 



In [10]:
with pd.ExcelWriter('Dataset.xlsx', engine='openpyxl', mode='a', if_sheet_exists='replace') as writer:
    summary_df.to_excel(writer, sheet_name='Summary', index=False)

# Summary analysis

Let us have a peek into the summary generated by the model and see how the output compares to the input data

In [11]:
print("Input Text:")
print(Client_Notes['A'])
print("\n")
print("Output T5 summary:")
print(summary_df['T5 Summary'][0])
print("\n")
print("Output BART summary:")
print(summary_df['BART Summary'][0])

Input Text:
Update On: 1st Jan 2024, Reached out to client AP contact for payment of 10 open invoices totaling 10250 USD.Update On: 15th Jan 2024, Client advised that there is cash crunch impacting fund transfer. Partial payment expected by Jan Month end with further settlements in Feb.Update On: 2nd Feb 2024, As discussed, partial payment of 5,000 USD received via wire transfer.Update On: 14th Feb 2024, Had further follow-up with client on remaining open balance. This includes newly created 5 invoices with a value of 4000 USD leading to open balance of 12250 USD.Update On: 1st Mar 2024, Client has released another partial payment of 8000 USD. Had discussion with AP contact on settlement plan for current open balance.Update On: 15th Mar 2024, AP Contact has been changed from John to Matthew effective immediately. Matthew will be the SPOC for all payments going forward.Update On: 1st Apr 2024, Client has released full payment for all open invoices, effectively closing out the dunning pr

In [12]:
print("Input Text:")
print(Client_Notes['B'])
print("\n")
print("Output T5 summary:")
print(summary_df['T5 Summary'][1])
print("\n")
print("Output BART summary:")
print(summary_df['BART Summary'][1])

Input Text:
Update On: 13th Jan 2024, Sent initial chaser to client on outstanding balance.Update On: 1st Feb 2024, Per email from Jim (AP), B LLC has expressed an inability to pay at the moment and promised to make a payment by 1st March. EP has been looped in to advise further.Update On: 14th Feb 2024, Per EP advise, we will pause follow-ups till PTP expires.Update On: 27th Feb 2024, As per latest update from Jim, B LLC is under financial strain and might not be able to make payment on agreed upon date. They are unable to commit to a new date, but instead have mentioned payment in "near future".Update On: 13th Mar 2024, Had discussion with EP on this during weekly call. The client will be sent to bad debt collection.Update On: 17th Mar 2024, Invoices sent to Bad debt collection. Payment is not expected and might need a write-off.


Output T5 summary:
B LLC has expressed an inability to pay at the moment and promised to make a payment by 1st march. unable to commit to a new date, but 

As we can see, the summary in question is extractive in nature - Meaning that the model has actually used the input data passed and picked key points and stitched them together to generate a summary.

Now, this is a start, but we eventually want it to generate abstract summary so that it can add new words to form a more cohesive summary. 

Neither model does a 100% perfect job and both show that they can be better than the other,depending upon the input.

So, we need more training data to fine tune the models to generate abstractive summarization and then compare the performance and pick the best model. 

# Lets try generating the summaries with a fine tuned T5 model

For now we we used the BBC News dataset from Hugging Face for fine tuning our T5 Model.This dataset for extractive text summarization has four hundred and seventeen political news articles of BBC from 2004 to 2005 in the News Articles folder

In [13]:
OUT_DIR = 'results_t5base'
model_path = f"{OUT_DIR}/checkpoint-1110"  # the path where we saved our fine tuned model
model = T5ForConditionalGeneration.from_pretrained(model_path)
tokenizer = T5Tokenizer.from_pretrained(OUT_DIR)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [30]:
summaries_tuned = []
for client, notes in Client_Notes.items():
   
    # Generate summary using T5
    inputs = tokenizer.encode("summarize: " + notes, return_tensors='pt', max_length=1024, truncation=True)
    outputs = model.generate(
    inputs, 
    max_length=250
    )
    summary_tuned = tokenizer.decode(outputs[0], skip_special_tokens=True)
    
    summaries_tuned.append({'Client Name': client, 'T5 Summary': summary_tuned})
    
# Create DataFrame after loop
summary_tuned_df = pd.DataFrame(summaries_tuned)  

In [31]:
print("Input Text:")
print(Client_Notes['A'])
print("\n")
print("Output T5 summary:")
print(summary_tuned_df['T5 Summary'][0])

Input Text:
Update On: 1st Jan 2024, Reached out to client AP contact for payment of 10 open invoices totaling 10250 USD.Update On: 15th Jan 2024, Client advised that there is cash crunch impacting fund transfer. Partial payment expected by Jan Month end with further settlements in Feb.Update On: 2nd Feb 2024, As discussed, partial payment of 5,000 USD received via wire transfer.Update On: 14th Feb 2024, Had further follow-up with client on remaining open balance. This includes newly created 5 invoices with a value of 4000 USD leading to open balance of 12250 USD.Update On: 1st Mar 2024, Client has released another partial payment of 8000 USD. Had discussion with AP contact on settlement plan for current open balance.Update On: 15th Mar 2024, AP Contact has been changed from John to Matthew effective immediately. Matthew will be the SPOC for all payments going forward.Update On: 1st Apr 2024, Client has released full payment for all open invoices, effectively closing out the dunning pr

In [32]:
print("Input Text:")
print(Client_Notes['B'])
print("\n")
print("Output T5 summary:")
print(summary_tuned_df['T5 Summary'][1])

Input Text:
Update On: 13th Jan 2024, Sent initial chaser to client on outstanding balance.Update On: 1st Feb 2024, Per email from Jim (AP), B LLC has expressed an inability to pay at the moment and promised to make a payment by 1st March. EP has been looped in to advise further.Update On: 14th Feb 2024, Per EP advise, we will pause follow-ups till PTP expires.Update On: 27th Feb 2024, As per latest update from Jim, B LLC is under financial strain and might not be able to make payment on agreed upon date. They are unable to commit to a new date, but instead have mentioned payment in "near future".Update On: 13th Mar 2024, Had discussion with EP on this during weekly call. The client will be sent to bad debt collection.Update On: 17th Mar 2024, Invoices sent to Bad debt collection. Payment is not expected and might need a write-off.


Output T5 summary:
The client will be sent to bad debt collection.Update on: 27th Mar 2024, As per latest update from Jim, B LLC is under financial strain

In [17]:
with pd.ExcelWriter('Dataset.xlsx', engine='openpyxl', mode='a', if_sheet_exists='replace') as writer:
    summary_tuned_df.to_excel(writer, sheet_name='Summary_Tuned', index=False)