# Summarization of financial data using Hugging Face LLM models
This notebook aims to provide an introduction to documenting an LLM model using the ValidMind Developer Framework. The use case presented is a summarization of financial data (https://huggingface.co/datasets/financial_phrasebank).

- Initializing the ValidMind Developer Framework
- Running a test various tests to quickly generate document about the data and model

## Before you begin

To use the ValidMind Developer Framework with a Jupyter notebook, you need to install and initialize the client library first, along with getting your Python environment ready.

If you don't already have one, you should also [create a documentation project](https://docs.validmind.ai/guide/create-your-first-documentation-project.html) on the ValidMind platform. You will use this project to upload your documentation and test results.

## Install the client library

In [None]:
# %pip install --upgrade validmind

## Initialize the client library

In a browser, go to the **Client Integration** page of your documentation project and click **Copy to clipboard** next to the code snippet. This code snippet gives you the API key, API secret, and project identifier to link your notebook to your documentation project.

::: {.column-margin}
::: {.callout-tip}
This step requires a documentation project. [Learn how you can create one](https://docs.validmind.ai/guide/create-your-first-documentation-project.html).
:::
:::

Next, replace this placeholder with your own code snippet:

In [None]:
## Replace the code below with the code snippet from your project ## 

import validmind as vm

vm.init(
  api_host = "http://localhost:3000/api/v1/tracking",
  api_key = "2494c3838f48efe590d531bfe225d90b",
  api_secret = "4f692f8161f128414fef542cab2a4e74834c75d01b3a8e088a1834f2afcfe838",
  project = "cllnq0ckr000273y6ev40pmb5"
)

import sys
print(sys.executable)

### Import Libraries

In [None]:
from transformers import pipeline
from transformers import BertTokenizer, BertModel
from sklearn.metrics.pairwise import cosine_similarity
import textwrap
from tabulate import tabulate
from IPython.display import display, HTML
from rouge import Rouge
import plotly.graph_objects as go
import nltk
from nltk.corpus import stopwords
import numpy as np
import pandas as pd
from pprint import pprint
import torch
import string
import plotly.express as px
import plotly.subplots as sp
from collections import Counter
from itertools import combinations
from dataclasses import dataclass

### Preprocessing functions

In [None]:
def _format_cell_text(text, width=50):  
    """Private function to format a cell's text."""
    return '\n'.join([textwrap.fill(line, width=width) for line in text.split('\n')])

def _format_dataframe_for_tabulate(df):
    """Private function to format the entire DataFrame for tabulation."""
    df_out = df.copy()
    
    # Format all string columns
    for column in df_out.columns:
        if df_out[column].dtype == object:  # Check if column is of type object (likely strings)
            df_out[column] = df_out[column].apply(_format_cell_text)
    return df_out

def _dataframe_to_html_table(df):
    """Private function to convert a DataFrame to an HTML table."""
    headers = df.columns.tolist()
    table_data = df.values.tolist()
    return tabulate(table_data, headers=headers, tablefmt="html")

def display_formatted_dataframe(df, num_rows=None):
    """Primary function to format and display a DataFrame."""
    if num_rows is not None:
        df = df.head(num_rows)
    formatted_df = _format_dataframe_for_tabulate(df)
    html_table = _dataframe_to_html_table(formatted_df)
    display(HTML(html_table))


In [None]:
def add_summaries_to_df(df, summaries):
    """
    Adds a new column 'summary_X' to the dataframe df that contains the given summaries, where X is an incremental number.

    Parameters:
    - df: The original pandas DataFrame.
    - summaries: List/array of summarized texts.

    Returns:
    - A new DataFrame with an additional summary column, with 'labels' being the first column followed by the original 'text'.
    """

    df = df.copy()  # Make an explicit copy of the DataFrame

    # Check if the length of summaries matches the number of rows in the DataFrame
    if len(summaries) != len(df):
        raise ValueError(f"The number of summaries ({len(summaries)}) does not match the number of rows in the DataFrame ({len(df)}).")

    # Determine the name for the new summary column
    col_index = 1
    col_name = 'summary_1'
    while col_name in df.columns:
        col_index += 1
        col_name = f'summary_{col_index}'

    # Add the summaries to the DataFrame
    df[col_name] = summaries

    # Rearrange the DataFrame columns to have 'topic' first, then the original 'input', followed by summary columns
    summary_columns = [col for col in df.columns if col.startswith('summary')]
    other_columns = [col for col in df.columns if col not in summary_columns + ['topic', 'input', 'reference_summary']]
    
    columns_order = ['topic', 'input', 'reference_summary'] + sorted(summary_columns) + other_columns
    df = df[columns_order]

    return df


In [None]:
def calculate_rouge_scores(df, ref_column, gen_column, metric="rouge-2"):
    """
    Compute ROUGE scores for each row in the DataFrame.

    :param df: DataFrame containing the summaries
    :param ref_column: Column name for the reference summaries
    :param gen_column: Column name for the generated summaries
    :param metric: Type of ROUGE metric ("rouge-1", "rouge-2", "rouge-l", "rouge-s")
    :return: DataFrame with ROUGE scores for each row
    """
    if metric not in ["rouge-1", "rouge-2", "rouge-l", "rouge-s"]:
        raise ValueError("Invalid metric. Choose from 'rouge-1', 'rouge-2', 'rouge-l', 'rouge-s'.")
    
    rouge = Rouge(metrics=[metric])
    score_list = []
    
    for _, row in df.iterrows():
        scores = rouge.get_scores(row[gen_column], row[ref_column], avg=True)[metric]
        score_list.append(scores)
    
    return pd.DataFrame(score_list)

def visualize_rouge_scores(df_scores):
    """
    Visualize ROUGE scores using Plotly line plots for each row.

    :param df_scores: DataFrame of ROUGE scores.
    """
    fig = go.Figure()

    # Adding the line plots
    fig.add_trace(go.Scatter(x=df_scores.index, y=df_scores['p'], mode='lines+markers', name='Precision'))
    fig.add_trace(go.Scatter(x=df_scores.index, y=df_scores['r'], mode='lines+markers', name='Recall'))
    fig.add_trace(go.Scatter(x=df_scores.index, y=df_scores['f'], mode='lines+markers', name='F1 Score'))

    fig.update_layout(
        title="ROUGE Scores for Each Row",
        xaxis_title="Row Index",
        yaxis_title="Score"
    )
    fig.show()

### POC Validation Metrics

In [None]:
# First function
def general_text_metrics(df, text_column):
    nltk.download('punkt', quiet=True)
    
    results = []

    for text in df[text_column]:
        sentences = nltk.sent_tokenize(text)
        words = nltk.word_tokenize(text)
        paragraphs = text.split("\n\n")

        total_words = len(words)
        total_sentences = len(sentences)
        avg_sentence_length = round(sum(len(sentence.split()) for sentence in sentences) / total_sentences if total_sentences else 0, 1)
        total_paragraphs = len(paragraphs)

        results.append([total_words, total_sentences, avg_sentence_length, total_paragraphs])

    return pd.DataFrame(results, columns=["Total Words", "Total Sentences", "Avg Sentence Length", "Total Paragraphs"])

# Second function
def vocabulary_structure_metrics(df, text_column, unwanted_tokens, num_top_words, lang):
    stop_words = set(word.lower() for word in stopwords.words(lang))
    unwanted_tokens = set(token.lower() for token in unwanted_tokens)

    results = []

    for text in df[text_column]:
        words = nltk.word_tokenize(text)

        filtered_words = [word for word in words if word.lower() not in stop_words and word.lower() not in unwanted_tokens and word not in string.punctuation]

        total_unique_words = len(set(filtered_words))
        total_punctuations = sum(1 for word in words if word in string.punctuation)
        lexical_diversity = round(total_unique_words / len(filtered_words) if filtered_words else 0, 1)

        results.append([total_unique_words, total_punctuations, lexical_diversity])

    return pd.DataFrame(results, columns=["Total Unique Words", "Total Punctuations", "Lexical Diversity"])

# Wrapper function that combines the outputs
def text_description_table(df, params):
    text_column = params["text_column"]
    unwanted_tokens = params["unwanted_tokens"]
    num_top_words = params["num_top_words"]
    lang = params["lang"]
    
    gen_metrics_df = general_text_metrics(df, text_column)
    vocab_metrics_df = vocabulary_structure_metrics(df, text_column, unwanted_tokens, num_top_words, lang)
    
    combined_df = pd.concat([gen_metrics_df, vocab_metrics_df], axis=1)
    
    return combined_df


In [None]:
def text_description_histograms(df, params):
    
    text_column = params["text_column"]
    num_docs_to_plot = params["num_docs_to_plot"]
    
    
    # Ensure the nltk punkt tokenizer is downloaded
    nltk.download('punkt', quiet=True)
    
    # Decide on the number of documents to plot
    if not num_docs_to_plot or num_docs_to_plot > len(df):
        num_docs_to_plot = len(df)

    # Colors for each subplot
    colors = ['blue', 'green', 'red', 'purple']

    # Axis titles for clarity
    x_titles = [
        "Word Frequencies",
        "Sentence Position in Document",
        "Sentence Lengths (Words)",
        "Word Lengths (Characters)"
    ]
    y_titles = [
        "Number of Words",
        "Sentence Length (Words)",
        "Number of Sentences",
        "Number of Words"
    ]

    # Iterate over each document in the DataFrame up to the user-specified limit
    for index, (idx, row) in enumerate(df.head(num_docs_to_plot).iterrows()):
        # Create subplots with a 2x2 grid for each metric
        fig = sp.make_subplots(
            rows=2, cols=2, 
            subplot_titles=[
                "Word Frequencies", 
                "Sentence Positions",
                "Sentence Lengths", 
                "Word Lengths"
            ]
        )
        
        # Tokenize document into sentences and words
        sentences = nltk.sent_tokenize(row[text_column])
        words = nltk.word_tokenize(row[text_column])
        
        # Metrics computation
        word_freq = Counter(words)
        freq_counts = Counter(word_freq.values())
        word_frequencies = list(freq_counts.keys())
        word_frequency_counts = list(freq_counts.values())
        
        sentence_positions = list(range(1, len(sentences) + 1))
        sentence_lengths = [len(sentence.split()) for sentence in sentences]
        word_lengths = [len(word) for word in words]
        
        # Adding data to subplots
        fig.add_trace(go.Bar(x=word_frequencies, y=word_frequency_counts, marker_color=colors[0], showlegend=False), row=1, col=1)
        fig.add_trace(go.Bar(x=sentence_positions, y=sentence_lengths, marker_color=colors[1], showlegend=False), row=1, col=2)
        fig.add_trace(go.Histogram(x=sentence_lengths, nbinsx=50, opacity=0.75, marker_color=colors[2], showlegend=False), row=2, col=1)
        fig.add_trace(go.Histogram(x=word_lengths, nbinsx=50, opacity=0.75, marker_color=colors[3], showlegend=False), row=2, col=2)

        # Update x and y axis titles
        for i, (x_title, y_title) in enumerate(zip(x_titles, y_titles)):
            fig['layout'][f'xaxis{i+1}'].update(title=x_title, titlefont=dict(size=10))
            fig['layout'][f'yaxis{i+1}'].update(title=y_title, titlefont=dict(size=10))

        # Update layout
        fig.update_layout(
            title=f"Text Description for Document {index+1}",
            barmode='overlay',
            height=800
        )
        
        fig.show()

In [None]:
# Function to plot scatter plots for specified combinations using Plotly
def text_description_scatter_plot(df, combinations_to_plot):

    combinations_to_plot = params["combinations_to_plot"]

    for metric1, metric2 in combinations_to_plot:
        fig = px.scatter(df, x=metric1, y=metric2, title=f"Scatter Plot: {metric1} vs {metric2}")
        fig.show()

### Hugging Face summarisation wrappers

The following code template showcases how to wrap a Hugging Face model for compatibility with the ValidMind Developer Framework. We will load an example model using the transformers API and then run some predictions on our test dataset.

The ValidMind developer framework provides support for Hugging Face transformers out of the box, so in the following section we will show how to initialize multiple transformers models with the `init_model` function, removing the need for a custom wrapper. In cases where you need extra pre-processing or post-processing steps, you can use the following code template as a starting point to wrap your model.

In [None]:
from dataclasses import dataclass
import pandas as pd
from transformers import pipeline

@dataclass
class AbstractSummarization_HuggingFace:
    """
    A VM Model instance wrapper for abstract summarization using HuggingFace Transformers.
    """
    model: any
    tokenizer: any
    predicted_prob_values: list = None

    def __init__(self, model_name=None, model=None, tokenizer=None):
        pipeline_task = "summarization"
        self.model_name = model_name
        self.pipeline_task = pipeline_task
        self.model = pipeline(pipeline_task, model=model, tokenizer=tokenizer)

    def predict(self, texts, params):
        """
        Generates summaries for the given texts.
        
        Parameters:
        - texts (list): List of texts to be summarized.
        - min_length (int, optional): Minimum length of the produced summary.
        - max_length (int, optional): Maximum length of the produced summary.
        
        Returns:
        - List of summaries.
        """
        
        min_length = params.get("min_length")
        max_length = params.get("max_length")
        
        # If either value is None, don't pass it to the model function
        model_args = {}
        if min_length is not None:
            model_args["min_length"] = min_length
        if max_length is not None:
            model_args["max_length"] = max_length

        summaries = []

        for text in texts:
            data = [str(text)]
            results = self.model(data, **model_args)  # Using ** unpacking to pass arguments conditionally
            results_df = pd.DataFrame(results)
            summary = results_df["summary_text"].values[0] if "summary_text" in results_df.columns else results_df["label"].values[0]
            summaries.append(summary)

        return summaries


    def predict_proba(self):
        """
        Retrieves predicted probabilities after prediction. 
        Note: Not all models provide predicted probabilities.
        """
        if self.predicted_prob_values is None:
            raise ValueError("First run predict method to retrieve predicted probabilities")
        return self.predicted_prob_values

    def description(self):
        """
        Describes the methods available in the class.

        Returns:
        - A string describing the methods.
        """
        desc = (
            "This class provides methods for abstract summarization using HuggingFace Transformers.\n"
            "1. predict: Generates summaries for given texts. Accepts optional min_length and max_length parameters.\n"
            "2. predict_proba: Retrieves predicted probabilities after prediction (if available).\n"
        )
        return desc


In [None]:
from dataclasses import dataclass
import torch
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

@dataclass
class ExtractiveSummarization_BERT:
    model: any
    tokenizer: any

    def _get_embedding(self, text):
        inputs = self.tokenizer(text, return_tensors="pt", truncation=True, max_length=512, padding='max_length')
        with torch.no_grad():
            output = self.model(**inputs)
        return output['last_hidden_state'].mean(dim=1).squeeze().detach().numpy()

    def predict(self, texts, params):
        summaries = []
        
        for text in texts:
            sentences = text.split('. ') # Changed to splitting at period for general sentence delineation.
            
            document_embedding = self._get_embedding(' '.join(sentences))
            sentence_embeddings = [self._get_embedding(sentence) for sentence in sentences]
            similarities = cosine_similarity(sentence_embeddings, [document_embedding])
            sorted_indices = np.argsort(similarities, axis=0)[::-1].squeeze()

            if params["method"] == "percentage":
                top_k = int(len(sentences) * params["value"])
                selected_sentences = [sentences[i] for i in sorted_indices[:top_k]]

            elif params["method"] == "fixed_sentences":
                top_k = params["value"]
                selected_sentences = [sentences[i] for i in sorted_indices[:top_k]]

            elif params["method"] == "word_count":
                selected_sentences = []
                total_words = 0
                for index in sorted_indices:
                    current_sentence_words = len(sentences[index].split())
                    # If adding the current sentence doesn't exceed the word limit, add it.
                    if total_words + current_sentence_words <= params["value"]:
                        total_words += current_sentence_words
                        selected_sentences.append(sentences[index])
                    # Once the word limit is reached or exceeded, stop adding more sentences.
                    if total_words >= params["value"]:
                        break

            else:
                raise ValueError("Invalid method specified.")
            
            summaries.append(' '.join(selected_sentences))
        
        return summaries

    def description(self):
        """ 
        Provides a description of the methods available for extractive summarization.
        
        1. Percentage: Extracts a given percentage of the total sentences from the original text.
        2. Fixed Sentences: Extracts a fixed number of sentences from the original text.
        3. Word Count: Extracts sentences up to a given word count.
        
        For all methods, sentences are ranked based on their similarity to the overall document meaning, as determined by BERT embeddings.
        """
        return self.description.__doc__


## 0. Load Data

In this section, we'll load the financial dataset, which will be the foundation for our summarization analysis tasks.

In [None]:
df = pd.read_csv('./datasets/bbc_text_cls_reference.csv')
df = df.head(5)

## 1. Extractive Summarization: Hugging Face-BERT Model

In [None]:
model = BertModel.from_pretrained("bert-base-uncased")  
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
extractive_model = ExtractiveSummarization_BERT(model=model, tokenizer=tokenizer)

In [None]:
df_results = df.copy()
data = df_results.input.values.tolist()

params = {
    "method": "percentage",
    "value": 0.4
}

list_extractive_summary = extractive_model.predict(data, params)

In [None]:
df_results = add_summaries_to_df(df_results, list_extractive_summary)
display_formatted_dataframe(df_results, num_rows=2)

## 2. Abstract Summarization: Hugging Face-T5 Model

In [None]:
from transformers import T5Tokenizer, T5ForConditionalGeneration

tokenizer = T5Tokenizer.from_pretrained("t5-small")
model = T5ForConditionalGeneration.from_pretrained("t5-small")

abstract_model = AbstractSummarization_HuggingFace(model=model, tokenizer=tokenizer)

In [None]:
data = df_results.summary_1.values.tolist()

params = {
    "min_length": None,
    "max_length": 50
}

list_abstract_summary = abstract_model.predict(data, params)

In [None]:
df_results = add_summaries_to_df(df_results, list_abstract_summary)

display_formatted_dataframe(df_results, num_rows=2)

## Validation

In [None]:
from transformers import BertTokenizer

# Initialize the tokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

# Input text
input_text = ("Ad sales boost Time Warner profit Quarterly profits at US media giant TimeWarner jumped 76% to $1.13bn (£600m) "
              "for the three months to December, from $639m year-earlier. The firm, which is now one of the biggest investors "
              "in Google, benefited from sales of high- speed internet connections and higher advert sales For 2005, TimeWarner "
              "is projecting operating earnings growth of around 5%, and also expects higher revenue and wider profit margins. "
              "TimeWarner is to restate its accounts as part of efforts to resolve an inquiry into AOL by US market regulators "
              "TimeWarner also has to restate 2000 and 2003 results following a probe by the US Securities Exchange Commission (SEC), "
              "which is close to concluding. Time Warner's fourth quarter profits were slightly better than analysts' expectations "
              "For the full- year, TimeWarner posted a profit of $3.36bn, up 27% from its 2003 performance, while revenues grew 6.4% to $42.09bn "
              "Its profits were buoyed by one- off gains which offset a profit dip at Warner Bros, and less users for AOL. "
              "Time Warner said on Friday that it now owns 8% of search-engine Google However, the company said AOL's underlying profit before "
              "exceptional items rose 8% on the back of stronger internet advertising revenues.")

# Summary
summary = ("the dollar has hit its highest level against the euro in almost three months . the federal reserve head said the US trade deficit is set to stabilise . he highlighted the government's willingness to curb spending and rising household savings ")

# Tokenize the texts
input_tokens = tokenizer.tokenize(input_text)
summary_tokens = tokenizer.tokenize(summary)

# Print token counts
print(f"Number of tokens in input text: {len(input_tokens)}")
print(f"Number of tokens in the summary: {len(summary_tokens)}")


### Data Description

**TEXT DESCRIPTION TABLE**

- Total Words: Assess the length and complexity of the input text. Longer documents might require more sophisticated summarization techniques, while shorter ones may need more concise summaries.

- Total Sentences: Understand the structural makeup of the content. Longer texts with numerous sentences might require the model to generate longer summaries to capture essential information.

- Avg Sentence Length: Determine the average length of sentences in the text. This can help the model decide on the appropriate length for generated summaries, ensuring they are coherent and readable.

- Total Paragraphs: Analyze how the content is organized into paragraphs. The model should be able to maintain the logical structure of the content when producing summaries.

- Total Unique Words: Measure the diversity of vocabulary in the text. A higher count of unique words could indicate more complex content, which the model needs to capture accurately.

- Most Common Words: Identify frequently occurring words that likely represent key themes. The model should pay special attention to including these words and concepts in its summaries.

- Total Punctuations: Evaluate the usage of punctuation marks, which contribute to the tone and structure of the content. The model should be able to maintain appropriate punctuation in summaries.

- Lexical Diversity: Calculate the richness of vocabulary in relation to the overall text length. A higher lexical diversity suggests a broader range of ideas and concepts that the model needs to capture in its summaries.

In [None]:
params = {
    "text_column": "raw_text",
    "unwanted_tokens": {'s', 's\'', 'mr', 'ms', 'mrs', 'dr', '\'s', ' ', "''", 'dollar', 'us', '``'},
    "num_top_words": 3,
    "lang": "english"
}

df_text_description = text_description_table(df_summaries, params)
display(df_text_description)

**TEXT DESCRIPTION SCATTER PLOT**

In [None]:
# Define the combinations you want to plot
combinations_to_plot = [
    ("Total Words", "Total Sentences"),
    ("Total Words", "Total Unique Words"),
    ("Total Sentences", "Avg Sentence Length"),
    ("Total Unique Words", "Lexical Diversity")
]

params = {
    "combinations_to_plot": combinations_to_plot
}

text_description_scatter_plot(df_text_description, params)

**TEXT DESCRIPTION HISTOGRAM**

- Word Frequencies: This metric provides a histogram of how often words appear with a given frequency. For example, if a lot of words appear only once in a document, it might be indicative of a text rich in unique words. On the other hand, a small set of words appearing very frequently might indicate repetitive content or a certain theme or pattern in the text.

- Sentence Positions vs. Sentence Lengths: This bar chart showcases the length of each sentence (in terms of word count) in their order of appearance in the document. This can give insights into the flow of information in a text, highlighting any long, detailed sections or brief, potentially superficial areas.

- Sentence Lengths Distribution: A histogram showing the frequency of sentence lengths across the document. Long sentences might contain a lot of information but could be harder for summarization models to digest and for readers to comprehend. Conversely, many short sentences might indicate fragmented information.

- Word Lengths Distribution: A histogram of the lengths of words in the document. Extremely long words might be anomalies, technical terms, or potential errors in the corpus. Conversely, a majority of very short words might denote lack of depth or specificity.

In [None]:
params = {
    "text_column": 'raw_text',
    "num_docs_to_plot": 2
}

text_description_histograms(df_summaries, params)

#### ROUGE Score 

The ROUGE score ((Recall-Oriented Understudy for Gisting Evaluation) is a widely adopted set of metrics used for evaluating automatic summarization and machine translation. It fundamentally measures the overlap between the n-grams in the generated summary and those in the reference summary.

- **ROUGE-N**: This evaluates the overlap of n-grams between the produced summary and reference summary. It calculates precision (the proportion of n-grams in the generated summary that are also present in the reference summary), recall (the proportion of n-grams in the reference summary that are also present in the generated summary), and F1 score (the harmonic mean of precision and recall).

- **ROUGE-L**: This metric is based on the Longest Common Subsequence (LCS) between the generated and reference summaries. LCS measures the longest sequence of tokens in the generated summary that matches the reference, without considering the order. It's beneficial because it can identify and reward longer coherent matching sequences.

- **ROUGE-S**: This measures the skip-bigram overlap, considering the pair of words in order as "bigrams" while allowing arbitrary gaps or "skips". This can be valuable to capture sentence-level structure similarity.

In [None]:
metric = "rouge-l"
df_scores = calculate_rouge_scores(
    df_all_summaries, 
    ref_column="reference_summary", 
    gen_column="summary_1", 
    metric=metric)
visualize_rouge_scores(df_scores)

#### Term Frequency-Inverse Document Frequency (TF-IDF)

This is a statistical measure used to evaluate the importance of a word in a document relative to a corpus. Words with high TF-IDF scores are considered more important. By examining words with the highest scores, you can potentially identify key topics or themes in the text.

In [None]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

# Sample DataFrame
data = {
    'text': [
        'OpenAI develops artificial general intelligence.',
        'Artificial intelligence has many applications.',
        'OpenAI is leading in AI research and development.'
    ]
}

df = pd.DataFrame(data)

# Initialize TF-IDF vectorizer
vectorizer = TfidfVectorizer(stop_words='english')

# Fit and transform the text column
tfidf_matrix = vectorizer.fit_transform(df['text'])

# Create a DataFrame for TF-IDF vectors
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=vectorizer.get_feature_names_out())

# Get the top N words with highest TF-IDF score for each document
N = 3  # Change N as per your requirement

top_words = {}
for i, row in tfidf_df.iterrows():
    top_words[i] = row.nlargest(N).index.tolist()

print(top_words)
