# Extractive Summarization for Financial Texts

## 1. Imports & Preprocessing

In [22]:
import nltk # NLP
import re # for text cleaning
import numpy as np

# Necessary NLP resources (stopwords & tokenizer)
nltk.download('punkt')
nltk.download('stopwords') # unimportant words
nltk.download('punkt_tab') # tokenizer data required to split sentences


print("Libraries imported successfully")

Libraries imported successfully


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


In [25]:
# a function for preprocessing

def preprocess_text(text):

  # clear spaces
  text = re.sub(r'\s+', ' ', text)

  # Remove reference numbers like [1], [2]
  text = text = re.sub(r'\[[0-9]*\]', ' ', text)

  # Separation into sentences (Tokenization)
  sentences = nltk.sent_tokenize(text)

  return sentences

In [27]:
# Test for preprocess_text func

text = """Apple Inc. (AAPL) reported its quarterly earnings today, exceeding analyst expectations.
Revenue increased by 12% year-over-year, driven by strong iPhone sales and growing services revenue.
The company also announced a $90 billion share buyback program. Investors reacted positively, pushing the stock price up by 4% in after-hours trading.
However, concerns remain about potential supply chain disruptions due to global chip shortages."""

sentences = preprocess_text(text)
print("Cleaned Sentences:\n", sentences)

Cleaned Sentences:
 ['Apple Inc. (AAPL) reported its quarterly earnings today, exceeding analyst expectations.', 'Revenue increased by 12% year-over-year, driven by strong iPhone sales and growing services revenue.', 'The company also announced a $90 billion share buyback program.', 'Investors reacted positively, pushing the stock price up by 4% in after-hours trading.', 'However, concerns remain about potential supply chain disruptions due to global chip shortages.']


Summary of this section:

- We cleaned the text and divided it into sentences.

- We created a correct structure without damaging the numbers and punctuation marks.

Next Up: Analyzing sentences with TF-IDF and determining their importance levels

## 2. Analyzing Sentences with TF-IDF

TF - IDF meaning:

- TF (Term Frequency): Measures how many times a word occurs in a given sentence.

- IDF (Inverse Document Frequency): Calculates how rare or common the word is.

Result: If a word is both frequent (TF high) and rare (IDF high), that word is considered important.

In [56]:
# Compute score function

from sklearn.feature_extraction.text import TfidfVectorizer

def compute_sentence_scores(sentences, ngram=(1,2)):

  # Remove English stopwords
  vectorizer = TfidfVectorizer(stop_words='english', ngram_range= ngram)

  # Convert sentences to TF-IDF matrix
  sentence_vectors = vectorizer.fit_transform(sentences).toarray()

  # Calculate the average TF-IDF score for each sentence
  sentence_scores = np.mean(sentence_vectors, axis=1)

  return sentence_scores

What is N-gram and How to Choose It?

In TF-IDF, the ngram_range=(a, b) setting determines:

- (1,1): Unigram (Single Words): “Apple”, “reported”, “earnings”

- (1,2): Unigram + Bigram (Single and Binary Words): “Apple”, “reported”, “Apple reported”, “reported earnings”
- (2,2): Only Bigram (Binary Word Groups): “Apple reported”, “reported earnings”, “earnings today”
- (2,3): Bigram + Trigram (Binary and Triple Word Groups): “Apple reported”, “reported earnings”, “Apple reported earnings”, “reported earnings today”
- (3,3): Only Trigram (Triple Word Groups): “Apple reported earnings”, “reported earnings today”

In [65]:
# Tests for ngram results

# Unigram
scores_unigram = compute_sentence_scores(sentences, ngram=(1,1))

# Unigram + Bigram
scores_bigram = compute_sentence_scores(sentences, ngram=(1,2))

# Bigram + Trigram
scores_trigram = compute_sentence_scores(sentences, ngram=(2,3))

print("Unigram:", scores_unigram)
print("Bigram:", scores_bigram)
print("Trigram:", scores_trigram)

Unigram: [0.04342    0.04347826 0.03834422 0.04093061 0.04342    0.04577435
 0.03827988 0.05015279]
Bigram: [0.03051895 0.03278779 0.02670779 0.02866488 0.03051895 0.03226669
 0.02668233 0.03550505]
Trigram: [0.03123374 0.03695626 0.02674697 0.02907703 0.03123374 0.03325085
 0.02674697 0.03695626]


I chose (1,2) because it provides a good balance for understanding the
context.

- It has the ability to understand the context of the word (thanks to bigrams)
- It does not require much calculation (not like trigrams)
- It can also work well in shorter sentences


```
sentence_vectors = vectorizer.fit_transform(sentences).toarray()
```

It converts each sentence into a numeric vector with TF-IDF values and allows us to represent sentences mathematically.

```
"Apple earnings increased by 12%" → [0.1, 0.2, 0.5, 0.3, 0.0, ...]
"Investors reacted positively" → [0.0, 0.1, 0.4, 0.7, 0.0, ...]
```

In [66]:
# Test for compute_sentence_scores func

text = """
Apple Inc. (AAPL) reported its quarterly earnings today, exceeding analyst expectations.
Revenue increased by 12% year-over-year, driven by strong iPhone sales and growing services revenue.
The company also announced a $90 billion share buyback program.
Investors reacted positively, pushing the stock price up by 4% in after-hours trading.
However, concerns remain about potential supply chain disruptions due to global chip shortages.
Meanwhile, the technology sector showed mixed results as Microsoft and Google reported varying performance.
Some analysts remain cautious about inflation and its impact on consumer spending.
The Federal Reserve's recent interest rate decision is expected to influence stock market trends in the coming weeks.
"""

sentences = preprocess_text(text)

sentence_scores = compute_sentence_scores(sentences)

for i, score in enumerate(sentence_scores):

    print(f"Sentence {i+1}: {sentences[i]}")

    print(f"Score: {score}\n")

Sentence 1:  Apple Inc. (AAPL) reported its quarterly earnings today, exceeding analyst expectations.
Score: 0.030518953467132133

Sentence 2: Revenue increased by 12% year-over-year, driven by strong iPhone sales and growing services revenue.
Score: 0.03278779306508986

Sentence 3: The company also announced a $90 billion share buyback program.
Score: 0.026707787225659183

Sentence 4: Investors reacted positively, pushing the stock price up by 4% in after-hours trading.
Score: 0.028664880625958518

Sentence 5: However, concerns remain about potential supply chain disruptions due to global chip shortages.
Score: 0.030518953467132133

Sentence 6: Meanwhile, the technology sector showed mixed results as Microsoft and Google reported varying performance.
Score: 0.03226669211849623

Sentence 7: Some analysts remain cautious about inflation and its impact on consumer spending.
Score: 0.026682331857952716

Sentence 8: The Federal Reserve's recent interest rate decision is expected to influen

Summary of this section:

- We converted the sentences in the text to numerical vectors with TF-IDF.

- We calculated scores using unigram, bigram and trigram to determine the importance of the sentences.

- We chose the “Unigram + Bigram (1,2)” model that gave the best results and will use it for summarization.

Next Up: Choosing the Most Important Sentences

## 3. Choosing the Most Important Sentences

In [71]:
import heapq # To select the highest scoring sentences

def select_top_sentences(sentences, sentence_scores, num_sentences = 3):

  # Get indexes of sentences with highest scores
  top_sentence_indices = heapq.nlargest(num_sentences, range(len(sentence_scores)), key=sentence_scores.__getitem__)

  # Return sentences sorted in original order
  summary_sentences = [sentences[i] for i in sorted(top_sentence_indices)]

  return ' '.join(summary_sentences) # merge

In [73]:
# Test

text = """Apple Inc. (AAPL) reported its quarterly earnings today, exceeding analyst expectations.
Revenue increased by 12% year-over-year, driven by strong iPhone sales and growing services revenue.
The company also announced a $90 billion share buyback program. Investors reacted positively, pushing the stock price up by 4% in after-hours trading.
However, concerns remain about potential supply chain disruptions due to global chip shortages.
Meanwhile, the technology sector showed mixed results as Microsoft and Google reported varying performance.
Some analysts remain cautious about inflation and its impact on consumer spending.
The Federal Reserve's recent interest rate decision is expected to influence stock market trends in the coming weeks.
"""

sentences = preprocess_text(text)

sentence_scores = compute_sentence_scores(sentences, ngram=(1,2))

summary = select_top_sentences(sentences, sentence_scores, num_sentences=3)

print("Summary:\n", summary)

Summary:
 Revenue increased by 12% year-over-year, driven by strong iPhone sales and growing services revenue. Meanwhile, the technology sector showed mixed results as Microsoft and Google reported varying performance. The Federal Reserve's recent interest rate decision is expected to influence stock market trends in the coming weeks.


Summary of this section:

- We identified the most important sentences using TF-IDF scores.

- We created our summary by ranking the highest scoring sentences.

- We can now automatically summarize the most critical information from financial texts.

Next Up: Web Interface

## 4. Streamlit Web Interface

In [75]:
!pip install streamlit

Collecting streamlit
  Downloading streamlit-1.42.2-py2.py3-none-any.whl.metadata (8.9 kB)
Collecting watchdog<7,>=2.1.5 (from streamlit)
  Downloading watchdog-6.0.0-py3-none-manylinux2014_x86_64.whl.metadata (44 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.3/44.3 kB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0m
Collecting pydeck<1,>=0.8.0b4 (from streamlit)
  Downloading pydeck-0.9.1-py2.py3-none-any.whl.metadata (4.1 kB)
Downloading streamlit-1.42.2-py2.py3-none-any.whl (9.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.6/9.6 MB[0m [31m63.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pydeck-0.9.1-py2.py3-none-any.whl (6.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.9/6.9 MB[0m [31m86.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading watchdog-6.0.0-py3-none-manylinux2014_x86_64.whl (79 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m79.1/79.1 kB[0m [31m5.7 MB/s[0m eta [36m0:00:00[0m
[

In [76]:
import streamlit as st

# Streamlit Interface
st.title("📄 Financial Text Summarization App")
st.write("We help you summarize long financial news and reports by extracting the most important information.")

# Get user input
user_input = st.text_area("📌 Please enter the financial text you want to summarize:", "")

# Summarization Button
if st.button("Summarize"):
    if user_input.strip():  # Check if input is not empty
        sentences = preprocess_text(user_input)  # Process the text
        sentence_scores = compute_sentence_scores(sentences, ngram=(1,2))  # Compute sentence scores
        summary = select_top_sentences(sentences, sentence_scores, num_sentences=3)  # Generate summary
        st.subheader("📌 Summary:")
        st.write(summary)
    else:
        st.warning("Please enter a valid text!")

2025-02-25 11:17:19.574 
  command:

    streamlit run /usr/local/lib/python3.11/dist-packages/colab_kernel_launcher.py [ARGUMENTS]
2025-02-25 11:17:19.588 Session state does not function when running a script without `streamlit run`


## Conclusion

In this project, we developed a financial text summarization web application using TF-IDF and Streamlit.

- We started by preprocessing financial texts, cleaning the data, and splitting it into sentences.

- We used TF-IDF with bigrams to rank sentence importance and extract the most relevant information.

- We built a user-friendly Streamlit interface, allowing users to input financial texts and generate summaries instantly.

- The project is now a functional tool for summarizing financial reports, news, and articles.