# Structured Text Insights Extraction Demo

This notebook demonstrates the **Structured Text Insights Flow** using the Bloomberg Financial News dataset. 

## What You'll Learn
- How to use the structured insights flow for comprehensive text analysis
- Extract summaries, keywords, entities, and sentiment from financial news
- Analyze and visualize results across large datasets
- Extend the flow with custom blocks for domain-specific analysis

## Flow Capabilities
The structured insights flow performs **4 key analyses** on any text:
1. **📝 Summary**: Concise 2-3 sentence summaries
2. **🔑 Keywords**: Top 10 most important terms
3. **🏷️ Entities**: Named entities (people, organizations, locations)
4. **😊 Sentiment**: Emotional tone analysis (positive/negative/neutral)

All results are combined into a **structured JSON output** for easy processing and analysis.

## Setup and Installation

In [1]:
%load_ext autoreload
%autoreload 2

# pip install sdg_hub[examples]

In [2]:
# Third Party
from datasets import load_dataset
import json
import pandas as pd
import matplotlib.pyplot as plt
from collections import Counter
import nest_asyncio
from datetime import datetime
import warnings
import random
warnings.filterwarnings('ignore')

# First Party
from sdg_hub import Flow, FlowRegistry

# Required for async execution in notebooks
nest_asyncio.apply()

  from .autonotebook import tqdm as notebook_tqdm


## 1. Flow Discovery and Loading

SDG Hub automatically discovers all available flows. Let's find our structured insights flow:

In [3]:
# Auto-discover all available flows
FlowRegistry.discover_flows()

# List all flows
flows = FlowRegistry.list_flows()
print(f"Available flows: {len(flows)}")
for i, flow in enumerate(flows[:10]):  # Show first 10
    print(f"{i+1}. {flow}")
if len(flows) > 10:
    print(f"... and {len(flows) - 10} more")

Available flows: 5
1. {'id': 'green-clay-812', 'name': 'Structured Text Insights Extraction Flow'}
2. {'id': 'small-rock-799', 'name': 'Advanced Document Grounded Question-Answer Generation Flow for Knowledge Tuning'}
3. {'id': 'mild-thunder-748', 'name': 'Detailed Summary Knowledge Tuning Dataset Generation Flow'}
4. {'id': 'heavy-heart-77', 'name': 'Key Facts Knowledge Tuning Dataset Generation Flow'}
5. {'id': 'epic-jade-656', 'name': 'Extractive Summary Knowledge Tuning Dataset Generation Flow'}


In [4]:
# Search for text analysis flows
text_flows = FlowRegistry.search_flows(tag="text-analysis")
print(f"Text analysis flows: {text_flows}")

# Load our structured insights flow
flow_id = "green-clay-812" 
flow_path = FlowRegistry.get_flow_path(flow_id)
flow = Flow.from_yaml(flow_path)

print(f"\n✅ Loaded flow: {flow_id}") 
print(f"📍 Flow path: {flow_path}")

Text analysis flows: [{'id': 'green-clay-812', 'name': 'Structured Text Insights Extraction Flow'}]



✅ Loaded flow: green-clay-812
📍 Flow path: /Users/shiv/workspace/sdg_hub_add-structured-summary-nb/src/sdg_hub/flows/text_analysis/structured_insights/flow.yaml


## 2. Model Configuration

The flow supports multiple LLM models. Let's configure it:

In [5]:
# Check recommended models
print("Default model:", flow.get_default_model())
print("Model recommendations:", flow.get_model_recommendations())

Default model: meta-llama/Llama-3.3-70B-Instruct
Model recommendations: {'default': 'meta-llama/Llama-3.3-70B-Instruct', 'compatible': ['microsoft/phi-4', 'mistralai/Mixtral-8x7B-Instruct-v0.1'], 'experimental': ['gpt-4o', 'gpt-oss-120b']}


In [6]:
# Configure the flow to use a specific model
# Option 1: Use a local vLLM server
flow.set_model_config(
    model="hosted_vllm/meta-llama/Llama-3.3-70B-Instruct",
    api_base="http://localhost:10000/v1",
    api_key="EMPTY",
)

# Option 2: Use OpenAI (requires API key)
# flow.set_model_config(
#     model="gpt-4o-mini",
#     api_key="your-openai-api-key"
# )

# Option 3: Use Anthropic Claude (requires API key)
# flow.set_model_config(
#     model="anthropic/claude-3-haiku",
#     api_key="your-anthropic-api-key"
# )

print("✅ Model configuration ready")

✅ Model configuration ready


## 3. Dataset Loading and Exploration

We'll use the **Bloomberg Financial News dataset** - 447k financial news articles from 2006-2013:

In [7]:
# Load the Bloomberg Financial News dataset
print("Loading Bloomberg Financial News dataset...")
dataset = load_dataset("danidanou/Bloomberg_Financial_News", split="train")

print(f"📊 Dataset size: {len(dataset):,} articles")
print(f"📅 Columns: {dataset.column_names}")
print(f"💾 Dataset features: {dataset.features}")

Loading Bloomberg Financial News dataset...
📊 Dataset size: 446,762 articles
📅 Columns: ['Headline', 'Journalists', 'Date', 'Link', 'Article']
💾 Dataset features: {'Headline': Value(dtype='string', id=None), 'Journalists': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None), 'Date': Value(dtype='timestamp[ns]', id=None), 'Link': Value(dtype='string', id=None), 'Article': Value(dtype='string', id=None)}


In [8]:
# Explore the dataset structure
sample = dataset[0]
print("=== Sample Article ===")
print(f"Headline: {sample['Headline']}")
print(f"Date: {sample['Date']}")
print(f"Journalists: {sample['Journalists']}")
print(f"Article length: {len(sample['Article'])} characters")
print(f"Article preview: {sample['Article'][:300]}...")

=== Sample Article ===
Headline: Ivory Coast Keeps Cocoa Export Tax Below 22%, Document Shows
Date: 2011-10-06 15:14:20
Journalists: ['Baudelaire Mieu']
Article length: 2530 characters
Article preview: Export taxes on cocoa beans from Ivory Coast , the world’s biggest producer of the chocolate ingredient, won’t exceed 22 percent of the international price this season, meeting a commitment to the International Monetary Fund , according to a finance ministry document. In the 2008-9 season taxes aver...


In [9]:
# Select a small sample for demonstration (start with 50 articles)
# For production, you can process thousands of articles
sample_size = 50
demo_dataset = dataset.shuffle(seed=42).select(range(sample_size))

# The flow expects a 'text' column, so we'll use the 'Article' column
demo_dataset = demo_dataset.rename_column("Article", "text")

print(f"📝 Demo dataset prepared: {len(demo_dataset)} articles")
print(f"📊 Average article length: {sum(len(article['text']) for article in demo_dataset) / len(demo_dataset):.0f} characters")

📝 Demo dataset prepared: 50 articles
📊 Average article length: 2397 characters


## 4. Running the Structured Insights Flow

Now let's extract structured insights from our financial news articles:

In [10]:
# Generate structured insights
print("🚀 Running structured insights extraction...")
print("⏱️ This may take a few minutes depending on your model setup...")

# Run the flow
results = flow.generate(demo_dataset)

print("✅ Processing complete!")
print(f"📊 Generated insights for {len(results)} articles")
print(f"📋 Result columns: {results.column_names}")

🚀 Running structured insights extraction...
⏱️ This may take a few minutes depending on your model setup...


Map: 100%|██████████| 50/50 [00:00<00:00, 6322.63 examples/s]


Map: 100%|██████████| 50/50 [00:00<00:00, 5796.44 examples/s]


Map: 100%|██████████| 49/49 [00:00<00:00, 5589.67 examples/s]


Map: 100%|██████████| 49/49 [00:00<00:00, 5183.12 examples/s]


✅ Processing complete!
📊 Generated insights for 49 articles
📋 Result columns: ['Headline', 'Journalists', 'Date', 'Link', 'text', 'summary_prompt', 'raw_summary', 'summary', 'keywords_prompt', 'raw_keywords', 'keywords', 'entities_prompt', 'raw_entities', 'entities', 'sentiment_prompt', 'raw_sentiment', 'sentiment', 'structured_insights']


In [None]:
# Display a sample result
sample_result = results[random.randint(0, len(results) - 1)]

print("=== First Article Analysis ===")
print(f"📰 Original headline: {dataset[0]['Headline']}")
print(f"📅 Date: {dataset[0]['Date']}")
print(f"✍️ Journalists: {dataset[0]['Journalists']}")
print(f"📄 Article length: {len(sample_result['text'])} characters")
print()

# Parse and display the structured insights
insights = json.loads(sample_result["structured_insights"])
print("🔍 EXTRACTED INSIGHTS:")
print(json.dumps(insights, indent=2, ensure_ascii=False))

=== First Article Analysis ===
📰 Original headline: Ivory Coast Keeps Cocoa Export Tax Below 22%, Document Shows
📅 Date: 2011-10-06 15:14:20
✍️ Journalists: ['Baudelaire Mieu']
📄 Article length: 5080 characters

🔍 EXTRACTED INSIGHTS:
{
  "summary": "Portugal's economy is expected to shrink 2% in both 2011 and 2012, more than initially forecast, due to additional austerity measures implemented to secure a 78 billion euro international aid package. The package, which includes loans from the EU and IMF, will allow Portugal to avoid raising funds in bond markets for two years and give the government \"breathing space\" to implement spending cuts and revenue increases. The austerity measures aim to reduce the budget deficit, but are expected to have a significant impact on the domestic economy and increase unemployment to 13% in 2013.",
  "keywords": "Portugal, Austerity Measures, European Union, International Monetary Fund, Bailout, GDP, Economic Crisis, Fiscal Program, Debt Crisis, Euro A

## 5. Dynamic Flow Extension: Adding Stock Ticker Extraction

Now we'll demonstrate SDG Hub's **dynamic flow modification** capabilities. Instead of creating separate flow files, we can extend flows at runtime by adding custom processing blocks using existing SDG Hub components.

### What We'll Add:
We'll extend our structured insights flow to extract **stock ticker symbols** from financial news articles. This is perfect for Bloomberg financial news analysis!

### Approach:
We'll use three existing SDG Hub blocks:
1. **PromptBuilderBlock** - Create a prompt to extract stock tickers
2. **LLMChatBlock** - Process the extraction using the LLM
3. **TextParserBlock** - Parse the output to a clean list

Let's see how to modify flows at runtime!

In [12]:
# We'll modify the existing flow by adding our ticker extraction blocks
# First, let's examine the current flow structure
flow.print_info()

In [13]:
# Import the blocks we need
from sdg_hub.core.blocks.llm import PromptBuilderBlock, LLMChatBlock, TextParserBlock
from sdg_hub.core.blocks.transform import JSONStructureBlock

# Step 1: Add stock ticker extraction blocks to the flow
print("🚀 Adding stock ticker extraction blocks to the flow...")

# Create the stock ticker extraction blocks
ticker_prompt_block = PromptBuilderBlock(
    block_name="stock_ticker_prompt",
    input_cols=["text"],
    output_cols=["ticker_prompt"],
    prompt_config_path="extract_stock_tickers.yaml"
)

ticker_llm_block = LLMChatBlock(
    block_name="extract_stock_tickers",
    input_cols=["ticker_prompt"],
    output_cols=["raw_stock_tickers"],
    max_tokens=100,
    temperature=0.1  # Low temperature for more consistent extraction
)

ticker_parser_block = TextParserBlock(
    block_name="parse_stock_tickers",
    input_cols=["raw_stock_tickers"],
    output_cols=["stock_tickers"],
    start_tags=["[STOCK_TICKERS]"],
    end_tags=["[/STOCK_TICKERS]"]
)

print("✅ Created ticker extraction blocks:")
print(f"  1. {ticker_prompt_block.block_name} - Builds extraction prompt")
print(f"  2. {ticker_llm_block.block_name} - Extracts tickers via LLM")
print(f"  3. {ticker_parser_block.block_name} - Parses LLM output")

# Step 2: Update the JSONStructureBlock to include stock tickers
print("🔧 Updating JSON structure to include stock ticker field...")

# Create a new JSONStructureBlock configuration that includes our new stock_tickers field
enhanced_json_block = JSONStructureBlock(
    block_name="create_enhanced_structured_insights",
    input_cols=["summary", "keywords", "entities", "sentiment", "stock_tickers"],
    output_cols=["enhanced_structured_insights"]
)

print("✅ Enhanced JSON structure will include:")
print("  📝 summary - Article summary")
print("  🔑 keywords - Important keywords")
print("  🏷️ entities - Named entities")
print("  😊 sentiment - Emotional tone")
print("  📈 stock_tickers - Stock ticker symbols (NEW!)")

🚀 Adding stock ticker extraction blocks to the flow...
✅ Created ticker extraction blocks:
  1. stock_ticker_prompt - Builds extraction prompt
  2. extract_stock_tickers - Extracts tickers via LLM
  3. parse_stock_tickers - Parses LLM output
🔧 Updating JSON structure to include stock ticker field...
✅ Enhanced JSON structure will include:
  📝 summary - Article summary
  🔑 keywords - Important keywords
  🏷️ entities - Named entities
  😊 sentiment - Emotional tone
  📈 stock_tickers - Stock ticker symbols (NEW!)


In [14]:

# Remove the original JSONStructureBlock (if it exists in your flow/blocks list)
# (Assume we are not using a flow object here, just not using the old block.)

# Add the new blocks to a list for the enhanced pipeline
ticker_blocks = [
    ticker_prompt_block,
    ticker_llm_block,
    ticker_parser_block,
    enhanced_json_block
]

flow.blocks.pop()
flow.blocks.extend(ticker_blocks)
flow.print_info()


In [15]:
# Configure the new LLM blocks with our model settings
flow.set_model_config(
    model="hosted_vllm/meta-llama/Llama-3.3-70B-Instruct",
    api_base="http://localhost:10000/v1", 
    api_key="EMPTY"
)

print("\n🎯 Ready to run enhanced flow with stock ticker extraction!")


🎯 Ready to run enhanced flow with stock ticker extraction!


In [16]:
# Generate structured insights
print("🚀 Running structured insights extraction...")
print("⏱️ This may take a few minutes depending on your model setup...")

# Run the flow
results2 = flow.generate(demo_dataset)

print("✅ Processing complete!")
print(f"📊 Generated insights for {len(results2)} articles")
print(f"📋 Result columns: {results2.column_names}")

🚀 Running structured insights extraction...
⏱️ This may take a few minutes depending on your model setup...


Map: 100%|██████████| 50/50 [00:00<00:00, 6254.55 examples/s]


Map: 100%|██████████| 50/50 [00:00<00:00, 6575.38 examples/s]


Map: 100%|██████████| 49/49 [00:00<00:00, 6146.51 examples/s]


Map: 100%|██████████| 49/49 [00:00<00:00, 5205.56 examples/s]


Map: 100%|██████████| 46/46 [00:00<00:00, 4345.74 examples/s]


✅ Processing complete!
📊 Generated insights for 46 articles
📋 Result columns: ['Headline', 'Journalists', 'Date', 'Link', 'text', 'summary_prompt', 'raw_summary', 'summary', 'keywords_prompt', 'raw_keywords', 'keywords', 'entities_prompt', 'raw_entities', 'entities', 'sentiment_prompt', 'raw_sentiment', 'sentiment', 'ticker_prompt', 'raw_stock_tickers', 'stock_tickers', 'enhanced_structured_insights']


In [None]:
# Display a sample result
sample_result2 = results2[random.randint(0, len(results2) - 1)]

print("=== First Article Analysis ===")
print(f"📰 Original headline: {dataset[0]['Headline']}")
print(f"📅 Date: {dataset[0]['Date']}")
print(f"✍️ Journalists: {dataset[0]['Journalists']}")
print(f"📄 Article length: {len(sample_result2['text'])} characters")
print()

# Parse and display the structured insights
insights2 = json.loads(sample_result2["enhanced_structured_insights"])
print("🔍 EXTRACTED INSIGHTS:")
print(json.dumps(insights2, indent=2, ensure_ascii=False))

=== First Article Analysis ===
📰 Original headline: Ivory Coast Keeps Cocoa Export Tax Below 22%, Document Shows
📅 Date: 2011-10-06 15:14:20
✍️ Journalists: ['Baudelaire Mieu']
📄 Article length: 1292 characters

🔍 EXTRACTED INSIGHTS:
{
  "summary": "DTE Energy, Enbridge, and Spectra Energy have agreed to jointly develop a $1.2-1.5 billion pipeline to transport natural gas from Ohio's Utica Shale to the Midwest and eastern Canada. The proposed Nexus Gas Transmission system will span 250 miles and have a daily capacity of 1 billion cubic feet of gas. The pipeline, expected to begin operating by November 2015, aims to serve power plants and industrial customers.",
  "keywords": "Natural Gas, Utica Shale, Pipeline, DTE Energy, Enbridge, Spectra Energy, Midwest, Ontario, Energy Companies, Nexus Gas",
  "entities": "DTE Energy Co., Enbridge Inc., Spectra Energy Corp., Nexus Gas Transmission, Ohio, Utica Shale, Midwest, Canada, Michigan, Ontario, Chesapeake Energy Corp., Devon Energy Corp., E

## Next Steps

### 🧪 **Experiment Further**
1. **Scale up**: Process 100+ articles to see larger patterns
2. **Time analysis**: Filter by date ranges to see trends over time
3. **Model comparison**: Try different LLMs and compare results
4. **Custom prompts**: Modify the prompt templates for your domain

### 🔧 **Customize for Your Use Case**
1. **Domain adaptation**: Modify prompts for your specific industry
2. **Additional insights**: Add blocks for topic classification, urgency scoring, etc.
3. **Output format**: Customize JSON structure for your applications
4. **Quality filters**: Add validation and quality checks

### 🚀 Build Your Own Model
- Leverage the generated structured insights as high-quality training data for your own machine learning models.
- Fine-tune LLMs or train classifiers to automate similar analyses at scale.
- Refer to Training Hub (https://github.com/Red-Hat-AI-Innovation-Team/training_hub) to setup your own training pipeline.

### 📚 **Learn More**
- Explore other SDG Hub flows in the repository
- Check the documentation for advanced configuration options
- Join the community for questions and contributions