# Dataset Analysis: Finance-Instruct-500k

This notebook loads and explores the [Josephgflowers/Finance-Instruct-500k](https://huggingface.co/datasets/Josephgflowers/Finance-Instruct-500k) dataset to prepare training/validation subsets for fine-tuning.


In [1]:
from datasets import load_dataset
import pandas as pd


## Load Dataset


In [2]:
dataset = load_dataset("Josephgflowers/Finance-Instruct-500k")
dataset


DatasetDict({
    train: Dataset({
        features: ['system', 'user', 'assistant'],
        num_rows: 518185
    })
})

## Basic Exploration


In [3]:
# Convert to DataFrame for easier exploration
df = dataset["train"].to_pandas()
print(f"Shape: {df.shape}")
df.head()


Shape: (518185, 3)


Unnamed: 0,system,user,assistant
0,\n,Explain tradeoffs between fiscal and monetary ...,Fiscal and monetary policy are the two main to...
1,\n,Explain the classical economic theory and its ...,The classical economic theory refers to the ec...
2,\n,Explain the difference between fiscal and mone...,Fiscal policy and monetary policy are the two ...
3,\n,Explain how central banks determine currency e...,1. Interest rates: By changing their benchmark...
4,\n,Explain how interest rates change with inflati...,"1. When inflation is rising, central banks typ..."


In [4]:
# Column info
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 518185 entries, 0 to 518184
Data columns (total 3 columns):
 #   Column     Non-Null Count   Dtype 
---  ------     --------------   ----- 
 0   system     518185 non-null  object
 1   user       518185 non-null  object
 2   assistant  518185 non-null  object
dtypes: object(3)
memory usage: 11.9+ MB


In [None]:
# Sample a few examples
for i, row in df.sample(3).iterrows():
    print("=" * 80)
    for col in df.columns:
        val = row[col]
        if isinstance(val, str) and len(val) > 500:
            val = val[:500] + "..."
        print(f"\n{col.upper()}:\n{val}")


## Identify Use Cases

The dataset supports 8 use cases (QA, Reasoning, Conversational AI, NER, Sentiment, Topic Classification, LLM Training, RAG). We'll use heuristics to classify entries.


In [6]:
def classify_task(row):
    """Heuristic classification based on prompt content"""
    system = str(row.get('system', '')).lower()
    user = str(row.get('user', '')).lower()
    assistant = str(row.get('assistant', '')).lower()
    combined = system + ' ' + user
    
    # Topic Classification patterns
    if any(x in combined for x in ['classify', 'categorize', 'topic', 'category']):
        if any(x in combined for x in ['topic', 'category', 'categories']):
            return 'topic_classification'
    
    # Sentiment Analysis patterns
    if any(x in combined for x in ['sentiment', 'bullish', 'bearish', 'positive', 'negative']):
        return 'sentiment_analysis'
    
    # NER patterns
    if any(x in combined for x in ['entity', 'entities', 'ner', 'extract', 'xbrl', 'tag']):
        return 'ner'
    
    # QA patterns
    if any(x in combined for x in ['question', 'answer', 'what is', 'explain', 'define']):
        return 'qa'
    
    # Reasoning patterns
    if any(x in combined for x in ['calculate', 'compute', 'analyze', 'reasoning', 'portfolio']):
        return 'reasoning'
    
    # RAG patterns (external context prepended)
    if len(user) > 1000 and 'context' in combined:
        return 'rag'
    
    return 'other'

df['task_type'] = df.apply(classify_task, axis=1)
df['task_type'].value_counts()


task_type
other                   166817
ner                     140604
qa                      109417
sentiment_analysis       55613
topic_classification     42609
reasoning                 3108
rag                         17
Name: count, dtype: int64

## Sentiment Analysis Subset


In [7]:
# Filter for sentiment analysis examples
sentiment_df = df[df['task_type'] == 'sentiment_analysis']
print(f"Sentiment analysis samples: {len(sentiment_df)}")
sentiment_df.head(10)


Sentiment analysis samples: 55613


Unnamed: 0,system,user,assistant,task_type
209,\n,Identify the concept of moral hazard in insura...,Moral hazard refers to the tendency for indivi...,sentiment_analysis
328,\n,Explain how an increase in workers' skills and...,Workers' skills and education is a component o...,sentiment_analysis
552,\n,Develop a mathematical model using quadratic e...,Let I(t) = inflation rate at time t\nU(t) = un...,sentiment_analysis
717,\n,While considering the determinants of long-run...,The neoclassical growth model identifies four ...,sentiment_analysis
747,\n,Explain what monetary policy refers to and 3 m...,Monetary policy refers to the actions taken by...,sentiment_analysis
927,\n,Explain what important factors contribute to c...,1. Consumption - Consumer spending on goods an...,sentiment_analysis
1041,\n,Explain how interest rates impact the followin...,Consumption: Higher interest rates increase th...,sentiment_analysis
1124,\n,Describe three macroeconomic factors that can ...,1. Investment - Higher levels of investment in...,sentiment_analysis
1135,\n,List three categories of sensitive information...,1. Personal financial information - Compromise...,sentiment_analysis
1145,\n,Give a brief summary of 4 key events in United...,World War I (1914 - 1918):\nThe U.S. initially...,sentiment_analysis


In [8]:
# Examine sentiment analysis examples
for i, row in sentiment_df.sample(min(5, len(sentiment_df))).iterrows():
    print("=" * 80)
    print(f"SYSTEM:\n{row['system'][:300] if row['system'] else 'N/A'}")
    print(f"\nUSER:\n{row['user'][:300]}")
    print(f"\nASSISTANT:\n{row['assistant'][:200]}")


SYSTEM:
You are a financial sentiment analysis expert. Your task is to analyze the sentiment expressed in the given financial text.Only reply with positive, neutral, or negative.

USER:
"The intense headwinds of high inflation, sharply higher interest and additional tax obligations are having a significant impact, leading to a sharp decline in real household disposable income." Indeed household spending was flat quarter on quarter and has barely grown for four quarters in a row, it

ASSISTANT:
positive
SYSTEM:
You are a financial expert. Your task is to answer yes/no questions based on the given headline or news content.

USER:
Answer a question about this headline:
April gold off $2.20, or 0.2%, at $1,207.40/oz.
Does the news headline talk about price staying constant?
Options:
- Yes
- No No

Answer a question about this headline:
gold prices fall in asia on profit taking, mild dollar rebound
Does the news headline talk a

ASSISTANT:
No
SYSTEM:
You are a financial sentiment analysis e

In [9]:
# Group by distinct assistant answers
sentiment_df['assistant'].value_counts()


assistant
neutral                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            21240
positive                                                                                                                                                                                                                                             

In [10]:
# Create cleansed dataset with only valid sentiment labels
valid_labels = ['neutral', 'positive', 'negative', 'bullish']
sentiment_clean_df = sentiment_df[sentiment_df['assistant'].isin(valid_labels)].copy()

print(f"Original: {len(sentiment_df)} rows")
print(f"Cleansed: {len(sentiment_clean_df)} rows")
print(f"\nLabel distribution:")
sentiment_clean_df['assistant'].value_counts()


Original: 55613 rows
Cleansed: 41281 rows

Label distribution:


assistant
neutral     21240
positive     9976
negative     8226
bullish      1839
Name: count, dtype: int64

## Create Balanced Training Dataset (6000 samples)


In [11]:
# Add word count columns
sentiment_clean_df['user_words'] = sentiment_clean_df['user'].str.split().str.len()
sentiment_clean_df['assistant_words'] = sentiment_clean_df['assistant'].str.split().str.len()
sentiment_clean_df['total_words'] = sentiment_clean_df['user_words'] + sentiment_clean_df['assistant_words']

# Filter by max word count (~400 words â‰ˆ 512 tokens)
max_words = 400
filtered_df = sentiment_clean_df[sentiment_clean_df['total_words'] <= max_words].copy()

print(f"After length filter: {len(filtered_df)} rows")
print(f"Per label:\n{filtered_df['assistant'].value_counts()}")


After length filter: 41281 rows
Per label:
assistant
neutral     21240
positive     9976
negative     8226
bullish      1839
Name: count, dtype: int64


In [12]:
# Remove near-duplicates for diversity (based on first 100 chars of user text)
filtered_df['user_key'] = filtered_df['user'].str[:100]
deduped_df = filtered_df.drop_duplicates(subset=['user_key', 'assistant'])

print(f"After deduplication: {len(deduped_df)} rows")
print(f"Per label:\n{deduped_df['assistant'].value_counts()}")


After deduplication: 36268 rows
Per label:
assistant
neutral     18318
positive     8548
negative     7569
bullish      1833
Name: count, dtype: int64


In [17]:
# Balanced sample: 1500 from each label (6000 total)
final_samples = []
for label in ['neutral', 'positive', 'negative', 'bullish']:
    label_df = deduped_df[deduped_df['assistant'] == label]
    n_sample = min(1500, len(label_df))
    sampled = label_df.sample(n=n_sample, random_state=42)
    final_samples.append(sampled)
    print(f"{label}: sampled {n_sample}")

final_df = pd.concat(final_samples, ignore_index=True)

# Drop temp columns
final_df = final_df.drop(columns=['user_key', 'task_type', 'user_words', 'assistant_words', 'total_words'])
print(f"\nFinal dataset: {len(final_df)} rows")


neutral: sampled 1500
positive: sampled 1500
negative: sampled 1500
bullish: sampled 1500

Final dataset: 6000 rows


In [18]:
# Final dataset stats
final_df['user_words'] = final_df['user'].str.split().str.len()
final_df['total_words'] = final_df['user_words'] + final_df['assistant'].str.split().str.len()

print("=== Final Dataset Stats ===")
print(f"Total rows: {len(final_df)}")
print(f"\nLabel distribution:\n{final_df['assistant'].value_counts()}")
print(f"\nWord count stats:")
print(final_df['total_words'].describe())
print(f"\nUser text word count stats:")
print(final_df['user_words'].describe())


=== Final Dataset Stats ===
Total rows: 6000

Label distribution:
assistant
neutral     1500
positive    1500
negative    1500
bullish     1500
Name: count, dtype: int64

Word count stats:
count    6000.000000
mean       21.793833
std        12.600574
min         2.000000
25%        12.000000
50%        19.000000
75%        28.000000
max       145.000000
Name: total_words, dtype: float64

User text word count stats:
count    6000.000000
mean       20.793833
std        12.600574
min         1.000000
25%        11.000000
50%        18.000000
75%        27.000000
max       144.000000
Name: user_words, dtype: float64


In [19]:
# Preview a few examples from each label
for label in ['neutral', 'positive', 'negative', 'bullish']:
    print(f"\n{'='*60}\n{label.upper()} EXAMPLE:\n{'='*60}")
    row = final_df[final_df['assistant'] == label].sample(1).iloc[0]
    print(f"USER ({len(row['user'].split())} words):\n{row['user'][:400]}")
    print(f"\nASSISTANT: {row['assistant']}")



NEUTRAL EXAMPLE:
USER (6 words):
Stocks mixed amid conflicting trade reports

ASSISTANT: neutral

POSITIVE EXAMPLE:
USER (32 words):
Given the latest inflation data and the tone of official commentary, we fear we were too timid, economists at Deutsche Bank led by Mark Wall said Wednesday in a report to clients.

ASSISTANT: positive

NEGATIVE EXAMPLE:
USER (28 words):
He and his team expect both companies to benefit from a resilient consumer and recent bets on generative AI, as company-specific fundamentals become a "bigger factor" next year.

ASSISTANT: negative

BULLISH EXAMPLE:
USER (14 words):
Newmont Goldcorp stock price target raised to $41.50 from $39.50 at B. Riley FBR

ASSISTANT: bullish


## Export Training Dataset


In [20]:
# Export to CSV (only core columns)
output_path = "sentiment_training_data.csv"
export_df = final_df[['system', 'user', 'assistant']].copy()
export_df.to_csv(output_path, index=False)

print(f"Exported {len(export_df)} rows to {output_path}")


Exported 6000 rows to sentiment_training_data.csv
