# **Financial Article Classification Using Machine Learning**
#### *by: Abba Ali-Concern*

---
### Introduction

Financial information is generated in huge volumes daily, making automatic classification essential for efficient content management. This project aims to build a machine learning model to categorize financial articles using natural language processing (NLP) techniques.

The notebook will cover data sourcing, preprocessing, and a comparison of classification models to determine the most accurate one for this task. We will also address challenges like finding relevant labeled data and feature extraction for improved model performance.

In [34]:
# Import packages
from datasets import load_dataset
import pandas as pd

# Load the dataset from Hugging Face
dataset = load_dataset('zeroshot/twitter-financial-news-topic')

# Convert the 'train' and 'validation' splits into DataFrames
train_df = pd.DataFrame(dataset['train'])
test_df = pd.DataFrame(dataset['validation'])

# Combine the train and test datasets
combined_df = pd.concat([train_df, test_df], ignore_index=True)

# Save the combined dataset as a CSV file
combined_df.to_csv('twitter_financial_news_combined.csv', index=False)

print("Combined dataset saved as 'twitter_financial_news_combined.csv'.")

Combined dataset saved as 'twitter_financial_news_combined.csv'.



### Dataset Selection Challenges  
Finding a dataset with specific labels for financial topics was challenging. To improve model performance for this use case, I downloaded a general financial article dataset. Then, I extracted keywords from each article title to create more relevant labels.

---

In [35]:
# Import Packages
import re

# Load your dataset
df = pd.read_csv('twitter_financial_news_combined.csv')

# Define keywords for each category
keywords = {
    'Cryptocurrency': ['bitcoin', 'ethereum', 'crypto', 'blockchain', 'altcoin', 'token'],
    'Stocks': ['stock', 'equity', 'shares', 'buy', 'sell', 'earnings', 'ipo'],
    'Economy': ['inflation', 'gdp', 'recession', 'unemployment', 'economy', 'growth'],
    'Banking': ['bank', 'interest rates', 'mortgage', 'lending', 'finance'],
    'Investments': ['portfolio', 'hedge fund', 'mutual fund', 'etf', 'bonds', 'assets']
}

# Function to clean and categorize text
def categorize_text(text):
    # Lowercase and remove special characters
    text = re.sub(r'http\S+|[^a-zA-Z\s]', '', text.lower())
    
    # Check for keywords in the text
    for category, words in keywords.items():
        if any(word in text for word in words):
            return category
    return 'Uncategorized'  # For texts that don't match any category

# Apply categorization
df['category'] = df['text'].apply(categorize_text)

# Display first few rows to check results
print(df[['text', 'category']].head())

# Save to a new CSV
df.to_csv('categorized_dataset.csv', index=False)


                                                text       category
0  Here are Thursday's biggest analyst calls: App...  Uncategorized
1  Buy Las Vegas Sands as travel to Singapore bui...         Stocks
2  Piper Sandler downgrades DocuSign to sell, cit...         Stocks
3  Analysts react to Tesla's latest earnings, bre...         Stocks
4  Netflix and its peers are set for a ‘return to...         Stocks


In [37]:
# Assign new dataframe
organized_df = df[['text', 'category']]
organized_df.head()

Unnamed: 0,text,category
0,Here are Thursday's biggest analyst calls: App...,Uncategorized
1,Buy Las Vegas Sands as travel to Singapore bui...,Stocks
2,"Piper Sandler downgrades DocuSign to sell, cit...",Stocks
3,"Analysts react to Tesla's latest earnings, bre...",Stocks
4,Netflix and its peers are set for a ‘return to...,Stocks


In [50]:
# Check for missing values
organized_df.isna().sum()

text        0
category    0
dtype: int64

In [51]:
# View target values
organized_df["category"].value_counts()

category
Uncategorized     13392
Stocks             4182
Economy            1800
Banking             968
Cryptocurrency      402
Investments         363
Name: count, dtype: int64

In [53]:
# Import necessary packages
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer

# CountVectorizer to convert text into a matrix of token counts
vectorizer = CountVectorizer(stop_words='english')
X = vectorizer.fit_transform(organized_df['text'])  # Convert text into numeric features

# Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(X, organized_df['category'], test_size=0.2, random_state=42)



In [56]:
# Import necessary packages
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report

# Model 1: Logistic Regression
model = LogisticRegression(max_iter=1000)

# Model 2: Naive Bayes
# model = MultinomialNB()

# Fit the model
model.fit(X_train, y_train)

# Step 4: Make Predictions and Evaluate the Model
y_pred = model.predict(X_test)

# Step 5: Evaluate the Model
print(f"Accuracy: {accuracy_score(y_test, y_pred)}")
print(f"Classification Report:\n{classification_report(y_test, y_pred)}")

Accuracy: 0.980104216011369
Classification Report:
                precision    recall  f1-score   support

       Banking       0.98      0.90      0.94       199
Cryptocurrency       1.00      0.85      0.92        89
       Economy       0.98      0.99      0.98       382
   Investments       0.97      0.92      0.94        74
        Stocks       0.99      0.95      0.97       816
 Uncategorized       0.98      1.00      0.99      2662

      accuracy                           0.98      4222
     macro avg       0.98      0.94      0.96      4222
  weighted avg       0.98      0.98      0.98      4222



### Model Comparison Results  
- Naive Bayes achieved an accuracy of 84%. 
- Logistic Regression achieved an accuracy of 98%.

Based on these results, we will proceed with the Logistic Regression model.

---


In [58]:
# Import packages
import joblib

# Save the model 
model_filename = 'text_classification_model.joblib'
joblib.dump(model, model_filename)
print(f"Model saved as {model_filename}")

# Save the vectorizer
vectorizer_filename = 'vectorizer.joblib'
joblib.dump(vectorizer, vectorizer_filename)
print(f"Vectorizer saved as {vectorizer_filename}")

Model saved as text_classification_model.joblib
Vectorizer saved as vectorizer.joblib
