# Twitter Financial News Classification  
## Final Project Summary

---

## 1. Project Objective

The objective of this project was to build an end-to-end Natural Language
Processing (NLP) pipeline to classify Twitter financial news into multiple
financial categories. The project emphasizes practical model selection,
efficient feature engineering, and fair comparison of classical machine
learning models under real-world resource constraints.

---

## 2. Dataset Overview

The dataset consists of Twitter posts related to financial news, each labeled
into predefined financial categories such as company updates, macroeconomic
events, market movements, and corporate actions.

Key characteristics of the dataset include:
- Multi-class classification problem
- Noticeable class imbalance across categories
- Short-form, noisy text typical of social media data

---

## 3. Exploratory Data Analysis (EDA)

Exploratory analysis revealed the following insights:
- A small number of categories dominate the dataset
- Minority classes contain limited but information-dense samples
- Tweet length varies significantly by news category
- Class imbalance is a critical challenge affecting model performance

EDA guided feature engineering decisions and evaluation strategy selection.

---

## 4. Feature Engineering

The following feature engineering techniques were applied:
- Text cleaning and normalization
- Removal of low-information samples
- TF-IDF vectorization using unigrams and bigrams
- Feature dimensionality control for computational efficiency

This approach ensured consistency and fairness across all model comparisons.

---

## 5. Models Implemented

### 5.1 Logistic Regression (Baseline Model)

- Served as the primary benchmark model
- Performed reliably on majority classes
- Provided interpretable and stable results
- Used class weighting to mitigate imbalance

---

### 5.2 Multinomial Naive Bayes

- Computationally efficient and fast to train
- Performed adequately on high-frequency classes
- Struggled with minority categories due to probabilistic assumptions
- Sensitive to class imbalance

---

### 5.3 Linear Support Vector Machine (Advanced Classical Model)

- Demonstrated the most balanced overall performance
- Improved separation between semantically similar categories
- Robust to high-dimensional sparse TF-IDF features
- Outperformed baseline models in macro-level evaluation

---

## 6. Model Evaluation and Comparison

- All models were trained using identical preprocessing and TF-IDF features
- Stratified trainâ€“test splits ensured fair evaluation
- Confusion matrix analysis highlighted misclassification patterns
- Most errors occurred between semantically related financial categories

**Final Classical Model Selected:**  
**Linear Support Vector Machine (SVM)**

---

## 7. Deep Learning Exploration (BERT)

A BERT-based model was initiated to explore deep learning approaches for text
classification. However, due to CPU-only hardware constraints and significantly
long training times, the model was not fully trained.

The decision to prioritize efficient classical models reflects real-world
engineering trade-offs between model complexity, resource availability, and
project timelines.

---

## 8. Key Learnings

- Classical machine learning models remain highly effective for text
  classification tasks
- Feature engineering and data quality often outweigh model complexity
- Class imbalance has a strong impact on multi-class performance
- Practical constraints should guide model selection decisions

---

## 9. Conclusion

This project successfully demonstrates a complete NLP workflow, from data
exploration and feature engineering to model training and evaluation. By
systematically comparing multiple models, the project identifies Linear SVM
as a strong and efficient solution for financial news classification while
maintaining practical feasibility.

---

## 10. Project Status

**Status:** Completed  
**Outcome:** End-to-end, interview-ready NLP classification project

---

