## Feature Engineering: Data Preparation

This step loads the cleaned dataset and removes low-quality records
such as missing or extremely short text entries to ensure reliable
feature generation.


In [4]:
import pandas as pd

# Load cleaned dataset
data_path = "../data/train_clean.csv"
df = pd.read_csv(data_path)

print("Initial dataset shape:")
print(df.shape)

# Remove rows with missing cleaned text
df_fe = df.dropna(subset=["clean_text"]).copy()

# Remove very short texts (low information content)
MIN_TEXT_LENGTH = 10
df_fe = df_fe[df_fe["clean_text"].str.len() > MIN_TEXT_LENGTH]

print("\nDataset shape after cleaning edge cases:")
print(df_fe.shape)


Initial dataset shape:
(16990, 4)

Dataset shape after cleaning edge cases:
(16949, 4)


## Feature and Target Definition

This step separates the input text features and target labels
required for supervised learning.


In [5]:
# Define input features and target variable
X = df_fe["clean_text"]
y = df_fe["label"]

print("Feature sample:")
print(X.head(3))

print("\nTarget label sample:")
print(y.head(10).to_string())


Feature sample:
0    here are thursdays biggest analyst calls apple...
1    buy las vegas sands as travel to singapore bui...
2    piper sandler downgrades docusign to sell citi...
Name: clean_text, dtype: object

Target label sample:
0    0
1    0
2    0
3    0
4    0
5    0
6    0
7    0
8    0
9    0


## Train-Test Split

The dataset is split into training and testing sets while preserving
class distribution to ensure fair model evaluation.


In [6]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=42,
    stratify=y
)

print("Train-Test Split Summary:")
print(f"Training samples : {X_train.shape[0]}")
print(f"Testing samples  : {X_test.shape[0]}")


Train-Test Split Summary:
Training samples : 13559
Testing samples  : 3390


## TF-IDF Vectorization

This step converts cleaned tweet text into numerical feature vectors
using TF-IDF with unigrams and bigrams.


In [7]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize TF-IDF Vectorizer
tfidf_vectorizer = TfidfVectorizer(
    max_features=5000,
    ngram_range=(1, 2),
    stop_words="english"
)

# Fit on training data and transform both sets
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
X_test_tfidf = tfidf_vectorizer.transform(X_test)

print("TF-IDF Feature Matrix Shape:")
print(f"Training matrix : {X_train_tfidf.shape}")
print(f"Testing matrix  : {X_test_tfidf.shape}")


TF-IDF Feature Matrix Shape:
Training matrix : (13559, 5000)
Testing matrix  : (3390, 5000)


# Feature Engineering Summary

This section outlines the steps performed to convert cleaned Twitter
financial news text into structured numerical features suitable for
machine learning models.

---

## 1. Data Preparation
The cleaned dataset was loaded and reviewed to ensure consistency and
usability. Records with missing cleaned text were removed, as they do
not provide meaningful information for text-based models.

Additionally, tweets with extremely short text length were filtered
out to reduce noise and improve the overall quality of the feature set.

---

## 2. Feature and Target Definition
The cleaned tweet text was selected as the input feature, while the
corresponding financial category label was used as the target variable.
This clear separation ensures compatibility with supervised learning
algorithms.

---

## 3. Train–Test Split
The dataset was divided into training and testing sets using a
stratified split. This approach preserves the original class
distribution across both sets, which is essential given the imbalanced
nature of the dataset.

---

## 4. Text Vectorization using TF-IDF
Text data was transformed into numerical representations using
Term Frequency–Inverse Document Frequency (TF-IDF) vectorization.

Both unigrams and bigrams were included to capture individual financial
terms as well as commonly occurring short phrases. The feature space
was limited to the most informative terms to maintain computational
efficiency and model stability.

---

## 5. Outcome of Feature Engineering
The feature engineering process resulted in well-structured numerical
feature matrices for both training and testing data. These features are
fully compatible with baseline and advanced machine learning models and
provide a strong foundation for accurate financial news classification.

---

## 6. Conclusion
The feature engineering phase successfully converted unstructured text
data into high-quality numerical features. The resulting dataset is
clean, balanced at the split level, and ready for model training and
evaluation.
