In [1]:
# ============================================================
# Twitter Financial News Sentiment Analysis
# Step 1: Data Loading and Text Cleaning
# ============================================================

# Import required libraries
import pandas as pd
import re

# Configure display settings for better readability
pd.set_option("display.max_colwidth", 150)

# ------------------------------------------------------------
# Load training and validation datasets
# ------------------------------------------------------------
train_path = "../data/train_data.csv"
valid_path = "../data/valid_data.csv"

df_train = pd.read_csv(train_path)
df_valid = pd.read_csv(valid_path)

print("Dataset successfully loaded.")
print(f"Training data shape   : {df_train.shape}")
print(f"Validation data shape : {df_valid.shape}")

# ------------------------------------------------------------
# Define text cleaning function
# Purpose:
# - Normalize text
# - Remove noise such as URLs, punctuation, and numbers
# - Prepare text for NLP modeling
# ------------------------------------------------------------
def clean_text(text):
    text = str(text).lower()                              # Convert to lowercase
    text = re.sub(r"http\S+|www\S+", "", text)            # Remove URLs
    text = re.sub(r"[^a-zA-Z\s]", "", text)               # Remove punctuation and numbers
    text = re.sub(r"\s+", " ", text).strip()              # Remove extra whitespace
    return text

# ------------------------------------------------------------
# Apply text cleaning to both datasets
# ------------------------------------------------------------
df_train["clean_text"] = df_train["text"].apply(clean_text)
df_valid["clean_text"] = df_valid["text"].apply(clean_text)

# ------------------------------------------------------------
# Basic data quality checks
# ------------------------------------------------------------
print("\nMissing values in training dataset:")
print(df_train.isnull().sum())

print("\nLabel distribution in training dataset:")
print(df_train["label"].value_counts().sort_index())

# ------------------------------------------------------------
# Feature engineering: text length
# Useful for EDA and model diagnostics
# ------------------------------------------------------------
df_train["text_length"] = df_train["clean_text"].apply(len)

print("\nText length statistics (training data):")
print(df_train["text_length"].describe())

# ------------------------------------------------------------
# Display sample records to verify cleaning
# ------------------------------------------------------------
print("\nSample cleaned records:")
display(df_train[["text", "clean_text", "label"]].head(5))

# ------------------------------------------------------------
# Save cleaned datasets for downstream tasks
# ------------------------------------------------------------
df_train.to_csv("../data/train_clean.csv", index=False)
df_valid.to_csv("../data/valid_clean.csv", index=False)

print("\nCleaned datasets saved successfully.")


Dataset successfully loaded.
Training data shape   : (16990, 2)
Validation data shape : (4117, 2)

Missing values in training dataset:
text          0
label         0
clean_text    0
dtype: int64

Label distribution in training dataset:
label
0      255
1      837
2     3545
3      321
4      359
5      987
6      524
7      624
8      166
9     1557
10      69
11      44
12     487
13     471
14    1822
15     501
16     985
17     495
18    2118
19     823
Name: count, dtype: int64

Text length statistics (training data):
count    16990.000000
mean       100.452796
std         49.726920
min          0.000000
25%         65.000000
50%         88.000000
75%        129.000000
max        276.000000
Name: text_length, dtype: float64

Sample cleaned records:


Unnamed: 0,text,clean_text,label
0,"Here are Thursday's biggest analyst calls: Apple, Amazon, Tesla, Palantir, DocuSign, Exxon &amp; more https://t.co/QPN8Gwl7Uh",here are thursdays biggest analyst calls apple amazon tesla palantir docusign exxon amp more,0
1,"Buy Las Vegas Sands as travel to Singapore builds, Wells Fargo says https://t.co/fLS2w57iCz",buy las vegas sands as travel to singapore builds wells fargo says,0
2,"Piper Sandler downgrades DocuSign to sell, citing elevated risks amid CEO transition https://t.co/1EmtywmYpr",piper sandler downgrades docusign to sell citing elevated risks amid ceo transition,0
3,"Analysts react to Tesla's latest earnings, break down what's next for electric car maker https://t.co/kwhoE6W06u",analysts react to teslas latest earnings break down whats next for electric car maker,0
4,"Netflix and its peers are set for a ‘return to growth,’ analysts say, giving one stock 120% upside https://t.co/jPpdl0D9s4",netflix and its peers are set for a return to growth analysts say giving one stock upside,0



Cleaned datasets saved successfully.



# Twitter Financial News Sentiment Analysis

## Project Overview
This project focuses on analyzing and classifying finance-related Twitter news using Natural Language Processing (NLP) techniques. The primary objective is to transform unstructured financial tweets into clean, structured data and prepare it for machine learning–based sentiment and category classification.

## Dataset Description
The dataset consists of English-language tweets related to financial markets, companies, and economic events. Each tweet is annotated with one of multiple finance-related labels. The dataset exhibits a significant class imbalance, making it a realistic and challenging multi-class classification problem.

## Data Preprocessing
The initial phase of the project involved comprehensive data cleaning and preparation:
- Loaded training and validation datasets from CSV files
- Normalized text by converting it to lowercase
- Removed noise such as URLs, punctuation, numbers, and extra whitespace
- Generated a cleaned text column for downstream NLP tasks
- Created an additional feature representing text length to support exploratory analysis

## Data Quality Assessment
Basic data validation checks were performed to ensure dataset reliability:
- Verified the absence of critical missing values
- Analyzed label distribution and identified class imbalance
- Reviewed sample records to confirm the effectiveness of text cleaning

## Output Artifacts
Cleaned versions of the training and validation datasets were saved for reuse in subsequent stages, including exploratory data analysis, feature engineering, and model development.

## Key Takeaways
- Proper text preprocessing is essential for effective NLP modeling
- Financial news data is inherently imbalanced and requires careful handling during model training
- Establishing a clean and reproducible data pipeline improves model reliability and project scalability

## Next Steps
The next phase of the project will focus on exploratory data analysis (EDA) to uncover patterns in tweet content, label distribution, and text characteristics, followed by feature engineering and machine learning model development.
