# Text Preprocessing

This notebook defines the preprocessing pipeline for textual data
(news headlines and social media content) prior to sentiment analysis.
The goal is to standardize raw text while preserving financial semantics.


In [5]:
import re
import pandas as pd

In [6]:
df = pd.read_csv("../data/raw/text_data.csv", parse_dates=["timestamp"])
df.head()

Unnamed: 0,timestamp,text,source,asset,channel
0,2024-10-03 01:15:00+00:00,Momentum Funds : Momentum funds with 4x rise i...,,NIFTY,news_gdelt
1,2024-10-03 03:00:00+00:00,Indian stock market : 10 key things that chang...,,NIFTY,news_gdelt
2,2024-10-03 03:00:00+00:00,"Nifty 50 , Sensex today : What to expect from ...",,NIFTY,news_gdelt
3,2024-10-03 06:00:00+00:00,Bitcoin Price Decline Forces $450M in Long Liq...,,BTC,news_gdelt
4,2024-10-03 07:00:00+00:00,Stock Market : शेयर बाजार में बड़ी गिरावट ... ...,,NIFTY,news_gdelt


In [7]:
def clean_text(text):
    if not isinstance(text, str):
        return ""

    text = text.lower()
    text = re.sub(r"http\S+|www\S+", "", text)     # remove URLs
    text = re.sub(r"[^a-zA-Z\s]", "", text)        # remove special chars
    text = re.sub(r"\s+", " ", text).strip()       # normalize spaces

    return text

In [8]:
df["clean_text"] = df["text"].apply(clean_text)
df = df[df["clean_text"].str.len() > 10]

In [9]:
df.to_csv("../data/processed/text_preprocessed.csv", index=False)
print(f"Saved {len(df)} cleaned text records")

Saved 1377 cleaned text records


## Preprocessing Design Choices

- URLs, mentions, hashtags, and cashtags are removed to reduce noise
- Text is lowercased for model consistency
- Financial semantics are preserved by avoiding aggressive stopword removal
- Tokenization and sentiment modeling are handled in the next stage
