# News-Based Features Engineering

This notebook creates comprehensive features from news data for stock price prediction models.

In [91]:
import pandas as pd
import numpy as np

# Load the cleaned news data
file_path = "../data/cleaned/cleaned_news_data.csv"
news_df = pd.read_csv(file_path)

## 1. Daily Average Sentiment

It creates a single aggregate value from the day's news, which reduces noise. In reality, there is a lot of news in a day, but not all of it is relevant – the average helps to extract the general market sentiment.

In [92]:
daily_avg_sentiment_score = news_df.groupby("Date").agg(
    avg_sentiment = ("Sentiment Score", "Mean")
).reset_index()

daily_avg_sentiment_score.head()

Unnamed: 0,Date,avg_sentiment
0,2025-04-27,0.9899
1,2025-04-28,0.9953
2,2025-04-29,0.18228
3,2025-04-30,0.55374
4,2025-05-01,0.443


## 2. Positive and Negative News Count

Count daily positive and negative articles based on sentiment thresholds.


In [93]:
news_df["is_positive"] = news_df["Sentiment Score"] > 0.05
news_df["is_negative"] = news_df["Sentiment Score"] < -0.05

counts_ = news_df.groupby("Date").agg(
    n_positive = ("N of Positive", "sum"),
    n_negative = ("N of Negative", "sum"),
    total_articles = ("Sentiment Score", "count")
).reset_index()

counts_.head()

Unnamed: 0,Date,n_positive,n_negative,total_articles
0,2025-04-27,1,0,1
1,2025-04-28,2,0,2
2,2025-04-29,3,2,5
3,2025-04-30,4,1,5
4,2025-05-01,7,2,9


## 3. Sentiment Volatility (Standard Deviation)

Calculate daily sentiment volatility to measure uncertainty and mixed signals in news.

In [94]:
volatility = news_df.groupby("Date")["Sentiment Score"].std().reset_index(name="Sentiment std")
volatility.head()

Unnamed: 0,Date,Sentiment_std
0,2025-04-27,
1,2025-04-28,0.004525
2,2025-04-29,0.868658
3,2025-04-30,0.847931
4,2025-05-01,0.823665


## 4. News Volume per Day

In [95]:
news_volume = news_df.groupby("Date").size().reset_index(name="N of News")

## 5. Polarity Ratio 
It is not only the amount of positive or negative news that matters, but also the ratio between them.

In [96]:
polarity_ratio = counts_.copy()
polarity_ratio["Polarity Ratio"] = polarity_ratio["N of Positive"] / (polarity_ratio["N of Positive"] + polarity_ratio["N of Negative"] + 1e-5)

# Stock-Based Features Engineering


In [97]:
file_path = "../data/cleaned/cleaned_stock_data.csv"
stock_df = pd.read_csv(file_path)

## 1. Intraday return
The movement between opening and closing – the trend for the day.

In [98]:
stock_df["Intraday Return"] = (stock_df["Close"] - stock_df["Open"]) / stock_df["Open"]
stock_df["Intraday Return"]

0     0.000667
1     0.012075
2     0.015289
3     0.020279
4    -0.003591
5    -0.020729
6     0.001513
7    -0.014661
8    -0.001163
9    -0.002362
10   -0.000853
11    0.011880
12   -0.000471
13    0.002370
14   -0.005180
15    0.004184
16   -0.003900
17   -0.015012
18    0.003238
19    0.008262
20    0.001563
Name: Intraday return, dtype: float64

## 2. Next-day return
 Looking back tomorrow, how accurate was today's news (prediction goal!).

In [99]:
stock_df["Next Close"] = stock_df["Close"].shift(-1)
stock_df["Next Day Return"] = (stock_df["Next Close"] - stock_df["Close"]) / stock_df["Close"]
stock_df["Next Day Return"]

0     0.005092
1     0.006108
2     0.003859
3    -0.037362
4    -0.031458
5    -0.001911
6    -0.011385
7     0.006319
8     0.005266
9     0.063146
10    0.010152
11   -0.002818
12   -0.004145
13   -0.000899
14   -0.011739
15   -0.009196
16   -0.023059
17   -0.003612
18   -0.030244
19    0.017105
20         NaN
Name: Next Day Return, dtype: float64

## 3. Gap up/down
It measures whether the news had an impact during the overnight period (when the stock market was closed).

In [100]:
stock_df["Gap"] = (stock_df["Open"] - stock_df["Close"].shift(1)) / stock_df["Close"].shift(1)
stock_df["Gap"]

0          NaN
1    -0.006900
2    -0.009043
3    -0.016094
4    -0.033893
5    -0.010957
6    -0.003419
7     0.003325
8     0.007490
9     0.007646
10    0.064054
11   -0.001708
12   -0.002348
13   -0.006499
14    0.004304
15   -0.015857
16   -0.005317
17   -0.008170
18   -0.006829
19   -0.038190
20    0.015517
Name: Gap, dtype: float64

## 4. Volatility
To measure longer-term instability. High value = risky period.


In [101]:
stock_df["Volatility 5 Day"] = stock_df["Close"].rolling(5).std()
stock_df["Volatility 5 Day"]

0          NaN
1          NaN
2          NaN
3          NaN
4     3.122746
5     6.086044
6     7.105778
7     6.940778
8     3.529961
9     1.074919
10    6.044495
11    8.124467
12    7.861268
13    6.139107
14    0.863204
15    1.587434
16    2.256906
17    3.851216
18    4.269917
19    5.279598
20    4.303226
Name: Volatility 5 Day, dtype: float64

## 5. Volume Change
A sudden spike in volume may be accompanied by increased interest, speculation, or panic.

In [102]:
stock_df["Volume Change"] = stock_df["Volume"].pct_change()
stock_df["Volume Change"]

0          NaN
1    -0.049441
2     0.419764
3     0.097142
4     0.760819
5    -0.316720
6    -0.257931
7     0.338176
8    -0.263476
9    -0.277839
10    0.749492
11   -0.186066
12   -0.049770
13   -0.087100
14    0.215601
15   -0.157065
16   -0.078974
17    0.393330
18   -0.210590
19    0.675455
20   -0.845467
Name: Volume Change, dtype: float64

# Merge News and Stock Features

Combine all created features into a single dataset for modeling

In [103]:
# Előkészület: Minden news alapú feature egyesítése
news_features = daily_avg_sentiment_score\
    .merge(counts_, on="Date", how="left")\
    .merge(volatility, on="Date", how="left")\
    .merge(news_volume, on="Date", how="left")\
    .merge(polarity_ratio[["Date", "Polarity Ratio"]], on="Date", how="left")

# Formázzuk a dátumokat egységesre
news_features["Date"] = pd.to_datetime(news_features["Date"])
stock_df["Date"] = pd.to_datetime(stock_df["Date"])

# Vágjuk le a stock_df-t, hogy csak a fontos oszlopokat tartalmazza
stock_features = stock_df[[
    "Date", "Intraday return", "Next Day Return", "Gap",
    "Volatility 5 Day", "Volume Change"
]]

# Merge: összeolvasztjuk hírekből és árfolyamból származó jellemzőket
final_features = news_features.merge(stock_features, on="Date", how="inner")

# Mentés
final_features.to_csv("../data/final/engineered_features.csv", index=False)

# Ellenőrzés
final_features.head()


Unnamed: 0,Date,avg_sentiment,n_positive,n_negative,total_articles,Sentiment_std,n_news,polarity_ratio,Intraday return,Next Day Return,Gap,Volatility 5 Day,Volume Change
0,2025-04-28,0.9953,2,0,2,0.004525,2,0.999995,0.000667,0.005092,,,
1,2025-04-29,0.18228,3,2,5,0.868658,5,0.599999,0.012075,0.006108,-0.0069,,-0.049441
2,2025-04-30,0.55374,4,1,5,0.847931,5,0.799998,0.015289,0.003859,-0.009043,,0.419764
3,2025-05-01,0.443,7,2,9,0.823665,9,0.777777,0.020279,-0.037362,-0.016094,,0.097142
4,2025-05-02,0.647183,5,1,6,0.744187,6,0.833332,-0.003591,-0.031458,-0.033893,3.122746,0.760819
