## Stochastic sequential decision problem 
Goal: The goal is to find the trading sequence for the YUM–MCD pair that maximizes cumulative dollar profit and loss.
Stochastic: The prices of the two assets and the news sentiment scores evolve probabilistically over time. 
Sequential: The agent must observe the state each trading day and make sequential BUY/SELL/HOLD decisions.

## Feasible solution method
The objective is to maximize dollar P/L through mean reversion on the YUM↔MCD spread. Since the price and news time series contain stochastic elements, it forms an MDP.

## Why compare YUM–MCD?
Because these two stocks form a suitable pair. In the Yahoo API stock price file, you can see the method used to find such pairs. The code shows how they were identified and what is needed.

## Resources:
1. https://www.insightbig.com/post/developing-a-profitable-pairs-trading-strategy-with-python
2. https://databento.com/blog/build-a-pairs-trading-strategy-in-python
3. https://medium.databento.com/build-a-pairs-trading-strategy-in-python-a-step-by-step-guide-dcee006e1a50?gi=5738dae53da6
4. https://medium.com/@ngao7/markov-decision-process-value-iteration-2d161d50a6ff
5. https://wire.insiderfinance.io/markov-decision-processes-mdp-ai-meets-finance-algorithms-series-7f34de5680d5
6. https://python.plainenglish.io/understanding-markov-decision-processes-17e852cd9981
7. https://www.datacamp.com/tutorial/markov-chains-python-tutorial


In [399]:
import gym
from gym import spaces
import numpy as np
import pandas as pd
import nltk
import yfinance as yf, statsmodels.api as sm, pandas as pd
import yfinance as yf

from nltk.sentiment import SentimentIntensityAnalyzer


In [370]:
"""
Resources: 
https://www.insightbig.com/post/developing-a-profitable-pairs-trading-strategy-with-python
https://databento.com/blog/build-a-pairs-trading-strategy-in-python
"""
sia = SentimentIntensityAnalyzer()

# Preparation to obtain YUM’s news via the Yahoo API and compute corresponding sentiment scores.
ticker = yf.Ticker("YUM")
summaryList=[] # Store news headlines.
for article in ticker.news:
    summaryList.append(article["content"]["summary"])
    
totalScore = [] # Store the scores of the collected headlines.

for i in summaryList:
    sentiment_scores = sia.polarity_scores(i)
    totalScore.append(sentiment_scores)
    
compoundScore = [s['compound'] for s in totalScore]
avgCoundScoreYUM = sum(compoundScore)/len(compoundScore)
print(compoundScore)
print(round(avgCoundScoreYUM,4)) # Compute the average score to indicate whether the current news is positive or negative. Range: (–1 to +1).

[0.4926, 0.0, 0.749, 0.0, 0.8016, 0.8619, 0.0, 0.7574, 0.6597, 0.7096]
0.5032


In [372]:
# Other company news
ticker = yf.Ticker("MCD")
summaryList=[]
for article in ticker.news:
    summaryList.append(article["content"]["summary"])
    
totalScore = []

for i in summaryList:
    sentiment_scores = sia.polarity_scores(i)
    totalScore.append(sentiment_scores)
    
compoundScore = [s['compound'] for s in totalScore]
avgCoundScoreMCD = sum(compoundScore)/len(compoundScore)

print(compoundScore)
print(round(avgCoundScoreMCD,4))# Compute the average score to indicate whether the current news is positive or negative. Range: (–1, +1).

[-0.631, 0.9601, 0.6901, 0.8835, -0.1027, 0.0, -0.1189, 0.3182, 0.6808, 0.0]
0.268


In [374]:

#  5-years
tickers   = ["YUM", "MCD"]
start_day = "2020-01-01"

raw = yf.download(tickers, start=start_day, progress=False)["Close"]

# log spread = ln(P_YUM) − ln(P_MCD)
log_price = np.log(raw)
spread    = log_price["YUM"] - log_price["MCD"]

# 3. Moving average, STD
win = 30
spread_MA = spread.rolling(win).mean()
spread_STD = spread.rolling(win).std(ddof=0)
Z_score = (spread - spread_MA) / spread_STD
diff_score =  np.zeros_like(spread)
MCD =raw["MCD"]
YUM = raw["YUM"]

# 4. DataFrame
df = pd.DataFrame({ # State 
    "spread"    : spread, # You can directly see by what percentage YUM is more expensive than MCD.
    "spread_MA" : spread_MA, # Recent Average Why?: Indicates the ‘normal (mean) position’ in a mean-reversion strategy. Simply using the overall historical mean reacts too slowly when the time series shifts, so we use a rolling mean instead.
    "spread_STD": spread_STD, # Volatility (σ) over the same window—how much does it fluctuate. Why?: Provides a scale reference to judge whether Fred’s ±5¢ move is ‘large’ or ‘small
    "Z_score"   : Z_score, # With thresholds like ±2σ, you can easily define Long/Short entry and exit rules.  A deep RL model can also instantly perceive the ‘normalized distance’ using only the Z_score.
    "price"     : spread, # Log spread +: P_YUM is relatively more expensive than P_MCD. - P_YUM is relatively cheaper than P_MCD. price = 0.1 => e^0.1 = 1.105 110% more expensive 
    # When Long (1), if the spread narrows → profit
    # When Short (2), if the spread widens → profit
    "diff_score": diff_score, # News score. Why are all of data same? we don't need past score.
    "MCD_closed_price": MCD,
    "YUM_closed_price": YUM
}).dropna()          # Remove NAN.

latest_idx = df.index[-1]
df.at[latest_idx, "diff_score"] = ( round(avgCoundScoreYUM, 4) - round(avgCoundScoreMCD, 4))

In [376]:
print(df)

              spread  spread_MA  spread_STD   Z_score     price  diff_score  \
Date                                                                          
2020-02-13 -0.699376  -0.682026    0.015843 -1.095178 -0.699376      0.0000   
2020-02-14 -0.695480  -0.683479    0.014988 -0.800704 -0.695480      0.0000   
2020-02-18 -0.703084  -0.685199    0.014154 -1.263572 -0.703084      0.0000   
2020-02-19 -0.689587  -0.686077    0.013571 -0.258643 -0.689587      0.0000   
2020-02-20 -0.699661  -0.687300    0.013077 -0.945242 -0.699661      0.0000   
...              ...        ...         ...       ...       ...         ...   
2025-04-28 -0.762781  -0.720042    0.046216 -0.924775 -0.762781      0.0000   
2025-04-29 -0.755869  -0.723242    0.045251 -0.721010 -0.755869      0.0000   
2025-04-30 -0.753662  -0.726446    0.043868 -0.620418 -0.753662      0.0000   
2025-05-01 -0.746988  -0.729109    0.042591 -0.419786 -0.746988      0.0000   
2025-05-02 -0.737989  -0.731543    0.040908 -0.15757

## Find Beta(The variables needed to convert the log-transformed values back into dollar terms. ):
Using this beta value, we adjust the portfolio by selling or buying MCD shares.

In [390]:
# Remove any missing values.
clean = (
    raw.replace([np.inf, -np.inf], np.nan)
        .dropna()                          
)

X = sm.add_constant(clean["MCD"]) #Transform into 2D by adding a constant term.
model = sm.OLS(clean["YUM"], X).fit() # Fit an OLS (Ordinary Least Squares) regression.
beta  = round(model.params["MCD"], 4)
print("β =", beta)

pair_price = clean["YUM"] - beta * clean["MCD"] # Formula to convert back into real dollar terms.
print(pair_price.head())

β = 0.4381
Date
2020-01-02    14.751992
2020-01-03    14.737256
2020-01-06    13.809942
2020-01-07    13.856747
2020-01-08    12.747439
dtype: float64


In [498]:
class PairTradingEnv(gym.Env):
    metadata = {'render.modes': ['human']}
    
    # Policy 
    def sample_policy(self):
        # Must use News score.
        row = self.data.iloc[self.current_step]
        z = float(self.data.iloc[self.current_step]['Z_score'])
        diff_score = float(self.data.iloc[self.current_step]['diff_score'])
        if diff_score == 0: # News scores start from the current time point
            if z >  1.0: # YUM has become relatively expensive compared to MCD.
                return 2          # Short (sell YUM)
            elif z < -1.0 : # YUM has become relatively cheap.
                return 1          # Long (sell MCD)
            else:
                return 0          # Hold
        else:
            if z >  1.0 and diff_score <= -0.1: 
                return 2        
            elif z < -1.0 and diff_score >= 0.1 : 
                return 1          
            else:
                return 0 
    
    def __init__(self, data,beta = beta):
        """
        data: pandas DataFrame
            - 'spread'     : Log price spread
            - 'spread_MA'  : Moving Average
            - 'spread_STD' : STD of spre
            - 'Z_score'    : Z-score (spread - MA) / STD)
            - 'price'      : pair price
        """
        super(PairTradingEnv, self).__init__()
        
        self.data = data.reset_index(drop=True)
        self.n_steps = len(self.data)
        self.current_step = 0
        self.beta =beta
        # action: 0-hold, 1-Long, 2-Short
        self.action_space = spaces.Discrete(3)
        
        # [spread, spread_MA, spread_STD, Z_score, price]
        low = -np.inf * np.ones(8)
        high = np.inf * np.ones(8)
        self.observation_space = spaces.Box(low=low, high=high, dtype=np.float32)
        
        # Current Position: 0-hold, 1-Long, 2-Short
        self.position = 0
        self.entry_price = 0

    def reset(self):
        self.current_step = 0
        self.position = 0
        self.entry_price = 0
        return self._next_observation()
    
    def _next_observation(self):
        obs = self.data.iloc[self.current_step][['spread', 
                                                 'spread_MA', 
                                                 'spread_STD', 
                                                 'Z_score', 
                                                 'price',
                                                 'diff_score',
                                                 'MCD_closed_price',
                                                 'YUM_closed_price'
                                                ]].values
        return obs.astype(np.float32)
    
    def step(self, action):
        """
        Process one step. 
        action: int, {0: hold, 1: Long, 2: Short}
        """
        done = False
        reward = 0.0
        dollar_reward = 0.0
        info = {}
        row_now  = self.data.iloc[self.current_step]
        row_prev = self.data.iloc[self.current_step - 1]
        delta_real_price = ((row_now["YUM_closed_price"]- row_prev["YUM_closed_price"]) - self.beta*(row_now['MCD_closed_price']-row_prev['MCD_closed_price']))

        # Move next step
        self.current_step += 1
        if self.current_step >= self.n_steps - 1:
            done = True
        
        current_price = self.data.iloc[self.current_step]['price']
        
        if action == 1:  # Long position.
            if self.position < 0:
                reward += (self.entry_price - current_price)  # Reward using short.
                dollar_reward += (-delta_real_price)
                self.position = 0

            if self.position == 0:
                self.position = 1
                self.entry_price = current_price
        elif action == 2:  # Short Position.
            if self.position > 0:
                reward += (current_price - self.entry_price)  # Reward using Long.
                dollar_reward += (delta_real_price)
                
                self.position = 0
            if self.position == 0:
                self.position = -1
                self.entry_price = current_price
        else:  # 0: Hold position
            if self.position != 0:
                dollar_reward += (-delta_real_price*(-self.position))
                if self.position == 1:
                    reward += (current_price - self.entry_price)
                    
                else:
                    reward += (self.entry_price - current_price)
                self.position = 0
                self.entry_price = 0
        
        # Update State
        obs = self._next_observation()
        
        info = {"dollar_reward": dollar_reward}
        
        return obs, reward, done, info    
        
    def render(self, mode='human', close=False):
        yum_stock = self.data.iloc[self.current_step]['YUM_closed_price']
        mcd_stock = self.data.iloc[self.current_step]['MCD_closed_price']
        pp = self.data.iloc[self.current_step]['price']

    
        action_txt = { 1: f"BUY 1 YUM & SELL {self.beta:.2f} MCD",
                      -1: f"SELL 1 YUM & BUY  {self.beta:.2f} MCD",
                       0: "HOLD"}[self.position]
    
        print(f"Step {self.current_step:3d} | Position: {self.position} | YUM ${yum_stock:,.2f} | MCD ${mcd_stock:,.2f} "
              f"| Pair ${pp:,.2f} | {action_txt:24s} ")


In [500]:
# Class pairTradingEnv using dataFrame = df
env = PairTradingEnv(df)

state = env.reset()
print("Initial State:", state)
print("Beta: ",beta)

done = False
total_reward = 0
total_dollar = 0
while not done:
    action = env.sample_policy() #Determine the action using the current policy.
    state, reward, done,info = env.step(action)
    dollar_reward = info.get("dollar_reward", 0.0)
    total_reward += reward
    total_dollar +=dollar_reward
    print(total_dollar)
    print('====================================================================================================')
    env.render()


Initial State: [-6.9937646e-01 -6.8202597e-01  1.5842598e-02 -1.0951782e+00
 -6.9937646e-01  0.0000000e+00  1.9254153e+02  9.5672935e+01]
Beta:  0.4381
0.0
Step   1 | Position: 1 | YUM $95.90 | MCD $192.25 | Pair $-0.70 | BUY 1 YUM & SELL 0.44 MCD 
0.3557881057739258
Step   2 | Position: 0 | YUM $94.76 | MCD $191.42 | Pair $-0.70 | HOLD                     
0.3557881057739258
Step   3 | Position: 1 | YUM $95.82 | MCD $190.96 | Pair $-0.69 | BUY 1 YUM & SELL 0.44 MCD 
1.6141677368164062
Step   4 | Position: 0 | YUM $94.62 | MCD $190.47 | Pair $-0.70 | HOLD                     
1.6141677368164062
Step   5 | Position: 0 | YUM $94.29 | MCD $191.17 | Pair $-0.71 | HOLD                     
1.6141677368164062
Step   6 | Position: 1 | YUM $91.32 | MCD $189.09 | Pair $-0.73 | BUY 1 YUM & SELL 0.44 MCD 
1.6141677368164062
Step   7 | Position: 1 | YUM $89.22 | MCD $187.83 | Pair $-0.74 | BUY 1 YUM & SELL 0.44 MCD 
1.6141677368164062
Step   8 | Position: 1 | YUM $88.81 | MCD $186.06 | Pair $-0.74

In [502]:
print("Total Reward:", total_reward)
print(f"Total Dollar: ${round(total_dollar,3)}")

Total Reward: -0.054155252913970386
Total Dollar: $199.242


In [409]:
from stable_baselines3 import DQN

In [504]:
# env: PairTradingEnv instance
model = DQN(
    policy="MlpPolicy",      # 다층 퍼셉트론 정책 네트워크
    env=env,                 # 학습에 사용할 환경
    learning_rate=1e-4,      # 학습률
    buffer_size=10000,       # 리플레이 버퍼 크기
    learning_starts=128,    # 학습 시작 전 최소 스텝
    target_update_interval=1000, 
    gamma=0.99,              # 기본 할인 인자 (감성 반영 시 동적으로 변경 가능)
    verbose=1,
    exploration_fraction=0.3
)


Using cpu device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.




In [506]:
total_timesteps = 2000000
model.learn(total_timesteps=total_timesteps)


----------------------------------
| rollout/            |          |
|    ep_len_mean      | 1.31e+03 |
|    ep_rew_mean      | 0.015    |
|    exploration_rate | 0.992    |
| time/               |          |
|    episodes         | 4        |
|    fps              | 4470     |
|    time_elapsed     | 1        |
|    total_timesteps  | 5244     |
| train/              |          |
|    learning_rate    | 0.0001   |
|    loss             | 0.00341  |
|    n_updates        | 1278     |
----------------------------------
----------------------------------
| rollout/            |          |
|    ep_len_mean      | 1.31e+03 |
|    ep_rew_mean      | -0.0788  |
|    exploration_rate | 0.983    |
| time/               |          |
|    episodes         | 8        |
|    fps              | 4551     |
|    time_elapsed     | 2        |
|    total_timesteps  | 10488    |
| train/              |          |
|    learning_rate    | 0.0001   |
|    loss             | 0.00372  |
|    n_updates      

<stable_baselines3.dqn.dqn.DQN at 0x3151ab410>

In [508]:
model.save("dqn_pairtrading")
# When we reuse this model.
model = DQN.load("dqn_pairtrading", env=env)


Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.




In [510]:
obs = env.reset()
done = False
total_reward = 0.0
total_dollar = 0.0

while not done:
    action, _states = model.predict(obs, deterministic=True)
    obs, reward, done, info = env.step(action)
    
    dollar_reward = info.get("dollar_reward", 0.0)
    total_reward += reward
    total_dollar  += dollar_reward
    env.render()
    print('-',total_dollar)


Step   1 | Position: 1 | YUM $95.90 | MCD $192.25 | Pair $-0.70 | BUY 1 YUM & SELL 0.44 MCD 
- 0.0
Step   2 | Position: 0 | YUM $94.76 | MCD $191.42 | Pair $-0.70 | HOLD                     
- 0.3557881057739258
Step   3 | Position: 1 | YUM $95.82 | MCD $190.96 | Pair $-0.69 | BUY 1 YUM & SELL 0.44 MCD 
- 0.3557881057739258
Step   4 | Position: 0 | YUM $94.62 | MCD $190.47 | Pair $-0.70 | HOLD                     
- 1.6141677368164062
Step   5 | Position: 0 | YUM $94.29 | MCD $191.17 | Pair $-0.71 | HOLD                     
- 1.6141677368164062
Step   6 | Position: 1 | YUM $91.32 | MCD $189.09 | Pair $-0.73 | BUY 1 YUM & SELL 0.44 MCD 
- 1.6141677368164062
Step   7 | Position: 1 | YUM $89.22 | MCD $187.83 | Pair $-0.74 | BUY 1 YUM & SELL 0.44 MCD 
- 1.6141677368164062
Step   8 | Position: 1 | YUM $88.81 | MCD $186.06 | Pair $-0.74 | BUY 1 YUM & SELL 0.44 MCD 
- 1.6141677368164062
Step   9 | Position: 1 | YUM $84.41 | MCD $178.00 | Pair $-0.75 | BUY 1 YUM & SELL 0.44 MCD 
- 1.614167736

In [512]:
print(total_dollar)

101.03629010620116
