Welcome to the Cointegration Challenge!
Welcome, to an intriguing journey into the world of cointegration and pairs trading in quantitative finance. Cointegration is a statistical phenomenon that uncovers long-term equilibrium relationships between time series variables, playing a vital role in strategies such as pairs trading, risk management, and portfolio optimization.

As we delve into this challenge, we will not only unravel the theory behind cointegration but also its practical implications in real-world finance. Our focus will be on mastering cointegration in the context of pairs trading, a market-neutral strategy that capitalises on the historical co-movement of assets. Pairs trading involves identifying and leveraging the relationship between two historically correlated assets, where their combined position is designed to be broadly market-neutral.

Here is where cointegration becomes a game-changer. Unlike correlation, which merely indicates the tendency of assets to move together, cointegration suggests a deeper, more stable relationship. In the realm of non-stationary time series data, which is characteristic of most financial markets, cointegration provides a more robust and statistically significant framework. This is vital in pairs trading, as it allows us to identify pairs of assets whose price movements are not just temporarily aligned but are likely to converge in the long run, despite short-term deviations. Learn more about Cointegration here.

Throughout this challenge, you will harness the power of Python and leverage essential libraries like pandas, numpy, and statsmodels to dissect, analyze, and model cointegrated time series.

Challenge Overview
OLS Regression: Conduct an Ordinary Least Squares (OLS) regression, where the first difference of the series is regressed against its value and any differences. This aims to understand the relationship between the changes in the series and its past values.

Augmented Dickey-Fuller: Implement the Augmented Dickey-Fuller (ADF) test in order to determine the test statistic. This measures the importance of the level in explaining the changes in the series.

Trading Bounds Analysis: Determine the upper and lower trading bounds from the residuals, essential for developing a pairs trading strategy.

Application Across Three Rounds:

Round 1: Apply your newfound knowledge to an initial dataset, analyzing results and making trading decisions.
Round 2: Fine-tune your approach on a new unseen dataset, refining your strategy based on initial insights.
Round 3: Test your refined strategy on a new, unseen dataset, putting your skills to the ultimate test.
Assets Overview
You will have access to the following assets throughout the challenge.

Celestial Tech Enterprises (CTEN)
Oceanic Trade Networks (OTNW)
Zephyr Mobility Group (ZMG)
Etherwave Communications (ETWC)
Apex BioSolutions Inc. (ABSI)
Pinnacle Finance Collective (PFNC)
Infinite Realm Interactive (IRIN)
Cybernex Software Corp. (CNSC)
Pixelstream Graphics Ltd. (PXGL)
Starbound Motors Corp. (SBMC)
Whether you are a beginner or have experience in data science, this challenge will offer valuable insights into the practical application of cointegration in financial strategies. You will learn, experiment, and apply these concepts in a hands-on environment, preparing you for real-world quantitative finance challenges.

Happy coding, and may your analytical skills flourish! Throughout the notebook you will find some cells cannot be edited and they are locked to form the foundations for you to work around them.

In [None]:
# The following are private packages available only during this simulation:
from AmplifyQuantTrading import Data
from AmplifyArbitrageTrading.TradingEngine import TradingEngine, trading_engine_status, trading_engine_submission

# The following are publicly available packages:
from matplotlib import pyplot as plt
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Imports for OLS Regression Analysis
import statsmodels.api as sm

# Imports for ADF Test
from statsmodels.tsa.stattools import adfuller

In [None]:
portal_id = 25831

# Prices to be used throughout the Simulation
historical_prices = Data.get_historical_price_series_for_round(portal_id, is_dataframe=True)
trading_engine_status(portal_id)

In [None]:
# Plotting Grid
fig, axes = plt.subplots(nrows=5, ncols=2, figsize=(15, 15))
fig.tight_layout(pad=5.0)

tickers = historical_prices.columns
for i, ticker in enumerate(tickers):
    ax = axes[i//2, i%2]
    historical_prices[ticker].plot(ax=ax, title=ticker)
    ax.set_xlabel("Time")
    ax.set_ylabel("Price")

plt.show()

# OLS Regression Analysis: Exploring the Relationship between two assets.

In financial econometrics, understanding the relationship between two time series, especially their linear interdependence, can be pivotal. One of the primary tools for this purpose is the Ordinary Least Squares (OLS) regression. The OLS method minimizes the sum of the squared differences between the observed and estimated values.

Given two assets, we aim to determine if there exists a linear relationship between them. This relationship can be represented by:

$$ Y_t = \alpha + \beta x_t + \epsilon_t $$


Where:
* Y = Represents the price of Asset 1 at time t, now as the dependent variable.
* X = Represents the price of Asset 2 at time t, now as the independent variable.
* α is the intercept term.
* β is the slope coefficient, representing the change in Asset 1's price for a unit change in Asset 2's price.
* ϵ is the error term at time t, capturing all other factors affecting Asset 1's price that are not explained by Asset 2's price.

In [None]:
import statsmodels.api as sm

y = historical_prices["CTEN"] # Dependent variable.
X = historical_prices["CNSC"] # Independent variable.
X = sm.add_constant(X) # Independent variable with constant.

# Perform OLS regression
model = sm.OLS(y, X).fit()
model.summary()

Using the statsmodels library, we’ve performed an OLS regression to understand the linear relationship between the assets. The following are the key results:

1. **Intercept (Alpha)**:
* This value represents the expected value of Asset 1 when Asset 2 is zero. It’s the point where the regression line crosses the Y-axis.

2. **Slope (Beta)**:
* This coefficient represents the change in Asset 1 for a unit change in Asset 2.

3. **R-squared**:
* This metric provides the proportion of the variance in the dependent variable (Asset 1) that’s predictable from the independent variable (Asset 2). An R-squared of 0.333 suggests that approximately 33.3% of the variability in Asset 1 can be explained by Asset 2.

4. **Durbin-Watson**:
* This test statistic tests for the presence of autocorrelation in the residuals. A value close to 2 suggests no autocorrelation, while a value that is far from 2, indicating possible autocorrelation.

5. **Condition Number**:
* A high value suggests potential multicollinearity issues or other numerical problems.

### Implement OLS Regression for the other pairs.

In the section below you can use the ```sm.OLS(y, X).fit()``` to check for the other assets being used in this challenge.

<img src="https://algo-assets.amplifyme.com/quant/gresearch/GResearch_Cointegration_C2.png" width="100%"/>

## Spread and Its Significance in Pairs Trading

In pairs trading, the spread is a pivotal concept. It quantifies the deviation of the relationship between two assets from their historical mean. When the spread substantially diverges from this mean, trading opportunities might emerge.


The spread is calculated as:

$$
\text{Spread} = X - (\alpha + \beta \times Y)
$$

Here: 
* X = Asset 1
* Y = Asset 2
* α represents the intercept from our OLS regression. 
* β denotes the slope or hedge ratio from our OLS regression.


The spread, in essence, captures the residuals of our regression model. It signifies the difference between the actual price of Asset 1 and its predicted value based on the price of Asset 2.

## Visualising Residuals for Spread Trading

It’s essential to compute the residuals to be used for trading. These residuals represent the deviations between the observed price of Asset 1 and its predicted value based with the Asset 2 prices.

Residuals are an integral part of pairs trading. They encapsulate the temporary mispricings or deviations from the established relationship between the two assets, which are the trading opportunities we aim to exploit.

In [None]:
# Calculate residuals
X_const = sm.add_constant(historical_prices["CNSC"])
model = sm.OLS(historical_prices["CTEN"], X_const).fit()
residuals = model.resid

plt.figure(figsize=(14,7))
plt.plot(residuals, label='Residuals', color='blue')
plt.axhline(residuals.mean(), color='red', linestyle='--', label='Mean')
plt.title('Residuals Over Time with Mean')
plt.legend()
plt.grid(True)
plt.show()

## Augmented Dickey-Fuller Test

The Augmented Dickey-Fuller (ADF) test is a prominent unit root test used to diagnose the stationarity of a given time series. A time series is deemed stationary if its statistical attributes (such as mean and variance) do not alter over time. For our pairs trading strategy, ascertaining stationarity is pivotal, as it ensures that the spread between the pairs is mean-reverting, allowing for potential arbitrage opportunities.

#### Implementation Steps:
1. **OLS Regression**:
* An Ordinary Least Squares (OLS) regression was conducted where the first difference of the series was regressed against its value and any differences. The purpose is to capture the relationship between the changes in the series and its past values.

2. **Test Statistic Calculation**:
* The ADF test statistic was derived as the quotient of the coefficient of the level of the series and its standard error. This statistic gauges the significance of the level in explaining the changes in the series.

3. **Comparison with Critical Values**:
* To make an inference regarding stationarity, the computed test statistic is contrasted against predefined critical values. If the statistic is less than the critical value (often at 1%, 5%, or 10% significance levels), the null hypothesis of a unit root (implying non-stationarity) is rejected, signaling that the series is stationary.

In [None]:
def adf_test(y):
    y = pd.Series(y)

    # Using statsmodels to get the results
    adf_result = adfuller(y, maxlag=0)
    pvalue = adf_result[1]
    critical_values = adf_result[4]
    
    print(f"ADF Statistic (statsmodels): {adf_result[0]}")
    print(f"p-value: {pvalue:.10f}")
    for key, value in critical_values.items():
        print(f"Critical Value ({key}): {value}")

adf_test(residuals)

<img src="https://algo-assets.amplifyme.com/quant/gresearch/GResearch_Cointegration_C3.png" width="100%"/>

## Trading Bounds Analysis

Pairs trading is predicated on the belief that if two co-integrated securities drift apart in terms of price, they will eventually revert back to a mean. The strategy thus revolves around identifying deviations from this mean and capitalizing on the expected mean reversion.

1. **Long Signal**:

This arises when the spread (or the residuals) dips below a predetermined lower threshold. A low spread indicates that, relative to its counterpart, the primary asset (as per our regression) is undervalued. In response, we should:

* Long the primary asset (believing its price will rise).
* Short the secondary asset (anticipating a price drop).

$$
\text{Spread}_t < \text{Lower Bound} \Rightarrow \text{Long Asset 1, Short Asset 2}
$$

2. **Short Signal**:

This is activated when the spread exceeds a predetermined upper threshold. A high spread signifies that the primary asset is overpriced relative to the secondary asset. Consequently, we should:

* Short the primary asset (anticipating its price will fall).
* Long the secondary asset (believing its price will ascend).

$$
\text{Spread}_t > \text{Upper Bound} \Rightarrow \text{Short Asset 1, Long Asset 2}
$$


3. **Exit Signal (Neutralize):**

This signal prompts us to close out any positions and return to a neutral stance. It’s initiated when the spread converges back towards its mean, hinting at the assets reverting to their equilibrium relationship. This is our cue to capitalize on the positions we’ve taken based on the long/short signals and secure our gains.

$$ 
\text{Lower Bound} \leq \text{Spread}_t \leq \text{Upper Bound} \Rightarrow \text{Exit Positions}
$$



In [None]:
def generate_signals(residuals, Z, price_data_index, show_graph=True): # Generate bounds
    mu_e = residuals.mean()
    sigma_eq = residuals.std()
    upper_bound = mu_e + Z * sigma_eq
    lower_bound = mu_e - Z * sigma_eq
    if not isinstance(residuals, pd.Series):
        residuals_series = pd.Series(residuals, index=price_data_index)
    else:
        residuals_series = residuals
    
    # Generate signals
    signals = pd.Series(index=residuals_series.index, dtype='float64') 
    signals[residuals_series > upper_bound] = -1 # Short signal signals[residuals_series < lower_bound] = 1 # Long signal
    signals = signals.fillna(0) # Fill NaN values with 0
    if show_graph:
        plt.figure(figsize=(14,7))
        residuals_series.plot(label='Residuals')
        plt.axhline(upper_bound, color='red', linestyle='--', label='Upper Bound')
    plt.axhline(lower_bound, color='green', linestyle='--', label='Lower Bound')
    plt.axhline(0, color='blue', linestyle='-') 
    plt.title('Residuals with Long and Short Signals') 
    plt.legend()
    plt.show()
    return upper_bound, lower_bound, residuals_series

upper_bound, lower_bound, residuals_series = generate_signals(residuals, 1, historical_prices.index[1:-1])

<img src="https://algo-assets.amplifyme.com/quant/gresearch/GResearch_Cointegration_C4.png" width="100%"/>

## Trading Application 

Backtesting is a cornerstone of algorithmic trading. Before deploying a strategy in real-time, it’s imperative to understand how it would have performed historically. This process involves simulating trades based on past data and analyzing the results to evaluate the strategy’s robustness, potential profitability, and associated risks. In this section, we will immerse ourselves in the intricacies of backtesting, ensuring that our strategy is not just theoretically sound but also practical and viable in real-world scenarios. By understanding its historical behavior, we can make more informed decisions about its future deployment.

#### Class TradingEngine

The engine class is used to simulate a high-frequency trading strategy. It contains the following properties:

* balance (float): The balance of the trading account, in dollars.
* profit_loss (float): The profit and loss of the trading account, in dollars.
* profit_loss_without_commission (float): The profit and loss without the commission deductions, in dollars.
* current_positions (dict): A dictionary of CurrentPosition objects, with each key representing a different asset.
* commission_costs (float): The commission costs in dollars.
* commission_percentage (float): The commission percentage to be applied to each trade, expressed

##### **Functions**

```python
execute_order(self, ticker: str, volume: int, action: str, date: int)
```
This function allows the execution of a trade order for a specific stock represented by ticker.

*Parameters:*
* ticker: a string representing the stock's ticker symbol. The symbol will be automatically converted to uppercase.
* volume: an integer representing the number of shares to be traded.
* action: a string representing the type of order to be executed. Acceptable values are BUY or SELL. The action will be automatically converted to uppercase.
* date: an integer representing the date of the trade.

*Returns:*
Based on the current position of the stock and the action provided, the function will determine the trade to be executed. If the function fails to execute you will see an error is raised.

<hr>

#### Class CurrentPosition

The CurrentPosition class is used to track and manage the positions taken by the engine object. It contains the following properties:

* ticker (str): The ticker symbol of the asset being traded.
* direction (str): The direction of the trade, either "LONG", "SHORT", or "FLAT".
* position_volume (int): The number of shares held in the current position.
* open_price (float): The price at which the position was opened.
* profit_loss (float): The profit or loss in dollars, calculated as the difference between the open price and the current price.
* profit_loss_without_commission (float): The profit or loss in dollars without the commission deductions.
* commission_costs (float): The cumulative commission costs for this position.
* trade_history (list): A list of dictionaries, each representing a single trade in this position.

<hr>

In [None]:
# The code below creates the Trading Engine object
live_prices = Data.get_price_series_for_round(portal_id, is_dataframe=True)
engine = TradingEngine(commission_percentage=0.005, data=live_prices)
trading_engine_status(portal_id)

In [None]:
import random

def template_strategy(tickers, date):    
    # Insert your answer code here
    
    # Randomly select a ticker
    ticker = random.choice(tickers)

    # Randomly decide whether to buy or sell
    action = random.choice(["BUY", "SELL"])

    # Randomly select a volume to trade
    volume_to_trade = random.randint(1, 10)

    try:
        # Place trades in the trading engine
        engine.execute_order(ticker, volume_to_trade, action, date)
    except Exception as e:
        print(f"Failed to execute order for {ticker} on {date}: {e}")
        
    return ticker, volume_to_trade, action

tickers = live_prices.columns.tolist()
for date, _ in live_prices.iterrows():
    # Use the random trading strategy
    template_strategy(tickers, date)

In [None]:
def base_strategy(tickers, date):    
    threshold=0.05
    Z=1
    amount_to_trade=20000000/100
    reversal_threshold=1
    initial_threshold=1.5
    print(date)
    if date >10000:        
        df=live_prices.loc[live_prices.index<=date,]
        trade_book=dict()
        res={'adf_test':dict(),'residuals':dict(),'strength':dict()}
        for i, ticker in enumerate(tickers):
            for j,ticker1 in enumerate(tickers):
                if j!=i:
                    #print(ticker)
                    #print(ticker1)
                    X_const = sm.add_constant(df.loc[:,ticker])
                    model = sm.OLS(df.loc[:,ticker1], X_const).fit()
                    residuals = model.resid
                    res['residuals'][ticker+' '+ticker1]=residuals
                    #res[ticker+' '+ticker1]={'residuals':residuals}
                    adf=adf_test(residuals)#
                    res['adf_test'][ticker+' '+ticker1]=adf
        #adf_test_res=pd.Series(res['adf_test'])#.min()

        for dic in current_pairs:
            ticker=dic['ticker']
            ticker1=dic['ticker1']
            volume=dic['volume']
            volume1=dic['volume1']
            direction=dic['dir']
            print(engine.current_positions)
            if signal_strength(res['residuals'][ticker+' '+ticker1])<reversal_threshold:
                if direction:
                    engine.execute_order(ticker,volume,'BUY', date)
                    engine.execute_order(ticker1, volume1, 'SELL', date)
                else:
                    engine.execute_order(ticker1, volume1,'BUY', date)
                    engine.execute_order(ticker, volume, 'SELL', date)
        
        for k,v in res['adf_test'].items():
            if v<threshold:
                res['strength'][k]=signal_strength(res['residuals'][k])
        if len(res['strength'])>0:
            res['strength']=pd.Series(res['strength'])
            if res['strength'].abs().max()>initial_threshold:
                pair=res['strength'].abs().idxmax()
                direction=(res['strength'].loc[pair]>0)
                ticker,ticker1=pair.split(' ')
                #print(ticker)
                #print(ticker1)
                #print(direction)
                volume= int(amount_to_trade/df.loc[:,ticker].iloc[-1])
                volume1=int(amount_to_trade/df.loc[:,ticker1].iloc[-1])
                print(volume)
                print(volume1)
                if direction:
                    engine.execute_order(ticker,volume,'BUY', date)
                    engine.execute_order(ticker1, volume1, 'SELL', date)
                else:
                    engine.execute_order(ticker1, volume1,'BUY', date)
                    engine.execute_order(ticker, volume, 'SELL', date)
                current_pairs.append({'ticker':ticker,'volume':volume,'ticker1':ticker1,'volume1':volume1,'dir':direction})


<img src="https://algo-assets.amplifyme.com/quant/gresearch/GResearch_Cointegration_Appendix.png" width="100%"/>

# Python 3.10.10 Documentation
[Checkout the documentation](https://docs.python.org/release/3.10.10/)

[What is new in Python 3.10](https://docs.python.org/3/whatsnew/3.10.html)

[Differences between Python 3 and Python 2](https://www.ibm.com/docs/en/sqsp/48?topic=scripts-python-2-python-3-differences)

# Import Explanations

## Publicly Available Packages

1. **Matplotlib**
   ```python
   from matplotlib import pyplot as plt
   ```
   - Matplotlib is a comprehensive library for creating static, animated, and interactive visualizations in Python. The `pyplot` module provides a MATLAB-like interface for making plots and graphs.

2. **Pandas**
   ```python
   import pandas as pd
   ```
   - Pandas is a fast, powerful, flexible, and easy-to-use open-source data analysis and manipulation tool, built on top of the Python programming language. It offers data structures like DataFrame for handling tabular data.

3. **NumPy**
   ```python
   import numpy as np
   ```
   - NumPy is the fundamental package for scientific computing with Python. It provides support for large, multi-dimensional arrays and matrices, along with a collection of mathematical functions to operate on these arrays.

4. **SciPy Stats**
   ```python
   import scipy.stats as stats
   ```
   - SciPy is a Python-based ecosystem of open-source software for mathematics, science, and engineering. The `stats` module in SciPy contains a large number of probability distributions as well as a growing library of statistical functions.

5. **Seaborn**
   ```python
   import seaborn as sns
   ```
   - Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.

## Imports for OLS Regression Analysis

1. **Statsmodels for Statistical Modeling**
   ```python
   import statsmodels.api as sm
   ```
   - Statsmodels is a Python module that provides classes and functions for the estimation of many statistical models, as well as for conducting statistical tests, and statistical data exploration. In the context of Ordinary Least Squares (OLS) regression analysis, `sm` from Statsmodels is used for fitting linear models, performing tests, and exploring data.

## Imports for ADF Test

1. **Statsmodels for Time Series Analysis**
   ```python
   from statsmodels.tsa.stattools import adfuller
   ```
   - Statsmodels provides classes and functions for the estimation of many different statistical models. `adfuller` is used for the Augmented Dickey-Fuller test, a type of statistical test called a unit root test, essential in time series analysis.