# TABLE OF CONTENTS

* [1. TEAM MEMBERS](#section-one)
* [2. PROBLEM SET 1: QUESTION 1](#section-two)
    - [2.1 Draw Packages](#subsection-two-one)
    - [2.2 Setup parameters](#subsection-two-two)
    - [2.3 Define Functions](#subsection-two-three)
    - [2.4 Import Data and process](#subsection-two-four)
* [3. CONCLUSION](#section-three)
* [4. REFERENCES](#section-four)

<a id="section-one"></a>
# 1. TEAM MEMBERS

## Lucas Sebastian A0112080B
## Sekson Ounsaengchan (Beer) A0227885M
## Zhao Mengyu (Jessica) A0227914B

<a id="section-two"></a>
# 2. PROBLEM SET 1: QUESTION 1

## Explore and find an asset (e.g., ETFs, stocks, currencies, bonds) whose variance ratio test is rejected (i.e. H0: market efficiency is rejected).

The code below does the following:
1. Import stock list CSV file from local path. In this case we use top 1000 stocks from US Nasdaq market
2. Calculate variance and Z-statistic using the estimate function above
3. Print out stocks that do not follow random walk on the k-period using conditions specified in parameters below

<a id="subsection-two-one"></a>
## 2.1 Draw Packages

In [16]:
import pandas as pd
import numpy as np
import yfinance as yf
import datetime as dt
try:
    import plotly.express as px
except:
    !pip install plotly==4.14.3

<a id="subsection-two-two"></a>
## 2.2 Setup parameters

In [2]:
time_intervals = [2, 4, 8, 16]
z_stat_threshold = 1.96 #Z-statisctic 1.96 equals to P-value < 0.05

<a id="subsection-two-three"></a>
## 2.3 Define Functions

In [3]:
def estimate(data, price, k):
    """
    Function returns the variance ratio, stat2 and number of observations
    
    Parameters:
        data (pandas.DataFrame): dataframe containing the prices of the underlyings 
        price (str): column header of underlying to be analysed 
        T (int): interval of returns (e.g T = 2 is the 2-period returns)
        
    Returns:
        tuple: containing variance ratio, stat2 and number of observations
    """
    prices = data[price].to_numpy(dtype=np.float64)
    log_prices = np.log(prices)
    rets = np.diff(log_prices)
    T = len(rets)
    mu = np.mean(rets)
    var_1 = np.var(rets, ddof=1, dtype=np.float64)
    rets_k = (log_prices - np.roll(log_prices, k))[k:]
    # print("prices", prices)
    # print("log_prices", log_prices)
    # print("np.roll(log_prices, k))[k:]", np.roll(log_prices, k)[k:])
    # print("rets_k", rets_k)
    m = k * (T - k + 1) * (1 - k / T)
    var_k = 1/m * np.sum(np.square(rets_k - k * mu))

    # Variance Ratio
    vr = var_k / var_1
    
    # Phi2
    def delta(j):
        res = 0
        for t in range(j+1, T+1):
            t -= 1  # array index is t-1 for t-th element
            res += np.square((rets[t]-mu)*(rets[t-j]-mu))
        return res / ((T-1) * var_1)**2

    phi2 = 0
    for j in range(1, k):
        phi2 += (2*(k-j)/k)**2 * delta(j)

    return vr, (vr - 1) / np.sqrt(phi2), T

In [4]:
def estimate_multiple_k(data, price, time_intervals):
    """
    Function prints the results from estimate() for multiple time intervals as specified in time_intervals
    Parameters:
        data (pandas.DataFrame): dataframe containing the prices of the underlyings 
        price (str): column header of underlying to be analysed 
        time_intervals (list): list containing int respresenting intervals for returns
    Returns:
        null
    """
    # Estimate different time_intervals.
    for time_interval in time_intervals:

        vr, stat2, T = estimate(data, price, time_interval)
        print('The number of observations : ' + str(T))
        print('Variance Ratio for k = ' + str(time_interval) + ' : ' + str(vr))
        print('Variance Ratio Test Statistic for k = ' + str(time_interval) + ' Heteroscedasticity Assumption : ' + str(stat2))
        print('-------------------------------------------------------------------------------------------------')

The function below calculates variance and Z-statistic using the estimate function above and update the stock dataframe (stock_df) to be printed out

In [5]:
def estimate_multiple_k_modified(data, price, time_intervals, stock_df):
    """
    Function stores the results from estimate() for multiple time intervals as specified in time_intervals into stock_df dataframe
    Parameters:
        data (pandas.DataFrame): dataframe containing the prices of the underlyings 
        price (str): column header of underlying to be analysed 
        time_intervals (list): list containing int respresenting intervals for returns
    Returns:
        null
    """
    # Estimate different time_intervals.
    for time_interval in time_intervals:
        vr, stat2, T = estimate(data, price, time_interval)
        # indicate variance difference and Z-statistic here. variance differs greater than 0.1 from 1 indicating non-random walk. Z-statistic > 1.96 equals to P-value = 0.05 (small enough for chance of error)
        temp = [price, time_interval, vr, stat2, T]
        if abs(vr) != 1 and abs(stat2) > z_stat_threshold:
            temp.append(True)
        else:
            temp.append(False)
        to_append = pd.Series(temp, index = stock_df.columns)
        stock_df = stock_df.append(to_append, ignore_index=True)
    return stock_df

<a id="subsection-two-four"></a>
## 2.4 Processing

### Import data from local path and download historical prices from Yahoo Finance. We use "Adj.close price" to be our dataset. We use "weekly" interval as an interval to download dataset.
In this case we use top 100 stocks from US Nasdaq market https://www.nasdaq.com/market-activity/stocks/screener
Note: the number of stock is modifiable, not fixed to 100.

In [14]:
# Update local path
local_path = "C:/Users/sekso/Desktop/MBA/sem4/Fintech/Bootcamp Codes and Data/02 EMH/"

us_stock = pd.read_csv(local_path + 'nasdaq_us_stock.csv', na_values=['.'])
symbols_list = list(us_stock["Symbol"][0:100])
start = dt.datetime(2018,1,1)
end = dt.datetime(2021,12,31)
data = yf.download(symbols_list, start=start, end=end, interval="1d")
# NOTE that we use daily because of the glitch in the yfinance API. Lots of stocks have NaN value on weekly price
maindata = data["Adj Close"]

"""
Rejected_stock dataframe contains [only stocks that don't follow random walk] in some k-period
The following section prints only stocks that pass the Z-statistic threshold
"""
# for symbol in symbols_list:
#     stock_df = estimate_multiple_k(maindata, symbol, time_intervals)

stock_df = pd.DataFrame(columns = ["symbol", "k", "vr", "stat2", "T", "null_rejected"])
stock_df.set_index(["symbol", "k"])
for symbol in symbols_list:
    stock_df = estimate_multiple_k_modified(maindata, symbol, time_intervals, stock_df)
null_rejected_stock = stock_df[stock_df["null_rejected"] == True]

if null_rejected_stock.empty:
    print("All stocks follow random walk")
else:
    print("The following stocks do not follow random walk (Null hypothesis is rejected for the k-period)")
    current_symbol = ""
    for i in range(0,len(null_rejected_stock)):
        if current_symbol != null_rejected_stock.iloc[i]["symbol"]:
            current_symbol = null_rejected_stock.iloc[i]["symbol"]
            print("----------------------------------------------------------------",current_symbol , "------------------------------------------------------------------------")
            # fig = px.scatter(y=maindata[current_symbol], labels={'x':'Time', 'y':'Price'})
            # fig.show()
        # print(rejected_stock.iloc[i])
        print('The number of observations :', null_rejected_stock.iloc[i]["T"])
        print('Variance Ratio for k =', null_rejected_stock.iloc[i]["k"], ':', null_rejected_stock.iloc[i]["vr"])
        print('Variance Ratio Test Statistic for k =', null_rejected_stock.iloc[i]["k"], 'Heteroscedasticity Assumption :', null_rejected_stock.iloc[i]["stat2"])
        print('-------------------------------------------------------------------------------------------------')

[*********************100%***********************]  100 of 100 completed
The following stocks do not follow random walk (Null hypothesis is rejected for the k-period)
---------------------------------------------------------------- AAPL ------------------------------------------------------------------------
The number of observations : 1006
Variance Ratio for k = 2 : 0.8615330757642441
Variance Ratio Test Statistic for k = 2 Heteroscedasticity Assumption : -2.3403193043941157
-------------------------------------------------------------------------------------------------
---------------------------------------------------------------- MSFT ------------------------------------------------------------------------
The number of observations : 1006
Variance Ratio for k = 2 : 0.7185623720069287
Variance Ratio Test Statistic for k = 2 Heteroscedasticity Assumption : -3.155626509082304
-------------------------------------------------------------------------------------------------
The numb

<a id="section-three"></a>
# 3. CONCLUSION
Based on the experiment above, we observe that the longer period we scope, the more price pattern looks more random. Most of the stocks do not look random when we observe two-period past price (k = 2), but when k > 2, only few stocks pass the threshold (Variance does not equal to 1 and Z-statistic > 1.96) e.g., Texas Instruments Incorporated (TXN). 

<a id="section-four"></a>
# 4. REFERENCES

Lo A.W. and MacKinlay A.C. (1988), "Stock market prices do not follow random walk: Evidence from a simple specification test", The Review of Financial Studies, 1, 41-66.

Boehmer, E., Broussard, John P., and Kallunki, J. Using SAS in Financial Research. SAS Institute, 2002. ISBN 978-
1590470398.

This notebook is adapted from code written by [Mingze Gao](https://mingze-gao.com/), University of Sydney. 