In [48]:
import pandas as pd
import numpy as np

## Information About Data

### Files
* `stock_prices.csv`: The core file of interest. Includes the daily closing price for each stock and the target column.
* `options.csv`: Data on the status of a variety of options based on the broader market. Many options include implicit predictions of the future price of the stock market and so may be of interest even though the options are not scored directly.
* `secondary_stock_prices.csv`: The core dataset contains on the 2,000 most commonly traded equities but many less liquid securities are also traded on the Tokyo market. This file contains data for those securities, which aren't scored but may be of interest for assessing the market as a whole.
* `trades.csv`: Aggregated summary of trading volumes from the previous business week.
* `financials.csv`: Results from quarterly earnings reports.
* `stock_list.csv`: - Mapping between the SecuritiesCode and company names, plus general information about which industry the company is in.

### Folders
* **data_specifications/** - Definitions for individual columns.
* **jpx_tokyo_market_prediction/** Files that enable the API. Expect the API to deliver all rows in under five minutes and to reserve less than 0.5 GB of memory.

* Copies of data files exist in multiple folders that cover different time windows and serve different purposes.
    * **train_files/** Data folder covering the main training period.
    * **supplemental_files/** Data folder containing a dynamic window of supplemental training data. This will be updated with new data during the main phase of the competition in early May, early June, and roughly a week before the submissions are locked. The supplemental data will also be updated once at the very beginning of the forecasting phase so that the test set will start with the trading day after the last trading day in the supplemental data.
    * **example_test_files/** Data folder covering the public test period. Intended to facilitate offline testing. Includes the same columns delivered by the API (ie no Target column). You can calculate the Target column from the Close column; it's the return from buying a stock the next day and selling the day after that. This folder also includes an example of the sample submission file that will be delivered by the API.

# Exploring

## How is the evaluation process?
Submissions are evaluated on the Sharpe Ratio of the daily spread returns. You will need to rank each stock active on a given day. The returns for a single day treat the 200 highest (e.g. 0 to 199) ranked stocks as purchased and the lowest (e.g. 1999 to 1800) ranked 200 stocks as shorted. The stocks are then weighted based on their ranks and the total returns for the portfolio are calculated assuming the stocks were purchased the next day and sold the day after that. You [can find a python implementation of the metric here](https://www.kaggle.com/code/smeitoma/jpx-competition-metric-definition).

Example of submission pred

In [4]:
example_submission = pd.read_csv("../../data/raw/example_test_files/sample_submission.csv")
example_submission

Unnamed: 0,Date,SecuritiesCode,Rank
0,2021-12-06,1301,0
1,2021-12-06,1332,1
2,2021-12-06,1333,2
3,2021-12-06,1375,3
4,2021-12-06,1376,4
...,...,...,...
111995,2022-02-28,9990,1995
111996,2022-02-28,9991,1996
111997,2022-02-28,9993,1997
111998,2022-02-28,9994,1998


### Understanding stocks files

#### Stock list

In [23]:
stock_spec = pd.read_csv("../../data/raw/data_specifications/stock_list_spec.csv")
stock_spec.head(16)

Unnamed: 0,Column,Sample value,Type,Addendum,Remarks
0,SecuritiesCode,1301,Int64,,Local Securities Code
1,EffectiveDate,20211230,date,,the effective date
2,Name,"KYOKUYO CO.,LTD.",string,,Name of security
3,Section/Products,First Section (Domestic),string,,Section/Product
4,NewMarketSegment,Prime Market,string,,New market segment effective from 2022-04-04 (...
5,33SectorCode,50,Int64,,33 Sector Name\n\nref. https://www.jpx.co.jp/e...
6,33SectorName,"Fishery, Agriculture and Forestry",string,,33 Sector Name\n\nref. https://www.jpx.co.jp/e...
7,17SectorCode,1,Int64,,17 Sector Code\nref. https://www.jpx.co.jp/eng...
8,17SectorName,FOODS,string,,17 Sector Name\nref. https://www.jpx.co.jp/eng...
9,NewIndexSeriesSizeCode,7,Int64,,TOPIX New Index Series code\n\nref. https://ww...


In [26]:
stock_spec.loc[15, "Remarks"]

'a flag of prediction target universe (top 2000 stocks by market capitalization)'

In [24]:
stocks = pd.read_csv("../../data/raw/stock_list.csv")
stocks.head()

Unnamed: 0,SecuritiesCode,EffectiveDate,Name,Section/Products,NewMarketSegment,33SectorCode,33SectorName,17SectorCode,17SectorName,NewIndexSeriesSizeCode,NewIndexSeriesSize,TradeDate,Close,IssuedShares,MarketCapitalization,Universe0
0,1301,20211230,"KYOKUYO CO.,LTD.",First Section (Domestic),Prime Market,50,"Fishery, Agriculture and Forestry",1,FOODS,7,TOPIX Small 2,20211230.0,3080.0,10928280.0,33659110000.0,True
1,1305,20211230,Daiwa ETF-TOPIX,ETFs/ ETNs,,-,-,-,-,-,-,20211230.0,2097.0,3634636000.0,7621831000000.0,False
2,1306,20211230,NEXT FUNDS TOPIX Exchange Traded Fund,ETFs/ ETNs,,-,-,-,-,-,-,20211230.0,2073.5,7917718000.0,16417390000000.0,False
3,1308,20211230,Nikko Exchange Traded Index Fund TOPIX,ETFs/ ETNs,,-,-,-,-,-,-,20211230.0,2053.0,3736943000.0,7671945000000.0,False
4,1309,20211230,NEXT FUNDS ChinaAMC SSE50 Index Exchange Trade...,ETFs/ ETNs,,-,-,-,-,-,-,20211230.0,44280.0,72632.0,3216145000.0,False


#### Stock Prices

In [21]:
stock_prices_spec = pd.read_csv("../../data/raw/data_specifications/stock_price_spec.csv")
stock_prices_spec.head(12)

Unnamed: 0,Column,Sample value,Type,Addendum,Remarks
0,RowId,20170104_1301,string,,Unique ID of price records
1,Date,2017-01-04 0:00:00,date,,Trade date
2,SecuritiesCode,1301,Int64,,Local securities code
3,Open,2734,float,,first traded price on a day
4,High,2755,float,,highest traded price on a day
5,Low,2730,float,,lowest traded price on a day
6,Close,2742,float,,last traded price on a day
7,Volume,31400,Int64,,number of traded stocks on a day
8,AdjustmentFactor,1,float,,to calculate theoretical price/volume when spl...
9,SupervisionFlag,FALSE,boolean,,Flag of Securities Under Supervision & Securit...


In [34]:
stock_prices_spec.loc[11, "Remarks"]

'Change ratio of adjusted closing price between t+2 and t+1 where t+0 is TradeDate'

In [22]:
stock_prices = pd.read_csv("../../data/raw/train_files/stock_prices.csv")
stock_prices["Date"] = pd.to_datetime(stock_prices["Date"])
stock_prices.head()

Unnamed: 0,RowId,Date,SecuritiesCode,Open,High,Low,Close,Volume,AdjustmentFactor,ExpectedDividend,SupervisionFlag,Target
0,20170104_1301,2017-01-04,1301,2734.0,2755.0,2730.0,2742.0,31400,1.0,,False,0.00073
1,20170104_1332,2017-01-04,1332,568.0,576.0,563.0,571.0,2798500,1.0,,False,0.012324
2,20170104_1333,2017-01-04,1333,3150.0,3210.0,3140.0,3210.0,270800,1.0,,False,0.006154
3,20170104_1376,2017-01-04,1376,1510.0,1550.0,1510.0,1550.0,11300,1.0,,False,0.011053
4,20170104_1377,2017-01-04,1377,3270.0,3350.0,3270.0,3330.0,150800,1.0,,False,0.003026


In [35]:
stock_1301 = stock_prices[stock_prices["SecuritiesCode"]==1301].reset_index(drop=True)
stock_1301.head(3)

Unnamed: 0,RowId,Date,SecuritiesCode,Open,High,Low,Close,Volume,AdjustmentFactor,ExpectedDividend,SupervisionFlag,Target
0,20170104_1301,2017-01-04,1301,2734.0,2755.0,2730.0,2742.0,31400,1.0,,False,0.00073
1,20170105_1301,2017-01-05,1301,2743.0,2747.0,2735.0,2738.0,17900,1.0,,False,0.00292
2,20170106_1301,2017-01-06,1301,2734.0,2744.0,2720.0,2740.0,19900,1.0,,False,-0.001092


## How to calculate Target

1. The model will use the closing price ($C_{(k, t)}$) until that business day ($t$) and other data every business day as input data for a stock ($k$), and predict rate of change ($r_{(k, t)}$) of closing price of the top 200 stocks and bottom 200 stocks on the following business day ($C_{(k, t+1)}$) to next following business day ($C_{(k, t+2)}$)

    $$
    r_{(k, t)} = \frac{C_{(k, t+2)} - C_{(k, t+1)}}{C_{(k, t+1)}}
    $$

In [38]:
stock_1301["Close-1"] = stock_1301["Close"].shift(-1)
stock_1301["Close-2"] = stock_1301["Close"].shift(-2)
stock_1301["Rate"] = (stock_1301["Close-2"]-stock_1301["Close-1"])/stock_1301["Close-1"]
stock_1301.head()

Unnamed: 0,RowId,Date,SecuritiesCode,Open,High,Low,Close,Volume,AdjustmentFactor,ExpectedDividend,SupervisionFlag,Target,Close-1,Close-2,Rate
0,20170104_1301,2017-01-04,1301,2734.0,2755.0,2730.0,2742.0,31400,1.0,,False,0.00073,2738.0,2740.0,0.00073
1,20170105_1301,2017-01-05,1301,2743.0,2747.0,2735.0,2738.0,17900,1.0,,False,0.00292,2740.0,2748.0,0.00292
2,20170106_1301,2017-01-06,1301,2734.0,2744.0,2720.0,2740.0,19900,1.0,,False,-0.001092,2748.0,2745.0,-0.001092
3,20170110_1301,2017-01-10,1301,2745.0,2754.0,2735.0,2748.0,24200,1.0,,False,-0.0051,2745.0,2731.0,-0.0051
4,20170111_1301,2017-01-11,1301,2748.0,2752.0,2737.0,2745.0,9300,1.0,,False,-0.003295,2731.0,2722.0,-0.003295


#### Calculate rank at 2021-12-01

First let's filter the dataset at 2021-12-01


Rank is in the range of 0-1999 and is in descending order of Target.

In [46]:
stocks_2021_12_01 = stock_prices[stock_prices["Date"]=="2021-12-01"].reset_index(drop=True)
stocks_2021_12_01["rank"] = stocks_2021_12_01["Target"].rank(ascending=False, method="first") - 1
stocks_2021_12_01 = stocks_2021_12_01.sort_values("rank").reset_index(drop=True)
stocks_2021_12_01

Unnamed: 0,RowId,Date,SecuritiesCode,Open,High,Low,Close,Volume,AdjustmentFactor,ExpectedDividend,SupervisionFlag,Target,rank
0,20211201_4488,2021-12-01,4488,5940.0,6090.0,5760.0,6020.0,67100,1.0,,False,0.175439,0.0
1,20211201_6047,2021-12-01,6047,539.0,555.0,535.0,551.0,142900,1.0,,False,0.153610,1.0
2,20211201_2987,2021-12-01,2987,3130.0,3330.0,3035.0,3225.0,155100,1.0,,False,0.147595,2.0
3,20211201_9107,2021-12-01,9107,4940.0,4980.0,4760.0,4885.0,4237900,1.0,,False,0.128676,3.0
4,20211201_3926,2021-12-01,3926,1815.0,1902.0,1784.0,1863.0,388700,1.0,,False,0.115712,4.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1995,20211201_6378,2021-12-01,6378,994.0,1025.0,951.0,1002.0,1016600,1.0,,False,-0.060456,1995.0
1996,20211201_3635,2021-12-01,3635,4740.0,4870.0,4625.0,4855.0,426300,1.0,,False,-0.065708,1996.0
1997,20211201_4080,2021-12-01,4080,2163.0,2168.0,1971.0,2023.0,1709700,1.0,,False,-0.203046,1997.0
1998,20211201_3031,2021-12-01,3031,1545.0,1595.0,1488.0,1563.0,449400,1.0,,False,-0.206186,1998.0


* Smaller rank: Profitable to buy it.
* Larger rank: Profitable to sell it.

2. Within top 200 stock predicted ($up_i\;\;(i = 1, 2, \ldots, 200)$), multiply by their respective rate of change with linear weights of 2-1 for rank 1-200 and denote their sum as $S_{up}$.

    $$
    S_{up} = \frac{\sum^{200}_{i=1}(r_{({up_i}, t)} * linear function(2, 1)_i))}{Average(linear function(2, 1))}
    $$

In [54]:
# get top 200
stocks_2021_12_01_top200 = stocks_2021_12_01.iloc[:200,:]

# create weights
weights = np.linspace(start=2, stop=1, num=200)
stocks_2021_12_01_top200.loc[:, "weights"] = weights

# calculate weights
stocks_2021_12_01_top200.loc[:, "calc_weights"] = stocks_2021_12_01_top200["Target"] * stocks_2021_12_01_top200["weights"]
stocks_2021_12_01_top200.head(3)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  stocks_2021_12_01_top200.loc[:, "weights"] = weights
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  stocks_2021_12_01_top200.loc[:, "calc_weights"] = stocks_2021_12_01_top200["Target"] * stocks_2021_12_01_top200["weights"]


Unnamed: 0,RowId,Date,SecuritiesCode,Open,High,Low,Close,Volume,AdjustmentFactor,ExpectedDividend,SupervisionFlag,Target,rank,weights,calc_weights
0,20211201_4488,2021-12-01,4488,5940.0,6090.0,5760.0,6020.0,67100,1.0,,False,0.175439,0.0,2.0,0.350877
1,20211201_6047,2021-12-01,6047,539.0,555.0,535.0,551.0,142900,1.0,,False,0.15361,1.0,1.994975,0.306448
2,20211201_2987,2021-12-01,2987,3130.0,3330.0,3035.0,3225.0,155100,1.0,,False,0.147595,2.0,1.98995,0.293707


In [55]:
Sup = stocks_2021_12_01_top200["calc_weights"].sum()/np.mean(weights)
Sup

12.823517325150357

3. Within bottom 200 stocks predicted  ($down_i\;\;(i = 1, 2, \ldots, 200)$), multiply by their respective rate of change with linear weights of 2-1 for bottom rank 1-200 and denote their sum as $S_{down}$.

    $$
    S_{down} = \frac{\sum^{200}_{i=1}(r_{({down_i}, t)} * linear function(2, 1)_i)}{Average(linear function(2, 1))}
    $$

In [56]:
stocks_2021_12_01_bottom200 = stocks_2021_12_01.iloc[-200:,:]
stocks_2021_12_01_bottom200 = stocks_2021_12_01_bottom200.sort_values("rank",ascending = False).reset_index(drop=True)
stocks_2021_12_01_bottom200["weights"] = weights
stocks_2021_12_01_bottom200["calc_weights"] = stocks_2021_12_01_bottom200["Target"] * stocks_2021_12_01_bottom200["weights"]
Sdown = stocks_2021_12_01_bottom200["calc_weights"].sum()/np.mean(weights)
Sdown

-2.9368666783473216

4. The result of subtracting $S_{down}$ from $S_{up}$ is $R_{day}$ and is called "**daily spread return**".

    $$
    R_{day} = S_{up} - S_{down}
    $$

In [57]:
daily_spread_return = Sup - Sdown
daily_spread_return

15.760384003497679

5. The daily spread return is calculated every business day during the public/private period and obtained as a time series for that period. The mean/standard deviation of the time series of daily spread returns is used as the score. Score calculation formula (x is the business day of public/private period)

    $$
    Score = \frac{Average(R_{day_1-day_x})}{STD(R_{day_1-day_x})}
    $$

In [59]:
def calc_spread_return_sharpe(df: pd.DataFrame, portfolio_size: int = 200, toprank_weight_ratio: float = 2) -> float:
    """
    Args:
        df (pd.DataFrame): predicted results
        portfolio_size (int): # of equities to buy/sell
        toprank_weight_ratio (float): the relative weight of the most highly ranked stock compared to the least.
    Returns:
        (float): sharpe ratio
    """
    def _calc_spread_return_per_day(df, portfolio_size, toprank_weight_ratio):
        """
        Args:
            df (pd.DataFrame): predicted results
            portfolio_size (int): # of equities to buy/sell
            toprank_weight_ratio (float): the relative weight of the most highly ranked stock compared to the least.
        Returns:
            (float): spread return
        """
        assert df['Rank'].min() == 0
        assert df['Rank'].max() == len(df['Rank']) - 1
        weights = np.linspace(start=toprank_weight_ratio, stop=1, num=portfolio_size)
        purchase = (df.sort_values(by='Rank')['Target'][:portfolio_size] * weights).sum() / weights.mean()
        short = (df.sort_values(by='Rank', ascending=False)['Target'][:portfolio_size] * weights).sum() / weights.mean()
        return purchase - short

    buf = df.groupby('Date').apply(_calc_spread_return_per_day, portfolio_size, toprank_weight_ratio)
    sharpe_ratio = buf.mean() / buf.std()
    return sharpe_ratio

#### Calculate for 2021

In [62]:
stock_prices_2021 = stock_prices.loc[stock_prices["Date"]>= "2021-01-01"].reset_index(drop=True)
stock_prices_2021["Rank"] = stock_prices_2021.groupby("Date")["Target"].rank(ascending=False,method="first") -1 
stock_prices_2021["Rank"] = stock_prices_2021["Rank"].astype("int")
stock_prices_2021

Unnamed: 0,RowId,Date,SecuritiesCode,Open,High,Low,Close,Volume,AdjustmentFactor,ExpectedDividend,SupervisionFlag,Target,Rank
0,20210104_1301,2021-01-04,1301,2951.0,2951.0,2913.0,2950.0,9700,1.0,,False,0.011502,655
1,20210104_1332,2021-01-04,1332,428.0,429.0,416.0,421.0,1780500,1.0,,False,0.019093,375
2,20210104_1333,2021-01-04,1333,2229.0,2231.0,2179.0,2202.0,112400,1.0,,False,0.015075,497
3,20210104_1375,2021-01-04,1375,1701.0,1701.0,1672.0,1674.0,67900,1.0,,False,-0.003503,1481
4,20210104_1376,2021-01-04,1376,1597.0,1597.0,1577.0,1588.0,4500,1.0,,False,-0.012033,1737
...,...,...,...,...,...,...,...,...,...,...,...,...,...
451995,20211203_9990,2021-12-03,9990,514.0,528.0,513.0,528.0,44200,1.0,,False,0.034816,580
451996,20211203_9991,2021-12-03,9991,782.0,794.0,782.0,794.0,35900,1.0,,False,0.025478,1119
451997,20211203_9993,2021-12-03,9993,1690.0,1690.0,1645.0,1645.0,7200,1.0,,False,-0.004302,1941
451998,20211203_9994,2021-12-03,9994,2388.0,2396.0,2380.0,2389.0,6500,1.0,,False,0.009098,1768


In [63]:
score = calc_spread_return_sharpe(stock_prices_2021, portfolio_size= 200, toprank_weight_ratio= 2)
score

5.7907451128813605

The following rules are used to determine which stocks are available for investment.

* The top 2,000 common stocks by market capitalization that have been listed for at least one year as of 2021-12-31 are eligible for investment.

* If a stock is designated as Securities Under Supervision or Securities to Be Delisted during the private period, it will be excluded from investment after the date of designation.

* When calculating the score, the adjusted stock price is used.

### Intentions of problem

In general, it is not possible to assume that data will have the same distribution permanently in two different periods of financial market time-series data. For example, the nature of the market changed dramatically between February 2020 and March 2020 and beyond due to changes in global conditions caused by COVID-19 and other factors.

In the case of a competition that focuses on a financial market with shifting data distribution characteristics, we thought that the winner of the competition should be the Kaggler who constructed a robust model that does not depend on changes in the data distribution.

Based on the above assumptions, the following were considered in the design of this competition

* The number of stocks to be forecast each business day is the difference between the rate of change of 200 stocks, each of which is 10% or more than the number of stocks to be invested in (2000 stocks), so that the performance of the model can be competed without being affected by the events of individual stocks. In practice, however, institutional investors and funds often invest in 50-100 stocks, so there is a slight deviation from the real-world setting of the problem.

* When calculating the daily spread return, a linear weight of 2 to 1 is applied to the 1-200 stocks, so that stocks with higher rates of return are placed in the first position.

* Since "risk control" is also an important element of investment, the competing score is the **mean/standard deviation** of the time series of daily spread returns, rather than the simple mean or sum of daily spread returns. This makes it necessary to build a model that can respond to changes in the distribution of data and produce stable and high performance, rather than a model that only wins big on certain days.

* The competition also provides option data and other data that can provide clues for estimating the volatility and risk factors of the market itself. These data may be used for more sophisticated risk control. Since the bottom 200 stocks are also included in the forecast, it is possible to adopt a market-neutral strategy (it is also possible to intentionally bias the beta toward the long side).