# Exploration of Market Index Price Distributions and Normality - Part 2

A 2024 continuation of [Market Index Data Exploration.ipynb](https://github.com/edM777/index-data-explore/blob/main/Market%20Index%20Data%20Exploration.ipynb)

In [2]:
import shelve
import ipyplot

#### Intro

In this study we will revisit the past exploration of Market Index Price Distributions, which included tests to determine if the distributions are log-normal/normal. According to market theory, share prices are log-normally distributed, so we will test these assumptions.  To see the conclusions of my last study, see the accompanying notebook -  "[Market Index Data Exploration.ipynb](https://github.com/edM777/index-data-explore/blob/main/Market%20Index%20Data%20Exploration.ipynb)". The same *References* used there also apply here, in addition to any new ones noted.<br>
Furthermore, this study expands on the last one by analyzing the normality of index log returns as well as the effects of the Compound Average Growth Rate (CAGR). The additional analysis gives more insight into the price evolution of financial instruments and whether or not they behave according to theory.


To collect the new data analyzed in this study, we used the same market data collection Python script as before: historical-data-collector.py <br>
As will be seen in the time series plot below, the new data spans from roughly April 2008 - Dec 2022 , so an almost 3 year shift forward from the last study. 

### New Price Data - Initial Visualization and Analysis

#### Time Series PLot

As before, the study begins by creating a time-series plot of the DAX Price data. To try and account for all bar price values, we will plot the average, where $avgPrice = (open + high + low + close) / 4$. This is one of the price calculations typically used for creating moving averages.

![Dax Time Series Plot](ln-imgs/dax-time-vs-price-plot-new-2024-data.png) 

As expected, the new time-series plots looks almost identical to the one in the previous study, just shifted forward in time by almost 3 years. So, as before, one can assume that there are no major outlier or erroneous entries in this dataset. In future continuations of this study, perhaps alternative market data sources could be used, and one could compare the quality of data between sources. For now, this 15 year dataset of DAX prices from IBKR will suffice.<br>
The plot also shows that since the end of the last study's data - July 2021, there was an expected price increase at first, but then prices started to dip around the beginning of 2022. The price bottomed out around July 2022 and has since continued to increase.

#### Histograms - Visual Inspection of the Data Distribution using Old Methodologies 

First, we may re-create the histograms presented in the old study using the new DAX historical data. The main purpose of creating these figures is to help verify our past conclusions on normality. If our assumptions were correct, then we should see similar distributions in our new datasets.<br>
We will explore new ways of analyzing the price data in the next section of this study.

**Define Paramters**

As before, the independent variables that we will vary to test for normality are: Index, price type, and time period. <br>
For this study, our parameters are as follows (only "DAX" Index this time):
- **Index symbols**: "DAX"
- **Price Types**: 
    - 0 -> (O + H + L +C) / 4
    - 1 -> (H + L + C) / 3
    - 2 -> C (Close Price only)
    - 3 -> (H + L) / 2 
- **Time Periods**: 2 months, 6 months, 18 months, 10 years, all available time
    - Note, all time periods are defined in terms of days, where every month in a year is treated as 30 days

Now, below are the resulting histograms created using the new DAX Price data<br>
(See past study [Market Index Data Exploration.ipynb](https://github.com/edM777/index-data-explore/blob/main/Market%20Index%20Data%20Exploration.ipynb) for detailed methodology and code)

![Dax New Raw Data Histograms](ln-imgs/histograms-newData-TightFig-originalFigVar.png) 

As can be seen above, there are some noticeable differences between the price distributions generated in this study, with data from almost three years into the future. However, all of the histograms do have some characteristics of normal distributions. They have a peak and a bell curve type shape with two tails. This is the case for all time periods, even the longer ones, which does not exactly agree with the statistical tests performed later on. So, these histograms seem to show the same distribution shapes as before, though still different.<br>
It's also worth noting that, while at a glance, one can identify the shape of a typical log-normal or normal distribution, quantifying the "*degree*" of normality visually is quite difficult. Similarly, determining at what exact point a distribution stops being normal is also difficult.

## New Feature Study: Log Returns

I became interested in investigating log returns partly due to the studies and analysis conducted by Vance in sixfigureinvesting<sup>1</sup>. Furthermore, the main mathematical theory, as stated by the CFA institute is that "If continuously compounded returns are normally distributed, asset prices are lognormally distributed."<sup>6</sup>.<br>With these theories in mind, one could assume that both features of a price dataset could be studied interchangeably, and normality tests would be "equivalent". One might just choose one or the other depending on what is easier to visualize. In this study we will explore these assumptions.

### Statistical Testing Pt. 1 - Pearson's, Shapiro, Dagostino

Here we will begin the normality tests with our collection of DAX histprical price data - sortedBars. As before, sortedBars consists of historical bar data in standard [ibapi](https://interactivebrokers.github.io/tws-api/introduction.html) format (Open, High, Low, Close), and sorted by date.

The parameters and price types defined in the section above (Histograms - Check for Normality visually using Old Methodologies) will remain the same. The only difference in this study is that now we will exclusive look at the *log returns* as opposed to the raw price data.

#### Obtain Log Returns Data

Please see the past study - [Market Index Data Exploration.ipynb](https://github.com/edM777/index-data-explore/blob/main/Market%20Index%20Data%20Exploration.ipynb)  : "Initial Normality Tests" for the detailed methodology and code used to generate the statistical test results below. The code was only slightly altered to calculate and use the log returns instead of the raw prices. Below is the relevant code snippet were log returns are calculated by simply subtracting the natural log of the current period's price from the natural log of the next one.

In [None]:

            calcCagrColl = cagrSortedBars(currSortedColl, myCalcPrices)

            currCalc = calcCagrColl[0]
            calcCagrCollReturn = np.log(np.array(currCalc[1:]) / np.array(currCalc[:-1]))

            mstatsP = sp.stats.mstats.normaltest(calcCagrCollReturn).pvalue
            

#### First Results for Log Returns Testing

Then I simply print the **p value** results of the statistical testing. The table below is sorted in ascending order by the "shapiro_test" column. <br>Also note that *no* log-normal transformation was done to our dataset this time, unlike in our previous study. This is because the theory states that log returns should be normal (not lognormal).

In [21]:
myLnDb = shelve.open('myLnReturnsStats.db')
myLnStats = myLnDb['barStats0']
myLnDb.close()
myLnStats.head(30)

Unnamed: 0,Index,Price_Type,Time_Period,mstats_normaltest,shapiro_test,dagostino_test
15,DAX,3,60 days 00:00:00,0.3281003,0.4439808,0.3281003
0,DAX,0,60 days 00:00:00,0.5684429,0.394408,0.5684429
5,DAX,1,60 days 00:00:00,0.3888791,0.3421267,0.3888791
35,INDU,3,60 days 00:00:00,0.3388371,0.3409079,0.3388371
50,SPX,2,60 days 00:00:00,0.4395631,0.199734,0.4395631
55,SPX,3,60 days 00:00:00,0.1726734,0.1980959,0.1726734
45,SPX,1,60 days 00:00:00,0.2398082,0.1439784,0.2398082
40,SPX,0,60 days 00:00:00,0.2415956,0.1069764,0.2415956
10,DAX,2,60 days 00:00:00,0.2342222,0.09473832,0.2342222
25,INDU,1,60 days 00:00:00,0.2544365,0.0909259,0.2544365


As before, these tests assume that our sample data is normal (**H<sub>0</sub>**)and we will use the standard threshold, where **alpha = 0.05**. That is, if p <= alpha: reject the null hypothesis, the data is most likely not normal.<br>
Thus, the table below shows the log returns cases which "passed" the normality tests.

In [14]:
passedLnRes = myLnStats.loc[myLnStats['shapiro_test'] > 0.05]
passedLnRes

Unnamed: 0,Index,Price_Type,Time_Period,mstats_normaltest,shapiro_test,dagostino_test
15,DAX,3,60 days 00:00:00,0.3281,0.443981,0.3281
0,DAX,0,60 days 00:00:00,0.568443,0.394408,0.568443
5,DAX,1,60 days 00:00:00,0.388879,0.342127,0.388879
35,INDU,3,60 days 00:00:00,0.338837,0.340908,0.338837
50,SPX,2,60 days 00:00:00,0.439563,0.199734,0.439563
55,SPX,3,60 days 00:00:00,0.172673,0.198096,0.172673
45,SPX,1,60 days 00:00:00,0.239808,0.143978,0.239808
40,SPX,0,60 days 00:00:00,0.241596,0.106976,0.241596
10,DAX,2,60 days 00:00:00,0.234222,0.094738,0.234222
25,INDU,1,60 days 00:00:00,0.254436,0.090926,0.254436


Already, it can be said that the log returns data yielded interesting results. Here we have 12 cases which pass the test. In our study done almost 3 years ago, only 2 cases passed the test. And we still see a pattern regarding the index and time period: The shorter time periods (60 days) are most likely to pass the test and DAX seems to have the most normally distribution. Conclusions about the price type are not as clear here.  <br>Another aspect worth noting is that all 3 statistical tests here seem to reach similar conclusions, so there isn't one test which disagrees with another. <br><br>Without performing further tests, one might think that log returns tend to be distributed more normally that raw prices, but we will see that the behavior may be the opposite in some cases.

#### Comparing Log Returns to Raw Price (w/ CAGR Adjustment) Distributions

Now we will conduct the same statistical tests on our raw price data, adjusted by the Compound Average Growth Rate (CAGR). In order to be in line with our past study, *all our raw prices will be adjusted by CAGR to account for growth smears*. So, we will run the exact same code as in our past study, but this time we will use our new historical bar collection of data from almost 3 years forwards in time.<br>Based on some interpretations of mathematical theory, one could expect these tests to yield the same results, but not necessarily.

In [22]:
myRawDb = shelve.open('myRawReturnsStats.db')
myRawStats = myRawDb['barStats0']
myRawDb.close()
myRawStats.head(30)

Unnamed: 0,Index,Price_Type,Time_Period,mstats_normaltest,shapiro_test,dagostino_test
15,DAX,3,60 days 00:00:00,0.4579012,0.4274214,0.4579012
5,DAX,1,60 days 00:00:00,0.4932132,0.3681036,0.4932132
0,DAX,0,60 days 00:00:00,0.3971657,0.3155884,0.3971657
10,DAX,2,60 days 00:00:00,0.4270801,0.2981354,0.4270801
20,INDU,0,60 days 00:00:00,0.4485496,0.2728799,0.4485496
51,SPX,2,180 days 00:00:00,0.1814112,0.2406062,0.1814112
35,INDU,3,60 days 00:00:00,0.4264858,0.2206722,0.4264858
11,DAX,2,180 days 00:00:00,0.2730447,0.1989837,0.2730447
25,INDU,1,60 days 00:00:00,0.4414437,0.1692764,0.4414437
46,SPX,1,180 days 00:00:00,0.03716727,0.160366,0.03716727


In [24]:
passedRawRes = myRawStats.loc[myRawStats['shapiro_test'] > 0.05]
passedRawRes

Unnamed: 0,Index,Price_Type,Time_Period,mstats_normaltest,shapiro_test,dagostino_test
15,DAX,3,60 days 00:00:00,0.457901,0.427421,0.457901
5,DAX,1,60 days 00:00:00,0.493213,0.368104,0.493213
0,DAX,0,60 days 00:00:00,0.397166,0.315588,0.397166
10,DAX,2,60 days 00:00:00,0.42708,0.298135,0.42708
20,INDU,0,60 days 00:00:00,0.44855,0.27288,0.44855
51,SPX,2,180 days 00:00:00,0.181411,0.240606,0.181411
35,INDU,3,60 days 00:00:00,0.426486,0.220672,0.426486
11,DAX,2,180 days 00:00:00,0.273045,0.198984,0.273045
25,INDU,1,60 days 00:00:00,0.441444,0.169276,0.441444
46,SPX,1,180 days 00:00:00,0.037167,0.160366,0.037167


As can be seen in the table above, 23 cases passed the normality test when we tested the raw price data. This is almost double the number of passes compared to the log returns tests. The "top" 3 results are also the same, with DAX 60 day time periods, and price types 3, 1, and 0. After these, the results differ a bit.<br> In both cases we can definitely still see the same trend of the shorter time periods (60 days and 180 days) being most likely to pass the test. It also still seems as though DAX is the most likely to be normal, as it appears a bit more than the rest.<br>Initially, these results were a bit unexpected. As explained before, I thought that the results would be similar according to the theory. But we can clearly see more passing results when we analyze the raw prices. And again, only smaller time periods pass, so our statistical tests may still have some limitations. So, further testing should be done to validate our results here.

### Statistical Testing Pt. 2 - KS Test, Jarque Bera

The reasoning behind conduciting these extra tests is to try and ensure that the differences between the raw price datasets and the log returns datasets are not only due to limitations of the statistical tests. There are many different types of tests that can be done to test the normality of a distribution, but here we will choose two more popular tests: **Kolmogorov-Smirnov** test (scipy - kstest) and **Jarque Bera** test (scipy - jarque_bera)

#### Log Returns Testing Pt. 2 

To perform our two new statistical tests, we will use very similar code as the oned used for the original statistical tests. Both new tests, the kstest and the jarque_bera, are contained in the scipy package as well. The raw Python code used to obtain the results in this section is given below. 

In [None]:
resultColumnsTestsR2p2 = ['Index', 'Price_Type', 'Time_Period', 'ks_test', 'jarque_test']
normTest_results_LnRet_R2 = pd.DataFrame(columns=resultColumnsTestsR2p2)

priceTypes = [0, 1, 2, 3]
# Value 0 for timePeriods list indicates max available time period
timePeriods = [timedelta(days=60), timedelta(days=180), timedelta(days=540), timedelta(days=3600), 0]
index_symbols = ["DAX", "INDU", "SPX"]

for index in index_symbols:
    for priceT in priceTypes:
        for timeP in timePeriods:
            currCollIndex = symbolsMap.get(index)
            currSortedColl = []
            currColl = sortedBars[currCollIndex]  # Current collection of bars for the current index
            if timeP != 0:
                timePColl = []
                endTimeP = datetime.strptime(currColl[0].date, '%Y%m%d') + timeP
                for bar in currColl:  # Iterate through curr collection and only get bars within current time frame
                    if (datetime.strptime(bar.date, '%Y%m%d')) < endTimeP:
                        timePColl.append(bar)
                    else:
                        break  # Can just break since we have a sorted list
            else:
                timePColl = currColl  # Use all bar data for time period 0
            currSortedColl.append(timePColl)
            myCalcPrices = getCalcPriceBars(currSortedColl, priceT)
            calcCagrColl = cagrSortedBars(currSortedColl, myCalcPrices)

            currCalc = calcCagrColl[0]
            calcCagrCollReturn = np.log(np.array(currCalc[1:]) / np.array(currCalc[:-1]))

            params = sp.stats.norm.fit(calcCagrCollReturn)

            ks_test = sp.stats.kstest(calcCagrCollReturn, 'norm', params) # For log returns test for normality (NOT lognormality)
            test_statistic, j_p_value = jarque_bera(calcCagrCollReturn)

            normTest_results_LnRet_R2.loc[len(normTest_results_LnRet_R2.index)] = [index, priceT, timeP, ks_test.pvalue, j_p_value]

sortedResultsRawR2p2 = normTest_results_LnRet_R2.sort_values(by=['jarque_test'], ascending = False)

Per usual, we will use a standard significance level where **alpha = 0.05**. <br>The table shows the log returns cases which "passed" the normality tests.

In [3]:
myLnDbP2 = shelve.open('myLogReturnsStatsP2.db')
myLnStatsP2 = myLnDbP2['barStats0']
myLnDbP2.close()

passedLnResP2 = myLnStatsP2.loc[myLnStatsP2['jarque_test'] > 0.05]
passedLnResP2

Unnamed: 0,Index,Price_Type,Time_Period,ks_test,jarque_test
50,SPX,2,60 days 00:00:00,0.581746,0.681751
0,DAX,0,60 days 00:00:00,0.738031,0.610211
35,INDU,3,60 days 00:00:00,0.921129,0.553033
5,DAX,1,60 days 00:00:00,0.736781,0.486302
30,INDU,2,60 days 00:00:00,0.413175,0.485003
20,INDU,0,60 days 00:00:00,0.266545,0.456296
45,SPX,1,60 days 00:00:00,0.825393,0.407978
15,DAX,3,60 days 00:00:00,0.741832,0.401338
40,SPX,0,60 days 00:00:00,0.514477,0.395443
25,INDU,1,60 days 00:00:00,0.798412,0.373763


With these 2 new tests, there are 13 cases which pass the normality tests, just one more than in the previous statistical tests. However, the results are not exactly the same. In this case, INDU is the most likely to be normal, with the most entries in the table. Again, the shorter time period (60 days) is the most prevalent as well. One could say that the results are not that distinct overall. So, with regards to the log returns testing, we can see that both sets of testing are agreeable with each other. These results do also suggest that the particular scipy statistical test chosen can make a difference in normality assessments. 

#### Comparing log Returns to Raw Price Distributions Pt. 2

To complete the second round of statistical tests, we will run the jarque_bera and the kstest on our raw historical price data. We will mostly use the same code given above for the log return case, except that now we will test the raw price data. And since we are looking at raw prices now, per the theory, our ks_test will use 'lognorm' parms now (instead of 'norm'). A snippet of the altered code is given below. 

In [None]:

            calcCagrColl = cagrSortedBars(currSortedColl, myCalcPrices)

            params = sp.stats.lognorm.fit(np.array(calcCagrColl[0]))

            ks_test = sp.stats.kstest(np.array(calcCagrColl[0]), 'lognorm', params)
            test_statistic, j_p_value = jarque_bera(np.array(calcCagrColl[0]))

            normTest_results_rawP_R2.loc[len(normTest_results_rawP_R2.index)] = [index, priceT, timeP, ks_test.pvalue, j_p_value]


Now continue the test with a standard significance level where **alpha = 0.05**. <br>The table below shows the raw price cases which "passed" the normality tests.

In [6]:
myRawDbP2 = shelve.open('myRawPStatsP2.db')
myRawStatsP2 = myRawDbP2['barStats0']
myRawDbP2.close()

passedRawResP2 = myRawStatsP2.loc[myRawStatsP2['jarque_test'] > 0.05]
passedRawResP2

Unnamed: 0,Index,Price_Type,Time_Period,ks_test,jarque_test
30,INDU,2,60 days 00:00:00,0.354552,0.557759
5,DAX,1,60 days 00:00:00,0.670496,0.508899
15,DAX,3,60 days 00:00:00,0.851714,0.495334
20,INDU,0,60 days 00:00:00,0.574821,0.465191
25,INDU,1,60 days 00:00:00,0.49279,0.458544
0,DAX,0,60 days 00:00:00,0.833551,0.447328
35,INDU,3,60 days 00:00:00,0.607709,0.444766
10,DAX,2,60 days 00:00:00,0.70056,0.438327
11,DAX,2,180 days 00:00:00,0.569483,0.40447
51,SPX,2,180 days 00:00:00,0.684346,0.285485


<br>Here we have 24 cases pass the normality tests, just one more than the previous tests. Again, we see very similar results to our first set of statistical tests. The amount of samples that "passed" the normality tests only differed by 1 or 2 at the most. Additionally the small time period trend continues, where the 60 day time period samples are most likely to pass the test. So, in this second set of raw price testing, our new results overall agree with past results.

Going back to the larger topic, these new statistical tests further support the idea that the distributions of stock prices and its log returns are not directly correlated. The results may even agree to mathematical theory: "Because of the central limit theorem, continuously compounded returns need not be normally distributed for asset prices to be reasonably well described by a lognormal distribution."<sup>6</sup>

### Visual Tests - New Lognormal Returns Histograms

We now move on to our visual tests to further analyze the effects of examining log normal returns as opposed to the raw prices. To begin, we plotted a histogram of DAX log returns for all available time, using Price Type 1. This can be considered another "proof of concept" for our normality testing. If the whole dataset showed at least some normal characteristics under certain circumstances, then this would also indicate that varying the parameters as before may lead to interesting results. And as discussed above, the initial statistical tests above already support the notion that log returns can yield different results as compared to raw prices when analyzing normality. 

**Histogram of DAX Log Returns, 2008 - 2022, 'Price Type' 1**

![Dax log ret hist 15Y](ln-imgs/histogram-DAX-logReturns-15Y-2024-price1.png) 

On visual inspection, the log returns histogram above clearly looks normal. There are no noticeable outliers in this case and it has a typical bell curve shape with a mean at 0. This is likely the most noticeably normal/log-normal distribution shown in both studies thus far.

Next, a histogram will be created for the same dataset and Price Type, but only the raw price will be plotted this time. 

**Histogram of DAX Prices, 2008 - 2022, 'Price Type' 1**

![Dax raw pricehist 15Y](ln-imgs/histogram-DAX-rawP-15Y-2024-price1.png) 

Visually, the raw price histogram does not look very log-normal or normal at all. There is no clear mean, there's no clear bell shape, and overall, there seems to be too many random deviations. Even if we try to smooth things out with a basic log transform, the distribution does not look normal. So, again, we see clear differences between the distribution of raw prices v.s. their log returns<br>
Also note that we did *not* adjust by CAGR in either of the plots above. However as can be seen, the log returns appear normal regardless, and we will explore CAGR further in the next topic of this study.

#### Further Visual Comparison of log Returns to Raw Price Distributions

After generating fairly distinct histograms above, it's also appropriate to analyze another set of samples in the same way. This time, INDU was chosen, price type 1, 180 days. This particular sample/case was chosen because this is an example where the statistical test results disagreed - i.e. the raw price's normality tests passed while the tests on its log returns failed. 

So, below the log returns and the raw price histograms are shown side by side for  ease of comparison.

In [3]:
ipyplot.plot_images(["ln-imgs/indu-log-returns-180D-P1-2024NewData.png", "ln-imgs/indu-rawP-180D-P1-2024NewData.png"], max_images=2, img_width=450)

As can be seen above, the histograms generated for this sample are noticeably normal and lognormal, respectively. So, even though the log returns statistical tests determined that it's unlikely this sample is normal, the histogram can appear to say otherwise. It's only after closer inspection that you can see the outliers on the right of the graph. It's possible that there are some limitations to the accuracy of these statistical tests. Or, it could be that those outliers are enough to break the definitions of normality.<br>
All in all, log returns should be easier to spot being normal due to it being a familiar shape. However, statistical testing may be best in some cases, where things may seem normal, but there are a few outliers that skew the distribution. And it seems these outliers are especially harder to spot when the rest of the distribution looks very normal.

#### Raw Plots of Log Returns - Spectograms

To complete the log returns study, below are the figures generated by plotting the log returns over time. If you consider the Index price a signal, these can be called spectograms.

In [4]:
ipyplot.plot_images(["ln-imgs/dax-log-returns-vs-time-plot-60D-price1.png", "ln-imgs/dax-log-returns-vs-time-plot-180D-price1.png", "ln-imgs/dax-log-returns-initialIndex0OutlierRemoved-type0.png"], max_images=3, img_width=450)

As can be seen above, all spectograms show some sort of oscillation in returns, going between negative and positive returns continuously. So, overall they show the same type of evolution over time. With a smaller sample size, there appears to be a lot more volatility, as the returns seem to vary a lot more. This could just be the effect of the lower "resolution", however.

### Conclusions Regarding Log Returns

As has been mentioned in this report before, an initial interpretation of the mathematical theory behind the log-normal model for stock prices makes it seem as though raw price distributions and log return distributions directly correlate with each other. Again, as the CFA puts it, "If continuously compounded returns are normally distributed, asset prices are lognormally distributed."<sup>6</sup> <br>
To be more specific on the theory, Malkiel's random walk model for share prices, with share price $S(t)$, uses the main equation: dS(t) = µ.S(t).dt + σ.S(t).dZ <sup>7</sup>. This is essentially describing the continuous returns of a stock. Then, through derivation, the equivalent discrete share price equation is: S(t) = S(t–1).exp{µ* + σ.Z(t)}. <br>So, we can see how the equations would both yield normal distributions (you would need to perform a log transform on the share price equation) and they directly correlate with each other. <br><br>
However, the Central limit Theorem seems to expand on the theory, as a price distribution should eventually be log-normal, even if the log returns are not. In our analysis we even had a case where the log returns appear normal, however the raw price distribution does *not* seem to be normal/log-normal. There are a few possibilities for our results. One interpretation is that the real life index data, at least in this study, is not distributed according to theory at all. However, with some results showing clear indications of normality on both ends, I don't think that's completely accurate. I believe that it may be more often the case that financial instrument prices are more likely to be log-normally distributed than it is for the log returns to be normally distributed.<br>In any case, more tests should be done with many varied samples to make more definite conclusions on these distributions. The most apparent finding in this log returns study simply seems to be that raw price distributions and its log returns distributions do not always correlate with each other. 

## Analysis of Compound Average Growth Rate (CAGR)

Up until this point, almost all our raw price datasets have been adjusted by their Compound Average Growth Rate (CAGR). This includes both the "raw" price tests and the log returns tests. To calculate the CAGR for each price interval, the standard formula was used: $CAGR = ((fv / pv)^{365/d}) - 1$. See the past study's "Initial Normality Tests" section for details and code.<br> 
This CAGR adjustment was also inspired by the studies done by sixfigureinvesting. According to that study, and other sources, "the CAGR can be used to smooth returns so that they may be more easily understood"<sup>8</sup>. After further reflection, a question that came to mind is how does this "smoothing" affect normality? Does the CAGR adjustment keep the underlying price distribution the same? This next part of the study explored these questions further.

Since CAGR is supposed to "smooth" out results, visual plots should be able to show this behavior more clearly in the data <br> 
So, below are two sets of histogram plots to compare: the one on the left are log return histograms with the usual CAGR adjustment applied. On the right are histograms constructed with the same data, but this time no CAGR adjustment was applied.

In [5]:
ipyplot.plot_images(["ln-imgs/histogram-newData-logReturns-allPeriods-2024.png", "ln-imgs/histogram-newData-logReturns-allPeriods-NoCAGR-2024.png"], max_images=2, img_width=600)

As shown above, we can see the CAGR adjustment did make a noticeable difference. The main difference we can see visually is that it got rid of some outliers, on both sides of the distribution. It's not easy to tell visually which distribution was affected the most, especially at this scale. But the smaller time period distributions seem to have changed the most, as before there was a large gap between the mean and the outliers. The shape of the larger time period seems fairly similar, but now the range is much smaller, as a lot of small outliers seem to have been removed.<br>So, we can see that the CAGR adjustment did do some sort of "smoothing" to our dataset. Note, however, that that the overall shape of the distributions remained the same. With no CAGR adjustment, all distributions had a normal bell curve shape, especially the larger time periods. After the adjustment, the distributions were tightened up by removing the outliers. 

### Is it valid to adjust by CAGR?

According to theory, it is already known that the CAGR calculation has some limitations, such as not accounting for volatility.<sup>8</sup>. However, from all our studies done thus far, including the one above, we can see that distributions can still be noticeably different from a normal/log-normal one even with the CAGR adjustment. Indeed, from the histograms generated, it seems the effect of removing some volatility from price histories is that outliers are removed. However, in some cases, even if the outliers are removed, the distribution can be very haphazard. One way to look at this is that the CAGR is a way to remove noise from the signal, but the underlying signal remains the same.<br>
It can also be fair to say that only analyzing raw prices is valid, and that any manipulations/adjustments of the data makes results invalid. It seems few other researchers in this area have looked into adjusting by the CAGR. However, our results, when the visual and the statistical tests are taken into account, do make logical sense, even with the CAGR adjustment. It may also be that the CAGR is just another tool that is most useful to use in certain circumstances. In the literature, for example, the CAGR is meant to be used for larger time periods where the compounding growth most affects the price evolution. It would be interesting to explore more in the future how these conditions interact with CAGR adjustments.

# Final Thoughts of this Revisitaion Study

There were many interesting observations made in this second exploration study of Market Indices. One of the primary objectives behind these investigations was to scrutinize the validity and applicability of market theories through different testing methods.<br>
After this second set of studies, I have more confidence in saying that market index prices can have log-normal distributions under certain conditions. Those conditions can be: the specific time period analyzed, what specific price is used (average price, close price, open, etc), and the index itself (DAX, INDU, SPX, etc). Of course there could be other conditions that could be analyzed in the future, but these are the params we examined which made a difference. 

In this study it was also observed that the log-normality/normality of index prices and their log returns are not directly correlated - if log returns are normal, this does not necessarily mean that the raw price distribution is lognormal and vice-versa. It seems that, overall, this behavior is still according to mathematical theory. However, a deeper study on the theory is required to confirm this. In any case, both the returns and the raw prices should be considered when analyzing the distributions of financial instruments.<br>
The CAGR is another parameter that can be considered when analyzing distributions. It seems that it doesn't affect the overall distribution shape, but it does smooth out and remove outliers.

Note that while I am basing my conclusions on multiple sets of tests and research, I cannot say with complete certainty that all my findings are accurate. More samples would need to be analyzed in future studies to make more definitive conclusions. These samples should be quite varied, coming from different regions, from different time periods, performing different adjustments, etc. It is mainly for shorter time periods where CAGR may not be ideal, and statistical tests may not be ideal for longer time periods. 

Others that have studied this log-normal/normal theory such as Mota seem to have different findings<sup>9</sup>. Overall, it seems that others did not find much evidence for log-normality in the distribution of share prices. The studies also all seem to be conducted differently. They use different statistical tests, look at different time periods, only consider certain "parameters" to vary, etc. It would be interesting to try and determine which testing methodologies are the best.

Lastly, I would like to mention that the market theory seems to specifically say that *share* prices are log-normally distributed, which should limit the application to Stock prices, and not Indices. However, Indices are composed of stocks, so one might assume that would make indices be included in the theory. Furthermore, other online sources seem to extend the theory to index values as well. In Wilkies' log-normal random walk paper, he shows how the mathematical models make it as though an index cannot be log-normal if its underlying stocks are log-normal<sup>7</sup>. But, because of the CLT, the indices will eventually be log-normal. <br>
In any case, in our studies, the distribution of all 3 indices examined did show log-normal/normal characteristics

## References 

1. https://sixfigureinvesting.com/2018/11/predicting-price-ranges-with-historic-volatility/
2. https://www.macroption.com/calculating-moving-average-prices/
3. https://machinelearningmastery.com/a-gentle-introduction-to-normality-tests-in-python/
4. https://www.r-bloggers.com/2011/10/normality-tests-don%e2%80%99t-do-what-you-think-they-do/
5. https://interactivebrokers.github.io/tws-api/historical_bars.html

New References in this Study:<br>

6. https://www.cfainstitute.org/en/membership/professional-development/refresher-readings/common-probability-distributions
7. https://www.actuaries.org/EVENTS/Congresses/Paris/Papers/3203.pdf
8. https://www.investopedia.com/terms/c/cagr.asp
9. https://bibliotekanauki.pl/articles/729892.pdf