# Exploring Stock Market Distributions

In [1]:
import ipyplot
from datetime import datetime, timedelta
from typing import List
import math
import matplotlib.pyplot as plt
import matplotlib.ticker as plticker
from time import sleep
import psycopg2
from psycopg2.extras import execute_values
import pandas as pd
import sqlalchemy
from sqlalchemy import create_engine, insert, Table, MetaData
import itertools
import numpy as np
import scipy as sp
import scipy.stats
from configparser import ConfigParser
import matplotlib.dates as mdates

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


In [1]:
import getpass
password = getpass.getpass('Enter your password')

Enter your password ········


## Introduction And Motivation

Stock market behavior is a complex domain, and analyzing the statistical distribution of stock prices and returns is fundamental to understanding market dynamics. Traditional financial theories, such as the Geometric Brownian Motion model (which forms the basis of the Black-Scholes formula), often assume that continuously compounded returns are normally distributed, leading to the assumption that prices follow a lognormal distribution.

This study is the latest phase of a multi-stage data exploration project investigating the distributional properties of equity prices, expanding the analysis with a larger dataset, more parameters, and a more rigorous approach. While past work focused on indices and smaller datasets, this project specifically examines individual stock distributions, providing a deeper analysis on the specifics of stock theory.<br>
As had been mentioned in past [conclusions](https://github.com/edM777/index-data-explore/blob/main/Market-Index-Data-Exploration-revisit-Jan2024.ipynb), in the future "these samples should be quite varied, coming from different regions, from different time periods, performing different adjustments, etc." This study greatly expands on the parameter analysis as well.<br>
And so, this project is an exploratory data analysis, aiming to uncover the fundamental nature of stock distributions and their relationship with normality.

#### Theoretical Background:
A core tenet of traditional financial theory is that stock prices tend to exhibit a lognormal distribution over time<sup>1</sup>. This aligns with the principles of Brownian motion. The formula for log returns is directly derived from this model.
$$
\ln\left(\frac{S(t)}{S(t-1)}\right) = \mu + \sigma \epsilon_t
$$

Where:
- $S(t)$ is the stock price at time $t$
- $S(t-1)$ is the stock price at time $t-1$
- $\mu$ is the expected return
- $\sigma$ is the volatility
- $\epsilon_t$ is a random variable from a standard normal distribution.

Theoretically, if raw prices are lognormally distributed, then their continuously compounded returns (log returns) are normally distributed. Additionally, a log transformation of the raw prices will also be normally distributed. This relationship is derived from the properties of lognormal and normal distributions, which is also explained in Wilkie's paper <sup>2</sup>.<br>
However, this study investigates whether real-world stock data truly conforms to these theoretical assumptions.

### Goal

The primary objective of this project is to critically evaluate the standard assumption of lognormality and normality in stock price distributions by applying rigorous statistical analysis to a large set of real-world market data. This exploration moves beyond theoretical models to provide an empirical, data-driven perspective on the actual characteristics of stock returns, with the goal of identifying where classical financial theories align with and diverge from observed market behavior.

## Data Acquisition and Preparation 

***Data Source***: Financial data for this study was primarily sourced from [Financial Modeling Prep](https://site.financialmodelingprep.com/) (**FMP**).<br> 
To ensure a diverse representation of the market, data was collected for **100** individual stocks spanning small, mid, and large market capitalization ranges across three different countries: Hong Kong (HK), the United States (US), and Sweden (SE).<br> 

The distribution included: approximately 11 small, 11 mid, and 11 large-cap stocks per country. Small-cap stocks were defined as having a market capitalization between 3 million and 2 billion USD, mid-cap between 2 billion and 10 billion USD, and large-cap above 10 billion USD. Stocks for each segment were randomly selected using [FMP's stock screener](https://site.financialmodelingprep.com/developer/docs/stable/search-company-screener), with parameters configured to include only actively trading stocks, excluding funds and ETFs.<br>
The main programming language for this section and the rest of the project is Python.
<br><br>
A full list of the 100 stocks used in this project can be found in this **[Google Sheet](https://docs.google.com/spreadsheets/d/1Xg0Nsfbj1lZVDYbo34lS32oRoagoINdk_xNKeIwtFgs/edit?gid=295300716#gid=295300716)**.

The data collection and initial storage were managed using a Python script (`data-collectory-storage.py`), which interacted with the FMP API. The collected data was then stored in a PostgreSQL database. The `sqlalchemy` library was used for both inserting data into and extracting data from the database.See example data extraction below

In [3]:
engine = create_engine(f'postgresql://postgres:{password}@localhost:5432/market-data-analysis')
metadata = MetaData()
# Query to retrieve all stock data
query = "SELECT * FROM stock_prices_filtered;"
ins_eod_df = pd.read_sql(query, engine)
# Convert 'bar_date' to datetime and sort DataFrame
ins_eod_df['bar_date'] = pd.to_datetime(ins_eod_df['bar_date'], utc=True)
ins_eod_df = ins_eod_df.sort_values('bar_date')
ins_eod_df.reset_index(drop=True, inplace=True)

In [4]:
ins_eod_df.head()

Unnamed: 0,id,bar_date,isin,open,high,low,close,volume,split,merger,OHLC Avg,price_change
0,968956,1994-04-18 00:00:00+00:00,US9842451000,24.75,25.0,24.25,24.25,908300.0,0.0,False,24.5625,0.0
1,782362,1994-04-18 00:00:00+00:00,US4042511000,15.5,16.38,15.13,16.38,88200.0,0.0,False,15.8475,0.0
2,1019722,1994-04-18 00:00:00+00:00,US67021C2061,51.16,51.16,51.16,51.16,0.0,0.0,False,51.16,0.0
3,815025,1994-04-18 00:00:00+00:00,US3208171096,10.94,10.94,10.94,10.94,2080.0,0.0,False,10.94,0.0
4,774812,1994-04-18 00:00:00+00:00,US1501851067,16.81,16.88,16.75,16.88,12800.0,0.0,False,16.83,0.0


Notes on some of the columns / data features:<br>
- `bar_date`: Represents the date for which the daily OHLC values were obtained
- `split`: Represents the decimal value for a split. For example a 1:2 split would have a 0.5 value. A 0 indicates there was no split on this day.

#### Data Cleaning and Filtering
Initial data cleaning involved checking for and removing any NaN or infinity values. While the dataset was largely clean, functions like `dropna()` were used to ensure data integrity.<br>
For example, see the dropna usage below.<br>
See full data cleaning logic in `normality-scanner-tester.py`

In [None]:
param_adjusted_data.dropna(subset=['calculated_values'], inplace=True)

In the end, accross all the stock data, there is a total of **380448** rows of daily stock price data. 

**Data Filtering Exploration** <br>
A key challenge in financial data analysis is appropriate outlier detection. Standard statistical methods like the IQR (Interquartile Range) proved insufficient for this dataset, as they did not account for domain-specific data characteristics observed from the FMP API. For instance, certain stocks like 'ARD' exhibited anomalous data points that were not flagged by conventional methods. Therefore, a custom filtering logic was developed based on the mean price change (see `stk-data-cleaner-filter-R.py` in the program files).<br> 
This approach resulted in a seemingly more robust dataset, reducing the row count by approximately 8% to 352,119 entries.

The final filtering method used a threshold based on the mean of absolute price changes: `threshold = abs(df['price_change']).mean() * 24`.

It's important to note that this was an initial exploration, and further analysis needs to be conducted to determine the most adequate filtering methods. As such, this further filtered dataset was *not* used in the rest of the analysis.

### Obtaining Test Results Data

To generate a comprehensive set of results for analysis, a Python script (`normality-scanner-tester.py`) was developed. This script manipulated the stock data according to various parameters and performed statistical tests on the resulting distributions.<br> 
The duration parameter for each test run was applied by randomly selecting `start_time` and `end_time` dates that yielded the desired time duration.

See *Appendix - Code 1* for the primary function used to manipulate and test the stock data in bulk within the `normality-scanner-tester.py` script.<br>
Further analysis of the logic of the script as it pertains to the application of parameter testing is done in later sections.

Multiple runs of this script were conducted to create a large dataset of test results with diverse parameter combinations (start and end times were generated at random each run). After running the "tester" script *three times*, an initial total of *25,552* rows of test results were obtained.

To ensure the uniqueness of each test result, duplicate entries were identified and removed using a SQL query. Duplicates were defined as rows with identical values across key columns including `ISIN`, normality test results (`KS_TEST`, `JARQUE_BERA_TEST`, etc.), shape metrics (`SKEW`, `KURTOSIS`, `HURST_EXPONENT`), time parameters (`START_TIME`, `END_TIME`, `DURATION`), and other defining parameters (`N`, `PERIOD_VOLUME`, `SPLIT`, `MERGER`, `CAGR`, `DATA_TYPE`, `LOG_RETURNS`). The query utilized `GROUP BY` and `HAVING COUNT(*) > 1` to identify duplicates. Duplicates were likely caused by overlapping time frames, which can happen even if the start/end are randomly generated<br> 
This process identified 22 duplicate entries (44 rows total), which were subsequently removed, resulting in a final dataset of **25,530** unique test results.<br>
Each row, representing a set of test results, is uniquely identified by its **`runId`**.

## Methodology 

The core of this project's methodology involved applying statistical analysis and visualization techniques to understand the distributional properties of stock data. Based on research and iterative observations, the following methods were selected for this study.

### Initial Distribution Analysis
The standard normal distribution, characterized by its bell shape, is a widely used model in various fields, including finance. Below is the typical distribution.

![Standard Normal Distribution](standard-normal-distribution-graph-gaussian-260nw-2291382909.webp)

A crucial part of the initial analysis involved visually inspecting the distributions. This manual inspection, guided by prior knowledge and experience, served as a "self-check" to complement automated test results. To aid this visual comparison, an approximation of the expected normal distribution line was overlaid on the histogram of the actual data. This synthetic line is generated based on the mean and standard deviation of the observed data using the `norm` function from `scipy.stats`.<br>
Below is the code just to create the overlay.

In [None]:
data = stock_data['calculated_values']
mean = np.mean(data)
std_dev = np.std(data)
# Plot actual histogram data here ....
# Generate points on the x axis
xmin, xmax = plt.xlim()
x = np.linspace(xmin, xmax, 200)
# Generate normal distribution
p = norm.pdf(x, mean, std_dev)

Then the whole sample code for plotting the histogram can be found in *Appendix - Code 2*, and `auto-visual-normality-checker.py`.

Below is a sample distribution produced from the test runs with the normal line overlaying the distribution. There are of course thousands of other distributions produced that were analyzed more closely.

![Parameters-correlation-matrix](Histogram-for-HK2778034606---Start-2007-10-30-190000-0500-End-2024-04-14-190000-0500-Log_returns-False-CAGR-False-Data-Type-eod_high_low(log-transformation)-ID54-wOver-scaleCorrection-prelim-weirdButNormalishShapeLike6fig.png)

In addition to visual inspection, standard error metrics were calculated to quantify the "closeness" of the actual data to the generated normal lines. These metrics are discussed in detail in a later section.

### Statistical Tests and Metrics

To provide a more objective assessment of normality, several widely recognized statistical tests were employed. These tests, primarily sourced from the **`scipy.stats`** Python module (with the exception of the Lilliefors test from `statsmodels.stats.diagnostic`), are well-documented and commonly used in statistical analysis.

The specific statistical tests used include:
- **Kolmogorov-Smirnov (KS) test**: A non-parametric test to assess whether a sample comes from a specified distribution (in this case, a normal distribution) or whether two samples come from the same distribution.
- **Shapiro-Wilk test**: A test for normality, generally considered powerful for smaller sample sizes.
- **Jarque-Bera test**: A test for normality based on the sample skewness and kurtosis.
- **Lilliefors test**: A modification of the KS test specifically for testing normality when the mean and variance are unknown.
- **D'Agostino K² test**: A test for normality based on transformations of the sample skewness and kurtosis.<br>

See *Appendix - Code 3* for the code used to apply these tests, from `normality-scanner-tester.py`.

Beyond visual inspection and formal statistical tests, the shape of the distribution was quantified using three key metrics. These metrics provide a numerical basis for understanding how a sample distribution deviates from a perfect normal distribution.<br><br>
**Skewness**<br>
Skewness measures the asymmetry of a probability distribution around its mean. It provides insight into the direction and relative magnitude of a distribution's tails.

- A skewness value of 0 indicates a perfectly symmetrical distribution, as is the case with a normal distribution.
- Positive Skewness (> 0) indicates a "right-skewed" distribution, where the tail on the right side is longer or fatter than the left side. The mass of the distribution is concentrated on the left.
- Negative Skewness (< 0) indicates a "left-skewed" distribution, where the tail on the left side is longer or fatter than the right. The mass is concentrated on the right.<br>
  
**Kurtosis**<br>
Kurtosis measures the "tailedness" of a distribution—that is, it describes the weight of the tails relative to the rest of the distribution. This project utilized the Fisher definition of kurtosis, as implemented in the scipy.stats library, which calculates excess kurtosis. For excess kurtosis, a normal distribution has a value of 0.

- Excess Kurtosis > 0 (Leptokurtic): This indicates a distribution with heavier or "fatter" tails and a sharper peak than a normal distribution. Leptokurtic distributions have a higher probability of extreme outlier events. This is a common feature in financial return data.
- Excess Kurtosis < 0 (Platykurtic): This indicates a distribution with lighter or "thinner" tails and a flatter peak. This suggests a lower probability of extreme events compared to a normal distribution.
- Excess Kurtosis = 0 (Mesokurtic): This is the characteristic of a standard normal distribution.

Additionally, the **Hurst exponent** was calculated as an auxiliary measure to provide insight into the long-term memory of the time series. A Hurst exponent value of 0.5 suggests a random walk, consistent with a normal distribution in returns. Values greater than 0.5 indicate persistent behavior (trending), while values less than 0.5 indicate anti-persistent behavior (mean-reverting).<br>
The Hurst exponent is calculated based on the rescaled range (R/S) analysis. In terms of $S_D(n)$ - standard deviation of the increments, the formula is given below.

$$
S_D(n) = \text{StdDev}(\{ \Delta X_t(n) \}_t)
$$
$$
\mathcal{T}(n) = \sqrt{S_D(n)} = \sqrt{\text{StdDev}(X_{t+n} - X_t)}
$$
$$
\log(\mathcal{T}(n)) = m \cdot \log(n) + c
$$
$$
H = 2m
$$

See *Appendix - Code 4*, for the code, from `normality-scanner-tester.py`

#### Combined Statistical Decision Rule
To arrive at a programmatic determination of normality that synthesized the information from multiple statistical tests and distribution shape characteristics, a combined decision rule was developed. This rule aimed to incorporate the results from the most common and robust statistical tests alongside the key shape parameters of skewness and kurtosis, which are fundamental descriptors of a distribution.

The criteria established for a distribution to be classified as "normal" under this combined rule were:

1. The **average p-value** across the five statistical normality tests (KS, Jarque-Bera, Lilliefors, Shapiro, D'Agostino) is *greater than **0.05***. This threshold is commonly used to determine if there is sufficient evidence to reject the null hypothesis of normality.
2. The *absolute value of the **skewness** is less than a defined threshold (specifically, < 0.5)*. Skewness measures the asymmetry of the distribution; a value close to **0** indicates symmetry, characteristic of a normal distribution.
3. The *absolute value of the **kurtosis** (specifically, *excess kurtosis*) is less than a defined threshold (specifically, < 0.5)*. Kurtosis measures the "tailedness" and peakedness of the distribution relative to a normal distribution (which has an excess kurtosis of 0). A value close to **0** indicates a peak and tails similar to a normal distribution.<br>

The code used to apply this logic via filtering is below. This is from `auto-visual-normality-checker.py`

In [6]:
query = "SELECT * FROM tests_results;"
tests_results_df = pd.read_sql(query, engine)

In [7]:
normality_columns = ['ks_test', 'jarque_bera_test', 'lilliefors_test', 'shapiro_test', 'dagostino_test']
tests_results_df['average_p_value'] = tests_results_df[normality_columns].mean(axis=1)

# Determine normal distribution based on combined decision rule
tests_results_df['is_normal'] = ((tests_results_df['average_p_value'] > 0.05) & (tests_results_df['skew'].abs() < 0.5) &
                                 (tests_results_df['kurtosis'].abs() < 0.5))

tests_normal_distributions = tests_results_df[tests_results_df['is_normal']]

### Visual Heuristic Rule - Actual vs Expected Values

As a final layer of normality assessment, an algorithmic approach was developed to mimic the human process of visually comparing the observed distribution to the expected normal or lognormal curve. This method aimed to quantify how well the actual data distribution aligned with the theoretical shape.<br>

Error metrics were crucial for this comparison. A fundamental and standard metric used for comparing observed and expected frequencies is the weighted chi-squared test. This test quantifies the difference between the observed counts in histogram bins and the expected counts under the assumption of a normal distribution. In this rule, the specific metric implemented is **Sum of Squared Differences (SSD)**, where the formula is given below.<br>
Squaring the differences ensures that both positive and negative deviations contribute positively to the total error.

$$ Sum \ of \ Squared \ Differences (SSD) = \sum (Observed_i - Expected_i)^2 $$

In addition to this, the *maximum absolute difference* between the observed and expected counts in any bin was calculated.  This will catch cases where a single bin has a large deviation, even if the overall Chi-squared is low.<br>

Lastly, this algorithm checked the number of peaks in the observed distribution. A characteristic of a normal distribution is that it is unimodal, possessing a single peak. Distributions with multiple peaks are indicative of a departure from normality.

These three conditions (SSD, maximum absolute difference, and number of peaks) formed another set of criteria used in a combined decision logic to assess the "visual normality" of the distributions.<br> 
Thresholds for these three conditions were determined empirically through manual visual inspection of numerous distributions. The resulting classifications were compared against the visual assessment, and thresholds were adjusted iteratively until the automated classification reasonably aligned with human judgment. The final thresholds agreed upon were:

In [None]:
ssd_threshold = 13  # Threshold for Sum of Squared Differences (Weighted Chi^2)
max_abs_diff_threshold = 8  # Threshold for Maximum Absolute Difference
max_peaks = 1

See the *Appendix - Code 5* for the code used to make these calculations, from `auto-visual-normality-checker.py`

### Parameter Exploration
With numerous parameters potentially influencing the distribution of stock data, a primary method for exploring their relationships and impact on normality was the use of a correlation matrix. A correlation matrix visually represents the pairwise linear correlation coefficients between variables, indicating both the direction and strength of their linear relationship. This initial visualization was valuable in providing insights into which parameters might have the most significant influence on the normality test results.<br>

Altogether, the **parameters** analyzed, along with the types of values, included:

- "`CAGR`": A boolean indicating whether the Compound Annual Growth Rate calculation was applied (True/False).
- "`logReturns`": A boolean indicating whether log returns were used (True) or raw prices were analyzed (False).
- "`Merger`": A boolean indicating the occurrence of a merger within the data period (True/False).
- "`Split`": A number representing the ratio of a split, if there is any
- "`Duration`": The length of the time series in seconds. This parameter is directly related to the sample size (N).
- Normality Test Results: "`ks_test`", "`jarque_bera_test`", "`lilliefors_test`", "`shapiro_test`", "`dagostino_test`" (representing p-values).
- Distribution Shape Metrics: "`skew`" and "`kurtosis`" (specifically, excess kurtosis).
- "`hurst_exponent`": A number representing the calculated Hurst exponent.
- "Period in time": The specific chronological period from which the data was sampled.

To generate all possible combinations of categorical parameters for systematic testing, the following code was used

In [8]:
query = "SELECT * FROM parameters;"
params_df = pd.read_sql(query, engine)

# Generate all combinations of parameters
param_combos = [
    {col: value for col, value in zip(params_df.columns, combo)}
    for combo in itertools.product(*(params_df[col].dropna().unique() for col in params_df))
]

In [9]:
print(param_combos)

[{'cagr': False, 'data_type': 'eod_avg', 'duration': 15768000.0, 'log_returns': False}, {'cagr': False, 'data_type': 'eod_avg', 'duration': 15768000.0, 'log_returns': True}, {'cagr': False, 'data_type': 'eod_avg', 'duration': 31536000.0, 'log_returns': False}, {'cagr': False, 'data_type': 'eod_avg', 'duration': 31536000.0, 'log_returns': True}, {'cagr': False, 'data_type': 'eod_avg', 'duration': 126144000.0, 'log_returns': False}, {'cagr': False, 'data_type': 'eod_avg', 'duration': 126144000.0, 'log_returns': True}, {'cagr': False, 'data_type': 'eod_avg', 'duration': 473040000.0, 'log_returns': False}, {'cagr': False, 'data_type': 'eod_avg', 'duration': 473040000.0, 'log_returns': True}, {'cagr': False, 'data_type': 'eod_avg', 'duration': 252288000.0, 'log_returns': False}, {'cagr': False, 'data_type': 'eod_avg', 'duration': 252288000.0, 'log_returns': True}, {'cagr': False, 'data_type': 'eod_avg', 'duration': 7884000.0, 'log_returns': False}, {'cagr': False, 'data_type': 'eod_avg', 'd

The function `apply_params_to_data(data, params)` was responsible for transforming the raw stock data based on the specified parameters for each test run (see *Appendix - Code 6*, located in `normality-scanner-tester.py`).

#### Time Dependency

As suggested by past studies<sup>3</sup> and research<sup>5</sup>, the impact of duration (the length of the time series, or sample size N) on normality appeared to be significant and warranted a more in-depth analysis. This analysis explored the relationship between duration and normality from several perspectives. The distribution of p-values from normality tests was examined across different durations, and the frequency with which very long time periods resulted in distributions passing normality tests was specifically investigated.

## Findings and Analysis

The analysis of the collected test results and observed data yielded several significant findings regarding the distributional characteristics of stock market data. With a substantial dataset of **25,530** unique test results available for analysis, a broad range of parameter combinations were explored.

A primary question addressed by this study was whether actual stock behavior conforms to the general financial theory which posits that log returns exhibit normal distributions. While this question has many considerations and nuances, a direct answer based on the findings from this sample dataset would be that *normality is not a consistently observed characteristic across all conditions*. The details and nuances of this finding are explored in the following sections.

Beyond the core question of normality, this study also uncovered other significant features of stock data distributions. The observation that traditional theory may not perfectly model actual stock data led to the identification of other recurring distributional characteristics. Even in the limited instances where distributions showed some resemblance to normality, notable observations were made regarding the specific conditions and parameters associated with these cases. This also prompted consideration of other potential distributional models that might provide a better fit for the observed stock data.

### **Overall Normality Is Rare:**

The primary finding of this analysis is that, *under rigorous statistical scrutiny, stock price and return distributions rarely conform to a true normal or lognormal distribution*. This conclusion was drawn from a comprehensive testing process involving numerous data samples.

The strict *combined statistical decision rule* is described in the Methodology above. When this rule was applied to the entire dataset, only 2,014 out of 25,530 samples **(7.8%)** were classified as normal. This low percentage strongly suggests that while normality is theoretically convenient, it is not an inherent characteristic of empirical stock market data.<br>

Even within this small subset of "passing" distributions, visual inspection reveals that few are perfect. The figure below displays six distributions that successfully passed the combined statistical rule.<br> 
Note that while they clearly exhibit the characteristic bell shape, some still show slight deviations from a true Gaussian curve, often in the form of a sharper central peaks — a  sign of the leptokurtosis that will be explored in a later section.<br>
Their `runIds` for the figures below, from left to right are: 4871, 4333, 359, 3481, 15620, 14468

In [1]:
import ipyplot

image_paths = [
    "Histogram-for-US92890T2050---Start-2022-01-02-180000-0600-End-2022-12-29-180000-0600-Log_returns-True-CAGR-False-Data-Type-eod_close-ID4871-fromCombinedRule-muchBetterNow2StillMoreLepto-scaleCorrection.png", 
    "Histogram-for-HK0045000319---Start-2012-03-20-190000-0500-End-2013-03-17-190000-0500-Log_returns-True-CAGR-False-Data-Type-eod_typical-ID4333-wOver-VeryEvenSmoothButLeptoPeak-fromCombined2-scaleCorrection.png", 
    "Histogram-for-NL0013056914---Start-2020-09-14-190000-0500-End-2021-03-11-180000-0600-Log_returns-True-CAGR-True-Data-Type-eod_high_low-ID359-wOver-alright6mo-fromCombined2-scaleCorrection.png",
    "Histogram-for-SE0017486889-Start-2016-09-25-190000-0500-End-2017-09-21-190000-0500-Log_returns-True-CAGR-False-Data-Type-eod_high_low-ID3481-veryCloseTowardsTopSAME2-scaleCorrection.png",
    "Histogram-for-HK2388011192---Start-2011-12-28-180000-0600-End-2012-12-23-180000-0600-Log_returns-True-CAGR-False-Data-Type-eod_typical-ID15620-fromCombinedRule-kindaClosebutPeakedLeptokurticStill2-scaleCorrection.png",
    "Histogram-for-US4042511000---Start-2001-07-16-190000-0500-End-2002-07-11-190000-0500-Log_returns-True-CAGR-True-Data-Type-eod_close-ID14468-wOver-veryCloseIsh1YRLepto-fromCombined2-scaleCorrection.png"
]

# By providing a list of empty labels, you override the default file paths
empty_labels = [''] * len(image_paths)

ipyplot.plot_images(
    image_paths,
    max_images=6,
    img_width=450,
    show_url=False
)

Referring back to the initial visual inspection of a random subset of all distributions, this provided the first indication that while stock prices are not entirely random, their distributions often deviate from a perfect normal shape. While some distributions exhibited a semblance of the classic bell shape, a closer examination revealed that most did not perfectly match the precise form of a standard normal distribution.<br>

Visual analysis across multiple stock distributions and various parameter combinations clearly revealed a prevalent trend in shape: distributions are primarily leptokurtic. This observation, which will be explored further in a later section, suggests that extreme price movements (both positive and negative) occur more frequently than predicted by a normal distribution model. 

Further details on analysis and calculations done for classifying distributions are given in the next subsection.

#### Calculations on the Likelihood of Normality

Below is a sample of the runs which passed the combined statistical decision rule as described in the Methodology

In [10]:
tests_normal_distributions

Unnamed: 0,runid,isin,ks_test,jarque_bera_test,lilliefors_test,shapiro_test,dagostino_test,skew,kurtosis,hurst_exponent,...,duration,n,period_volume,split,merger,cagr,data_type,log_returns,average_p_value,is_normal
15,16,HK2778034606,0.471040,0.169405,0.040047,0.006477,0.157919,-0.410331,-0.053668,0.234005,...,15724800,126,1.168846e+09,0.0,False,False,eod_typical,False,0.168978,True
58,58,HK2778034606,0.898834,0.558764,0.437773,0.307523,0.583477,0.165454,-0.340296,-0.143638,...,15465600,124,1.749103e+08,0.0,False,True,eod_avg,False,0.557274,True
69,69,HK2778034606,0.839691,0.819193,0.557840,0.962543,0.818942,0.180245,-0.164255,-0.145827,...,7776000,61,9.793645e+07,0.0,False,True,eod_avg,True,0.799642,True
70,70,HK2778034606,0.160546,0.618345,0.013840,0.015067,0.428302,0.215046,0.439641,-0.104925,...,7603200,61,6.981619e+08,0.0,False,True,eod_avg,False,0.247220,True
86,86,HK2778034606,0.830298,0.465564,0.612094,0.181833,0.440079,-0.160239,-0.439580,0.073521,...,15724800,124,7.349658e+08,0.0,False,True,eod_close,False,0.505974,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
25501,25524,CNE100005980,0.930544,0.606074,0.728318,0.562387,0.539610,-0.222014,0.038544,0.035790,...,15379200,121,3.729091e+09,0.0,False,False,eod_high_low,True,0.673386,True
25505,25528,CNE100005980,0.987733,0.992096,0.920726,0.790251,0.892724,0.034010,0.040215,0.034301,...,7603200,61,2.005452e+09,0.0,False,False,eod_high_low,True,0.916706,True
25523,25546,CNE100005980,0.926219,0.782391,0.737013,0.541907,0.679649,-0.217423,0.029935,-0.007894,...,7516800,62,1.661139e+09,0.0,False,True,eod_close,True,0.733436,True
25524,25547,CNE100005980,0.769390,0.147315,0.468162,0.013833,0.150125,0.417068,-0.265124,0.074990,...,15206400,120,1.030926e+10,0.0,False,True,eod_high_low,False,0.309765,True


The number of test runs that passed this combined decision rule was 2014 out of 25530, which is approximately 7.8%. This percentage is notably low, especially considering the theoretical expectation of a large majority of distributions exhibiting normality under the assumptions of classical financial models. 

The full list of `runIds` that passed this combined decision rule can be found in the accompanying file `passing-runIds-combinedDecisionRule.txt`.

Another area of exploration was the Hurst exponent as an additional filter for normality, considering values between 0.48 and 0.52 to indicate a lack of long-term dependence associated with normal distributions. Applying this filter to the distributions already identified as normal by the combined decision rule resulted in a significantly smaller subset, with only 23 results:

In [11]:
hurst_filtered_normals = tests_normal_distributions[(tests_normal_distributions['hurst_exponent'] >= 0.48) & (tests_normal_distributions['hurst_exponent'] <= 0.52)]
hurst_filtered_normals

Unnamed: 0,runid,isin,ks_test,jarque_bera_test,lilliefors_test,shapiro_test,dagostino_test,skew,kurtosis,hurst_exponent,...,duration,n,period_volume,split,merger,cagr,data_type,log_returns,average_p_value,is_normal
94,94,HK2778034606,0.0002632758,0.13245,0.001,2.064995e-12,0.135508,-0.099215,-0.098398,0.508182,...,252288000,1978,7754798000.0,0.0,False,True,eod_close,False,0.053844,True
244,244,SE0012853455,2.037081e-05,0.797104,0.001,9.292981e-11,0.832999,0.019903,-0.09599,0.491904,...,125971200,1008,1065955000.0,0.0,False,False,eod_high_low,False,0.326225,True
1992,1992,SE0015811963,0.1003169,0.543826,0.001,3.889456e-05,0.564963,0.109172,-0.264994,0.481312,...,31536000,248,1085016000.0,0.0,False,True,eod_close,False,0.242029,True
2186,2186,HK1883037637,7.780792e-05,0.26396,0.001,6.444447e-07,0.220505,0.224582,0.236748,0.50892,...,31536000,248,1521203000.0,0.0,False,True,eod_typical,False,0.097109,True
4073,4070,US96208T1043,0.4455475,0.027643,0.378696,8.976494e-07,0.026314,-0.194325,0.142009,0.486303,...,125971200,1006,363304000.0,0.0,False,True,eod_close,False,0.17564,True
7994,7994,HK0883013259,9.947153e-11,0.412615,0.001,1.761794e-14,0.396497,0.049242,0.182507,0.483956,...,126144000,988,112298700000.0,0.0,False,True,eod_close,False,0.162023,True
9745,9745,SE0012853455,2.215211e-05,0.674453,0.001,1.065014e-10,0.701998,-0.029351,-0.12373,0.495779,...,126144000,1008,1066851000.0,0.0,False,False,eod_avg,False,0.275495,True
9956,9957,SE0000115420,0.6354562,0.000184,0.006238,4.60026e-06,0.000237,0.321063,0.029758,0.485292,...,126144000,999,238876500.0,0.0,False,True,eod_high_low,False,0.128424,True
10542,10543,US7995661045,0.0002430277,0.262937,0.001,4.493251e-06,0.180092,0.120432,-0.443204,0.497182,...,31536000,252,455870500.0,0.0,False,False,eod_avg,False,0.088855,True
11569,11557,SE0000115446,0.003702611,0.184814,0.001,2.201667e-07,0.177697,-0.106461,-0.188107,0.480887,...,126057600,1004,9029486000.0,0.0,False,False,eod_high_low,False,0.073443,True


RunID 94 below is an example distribution form this subset.

![bad-hurst-combined-pass](Histogram-for-HK2778034606---Start-2015-03-22-190000-0500-End-2023-03-20-190000-0500-Log_returns-False-CAGR-True-Data-Type-eod_close(log-transformation)-ID94-wOver-like8YearPassButNoGoodStill2-scaleCorrection.png)

RunID 94, shown above, is also an example of a distribution that passed this strict filtering but does not appear visually normal. In general, manual inspection revealed that some distributions that passed all these criteria were still visually not normal, suggesting the Hurst exponent, in this context, might produce false positives.<br>

Conversely, the Hurst exponent also identified some distributions that appear normal by other measures, such as runID 2445 (shown below), suggesting it has some utility but should be used in conjunction with other methods.

![normal-hurst-result](Histogram-for-HK0388045442---Start-2014-12-21-180000-0600-End-2015-03-19-190000-0500-Log_returns-True-CAGR-False-Data-Type-eod_avg-ID2445-wOver-passStrictTest2VERYClose-scaleCorrection.png)

Further analysis involved comparing the actual distributions to expected normal distributions using error metrics. This is the **"Visual Heuristic Rule"** described above. Distributions were considered to 'pass' this comparison test based on empirically determined thresholds for these metrics.

See Appendix - List 1 to see all 178 runs that passed the algo which compares the actual to expected values.

The combination of *passing both* the combined statistical decision rule and this visual comparison test yielded 57 distributions. The `runIds` for these distributions are:

[69, 1561, 2195, 2955, 3901, 4955, 4964, 5356, 6071, 6435, 6747, 6756, 6766, 6850, 7711, 7959, 8851, 9347, 9357, 9396, 9756, 9852, 10708, 11576, 11578, 11904, 11937, 12086, 12262, 12298, 12866, 14265, 14944, 15255, 15652, 15927, 16310, 16837, 17023, 17209, 17405, 17433, 18195, 19615, 19850, 20152, 20186, 20596, 20946, 21706, 22275, 23111, 23205, 24589, 24697, 25256, 25359]

Visually inspecting these 57 distributions, many appear to be good candidates for approximately normal distributions, showing a strong resemblance to the classic bell shape and aligning well with the expected normal line. Examples include:

In [2]:
import ipyplot

image_paths = [
    "Histogram-for-US92890T2050---Start-2022-01-02-180000-0600-End-2022-12-29-180000-0600-Log_returns-True-CAGR-False-Data-Type-eod_close-ID4871-fromCombinedRule-muchBetterNow2StillMoreLepto-scaleCorrection.png", 
    "Histogram-for-SE0017486897---Start-2008-07-13-190000-0500-End-2008-10-09-190000-0500-Log_returns-True-CAGR-False-Data-Type-eod_close-ID1561-wOver-passStrictTest2ReallyLooksAlmostAnotherSE-scaleCorrection.png"
]

# By providing a list of empty labels, you override the default file paths
empty_labels = [''] * len(image_paths)

ipyplot.plot_images(
    image_paths,
    max_images=2,
    img_width=450,
    show_url=False
)

However, it is important to note that some distributions in this list, particularly those with small sample sizes (e.g., runID 6766 with N=12, shown below), may pass these automated checks but are difficult to definitively label as belonging to a standard distribution type based on visual inspection. Our current filtering criteria require a minimum of 9 data points, which may still be insufficient for reliable distribution analysis.

![small-distribution-pass-expected-test](Histogram-for-SE0015988019---Start-2020-03-05-180000-0600-End-2020-08-04-190000-0500-Log_returns-True-CAGR-True-Data-Type-eod_avg-ID6766-wOver-Correction.png)

Notably, no distributions passed all three criteria simultaneously (combined statistical rule, visual comparison rule, and the Hurst exponent range), further emphasizing the rarity of perfect normality in real-world stock data under the conditions tested.

### Parameter Analysis

First a large correlation matrix encompassing the majority of the params was analyzed.
To understand which factors influence the normality of stock distributions,the correlation between various parameters and the outcomes of normality tests were analyzed.<br>
A correlation matrix was generated to visualize the relationships between parameters and the likelihood of passing the normality tests, specifically, the average p-value criterion, with a threshold of 0.05. 

![Parameters-correlation-matrix](Correlation_Matrix_Parameters_vs_Normality_Test_Pass_Rates_005-gemini-043025.png)

From this analysis, it can already be observed that Duration (or equivalently, the number of data points, N) emerged as the most significant independent variable influencing the average p-value.<br>

Examining the correlation matrix for *skewness and kurtosis* provides further insight into parameter influence:

![Parameters-correlation-matrix](Correlation_Matrix_Parameters_vs_Skewness__Kurtosis.png)

Here, "logReturns" shows the strongest impact on kurtosis, and "Duration" also exhibits a notable correlation, reinforcing its significance in shaping the distribution's characteristics.

#### The Dominant Effect of Time Duration

*The single most significant parameter correlated with normality was time duration.* Shorter durations consistently had a higher likelihood of passing normality tests, with the probability dropping sharply as the duration increased.<br>

The initial parameter analysis with the correlation matrices had already indicated this clearly. Duration exhibits a strong negative correlation (of **-0.43**) indicating that as the duration of the time series increases, the average p-value pass rate from the normality tests tends to decrease, making the distribution less likely to be identified as normal by these tests.

Further investigation into the influence of duration involved plotting the relationship between the time series duration and the average p-value obtained from the normality tests.<br>

As depicted in the figure below, a clear trend is observed: there is a notable decline in the average p-value as the duration of the time series increases. This indicates that distributions from longer time periods are statistically less likely to satisfy the criteria for normality (where the significance threshold is 0.05). Beyond a certain duration, very few or no distributions meet this p-value threshold for normality.

![average-p-time](Average_Normality_Test_Pvalue_vs_Duration.png)

Consistent with this trend, the dataset includes a single data point representing a distribution from a long duration (approximately 15 years, `RunID 20289`) that achieved an average p-value slightly above the 0.05 threshold (0.0611). This instance, while rare, suggests that achieving a statistically 'normal' result on the average p-value test is possible even for long durations, though it is highly infrequent and may highlight potential limitations or sensitivities of these tests over extended periods.<br>

A visual inspection of the distribution corresponding to `RunID 20289` (shown below) reveals it to be less leptokurtic and exhibit skinnier tails and a more discernible bell curve shape compared to many other distributions from similarly long durations. The overlaid normal distribution line shows a decent alignment with the actual data, supporting the higher p-value. However, the distribution is not a perfect fit and still appears to exhibit some degree of leptokurtosis, aligning more closely with typical observations of financial data distributions over long periods, which are often described as leptokurtic, similar to the S&P 500 distribution shape noted in some analyses.

![Standard Normal Distribution](Histogram-for-US89531P1057-Start-2000-05-31-190000-0500-End-2015-05-28-190000-0500-Log_returns-False-CAGR-False-Data-Type-eod_close(log-transformation)-ID20289-wOver-ONLYLONGDurationPassPerAveragePFAILCOMBINEDDECISIONFailAuto-scaleCorrection.png)

The raw row values for runId 20289 are given below. Also note - Duration, Days: 5475 (15 YEARS); Average p-value: 0.0611 (fails skew and kurtosis test)<br>

- runid:                                   20289
- isin:                             US89531P1057
- ks_test:                              0.000008
- jarque_bera_test:                     0.147209
- lilliefors_test:                         0.001
- shapiro_test:                              0.0
- dagostino_test:                       0.157368
- skew:                                  -0.0395
- kurtosis:                             0.134708
- hurst_exponent:                        0.44495
- start_time:          2000-05-31 19:00:00-05:00
- end_time:            2015-05-28 19:00:00-05:00
- duration:                            473040000
- N:                                        3771
- period_volume:                    5223603756.0
- split:                                     2.0
- merger:                                  False
- cagr:                                    False
- data_type:                           eod_close
- log_returns:                             False

Then, continuing the analysis of duration, The line plot below was created, showing the percentage of runs that passed the combined statistical decision rule described above. So, this is plotting: **Combined Statistical Decision Rule Pass Rate vs. Duration**

![combined-pass-vs-duration](Combined_Statistical_Decision_Rule_Pass_Rate_vs_Duration-(2).png)

As can be seen above, there is a very rapid decline in pass rates as duration increases. By the time the duration reaches about 1,000 days (about 2.7 years), the pass rate is already consistently in the low single digits.<br>
For almost all durations beyond about 1,500 days (about 4 years), the probability of a run passing this combined statistical rule is effectively zero. The line flatlines at the bottom of the chart for the majority of the time scale.

When only *skew and kurtosis* were examined, there were similar trends. Note that the trend was not as pronounced for skewness. However, by the time the duration reaches about 2,500 days (about 7 years), the pass rate for kurtosis has already fallen below 10% and continues to trend towards zero for longer periods. It seems that the excess kurtosis systematically increases with time

The analysis showed a clear trend: as the duration of the time series increased, the percentage of runs passing the combined normality rule decreased. Short-duration time series (e.g., days to a few months) had a higher probability of passing the normality tests compared to longer time series (e.g., several years). This is a crucial finding and aligns with the understanding that while price changes over very short intervals might approximate normality, cumulative effects over longer periods, influenced by market dynamics, events, and trends, lead to deviations from a purely random walk model.<br>

This observation is consistent with the concept of "fat tails" in financial data, where extreme events occur more frequently than predicted by a normal distribution, and these deviations become more pronounced over longer time horizons.

#### Log Returns v.s. Raw Prices
A fundamental aspect of our analysis was to compare the distributional properties of log returns and raw prices, particularly in relation to the theoretical assumption that log returns are normally distributed if prices are lognormal<sup>2,3</sup>.<br>

**Mathematical Theory Considerations**
For a more detailed discussion on the mathematical theory of the correlation between raw price normality and log returns normality, see the past study "Market-Index-Data-Exploration-revisit-Jan2024"<sup>3</sup>, specifically the section "Conclusions Regarding Log Returns," can be consulted.

The underlying mathematical equations clearly demonstrate that if raw price distributions are truly lognormal, then their corresponding log returns should inherently follow a normal distribution. The analysis conducted in this study largely supports this theoretical relationship, which aligns with standard financial theory. However, certain discrepancies are clearly observed when applying this theory to real-life stock data. These deviations are likely attributable to the challenges of real-world data fully meeting all the underlying assumptions of theoretical models. A significant discrepancy compared to theory lies in the observed volatility (σ) of real stock data; unlike the constant volatility often assumed in theoretical models, real-life stock volatility tends to be much more varied.

In any case, initial observations from the correlation matrices suggested that the "logReturns" parameter had minimal linear correlation with the average p-value (only 0.01), implying little difference in normality based solely on statistical tests when viewed in isolation. However, the second correlation matrix indicated a more pronounced impact of "logReturns" on kurtosis.

Further analysis involved comparing the pass rates based on the *average p-value criterion* for runs where log_returns was 'True' versus 'False':

Log Returns = `True`
- Total Runs: 12,748
- Passing Runs: 3,677
- Pass Percentage: 28.84%
<br>

Log Returns = `False`
- Total Runs: 12,764
- Passing Runs: 3,529
- Pass Percentage: 27.65%

These results show a marginal difference of about 1% in the pass rate for the average p-value test, with log returns having a slight edge. 

When examining the distributions of the average p-values for both cases, they appear almost identica as well. This adds to the analysis of pure correlation, as it shows that actual pass rate are quite similar, and the distribution in general is very similar - which is crucially important in the impact analysis. 

![average-p-distributions-ln](Distribution_of_Average_Pvalues.png)

However, when analyzing the other components of the combined decision rule – the shape parameters, skew and kurtosis – a more pronounced difference was observed.

Log Returns = `True`
- Total Runs: 12,640
- Passed Skewness Test (|skew| < 0.5): 7,352 (58.16%)
- Passed Kurtosis Test (|kurtosis| < 0.5): 1,141 (9.03%)
- Passed Both Tests: 1,035 (8.19%)
<br>

Log Returns = `False`
- Total Runs: 12,643
- Passed Skewness Test (|skew| < 0.5): 7,707 (60.96%)
- Passed Kurtosis Test (|kurtosis| < 0.5): 3,710 (29.34%)
- Passed Both Tests: 1,839 (14.55%)

Here, a clearer difference emerges. 14.55% of runs passed both the skewness and kurtosis criteria when log_returns = False, while only 8.19% passed when log_returns = True. This difference is primarily driven by kurtosis. Over three times more runs passed the kurtosis test when log_returns = False than when log_returns = True. This initially suggests that log-transformed raw prices may exhibit kurtosis values closer to the expected value of 3 for a normal distribution more frequently than log returns.

However, when considering the *full combined decision rule*, which includes the average p-value criterion, the results change again:

Log Returns = `True`
- Total Runs Analyzed: 12,640
- Passing Runs: 1,021
- Pass Percentage: 8.08%
<br>

Log Returns = `False`
- Total Runs Analyzed: 12,643
- Passing Runs: 993
- Pass Percentage: 7.85%

These results suggest that while log-transformed raw prices may more frequently exhibit kurtosis values closer to 3, the overall likelihood of passing all combined criteria for normality is very similar for both log returns and log-transformed raw prices.

<br>
Furthermore, visually analyzing a few different distributions provides a qualitative check on these assumptions. Upon self-verification through visual inspection, it would appear that raw price distributions might exhibit more instances of anomalous shapes, while distributions with a semblance of normality could be more easily spotted within the log returns data. While this manual analysis is subjective and limited, it offers a complementary perspective to the quantitative findings.

To illustrate the types of anomalous shapes generated, a few distributions without a strong resemblance to a normal distribution are shown below. The first plot utilizes log returns and the second raw prices.<br>
These are runIDs 192 and 1545.

In [3]:
import ipyplot

image_paths = [
    "Histogram-for-US34964C1062---Start-2015-01-20-180000-0600-End-2023-01-18-180000-0600-Log_returns-False-CAGR-True-Data-Type-eod_close(log-transformation)-ID192-wOver-weirdMaybeLegitLookingForPlaty-scaleCorrection.png", 
    "Histogram-for-SE0017486897---Start-1999-12-01-180000-0600-End-2024-04-14-190000-0500-Log_returns-True-CAGR-False-Data-Type-eod_typical-ID1545-wOver-almost25YWeird-scaleCorrection.png"
]

# By providing a list of empty labels, you override the default file paths
empty_labels = [''] * len(image_paths)

ipyplot.plot_images(
    image_paths,
    max_images=2,
    img_width=450,
    show_url=False
)

Next, a few examples of distributions based on raw prices (log_returns=False) that do not exhibit a normal shape are presented. <br> 
These correspond to runIds 232 and 6832.

In [5]:
import ipyplot

image_paths = [
    "Histogram-for-SE0012853455---Start-2020-06-25-190000-0500-End-2021-06-23-190000-0500-Log_returns-False-CAGR-False-Data-Type-eod_close(log-transformation)-ID232-wOver-nowNoPlatykurticAtAll-scaleCorrection.png", 
    "Histogram-for-CNE100004QG0---Start-2021-08-19-190000-0500-End-2024-04-14-190000-0500-Log_returns-False-CAGR-False-Data-Type-eod_high_low(log-transformation)-ID6832-wOver-scaleCorrection.png"
]

# By providing a list of empty labels, you override the default file paths
empty_labels = [''] * len(image_paths)

ipyplot.plot_images(
    image_paths,
    max_images=2,
    img_width=450,
    show_url=False
)

Finally, as an example, a distribution that passed both the combined decision rule and the expected value comparison algorithm is shown below. This distribution used raw prices (log_returns=False) and corresponds to `runId 14223`. Upon visual inspection, the figure appears relatively close to a normal distribution, although a slight leptokurtic peak and some gaps in the distribution are still observable.

![ln-ret-false-pass-combined](Histogram-for-HK0000544194---Start-2022-03-22-190000-0500-End-2022-09-20-190000-0500-Log_returns-False-CAGR-True-Data-Type-eod_typical(log-transformation)-ID14223-wOver-nowDifferentnow2withLogTransVeryCloseButGap-scaleCorrection.png)

This qualitative analysis, while subjective, supports the quantitative findings and further illustrates the characteristics of stock distributions in this dataset, including the prevalence of leptokurtosis and the variability in shapes observed.

**Analysis of "Paired" Runs**:<br>
To isolate the effect of using log returns, a final paired analysis was conducted. For this analysis, the approach was to compare all the runs with all matching parameters, except for log returns. To be more specific, two sets were compared based on identical `isin`, `cagr`, `data_type`, `split`, and `merger` values. The final two sets were created by filtering these pairs to include only those where the `start_time` and `end_time` are within +/- 2 days of each other.<br>

A total of 638 valid pairs with this corrected date matching logic / conditions were found (the dates disqualified a lot of pairs as dates where chosen at random, based on the duration of interest, as shown in the normality-scanner-tester.py script). <br>
From these valid pairs, we can see the following:<br>
1. Analysis for Average P-value Test (> 0.05): Matching Outcomes: **95.61%** of the pairs (610 pairs) had the same pass/fail status for the average p-value test.
2. Analysis for Skew/Kurtosis Test (|skew|<0.5 & |kurtosis|<0.5, using corrected kurtosis): **87.62%** Matching Outcomes (559 pairs)
3. Analysis for Passing BOTH Tests (Average P-value > 0.05 AND Skew/Kurtosis OK): **97.02%** Matching Outcomes (619 pairs)
    - Of the 19 mismatching pairs, 52.63% (10 pairs) were cases where the log_returns=True run passed (met all criteria) and the log_returns=False run failed.

These paired run analyses indicate a very high degree of agreement between the results for log returns and log-transformed raw prices when other parameters are held constant. The high percentage of matching outcomes for both the average p-value test and the combined decision rule (which includes skew and kurtosis) strongly supports the theoretical relationship between the distributions of raw prices and log returns. While there was a noticeable difference in individual kurtosis pass rates, the overall normality assessment based on combined criteria shows strong alignment between the two approaches.<br>

Taking all this analysis into account, it appears that overall, log returns distributions may have a very slight increased chance of normality according to the combined decision rule. However, the differences are very slight and not statistically significant in the context of the overall low prevalence of normality observed. The skewed results for individual kurtosis pass rates when log_returns = False is interesting; it is possible this percentage difference was influenced by irregularities or characteristics specific to the dataset. This reinforces the importance of using a wide variety of criteria for a comprehensive normality assessment.<br>

Ultimately, these results largely support the theory that raw prices and their log returns, under the same conditions, exhibit a strong correlation with each other in terms of their closeness to normal or lognormal distributions.

#### Influence of Other Parameters

**Splits and Mergers** <br>
While not the most significant parameters in terms of correlation with normality test results, it is worth noting the impact of stock splits and mergers. They showed negative correlations of -0.19 and -0.08 with average p-values, respectively. Although the number of splits and mergers in the dataset is not large, this correlation is still significant. It also intuitively follows that disruptive corporate actions like splits and mergers could introduce deviations from standard theoretical distributions.

**Differences in specific stocks** <br>
Analyzing the impact of the specific stock (ISIN) on normality is less straightforward within a correlation matrix format. However, the performance across all tests for individual stocks can be analyzed.

When analyzing pass rates based solely on the average p-value criterion, some stocks, particularly newer ones like US55445L1008, exhibited notably higher normality test scores. US55445L1008 earned a pass rate of 87.50%. CNE100005980 was also relatively high at a pass rate of 75.62%.<br>

However, when applying the combined decision rule, the difference in pass rates between individual stocks is less pronounced. Nevertheless, the difference is still significant in comparison to the overall low pass rate. Crucially, as already mentioned, this is largely attributed to the shorter duration of data available for newer stocks. Therefore, the difference in normality with older stocks with longer data histories is not as significant when considering the influence of duration.

The table below presents the pass rates for the top-performing stocks based on the combined statistical decision rule:

| ISIN | Pass Rate (All Criteria) | Total Runs |
|:---|:---|:---|
| US6742151243 | 27.50% | 160 |
| CNE1000055G1 | 21.38% | 159 |
| US7995661045 | 20.00% | 160 |
| ARBCOM4603D8 | 20.00% | 160 |
| US55445L1008 | 18.75% | 64 |
| ARBCOM4601M3 | 17.86% | 84 |
| ARBCOM460390 | 16.98% | 159 |
| HK0000069689 | 16.41% | 256 |
| SE0016276851 | 16.25% | 160 |
| CNE100005980 | 16.25% | 160 |
| HK0000735800 | 13.75% | 160 |
| SE0015960935 | 13.12% | 160 |
| NL0013056914 | 12.98% | 208 |
| SE0004977692 | 12.60% | 254 |
| HK0941009539 | 12.50% | 304 |


Again, it's important to note that the ranking changes depending on the specific definition of "most normal." The third list, which incorporates both p-value significance and specific shape characteristics (skew/kurtosis), has much lower pass rates overall, highlighting how few runs met all these conditions simultaneously.

**Specific Dates**:<br>
The impact of the amount of time (duration/N) in a given test run are explored in detail above. However, the effect of specific dates and the period in time can also be analyzed. These findings focus on the influence of specific time periods (such as year, month, or market conditions) on normality. Overall, these time period parameters do not appear to be as significant as duration, with no clear trends emerging from the data.<br>

Based on the data, the following specific periods produced the most lognormal/normal stock distributions, as indicated by the highest average p-values from normality tests:

Top Years (by Average p-value):
1. 2024: 0.234
2. 2023: 0.186
3. 2022: 0.160
4. 2021: 0.146
5. 2018: 0.103

Top Months (by Average p-value):
1.  November 2023: 0.299
2. December 2023: 0.250
3. January 2024: 0.234
4. May 2023: 0.225
5. February 2023: 0.207

These results suggest that recent periods (2023-2024) and specific months within those years may have exhibited slightly higher average p-values, potentially indicating conditions that were marginally more conducive to observed normality based on the statistical tests.

#### **The optimal conditions for Normal Distributions**
Based on the extensive analysis conducted, certain conditions appear to be more conducive to observing distributions that approximate normality in this dataset. While perfect normality is rarely achieved, specific parameter settings increase the likelihood of passing the defined normality criteria.<br>

*A short duration and the absence of splits or mergers represent some of the most impactful conditions for increasing the likelihood of observing distributions that meet the criteria for normality. Among these, a short duration and a smaller number of data points (N) emerge as the most critical condition for obtaining distributions that are most likely to be classified as normal or lognormal.*

Below are presented some examples of distributions that passed various normality metrics (combined decision rule, comparison to expected values) and, based on subjective visual inspection, appear to most closely resemble standard normal or lognormal distributions. These correspond to `runIds`: 6850, 69, 9852, 11578.

In [7]:
import ipyplot

image_paths = [
    "Histogram-for-CNE100004QG0---Start-2023-07-09-190000-0500-End-2023-09-27-190000-0500-Log_returns-False-CAGR-True-Data-Type-eod_typical(log-transformation)-ID6850-wOver-Correction.png", 
    "Histogram-for-HK2778034606---Start-2023-02-06-180000-0600-End-2023-05-07-190000-0500-Log_returns-True-CAGR-True-Data-Type-eod_avg-ID69-wOver-scaleCorrection-prelim-onlyPASSwithStrict.png", 
    "Histogram-for-NL0013056914---Start-2018-10-23-190000-0500-End-2019-01-21-180000-0600-Log_returns-True-CAGR-True-Data-Type-eod_typical-ID9852-wOver-passBOTHtestsVERYCloseShortish-scaleCorrection.png",
    "Histogram-for-SE0000115446---Start-2009-08-25-190000-0500-End-2010-02-22-180000-0600-Log_returns-True-CAGR-True-Data-Type-eod_typical-ID11578-wOver-scaleCorrection.png"
]

# By providing a list of empty labels, you override the default file paths
empty_labels = [''] * len(image_paths)

ipyplot.plot_images(
    image_paths,
    max_images=4,
    img_width=450,
    show_url=False
)

Utilizing a variety of normality criteria indeed appears to facilitate the visual identification of distributions with shapes approximating normality. This further supports the notion that even though individual methods may have limitations, their combined application contributes to a more robust assessment.

### The Pervasive Nature of Leptokurtosis (Fat Tails)

Given that a small minority of stock distributions conform to a normal model, the analysis then shifted to identify the actual characteristic shape of the data. .<br>
*Following this, the data shows that a vast majority of stock distributions are leptokurtic*. Leptokurtosis is characterized by a distribution having a sharper peak and heavier tails than a normal distribution. The heavier tails indicate a greater probability of extreme values or outliers occurring compared to what a normal distribution would predict. This characteristic is often referred to as "fat tails" in finance.<sup>4</sup><br>

Visually, leptokurtotic distributions appear more "peaked" around the mean and have more density in the tails compared to the bell-shaped curve of a normal distribution. Many of the distributions observed in this study, even those that might have otherwise appeared somewhat symmetrical, exhibited this characteristic leptokurtic shape.

#### Quantitative Evidence of Leptokurtosis

The prevalence of this characteristic was quantified by analyzing the excess kurtosis for all 25,530 samples. An excess kurtosis value greater than 0 indicates a leptokurtic distribution. The analysis revealed:<br>

- 15,642 samples (**61.2%**) exhibited some degree of leptokurtosis (excess kurtosis > 0).
- 13,535 samples (53.0%) were significantly leptokurtic, with an excess kurtosis greater than 0.5.

In contrast, only *9,641* (37.7%) samples showed negative excess kurtosis (*platykurtosis*), confirming that leptokurtosis is the dominant feature of the dataset.  These distributions have "thinner" tails and a flatter peak than a normal distribution, suggesting that extreme events are even less likely.

The output below gives a sample of the runs which exhibited some degree of leptokurtosis (excess kurtosis > 0).

In [12]:
# Filter for all leptokurtic samples
leptokurtic_samples = tests_results_df[tests_results_df['kurtosis'] > 0]

# To find "significantly" leptokurtic samples using your 0.5 threshold:
significantly_leptokurtic = tests_results_df[tests_results_df['kurtosis'] > 0.5]

leptokurtic_samples

Unnamed: 0,runid,isin,ks_test,jarque_bera_test,lilliefors_test,shapiro_test,dagostino_test,skew,kurtosis,hurst_exponent,...,duration,n,period_volume,split,merger,cagr,data_type,log_returns,average_p_value,is_normal
0,1,HK2778034606,4.822982e-01,1.057955e-28,0.115889,1.423551e-05,9.858966e-10,-1.204408,4.397224,0.054735,...,15638400,123,1.780971e+08,0.00000,False,False,eod_avg,True,0.119640,False
2,3,HK2778034606,1.500044e-01,1.236803e-15,0.005518,2.058174e-05,5.806263e-06,-0.366522,2.487527,0.016331,...,31104000,245,4.273732e+08,0.00000,False,False,eod_avg,True,0.031110,False
4,5,HK2778034606,8.342432e-05,1.480880e-87,0.001000,6.644434e-16,1.429512e-21,-0.359527,3.029298,-0.010965,...,125884800,990,2.224356e+09,0.00000,False,False,eod_avg,True,0.000217,False
6,7,HK2778034606,9.697693e-40,0.000000e+00,0.001000,9.030016e-50,2.083774e-184,-0.404693,12.938399,-0.000511,...,472953600,3704,1.970117e+10,1.03741,False,False,eod_avg,True,0.000200,False
7,8,HK2778034606,4.708469e-08,1.338261e-51,0.001000,3.565071e-24,3.543891e-38,-0.452180,0.836997,0.544815,...,472953600,3703,1.831768e+10,1.03741,False,False,eod_avg,False,0.000200,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
25523,25546,CNE100005980,9.262188e-01,7.823915e-01,0.737013,5.419071e-01,6.796489e-01,-0.217423,0.029935,-0.007894,...,7516800,62,1.661139e+09,0.00000,False,True,eod_close,True,0.733436,True
25525,25548,CNE100005980,8.533742e-01,3.210386e-02,0.554073,9.575221e-02,4.682158e-02,0.317487,0.974562,0.058080,...,15379200,122,5.909578e+09,0.00000,False,True,eod_high_low,True,0.316425,False
25527,25550,CNE100005980,5.688428e-01,2.256117e-02,0.195912,3.280892e-02,2.760886e-02,0.310649,0.605001,0.020721,...,31190400,242,1.039070e+10,0.00000,False,True,eod_high_low,True,0.169547,False
25528,25551,CNE100005980,3.563323e-01,1.141846e-01,0.057392,2.368541e-02,6.939642e-02,0.603129,0.557060,-0.039352,...,7603200,59,1.984560e+09,0.00000,False,True,eod_high_low,False,0.124198,False


Then the output below gives a sample of the runs which can be considered significantly leptokurtic, with an excess kurtosis greater than 0.5.

In [13]:
significantly_leptokurtic

Unnamed: 0,runid,isin,ks_test,jarque_bera_test,lilliefors_test,shapiro_test,dagostino_test,skew,kurtosis,hurst_exponent,...,duration,n,period_volume,split,merger,cagr,data_type,log_returns,average_p_value,is_normal
0,1,HK2778034606,4.822982e-01,1.057955e-28,0.115889,1.423551e-05,9.858966e-10,-1.204408,4.397224,0.054735,...,15638400,123,1.780971e+08,0.00000,False,False,eod_avg,True,0.119640,False
2,3,HK2778034606,1.500044e-01,1.236803e-15,0.005518,2.058174e-05,5.806263e-06,-0.366522,2.487527,0.016331,...,31104000,245,4.273732e+08,0.00000,False,False,eod_avg,True,0.031110,False
4,5,HK2778034606,8.342432e-05,1.480880e-87,0.001000,6.644434e-16,1.429512e-21,-0.359527,3.029298,-0.010965,...,125884800,990,2.224356e+09,0.00000,False,False,eod_avg,True,0.000217,False
6,7,HK2778034606,9.697693e-40,0.000000e+00,0.001000,9.030016e-50,2.083774e-184,-0.404693,12.938399,-0.000511,...,472953600,3704,1.970117e+10,1.03741,False,False,eod_avg,True,0.000200,False
7,8,HK2778034606,4.708469e-08,1.338261e-51,0.001000,3.565071e-24,3.543891e-38,-0.452180,0.836997,0.544815,...,472953600,3703,1.831768e+10,1.03741,False,False,eod_avg,False,0.000200,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
25519,25542,CNE100005980,7.595515e-01,1.006649e-04,0.415448,2.285338e-02,1.287956e-02,-0.069466,1.897849,-0.021899,...,15379200,122,6.097239e+09,0.00000,False,True,eod_close,True,0.242167,False
25521,25544,CNE100005980,3.104083e-01,1.029169e-08,0.037865,9.110859e-05,5.144891e-04,-0.187653,1.868721,-0.022812,...,31104000,243,1.360325e+10,0.00000,False,True,eod_close,True,0.069776,False
25525,25548,CNE100005980,8.533742e-01,3.210386e-02,0.554073,9.575221e-02,4.682158e-02,0.317487,0.974562,0.058080,...,15379200,122,5.909578e+09,0.00000,False,True,eod_high_low,True,0.316425,False
25527,25550,CNE100005980,5.688428e-01,2.256117e-02,0.195912,3.280892e-02,2.760886e-02,0.310649,0.605001,0.020721,...,31190400,242,1.039070e+10,0.00000,False,True,eod_high_low,True,0.169547,False


Then, to filter for platykurtic runs, the same type of logic below was used, which filtered for runs where kurtosis is less than 0.

In [14]:
# Filter for all platykurtic samples
platykurtic_samples = tests_results_df[tests_results_df['kurtosis'] < 0]

# To find "significantly" platykurtic samples using your 0.5 threshold:
significantly_platykurtic = tests_results_df[tests_results_df['kurtosis'] < -0.5]

platykurtic_samples

Unnamed: 0,runid,isin,ks_test,jarque_bera_test,lilliefors_test,shapiro_test,dagostino_test,skew,kurtosis,hurst_exponent,...,duration,n,period_volume,split,merger,cagr,data_type,log_returns,average_p_value,is_normal
1,2,HK2778034606,0.028623,1.007815e-01,0.001000,4.487778e-04,9.952278e-02,-0.438444,-0.376034,0.084991,...,15465600,121,3.834700e+08,0.00000,False,False,eod_avg,False,0.046075,False
3,4,HK2778034606,0.013533,1.046401e-03,0.001000,3.347931e-08,1.463774e-19,0.112616,-1.137427,0.611886,...,31449600,245,4.526237e+08,0.00000,False,False,eod_avg,False,0.003116,False
5,6,HK2778034606,0.000059,7.402845e-08,0.001000,2.956093e-14,2.269385e-13,0.291613,-0.675809,0.582338,...,125971200,989,6.251064e+09,1.03741,False,False,eod_avg,False,0.000212,False
13,14,HK2778034606,0.000615,2.429463e-03,0.001000,2.195219e-07,4.707226e-03,-1.095860,-0.111482,-0.119628,...,7862400,60,1.109813e+08,0.00000,False,False,eod_avg,False,0.001750,False
15,16,HK2778034606,0.471040,1.694054e-01,0.040047,6.477186e-03,1.579193e-01,-0.410331,-0.053668,0.234005,...,15724800,126,1.168846e+09,0.00000,False,False,eod_typical,False,0.168978,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
25518,25541,CNE100005980,0.720618,1.089444e-02,0.001000,1.299130e-04,9.093174e-03,0.600159,-0.542887,0.088991,...,15724800,125,5.951234e+09,0.00000,False,True,eod_close,False,0.148347,False
25520,25543,CNE100005980,0.586643,2.363564e-03,0.027189,2.568614e-05,2.393897e-03,0.503683,-0.412367,0.244956,...,31536000,245,1.074896e+10,0.00000,False,True,eod_close,False,0.123723,False
25522,25545,CNE100005980,0.276935,8.403633e-02,0.008526,8.088786e-04,1.098194e-03,0.397201,-1.107387,-0.200160,...,7862400,64,1.769000e+09,0.00000,False,True,eod_close,False,0.074281,False
25524,25547,CNE100005980,0.769390,1.473152e-01,0.468162,1.383283e-02,1.501253e-01,0.417068,-0.265124,0.074990,...,15206400,120,1.030926e+10,0.00000,False,True,eod_high_low,False,0.309765,True


Finally, the output below gives a sample of the runs which can be considered significantly platykurtic, with an excess kurtosis less than -0.5.

In [15]:
significantly_platykurtic

Unnamed: 0,runid,isin,ks_test,jarque_bera_test,lilliefors_test,shapiro_test,dagostino_test,skew,kurtosis,hurst_exponent,...,duration,n,period_volume,split,merger,cagr,data_type,log_returns,average_p_value,is_normal
3,4,HK2778034606,0.013533,1.046401e-03,0.001000,3.347931e-08,1.463774e-19,0.112616,-1.137427,0.611886,...,31449600,245,4.526237e+08,0.00000,False,False,eod_avg,False,0.003116,False
5,6,HK2778034606,0.000059,7.402845e-08,0.001000,2.956093e-14,2.269385e-13,0.291613,-0.675809,0.582338,...,125971200,989,6.251064e+09,1.03741,False,False,eod_avg,False,0.000212,False
17,18,HK2778034606,0.703173,6.716223e-02,0.381102,2.422669e-02,2.100655e-03,0.024353,-0.721341,0.498597,...,31536000,248,1.050633e+09,0.00000,False,False,eod_typical,False,0.235553,False
19,20,HK2778034606,0.001063,4.497043e-06,0.001000,5.307837e-14,1.720138e-09,-0.244231,-0.601154,0.542938,...,125798400,985,4.514344e+09,0.00000,False,False,eod_typical,False,0.000413,False
29,30,HK2778034606,0.047337,8.467223e-03,0.001000,8.478568e-06,2.452728e-18,0.197826,-1.305959,0.060748,...,15724800,123,6.226666e+08,0.00000,False,False,eod_close,False,0.011363,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
25514,25537,CNE100005980,0.071120,2.574128e-02,0.001000,1.455646e-04,6.307473e-04,-0.212651,-0.734202,0.236463,...,31536000,244,9.698744e+09,0.00000,False,True,eod_typical,False,0.019728,False
25516,25539,CNE100005980,0.859559,3.675000e-01,0.493873,3.723035e-01,1.213763e-01,0.084498,-0.878787,-0.147590,...,7862400,60,1.882257e+09,0.00000,False,True,eod_typical,False,0.442922,False
25518,25541,CNE100005980,0.720618,1.089444e-02,0.001000,1.299130e-04,9.093174e-03,0.600159,-0.542887,0.088991,...,15724800,125,5.951234e+09,0.00000,False,True,eod_close,False,0.148347,False
25522,25545,CNE100005980,0.276935,8.403633e-02,0.008526,8.088786e-04,1.098194e-03,0.397201,-1.107387,-0.200160,...,7862400,64,1.769000e+09,0.00000,False,True,eod_close,False,0.074281,False


The figure below provides two examples of distributions that were flagged as platykurtic, both for the same Hong Kong-based stock but over different time periods.<br>
The `runId`s are 20 and 30.

In [1]:
import ipyplot

image_paths = [
    "Histogram-for-HK2778034606---Start-2016-12-27-180000-0600-End-2020-12-22-180000-0600-Log_returns-False-CAGR-False-Data-Type-eod_typical(log-transformation)-ID20-wOver-platykurticCase1-scaleCorrection.png", 
    "Histogram-for-HK2778034606---Start-2011-10-05-190000-0500-End-2012-04-04-190000-0500-Log_returns-False-CAGR-False-Data-Type-eod_close(log-transformation)-ID30-wOver-scaleCorrection-platykurticCase2.png"
]

# By providing a list of empty labels, you override the default file paths
empty_labels = [''] * len(image_paths)

ipyplot.plot_images(
    image_paths,
    max_images=2,
    img_width=450,
    show_url=False
)

It is important to acknowledge the *limitations* of relying on any single statistical metric. As was observed during manual verification, no automated test is perfect, and some distributions flagged by the kurtosis filter may not perfectly align with their classification upon visual inspection. However, by combining the quantitative kurtosis metric with qualitative visual analysis (which will be explained further below), the evidence strongly supports the overarching conclusion: the vast majority of stock and log-return distributions in this study are decidedly leptokurtic, not platykurtic or normal.

#### Visual Evidence

The statistical findings are strongly supported by visual inspection. The figures below present a sample of typical distributions from the dataset. Unlike the ideal bell curve, these histograms clearly display the signature leptokurtic shape: a tall, narrow peak around the mean and pronounced tails on either side.

The plots below are for the following `runId`s: 6807, 7, 463, 7929, 133, 8721

In [1]:
import ipyplot

image_paths = [
    "leptokurtic/Histogram-for-CNE100004QG0---Start-2021-08-22-190000-0500-End-2024-04-14-190000-0500-Log_returns-True-CAGR-False-Data-Type-eod_avg-ID6807-wOver-anotherTopVolShortDurCLOSEIshDist-anotherTopVolShortDurCLOSEishDist-scaleCorrection.png", 
    "leptokurtic/Histogram-for-HK2778034606---Start-2008-02-25-180000-0600-End-2023-02-20-180000-0600-Log_returns-True-CAGR-False-Data-Type-eod_avg-ID7-wOver-scaleCorrection-prelim-leptoButNotablyCloser.png", 
    "leptokurtic/Histogram-for-SE0000115420---Start-2000-01-03-180000-0600-End-2024-04-14-190000-0500-Log_returns-True-CAGR-True-Data-Type-eod_close-ID463-wOver-scaleCorrection.png",
    "Histogram-for-HK0883013259---Start-2004-03-17-180000-0600-End-2024-04-14-190000-0500-Log_returns-True-CAGR-False-Data-Type-eod_typical-ID7929-wOver-scaleCorrection.png",
    "leptokurtic/Histogram-for-US34964C1062---Start-2011-09-18-190000-0500-End-2024-04-11-190000-0500-Log_returns-True-CAGR-False-Data-Type-eod_typical-ID133-wOver-scaleCorrection-prelim-13YLeptoButCloser.png",
    "leptokurtic/Histogram-for-US43114Q1058---Start-2022-02-14-180000-0600-End-2023-02-13-180000-0600-Log_returns-True-CAGR-True-Data-Type-eod_close-ID8721-wOver-scaleCorrection.png"
]

# By providing a list of empty labels, you override the default file paths
empty_labels = [''] * len(image_paths)

ipyplot.plot_images(
    image_paths,
    max_images=6,
    img_width=450,
    show_url=False
)

#### Relationship with Time Duration

Further analysis reveals a strong relationship between leptokurtosis and the duration of the time series. Of the samples identified as significantly leptokurtic, approximately 53% had a duration greater than one year. This suggests that as the time horizon increases, and more market events and volatility regimes are included, the distribution of returns tends to deviate further from normality and exhibit more pronounced "fat tail" characteristics.

The output below shows sample runs which are considered leptokurtic and also have a duration greater than 31536000 seconds, or about a year. This resulted in 7904 rows.

In [16]:
leptokurtic_samples[leptokurtic_samples['duration'] > 31536000]

Unnamed: 0,runid,isin,ks_test,jarque_bera_test,lilliefors_test,shapiro_test,dagostino_test,skew,kurtosis,hurst_exponent,...,duration,n,period_volume,split,merger,cagr,data_type,log_returns,average_p_value,is_normal
4,5,HK2778034606,8.342432e-05,1.480880e-87,0.001,6.644434e-16,1.429512e-21,-0.359527,3.029298,-0.010965,...,125884800,990,2.224356e+09,0.00000,False,False,eod_avg,True,0.000217,False
6,7,HK2778034606,9.697693e-40,0.000000e+00,0.001,9.030016e-50,2.083774e-184,-0.404693,12.938399,-0.000511,...,472953600,3704,1.970117e+10,1.03741,False,False,eod_avg,True,0.000200,False
7,8,HK2778034606,4.708469e-08,1.338261e-51,0.001,3.565071e-24,3.543891e-38,-0.452180,0.836997,0.544815,...,472953600,3703,1.831768e+10,1.03741,False,False,eod_avg,False,0.000200,False
8,9,HK2778034606,2.429426e-13,0.000000e+00,0.001,1.447219e-31,2.643720e-70,0.224367,7.760693,0.001626,...,251856000,1968,9.783957e+09,0.00000,False,False,eod_avg,True,0.000200,False
9,10,HK2778034606,2.644193e-04,1.432433e-25,0.001,1.134355e-19,2.148015e-22,0.588246,0.073949,0.535148,...,252288000,1976,1.137699e+10,1.03741,False,False,eod_avg,False,0.000253,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
25447,25470,SE0008347652,1.955333e-06,0.000000e+00,0.001,3.102112e-21,2.397813e-31,-0.006008,6.374842,-0.003660,...,126057600,1005,3.608504e+07,0.00000,False,False,eod_high_low,True,0.000200,False
25455,25478,SE0008347652,3.271507e-07,0.000000e+00,0.001,6.664362e-25,9.191063e-42,-0.029543,10.228066,0.006471,...,125884800,1001,3.699632e+07,0.00000,False,True,eod_avg,True,0.000200,False
25463,25486,SE0008347652,5.081008e-04,1.706680e-96,0.001,8.135483e-15,1.727564e-18,0.011131,3.245227,-0.000697,...,126057600,1005,3.394490e+07,0.00000,False,True,eod_typical,True,0.000302,False
25471,25494,SE0008347652,1.460981e-06,1.061688e-99,0.001,1.262347e-16,9.307747e-20,-0.175783,3.285358,0.000943,...,126057600,1002,3.493358e+07,0.00000,False,True,eod_close,True,0.000200,False


Then, the output below shows sample runs which are considered significantly leptokurtic and also have a duration greater than 31536000 seconds, or about a year. This resulted in 7143 rows.

In [17]:
significantly_leptokurtic[significantly_leptokurtic['duration'] > 31536000]

Unnamed: 0,runid,isin,ks_test,jarque_bera_test,lilliefors_test,shapiro_test,dagostino_test,skew,kurtosis,hurst_exponent,...,duration,n,period_volume,split,merger,cagr,data_type,log_returns,average_p_value,is_normal
4,5,HK2778034606,8.342432e-05,1.480880e-87,0.001,6.644434e-16,1.429512e-21,-0.359527,3.029298,-0.010965,...,125884800,990,2.224356e+09,0.00000,False,False,eod_avg,True,0.000217,False
6,7,HK2778034606,9.697693e-40,0.000000e+00,0.001,9.030016e-50,2.083774e-184,-0.404693,12.938399,-0.000511,...,472953600,3704,1.970117e+10,1.03741,False,False,eod_avg,True,0.000200,False
7,8,HK2778034606,4.708469e-08,1.338261e-51,0.001,3.565071e-24,3.543891e-38,-0.452180,0.836997,0.544815,...,472953600,3703,1.831768e+10,1.03741,False,False,eod_avg,False,0.000200,False
8,9,HK2778034606,2.429426e-13,0.000000e+00,0.001,1.447219e-31,2.643720e-70,0.224367,7.760693,0.001626,...,251856000,1968,9.783957e+09,0.00000,False,False,eod_avg,True,0.000200,False
10,11,HK2778034606,1.866339e-38,0.000000e+00,0.001,3.175108e-49,1.218832e-190,-0.425149,11.144351,-0.002272,...,519264000,4062,2.057901e+10,1.03741,False,False,eod_avg,True,0.000200,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
25447,25470,SE0008347652,1.955333e-06,0.000000e+00,0.001,3.102112e-21,2.397813e-31,-0.006008,6.374842,-0.003660,...,126057600,1005,3.608504e+07,0.00000,False,False,eod_high_low,True,0.000200,False
25455,25478,SE0008347652,3.271507e-07,0.000000e+00,0.001,6.664362e-25,9.191063e-42,-0.029543,10.228066,0.006471,...,125884800,1001,3.699632e+07,0.00000,False,True,eod_avg,True,0.000200,False
25463,25486,SE0008347652,5.081008e-04,1.706680e-96,0.001,8.135483e-15,1.727564e-18,0.011131,3.245227,-0.000697,...,126057600,1005,3.394490e+07,0.00000,False,True,eod_typical,True,0.000302,False
25471,25494,SE0008347652,1.460981e-06,1.061688e-99,0.001,1.262347e-16,9.307747e-20,-0.175783,3.285358,0.000943,...,126057600,1002,3.493358e+07,0.00000,False,True,eod_close,True,0.000200,False


This analysis regarding duration agrees with the overall trends uncovered in the specific analysis of duration and its relationship to kurtosis above - excess kurtosis increases with time.

Finally, the prevalence of leptokurtosis in this dataset reinforces the earlier finding that the normal distribution may not be the most appropriate model for capturing the full characteristics of real-world stock distributions, particularly when considering the likelihood of extreme price movements.

### Limitations and Observations on Testing Methodology

Expanding on what was brought up earlier, a key insight from this project is the inherent limitation of relying on any single statistical test or automated rule for determining normality. While quantitative methods are essential for large-scale analysis, they must be interpreted with caution and validated with qualitative observation.<br>


Analysis of the correlation matrix between the various statistical tests (Shapiro-Wilk, Kolmogorov-Smirnov, etc.) revealed that while the tests were generally correlated, the strength of the correlation was often weak (in some cases, less than 0.5). This is expected, as each test is designed with different sensitivities to specific types of deviations from normality (e.g., tail behavior vs. central peak). This finding underscores the risk of drawing a conclusion based on a single statistical test, as the choice of test can itself influence the outcome.

Furthermore, even the "Combined Statistical Rule" developed for this project proved imperfect. Manual verification of the results showed that the algorithm occasionally classified a distribution as "normal" when it was visually obvious that it was not.<br>

The figure below illustrates two such "false positives" (runIDs 13084 and 17579). Although these samples passed the combined p-value and shape-metric thresholds, they exhibit clear non-normal characteristics, including bimodality (two peaks) and extreme skewness, which were not fully captured by the automated rules.

In [2]:
import ipyplot

image_paths = [
    "Histogram-for-SE0000648669---Start-2022-10-18-190000-0500-End-2023-04-17-190000-0500-Log_returns-True-CAGR-True-Data-Type-eod_high_low-ID13084-wOver-2PeakCaseAlright-fromCombined2-scaleCorrection.png", 
    "Histogram-for-HK2778034606---Start-2008-06-01-190000-0500-End-2008-08-28-190000-0500-Log_returns-True-CAGR-False-Data-Type-eod_typical-ID17579-fromCombinedRule-veryBadVeryShortTimeStill2-scaleCorrection.png"
]

# By providing a list of empty labels, you override the default file paths
empty_labels = [''] * len(image_paths)

ipyplot.plot_images(
    image_paths,
    max_images=2,
    img_width=450,
    show_url=False
)

These examples demonstrate that while automated tests are invaluable for processing thousands of samples, they can fail to capture nuances that are immediately apparent to a human analyst. This reinforces the necessity of a blended approach, using quantitative analysis to identify candidates and guide the investigation, but always relying on critical visual inspection to validate the final conclusions.

### Alternative Distributions

Given the strong evidence that stock and log-return distributions are not normal and are predominantly leptokurtic, the final stage of this analysis explored alternative distribution models better suited to capture these "fat-tail" characteristics. Two prominent candidates were investigated: the Student's t-distribution and the Tsallis q-Gaussian distribution.<br>

Both models are known for their ability to account for a higher probability of extreme events compared to the normal distribution and have been used in financial modeling before<sup>6</sup>. Below are notable properties of these distributions.

- **Student's t-distribution**, from classical statistics, is defined by its degrees of freedom (ν), which control the heaviness of the tails.
- **q-Gaussian distribution**, from non-extensive statistical mechanics, uses an entropic index (q) to model systems with long-range memory effects. When q=1, it is equivalent to a normal distribution, and as q increases, the tails become heavier.

For this analysis, Student's t-distribution was fitted directly to the data. For simplicity, the q-Gaussian was approximated using a standard normal distribution fit where **q ~ 1**. This is a reasonable approach for a preliminary investigation where q is expected to be close to 1. <br>
The full logic can be reviewed in the accompanying script, `mySamples-studentT-qGaussian-overlayTest.py`.


In [None]:
        # Fit the synthetic data to both distributions (Student's t and Tsallis q-Gaussian)
        
        # Student's t-distribution
        stock_data = stock_data['calculated_values']
        t_params = stats.t.fit(stock_data)
        t_dist = stats.t(*t_params)
        
        # Tsallis q-Gaussian approximation (using a Generalized Gaussian)
        # We'll estimate q empirically for simplicity
        q_params = stats.norm.fit(stock_data)  # Assuming a Gaussian-like fit for q ~ 1
        q_gaussian = stats.norm(*q_params)
        
        # Overlay the original histogram and the two fits
        x = np.linspace(-0.1, 0.1, 1000)
        pdf_t = t_dist.pdf(x)
        pdf_q = q_gaussian.pdf(x)
        
        # Plotting the original histogram with both fits

**Visual Fit Analysis**<br>
The figures below show the fitted curves for both the Student's t-distribution (red line) and the q-Gaussian approximation (blue line) overlaid on sample data histograms. This first set of images displays the probability *density*.<br>
The `runIds` are: 785, 13442, 7, 1507

In [3]:
import ipyplot

image_paths = [
    "Histogram-for-CNE1000002V2---Start-2002-11-17-180000-0600-End-2024-04-14-190000-0500-Log_returns-True-CAGR-True-Data-Type-eod_typical-ID785-wStudentTqGaussianOver.png", 
    "Histogram-for-SE0009216278---Start-2020-07-30-190000-0500-End-2021-07-29-190000-0500-Log_returns-True-CAGR-False-Data-Type-eod_high_low-ID13442-wStudentTqGaussianOver.png", 
    "Histogram-for-HK2778034606---Start-2008-02-25-180000-0600-End-2023-02-20-180000-0600-Log_returns-True-CAGR-False-Data-Type-eod_avg-ID7-wStudentTqGaussianOver.png",
    "Histogram-for-KYG9288X1043---Start-2008-04-15-190000-0500-End-2008-10-13-190000-0500-Log_returns-True-CAGR-True-Data-Type-eod_high_low-ID1507-wStudentTqGaussianOver.png"
]

# By providing a list of empty labels, you override the default file paths
empty_labels = [''] * len(image_paths)

ipyplot.plot_images(
    image_paths,
    max_images=4,
    img_width=450,
    show_url=False
)

For thoroughness, the the second set of plots below displays the *frequency* distributions (plotted using accompanying `mySamples-studentT-qGaussian-overlayTest-2-frequencyDistAddition.py`).<br>
The `runIds` are: 399, 7929, 785, 3335

In [4]:
import ipyplot

image_paths = [
    "Histogram-for-SE0000115420-(Run-ID-399)-Start-2019-10-27-End-2020-10-22-Log-Returns-True-CAGR-False-Data-Type-eod_close-ID399-wStudentTqGaussianOver2-frequency.png", 
    "Histogram-for-HK0883013259-(Run-ID-7929)-Start-2004-03-17-End-2024-04-14-Log-Returns-True-CAGR-False-Data-Type-eod_typical-ID7929-wStudentTqGaussianOver2-frequency.png", 
    "Histogram-for-CNE1000002V2-(Run-ID-785)-Start-2002-11-17-End-2024-04-14-Log-Returns-True-CAGR-True-Data-Type-eod_typical-ID785-wStudentTqGaussianOver2-frequency.png",
    "Histogram-for-US9842451000-(Run-ID-3335)-Start-1994-04-18-End-2024-04-11-Log-Returns-True-CAGR-False-Data-Type-eod_avg-ID3335-wStudentTqGaussianOver2-frequency.png"
]

# By providing a list of empty labels, you override the default file paths
empty_labels = [''] * len(image_paths)

ipyplot.plot_images(
    image_paths,
    max_images=4,
    img_width=450,
    show_url=False
)

From visual inspection, it is clear that both alternative models provide a significantly better fit to the empirical data than the normal distribution. In the majority of cases, *the **Student's t-distribution** appears to provide the closest fit*, accurately modeling both the sharp central peak and the heavier tails.<br>

This suggests a powerful connection to classical statistical theory. The Student's t-distribution's key parameter, the *degrees of freedom* (obtained from `stats.t.fit(x)`), adjusts the tail thickness to account for uncertainty in samples with unknown variance. This mathematical feature serves as an excellent proxy for the inherent economic uncertainty of market volatility. By allowing for heavier tails, the t-distribution effectively models the higher-than-normal probability of extreme events (market crashes or rallies), providing a more realistic and robust representation of stock return behavior.

## Conclusions

This multi-stage exploratory analysis, conducted on over 25,000 unique samples of stock data, consistently challenges the simplistic assumption of normal or lognormal distributions in financial markets. The findings lead to several key conclusions:<br>

1. **Stock Distributions are Predominantly Leptokurtic**: The most consistent characteristic observed across the dataset is leptokurtosis, or "fat tails." This means that extreme price movements occur far more frequently than would be predicted by a normal distribution. Only a small fraction of samples (7.8%) passed a rigorous, multi-factor statistical test for normality. This could be reduced even further, depending on the strictness of the chosen normality criteria. 

2. **Time Duration is the Most Significant Factor**: The single most influential parameter on distribution shape is the time window of the sample. There is a strong negative correlation between duration and the likelihood of normality; as the time frame increases, distributions are much more likely to exhibit fat tails and fail normality tests.

3. **Alternative Models Provide a Better Fit**: Given the prevalence of leptokurtosis, alternative distributions were investigated. The Student's t-distribution, in particular, provides a visually superior fit to the empirical data, accurately capturing the high peak and heavy tails characteristic of stock returns.

4. **Theoretical Models Have Practical Limitations**: The discrepancy between the theoretical Geometric Brownian Motion model and the observed data can be largely attributed to the model's assumption of constant volatility (**σ**). Real-world financial data exhibits volatility clustering and non-constant variance, especially over longer periods, which leads to the observed leptokurtic distributions.

5. **Log Returns and Raw Prices Are Highly Correlated**: While analyzing log returns is often more convenient, a paired analysis of samples confirmed a very high correlation (over 97% matching outcomes) in normality classification between raw prices and their corresponding log returns. This suggests that for a given time period, the underlying distributional characteristics are largely preserved regardless of the transformation used.

Furthermore, this work refines the conclusions of previous studies<sup>3</sup>. It moves from a general observation that normality is conditional to a more specific conclusion: normality is rare, primarily found in short-duration samples, and the default characteristic of stock returns is leptokurtosis, which is better modeled by distributions like the Student's t.

## Future Work 

This exploratory study opens several avenues for future research. Based on the findings, the following areas warrant further investigation:<br>

- **Deeper Parameter Analysis**:

    - *Duration and Data Frequency*: This study confirmed that time duration is a critical factor influencing distribution shape. Future work should systematically analyze this effect across different data frequencies. For instance, replicating the analysis on weekly, monthly, and quarterly returns would provide insight into whether distributional characteristics change as data is aggregated over time. This approach would also allow for a direct comparison with the findings of researchers like Mota (2012)<sup>5</sup>, who investigated similar properties in stock data.



    - *Corporate Actions*: Conduct a focused study on a larger dataset of stocks that have undergone splits or mergers to quantify their impact on distribution shape more robustly.

- **Advanced Distribution Modeling**:

    - *Quantitative Goodness-of-Fit*: Implement quantitative tests (e.g., AIC/BIC, likelihood-ratio tests) to formally compare the fit of the Student's t, q-Gaussian (with a variable q), and other fat-tailed distributions.

    - *Mixture Models*: Explore mixture models (e.g., a mix of two or more normal or t-distributions) to capture complex behaviors like regime shifts, as suggested by recent literature.

- **Methodological Refinements**:

    - *Outlier Detection*: Develop and test more sophisticated, domain-aware outlier detection methods specifically designed for financial time series to ensure data cleanliness.

    - *Broader Data Scope*: Expand the analysis to a much greater number of stocks, possibly thousands, across more international markets and different asset classes to validate the universality of these findings.

## Appendix

Code 1: Function part of normality-scanner-tester.py used to manipulate and test stock data to results to help identify normal/lognormal distributions

In [None]:
# Runs all my normality/lognormality tests for given ISINs/stocks, data parameters, and data.
def runAllTests(ISINs, param_combos, data):
    for ISIN in ISINs:
        instrument_data = data[data[ins_identifier].isin([ISIN])]
        # Sort by instrument specific date by DATE, just to be doubly sure it's sorted before applying params:
        instrument_data = instrument_data.sort_values('bar_date')
        instrument_data.reset_index(drop=True, inplace=True)
        for params in param_combos:
            log_returns_param = False
            sample_merger = False
            cagr_param = False
            data_type_param = 'eod_avg'
            average_split = 0
            param_adjusted_data = apply_params_to_data(instrument_data, params)
            # Below, check if adjusted data is a DataFrame, if not (it's -1), then don't log any results.
            if isinstance(param_adjusted_data, pd.DataFrame):
                param_adjusted_data.dropna(subset=['calculated_values'], inplace=True)
                sample_start = param_adjusted_data['bar_date'].min()
                sample_end = param_adjusted_data['bar_date'].max()
                sample_duration = sample_end - sample_start
                sample_duration = sample_duration.total_seconds()
                sample_volume = param_adjusted_data['volume'].sum()
                if param_adjusted_data['merger'].any():  # Just want to know if there was any merger in sample
                    sample_merger = True
                if (param_adjusted_data['split'] != 0).any():  # Get the average of non-zero values, if any splits
                    average_split = param_adjusted_data['split'][param_adjusted_data['split'] != 0].mean()
                param_adjusted_data = param_adjusted_data['calculated_values']
                param_adjusted_data = param_adjusted_data.tolist()
                sample_N = len(param_adjusted_data)
                if 'log_returns' in params:
                    log_returns_param = params.get('log_returns')
                if 'cagr' in params:
                    cagr_param = params.get('cagr')
                if 'data_type' in params:
                    data_type_param = params.get('data_type')
                test_results = apply_tests(param_adjusted_data, log_returns_param)
                test_results.update({'isin': ISIN, 'start_time': sample_start, 'end_time': sample_end,
                                     'duration': sample_duration, 'n': sample_N, 'period_volume': sample_volume,
                                     'split': average_split, 'merger': sample_merger, 'cagr': cagr_param,
                                     'data_type': data_type_param, 'log_returns': log_returns_param})
                stmt = insert(results_table).values(**test_results)
                with engine.begin() as conn:
                    db_result = conn.execute(stmt)

Code 2: Main code and logic for generating the hisogram plots for the stock data, with an expected normal distribution line, from auto-visual-normality-checker.py

In [None]:
 # --- 1. Generate Histogram Data ---
    try:
        hist_counts, bin_edges = np.histogram(data, bins='auto')  # Use 'auto' or a fixed number
        if len(hist_counts) > max_bins:
            hist_counts, bin_edges = np.histogram(data, bins=max_bins)
    except MemoryError:
        hist_counts, bin_edges = np.histogram(data, bins=max_bins)
    bin_centers = (bin_edges[:-1] + bin_edges[1:]) / 2

    # --- 2. Generate Expected Normal Distribution Values ---
    expected_counts = []
    for i in range(len(bin_edges) - 1):
        prob, _ = quad(norm.pdf, bin_edges[i], bin_edges[i + 1], args=(mean, std_dev))
        expected_counts.append(prob * len(data))
    expected_counts = np.array(expected_counts)

    # --- Plotting (for Visualization) ---
    plt.figure(figsize=(10, 6))
    plt.hist(data, bins=bin_edges, color='purple', alpha=0.7, label='Histogram')
    x = np.linspace(mean - 4 * std_dev, mean + 4 * std_dev, 200)
    plt.plot(x, norm.pdf(x, mean, std_dev) * len(data) * (bin_edges[1] - bin_edges[0]), 'k', linewidth=2,
             label='Normal Distribution')
    plt.title(titleStr)
    plt.xlabel('Price')
    plt.ylabel('Frequency')
    plt.legend()
    file_name = titleStr.replace(',', '').replace(':', '')  # Remove commas and colons
    file_name = file_name.replace(' ', '-')  # Replace spaces with dashes
    file_name = (file_name + "-ID" + str(row['runid']))
    file_name = file_name + "-wOver-scaleCorrection.png"
    file_name = "correction-2-normal-line-scaled" + "/" + file_name
    print("FILE NAME: ", file_name)
    # plt.savefig(file_name)
    plt.show()

Code 3: Obtains results for the statistical tests done on sample stock data, from normality-scanner-tester.py

In [14]:
# Function to apply tests, given array/list of specific data
def apply_tests(data, log_returns):
    cdf = 'norm'
    if not log_returns:
        cdf = 'lognorm'  # For RAW PRICES, distribution assumed to be lognormal, per theory.
        # KS Test against a lognormal distribution
        # Note: Here we use the sample mean and standard deviation to define the lognormal distribution
        shape, loc, scale = lognorm.fit(data)
        ks_stat, ks_p = kstest(data, cdf, args=(shape, loc, scale))
        hurst = hurst_exponent(data)  # calculate Hurst exponent, before any log transforms
        data = (np.log(np.array(data)))  # Apply log transform to raw prices for normality testing using other functions
    else:
        mean = (np.array(data)).mean()
        std = (np.array(data)).std()
        # Else perform k_s test against a normal distribution (for log returns)
        ks_stat, ks_p = kstest(data, cdf, args=(mean, std))
        hurst = hurst_exponent(data)  # Calculate Hurst for log returns as well, even though this debatable

    # Normality Tests
    shapiro_p = shapiro(data)[1]  # P-value of Shapiro-Wilk Test
    dagostino_p = normaltest(data)[1]  # P-value of D'Agostino's K^2 Test
    jarque_statistic, jarque_p = jarque_bera(data)
    lilliefors_statistic, lilliefors_p = lilliefors(data)
    # Skew and kurtosis calculations of distribution below. For this the log transform for raw prices is very necessary!
    skewness = skew(data)
    excess_kurtosis = kurtosis(data)  # Excess kurtosis should be zero for normal distribution
    return {
        'shapiro_test': shapiro_p,
        'dagostino_test': dagostino_p,
        'ks_test': ks_p,  # P-value of the Kolmogorov-Smirnov Test
        'jarque_bera_test': jarque_p,
        'lilliefors_test': lilliefors_p,
        'skew': skewness,
        'kurtosis': excess_kurtosis,
        'hurst_exponent': hurst
    }

Code 4: Function used to calculate the Hurst Exponent for sample stock data, from normality-scanner-tester.py

In [16]:
def hurst_exponent(time_series):
    lags = range(2, 100)
    tau = [sqrt(np.std(np.subtract(time_series[lag:], time_series[:-lag]))) for lag in lags]
    # Filter out zero, NaN, and infinite values
    valid_indices = [i for i, t in enumerate(tau) if t > 0 and not np.isnan(t) and np.isfinite(t)]
    valid_lags = np.array([lags[i] for i in valid_indices])
    valid_tau = np.array([tau[i] for i in valid_indices])

    # Apply polyfit if there are valid points
    if len(valid_lags) > 0 and len(valid_tau) > 0:
        poly = np.polyfit(np.log(valid_lags), np.log(valid_tau), 1)
        return poly[0] * 2.0
    else:
        return None

Code 5: Logic used to generate histograms and compare them to expected values, generate error metrics, etc. From auto-visual-normality-checker.py

In [None]:
    # --- 1. Generate Histogram Data ---
    try:
        hist_counts, bin_edges = np.histogram(data, bins='auto')  # Use 'auto' or a fixed number
        if len(hist_counts) > max_bins:
            hist_counts, bin_edges = np.histogram(data, bins=max_bins)
    except MemoryError:
        hist_counts, bin_edges = np.histogram(data, bins=max_bins)
    bin_centers = (bin_edges[:-1] + bin_edges[1:]) / 2

    # --- 2. Generate Expected Normal Distribution Values ---
    expected_counts = []
    for i in range(len(bin_edges) - 1):
        prob, _ = quad(norm.pdf, bin_edges[i], bin_edges[i + 1], args=(mean, std_dev))
        expected_counts.append(prob * len(data))
    expected_counts = np.array(expected_counts)

    # --- 3. Calculate Difference Metrics ---
    ssd = np.sum((hist_counts - expected_counts) ** 2)  # Sum of Squared Differences (Weighted Chi^2)
    max_abs_diff = np.max(np.abs(hist_counts - expected_counts))
    peaks, _ = find_peaks(hist_counts)
    num_peaks = len(peaks)

    # --- 4. Automated Decision based on Combined Criteria ---
    if ssd < ssd_threshold and max_abs_diff < max_abs_diff_threshold and num_peaks <= max_peaks:
        print(
            f"Run {row['runid']}: Visually Normal (SSD = {ssd:.2f}, Max Abs Diff = {max_abs_diff:.2f}, Num Peaks = {num_peaks})")
        pass_list.append(row['runid'])
        counter += 1
    else:
        print(
            f"Run {row['runid']}: NOT Visually Normal (SSD = {ssd:.2f}, Max Abs Diff = {max_abs_diff:.2f}, Num Peaks = {num_peaks})")

Code 6 - Function to manipulate given stock data by applying requested parameters. From noramlity-scanner-tester.py

In [17]:
# Takes in data for a specific instrument at this time, then returns the data, with the parameters applied.
# Ensure date columns are already given in datetime format as well (and ideally sorted)
def apply_params_to_data(data, params):
    data = data.copy()  # Used to prevent Pandas "SettingWithCopyWarning"
    CAGR = False
    log_returns = False
    for key, value in params.items():
        if key == 'duration' and int(value) != -1:  # IF duration value is -1, grab WHOLE data, no sampling
            time_duration = datetime.timedelta(seconds=int(value))
            latest_date = data['bar_date'].max()
            earliest_date = data['bar_date'].min()
            max_start_date = latest_date - time_duration
            # Ensure there's a minimum of N (time_duration) given days in data, else return -1 (SKIP):
            if (latest_date - earliest_date).days < time_duration.days:
                return -1
            random_delta = datetime.timedelta(days=np.random.randint(0, (max_start_date - earliest_date).days))
            random_start_date = (earliest_date + random_delta)
            # Extract the sample data from the complete dataframe:
            data = data[(data['bar_date'] >= random_start_date) &
                        (data['bar_date'] <= random_start_date + time_duration)]
            # Ensure chosen time frame is not empty or less than 8 rows (whether it is legitimate or sampling error)
            if data.empty or len(data.index) < 9:
                return -1
        if (key == 'data_type'):  # ESSENTIAL parameter condition as it creates required 'calculated_values' column
            data = get_calculated_prices(data, value)
        if (key == 'cagr'):
            if value:
                CAGR = True
        if (key == 'log_returns'):
            if value:
                log_returns = True
        # Ensure chosen time frame is not empty or less than 8 rows, even if had whole time frame (-1)
        if data.empty or len(data.index) < 9:
            return -1

List 1: RunIDs of all the runs which passed the thresholds when compared to expected values. 

[69, 381, 1371, 1561, 2195, 2403, 2445, 2655, 2729, 2740, 2827, 2855, 2883, 2955, 3845, 3901, 4035, 4955, 4964, 5356, 5629, 6057, 6059, 6071, 6435, 6735, 6744, 6745, 6747, 6756, 6766, 6767, 6775, 6776, 6794, 6795, 6850, 7045, 7711, 7959, 8851, 8901, 8922, 8931, 9317, 9318, 9342, 9347, 9356, 9357, 9359, 9360, 9369, 9382, 9396, 9406, 9407, 9409, 9420, 9699, 9756, 9852, 10708, 11576, 11578, 11867, 11874, 11881, 11904, 11934, 11937, 11984, 12086, 12262, 12298, 12334, 12619, 12866, 14265, 14270, 14488, 14525, 14574, 14576, 14668, 14944, 14992, 15211, 15233, 15242, 15248, 15255, 15262, 15264, 15266, 15271, 15331, 15368, 15392, 15401, 15402, 15652, 15927, 16310, 16837, 16860, 17023, 17209, 17371, 17373, 17383, 17393, 17394, 17404, 17405, 17422, 17424, 17432, 17433, 17443, 18195, 18567, 19519, 19615, 19850, 19890, 19901, 19907, 19979, 20152, 20186, 20596, 20946, 21394, 21706, 22275, 22282, 22502, 22546, 22547, 22844, 23111, 23205, 23209, 23214, 23215, 23216, 23217, 23224, 23231, 23233, 23252, 23264, 23269, 24009, 24417, 24589, 24697, 25126, 25256, 25359, 25368, 25370, 25379, 25381, 25388, 25400, 25401, 25418, 25420, 25431, 25540]

You can also find full scripts in the github repo: https://github.com/edM777/stock-distribution-analysistree/master/scripts

## References

1. https://analystprep.com/cfa-level-1-exam/quantitative-methods/lognormal-distribution-and-continuous-compounding/#:~:text=The%20lognormal%20distribution%20is%20positively,is%20greater%20than%20the%20mode.
2. https://www.actuaries.org/EVENTS/Congresses/Paris/Papers/3203.pdf
3. https://github.com/edM777/index-data-explore/blob/main/Market-Index-Data-Exploration-revisit-Jan2024.ipynb
4. https://www.investopedia.com/terms/l/leptokurtic.asp
5. https://eudml.org/doc/270939
6. https://stochasticeconomics.wordpress.com/2017/08/14/stock-returns-q-gaussian-vs-gaussian-distribution