# Exploratory Data Analysis (EDA) on Quantitative Data Tutorial
The quanteda package is specifically designed for conducting comprehensive exploratory data analysis (EDA) on historical stock returns within a time series framework. Its functionalities include the calculation of return and risk metrics, as well as the simulation of time series returns based on historical performance. The EDA provided by this package encompasses visualizing the presence of missing values and exploring the distribution of returns.

Various financial metrics are computed to assess the historical performance of a given stock. These metrics include total return, annualized return, annualized volatility, and the Sharpe ratio. The package also offers the capability to simulate returns on time series for a particular stock, leveraging the observed returns distribution and key return and risk metrics.

This tutorial aims to illustrate the practical application of the functions within the quanteda package.

The package's functions are as follows:

- `plot_missing_vals`: This function is utilized to visualize the presence of missing values within the dataset.

- `plot_num_dist`: Used for visualizing the distribution of numerical values, providing insights into the characteristics of the data.

- `generate_financial_metrics`: This function computes various financial metrics, including total return, annualized return, annualized volatility, and the Sharpe ratio, offering a comprehensive analysis of a stock's historical performance.

- `generate_return_series`: Employed to simulate time series returns for a given stock based on the observed returns distribution and specified return and risk metrics.

## Import the functions

In [1]:
import requests
import zipfile
import warnings
import pandas as pd

from io import BytesIO

from quanteda.plot_missing_vals import plot_missing_vals
from quanteda.plot_num_dist import plot_num_dist
from quanteda.generate_financial_metrics import generate_financial_metrics
from quanteda.generate_return_series import generate_return_series

The [dataset](https://archive.ics.uci.edu/dataset/247/istanbul+stock+exchange) utilized for demonstration purposes was curated by Akbilgic, Oguz at the Istanbul Stock Exchange and is sourced from the UCI Machine Learning Repository. To align the dataset with the requirements of our functions, we performed some basic data wrangling. In practice, the DataFrame passed to functions within this package is expected to have an index in the date range format.

For our demonstration, we have selected three stock indices, namely `SP`, `FTSE`, and `NIKKEI`, and obtained their daily returns data for the period from January 1, 2009, to December 31, 2009. This timeframe and selection of stock indices were chosen to showcase the functionality of the quanteda package with a specific subset of the data.

## Download the Raw Data

In [2]:
warnings.filterwarnings("ignore", category=UserWarning, module="openpyxl")
zip_url ='https://archive.ics.uci.edu/static/public/247/istanbul+stock+exchange.zip'
response = requests.get(zip_url)
with open('data/data.zip', 'wb') as zip_file:
    zip_file.write(response.content)
with zipfile.ZipFile('data/data.zip', 'r') as zip_ref:
    zip_ref.extractall('data/')
    df = pd.read_excel('data/data_akbilgic.xlsx', skiprows=1)

## Preprocess Data

In [3]:
df['date'] = pd.to_datetime(df['date'], format='%d-%b-%y')
df.set_index('date', inplace=True)
df.index.name = 'index'
df = df[(df.index >= '2009-01-01') & (df.index <= '2009-12-31')]

index_returns = df[['SP', 'FTSE', 'NIKKEI']].asfreq('D')
index_returns.head()

Unnamed: 0_level_0,SP,FTSE,NIKKEI
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2009-01-05,-0.004679,0.003894,0.0
2009-01-06,0.007787,0.012866,0.004162
2009-01-07,-0.030469,-0.028735,0.017293
2009-01-08,0.003391,-0.000466,-0.040061
2009-01-09,-0.021533,-0.01271,-0.004474


## `plot_missing_vals`

The `plot_missing_vals` function, designed to accept a Pandas DataFrame as a parameter, serves the purpose of visualizing the presence of missing values. Identifying and addressing missing values in a timely manner is crucial to ensure accuracy in historical performance evaluation. In the context of financial returns, missing records often occur during weekends or statutory holidays when the stock exchange is not trading.

In [4]:
plot_missing_vals(index_returns)

## `plot_num_dist`

The `plot_num_dist` function, designed to take a Pandas DataFrame as a parameter, serves the purpose of visualizing the distribution of time series returns for stocks. Understanding the return distribution is crucial, particularly when simulating future returns for a stock, as the random generated returns are based on the assumption of this historical distribution.

In [5]:
plot_num_dist(index_returns.ffill())

Upon analysis of the `index_returns` DataFrame, we can see that the daily returns of the three selected indices (`SP`, `FTSE`, and `NIKKEI`) are predominantly normally distributed. This insight is valuable in guiding the simulation of future returns, providing a foundation for modeling the randomness of return movements.

## `generate_financial_metrics`

The function `generate_financial_metrics` is used to evaluation stock performance and risk. The `generate_financial_metrics function` takes two parameters: a Pandas DataFrame representing historical stock returns and the risk-free rate `annual_risk_free` as a float. The parameter `annual_risk_free` is set to 0.0 by default. The output of `generate_financial_metrics` includes two return metrics (total return, annualized return), one risk metric (annulized volatility) and one risk adjusted performance metric (sharpe ratio). These financial metrics are important indicators of the stock performance. Annualized return and annulized volativity are also the key input in simulating future returns.

In [6]:
metrics = generate_financial_metrics(index_returns.ffill())
metrics

Unnamed: 0,count,total_return,annual_return,annual_volatility,sharpe_ratio
SP,361,0.233877,0.236469,0.29596,0.798988
FTSE,361,0.362978,0.367,0.256957,1.428256
NIKKEI,361,0.369559,0.373654,0.320181,1.167006


## `generate_return_series`

The function `generate_return_series` is used to simulate time series returns given an expected return, volatility and return distribution of a stock. The parameters passed to this function are the following:
- `expected_annual_return`: Expected annualized return as a decimal (e.g., 0.05 for 5%).
- `annual_volatility`:  Annualized volatility as a decimal (e.g., 0.2 for 20%).
- `n_rows`: Number of days, hours, or minutes (rows) to generate.
- `num_series` : Number of independent time series (columns) to generate.
- `freq`: The frequency of returns ('D' for daily, 'H' for hourly, 'min' for minute).
- `dist`: Type of return distribution (only supports Normal and Log-normal distribution).
- `start_date`: Start date for the series in the format 'YYYY-MM-DD'.

The values of `expected_annual_return`, `annual_volatility`, `freq`, `dist` and `start_date` are based on the analysis from the previous three functions. Below, the function is modeling 365 independent daily returns of index `SP` based on the historical annualized return, volatility and distribution. The resulting data is stored in a Pandas Dataframe.

In [7]:
expected_annual_return = metrics.loc['SP', 'annual_return']
annual_volatility =  metrics.loc['SP', 'annual_volatility']
n_rows=365
freq='D'
dist='normal'
start_date= index_returns.index.max()

generate_return_series(
    expected_annual_return, 
    annual_volatility, 
    n_rows=365, 
    freq='D', 
    num_series=1, 
    dist='normal', 
    random_state=524,
    start_date='2024-01-01')

Unnamed: 0,series_1
2024-01-01,-0.021539
2024-01-02,0.025164
2024-01-03,0.029528
2024-01-04,0.029991
2024-01-05,0.028716
...,...
2024-12-26,-0.025290
2024-12-27,0.012590
2024-12-28,0.005374
2024-12-29,0.010130


## Reference

"Akbilgic,Oguz. (2013). ISTANBUL STOCK EXCHANGE. UCI Machine Learning Repository. https://doi.org/10.24432/C54P4J."