## Imports

In [10]:
# Execute before using this notebook if using google colab
kernel = str(get_ipython())
if 'google.colab' in kernel:    
    !wget https://raw.githubusercontent.com/fredzett/rmqa/master/utils.py -P local_modules -nc 
    !npx degit fredzett/rmqa/data data
    import sys
    sys.path.append('local_modules')

In [11]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import scipy.stats as stats
import statsmodels.api as sm
from patsy import dmatrices
from sklearn.cluster import KMeans

plt.rcParams['font.size'] = 12
plt.rcParams['figure.figsize'] = [9,7]
plt.rcParams['figure.dpi'] = 80
plt.rcParams["axes.spines.right"] = False
plt.rcParams["axes.spines.top"] = False

# Simulation

Use simulation to answer the following two exercises.

## Exercise 1

A game involving a toin coss has the following rules: 

The player tosses a coin repeatedly until a tail appears or tosses it a maximum of 1000 times if no tail appears. The initial stake starts at 2 dollars and is doubled every time heads appears. The first time tails appears, the game ends and the player wins whatever is in the pot. Thus the player wins 2 dollars if tails appears on the first toss, 4 dollars if heads appears on the first toss and tails on the second, 8 dollars if heads appears on the first two tosses and tails on the third, and so on. 

_(Note: the coin is a fair coin with two sides (head and tail))_

__Question:__ What is the probability of profit if it costs __30__ dollars to participate in the game?

## Exercise 2

A box contains 100 coins, where 99 are fair but one is double-handed, i.e. 

- 99 of 100 coins have an equal probability of 50% to land either head or tail
- 1 of 100 coins will always land head

A coin is chosen at random from the box. The chosen coin is then flipped $7$ times, and it lands heads all 7 times. 

__Question:__ Given this information, what is the probability that the chosen coin is a fair coin.

# Resampling method

## Exercise 3

We are using a statistical model (e.g. linear regression) to predict a response variable $Y$ for a particular value of $X$. Describe how we might estimate the standard deviation of our prediction using a __bootstrap__ approach. 

## Exercise 4

Answer the following questions regarding __cross-validation__

1. why cross validation is useful to evaluate the predictive performance of a statistical model

2. Explain how k-fold cross-validation is implemented

3. Explain the advantates and disadvantages of k-fold cross-validation relative to

    1. validation set approach
    
    2. LOOCV approach

# Statistical models

## Exercise 5: Regression model

A friend of yours argues that he earns money by predicting today's stock return from looking at the previous day's returns (so called "lags"). Specifically, he argues that today's returns can be explained by the five previous days.

Example: 

- today's return is +3%
- returns in the five previous days where +1%, -2%, +2%, +0.4% and -2% (called lag1 to lag5)

Then he argues that the +3% can be explained by the previous five day returns. 

__Exercise:__ 

Below you are given a dataset of approximately 5 years of daily returns of the German Stock Index (DAX 30). Use linear regression analysis to analyse the validity of your friend's statments. Specifically do the following:

1. build a linear regression model specifying y and X such that the above statmenent can be scrutinized in a reasonable way

2. evaluate and interpret the model's goodness of fit (i.e. how good is the overall model in explaining stock returns)

3. evaluate and interpret the regression coefficients (i.e. how do individual variables contribute in explaining stock returns). Also include the intercept (i.e. $\beta_0$) in your discussion

5. What is your conclusion of your analysis. What would tell your friend regarding his statement?

In [12]:
df = pd.read_csv("./data/dax_lags.csv")
df.head()

Unnamed: 0,Date,volume,today,lag1,lag2,lag3,lag4,lag5
0,2016-01-04,119844300.0,-0.042778,-0.010785,0.019357,-0.006873,0.0,0.022776
1,2016-01-05,84894800.0,0.002592,-0.042778,-0.010785,0.019357,-0.006873,0.0
2,2016-01-06,90465700.0,-0.009319,0.002592,-0.042778,-0.010785,0.019357,-0.006873
3,2016-01-07,128029000.0,-0.022926,-0.009319,0.002592,-0.042778,-0.010785,0.019357
4,2016-01-08,98631100.0,-0.013077,-0.022926,-0.009319,0.002592,-0.042778,-0.010785


In [13]:
df.tail()

Unnamed: 0,Date,volume,today,lag1,lag2,lag3,lag4,lag5
1253,2020-12-15,67265000.0,0.010566,0.008301,-0.013646,-0.003338,0.004652,0.000564
1254,2020-12-16,77798600.0,0.0152,0.010566,0.008301,-0.013646,-0.003338,0.004652
1255,2020-12-17,77206700.0,0.007465,0.0152,0.010566,0.008301,-0.013646,-0.003338
1256,2020-12-18,156772100.0,-0.002688,0.007465,0.0152,0.010566,0.008301,-0.013646
1257,2020-12-21,98290900.0,-0.028187,-0.002688,0.007465,0.0152,0.010566,0.008301


## Exercise 6: Logistic regression

Extend your stock return analysis by using a _logistic regression model_. 

__Exercise:__

Use the same data data set as in Exercise 5. Build a logistic regression model to analyse your friend's statement that money can be earned from predicting stock returns. 

Specifically do the following:

1. build a logistic regression model using y and X. Keep in mind that logistic regression requires a binary response variable $y$. For this create a variable "direction" that takes a value of 1 if today's stock return was positive and a value of 0 if today's stock return was negative (hint: use `np.where`). For X in addition to lag1 to lag5 also consider trading volume (variable "volume") in addition to the variables lag1 to lag5. 

2. evaluate and interpret the model's goodness of fit (i.e. how good is the overall model in explaining stock return direction)

3. evaluate and interpret the regression coefficients (i.e. how do individual variables contribute in explaining stock return direction). Also include the intercept (i.e. $\beta_0$) in your discussion

5. What is your conclusion of your analysis. What would tell your friend regarding his statement?


In [14]:
df = pd.read_csv("./data/dax_lags.csv")
df.head()

Unnamed: 0,Date,volume,today,lag1,lag2,lag3,lag4,lag5
0,2016-01-04,119844300.0,-0.042778,-0.010785,0.019357,-0.006873,0.0,0.022776
1,2016-01-05,84894800.0,0.002592,-0.042778,-0.010785,0.019357,-0.006873,0.0
2,2016-01-06,90465700.0,-0.009319,0.002592,-0.042778,-0.010785,0.019357,-0.006873
3,2016-01-07,128029000.0,-0.022926,-0.009319,0.002592,-0.042778,-0.010785,0.019357
4,2016-01-08,98631100.0,-0.013077,-0.022926,-0.009319,0.002592,-0.042778,-0.010785


# Clustering

## Exercise 7 - Clustering analysis

Generate a simulated data set with 150 observation and 50 variables. The dataset should contain three classes such that:

- class 1: has 50 observations and 50 variables, where all observations should be sampled from a normal distribution with $\mu = 0$ and $\sigma = 1$
- class 2: has 50 observations and 50 variables, where all observations should be sampled from a normal distribution with $\mu = 5$ and $\sigma = 1$
- class 3: has 50 observations and 50 variables, where all observations should be sampled from a normal distribution with $\mu = 10$ and $\sigma = 1$

Perform a __K-Means__ clustering analysis with the above observations with $K=3$ and $K=5$. 

- Explain how KMeans clustering works in your own words

- compare the results from the clustering analysis to the true clusters

- explain your findings? 

Hint: 
- create each class separateley using `scipy.stats.norm` (remember that `.rvs` can be used to create random data from a defined distribution)
- combine all three classes into $X$ using `np.vstack((class1, class2, class3))`
- use `KMeans` from `sklearn.cluster` to conduct clustering analysis. 