## Assignment 3

This assignmemt is based on content discussed in modules 3 - 5 and test basic concepts of statistical inference theory and probability distributions.

## Learning outcomes

-   Work on problems of different distributions eg., gaussian 
-   Calculate z score 
-	Make statistical inferences on given data
-	Construct a null and an alternate hypothesis
-	Find the p-value for a given hypothesis and T test statistic.

**Question 1**

The Capital Asset Pricing Model (CAPM) is a financial model that assumes returns on a portfolio are normally distributed.  Suppose a portfolio has an average annual return of 14.7% (i.e., an average gain on 14.7%) with a standard deviation of 33%.  A return of 0% means the value of the portfolio doesn't change, a negative return means that the portfolio loses money, and a positive return means that the portfolio gains money. Determine the following:

1. What percentage of years does this portfolio lose money, (i.e. have a return less than 0%)?
2. What is the cutoff for the highest 15% of annual returns with this portfolio?

See CAPM here https://en.wikipedia.org/wiki/Capital_asset_pricing_model 

**Question 2**

Past experience indicates that because of low morale, a company loses 20 hours a year per employee due to lateness and abstenteeism.  Assume that the standard deviation of the population is 6 and normally distributed.

The HR department implemented a new rewards system to increase employee morale, and after a few months it collected a random sample of 20 employees and the annualized absenteeism was 14.

1. Could you confirm that the new rewards system was effective with a 90% confidence?
2. An HR subject matter expert would be very happy if the program could reduce absenteeism by 20% (i.e. to 16 hours).  Given the current sampling parameters (sample size of 20 and std. dev. of population. 6), what is the probability that the new rewards system reduced absenteeism to 16 hours and you miss it?
3. What should the sample size be if you want β to be 5%

**Question 3**

Chi-Square Goodness of fit

Please access and review **section 6.3.5** in the OpenIntro Statistics textbook:

Diez, D., Çetinkaya-Rundel, M. & Barr, C (2019). OpenIntro Statistics (4th Ed.). https://leanpub.com/openintro-statistics

Given the information in section 6.3.5, write python code for the following:

 - Calculate the expected values based on the geometric distribution with a probability of 53.2%
 - Compare the expected vs. the observed values from the textbook using the Chi-Square distribution
 - Reach a conclusion
 - Explain what is the business impact of your conclusion

In [1]:
import numpy as np
from scipy import stats
import pandas as pd
from scipy.stats import chisquare

In [2]:
#P(r < 0)
#1) What percentage of years does this portfolio lose money, (i.e. have a return less than 0%)?

mean = 14.7
std_dev = 33
stats.norm(mean, std_dev).cdf(0)

0.3279956507031998

In about 32.8% of the years the portfolio will have a negative return

In [3]:
#2) What is the cutoff for the highest 15% of annual returns with this portfolio?
stats.norm(mean, std_dev).ppf(0.85)

48.90230185329506

In the highest 15% of years, the return is greater than 48.90%.

**Question 2**

Past experience indicates that because of low morale, a company loses 20 hours a year per employee due to lateness and abstenteeism.  Assume that the standard deviation of the population is 6 and normally distributed.

The HR department implemented a new rewards system to increase employee morale, and after a few months it collected a random sample of 20 employees and the annualized absenteeism was 14.

1. Could you confirm that the new rewards system was effective with a 90% confidence?
2. An HR subject matter expert would be very happy if the program could reduce absenteeism by 20% (i.e. to 16 hours).  Given the current sampling parameters (sample size of 20 and std. dev. of population. 6), what is the probability that the new rewards system reduced absenteeism to 16 hours and you miss it?
3. What should the sample size be if you want β to be 5%

In [4]:
# Given Data
pop_mean = 20 # Absent for 20 hrs every year
std_dev = 6 # Standard deviation
sample_size = 20 # Number of employees taken in sample
sample_mean = 14 # Mean of sample of being absent after new systems' implementation
confidence_lvl =0.90 # 90% confidence
alpha = 1 - confidence_lvl # Significance level (alpha) for test

In [5]:
#1) Hypothesis Test (to check if the new rewards system is effective?)
# H0: New Rewards System performs same as Older system
# HA: New Rewards System performs effectively than Older system

st_error = std_dev/np.sqrt(sample_size) #standard error of mean
z_score = (sample_mean - pop_mean) / st_error #z-score for the sample mean
# Find the p-value for the one-tailed test
p_value = stats.norm.cdf(z_score)

# Decision: if p-value < alpha, we reject the null hypothesis.
test_result = p_value < alpha
p_value , test_result

(3.872108215522035e-06, True)

p_value is less than the alpha (0.10). That's why we reject the null hypothesis and conclude that the new rewards system was effective at reducing absenteeism with 90% confidence.

In [6]:
# 2) Probability that the new rewards system reduced absenteeism to 16 hours but we miss it (Type II error)

z_critical = stats.norm.ppf(1 - alpha)  # One-tailed test

# X_critical value (Cutoff ) above which we fail to reject the null hypothesis
x_critical = z_critical * st_error + pop_mean

# Probability of Type II error given that true mean is 16 hrs
z_score_beta = (x_critical - 16) / st_error  # Calculate z_score for alternate hypothesis
beta = 1 - stats.norm.cdf(z_score_beta)  # Probability of Type II error
x_critical , beta


(21.719381850337403, 1.0086130550202022e-05)

The probability of missing it is area right to beta which is 0.0001%. The chance of missing the reduction to 16 hours is extremely low.

In [7]:
#3) What should the sample size be if you want β to be 5% 

# Solve : Zα * SE + Zβ * SE = Distance , Distance =  20 - 16 = 4 , SE = σ/√n, Zα and Zβ known

z_alpha = stats.norm.ppf(1 - alpha) # for alpha =0.10 with 90% confidence
z_beta = stats.norm.ppf(0.95) # for beta 5%

# Calculate the required sample size
req_samp_size = ((z_alpha+ z_beta) * std_dev/4) **2
req_samp_size

19.268656539002937

A sample size of 20 is sufficient for this level of power.

**Question 3**

Chi-Square Goodness of fit

Please access and review **section 6.3.5** in the OpenIntro Statistics textbook:

Diez, D., Çetinkaya-Rundel, M. & Barr, C (2019). OpenIntro Statistics (4th Ed.). https://leanpub.com/openintro-statistics

Given the information in section 6.3.5, write python code for the following:

 - Calculate the expected values based on the geometric distribution with a probability of 53.2%
 - Compare the expected vs. the observed values from the textbook using the Chi-Square distribution
 - Reach a conclusion
 - Explain what is the business impact of your conclusion

In [8]:
# Given probability of a positive trading day
p = 0.532

# Total number of observed streaks
total_streaks =1362

# Observed data from the problem statement
observed_data = {'Days': [1, 2, 3, 4, 5, 6, 7],
                 'Observed': [717, 369, 155, 69, 28, 14, 10]}
df_observed = pd.DataFrame(observed_data)

# Calculating expected values based on the geometric distribution where P(X=k)=(1-p)**(k-1)*p and ExpectedCount= P(X=k)*total_streaks
expected_values = []
for day in df_observed['Days'][:]: 
    expected_count = ((1- p) ** (day - 1))* p *total_streaks
    expected_values.append(expected_count)

# expected count for more than or equal to 7 days P(X>=7)
#expected_7_above = ((1-p) ** (7-1))* p *total_streaks
#expected_values.append(expected_7_above)


# Adjust expected values to ensure their sum to the same total as observed values
adj_factor = total_streaks / sum(expected_values)
adj_expected_values = [v*adj_factor for v in expected_values]

# add adjusted expected values to expected dataframe
df_observed['Expected'] = adj_expected_values

print( df_observed) 

# Chi square Goodness of Fit
chi2_statistic, pval = chisquare(df_observed['Observed'], df_observed['Expected'])

# Result
chi2_statistic, pval

   Days  Observed    Expected
0     1       717  728.164556
1     2       369  340.781012
2     3       155  159.485514
3     4        69   74.639220
4     5        28   34.931155
5     6        14   16.347781
6     7        10    7.650761


(5.493953654701899, 0.48218778890275704)

Here, we can see that p-value is 0.482, which is well above significance level (such as 0.05), so we fail to reject the null hypothesis.
This suggests that the observed data doesnt significantly deviate from expected values derived using Geometric distribution with p = 0.532.
Therefore, there is no strong evidence suggesting that daily stock returns are dependent on previous day's returns.


Business Impact:

Based on these data, daily stock movements in the S&P500 appear to be independent of previous trading days. For traders, this supports the efficient market hypothesis, which states that it’s difficult to predict future returns based solely on past prices. As a result, traders may need to look beyond daily patterns and incorporate other factors or data points to develop predictive strategies.