## Problem 1
Given the dataset in problem1.csv
- A. Calculate the Mean, Variance, Skewness and Kurtosis of the data
- B. Given a choice between a Normal Distribution and a T-Distribution, which one would you
choose to model the data? Why?
- C. Fit both distributions and prove or disprove your choice in B using methods presented in
class.

In [1]:
import pandas as pd
import numpy as np
from scipy.stats import norm, t, kurtosis, skew

In [2]:
# Read data
data = pd.read_csv("problem1.csv")
data.head()

Unnamed: 0,X
0,-0.118037
1,0.149343
2,-0.083849
3,-0.025407
4,0.119084


In [3]:
# Make sure no null value
data_values = data['X'].dropna()  

In [4]:
# Calculate statistics
mean_value = np.mean(data_values)
variance_value = np.var(data_values)
skewness_value = skew(data_values)
kurtosis_value = kurtosis(data_values)

# Output results
print(f"Mean: {mean_value}")
print(f"Variance: {variance_value}")
print(f"Skewness: {skewness_value}")
print(f"Kurtosis: {kurtosis_value}")

Mean: 0.05019795790476916
Variance: 0.010322143931072109
Skewness: 0.1204447119194402
Kurtosis: 0.2229270674503816


If the data has strong symmetry, skewness is close to 0, and kurtosis is close to 3, it is more likely to be normal distribution.
If the data is tail-heavy (kurtosis greater than 3) or has high skewness, the T distribution is better suited.

In [5]:
# Choose the model
if abs(skewness_value) < 0.5 and abs(kurtosis_value - 3) < 1:
    print("The data tends to be more normally distributed.")
else:
    print("The data is more inclined to T distribution.")

The data is more inclined to T distribution.


In [6]:
t_params = t.fit(data_values)
print(f"T-Distribution Parameters: df={t_params[0]}, Mean={t_params[1]}, Scale={t_params[2]}")

T-Distribution Parameters: df=28.71016692613074, Mean=0.04986675416872735, Scale=0.09800128766123102


In [7]:
from scipy.stats import norm, t

# Normal distribution fitting parameters
norm_params = norm.fit(data_values)
# Log-likelihood of normal distribution
log_likelihood_norm = np.sum(norm.logpdf(data_values, *norm_params))
# Normally distributed AIC
aic_norm = 2 * 2 - 2 * log_likelihood_norm  # Number of parameters k = 2
bic_norm = 2 * np.log(len(data_values)) - 2 * log_likelihood_norm # Number of parameters k = 2

# T distribution fitting parameters
t_params = t.fit(data_values)
# Log-likelihood of T distribution
log_likelihood_t = np.sum(t.logpdf(data_values, *t_params))
# AIC of T distribution
aic_t = 2 * 3 - 2 * log_likelihood_t  # Number of parameters k = 3
bic_t = 3 * np.log(len(data_values)) - 2 * log_likelihood_t  # Number of parameters k = 3

# Output results
print(f"AIC for Normal Distribution: {aic_norm}")
print(f"AIC for T-Distribution: {aic_t}")
print(f"BIC for Normal Distribution: {bic_norm}")
print(f"BIC for T-Distribution: {bic_t}")

AIC for Normal Distribution: -1731.586728836508
AIC for T-Distribution: -1731.4183689195443
BIC for Normal Distribution: -1721.7712182785438
BIC for T-Distribution: -1716.6951030825978


Based on the initial assessment criteria—**symmetry, skewness close to 0, and kurtosis close to 3**—we initially suspected that a **T-distribution** might be a better fit due to potential heavy tails or high skewness. However, after evaluating both models using **AIC (Akaike Information Criterion) and BIC (Bayesian Information Criterion)**, the results suggest that the **Normal distribution provides a better fit** for the data.  

- **AIC for Normal Distribution:** **-1731.59**  
- **AIC for T-Distribution:** **-1731.42**
The lower AIC value for the **Normal distribution** suggests it fits the data slightly better in terms of likelihood while balancing model complexity.

- **BIC for Normal Distribution:** **-1721.77**  
- **BIC for T-Distribution:** **-1716.70**
And **The Normal distribution has a lower BIC value**, indicating that when penalizing for model complexity, it is preferred over the **T-distribution**. Since BIC strongly discourages overfitting, this further reinforces that the Normal distribution is a better choice.  

Even though the initial moment-based analysis suggested a T-distribution due to tail heaviness or skewness, **the AIC and BIC metrics objectively indicate that the Normal distribution fits the data better**. Therefore, based on **both likelihood-based model selection and penalization for complexity**, the **Normal distribution should be chosen** as the best fit for this dataset.