# Advanced Statistical Functions

### Hypothesis Testing

**What is Hypothesis Testing?**
Hypothesis testing is a statistical method used to determine the likelihood of a hypothesis being true based on sample data. It involves making an initial assumption (the null hypothesis) and testing this assumption to infer about a population.

**Example of Hypothesis Testing:**

Consider a company claims that their new product increases the average duration of customer engagement on their platform from 15 minutes to 20 minutes. To test this claim, a sample of 30 customers is observed.

**Null Hypothesis (H0):** The average engagement duration is 15 minutes.
**Alternative Hypothesis (H1):** The average engagement duration is more than 15 minutes.

**Importance and Usage:**
- Hypothesis testing is vital in research for validating theories and results.
- Commonly used in fields like science, economics, and medicine to draw conclusions from data.

**How to Conduct Hypothesis Testing:**
1. Define Null and Alternative Hypotheses.
2. Choose a significance level (alpha, typically 0.05).
3. Calculate the test statistic and p-value from the sample data.
4. Compare the p-value with the alpha level to accept or reject the Null Hypothesis.

**Python Code for Hypothesis Testing:**


In [None]:
import numpy as np
from scipy import stats

# Sample data (engagement time in minutes)
sample_data = np.array([...])  # Example data

# Performing a one-sample t-test
t_statistic, p_value = stats.ttest_1samp(sample_data, 15)

# Interpreting the result
alpha = 0.05  # Significance level
if p_value < alpha:
    print("Reject the null hypothesis")
else:
    print("Fail to reject the null hypothesis")



### Variance

**What is Variance?**
Variance measures how much the values in a dataset vary from the mean. It's a key concept in probability and statistics, indicating the degree of spread or dispersion in the data.

**Importance and Example:**
- Understanding variance helps in assessing risk, quality control, and variability in data.
- Example: In finance, high variance in the return of an investment indicates higher risk.

**Example of Variance:**

Consider an investment portfolio with annual returns over the past 5 years as follows: 8%, 12%, -5%, 10%, 7%.

**Python Code for Variance Calculation:**


In [None]:
import numpy as np

# Annual returns of the investment
returns = np.array([0.08, 0.12, -0.05, 0.10, 0.07])

# Calculating variance
variance = np.var(returns)
print("Variance of the investment returns:", variance)



![Alt text](../media/2_variance.png)

### Skewness and Kurtosis

**What is Skewness?**
Skewness measures the asymmetry of the probability distribution of a real-valued random variable. Positive skew indicates a tail on the right side of the distribution, and negative skew indicates a tail on the left.

**What is Kurtosis?**
Kurtosis measures the "tailedness" of the distribution. High kurtosis means more of the variance is due to infrequent extreme deviations, as opposed to frequent modestly sized deviations.

**Real-World Example for Skewness and Kurtosis:**

Consider a dataset representing the heights of adult males in a certain region. We assume that the height distribution is normally distributed.

**Python Code for Calculation:**


In [None]:
import numpy as np
from scipy.stats import skew, kurtosis

# Example dataset (heights in cm)
heights = np.array([...])  # Example data

# Calculating skewness and kurtosis
skewness = skew(heights)
kurtosis_value = kurtosis(heights)

print("Skewness:", skewness)
print("Kurtosis:", kurtosis_value)



**Usage:**
- Skewness and kurtosis are used in data analysis to understand the shape and extremities of data distribution.

![Alt text](../media/2_skewness.jpg)
![Alt text](../media/2_Kurtosis.jpg)

### Probability Distributions

Utilizing NumPy's random module, we can simulate a variety of probability distributions. This is crucial in statistical modeling and hypothesis testing.

#### Simulating a Normal Distribution


In [None]:
import numpy as np
import matplotlib.pyplot as plt

# Generate a normal distribution
mean = 0
std_dev = 1
samples = np.random.normal(mean, std_dev, 1000)

# Plotting the distribution
plt.hist(samples, bins=30, density=True)
plt.title("Normal Distribution")
plt.xlabel("Value")
plt.ylabel("Frequency")
plt.show()



### Statistical Sampling

Statistical sampling involves

 selecting a subset of data from a larger dataset. NumPy provides tools for both random sampling and data shuffling.

#### Random Sampling


In [None]:
# Random sampling from an array
data = np.arange(10)
sample = np.random.choice(data, size=5, replace=False)
print("Random Sample:", sample)



#### Data Shuffling


In [None]:
# Shuffling data
np.random.shuffle(data)
print("Shuffled Data:", data)



### Descriptive Statistics

NumPy offers functions to calculate descriptive statistics, providing insights into the central tendency, variability, and shape of dataset’s distribution.

#### Calculating Variance, Skewness, and Kurtosis


In [None]:
# Variance
variance = np.var(samples)
print("Variance:", variance)

# Skewness and Kurtosis require scipy
from scipy.stats import skew, kurtosis
print("Skewness:", skew(samples))
print("Kurtosis:", kurtosis(samples))



**Expected Outputs**:
- Plot of the normal distribution.
- A random sample from the array.
- Shuffled version of the original array.
- Calculated variance, skewness, and kurtosis of the distribution.

---

### Exercise

1. **Sampling and Shuffling Data**: Perform random sampling on a dataset, followed by shuffling the sampled data.
2. **Compute Descriptive Statistics**: Calculate mean, median, variance, skewness, and kurtosis of a randomly generated dataset.

**Expected Outputs**:
- A randomly sampled subset and its shuffled version.
- Calculated mean, median, variance, skewness, and kurtosis of the dataset.

---
