# Distributions and (mean, mode, median, std, variance)

In [2]:
import numpy as np
import pandas as pd
import plotly.express as px

Most common distributions : 

* Normal Distribution (Gaussian)
* Uniform Distribution
* Exponential Distribution
* Binomial Distribution
* Poisson Distribution
* Gamma Distribution
* Beta Distribution
* Log-Normal Distribution
* Chi-Square Distribution
* Pareto Distribution

## Normal Dist. (Gaussian)

> What is it?

Analogy: Imagine shoe sizes of people in a city. Most people wear an average size (like size 8), but some wear very small or very large sizes.

Shape: Bell curve, with most values near the center (mean).

> Use Cases:

* Customer heights or weights.
* Price fluctuations around an average.
* Average product ratings from customers.

In [3]:
data = np.random.normal(loc=0, scale=1, size=1000)

fig = px.histogram(data, nbins=50, title="Normal Distribution")
fig.show()

## Uniform Dist

> What is it?

Analogy: Think of a raffle where each number has an equal chance of being drawn.

Shape: Flat distribution where all outcomes are equally likely within a range.

> Use Cases:

* Random selection of products for testing.
* Assigning equal probabilities to promotional offers.
* When each size of a garment has an equal chance of being picked randomly.

In [4]:
data_2 = np.random.uniform(low=0, high=1, size=1000)
fig = px.histogram(data_2, nbins=50, title="Uniform Distribution")
fig.show()

## Exponential Dist

> What is it?

Analogy: Imagine the time between customers arriving at a checkout line. Long gaps are rare, and short gaps are more common.

Shape: Starts high and tapers off, with many small values and fewer large ones.

> Use Cases:

* Time between customer purchases.
* Time until a product runs out of stock.
* Waiting times for customer service queries.

In [5]:
data_3 = np.random.exponential(scale=1, size=1000)

fig = px.histogram(data_3, nbins=50, title="Exponential Distribution")
fig.show()


## Binomial Dist

> What is it?

Analogy: Think of flipping a coin 10 times and counting how many times it lands on heads. This distribution tells you the likelihood of each possible number of heads.

Shape: Shows the probability of a certain number of successes (e.g., heads) in a fixed number of trials.

> Use Cases:

* Customer surveys where you track yes/no responses.
* Success rates of product launches.
* Modeling customer choices between two options (like online vs. in-store shopping).

In [9]:
data_4 = np.random.binomial(n=5, p=0.3, size=1000)

fig = px.histogram(data_4, nbins=10, title="Binomial Distribution")
fig.show()


## Poisson Dist

> What is it?

Analogy: Think of how many customers enter a store in an hour. You might get 2, 3, or sometimes more, but most of the time, the count hovers around a typical value.

Shape: It’s skewed toward lower values, showing how rare events (like many customers at once) are less likely.

> Use Cases:

* Number of customers arriving at a store in a given time frame.
* Number of returns or defects in a batch of products.
* Customer service requests per hour.

In [10]:
data_5 = np.random.poisson(lam=3, size=1000)

fig = px.histogram(data_5, nbins=15, title="Poisson Distribution")
fig.show()


## Gamma Dist

> What is it?

Analogy: Imagine the time it takes for two independent events to happen, like how long it takes for two customers to make purchases. The Gamma distribution models the waiting time for such events.

Shape: Starts high for small values and tails off gradually, depending on how many events are being waited for.

> Use Cases:

* Time until a customer buys multiple products.
* Duration of events, like how long customers stay in a store.
* Modeling demand patterns over time for certain product lines.

In [11]:
data_6 = np.random.gamma(shape=2, scale=1, size=1000)
fig = px.histogram(data_6, nbins=50, title="Gamma Distribution")
fig.show()


## Beta Dist

> What is it?

Analogy: Think of the probability of success in a situation where you’ve seen a few trials, like predicting the success of a new product after limited sales data.

Shape: Can be skewed to the left, right, or centered, depending on how likely a "success" is, making it very flexible for modeling probabilities.

> Use Cases:

* Customer satisfaction ratings.
* Probability of a customer making a repeat purchase.
* Predicting product success based on early feedback.

In [12]:
data_7 = np.random.beta(a=2, b=5, size=1000)
fig = px.histogram(data_7, nbins=50, title="Beta Distribution")
fig.show()


## Log - Normal Dist

> What is it?

**Analogy**: Imagine the distribution of incomes in a city where most people earn around an average amount but some earn much higher. This distribution models such skewed data.

**Shape**: Skewed to the right with a long tail, showing that while most values are clustered around a central point, there are a few extremely high values.

> Use Cases:

* Modeling income distributions where a few people earn significantly more than the average.
* Distribution of product prices where most items are priced around a median but a few are very expensive.
* Modeling time-to-failure of equipment where most equipment lasts a moderate amount of time but some fail much earlier or much later.


In [13]:
data_8 = np.random.lognormal(mean=0, sigma=1, size=1000)
fig = px.histogram(data_8, nbins=50, title="Log-Normal Distribution")
fig.show()


## Chi-Square Distribution

> What is it?

**Analogy**: Imagine the number of mistakes in different batches of products, where some batches are more error-prone than others. This distribution models the variability in error counts.

**Shape**: Skewed to the right, with values clustering around lower numbers and a long tail extending towards higher numbers.

> Use Cases:

* Analyzing the variability in quality control results.
* Testing the goodness-of-fit in statistical models.
* Modeling the number of defects in manufacturing processes.


In [14]:
data_9 = np.random.chisquare(df=2, size=1000)
fig = px.histogram(data_9, nbins=50, title="Chi-Square Distribution")
fig.show()


## Pareto Distribution

> What is it?

**Analogy**: Think of wealth distribution where a small percentage of people hold most of the wealth. This distribution models such scenarios where a large proportion of the total is held by a small proportion of the population.

**Shape**: Highly skewed to the right, with a steep drop-off. Most values are concentrated near the lower end, with a few extreme values extending far out.

> Use Cases:

* Modeling wealth distribution where a few people have most of the wealth.
* Distribution of sales or revenue where a small number of products generate most of the income.
* Analyzing the frequency of rare events or top-performing products. 


In [15]:
data_10 = np.random.pareto(a=2, size=1000)
fig = px.histogram(data_10, nbins=50, title="Pareto Distribution")
fig.show()


## Main Summary Functions

> 1. **Skewness**

- **Positive Skew (Right Skew)**: Imagine a long tail on the right side of a graph. Most data points are low, but a few are very high.
- **Negative Skew (Left Skew)**: Picture a long tail on the left side. Most data points are high, with a few very low ones.
- **Zero Skew**: The graph is symmetrical, like a bell curve.

> 2. **Kurtosis**

- **High Kurtosis**: Think of a peak with heavy tails. There are more extreme values (outliers) than usual.
- **Low Kurtosis**: Picture a flatter peak with light tails. Extreme values are rare.

> 3. **Tailedness**

- **Heavy Tailed**: Extreme values (outliers) happen more often. Like very high or very low scores in a test.
- **Light Tailed**: Extreme values are rare. Most scores are closer to the average.


In [17]:
import numpy as np
import scipy.stats as stats
import plotly.express as px

# Generate example data
data = np.random.exponential(scale=1, size=1000)

# Calculate skewness
skewness = stats.skew(data)
print(f"Skewness: {skewness}")

# Plot histogram
fig = px.histogram(data, nbins=50, title='Histogram of Data with Skewness')
fig.show()


Skewness: 2.2669392223410263


In [18]:
import numpy as np
import scipy.stats as stats
import plotly.express as px

# Generate example data
data = np.random.normal(loc=0, scale=1, size=1000)

# Calculate kurtosis
kurtosis = stats.kurtosis(data, fisher=True)  # Fisher's definition (subtracts 3)
print(f"Kurtosis: {kurtosis}")

# Plot histogram
fig = px.histogram(data, nbins=50, title='Histogram of Data with Kurtosis')
fig.show()


Kurtosis: 0.0706305832303431


In [19]:
import numpy as np
import plotly.express as px

# Generate example data for heavy-tailed and light-tailed distributions
data_heavy_tailed = np.random.pareto(a=2, size=1000)
data_light_tailed = np.random.normal(loc=0, scale=1, size=1000)

# Create figures
fig_heavy = px.histogram(data_heavy_tailed, nbins=50, title='Heavy-Tailed Distribution')
fig_light = px.histogram(data_light_tailed, nbins=50, title='Light-Tailed Distribution')

# Show figures
fig_heavy.show()
fig_light.show()


**Explanation of Terms**

> Mean: The average value of the data.

> Mode: The most frequently occurring value in the data.

> Median: The middle value when the data is sorted.

> Variance: A measure of how much the data varies from the mean.

> Standard Deviation: The square root of the variance, showing how spread out the data is.

> Covariance: Measures how two datasets vary together. Positive values indicate they increase together, negative values indicate one increases while the other decreases.

In [20]:
import numpy as np
import scipy.stats as stats

# Generate example data
data = np.random.normal(loc=0, scale=1, size=1000)

# Mean
mean = np.mean(data)
print(f"Mean: {mean}")

# Mode
mode = stats.mode(data)[0][0]
print(f"Mode: {mode}")

# Median
median = np.median(data)
print(f"Median: {median}")

# Variance
variance = np.var(data)
print(f"Variance: {variance}")

# Standard Deviation
std_dev = np.std(data)
print(f"Standard Deviation: {std_dev}")

import numpy as np

# Generate example data
data1 = np.random.normal(loc=0, scale=1, size=1000)
data2 = np.random.normal(loc=0, scale=1, size=1000)

# Covariance
covariance = np.cov(data1, data2)[0, 1]
print(f"Covariance: {covariance}")



Mean: -0.02263102178783502
Mode: -3.1174025143967796
Median: 0.02814054931444751
Variance: 0.9897105712690446
Standard Deviation: 0.9948419830651724
Covariance: -0.0412200933572272






## How to Detect Distribution Types ?

**Step-by-Step Checklist**

> Visualize:
* Plot histogram  : This gives you a visual sense of the data’s distribution shape (e.g., bell-shaped for normal).
* Use Q-Q plot : Compares your data’s quantiles to those of a theoretical distribution. If the points align with a straight line, it suggests a good fit.

> Calculate Statistics:
* Mean, Median  : For a normal distribution, the mean and median are approximately the same.
* Skewness : Measures asymmetry. Positive skew means a longer right tail; negative skew means a longer left tail.
* Kurtosis : Measures the heaviness of the tails. High kurtosis means heavy tails; low kurtosis means light tails.

> Statistical Tests: 
* Shapiro-Wilk for normality : Tests if data follows a normal distribution. A p-value < 0.05 suggests it doesn’t.
* Chi-Square goodness-of-fit : Compares observed data to a theoretical distribution to see if they match.

> Fit Distributions:
* Compare fit of different distribution : Use statistical tools to fit your data to different distributions (e.g., normal, exponential) and compare how well they match using criteria like AIC or BIC.

In [25]:
import numpy as np
import scipy.stats as stats
import plotly.express as px
import plotly.graph_objects as go

# Generate example data
data = np.random.normal(loc=0, scale=1, size=1000)

# 1. Visualize
# Histogram
fig_hist = px.histogram(data, nbins=30, title='Histogram of Data')
fig_hist.show()

# Q-Q Plot with Plotly
quantiles = np.percentile(data, np.arange(0, 100, 1))
theoretical_quantiles = stats.norm.ppf(np.arange(0, 1, 1/len(quantiles)))

fig_qq = go.Figure()
fig_qq.add_trace(go.Scatter(x=theoretical_quantiles, y=quantiles,
                            mode='markers',
                            marker=dict(size=5),
                            name='Data vs Theoretical Quantiles'))
fig_qq.add_trace(go.Scatter(x=theoretical_quantiles, y=theoretical_quantiles,
                            mode='lines',
                            line=dict(color='red'),
                            name='Reference Line'))
fig_qq.update_layout(title='Q-Q Plot', xaxis_title='Theoretical Quantiles', yaxis_title='Sample Quantiles')
fig_qq.show()

# 2. Calculate Statistics
mean = np.mean(data)
median = np.median(data)
skewness = stats.skew(data)
kurtosis = stats.kurtosis(data, fisher=True)

print(f"Mean: {mean}")
print(f"Median: {median}")
print(f"Skewness: {skewness}")
print(f"Kurtosis: {kurtosis}")

# 3. Statistical Tests
# Shapiro-Wilk Test
shapiro_stat, shapiro_p = stats.shapiro(data)
print(f"Shapiro-Wilk Test Statistic: {shapiro_stat}, p-value: {shapiro_p}")

# Chi-Square Goodness-of-Fit Test
# Adjust bins to ensure expected frequencies are sufficient
bins = np.linspace(min(data), max(data), 11)  # Use fewer bins
observed_freq, _ = np.histogram(data, bins=bins)
expected_freq = len(data) * (stats.norm.cdf(bins[1:]) - stats.norm.cdf(bins[:-1]))

# Ensure sums match
observed_sum = np.sum(observed_freq)
expected_sum = np.sum(expected_freq)
if not np.isclose(observed_sum, expected_sum):
    # Adjust expected frequencies to match the total observed frequency
    expected_freq *= observed_sum / expected_sum

# Perform the chi-square test
chi2_stat, chi2_p = stats.chisquare(observed_freq, expected_freq)
print(f"Chi-Square Statistic: {chi2_stat}, p-value: {chi2_p}")

# 4. Fit Distributions
# Fit normal distribution and plot
mu, std = stats.norm.fit(data)
fig_fit = px.histogram(data, nbins=30, title='Fitted Distribution with Normal Curve')
x = np.linspace(min(data), max(data), 100)
pdf = stats.norm.pdf(x, mu, std)
fig_fit.add_scatter(x=x, y=pdf*len(data)*np.diff(bins)[0], mode='lines', name='Fitted Normal Curve')
fig_fit.show()


Mean: 0.08110437384278317
Median: 0.10027415327889941
Skewness: 0.008252371307320823
Kurtosis: -0.04193957944396676
Shapiro-Wilk Test Statistic: 0.9988213181495667, p-value: 0.7686306834220886
Chi-Square Statistic: 9.53996732554912, p-value: 0.38899306908606374


> Interpretation of Results

**Mean: 0.081**
- The average value of the data is close to 0, indicating that the data is centered around zero.

**Median: 0.100**
- The middle value of the data is also close to 0, suggesting that the distribution is roughly symmetric around the mean.

**Skewness: 0.008**
- Skewness is near zero, indicating that the data is almost symmetric. A skewness of 0.008 means there's very little skewness, so the distribution is nearly symmetrical.

**Kurtosis: -0.042**
- Kurtosis close to zero (for Fisher's definition) suggests that the tails of the distribution are similar to a normal distribution. Negative kurtosis indicates slightly lighter tails than a normal distribution.

### Statistical Tests

**Shapiro-Wilk Test:**
- **Statistic:** 0.999
- **p-value:** 0.769
- **Interpretation:** The p-value is high (greater than 0.05), indicating that you cannot reject the null hypothesis. This suggests that your data does not significantly deviate from a normal distribution.

**Chi-Square Test:**
- **Statistic:** 9.540
- **p-value:** 0.389
- **Interpretation:** The p-value is high (greater than 0.05), indicating that you cannot reject the null hypothesis that the observed data fits the expected distribution. This suggests that the data fits the normal distribution reasonably well.

> Summary

- **Normality:** Your data appears to be normally distributed based on the Shapiro-Wilk test and the Chi-Square test.
- **Distribution Shape:** The mean, median, skewness, and kurtosis indicate that the data is roughly symmetrical and has a normal distribution shape.
