<a href="https://colab.research.google.com/github/cloudpedagogy/statistics-python/blob/main/04_Probability_distributions.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Understanding probability distributions: Normal, Binomial, Poisson


##Overview


In data science, probability distributions play a crucial role in describing and modeling random variables and their outcomes. Understanding different probability distributions is essential for making data-driven decisions, performing statistical analysis, and building predictive models. Let's explore three common probability distributions used in data science:

1. **Normal Distribution:**
   - Also known as the Gaussian distribution, it is one of the most important and widely used probability distributions.
   - The probability density function (PDF) of a normal distribution is characterized by its mean (μ) and standard deviation (σ).
   - It is symmetric and bell-shaped, with the peak at the mean, and tails extending to infinity.
   - Many natural phenomena, such as heights, weights, and measurement errors, tend to follow a normal distribution.
   - The central limit theorem states that the sum of a large number of independent, identically distributed random variables will be approximately normally distributed, regardless of the original distribution of the variables.

2. **Binomial Distribution:**
   - The binomial distribution models the number of successes in a fixed number of independent Bernoulli trials.
   - A Bernoulli trial is an experiment with two possible outcomes, often referred to as "success" and "failure," with a fixed probability of success (p) and failure (1-p).
   - The binomial distribution is characterized by two parameters: the number of trials (n) and the probability of success (p).
   - Examples of scenarios following a binomial distribution include coin flips, success/failure experiments, and click-through rates in online advertising.

3. **Poisson Distribution:**
   - The Poisson distribution models the number of events that occur in a fixed interval of time or space.
   - It is appropriate for situations where events happen independently and with a constant average rate (λ) over time or space.
   - The probability mass function (PMF) of the Poisson distribution is characterized by its mean (λ).
   - Examples of scenarios following a Poisson distribution include the number of arrivals at a store, the number of customer service calls in an hour, or the number of defects in a product.

In data science, the choice of which distribution to use depends on the nature of the data and the problem at hand. Properly understanding and applying these probability distributions help data scientists gain insights from data, build accurate models, and make informed decisions in various domains, including finance, healthcare, marketing, and more.

##Normal distribution



The SciPy library in Python provides a suite of functions that allow you to work with different probability distributions including the Normal Distribution (also known as Gaussian Distribution).

Here, we'll load the Pima Indians Diabetes dataset and visualize a histogram and probability density function (PDF) for a normally distributed feature, such as BMI.

Firstly, let's load the data:


In [None]:
import pandas as pd

url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
names = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome']
df = pd.read_csv(url, names=names)


Now let's analyze the 'BMI' column:


In [None]:
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm

# isolate BMI values
bmi_values = df['BMI'].values

# calculate mean and standard deviation
mu, std = np.mean(bmi_values), np.std(bmi_values)

# create a range of values for x (from -4*stddev to +4*stddev)
x = np.linspace(mu - 4*std, mu + 4*std, 100)

# plot histogram
plt.hist(bmi_values, bins=30, density=True, alpha=0.6, color='g')

# plot PDF
plt.plot(x, norm.pdf(x, mu, std), color='crimson')

plt.title('BMI Distribution with PDF')
plt.xlabel('BMI')
plt.ylabel('Density')
plt.grid(True)
plt.show()


In the code snippet above, we first isolate the BMI values from the dataframe, then calculate their mean and standard deviation. These are the parameters of the normal distribution.

Next, we generate a range of x values that span 4 standard deviations on either side of the mean. We then plot the histogram of the BMI values and the PDF of a normal distribution with the same mean and standard deviation.

Note that the histogram is "normalized" (`density=True`) so that it forms a probability density (i.e., the area under the histogram integrates to 1). This allows it to be properly compared with the PDF.

Remember that not all data are normally distributed, and you should always perform statistical tests (like the Shapiro-Wilk test) to confirm if the data follows a normal distribution. It's also worth noting that BMI values should ideally not be zero in the context of this dataset, so further data cleaning might be required.


##Binomial distribution



The binomial distribution is a probability distribution that describes the number of successes in a fixed number of independent Bernoulli trials, each with the same probability of success.

In the context of the Pima Indians Diabetes dataset, one way to use the binomial distribution is to simulate the distribution of diabetes occurrence given the probabilities observed in the dataset. Let's see how we can do this with SciPy and pandas.

**Step 1: Load Libraries and Data**


In [None]:
import pandas as pd
from scipy.stats import binom
import matplotlib.pyplot as plt

# load data
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
names = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome']
dataframe = pd.read_csv(url, names=names)


**Step 2: Compute the Probability of Diabetes**


In [None]:
# compute the probability of diabetes
prob_diabetes = dataframe['Outcome'].mean()


**Step 3: Simulate Binomial Trials**


In [None]:
# number of trials and probability of success
n = 10
p = prob_diabetes

# simulate binomial distribution
r_values = list(range(n + 1))
dist = [binom.pmf(r, n, p) for r in r_values]


**Step 4: Plot the Binomial Distribution**


In [None]:
# plot the distribution
plt.bar(r_values, dist)
plt.title('Binomial Distribution of Diabetes Occurrence')
plt.xlabel('Number of Diabetes Cases')
plt.ylabel('Probability')
plt.show()


This will create a bar plot of the binomial distribution of diabetes occurrence in a sample of 10 individuals, based on the probability of diabetes occurrence in the dataset.


##Poisson distribution



The Poisson distribution is a discrete probability distribution often used to model the number of times an event happened in a time interval. For instance, it can describe the number of users visited on a website in an interval of time, given the average number of visits.

In this example, we will analyze the number of pregnancies from the Pima Indian dataset. We'll assume that this number follows a Poisson distribution. Here is how you can use the `scipy` library to work with the Poisson distribution:



In [None]:
import pandas as pd
from scipy.stats import poisson

# load data
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
names = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome']
dataframe = pd.read_csv(url, names=names)

# data
data = dataframe['Pregnancies']

# the average number of pregnancies in the dataset
mu = data.mean()

# PMF
print("Probability mass function: ", poisson.pmf(3, mu)) # probability of exactly 3 pregnancies

# CDF
print("Cumulative distribution function: ", poisson.cdf(3, mu)) # probability of 3 or fewer pregnancies

# Random variates
print("Random variates: ", poisson.rvs(mu, size=10))  # generate 10 random numbers from poisson distribution with mean mu


Note that this is a simplification. The number of pregnancies in the dataset might not follow a Poisson distribution in reality, and we are treating it as such for illustrative purposes. For real-world data analysis, it's important to carefully consider the underlying assumptions and validate them using appropriate statistical tests or graphical checks.

Remember to replace `'Pregnancies'` with the actual name of your column if it's different. If you want to analyze a different column, ensure that column represents count data (i.e., non-negative integer values), as the Poisson distribution is defined for such data.


#Hypothesis testing: Null Hypothesis, Alternate Hypothesis, p-value


##Overview


Hypothesis testing is a fundamental concept in statistics and data science that allows us to make decisions or draw conclusions about a population based on sample data. It involves formulating two competing hypotheses, the null hypothesis and the alternative hypothesis, and then using statistical methods to evaluate the evidence in the data to support one of the hypotheses.

1. **Null Hypothesis (H0):**
   The null hypothesis is the default or the status quo hypothesis. It states that there is no significant difference, effect, or relationship between the variables of interest in the population. In other words, any observed differences or patterns in the sample data are due to random chance or sampling variability.

   For example, if we are testing the effectiveness of a new drug, the null hypothesis would state that the drug has no effect on the condition being treated.

2. **Alternative Hypothesis (Ha or H1):**
   The alternative hypothesis is the opposite of the null hypothesis. It represents what the researcher wants to establish or show evidence for. It proposes that there is a significant difference, effect, or relationship between the variables in the population, beyond random chance.

   In the drug example, the alternative hypothesis would state that the new drug has a significant effect on the condition being treated.

3. **p-value:**
   The p-value is a probability value that measures the strength of the evidence against the null hypothesis. It quantifies the likelihood of obtaining the observed results (or more extreme results) if the null hypothesis were true. A small p-value suggests that the observed data is unlikely to occur under the null hypothesis, providing evidence in favor of the alternative hypothesis.

   In general, a common significance level (alpha) is set, often at 0.05. If the calculated p-value is less than or equal to the significance level, we reject the null hypothesis in favor of the alternative hypothesis. If the p-value is greater than the significance level, we fail to reject the null hypothesis (note: failing to reject the null hypothesis does not mean that we accept it as true).

   For example, if the p-value is 0.03 and the significance level is 0.05, we would reject the null hypothesis because the p-value is less than the significance level.

In conclusion, hypothesis testing is a powerful tool that helps data scientists and researchers make data-driven decisions. By formulating null and alternative hypotheses and calculating p-values, we can determine whether the observed data provides enough evidence to support the alternative hypothesis or if the results are simply due to chance and should be attributed to the null hypothesis.



Here's a simple example using the Pima Indian Diabetes dataset to perform a hypothesis test. For this demonstration, we'll use the BMI (Body Mass Index) variable and test the hypothesis that the average BMI of the population is 30.

**Step 1: Load Libraries and Data**


In [None]:
import pandas as pd
from scipy import stats

# load data
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
names = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome']
dataframe = pd.read_csv(url, names=names)


**Step 2: Define Hypotheses**

* Null Hypothesis (H0): The mean BMI of the population is 30.
* Alternative Hypothesis (H1): The mean BMI of the population is not 30.

**Step 3: Conduct Hypothesis Test**

We can perform a One-Sample T-Test to test our hypothesis. A one-sample t-test is a statistical procedure used to determine whether a sample of observations could have been generated by a process with a specific mean.


In [None]:
bmi_values = dataframe['BMI'].values

tset, pval = stats.ttest_1samp(bmi_values, 30)

print('p-values',pval)

if pval < 0.05:    # alpha value is 0.05 or 5%
   print("We are rejecting null hypothesis")
else:
  print("We are accepting null hypothesis")


This test returns a p-value. If the p-value is below our significance level (often 0.05), we reject the null hypothesis. If it's greater, we cannot reject the null hypothesis.

In the example above, if the p-value is less than 0.05, we reject the null hypothesis that the average BMI is 30, and infer the average BMI is different from 30. If the p-value is greater than 0.05, we cannot reject the null hypothesis that the average BMI is 30.

Remember that "failing to reject" the null hypothesis is not the same as "accepting" it. Your test is simply saying that the data isn't convincing enough to say for certain that the null is untrue.


##Types of errors in hypothesis testing: Type I and Type II


In the context of hypothesis testing, the two types of errors we look for are Type I and Type II errors:

1. **Type I Error (False Positive):** When the null hypothesis is true and we incorrectly reject it. It's also known as a "false alarm" or "false positive". The probability of committing a Type I error is denoted by the significance level α.

2. **Type II Error (False Negative):** When the null hypothesis is false and we fail to reject it. It's also known as a "miss" or "false negative". The probability of committing a Type II error is denoted by β. The power of a test (1 - β) is the probability that it correctly rejects a false null hypothesis.

To illustrate these concepts with an example, we'll use the Pima Indian Diabetes dataset and perform a hypothesis test on whether the average Glucose level differs for people with diabetes and people without.


In [None]:
import pandas as pd
from scipy import stats

# Load dataset
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
names = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome']
dataframe = pd.read_csv(url, names=names)

# Split the dataset based on Outcome
diabetes_positive = dataframe[dataframe['Outcome'] == 1]
diabetes_negative = dataframe[dataframe['Outcome'] == 0]

# Conduct a two-sample t-test
t_stat, p_val = stats.ttest_ind(diabetes_positive['Glucose'], diabetes_negative['Glucose'], equal_var=False, nan_policy='omit')

print(f"T-statistic: {t_stat}")
print(f"P-value: {p_val}")


If the p-value is less than our chosen significance level (say 0.05), we reject the null hypothesis and infer that there is a significant difference in average glucose levels for individuals with and without diabetes. If we falsely reject the null hypothesis (i.e., in reality, there's no difference), that's a Type I error.

If the p-value is more significant than our chosen significance level, we fail to reject the null hypothesis and infer that there is no significant difference in average glucose levels. If in reality there is a difference (we failed to detect it), that's a Type II error.

Remember, these are probabilistic inferences. A low p-value means that our observed data is highly unlikely under the null hypothesis, not that the null hypothesis is definitely false. Similarly, failing to reject the null hypothesis doesn't prove it true. It merely suggests that we don't have strong enough evidence to conclude otherwise.


#Reflection Points

1. **Normal Distribution**
   - What is the normal distribution, and why is it important in statistics?
   - How is the normal distribution characterized, and what are its key properties?
   - How do you generate random numbers following a normal distribution using Scipy?
   - How do you calculate the probability density function (PDF) and cumulative distribution function (CDF) of a normal distribution using Scipy?
   - How can you perform statistical tests and confidence interval calculations based on the normal distribution using Scipy?

2. **Binomial Distribution**
   - What is the binomial distribution, and in what scenarios is it applicable?
   - How is the binomial distribution defined, and what are its parameters?
   - How do you generate random numbers following a binomial distribution using Scipy?
   - How can you calculate the probability mass function (PMF) and cumulative distribution function (CDF) of a binomial distribution using Scipy?
   - How can you use the binomial distribution for hypothesis testing and confidence interval estimation using Scipy?

3. **Poisson Distribution**
   - What is the Poisson distribution, and when is it commonly used?
   - How is the Poisson distribution defined, and what are its characteristics?
   - How do you generate random numbers following a Poisson distribution using Scipy?
   - How can you calculate the probability mass function (PMF) and cumulative distribution function (CDF) of a Poisson distribution using Scipy?
   - How can you utilize the Poisson distribution in various applications, such as modeling rare events, queuing theory, and counting processes?


#A quiz on Probability distributions


Question 1: What type of probability distribution is appropriate to model the number of successes in a fixed number of independent Bernoulli trials?
<br>a) Normal Distribution
<br>b) Binomial Distribution
<br>c) Poisson Distribution

Question 2: The sum of a large number of independent, identically distributed random variables with finite variance will tend to follow which distribution?
<br>a) Normal Distribution
<br>b) Binomial Distribution
<br>c) Poisson Distribution

Question 3: Which probability distribution is commonly used to describe continuous random variables with a symmetric bell-shaped curve?
<br>a) Normal Distribution
<br>b) Binomial Distribution
<br>c) Poisson Distribution

Question 4: In hypothesis testing, the statement that there is no significant difference between specified populations is called:
<br>a) Null Hypothesis
<br>b) Alternate Hypothesis
<br>c) P-value

Question 5: The probability of making a Type I error in hypothesis testing is represented by:
<br>a) Alpha (α)
<br>b) Beta (β)
<br>c) P-value

Question 6: The probability of making a Type II error in hypothesis testing is represented by:
<br>a) Alpha (α)
<br>b) Beta (β)
<br>c) P-value

Question 7: Which of the following is NOT a step in hypothesis testing?
<br>a) Formulating the null hypothesis
<br>b) Selecting the level of significance
<br>c) Computing the p-value

Question 8: The probability value (p-value) in hypothesis testing represents:
<br>a) The probability of the null hypothesis being true
<br>b) The probability of obtaining the observed results by chance, assuming the null hypothesis is true
<br>c) The probability of making a Type I error

Question 9: In a binomial distribution, if the number of trials increases and the probability of success remains constant, what happens to the shape of the distribution?
<br>a) It becomes narrower and taller
<br>b) It becomes wider and flatter
<br>c) It remains the same

Question 10: Which scipy function can be used to generate random numbers from a normal distribution?
<br>a) scipy.random.normal
<br>b) scipy.stats.norm
<br>c) scipy.random.binomial

---
Answers:

<br>1) b) Binomial Distribution
<br>2) a) Normal Distribution
<br>3) a) Normal Distribution
<br>4) a) Null Hypothesis
<br>5) a) Alpha (α)
<br>6) b) Beta (β)
<br>7) c) Computing the p-value
<br>8) b) The probability of obtaining the observed results by chance, assuming the null hypothesis is true
<br>9) b) It becomes wider and flatter
<br>10) a) scipy.random.normal

---