In [None]:
"""
Statistical tests always have an important place in data science to make inferences about data.
These tests help determine if observed patterns are significant or occurred by chance.
"""

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Parametric Statistical Tests

When to Use Parametric Tests:

Parametric tests are appropriate when the data meet specific assumptions, such as the normality
of the data distribution and the homogeneity of variances. They are powerful for detecting
relationships and differences in data when these assumptions are met.
Here are some general guidelines:

- Normality: Use parametric tests when the data approximate a normal distribution.
- Continuous Data: Parametric tests are suitable for continuous data or when data can be
transformed to approximate normality.
- Large Sample Size: Parametric tests become robust with larger sample sizes, though specific
rules (like n > 30 for the T-test) apply.

These tests form the backbone of statistical analysis in many fields, providing tools to make
data-driven decisions, test hypotheses, and understand relationships between variables in a
structured and rigorous manner.

In [None]:
# Regression Tests

"""
Parametric tests are used to make assumptions about the parameters (like mean, and variance) of the
population distribution from which the sample is drawn. They are generally more powerful and
reliable when their assumptions are met.
"""

import statsmodels.api as sm
from sklearn.datasets import fetch_california_housing

california= fetch_california_housing()
X= pd.DataFrame(california.data, columns=california.feature_names)
y= california.target
X= sm.add_constant(X)

# fit the model - ordinary least squares regression
model= sm.OLS(y, X).fit()

print(model.summary())

                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.606
Model:                            OLS   Adj. R-squared:                  0.606
Method:                 Least Squares   F-statistic:                     3970.
Date:                Sun, 22 Sep 2024   Prob (F-statistic):               0.00
Time:                        15:46:47   Log-Likelihood:                -22624.
No. Observations:               20640   AIC:                         4.527e+04
Df Residuals:                   20631   BIC:                         4.534e+04
Df Model:                           8                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const        -36.9419      0.659    -56.067      0.0

The model explains about 60.6% of the variability in housing prices, as indicated by the R-squared
value. Key predictors such as Median Income, Housing Age, Average Number of Rooms, Average Number
of Bedrooms, Average Occupancy, Latitude, and Longitude are all statistically significant, with
p-values well below the 0.05 threshold. Median Income, for instance, has a positive coefficient
of 0.4367, suggesting that an increase in median income is associated with an increase in housing
prices. Conversely, the Average Number of Rooms has a negative coefficient, indicating that more
rooms are associated with lower housing prices, likely due to other compounding factors in the
dataset.

The model's F-statistic is 3970 with a p-value of 0.00, underscoring the overall significance of
the regression model. Additionally, the high condition number (2.38e+05) suggests potential
multicollinearity among the predictors.

Other Regression Tests
- Multiple Linear Regression: This extends linear regression to model the relationship between a
dependent variable and multiple independent variables
- Logistic Regression: Models the probability of a binary outcome based on one or more predictor
variables using a logistic function

In [None]:
# Comparison Tests

"""
The T-test is used to compare the means of two groups to determine if they are significantly
different from each other. There are several types of T-tests, each suited for different scenarios.
"""

# Independent T-test: Used to compare the means of two independent groups
from scipy.stats import ttest_ind

group1= np.random.normal(10,2,30)
group2= np.random.normal(12,2,30)

# perform the independent t-test
t_stat, p_value= ttest_ind(group1, group2)

print(f't-statistic: {t_stat}, p-value: {p_value}')

t-statistic: -4.645255605036813, p-value: 1.9970732555673982e-05


Here, ttest_ind computes the t-statistic and the p-value, which helps us determine if there is a
significant difference between the two groups.

The t-statistic of -4.64 indicates a significant difference between the means of group1 and group2,
with a corresponding p-value of approximately 0.0000199, suggesting that this difference is highly
unlikely to be due to random chance alone.

In [None]:
# Paired T-test: Used to compare the means of two related groups. This is applicable in scenarios
# like before-and-after studies.
from scipy.stats import ttest_rel

before= np.random.normal(10,2,30)
after = before + np.random.normal(1,1,30)

# perform the paired t-test
t_stat, p_value= ttest_rel(before, after)

print(f't-statistic: {t_stat}, p-value: {p_value}')

t-statistic: -5.252001219836367, p-value: 1.259134552229505e-05


In this example, ttest_rel is used to assess the significance of the difference in means for the
paired data.

The t-statistic of -5.2520 and the p-value of approximately 1.25e-05 obtained from the paired
t-test indicate a significant difference between the paired observations (before and after),
suggesting that the observed change is unlikely to be due to random variation.

In [None]:
# One Sample T-test: Used to compare the mean of a single group to a known value.
# It tests whether the mean of a single sample is significantly different from a known value.

from scipy.stats import ttest_1samp

# example data
data= np.random.normal(10,2,30)

# perform the one-sample-t-test
t_stat, p_value= ttest_1samp(data, 10)

print(f't-statistic: {t_stat}, p-value: {p_value}')

t-statistic: 0.5721672659818833, p-value: 0.5716178624094004


Here, ttest_1samp helps us determine if the sample mean differs significantly from the known value.
The t-statistic of 0.5721 and the p-value of 0.5716 obtained from the one-sample t-test indicate
that there is no significant difference between the mean of the sample data and the population
mean of 10

In [None]:
# ANOVA (Analysis of Variance) is used to compare the means of three or more groups. It helps to
# determine if at least one of the group means is significantly different from the others.

from scipy.stats import f_oneway

# example data
group1= np.random.normal(10,2,30)
group2= np.random.normal(12,2,30)
group3= np.random.normal(11,2,30)

# perform the ANOVA test
f_stat, p_value= f_oneway(group1, group2, group3)

print(f'F-statistic: {f_stat}, p-value: {p_value}')

F-statistic: 7.283306632997532, p-value: 0.0011896528333366068


In this example, f_oneway performs the ANOVA test, and the F-statistic and p-value help us
determine if there are significant differences among the groups. The F-statistic tests whether
there are significant differences between the means of three or more groups. In this case,

- The F-statistic of 7.283 indicates that there is some evidence of a difference in means between
the groups
- The p-value of 0.00119 is less than the typical significance level of 0.05, indicating strong
evidence against the null hypothesis (that all group means are equal)

In [None]:
# Z-test - used to compare the mean of a sample to a known population mean when the sample size is
# large (n > 30). It's similar to the t-test but is used when the sample size is sufficiently large
# for the Central Limit Theorem to apply.

import statsmodels.api as sm

# example data
data= np.random.normal(10,2,100)

# perform the one-sample z-test
z_stat, p_value= sm.stats.ztest(data, value=10)

print(f'z-statistic: {z_stat}, p-value: {p_value}')

z-statistic: 0.20197764922380826, p-value: 0.839934197410085


Here, sm.stats.ztest performs the Z-test, providing the z-statistic and p-value to determine the
significance of the difference. The z-statistic of 0.2019 and the p-value of 0.8399 obtained from
the z-test suggest that there is no significant difference between the sample mean and the
population mean. This indicates that the observed result is likely due to random chance and does
not provide sufficient evidence to reject the null hypothesis.

In [None]:
# Correlation Tests

# Pearson Correlation Coefficient - measures the linear relationship between two variables. It
# ranges from -1 to 1, where 1 means a perfect positive linear relationship, -1 means a perfect
# negative linear relationship, and 0 means no linear relationship.

from scipy.stats import pearsonr

# example data
x= np.random.normal(10,2,30)
y= x + np.random.normal(1,1,30)

# calculate the pearson correlation coefficient
corr, p_value= pearsonr(x, y)

print(f'Pearson: {corr}, p_value: {p_value}')

Pearson: 0.9172774039058988, p_value: 1.031860290501634e-12


In this example, pearsonr calculates the correlation coefficient and the p-value, indicating the
strength and significance of the linear relationship between the two variables.

- The Pearson correlation coefficient (r) of 0.9172 suggests a strong positive correlation. As the
coefficient approaches +1, it indicates that as one variable increases, the other variable tends
to also increase.
- The very small p-value (1.03e-12) indicates strong evidence against the null hypothesis,
suggesting that the observed correlation is unlikely to be due to random chance.

# Non-parametric Statistical Tests

When to Use Non-Parametric Tests:

- Data Distribution: When the data are not normally distributed.
- Ordinal Data: When dealing with ranked or ordered data.
- Small Sample Sizes: When sample sizes are small and assumptions for parametric tests are not met.
- Robustness: When robustness to outliers is desired.

These tests provide alternatives to parametric tests and are valuable in various fields, including
medicine, biology, social sciences, and finance, where data often deviate from normality or
assumptions of parametric tests cannot be met.

Non-parametric tests do not assume a specific distribution for the data and are useful when
parametric test assumptions are not met. These tests are more flexible but generally less powerful
than parametric tests.

In [None]:
# Chi-square Test - used to examine the association between two categorical variables.
# It's often used in contingency tables to test the independence of variables.

from scipy.stats import chi2_contingency

# example data
data= pd.DataFrame({
    'A': [10,20,30],
    'B': [6,9,17]
})

# create a contingency table
contingency_table= pd.crosstab(index=data['A'], columns=data['B'])

# perform the chi-square test
chi2, p, dof, expected= chi2_contingency(contingency_table)

print(f'Chi--square statistic: {chi2}, p_value: {p}')

Chi--square statistic: 6.000000000000001, p_value: 0.19914827347145564


- The Chi-square statistic tests the independence between categorical variables. In this case, a
Chi-square statistic of 6.0000 suggests some degree of association between the variables, but it
is not strong enough to reject the null hypothesis
- The p-value of 0.1991 is greater than the typical significance level of 0.05. This indicates
that we do not have sufficient evidence to reject the null hypothesis of independence between the
variables.

Other commonly used Non-parametric Tests
- Mann-Whitney-Wilcoxon Test: Another name for the Mann-Whitney U Test.
- Friedman Test: Non-parametric alternative to repeated measures ANOVA tests whether there are
differences between groups across multiple measurements.
- Kolmogorov-Smirnov Test: Tests whether a sample comes from a specific distribution
(e.g., a normal distribution).

Also, consider below while choosing a test:

- Use t-tests and ANOVA for continuous data.
- Use Chi-square tests for categorical data.

Number of Groups: Determine the number of groups you're comparing.
- Use a one-sample t-test to compare a sample mean to a known value.
- Use an independent t-test to compare the means of two independent groups.
- Use a paired t-test to compare the means of two related groups.
- Use ANOVA to compare the means of three or more groups.

Assumptions: Check the assumptions required for each test.
- Parametric tests like t-tests and ANOVA assume a normal distribution and homogeneity of variances.
- Non-parametric tests like Chi-square tests do not assume a specific distribution and are suitable
for categorical data.

Sample Size: Consider the sample size.
- Use Z-tests for large sample sizes (n > 30).
- Use t-tests for smaller sample sizes (n < 30).

In [None]:
# https://medium.com/tech-tensorflow/essential-statistical-tests-every-data-scientist-should-know-d3ce651cf62f

**Continuous Probability Distribution (PDF and CDF)** is the distribution of probability density for continuous random variables. There are an infinite number of possible values in the continuous random variable. It provides the probability of an interval. Unlike the probability mass function, the probability of a specific value or point of continuous probability is always zero because there are an infinite number of values.

- PDF gives the probability density value close to a specified value, not the exact probability. All the normal distributions can be scaled to a standard normal distribution.
- If we want to calculate the probability, we have to set an interval and calculate the integral value within the range to find out the probability (CDF).

In brief, we can calculate the probability of discrete values with probability mass function (PMF), and cumulative probability can be found by summing up the individual outcomes.

For continuous values, if we apply integration to the probability density function (PDF), we will get the cumulative density function (CDF). The probability density function (PDF) can be derived by differentiating the cumulative density function (CDF). Moreover, continuous probability can be calculated within an interval, and theoretically probability of a specific value is zero, but the probability density is not zero. Though we have demonstrated it with normal distribution, it is applicable to all other continuous distributions as well.

In [None]:
# https://towardsdatascience.com/3-key-concepts-of-probability-distribution-every-data-scientist-must-know-bfb429c61cc6