<a href="https://colab.research.google.com/github/comparativechrono/Principles-of-Data-Science/blob/main/Week_1/Section_8_Python_Example__Implementing_Statistical_Calculations.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Section 8 Python Example Implementing Statistical Calculations

Statistical calculations are a fundamental aspect of data analysis in data science, allowing us to summarize, interpret, and derive insights from data. Python, equipped with powerful libraries like NumPy and SciPy, provides extensive functionalities for performing these calculations efficiently. This section illustrates how to implement essential statistical calculations in Python, including measures of central tendency, variability, hypothesis testing, and regression analysis.

1. Measures of Central Tendency and Variability:

Using Python's NumPy library, we can easily calculate the mean, median, and standard deviation, which are basic measures of central tendency and variability. Here's an example using a sample dataset:

In [None]:
import numpy as np

# Sample data
data = np.array([23, 29, 20, 32, 34, 29, 27, 24, 21, 33, 25, 31])

# Calculate measures of central tendency
mean = np.mean(data)
median = np.median(data)
mode = np.bincount(data).argmax()  # Simple mode (most common value)

# Calculate measures of variability
std_deviation = np.std(data)
variance = np.var(data)

print("Mean:", mean)
print("Median:", median)
print("Mode:", mode)
print("Standard Deviation:", std_deviation)
print("Variance:", variance)

2. Probability Distributions:

Let’s simulate data from a normal distribution and calculate probabilities using the SciPy library, which complements NumPy with more advanced statistical functions:

In [None]:
from scipy.stats import norm

# Generate random data from a normal distribution
normal_data = np.random.normal(loc=0, scale=1, size=1000)

# Calculate probabilities
prob_less_than_zero = norm.cdf(x=0, loc=np.mean(normal_data), scale=np.std(normal_data))

print("Probability of less than zero:", prob_less_than_zero)

3. Hypothesis Testing:

We can perform a simple t-test to determine whether the means of two independent samples are significantly different. Here’s how you might use SciPy to perform this test:

In [None]:
from scipy.stats import ttest_ind

# Sample data (two groups)
group1 = np.random.normal(30, 10, 100)
group2 = np.random.normal(35, 10, 100)

# Perform a t-test
t_stat, p_value = ttest_ind(group1, group2)

print("t-statistic:", t_stat)
print("p-value:", p_value)

4. Regression Analysis:

For regression analysis, the statsmodels library provides comprehensive classes and functions. Here's an example of performing linear regression:

In [None]:
import statsmodels.api as sm

# Dependent and independent variables
Y = np.array([25, 30, 35, 40, 45])  # Dependent variable (e.g., salary)
X = np.array([5, 10, 15, 20, 25])  # Independent variable (e.g., years of experience)
X = sm.add_constant(X)  # Adds a constant term to the predictor

# Fit the regression model
model = sm.OLS(Y, X).fit()

# Print out the statistics
print(model.summary())

These examples demonstrate how Python can be effectively used to perform a range of statistical calculations, from simple measures of central tendency to more complex analyses like hypothesis testing and regression. By leveraging Python’s libraries, data scientists can efficiently process and analyse large datasets, applying statistical techniques to derive actionable insights and make informed decisions based on data.

References:

McKinney, W. (2012). Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython. O'Reilly Media.

Bressert, E. (2012). SciPy and NumPy: An Overview for Developers. O'Reilly Media.

Seabold, S., & Perktold, J. (2010). "Statsmodels: Econometric and statistical modeling with Python." Proceedings of the 9th Python in Science Conference.