<a href="https://colab.research.google.com/github/comparativechrono/Principles-of-Data-Science/blob/main/Week_7/Section_8_Python_Example__Statistical_Analysis_Tools.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Section 8: Python example - statistical analysis tools
Statistical analysis is a cornerstone of data science, providing methods to interpret data, draw conclusions, and make predictions. Python, with its rich ecosystem of libraries, offers a comprehensive suite of tools for conducting statistical analysis. This section will demonstrate practical examples of utilizing Python’s capabilities to perform statistical analysis, highlighting tools and techniques that can help uncover insights within data.

1. Setting Up the Environment:

To perform statistical analysis in Python, it's essential to have access to libraries such as SciPy for statistical tests, NumPy for numerical operations, and statsmodels for more advanced statistical modeling. If these libraries are not already installed, you can install them using pip:

In [None]:
pip install numpy scipy statsmodels

2. Importing Required Libraries:

Begin by importing the necessary libraries. NumPy will be used for handling numerical data, SciPy for performing specific statistical tests, and statsmodels for regression analysis and more:

In [None]:
import numpy as np
import scipy.stats as stats
import statsmodels.api as sm
import matplotlib.pyplot as plt

3. Generating Sample Data:

For our example, let’s generate some sample data that could represent test scores from different groups:

In [None]:
# Generate test scores for three different groups
np.random.seed(0)
group1 = np.random.normal(70, 10, 200)
group2 = np.random.normal(75, 12, 200)
group3 = np.random.normal(80, 15, 200)

4. Descriptive Statistics:

Calculate and display descriptive statistics for this sample data, which provide an initial understanding of the central tendency and dispersion:

In [None]:
# Calculate means and standard deviations
means = [np.mean(group) for group in [group1, group2, group3]]
stddevs = [np.std(group) for group in [group1, group2, group3]]
print("Means:", means)
print("Standard Deviations:", stddevs)

5. Visualizing the Data:

Use histograms to visualize the distribution of scores within the groups:

In [None]:
# Plot histograms
plt.hist(group1, alpha=0.7, label='Group 1')
plt.hist(group2, alpha=0.7, label='Group 2')
plt.hist(group3, alpha=0.7, label='Group 3')
plt.title('Distribution of Scores')
plt.xlabel('Scores')
plt.ylabel('Frequency')
plt.legend()
plt.show()

6. Hypothesis Testing:

Conduct an ANOVA test to determine if there are statistically significant differences between the means of the three groups:

In [None]:
# Perform ANOVA
anova_result = stats.f_oneway(group1, group2, group3)
print("ANOVA Result: F-statistic =", anova_result.statistic, "P-value =", anova_result.pvalue)

7. Regression Analysis:

Use statsmodels to perform a regression analysis, which can help understand how variables predict or affect one another:

In [None]:
# Add a constant term for the intercept
X = np.concatenate([group1, group2, group3])
X = sm.add_constant(X)
# Adding a constant for the intercept
Y = np.concatenate([np.ones_like(group1), 2*np.ones_like(group2), 3*np.ones_like(group3)])
# Response variable
model = sm.OLS(Y, X).fit()
print(model.summary())

8. Conclusion:

These examples illustrate just a few ways Python can be employed for statistical analysis. By leveraging libraries like NumPy, SciPy, and statsmodels, Python becomes an extraordinarily powerful tool for statistical testing, helping to uncover underlying patterns, test hypotheses, and model complex relationships within data. Whether it’s through descriptive statistics, visualization, or inferential statistics, Python’s capabilities enable thorough and insightful exploratory data analyses.