# Statistical Analysis with NumPy and Pandas

This notebook demonstrates how to perform statistical analysis using NumPy and Pandas. It covers basic statistical measures, hypothesis testing, and correlation analysis.

## Table of Contents

1. Descriptive Statistics
   1. Mean
   2. Median
   3. Mode
   4. Standard Deviation
   5. Variance
   6. Range
   7. Quartiles
2. Hypothesis Testing
   1. t-Test
   2. ANOVA
3. Correlation Analysis
   1. Pearson Correlation Coefficient
   2. Spearman Correlation Coefficient

Let's get started!

## 1. Descriptive Statistics

Descriptive statistics summarize and describe the main features of a dataset. NumPy and Pandas provide functions for calculating various statistical measures.

### 1.1 Mean

The mean represents the average value of a dataset. NumPy and Pandas offer functions to calculate the mean.

Let's calculate the mean of a dataset using NumPy and Pandas.

In [ ]:
import numpy as np
import pandas as pd

# Create a NumPy array
data = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9])

# Calculate mean using NumPy
mean_numpy = np.mean(data)
print('Mean (NumPy):', mean_numpy)

# Calculate mean using Pandas
mean_pandas = pd.Series(data).mean()
print('Mean (Pandas):', mean_pandas)

### 1.2 Median

The median represents the middle value in a sorted dataset. NumPy and Pandas provide functions to calculate the median.

Let's calculate the median of a dataset using NumPy and Pandas.

In [ ]:
# Calculate median using NumPy
median_numpy = np.median(data)
print('Median (NumPy):', median_numpy)

# Calculate median using Pandas
median_pandas = pd.Series(data).median()
print('Median (Pandas):', median_pandas)

### 1.3 Mode

The mode represents the most frequently occurring value(s) in a dataset. NumPy and Pandas provide functions to calculate the mode.

Let's calculate the mode of a dataset using NumPy and Pandas.

In [ ]:
# Create a Pandas Series with multiple modes
data_mode = pd.Series([1, 2, 2, 3, 3, 4, 5])

# Calculate mode using NumPy
mode_numpy = np.array(stats.mode(data_mode))
print('Mode (NumPy):', mode_numpy)

# Calculate mode using Pandas
mode_pandas = pd.Series(data_mode).mode()
print('Mode (Pandas):', mode_pandas.values)

### 1.4 Standard Deviation

The standard deviation measures the amount of variation or dispersion in a dataset. NumPy and Pandas provide functions to calculate the standard deviation.

Let's calculate the standard deviation of a dataset using NumPy and Pandas.

In [ ]:
# Calculate standard deviation using NumPy
std_dev_numpy = np.std(data)
print('Standard Deviation (NumPy):', std_dev_numpy)

# Calculate standard deviation using Pandas
std_dev_pandas = pd.Series(data).std()
print('Standard Deviation (Pandas):', std_dev_pandas)

### 1.5 Variance

The variance measures the average squared deviations from the mean. NumPy and Pandas provide functions to calculate the variance.

Let's calculate the variance of a dataset using NumPy and Pandas.

In [ ]:
# Calculate variance using NumPy
variance_numpy = np.var(data)
print('Variance (NumPy):', variance_numpy)

# Calculate variance using Pandas
variance_pandas = pd.Series(data).var()
print('Variance (Pandas):', variance_pandas)

### 1.6 Range

The range represents the difference between the maximum and minimum values in a dataset. NumPy provides a function to calculate the range.

Let's calculate the range of a dataset using NumPy.

In [ ]:
# Calculate range using NumPy
range_numpy = np.ptp(data)
print('Range (NumPy):', range_numpy)

### 1.7 Quartiles

Quartiles divide a dataset into four equal parts, each containing 25% of the data. NumPy and Pandas provide functions to calculate quartiles.

Let's calculate the quartiles of a dataset using NumPy and Pandas.

In [ ]:
# Calculate quartiles using NumPy
quartiles_numpy = np.percentile(data, [25, 50, 75])
print('Quartiles (NumPy):', quartiles_numpy)

# Calculate quartiles using Pandas
quartiles_pandas = pd.Series(data).quantile([0.25, 0.5, 0.75])
print('Quartiles (Pandas):')
print(quartiles_pandas)

## 2. Hypothesis Testing

Hypothesis testing is used to make inferences or conclusions about a population based on sample data. NumPy and Pandas provide functions for performing hypothesis tests.

### 2.1 t-Test

The t-test is used to compare the means of two groups and determine if they are significantly different from each other. NumPy provides a function to perform the t-test.

Let's perform a t-test using NumPy.

In [ ]:
from scipy import stats

# Create two samples
sample1 = np.array([1, 2, 3, 4, 5])
sample2 = np.array([2, 4, 6, 8, 10])

# Perform t-test
t_statistic, p_value = stats.ttest_ind(sample1, sample2)
print('t-statistic:', t_statistic)
print('p-value:', p_value)

### 2.2 ANOVA

Analysis of Variance (ANOVA) is used to compare the means of more than two groups and determine if they are significantly different from each other. NumPy provides a function to perform ANOVA.

Let's perform ANOVA using NumPy.

In [ ]:
# Create three samples
sample1 = np.array([1, 2, 3, 4, 5])
sample2 = np.array([2, 4, 6, 8, 10])
sample3 = np.array([3, 6, 9, 12, 15])

# Perform ANOVA
f_statistic, p_value = stats.f_oneway(sample1, sample2, sample3)
print('F-statistic:', f_statistic)
print('p-value:', p_value)

## 3. Correlation Analysis

Correlation analysis is used to determine the strength and direction of the linear relationship between two variables. NumPy and Pandas provide functions to calculate correlation coefficients.

### 3.1 Pearson Correlation Coefficient

The Pearson correlation coefficient measures the linear correlation between two variables. NumPy and Pandas provide functions to calculate the Pearson correlation coefficient.

Let's calculate the Pearson correlation coefficient using NumPy and Pandas.

In [ ]:
# Create two arrays
x = np.array([1, 2, 3, 4, 5])
y = np.array([2, 4, 6, 8, 10])

# Calculate Pearson correlation coefficient using NumPy
pearson_corr_numpy = np.corrcoef(x, y)[0, 1]
print('Pearson Correlation Coefficient (NumPy):', pearson_corr_numpy)

# Create a DataFrame
df = pd.DataFrame({'x': x, 'y': y})

# Calculate Pearson correlation coefficient using Pandas
pearson_corr_pandas = df['x'].corr(df['y'])
print('Pearson Correlation Coefficient (Pandas):', pearson_corr_pandas)

### 3.2 Spearman Correlation Coefficient

The Spearman correlation coefficient measures the monotonic relationship between two variables. NumPy and Pandas provide functions to calculate the Spearman correlation coefficient.

Let's calculate the Spearman correlation coefficient using NumPy and Pandas.

In [ ]:
# Calculate Spearman correlation coefficient using NumPy
spearman_corr_numpy = stats.spearmanr(x, y).correlation
print('Spearman Correlation Coefficient (NumPy):', spearman_corr_numpy)

# Calculate Spearman correlation coefficient using Pandas
spearman_corr_pandas = df['x'].corr(df['y'], method='spearman')
print('Spearman Correlation Coefficient (Pandas):', spearman_corr_pandas)