# (PART) STATISTICAL ANALYSIS {-}

# What are summary statistics and how do you compute them?

## Explanation

Summary statistics provide a quick overview of the distribution and central tendency of a dataset. They include measures like mean, median, mode, standard deviation, min, max, and quantiles.

## Python Code

In [None]:
import pandas as pd

# Load sample data
df = pd.read_csv("data/iris.csv")

# Summary statistics
summary = df.describe()
print(summary)


## R Code

```{r}
library(readr)
library(dplyr)

# Load sample data
df <- read_csv("data/iris.csv")

# Summary statistics
summary(df)
```

# How do you test if the mean of two groups is significantly different?

## Explanation

You can use a t-test to compare the means of two independent groups. This test checks whether the difference in means is statistically significant.

## Python Code

In [None]:
import pandas as pd
from scipy.stats import ttest_ind

# Load sample data
df = pd.read_csv("data/iris.csv")

# t-test between two species
group1 = df[df['species'] == 'setosa']['sepal_length']
group2 = df[df['species'] == 'versicolor']['sepal_length']

t_stat, p_val = ttest_ind(group1, group2)
print(f"T-statistic: {t_stat}, P-value: {p_val}")


## R Code

```{r}
library(readr)
library(dplyr)

# Load data
df <- read_csv("data/iris.csv")

# t-test between two species
setosa <- df %>% filter(species == "setosa") %>% pull(sepal_length)
versicolor <- df %>% filter(species == "versicolor") %>% pull(sepal_length)

t.test(setosa, versicolor)
```

# How do you compute the correlation between two variables?

## Explanation

Correlation measures the strength and direction of a linear relationship between two variables. The Pearson correlation coefficient ranges from -1 to 1.

## Python Code

In [None]:
import pandas as pd

# Load sample data
df = pd.read_csv("data/iris.csv")

# Pearson correlation between sepal length and sepal width
correlation = df["sepal_length"].corr(df["sepal_width"])
print("Correlation:", correlation)


## R Code

```{r}
library(readr)

# Load sample data
df <- read_csv("data/iris.csv")

# Pearson correlation
cor(df$sepal_length, df$sepal_width)
```

# What is a confidence interval and how do you calculate it?

## Explanation

A confidence interval provides a range of values that is likely to contain the population parameter. Typically, a 95% confidence interval is used to estimate the mean.

## Python Code

In [None]:
import pandas as pd
from scipy import stats
import numpy as np

# Load sample data
df = pd.read_csv("data/iris.csv")
data = df["sepal_length"]

# Compute 95% confidence interval
mean = np.mean(data)
sem = stats.sem(data)
conf_int = stats.t.interval(0.95, len(data)-1, loc=mean, scale=sem)

print(f"95% Confidence Interval: {conf_int}")


## R Code

```{r}
library(readr)

# Load sample data
df <- read_csv("data/iris.csv")
data <- df$sepal_length

# 95% confidence interval
t.test(data)$conf.int
```

# How do you perform a one-way ANOVA?

## Explanation

ANOVA (Analysis of Variance) is used to compare means across more than two groups. It tests the null hypothesis that all groups have the same mean.

## Python Code

In [None]:
import pandas as pd
from scipy.stats import f_oneway

# Load sample data
df = pd.read_csv("data/iris.csv")

# Group by species
setosa = df[df["species"] == "setosa"]["sepal_length"]
versicolor = df[df["species"] == "versicolor"]["sepal_length"]
virginica = df[df["species"] == "virginica"]["sepal_length"]

# Perform ANOVA
f_stat, p_val = f_oneway(setosa, versicolor, virginica)
print(f"F-statistic: {f_stat}, P-value: {p_val}")


## R Code

```{r}
library(readr)

# Load sample data
df <- read_csv("data/iris.csv")

# Perform one-way ANOVA
anova_result <- aov(sepal_length ~ species, data = df)
summary(anova_result)
```

# How do you interpret a p-value in hypothesis testing?

## Explanation

The p-value represents the probability of observing the data (or something more extreme) if the null hypothesis is true. A smaller p-value indicates stronger evidence against the null hypothesis.

## Python Code

In [None]:
# P-value is typically output by test functions, e.g., t-test, ANOVA
# Interpret p < 0.05 as statistically significant
print("If p < 0.05, we reject the null hypothesis.")


## R Code

```{r}
# P-values are returned by tests such as t.test or aov
# Print interpretation guide
cat("If p < 0.05, we reject the null hypothesis.\n")
```

# What are summary statistics and how do you compute them?

## Explanation

Summary statistics provide a quick overview of the distribution and central tendency of a dataset. They include measures like mean, median, mode, standard deviation, min, max, and quantiles.

## Python Code

In [None]:
import pandas as pd

# Load sample data
df = pd.read_csv("data/iris.csv")

# Summary statistics
summary = df.describe()
print(summary)


## R Code

```{r}
library(readr)
library(dplyr)

# Load sample data
df <- read_csv("data/iris.csv")

# Summary statistics
summary(df)
```

# How do you test if the mean of two groups is significantly different?
## Explanation

You can use a t-test to compare the means of two independent groups. This test checks whether the difference in means is statistically significant.
## Python Code

In [None]:
import pandas as pd
from scipy.stats import ttest_ind

# Load sample data
df = pd.read_csv("data/iris.csv")

# t-test between two species
group1 = df[df['species'] == 'setosa']['sepal_length']
group2 = df[df['species'] == 'versicolor']['sepal_length']

t_stat, p_val = ttest_ind(group1, group2)
print(f"T-statistic: {t_stat}, P-value: {p_val}")


## R Code

```{r}
library(readr)
library(dplyr)

# Load data
df <- read_csv("data/iris.csv")

# t-test between two species
setosa <- df %>% filter(species == "setosa") %>% pull(sepal_length)
versicolor <- df %>% filter(species == "versicolor") %>% pull(sepal_length)

t.test(setosa, versicolor)
```

# How do you compute the correlation between two variables?
## Explanation

Correlation measures the strength and direction of a linear relationship between two variables. The Pearson correlation coefficient ranges from -1 to 1.
## Python Code

In [None]:
import pandas as pd

# Load sample data
df = pd.read_csv("data/iris.csv")

# Pearson correlation between sepal length and sepal width
correlation = df["sepal_length"].corr(df["sepal_width"])
print("Correlation:", correlation)


## R Code

```{r}
library(readr)

# Load sample data
df <- read_csv("data/iris.csv")

# Pearson correlation
cor(df$sepal_length, df$sepal_width)
```

# What is a confidence interval and how do you calculate it?
## Explanation

A confidence interval provides a range of values that is likely to contain the population parameter. Typically, a 95% confidence interval is used to estimate the mean.
## Python Code

In [None]:
import pandas as pd
from scipy import stats
import numpy as np

# Load sample data
df = pd.read_csv("data/iris.csv")
data = df["sepal_length"]

# Compute 95% confidence interval
mean = np.mean(data)
sem = stats.sem(data)
conf_int = stats.t.interval(0.95, len(data)-1, loc=mean, scale=sem)

print(f"95% Confidence Interval: {conf_int}")


## R Code

```{r}
library(readr)

# Load sample data
df <- read_csv("data/iris.csv")
data <- df$sepal_length

# 95% confidence interval
t.test(data)$conf.int
```

# How do you use a chi-squared test to determine if two categorical variables are independent?

## Explanation

This question demonstrates how to perform a chi-squared test of independence to assess the relationship between two categorical variables. The chi-squared statistic tests whether distributions of categorical variables differ from each other, based on a contingency table.

## Python Code

In [None]:
import pandas as pd
import scipy.stats as stats

# Sample contingency table
data = pd.DataFrame({
    "A": [20, 15],
    "B": [30, 35]
}, index=["Yes", "No"])

# Chi-squared test
chi2, p, dof, expected = stats.chi2_contingency(data)
print(f"Chi2: {chi2:.2f}, p-value: {p:.4f}")

## R Code  

```{r}
# Create a contingency table
data <- matrix(c(20, 30, 15, 35), nrow = 2, byrow = TRUE)
colnames(data) <- c("A", "B")
rownames(data) <- c("Yes", "No")

# Perform chi-squared test
test <- chisq.test(data)
test
```