1. Covariance and Correlation

Covariance

Definition: Covariance measures how two variables change together.


Positive Covariance: When one variable increases, the other tends to increase too.

Negative Covariance: When one variable increases, the other tends to decrease.

Example:

If students study more hours and score higher marks, study hours and exam scores have positive covariance.

If temperature increases and sweater sales decrease, temperature and sweater sales have negative covariance.


Correlation

Definition: Correlation standardizes covariance to a value between 
−
1
−1 and 
1
1 for easier interpretation.


Correlation = 1: Strong positive relationship.

Correlation = -1: Strong negative relationship.

Correlation = 0: No relationship.

Example:

Correlation between height and weight might be 0.85, indicating a strong positive relationship.

In [2]:
import pandas as pd

# Sample dataset
data = {
    "Study Hours": [2, 3, 4, 5, 6],
    "Exam Scores": [50, 55, 60, 65, 70]
}
df = pd.DataFrame(data)

# Covariance
cov_matrix = df.cov()
print("Covariance Matrix:\n", cov_matrix)

# Correlation
correlation_matrix = df.corr()
print("\nCorrelation Matrix:\n", correlation_matrix)


Covariance Matrix:
              Study Hours  Exam Scores
Study Hours          2.5         12.5
Exam Scores         12.5         62.5

Correlation Matrix:
              Study Hours  Exam Scores
Study Hours          1.0          1.0
Exam Scores          1.0          1.0


2. Descriptive Statistics
Descriptive statistics summarize the main features of a dataset.

Central Tendency

Mean: The average of all data points.

Example: Exam scores [50,60,70]. Mean = (50 + 60 + 70) / 3 = 60

Median: The middle value when data is sorted.

Example: [50,60,70]. Median = 60

If even number of values: Median is the average of the two middle values.

Mode: The most frequent value.

Example: [50,60,60,70]. Mode = 60


In [3]:
# Central Tendency
mean = df["Exam Scores"].mean()
median = df["Exam Scores"].median()
mode = df["Exam Scores"].mode()[0]

print("Mean:", mean)
print("Median:", median)
print("Mode:", mode)


Mean: 60.0
Median: 60.0
Mode: 50


Variability

Range: The difference between the highest and lowest values.

Example: [50,60,70]. Range = 70−50=20.

Variance: The average squared deviation from the mean.

Standard Deviation: The square root of the variance.

Provides a measure of the spread of data around the mean.

Example: If exam scores have a standard deviation of 5, most scores are within 5 points of the mean.

In [4]:
# Variability
variance = df["Exam Scores"].var()
std_dev = df["Exam Scores"].std()

print("Variance:", variance)
print("Standard Deviation:", std_dev)


Variance: 62.5
Standard Deviation: 7.905694150420948


3. Inferential Statistics
Inferential statistics allow us to make generalizations or test hypotheses using sample data.

Hypothesis Testing

Definition: A statistical method to determine if there is enough evidence to support a claim.

Example:


Claim: Students who study more than 4 hours score higher than 60.

Null Hypothesis (H0): There is no significant difference (mean = 60).

Alternative Hypothesis (H1): Mean > 60.
T-Test: Used to compare a sample mean with a population mean.

Output:

T-Statistic: Measures the difference between sample and population mean in units of standard error.

P-Value: The probability of observing the result if the null hypothesis is true.

Decision: If P<0.05, reject H0


In [5]:
from scipy.stats import ttest_1samp

# Filter data for students studying more than 4 hours
study_above_4 = df[df["Study Hours"] > 4]["Exam Scores"]

# Perform one-sample t-test
t_stat, p_value = ttest_1samp(study_above_4, 60)
print("T-Statistic:", t_stat)
print("P-Value:", p_value)

# Decision
alpha = 0.05  # Significance level
if p_value < alpha:
    print("Reject the null hypothesis: Students studying >4 hours score significantly higher than 60.")
else:
    print("Fail to reject the null hypothesis: No significant difference.")


T-Statistic: 3.0
P-Value: 0.20483276469913345
Fail to reject the null hypothesis: No significant difference.


Confidence Intervals
Definition: A range of values that likely contains the true population mean.

Example: A 95% confidence interval of (58,62) means there is a 95% chance that the true mean lies between 58 and 62.


In [6]:
import scipy.stats as stats

# Data
scores = df["Exam Scores"]
mean_score = scores.mean()
std_error = stats.sem(scores)  # Standard error of the mean

# Confidence interval
confidence_level = 0.95
confidence_interval = stats.t.interval(confidence_level, len(scores)-1, loc=mean_score, scale=std_error)
print(f"95% Confidence Interval: {confidence_interval}")


95% Confidence Interval: (50.183784192612194, 69.8162158073878)


In [7]:
import pandas as pd
from scipy.stats import ttest_1samp, sem
import scipy.stats as stats

# Sample dataset
data = {
    "Study Hours": [2, 3, 4, 5, 6],
    "Exam Scores": [50, 55, 60, 65, 70]
}
df = pd.DataFrame(data)

# Covariance and Correlation
cov_matrix = df.cov()
correlation_matrix = df.corr()

# Descriptive Statistics
mean = df["Exam Scores"].mean()
median = df["Exam Scores"].median()
mode = df["Exam Scores"].mode()[0]
variance = df["Exam Scores"].var()
std_dev = df["Exam Scores"].std()

# Hypothesis Testing
study_above_4 = df[df["Study Hours"] > 4]["Exam Scores"]
t_stat, p_value = ttest_1samp(study_above_4, 60)

# Confidence Interval
mean_score = df["Exam Scores"].mean()
std_error = sem(df["Exam Scores"])
confidence_interval = stats.t.interval(0.95, len(df["Exam Scores"])-1, loc=mean_score, scale=std_error)

# Results
print("Covariance Matrix:\n", cov_matrix)
print("Correlation Matrix:\n", correlation_matrix)
print("Mean:", mean)
print("Median:", median)
print("Mode:", mode)
print("Variance:", variance)
print("Standard Deviation:", std_dev)
print("T-Statistic:", t_stat)
print("P-Value:", p_value)
print(f"95% Confidence Interval: {confidence_interval}")


Covariance Matrix:
              Study Hours  Exam Scores
Study Hours          2.5         12.5
Exam Scores         12.5         62.5
Correlation Matrix:
              Study Hours  Exam Scores
Study Hours          1.0          1.0
Exam Scores          1.0          1.0
Mean: 60.0
Median: 60.0
Mode: 50
Variance: 62.5
Standard Deviation: 7.905694150420948
T-Statistic: 3.0
P-Value: 0.20483276469913345
95% Confidence Interval: (50.183784192612194, 69.8162158073878)
