# Statistics Basics

## 1. What is statistics, and why is it important?
Statistics is the science of collecting, organizing, analyzing, interpreting, and presenting data. It's important because:
- Helps make informed decisions based on data
- Enables prediction of future trends
- Provides tools for testing hypotheses
- Essential for scientific research and business analytics
- Helps understand uncertainty and variability in data

## 2. What are the two main types of statistics?
The two main types are:
1. **Descriptive Statistics**: Summarizes and describes data
2. **Inferential Statistics**: Makes predictions or inferences about a population based on sample data

## 3. What are descriptive statistics?
Descriptive statistics are methods used to summarize and describe the main features of a dataset. Examples include:
- Measures of central tendency (mean, median, mode)
- Measures of variability (range, variance, standard deviation)
- Graphical representations (histograms, box plots)

## 4. What is inferential statistics?
Inferential statistics uses sample data to make generalizations about a larger population. It includes:
- Hypothesis testing
- Confidence intervals
- Regression analysis
- ANOVA

## 5. What is sampling in statistics?
Sampling is the process of selecting a subset of individuals from a population to estimate characteristics of the whole population. It's done because studying the entire population is often impractical.

## 6. What are the different types of sampling methods?
Main sampling methods:
- **Probability Sampling**:
  - Simple random sampling
  - Stratified sampling
  - Cluster sampling
  - Systematic sampling
- **Non-Probability Sampling**:
  - Convenience sampling
  - Purposive sampling
  - Quota sampling
  - Snowball sampling

## 7. What is the difference between random and non-random sampling?
**Random Sampling**:
- Every member has known, non-zero chance of selection
- Reduces bias
- Allows for statistical inference

**Non-Random Sampling**:
- Selection based on convenience or judgment
- May introduce bias
- Limits generalizability

## 8. Define and give examples of qualitative and quantitative data
**Qualitative Data** (Categorical):
- Describes qualities/characteristics
- Examples: Gender, color, satisfaction level

**Quantitative Data** (Numerical):
- Can be measured numerically
- Examples: Height, weight, temperature

## 9. What are the different types of data in statistics?
Four measurement scales:
1. **Nominal**: Categories without order (e.g., colors)
2. **Ordinal**: Ordered categories (e.g., satisfaction levels)
3. **Interval**: Equal intervals, no true zero (e.g., temperature in °C)
4. **Ratio**: Equal intervals with true zero (e.g., height, weight)

## 10. Explain nominal, ordinal, interval, and ratio levels of measurement
- **Nominal**: Categories without mathematical meaning (e.g., gender, race)
- **Ordinal**: Ordered categories where difference between values isn't meaningful (e.g., Likert scales)
- **Interval**: Equal intervals between values, no true zero (e.g., temperature in Celsius)
- **Ratio**: Equal intervals with true zero point (e.g., height, weight, age)

## 11. What is the measure of central tendency?
Measures of central tendency describe the center or typical value of a dataset. The three main measures are:
- Mean
- Median
- Mode

## 12. Define mean, median, and mode
- **Mean**: The average (sum of all values divided by number of values)
- **Median**: The middle value when data is ordered
- **Mode**: The most frequently occurring value

## 13. What is the significance of the measure of central tendency?
Significance:
- Provides a single value representing the entire dataset
- Helps compare different datasets
- Serves as basis for many statistical analyses
- Gives first impression of data distribution

## 14. What is variance, and how is it calculated?
Variance measures how far each number in a dataset is from the mean. Calculation:
1. Find the mean
2. Subtract mean from each data point and square the result
3. Average these squared differences

Formula: σ² = Σ(xᵢ - μ)²/N

## 15. What is standard deviation, and why is it important?
Standard deviation is the square root of variance. It's important because:
- Measures dispersion in same units as original data
- Indicates how spread out data is
- Helps identify outliers
- Fundamental for many statistical tests

## 16. Define and explain the term range in statistics
Range is the difference between the highest and lowest values in a dataset. It's the simplest measure of dispersion but sensitive to outliers.

## 17. What is the difference between variance and standard deviation?
- **Variance**: Average of squared deviations from mean (units squared)
- **Standard Deviation**: Square root of variance (original units)
- SD is more interpretable as it's in same units as data

## 18. What is skewness in a dataset?
Skewness measures asymmetry in data distribution:
- **Positive skew**: Right tail longer
- **Negative skew**: Left tail longer
- **Zero skew**: Symmetrical distribution

## 19. What does it mean if a dataset is positively or negatively skewed?
- **Positively skewed**: Mean > Median, tail extends to right
- **Negatively skewed**: Mean < Median, tail extends to left

## 20. Define and explain kurtosis
Kurtosis measures the "tailedness" of a distribution:
- **Leptokurtic**: Heavy tails, peaked (kurtosis > 3)
- **Mesokurtic**: Normal tails (kurtosis = 3)
- **Platykurtic**: Light tails, flat (kurtosis < 3)

## 21. What is the purpose of covariance?
Covariance measures how two variables change together:
- Indicates direction of linear relationship
- Basis for correlation calculation
- Used in portfolio theory in finance

## 22. What does correlation measure in statistics?
Correlation measures the strength and direction of linear relationship between two variables (-1 to 1):
- +1: Perfect positive correlation
- -1: Perfect negative correlation
- 0: No linear correlation

## 23. What is the difference between covariance and correlation?
- **Covariance**: Measures direction of relationship (unstandardized)
- **Correlation**: Measures strength and direction (standardized to -1 to 1)
- Correlation is dimensionless, covariance units are product of variable units

## 24. What are some real-world applications of statistics?
- **Medicine**: Clinical trials, epidemiology
- **Business**: Market research, quality control
- **Finance**: Risk assessment, portfolio management
- **Sports**: Player performance analysis
- **Government**: Census, policy evaluation
- **Science**: Experimental design, data analysis
- **Technology**: Machine learning, AI algorithms

# Practical Questions

In [8]:
# Statistical Calculations and Python Implementations

## 1. How to calculate mean, median, and mode of a dataset

data = [3, 7, 2, 5, 5, 8, 4]

# Mean
mean = sum(data) / len(data)

# Median
sorted_data = sorted(data)
n = len(sorted_data)
median = (sorted_data[n//2] if n % 2 != 0 
          else (sorted_data[n//2 - 1] + sorted_data[n//2]) / 2)

# Mode
from collections import Counter
count = Counter(data)
mode = [k for k, v in count.items() if v == max(count.values())]

print(f"Mean: {mean}, Median: {median}, Mode: {mode}")


## 2. Python program to compute variance and standard deviation


import math

def variance_std_dev(data):
    n = len(data)
    mean = sum(data) / n
    variance = sum((x - mean)**2 for x in data) / (n - 1)  # Sample variance
    std_dev = math.sqrt(variance)
    return variance, std_dev

data = [2, 4, 4, 4, 5, 5, 7, 9]
var, std = variance_std_dev(data)
print(f"Variance: {var:.2f}, Standard Deviation: {std:.2f}")


## 3. Dataset classification by measurement type


# Example dataset classification
nominal = ["red", "blue", "green", "blue"]  # Colors (no order)
ordinal = ["low", "medium", "high"]  # Ordered categories
interval = [20, 25, 30, 35]  # Temperature in °C (no true zero)
ratio = [150, 160, 175, 180]  # Heights in cm (true zero exists)


## 4. Sampling techniques implementation


import random
import numpy as np

# Random sampling
def random_sample(data, n):
    return random.sample(data, n)

# Stratified sampling
def stratified_sample(df, strata_col, n_per_stratum):
    return df.groupby(strata_col).apply(lambda x: x.sample(min(len(x), n_per_stratum)))

# Example usage
data = list(range(100))
print("Random sample:", random_sample(data, 10))

import pandas as pd
df = pd.DataFrame({
    'value': np.random.randn(100),
    'group': np.random.choice(['A', 'B', 'C'], 100)
})
print("Stratified sample:\n", stratified_sample(df, 'group', 2))


## 5. Python function to calculate range


def data_range(data):
    return max(data) - min(data)

data = [10, 20, 5, 30, 15]
print("Range:", data_range(data))


## 6. Visualizing skewness with histogram


import matplotlib.pyplot as plt
import numpy as np

# Create skewed data
right_skewed = np.random.gamma(2, 2, 1000)
left_skewed = np.max(right_skewed) - right_skewed

plt.figure(figsize=(12, 4))
plt.subplot(1, 2, 1)
plt.hist(right_skewed, bins=30)
plt.title("Right (Positive) Skew")

plt.subplot(1, 2, 2)
plt.hist(left_skewed, bins=30)
plt.title("Left (Negative) Skew")
plt.show()


## 7. Calculating skewness and kurtosis


from scipy.stats import skew, kurtosis
import numpy as np

data = np.random.normal(0, 1, 1000)  # Normal distribution
print(f"Skewness: {skew(data):.2f}")
print(f"Kurtosis: {kurtosis(data):.2f}")


## 8. Demonstrating positive and negative skewness


import numpy as np
import matplotlib.pyplot as plt

# Positive skew (right tail)
pos_skew = np.random.exponential(scale=2, size=1000)

# Negative skew (left tail)
neg_skew = np.max(pos_skew) - pos_skew

plt.hist(pos_skew, alpha=0.5, label='Positive Skew')
plt.hist(neg_skew, alpha=0.5, label='Negative Skew')
plt.legend()
plt.show()


## 9. Calculating covariance between two datasets


import numpy as np

def covariance(x, y):
    n = len(x)
    mean_x, mean_y = np.mean(x), np.mean(y)
    return sum((x[i] - mean_x) * (y[i] - mean_y) for i in range(n)) / (n - 1)

x = [1, 2, 3, 4, 5]
y = [2, 3, 5, 8, 7]
print("Covariance:", covariance(x, y))
print("Numpy covariance:", np.cov(x, y)[0, 1])


## 10. Calculating correlation coefficient


import numpy as np
from scipy.stats import pearsonr

def correlation(x, y):
    cov = np.cov(x, y)[0, 1]
    std_x, std_y = np.std(x), np.std(y)
    return cov / (std_x * std_y)

x = [1, 2, 3, 4, 5]
y = [2, 3, 5, 8, 7]
print("Correlation coefficient:", correlation(x, y))
print("Pearson correlation:", pearsonr(x, y)[0])


## 11. Creating a scatter plot


import matplotlib.pyplot as plt
import numpy as np

x = np.random.rand(50)
y = 2 * x + np.random.normal(0, 0.1, 50)

plt.scatter(x, y)
plt.xlabel('X Variable')
plt.ylabel('Y Variable')
plt.title('Scatter Plot of X vs Y')
plt.show()


## 12. Comparing sampling techniques


import random
import numpy as np

population = list(range(1000))

# Simple random sampling
random_sample = random.sample(population, 100)

# Systematic sampling
k = len(population) // 100
start = random.randint(0, k-1)
systematic_sample = population[start::k]

print("Random sample first 10:", random_sample[:10])
print("Systematic sample first 10:", systematic_sample[:10])


## 13. Central tendency for grouped data


import numpy as np
from scipy import stats

# For grouped data (midpoints and frequencies)
midpoints = [15, 25, 35, 45]
frequencies = [5, 12, 8, 5]

# Mean
mean = np.average(midpoints, weights=frequencies)

# Median class
cum_freq = np.cumsum(frequencies)
n = sum(frequencies)
median_class = next(i for i, cf in enumerate(cum_freq) if cf >= n/2)

# Mode (class with highest frequency)
mode_class = np.argmax(frequencies)

print(f"Mean: {mean}, Median class: {median_class}, Mode class: {mode_class}")


## 14. Data simulation and analysis


import numpy as np
from scipy import stats

# Simulate data
data = np.random.normal(50, 10, 1000)

# Central tendency
mean, median = np.mean(data), np.median(data)
mode = stats.mode(data)[0]

# Dispersion
std_dev, variance = np.std(data), np.var(data)
range_val = np.ptp(data)  # peak-to-peak (max-min)

print(f"Mean: {mean:.2f}, Median: {median:.2f}, Mode: {mode[0]:.2f}")
print(f"Std Dev: {std_dev:.2f}, Variance: {variance:.2f}, Range: {range_val:.2f}")


## 15. Dataset summary with pandas


import pandas as pd
import numpy as np

data = pd.DataFrame({
    'A': np.random.normal(0, 1, 100),
    'B': np.random.uniform(5, 10, 100),
    'C': np.random.randint(0, 5, 100)
})

print(data.describe())


## 16. Boxplot for spread and outliers


import seaborn as sns
import numpy as np

data = np.random.normal(0, 1, 100)
data = np.append(data, [3, -3])  # Add outliers

sns.boxplot(data=data)
plt.title('Boxplot Showing Spread and Outliers')
plt.show()


## 17. Calculating IQR


import numpy as np
from scipy import stats

data = np.random.normal(0, 1, 100)
q1, q3 = np.percentile(data, [25, 75])
iqr = q3 - q1

print(f"IQR: {iqr:.2f}")
print(f"Scipy IQR: {stats.iqr(data):.2f}")


## 18. Z-score normalization


import numpy as np

def z_score_normalize(data):
    mean = np.mean(data)
    std = np.std(data)
    return [(x - mean) / std for x in data]

data = [10, 20, 30, 40, 50]
normalized = z_score_normalize(data)
print("Original:", data)
print("Normalized:", normalized)


## 19. Comparing datasets by standard deviation


import numpy as np

data1 = np.random.normal(0, 1, 1000)
data2 = np.random.normal(0, 2, 1000)

std1, std2 = np.std(data1), np.std(data2)
print(f"Dataset 1 SD: {std1:.2f}, Dataset 2 SD: {std2:.2f}")
print(f"Dataset 2 is {std2/std1:.1f} times more variable than Dataset 1")


## 20. Covariance heatmap


import seaborn as sns
import pandas as pd
import numpy as np

data = pd.DataFrame(np.random.randn(100, 5), columns=list('ABCDE'))
cov_matrix = data.cov()

sns.heatmap(cov_matrix, annot=True, cmap='coolwarm')
plt.title('Covariance Heatmap')
plt.show()


## 21. Correlation matrix with seaborn


import seaborn as sns
import pandas as pd
import numpy as np

data = pd.DataFrame(np.random.randn(100, 5), columns=list('ABCDE'))
corr_matrix = data.corr()

sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', vmin=-1, vmax=1)
plt.title('Correlation Matrix')
plt.show()


## 22. Variance and standard deviation implementation


import math

def compute_var_std(data):
    n = len(data)
    mean = sum(data) / n
    variance = sum((x - mean)**2 for x in data) / (n - 1)
    std_dev = math.sqrt(variance)
    return variance, std_dev

data = [2, 4, 4, 4, 5, 5, 7, 9]
variance, std_dev = compute_var_std(data)
print(f"Variance: {variance:.2f}, Standard Deviation: {std_dev:.2f}")


## 23. Visualizing skewness and kurtosis


import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import skew, kurtosis

# Create different distributions
normal = np.random.normal(0, 1, 1000)
right_skew = np.random.exponential(1, 1000)
high_kurtosis = np.random.laplace(0, 1, 1000)

# Plot
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
titles = [
    f"Normal (Skew: {skew(normal):.2f}, Kurtosis: {kurtosis(normal):.2f})",
    f"Right Skew (Skew: {skew(right_skew):.2f}, Kurtosis: {kurtosis(right_skew):.2f})",
    f"High Kurtosis (Skew: {skew(high_kurtosis):.2f}, Kurtosis: {kurtosis(high_kurtosis):.2f})"
]

for ax, data, title in zip(axes, [normal, right_skew, high_kurtosis], titles):
    ax.hist(data, bins=30)
    ax.set_title(title)

plt.tight_layout()
plt.show()


## 24. Pearson and Spearman correlation


import numpy as np
from scipy.stats import pearsonr, spearmanr

# Create correlated data
x = np.linspace(0, 10, 100)
y = x + np.random.normal(0, 1, 100)

# Calculate correlations
pearson_corr, _ = pearsonr(x, y)
spearman_corr, _ = spearmanr(x, y)

print(f"Pearson correlation: {pearson_corr:.3f}")
print(f"Spearman correlation: {spearman_corr:.3f}")


Mean: 4.857142857142857, Median: 5, Mode: [5]
Variance: 4.57, Standard Deviation: 2.14


TypeError: 'list' object is not callable