# Statistics Basics

## 1. What is statistics, and why is it important?
Statistics is the science of collecting, organizing, analyzing, interpreting, and presenting data. It's important because:
- Helps make informed decisions based on data
- Enables prediction of future trends
- Provides tools for testing hypotheses
- Essential for scientific research and business analytics
- Helps understand uncertainty and variability in data

## 2. What are the two main types of statistics?
The two main types are:
1. **Descriptive Statistics**: Summarizes and describes data
2. **Inferential Statistics**: Makes predictions or inferences about a population based on sample data

## 3. What are descriptive statistics?
Descriptive statistics are methods used to summarize and describe the main features of a dataset. Examples include:
- Measures of central tendency (mean, median, mode)
- Measures of variability (range, variance, standard deviation)
- Graphical representations (histograms, box plots)

## 4. What is inferential statistics?
Inferential statistics uses sample data to make generalizations about a larger population. It includes:
- Hypothesis testing
- Confidence intervals
- Regression analysis
- ANOVA

## 5. What is sampling in statistics?
Sampling is the process of selecting a subset of individuals from a population to estimate characteristics of the whole population. It's done because studying the entire population is often impractical.

## 6. What are the different types of sampling methods?
Main sampling methods:
- **Probability Sampling**:
  - Simple random sampling
  - Stratified sampling
  - Cluster sampling
  - Systematic sampling
- **Non-Probability Sampling**:
  - Convenience sampling
  - Purposive sampling
  - Quota sampling
  - Snowball sampling

## 7. What is the difference between random and non-random sampling?
**Random Sampling**:
- Every member has known, non-zero chance of selection
- Reduces bias
- Allows for statistical inference

**Non-Random Sampling**:
- Selection based on convenience or judgment
- May introduce bias
- Limits generalizability

## 8. Define and give examples of qualitative and quantitative data
**Qualitative Data** (Categorical):
- Describes qualities/characteristics
- Examples: Gender, color, satisfaction level

**Quantitative Data** (Numerical):
- Can be measured numerically
- Examples: Height, weight, temperature

## 9. What are the different types of data in statistics?
Four measurement scales:
1. **Nominal**: Categories without order (e.g., colors)
2. **Ordinal**: Ordered categories (e.g., satisfaction levels)
3. **Interval**: Equal intervals, no true zero (e.g., temperature in °C)
4. **Ratio**: Equal intervals with true zero (e.g., height, weight)

## 10. Explain nominal, ordinal, interval, and ratio levels of measurement
- **Nominal**: Categories without mathematical meaning (e.g., gender, race)
- **Ordinal**: Ordered categories where difference between values isn't meaningful (e.g., Likert scales)
- **Interval**: Equal intervals between values, no true zero (e.g., temperature in Celsius)
- **Ratio**: Equal intervals with true zero point (e.g., height, weight, age)

## 11. What is the measure of central tendency?
Measures of central tendency describe the center or typical value of a dataset. The three main measures are:
- Mean
- Median
- Mode

## 12. Define mean, median, and mode
- **Mean**: The average (sum of all values divided by number of values)
- **Median**: The middle value when data is ordered
- **Mode**: The most frequently occurring value

## 13. What is the significance of the measure of central tendency?
Significance:
- Provides a single value representing the entire dataset
- Helps compare different datasets
- Serves as basis for many statistical analyses
- Gives first impression of data distribution

## 14. What is variance, and how is it calculated?
Variance measures how far each number in a dataset is from the mean. Calculation:
1. Find the mean
2. Subtract mean from each data point and square the result
3. Average these squared differences

Formula: σ² = Σ(xᵢ - μ)²/N

## 15. What is standard deviation, and why is it important?
Standard deviation is the square root of variance. It's important because:
- Measures dispersion in same units as original data
- Indicates how spread out data is
- Helps identify outliers
- Fundamental for many statistical tests

## 16. Define and explain the term range in statistics
Range is the difference between the highest and lowest values in a dataset. It's the simplest measure of dispersion but sensitive to outliers.

## 17. What is the difference between variance and standard deviation?
- **Variance**: Average of squared deviations from mean (units squared)
- **Standard Deviation**: Square root of variance (original units)
- SD is more interpretable as it's in same units as data

## 18. What is skewness in a dataset?
Skewness measures asymmetry in data distribution:
- **Positive skew**: Right tail longer
- **Negative skew**: Left tail longer
- **Zero skew**: Symmetrical distribution

## 19. What does it mean if a dataset is positively or negatively skewed?
- **Positively skewed**: Mean > Median, tail extends to right
- **Negatively skewed**: Mean < Median, tail extends to left

## 20. Define and explain kurtosis
Kurtosis measures the "tailedness" of a distribution:
- **Leptokurtic**: Heavy tails, peaked (kurtosis > 3)
- **Mesokurtic**: Normal tails (kurtosis = 3)
- **Platykurtic**: Light tails, flat (kurtosis < 3)

## 21. What is the purpose of covariance?
Covariance measures how two variables change together:
- Indicates direction of linear relationship
- Basis for correlation calculation
- Used in portfolio theory in finance

## 22. What does correlation measure in statistics?
Correlation measures the strength and direction of linear relationship between two variables (-1 to 1):
- +1: Perfect positive correlation
- -1: Perfect negative correlation
- 0: No linear correlation

## 23. What is the difference between covariance and correlation?
- **Covariance**: Measures direction of relationship (unstandardized)
- **Correlation**: Measures strength and direction (standardized to -1 to 1)
- Correlation is dimensionless, covariance units are product of variable units

## 24. What are some real-world applications of statistics?
- **Medicine**: Clinical trials, epidemiology
- **Business**: Market research, quality control
- **Finance**: Risk assessment, portfolio management
- **Sports**: Player performance analysis
- **Government**: Census, policy evaluation
- **Science**: Experimental design, data analysis
- **Technology**: Machine learning, AI algorithms

# Practical Questions

In [None]:
# 1. How do you calculate the mean, median, and mode of a dataset?

import numpy as np

list = [1, 2, 2, 3, 4, 5, 6, 7, 8, 9, 10]
np.mean(list)
np.median(list)

# from scipy import stats
# stats.mode(list)

import statistics
statistics.mean(list)
statistics.median(list)
statistics.mode(list)


2

In [None]:
# 2. Write a Python program to compute the variance and standard deviation of a dataset?

import numpy as np
list = [1, 2, 2, 3, 4, 5, 6, 7, 8, 9, 10]
np.var(list)
np.std(list)

np.float64(2.9176011883420387)

In [6]:
# 3. Create a dataset and classify it into nominal, ordinal, interval, and ratio types.

import pandas as pd

# Create the dataset
data = {
    "Name": ["Alice", "Bob", "Charlie", "Diana"],                         # Nominal
    "Education Level": ["High School", "Bachelor", "Master", "PhD"],     # Ordinal
    "Temperature (°C)": [36.5, 37.0, 36.8, 37.2],                         # Interval
    "Height (cm)": [160, 175, 168, 170]                                   # Ratio
}

df = pd.DataFrame(data)

# Display the dataset
print("Dataset:\n", df)

# Classification of variables
data_types = {
    "Name": "Nominal",
    "Education Level": "Ordinal",
    "Temperature (°C)": "Interval",
    "Height (cm)": "Ratio"
}

# Display classification
print("\nData Type Classification:")
for column, dtype in data_types.items():
    print(f"{column}: {dtype}")


Dataset:
       Name Education Level  Temperature (°C)  Height (cm)
0    Alice     High School              36.5          160
1      Bob        Bachelor              37.0          175
2  Charlie          Master              36.8          168
3    Diana             PhD              37.2          170

Data Type Classification:
Name: Nominal
Education Level: Ordinal
Temperature (°C): Interval
Height (cm): Ratio


In [7]:
# 4. Implement sampling techniques like random sampling and stratified sampling.
import pandas as pd
import numpy as np
# Create a sample dataset
data = {
    'Category': ['A', 'B', 'A', 'B', 'A', 'B', 'A', 'B'],
    'Value': [10, 20, 30, 40, 50, 60, 70, 80]
}
df = pd.DataFrame(data)
# Random Sampling
random_sample = df.sample(n=4, random_state=1)
print("\nRandom Sample:")
print(random_sample)
# Stratified Sampling
stratified_sample = df.groupby('Category', group_keys=False).apply(lambda x: x.sample(n=2, random_state=1))
print("\nStratified Sample:")
print(stratified_sample)



Random Sample:
  Category  Value
7        B     80
2        A     30
1        B     20
6        A     70

Stratified Sample:
  Category  Value
6        A     70
4        A     50
7        B     80
5        B     60


  stratified_sample = df.groupby('Category', group_keys=False).apply(lambda x: x.sample(n=2, random_state=1))
