# üìä 8.1 Statistics for Data Analysts

## üîë Key Statistics Involve

### 1Ô∏è‚É£ Descriptive Statistics
Used to **summarize and describe** data.

Examples:
- Mean
- Median
- Mode
- Variance
- Standard Deviation

### 2Ô∏è‚É£ Inferential Statistics
Used to **draw conclusions, make predictions**, and generalize from a sample to a population.

Examples:
- Hypothesis Testing
- Confidence Intervals
- Regression Analysis

---

## üìå Core Concepts

### ‚úî Descriptive & Inferential Statistics

### ‚úî Probability Distributions
- **Normal Distribution**
- **Binomial Distribution**
- **Poisson Distribution**

### ‚úî Data Visualization
- Bar Charts
- Line Charts
- Histograms
- Box Plots
- Scatter Plots

### ‚úî Exploratory Data Analysis (EDA)
- Understanding data patterns
- Detecting outliers
- Identifying trends and relationships

### ‚úî Central Limit Theorem (CLT)
- States that the sampling distribution of the mean approaches a normal distribution as sample size increases, regardless of the population distribution.

### ‚úî Hypothesis Testing
- Null Hypothesis (H‚ÇÄ)
- Alternative Hypothesis (H‚ÇÅ)
- p-value
- Significance Level (Œ±)

### ‚úî Regression & Correlation
- **Regression**: Predicts a dependent variable
- **Correlation**: Measures strength and direction of relationship

# üìÇ 8.2 Types of Data
## 8.2.1 üìà Types of Data Based on Structure
### 1Ô∏è‚É£ Structured Data
- Organized in rows and columns
- Stored in **tables, spreadsheets, and relational databases**

Examples:
- Excel files
- SQL tables

### 2Ô∏è‚É£ Unstructured Data
- No predefined format

Examples:
- Text
- Images
- Videos
- Audio
- Multimedia content

---

## 8.2.2 üìà Types of Data Based on Variables

### üîπ Univariate Data
- Single variable
- Example: Student ages

### üîπ Bivariate Data
- Two variables
- Example: Height vs Weight

### üîπ Multivariate Data
- More than two variables
- Example: Age, income, education, experience

---

## 8.2.3 ‚è≥ Data Based on Time

### üî∏ Cross-Sectional Data
- Collected at a **single point in time**
- Example: Survey conducted once

### üî∏ Time Series Data
- Collected over **multiple time intervals**
- Example: Monthly sales data

# üî† 8.3 Types of Variables

### 1Ô∏è‚É£ Categorical Variables
Used to label data

#### üîπ Nominal
- No specific order
- Example: Gender, Blood Group

#### üîπ Ordinal
- Ordered categories
- Example: Education Level, Ratings

---

### 2Ô∏è‚É£ Numerical Variables
Numeric values

#### üîπ Discrete
- Countable values
- Example: Number of students

#### üîπ Continuous
- Measurable values
- Example: Height, Weight

---

## üìè Levels of Measurement

### üîπ Ratio Scale
- True zero exists
- Meaningful comparison possible
- Example: Height, Weight

### üîπ Interval Scale
- Equal intervals but **no true zero**
- Example: Temperature (¬∞C, ¬∞F)

---

## üé® Color & Labels
- Used for classification and visualization
- Zero has **no mathematical meaning**
- Example: Temperature scale

# üìä 8.4 Descriptive Statistics

Descriptive statistics **summarize the core features of a dataset**, including:
- **Center**
- **Spread**
- **Shape**

They provide **initial insights** for data exploration, help identify **trends and outliers**, and guide **visualizations** such as histograms and box plots.

## üìå Key Measures in Descriptive Statistics

### 1Ô∏è‚É£ Measures of Central Tendency
Describe where the **center of the data** lies.

- **Mean** ‚Äì Average value  
- **Median** ‚Äì Middle value (robust to outliers)
- **Mode** ‚Äì Most frequent value

In [30]:
import numpy as np
import pandas as pd

# Dummy numeric data
data = np.array([10, 20, 20, 30, 40, 50, 20])

# Create DataFrame
df = pd.DataFrame({
    'column': data
})

df

Unnamed: 0,column
0,10
1,20
2,20
3,30
4,40
5,50
6,20


In [31]:
print("Mean:", np.mean(data))
print("Median:", np.median(data))
print("Mode:", df['column'].mode().values)

Mean: 27.142857142857142
Median: 20.0
Mode: [20]


### 2Ô∏è‚É£ Measures of Variability (Spread)
Show how spread out the data is.

- **Range**: `max(data)` - `min(data)`
- **Variance**: `np.var(data)`
- **Standard Deviation**: `np.std(data)`
- **Interquartile Range (IQR)**: `np.percentile(data, 75)` - `np.percentile(data, 25)`

In [32]:
print("Range:", (max(data) - min(data)))
print("Variance:", np.var(data))
print("Standard Deviation:", np.std(data))
print("Interquartile Range (IQR):", (np.percentile(data, 75) - np.percentile(data, 25)))

Range: 40
Variance: 163.26530612244898
Standard Deviation: 12.777531299998799
Interquartile Range (IQR): 15.0


### 3Ô∏è‚É£ Measures of Shape
Describe the distribution shape.

- **Skewness** ‚Äì Measures asymmetry: `df['column'].skew()`
- **Kurtosis** ‚Äì Measures tail heaviness: `df['column'].kurt()`

In [33]:
print("Skewness:", df['column'].skew())
print("Kurtosis:", df['column'].kurt())

Skewness: 0.7064546163767351
Kurtosis: -0.3255000000000008


### 4Ô∏è‚É£ Frequency Distribution
Shows how often values occur.

- **Frequency** ‚Äì Frequency counts how many times a specific value appears in the dataset.
- **Relative Frequency** ‚Äì Transforms raw counts into proportions, showing how often each value occurs relative to the total dataset. (Proportion of total)
    - `RF% = (Freq / Total no. of obs) * 100`
- **Cumulative Frequency** ‚Äì shows the running total of frequencies, revealing how many observations fall at or below each value.(Running total)

In [34]:
import pandas as pd

# Create initial data
data = {
    'Score': [4, 5, 6, 7],
    'Frequency': [2, 4, 4, 2]
}

# Create DataFrame
df = pd.DataFrame(data)

df

Unnamed: 0,Score,Frequency
0,4,2
1,5,4
2,6,4
3,7,2


In [35]:
total_freq = df['Frequency'].sum()
print("Total Frequency",total_freq)

df['Relative Frequency (%)'] = (df['Frequency'] / total_freq) * 100
df['Cumulative Frequency'] = df['Frequency'].cumsum()

df

Total Frequency 12


Unnamed: 0,Score,Frequency,Relative Frequency (%),Cumulative Frequency
0,4,2,16.666667,2
1,5,4,33.333333,6
2,6,4,33.333333,10
3,7,2,16.666667,12



---

### üîç Key Insights from the Data
- Most common scores: 5 and 6
- Median lies between: 5 and 6
- Performance range: Few students achieved the highest score (7)
- Lowest score: 4 achieved by only 2 students

---

### üéØ Why These Concepts Matter
- Data Exploration: Frequency tables reveal patterns and anomalies
- Probability Foundation: Relative frequency approximates probability in large datasets
- Percentile Calculations: Basis for quartiles and percentile analysis
- Visualization: Forms the foundation for histograms and distribution plots

---



### 5Ô∏è‚É£ Outliers

### üîπ What are Outliers?
Outliers are **data points that significantly differ** from other observations in a dataset.  
They may occur due to:
- Measurement errors
- Data entry mistakes
- Natural variability
- Rare but important events

---

### üîπ Why Outliers Matter
- Can **distort mean and standard deviation**
- Affect **regression models**
- Reveal **anomalies or valuable insights**
- Impact **business decisions**

---

### üîπ Methods to Detect Outliers

#### 1Ô∏è). IQR Method (Interquartile Range)
- Calculate Q1 (25th percentile) and Q3 (75th percentile)
- IQR = Q3 ‚àí Q1
- Outliers lie outside:
  - Lower Bound = Q1 ‚àí 1.5 √ó IQR
  - Upper Bound = Q3 + 1.5 √ó IQR

#### 2Ô∏è). Z-Score Method
Measures how many standard deviations a value is from the mean.
- |Z| > 3 ‚Üí potential outlier
#### 3Ô∏è). Visualization-Based Detection
- Box Plot
- Scatter Plot

##### üîπ Handling Outliers
- Remove (if data error)
- Cap or floor (winsorization)
- Transform (log, square root)
- Keep (if meaningful)

---

### 6Ô∏è‚É£ Visualization Statistics
üîπ Purpose of Statistical Visualization
- Identify patterns, trends, and distributions
- Detect outliers
- Compare groups
- Support data-driven decisions

##### üîπ Common Statistical Plots
- üìä Histogram
    - Shows frequency distribution
    - Helps understand shape (normal, skewed)
- üì¶ Box Plot
    - Visualizes median, quartiles, and outliers

- üìà Line Plot
    - Used for time-series data

- üîµ Scatter Plot
    - Shows relationship between two variables

- üìä Bar Chart
    - Compares categorical data
 

---

### üéØ Key Takeaways
- Visualization simplifies complex data
- Enhances EDA and storytelling
- Essential for interviews and real-world analysis

---



### 7Ô∏è‚É£üîó Correlation, Covariance & Causation

Understanding relationships between variables is critical in data analysis.

---

#### üìê Covariance

#### üîπ What is Covariance?
Covariance measures the **direction of a linear relationship** between two variables.

It tells whether:
- Variables move **together**
- Variables move in **opposite directions**

---

#### üîπ Interpretation

- **Positive Covariance (> 0):**  
  Both variables increase or decrease together

- **Negative Covariance (< 0):**  
  One variable increases while the other decreases

- **Zero Covariance (~ 0):**  
  No linear relationship (may still be non-linear)

---

#### üîπ Limitations of Covariance
- Scale-dependent
- Hard to interpret magnitude
- Only indicates **direction**, not strength

---



### 8Ô∏è‚É£üìä Correlation

#### üîπ What is Correlation?
Correlation is a **standardized measure** that quantifies:
- **Strength**
- **Direction**
of a **linear relationship** between two variables.

It removes the scale dependency of covariance.

---

#### üîπ Correlation Coefficient (r)
- Range: **‚Äì1 to +1**

| Value of r | Interpretation |
|-----------|---------------|
| +1 | Perfect positive correlation |
| ‚Äì1 | Perfect negative correlation |
| ~0 | Weak or no linear relationship |

In [36]:
# Dummy data
x = [2, 4, 6, 8, 10]
y = [10, 8, 6, 4, 2]   # decreases as x increases

def mean(data):
    return sum(data) / len(data)

def covariance(x, y):
    mx, my = mean(x), mean(y)
    return sum((i - mx) * (j - my) for i, j in zip(x, y)) / (len(x) - 1)

print("Covariance:", covariance(x, y))

def correlation(x, y):
    mx, my = mean(x), mean(y)
    sx = sum((i - mx) ** 2 for i in x) ** 0.5
    sy = sum((j - my) ** 2 for j in y) ** 0.5
    return covariance(x, y) / (sx * sy)

print("Correlation:", correlation(x, y))

Covariance: -10.0
Correlation: -0.24999999999999994


**Correlation Using Pandas**

In [37]:
import pandas as pd

df = pd.DataFrame({
    'x': [2, 4, 6, 8, 10],
    'y': [10, 8, 6, 4, 2]
})

df

print("Correlation Using Pandas")
print(df[['x', 'y']].corr())

Correlation Using Pandas
     x    y
x  1.0 -1.0
y -1.0  1.0


### 9Ô∏è‚É£üî• Causation
- Causation means that one event directly causes another.
- **Example**: Smoking ‚Üí Lung cancer

## ‚ö†Ô∏è Correlation vs Covariance

| Aspect | Covariance | Correlation |
|---|---|---|
| Scale dependent | Yes | No |
| Range | Unbounded | ‚Äì1 to +1 |
| Interpretation | Direction only | Strength + direction |

---

### ‚ö†Ô∏è Important Rule
- Correlation does NOT imply causation
- Just because two variables move together does not mean one causes the other.
- üîπ **Example**
    - Ice cream sales ‚Üë and drowning cases ‚Üë
    - ‚ùå Ice cream does NOT cause drowning
    - ‚úî Summer temperature affects both

---


### üéØ Key Takeaways
- **Covariance** ‚Üí Direction
- **Correlation** ‚Üí Strength + Direction
- **Causation** ‚Üí Direct cause-effect
- Always validate relationships using domain knowledge

---



# üìà 8.5 Inferential Statistics
Inferential statistics use **sample data** to:
- Draw conclusions about a **larger population**
- Make **predictions**
- Test **hypotheses**

Instead of analyzing the entire population (census), we analyze a **representative sample**.

---

## üéØ Why Inferential Statistics Are Needed
- Reduces **cost and time** of data collection
- Census data is often **impractical or impossible**
- Helps generalize findings from **sample ‚Üí population**

---

## üß† Key Techniques in Inferential Statistics

- **t-test**
- **z-test**
- **ANOVA**
- **Chi-Square**
- **p-values** to determine whether results are:
  - Statistically significant  
  - Or due to random chance

---

## üë• Population vs Sample

- **Population:** Entire group of interest  
- **Sample:** Subset of the population  

A sample acts as a **bridge** between limited data and the whole population.

## 8.5.1 üìå Sampling Techniques

In [38]:
import pandas as pd

# Population data
data = {
    'Student_ID': range(1, 21),
    'Department': ['CS','CS','CS','CS','CS',
                   'IT','IT','IT','IT','IT',
                   'ECE','ECE','ECE','ECE','ECE',
                   'ME','ME','ME','ME','ME'],
    'Year': [1,2,3,4,1,
             1,2,3,4,1,
             1,2,3,4,1,
             1,2,3,4,1],
    'Marks': [78,85,90,66,72,
              80,88,91,70,75,
              76,82,89,68,74,
              79,83,87,65,71]
}

df = pd.DataFrame(data)
df

Unnamed: 0,Student_ID,Department,Year,Marks
0,1,CS,1,78
1,2,CS,2,85
2,3,CS,3,90
3,4,CS,4,66
4,5,CS,1,72
5,6,IT,1,80
6,7,IT,2,88
7,8,IT,3,91
8,9,IT,4,70
9,10,IT,1,75


### 1Ô∏è‚É£ Simple Random Sampling
- Every data point has an **equal chance** of selection.
- Eliminates bias
- Simple to implement
- Each student has an equal chance.

In [39]:
simple_random_sample = df.sample(n=5, random_state=1)
simple_random_sample

Unnamed: 0,Student_ID,Department,Year,Marks
3,4,CS,4,66
16,17,ME,2,83
6,7,IT,2,88
10,11,ECE,1,76
2,3,CS,3,90


### 2Ô∏è‚É£ Stratified Sampling
- Population is divided into homogeneous subgroups (strata) based on characteristics, then sampled proportionally.
- Common in train-test splitting for ML.
- Preserves class distribution
- Useful for imbalanced datasets
- Sample proportionally from Department.
- Ensures representation from each department.

In [40]:
stratified_sample = df.groupby('Department').sample(n=2, random_state=1)
stratified_sample

Unnamed: 0,Student_ID,Department,Year,Marks
2,3,CS,3,90
1,2,CS,2,85
10,11,ECE,1,76
12,13,ECE,3,89
7,8,IT,3,91
8,9,IT,4,70
18,19,ME,4,65
15,16,ME,1,79


### 3Ô∏è‚É£ Systematic Sampling
- Selects every k-th data point after a random starting point.
- Simple
- May introduce bias if data has patterns
- Select every k-th observation.
- Picks every 4th student.

In [41]:
k = 4
systematic_sample = df.iloc[::k]
systematic_sample

Unnamed: 0,Student_ID,Department,Year,Marks
0,1,CS,1,78
4,5,CS,1,72
8,9,IT,4,70
12,13,ECE,3,89
16,17,ME,2,83


### 4Ô∏è‚É£ Cluster Sampling
- Population is divided into clusters, and entire clusters are randomly selected.
- Cost-effective
- Used in geographical or large-scale studies
- Entire groups (clusters) are selected. Here, Department = Cluster.
- All students from selected clusters.

In [42]:
# Select clusters
selected_clusters = ['CS', 'ECE']

cluster_sample = df[df['Department'].isin(selected_clusters)]
cluster_sample

Unnamed: 0,Student_ID,Department,Year,Marks
0,1,CS,1,78
1,2,CS,2,85
2,3,CS,3,90
3,4,CS,4,66
4,5,CS,1,72
10,11,ECE,1,76
11,12,ECE,2,82
12,13,ECE,3,89
13,14,ECE,4,68
14,15,ECE,1,74


### Sampling Types

| Sampling Type | Key Idea |
|---|---|
| Simple Random | Equal probability for each unit |
| Stratified | Subgroups proportionally represented |
| Systematic | Every k-th unit selected |
| Cluster | Entire groups randomly selected |

## üìê  8.5.2 Estimation (Inferential Statistics)

Estimation is the process of **drawing conclusions about a population parameter** using **sample data**.

---

### üß† Key Idea
- **Population Parameters:** Properties of the population --> Mean (Œº), proportion (p), standard deviation (œÉ), variance
- **Sample Statistics:** Mean (xÃÑ), proportion (pÃÇ), standard deviation (s)
- Sample statistics are used to **estimate population parameters**

---

### üéØ Types of Estimation

#### 1Ô∏è‚É£ Point Estimation

Provides a **single numerical value** as the best guess of a population parameter.

Examples:
- Sample mean (xÃÑ) ‚Üí estimates population mean (Œº)
- Sample proportion (pÃÇ) ‚Üí estimates population proportion (p)

#### üîπ Properties of a Good Point Estimator

- **Unbiased:** Expected value equals the true parameter
- **Consistent:** Accuracy improves as sample size increases
- **Efficient:** Low variance

---

### üìä Methods of Point Estimation (Proportion)

#### üîπ Maximum Likelihood Estimation (MLE)

Let:
- **S** = Number of successes
- **T** = Total number of trials

MLE = S/T


#### üîπ Laplace Estimator

S + 1 / T + 2

#### üîπ Wilson Estimator

(S + Z^2 / 2) / (T + Z^2)

#### üîπ Jeffrey‚Äôs Estimator

(S + 0.5) / (T + 1)

---


### 2Ô∏è‚É£ Interval Estimation

Gives a **range of values** (confidence interval) where the **true population parameter** is likely to lie, along with a **confidence level** that the interval contains the true value.

#### üì¶ Confidence Interval (CI)

A **confidence interval** is a range calculated from sample data that likely contains the true, unknown population parameter with a specified confidence level.

#### üîπ Margin of Error (ME)

Quantifies the uncertainty, calculated from the critical value (based on confidence level) and standard error.

#### üîπ Standard Error (SE) for Population Mean

Measures how much a sample mean likely varies from the true population mean

---

### üìä Confidence Levels & Critical Values

| Confidence Level | Critical Value (Z) |
|------------------|-------------------|
| 90% | 1.645 |
| 95% | 1.96 |
| 99% | 2.576 |

---

### üìà Distribution Used for CI

- **n ‚â• 30** ‚Üí Use **Z-distribution**
- **n =< 30** ‚Üí Use **t-distribution**

---

### üéØ Key Takeaways
- Point estimation gives a single best value
- Interval estimation provides reliability with confidence
- Larger samples ‚Üí smaller margin of error
- Foundation for hypothesis testing and prediction

## üìâ t- Distribution : Distribution for Small Samples (n ‚â§ 30)

When the **sample size is small (n ‚â§ 30)** and the **population standard deviation (œÉ) is unknown**, we use the **t-distribution** instead of the normal (Z) distribution.

---

### üìå Key Points

- Population standard deviation (œÉ) is **unknown**
- Sample standard deviation (s) is used
- Data is assumed to be **approximately normal**
- Variability is higher for small samples

---

### üìä Comparison: Z vs t Distribution

| Feature | Z-Distribution | t-Distribution |
|------|---------------|----------------|
| Sample size | n ‚â• 30 | n ‚â§ 30 |
| Population œÉ | Known | Unknown |
| Variability | Fixed | Higher |
| Shape | Normal | Wider tails |

---

### üéØ Key Takeaways
- Use **t-distribution** for small samples
- Replace œÉ with sample standard deviation (s)
- Always adjust using **degrees of freedom**
- As n increases, t ‚Üí normal distribution

## üß™ 8.5.3 Hypothesis Testing

**Hypothesis Testing** is a formal process to test a claim about a whole population using data from a smaller sample, deciding if observed sample results are due to chance or a real effect by comparing opposite ideas.

It helps decide whether observed sample results are:
- Due to **random chance**
- Or due to a **real (significant) effect**

This is done by comparing **two opposing hypotheses**.

---

### üìå Types of Hypothesis

#### 1Ô∏è‚É£ Null Hypothesis (H‚ÇÄ)
- Default assumption
- States **no effect** or **no difference**

#### 2Ô∏è‚É£ Alternative Hypothesis (H‚ÇÅ or H‚Çê)
- Researcher‚Äôs claim
- States that **an effect or difference exists**

---

### üéØ Level of Significance (Œ±)

- Threshold for rejecting the null hypothesis
- Probability of making a **Type I error**
- Common values: **0.05, 0.01**

---

### üìä P-value

The **p-value** is the probability of obtaining results **as extreme as the observed data** assuming the null hypothesis is true.

#### Decision Rule:
- **p ‚â§ Œ± ‚Üí Reject H‚ÇÄ**
- **p > Œ± ‚Üí Fail to reject H‚ÇÄ**

Low p-value ‚áí Strong evidence against H‚ÇÄ

---

### ‚ö†Ô∏è Types of Errors

| Error Type | Description |
|----------|------------|
| **Type I Error** (False Positive) | Rejecting H‚ÇÄ when it is actually true |
| **Type II Error** (False Negative) | Failing to reject H‚ÇÄ when it is false |

---

### üß† Decision Outcomes

| Reality | Decision | Result |
|------|---------|--------|
| H‚ÇÄ True | Fail to Reject H‚ÇÄ | ‚úÖ Correct |
| H‚ÇÄ True | Reject H‚ÇÄ | ‚ùå Type I Error |
| H‚ÇÄ False | Reject H‚ÇÄ | ‚úÖ Correct |
| H‚ÇÄ False | Fail to Reject H‚ÇÄ | ‚ùå Type II Error |

---

### üìà Types of Tests by Tail Direction

#### 1Ô∏è‚É£ One-Tailed Test
- Tests **directional** claims (`>` or `<`)
- Critical region is in **one tail** of the distribution

‚úî More powerful for directional hypotheses

#### 2Ô∏è‚É£ Two-Tailed Test
- Tests **any difference**
- Critical region split across **both tails**

‚úî Most commonly used in practice

---

### üìä Tail Comparison

| Test Type | Direction | Critical Region |
|---------|----------|----------------|
| One-tailed | One direction | One tail |
| Two-tailed | Any difference | Both tails |

---

### üéØ Key Takeaways
- Hypothesis testing validates population claims
- p-value determines statistical significance
- Œ± controls false positive risk
- One-tailed ‚Üí directional
- Two-tailed ‚Üí non-directional