# **Sampling methods**
- Sampling methods are ways to select a sample of data from a given population (every individual in the whole group).

It is unrealistic to collect data from the entire population because it:
- is too big
- takes too much time
- costs too much money
We therefore take an appropriate sized sample as a way of representing the population.

---

### **Types of Sampling Methods**

| **Sampling Method**       | **Description**                                                                                                                                         | **Example**                                                                                                                                                         |
|---------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| **Random Sampling**       | • Gathering a representative sample from a population where each member in the population has an **equal chance** of being selected.                   | • Using a **random number generator** to select students in a class to complete a task.                                                                             |
| **Stratified Sampling**   | • Smaller groups or **strata** within the sample are represented **proportionally** to the population.                                                 | • Finding out a favourite soap opera from different age **categories** of people in a year group.                                                                  |
| **Systematic Sampling**   | • Every member in the population is given a **number**. After the first member is chosen at random, the remaining members are chosen from an interval. | • A **list of people** with their first names in alphabetical order are numbered. The 5th person is **chosen randomly**, followed by every subsequent 8th person. |
| **Non Random Sampling**   | • **Convenience** sampling is used for **ease** of data collection. **Volunteers** usually collect data.                                               | • Asking people at a given location about how long their **commute** to work is.                                                                                    |
| **Capture Recapture**     | • **Collecting** a sample of data from **one location at different points in time**, **marking** the individuals to estimate a population size.         | • A sample of woodlice were **captured, marked and released**. Another sample of woodlice was captured 5 days later and the number of marked woodlice was counted. |

---

# **Statistics**
- It is  a branch of mathematics that deals with collection, analysis and interpretation of large amount of data. 
- It allows us to derive knowledge from large datasets and this knowledge can be used to make predictions, decisions, classification, etc.
- it is used in data visualisation, and machine learning is totally based on statistics, to make ML models we need to find important columns from many columns at that time we use statistics.
  
![image.png](attachment:974b7f45-e39f-4ace-89d7-81e2ec38cb48.png)

---

### **Statistics and Its Types**

| **Type of Statistics**    | **Description**                                                                                                           | **Example**                                                                                   |
|---------------------------|---------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------|
| **Descriptive Statistics**| • Involves collecting, organizing, summarizing, and presenting data.                                                     <br>• Focuses on what **has happened** using measures like mean, median, mode, and visual tools.| • Calculating the **average marks** of students in a class. <br>• Creating **bar charts** and **pie charts** to represent data. |
| **Inferential Statistics**| • Makes **predictions or inferences** about a population using a sample.                                                  <br>• Uses **probability theory** and statistical tests to draw conclusions.                    | • Estimating the **average income** of a city from a sample survey. <br>• **Predicting election results** based on exit polls.   |

---
Absolutely! Here's a clearly formatted table on **Statistics and Its Types**, similar in style to the previous one:

---

### ✅ **Comparison b/w Descriptive and Inferential statistics**

| Feature                  | Descriptive Statistics                    | Inferential Statistics                             |
|--------------------------|-------------------------------------------|----------------------------------------------------|
| Purpose                  | Summarize and describe data               | Draw conclusions/predictions about a population    |
| Data Focus               | Whole data set                            | Sample from data set                               |
| Techniques Used          | Mean, median, mode, range, graphs         | Hypothesis testing, confidence intervals, regression |
| Example Use Case         | Monthly sales report                      | Estimating next month’s sales                      |

---

### **Descriptive Statistics and Its Types**

![image.png](attachment:dd78a5b2-0e61-4e83-8083-81adb3c95f6f.png)


| **Type**                   | **Description**                                                                                              | **Example**                                                                                         |
|----------------------------|--------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------|
| **Measures of Central Tendency** | • Describe the **center** or average of a data set. <br>• Common measures: **Mean**, **Median**, **Mode**.        | • Finding the **average score** of students in an exam. <br>• **Median income** in a city.         |
| **Measures of Dispersion**       | • Indicate how **spread out** the data is. <br>• Common measures: **Range**, **Variance**, **Standard Deviation**. | • Finding the **range of temperatures** in a week. <br>• Measuring **variability in salaries**.     |
| **Measures of Position**         | • Identify the **relative position** of data values in a dataset. <br>• Includes: **Percentiles**, **Quartiles**, **Z-scores**. | • Determining the **top 25%** scorers in a test. <br>• Calculating a **Z-score** to compare data.   |
| **Frequency Distribution**       | • Shows how often each **value or range of values** occurs in a dataset.                                     | • Creating a **frequency table** for test scores. <br>• Using **histograms** to show score ranges.  |
| **Data Visualization**           | • Graphical representation of data for easier understanding. <br>• Tools include **Bar charts**, **Pie charts**, **Histograms**. | • Drawing a **pie chart** to show time spent on daily activities. <br>• Creating a **bar chart** of product sales. |

---

### **Inferential Statistics and Its Types**

![image.png](attachment:b5aefde8-b8d2-4ddb-a3a1-a37e2e584ebd.png)
---

### 📊 **Inferential Statistics and Its Types (Detailed Table)**

| **Main Type**              | **Subcategory / Test**              | **Description**                                                                                               | **Example**                                                                                           |
|----------------------------|--------------------------------------|---------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------|
| **Estimation**             | • Point Estimation                   | • Provides a **single best guess** of a population parameter based on sample data.                            | • Estimating the **mean height** of students from a sample.                                            |
|                            | • Confidence Interval               | • Gives a **range of values** (interval) within which the population parameter is expected to lie.            | • Calculating a **95% confidence interval** for average monthly income.                                |
| **Hypothesis Testing**     | • z-test                             | • Used for **large samples (n ≥ 30)** when population **standard deviation is known**.                        | • Testing if average IQ of a sample is significantly different from population average.                |
|                            | • t-test                             | • Used for **small samples (n < 30)** or when **population SD is unknown**.                                   | • Comparing average scores between two teaching methods.                                               |
|                            | • F-test                             | • Compares **variances** between two or more groups. Often used as a base for ANOVA.                          | • Checking if two datasets have significantly different variances.                                     |
|                            | • ANOVA (Analysis of Variance)       | • Tests if the **means of 3 or more groups** are significantly different.                                     | • Comparing test scores of students taught by 3 different teachers.                                    |
|                            | • Chi-Square Test                    | • Used to assess relationships between **categorical variables** or **goodness of fit**.                      | • Testing if there’s a link between **gender** and **mobile brand preference**.                        |
|                            | • One-Tailed & Two-Tailed Tests      | • One-tailed tests check for **directional differences**, two-tailed for **any difference**.                  | • One-tailed: Is A **better** than B? <br>• Two-tailed: Is A **different** from B?                     |
| **Regression Analysis**    | • Simple Linear Regression           | • Examines the **linear relationship** between two variables (1 independent, 1 dependent).                     | • Predicting **exam score** based on **hours studied**.                                                |
|                            | • Multiple Linear Regression         | • Involves **2 or more independent variables** predicting one dependent variable.                             | • Predicting **house price** using **size, location, number of rooms**, etc.                           |
|                            | • Logistic Regression                | • Used when the **dependent variable is categorical** (e.g., Yes/No, 0/1).                                    | • Predicting if a customer will **buy a product** or not.                                              |

---

### 🧠 **Recap**

- ✅ **Estimation** → Point Estimation, Confidence Interval  
- ✅ **Hypothesis Testing** → t-test, z-test, F-test, ANOVA, Chi-Square, One-/Two-tailed Tests  
- ✅ **Regression Analysis** → Simple, Multiple, Logistic Regression  

---

### 🔍 **Test Type Summary**

| **Test**                  | **Type of Test**         |
|---------------------------|--------------------------|
| t-test                    | Hypothesis Testing (Means) |
| z-test                    | Hypothesis Testing (Means) |
| z-test for proportions    | Hypothesis Testing (Proportions) |
| F-test                    | Hypothesis Testing (Variances) |
| ANOVA                     | Extension of t-test (Means) |
| Chi-Square                | Hypothesis Testing (Categorical) |

---


### 📊 **Transposed Table: Comparison of Statistical Tests**

| **Feature**                 | **t-test**                                 | **z-test**                                 | **z-test for proportions**                     | **F-test**                                | **ANOVA**                                      | **Chi-Square (χ²)**                                 |
|----------------------------|--------------------------------------------|---------------------------------------------|------------------------------------------------|-------------------------------------------|--------------------------------------------------|------------------------------------------------------|
| **Purpose**                | Compare means between groups               | Compare means between groups                | Compare two proportions                        | Compare variances                         | Compare means across ≥3 groups                 | Test association / goodness of fit                  |
| **Data Type**              | Continuous (quantitative)                  | Continuous (quantitative)                   | Proportions (categorical)                      | Continuous                                 | Continuous                                     | Categorical (nominal)                              |
| **Sample Size**            | Small (n < 30)                             | Large (n ≥ 30)                              | Large (n ≥ 30)                                 | Any                                        | Any                                            | Preferably large                                   |
| **Population SD Known?**  | No                                         | Yes                                         | Yes (proportion SD known or estimated)         | No                                         | No                                             | Not applicable                                     |
| **# Groups / Variables**   | 2 groups                                   | 2 groups                                    | 2 groups                                       | 2 groups                                   | 3 or more groups                              | 2 or more categories                               |
| **Distribution Assumption**| Normal (or near-normal)                    | Normal                                      | Approx. normal (via CLT)                       | Normal                                     | Normal & equal variances                      | Chi-Square distribution                           |
| **Example Use Case**       | Compare test scores of 2 classes           | Compare average heights of 2 populations     | Compare pass % of males vs. females            | Compare score variance by department      | Compare average salary across departments     | Check if education level is independent of gender  |
| **Test Type**              | Hypothesis Testing (Means)                 | Hypothesis Testing (Means)                  | Hypothesis Testing (Proportions)               | Hypothesis Testing (Variances)            | Hypothesis Testing (Means across groups)       | Hypothesis Testing (Categorical)                  |

---

![{70C07BCC-576E-4EEC-887B-1A0AED4145E3}.png](attachment:cbb89b62-39b3-4f72-bf32-c7b8a691b564.png)

![{30A72ABF-8A75-445A-B6F8-D2DACF77F62F}.png](attachment:1544d68d-56bd-4bae-9f8c-2326999ae045.png)

![{7BE932E7-A8FF-4E7D-85D2-36725C0A128E}.png](attachment:842ab359-3c21-4265-8b2f-09763be95bf3.png)

![{975A5288-748A-4E9B-8C05-1452FD56F506}.png](attachment:24c161b3-b6d8-4d6c-afb9-ae0dc37cb856.png)

In [1]:
import numpy as np
import pandas as pd
from scipy import stats
from sklearn.feature_selection import chi2, f_classif
from sklearn.model_selection import train_test_split

# Sample dataset
np.random.seed(42)
data = {
    'Feature1': np.random.normal(50, 10, 100),  # Continuous variable
    'Feature2': np.random.normal(55, 15, 100),  # Continuous variable
    'Feature3': np.random.randint(0, 2, 100),   # Binary categorical variable
    'Target': np.random.randint(0, 2, 100)      # Binary target variable
}

df = pd.DataFrame(data)

# **T-test: Checking if Feature1 & Feature2 have different means**
t_stat, p_value = stats.ttest_ind(df['Feature1'], df['Feature2'])
print(f"T-test: t-stat={t_stat}, p-value={p_value}")

# **Z-test: Assuming large sample, compare means**
z_stat = (df['Feature1'].mean() - df['Feature2'].mean()) / np.sqrt(df['Feature1'].var()/len(df) + df['Feature2'].var()/len(df))
print(f"Z-test: z-stat={z_stat}")

# **F-test (ANOVA)**
f_stat, p_value = f_classif(df[['Feature1', 'Feature2']], df['Target'])
print(f"F-test (ANOVA) p-values: {p_value}")

# **Chi-Square Test for Categorical Data**
chi2_stat, p_value = chi2(df[['Feature3']], df['Target'])
print(f"Chi-Square Test: chi2-stat={chi2_stat}, p-value={p_value}")

T-test: t-stat=-3.7611557434761576, p-value=0.00022265532013974478
Z-test: z-stat=-3.7611557434761576
F-test (ANOVA) p-values: [0.75331897 0.4651729 ]
Chi-Square Test: chi2-stat=[0.34808035], p-value=[0.55520182]


In [2]:
import numpy as np
import pandas as pd
from scipy import stats
from sklearn.feature_selection import chi2, f_classif
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# **1. Generate a Sample Dataset**
np.random.seed(42)
data = {
    'Age': np.random.randint(20, 60, 200),  # Continuous variable
    'BP': np.random.normal(120, 15, 200),   # Continuous variable
    'Cholesterol': np.random.normal(200, 50, 200),  # Continuous variable
    'Diabetes': np.random.randint(0, 2, 200),  # Binary categorical variable
    'Exercise': np.random.randint(0, 2, 200),  # Binary categorical variable
    'HeartDisease': np.random.randint(0, 2, 200)  # Binary Target Variable
}

df = pd.DataFrame(data)

# **2. Split into Features & Target**
X = df.drop(columns=['HeartDisease'])
y = df['HeartDisease']

# **3. Apply Statistical Tests for Feature Selection**

# **T-Test (Comparing means for Binary Features)**
t_stat, p_ttest = stats.ttest_ind(df['Age'][y==0], df['Age'][y==1])
print(f"T-test (Age vs HeartDisease): p-value = {p_ttest}")

# **F-test (ANOVA) for Continuous Features**
f_stat, p_anova = f_classif(X[['Age', 'BP', 'Cholesterol']], y)
print(f"F-test (ANOVA) p-values: {p_anova}")

# **Chi-Square Test for Categorical Features**
chi2_stat, p_chi2 = chi2(X[['Diabetes', 'Exercise']], y)
print(f"Chi-Square Test p-values: {p_chi2}")

# **Z-test for Large Samples (Comparing Means)**
z_stat = (df['BP'].mean() - df['Cholesterol'].mean()) / np.sqrt(df['BP'].var()/len(df) + df['Cholesterol'].var()/len(df))
print(f"Z-test (BP vs Cholesterol): z-stat = {z_stat}")

# **4. Select Significant Features (p < 0.05)**
selected_features = ['BP', 'Cholesterol', 'Diabetes']  # Based on p-values
X_selected = X[selected_features]

# **5. Split Data into Train and Test Sets**
X_train, X_test, y_train, y_test = train_test_split(X_selected, y, test_size=0.2, random_state=42)

# **6. Standardize Data (Important for Logistic Regression)**
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# **7. Train a Logistic Regression Model**
model = LogisticRegression()
model.fit(X_train, y_train)

# **8. Make Predictions and Evaluate Accuracy**
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy with Selected Features: {accuracy:.2f}")

T-test (Age vs HeartDisease): p-value = 0.48331879243604925
F-test (ANOVA) p-values: [0.48331879 0.70551203 0.85151539]
Chi-Square Test p-values: [0.69271039 0.81865944]
Z-test (BP vs Cholesterol): z-stat = -21.343818244707247
Model Accuracy with Selected Features: 0.38
