<a href="https://colab.research.google.com/github/davidofitaly/notes_02_50_key_stats_ds/blob/main/02_chapter/01_raw_data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [3]:
library(boot)

###Random Sampling and Sample Bias

##### **1. Sample**
#####A **sample** is a subset of units selected from the population for analysis. The goal is for the sample to be representative of the population, enabling valid inferences.

##### **2. Population**
#####The **population** refers to the entire set of individuals, items, or observations that are the subject of a study. It is the broader group from which the sample is drawn.

##### **3. Population Size ($N$) and Sample Size ($n$)**
- $N$: The total number of units in the population.  
- $n$: The total number of units in the sample.

##### **4. Random Sampling**
#####**Random sampling** ensures that every unit in the population has an equal chance of being included in the sample. This method minimizes bias and enhances the representativeness of the sample.

#####Key features of random sampling:
- Equal probability of selection for all units.
- Reduces the risk of systematic bias.
- Results can be generalized to the population.

##### **5. Stratified Sampling**
#####**Stratified sampling** involves dividing the population into subgroups (strata) based on specific characteristics, then randomly sampling from each stratum. It ensures representation across key subgroups.

##### **6. Strata**
#####A **stratum** (plural: strata) is a subgroup of the population that shares a common characteristic, such as age, income level, or geographical region. Stratification improves the precision of sample estimates.

##### **7. Random Sampling Within Strata**
#####In **stratified sampling**, random samples are drawn from each stratum. This can be proportional to the stratum’s size in the population or equal across strata, depending on the research objective.

##### **8. Bias**
#####**Bias** refers to a systematic error that causes the sample to misrepresent the population. Common sources of bias include:
- Non-random sampling methods.
- Exclusion of specific groups.
- Inconsistent data collection techniques.

##### **9. Sample Bias**
#####**Sample bias** occurs when the sample is not representative of the population, leading to distorted results. Examples include:
- **Selection bias**: Overrepresentation or underrepresentation of certain groups.
- **Non-response bias**: Results skewed due to a lack of participation from certain individuals.

###Selection Bias

#####Selection bias occurs when the sample data used in an analysis is not representative of the target population, leading to skewed or incorrect conclusions. This bias can arise from the way data is collected, the choice of sample, or the conditions under which the sample is observed.

##### Important Concepts:

1. **Selection Bias:**
   - This is the distortion of a statistical analysis resulting from the method of selecting participants or data. It leads to non-random selection of data points, causing the sample to differ from the general population.


2. **Data Hunting:**
   - Data hunting refers to an exploratory approach where a researcher searches for patterns or insights within a dataset, often leading to overfitting or misinterpretation of findings.
   - This can be problematic if it leads to confirmation bias, where the researcher finds patterns that support preconceived notions rather than unbiased data.

3. **The Effect of Search Through Multiple Hypotheses:**
   - This effect occurs when a researcher tests multiple hypotheses or multiple models on the same dataset. The more hypotheses tested, the higher the likelihood of finding a spurious relationship, simply by chance.
   - Example: Testing 10 different hypotheses increases the likelihood that at least one will appear statistically significant due to random variation, even if there is no true effect.

4. **Regression to the Mean:**
   - This phenomenon occurs when extreme values tend to be closer to the average on subsequent measurements. It occurs because random variation causes extreme outcomes, which are less likely to persist over time.
   - Example: If an unusually high-performing student has a very high score on a test, their score on the next test is likely to be closer to the average, as the first result was influenced by random factors.


###Sampling Distribution for Statistics

#####The sampling distribution is a probability distribution of a given statistic (such as the mean or variance) based on repeated sampling from a population. Understanding the sampling distribution is key to making inferences about the population from sample data.

##### Important Concepts:

1. **Statistic for the Sample:**
   - A statistic is a numerical value calculated from a sample of data. Common statistics include the sample mean, sample variance, and sample proportion.
   - Example: If you take a sample of 50 students' test scores and calculate the average, this average is a statistic for the sample.

2. **Data Distribution:**
   - The data distribution describes how the values of a dataset are spread or dispersed. It includes features like the central tendency, dispersion, and shape of the data.
   - Example: A normal distribution has a bell-shaped curve, with most of the data points clustered around the mean.

3. **Sampling Distribution:**
   - The sampling distribution refers to the distribution of a statistic (such as the sample mean) that is obtained through repeated sampling from a population.
   - When we take many random samples from a population and compute a statistic for each sample, the distribution of these statistics is known as the sampling distribution.
   - Example: If we repeatedly take samples of 30 students' test scores and calculate the mean for each sample, the distribution of these means is the sampling distribution.

4. **Central Limit Theorem:**
   - The Central Limit Theorem (CLT) states that for sufficiently large sample sizes, the sampling distribution of the sample mean will be approximately normal, regardless of the population distribution, as long as the population has finite variance.
   - This is a powerful result because it allows us to apply inferential statistics (such as confidence intervals and hypothesis tests) even if the underlying population distribution is not normal.
   - Example: If you sample 30 students from a population of students and calculate their mean scores, the distribution of these means will be approximately normal even if the individual scores are not normally distributed.

5. **Standard Error:**
   - The standard error is the standard deviation of the sampling distribution of a statistic, typically the sample mean. It measures how much the sample statistic is expected to vary from the true population parameter.
   - The formula for the standard error of the sample mean is:

$$ SE = \frac{\sigma}{\sqrt{n}} $$  

   - where $ \sigma $ is the population standard deviation and $n$ is the sample size.
   - Example: A larger sample size leads to a smaller standard error, meaning that the sample mean is likely to be closer to the population mean.



### Bootstrap Sampling


#####Bootstrap sampling is a statistical method used to estimate the sampling distribution of a statistic by resampling with replacement from the observed data. This technique is particularly useful when the theoretical distribution of a statistic is complex or unknown.

##### Key Concepts:

1. **Bootstrap Sample:**
   - A bootstrap sample is a randomly selected subset of data obtained by sampling with replacement from the original dataset.
   - Each bootstrap sample has the same size as the original dataset, but some observations may appear more than once, while others may not appear at all.
   - Example: If your original dataset has 5 observations $\{x_1, x_2, x_3, x_4, x_5\}$, a bootstrap sample might look like $\{x_1, x_2, x_2, x_4, x_5\}$

2. **Resampling:**
   - Resampling refers to the process of repeatedly drawing samples from a dataset to compute a statistic multiple times.
   - Bootstrap resampling allows for estimating the variability of a statistic, constructing confidence intervals, or performing hypothesis testing without relying on strong parametric assumptions.
   - Example: Generate 1000 bootstrap samples from a dataset to calculate the sampling distribution of the mean.

##### Steps for Bootstrap Sampling:

1. Start with an original dataset of size $n$.
2. Randomly draw $n$ observations from the dataset with replacement to create a bootstrap sample.
3. Compute the statistic of interest (e.g., mean, median) for the bootstrap sample.
4. Repeat steps 2 and 3 many times (e.g., 1000 iterations) to generate the bootstrap distribution of the statistic.
5. Use the bootstrap distribution to calculate standard errors, confidence intervals, or other inferential measures.

##### Example:
Suppose you have a dataset of exam scores: $\{78, 85, 92, 88, 76\}$.  
1. Generate a bootstrap sample: $\{78, 85, 85, 92, 76\}$.
2. Compute the mean of the bootstrap sample: $\bar{x}_{\text{bootstrap}} = 83.2$.
3. Repeat this process 1000 times to create the bootstrap distribution of the mean.

##### Key Takeaways:
- **Bootstrap sample** is created by resampling with replacement from the original data.
- **Resampling** allows repeated estimation of statistics to assess their variability and build confidence intervals.
- **Bootstrap** is a powerful tool for non-parametric statistical inference.


#### Exercise 2.1

##### Write a program that performs the following tasks:  
1. Load the `loans_income.csv` dataset from the given GitHub URL.  
2. Display the first few rows of the dataset to understand its structure.  
3. Define a custom statistic function to calculate the median of a sample.  
4. Use the **bootstrap method** to resample the `x` variable from the dataset 1000 times and estimate the median for each resample.  
5. Print the bootstrap results, including the original median, bias, and standard error.  


In [1]:
# Define the URL of the CSV file on GitHub
url <- "https://raw.githubusercontent.com/davidofitaly/notes_02_50_key_stats_ds/main/02_chapter/files/loans_income.csv"

# Load the data from the CSV file into the variable 'data_state'
data_state <- read.csv(url)

# Display the first few rows of the loaded dataset
head(data_state)

Unnamed: 0_level_0,x
Unnamed: 0_level_1,<int>
1,67000
2,52000
3,100000
4,78762
5,37041
6,33000


In [9]:
# Define a function to calculate the median for bootstrap samples
stat_fun_median <- function(x, idx) {median(x[idx])}

# Perform bootstrap resampling with 1000 iterations
boot_obj_median <- boot(data_state$x, R = 1000, statistic = stat_fun_median)

# Print the bootstrap results
print(boot_obj_median)


ORDINARY NONPARAMETRIC BOOTSTRAP


Call:
boot(data = data_state$x, statistic = stat_fun_median, R = 1000)


Bootstrap Statistics :
    original   bias    std. error
t1*    62000 -65.7235    204.4154


#### Exercise 2.2

Load the `loans_income.csv` dataset and perform bootstrap resampling to calculate the mean of the data.

1. Use bootstrap resampling with 1000 iterations.
2. Calculate the mean for each resampled sample.
3. Display the bootstrap results including the original mean, bias, and standard error.

In [10]:
# Define a function to calculate the mean for bootstrap samples
stat_fun_mean <- function(x, idx) {mean(x[idx])}

# Perform bootstrap resampling with 1000 iterations
boot_obj_mean <- boot(data_state$x, R=1000, statistic=stat_fun_mean)

# Print the bootstrap results
print(boot_obj_mean)


ORDINARY NONPARAMETRIC BOOTSTRAP


Call:
boot(data = data_state$x, statistic = stat_fun_mean, R = 1000)


Bootstrap Statistics :
    original   bias    std. error
t1* 68760.52 6.022033    151.9284


### Confidence Intervals

#####A **Confidence Interval (CI)** is a range of values used to estimate the true value of a population parameter, based on sample data. It provides a measure of uncertainty about the estimate and tells us how confident we can be that the interval contains the true parameter value.

##### **Key Elements of Confidence Intervals**:
1. **Confidence Level**:
   - The **confidence level** (e.g., 90%, 95%, 99%) represents the probability that the interval contains the true population parameter.
   - For example, a 95% confidence interval means we expect the true parameter value to lie within this interval in 95% of repeated sampling.

2. **Upper and Lower Bounds**:
   - The **lower bound** is the smallest value in the interval.
   - The **upper bound** is the largest value in the interval.

##### **Formula for Confidence Interval**:
For a population mean, the confidence interval is calculated as:
$$
CI = \hat{\mu} \pm Z \cdot \frac{\sigma}{\sqrt{n}}
$$
Where:
- $ \hat{\mu} $ is the sample mean,
- $ Z $ is the Z-score corresponding to the desired confidence level (e.g., 1.96 for 95% confidence),
- $ \sigma $ is the population standard deviation (or its estimate from the sample),
- $ n $ is the sample size.

##### **Interpretation of Confidence Intervals**:
- A **95% confidence interval** means that if we were to take many samples and compute a CI for each, 95% of those intervals would contain the true population parameter.
- A **wider interval** indicates less precision in the estimate, while a **narrower interval** indicates a more precise estimate.

##### **Example**:
Suppose we compute the sample mean height of 100 individuals to be 170 cm with a standard deviation of 10 cm. For a 95% confidence interval, the Z-score is 1.96. The confidence interval for the mean height is calculated as:
$$
CI = 170 \pm 1.96 \cdot \frac{10}{\sqrt{100}} = 170 \pm 1.96 \cdot 1 = [168.04, 171.96]
$$
This means we are 95% confident that the true average height of the population lies between 168.04 cm and 171.96 cm.



#### Exercise 2.3

#####In this task, you are required to calculate the **95% confidence interval** for the **mean** of a dataset. The dataset contains the following income values (in USD): $[45000, 52000, 67000, 68000, 75000, 90000, 82000, 67000, 88000, 62000]$

1. Load the dataset.
2. Calculate the sample mean and sample standard deviation.
3. Use the Z-score for a 95% confidence level to calculate the standard error.
4. Compute the confidence interval using the formula:
   $$ \text{Confidence Interval} = \text{mean} \pm Z \times \text{standard error} $$

In [13]:
# Dataset (income values)
data <- c(45000,52000,67000,68000,75000,90000,82000,67000,88000,62000)

# Number of elements in the sampl
n <- length(data)

# Sample mean
mean_data <- mean(data)

# Sample standard deviation
sd_data <- sd(data)

# Z-score for 95% confidence level
z_score <- 1.96

# Standard error calculation
se <- sd_data / sqrt(n)

# Calculate confidence interval
lower_bound <- mean_data - z_score * se
upper_bound <- mean_data + z_score * se

# Print the result
cat("95% Confidence Interval: [", round(lower_bound, 2), ", ", round(upper_bound, 2), "]\n")


95% Confidence Interval: [ 60532.07 ,  78667.93 ]
