# Week 11

## Questions for this week

In [1]:
import numpy as np
import scipy.stats as stats

# Statistical Analysis of a Random Sample

## 🎲 Selecting a Random Sample
**Random Seed:** `12`  
**Sample Size:** `40`


In [2]:
# Set random seed
np.random.seed(12)

In [5]:
# Generate a population (assuming normal distribution)
population = np.random.normal(loc=50, scale=15, size=10000)

# Select a random sample of 40 individuals
sample = np.random.choice(population, size=40, replace=False)
sample

array([52.72451271, 78.47412225, 69.22624098, 48.57328095, 58.86986555,
       69.50223134, 32.49030412, 47.57876149, 36.35072556, 44.981589  ,
       34.65569003, 71.43387533, 33.83252514, 67.34596896, 45.84041446,
       46.98703666, 55.79642113, 53.69672801, 51.73804517, 33.67577262,
       47.93904604, 55.19051765, 45.05322819, 60.53931261, 31.42706035,
       52.87131453, 54.687056  , 48.60288453, 68.87699443, 45.22913029,
       44.74025133, 39.84083702, 93.25538353, 30.76313021, 35.96244765,
       48.49242437, 62.83524774, 42.93536615, 52.70921068, 64.37815737])

## 📊 Sample Statistics
- **Sample Mean:** <span style="color:blue;">`sample_mean`</span>  
- **Sample Standard Deviation:** <span style="color:green;">`sample_std`</span>  
- **Standard Error:** <span style="color:cyan;">`standard_error`</span>  


In [8]:
# Calculate the sample mean
sample_mean = np.mean(sample)
sample_mean



np.float64(51.50257780335321)

In [9]:
# Calculate the standard deviation of the sample
sample_std = np.std(sample, ddof=1)
sample_std


np.float64(13.964941383741639)

In [10]:

# Calculate the standard error of the sample
standard_error = sample_std / np.sqrt(len(sample))
standard_error

np.float64(2.208051108168354)


## 📉 Confidence Interval (95%)
- **Confidence Coefficient (t-value):** <span style="color:purple;">`c`</span>  
- **Lower Bound:** <span style="color:red;">`lower_bound`</span>  
- **Upper Bound:** <span style="color:red;">`upper_bound`</span>  

In [13]:
# Calculate the confidence coefficient for a 95% confidence level
c = stats.t.ppf(0.975, df=len(sample)-1)
c

np.float64(2.0226909200367604)

In [15]:

# Calculate the lower bound of the confidence interval
lower_bound = sample_mean - (c * standard_error)
lower_bound


np.float64(47.03637287588398)

In [16]:
# Calculate the upper bound of the confidence interval
upper_bound = sample_mean + (c * standard_error)
upper_bound

np.float64(55.96878273082245)

## 🔄 Bootstrap Analysis
- **Bootstrapped Sample Means:** `1000` values  
- **2.5th Percentile:** <span style="color:orange;">`percentile_2_5`</span>  
- **97.5th Percentile:** <span style="color:orange;">`percentile_97_5`</span>  


In [19]:
# Bootstrap sampling
bootstrapped_means = np.array([np.mean(np.random.choice(sample, size=40, replace=True)) for _ in range(1000)])
bootstrapped_means.sort()

In [20]:
# Calculate the 2.5th percentile
percentile_2_5 = np.percentile(bootstrapped_means, 2.5)
percentile_2_5

np.float64(47.57574260536952)

In [22]:
# Calculate the 97.5th percentile
percentile_97_5 = np.percentile(bootstrapped_means, 97.5)
percentile_97_5

np.float64(56.078772581881545)

## Project Guidelines
- <font color="red">Introduction</font> (what is this going to be about, the only information he will know about this)
- <font color="red">Methods</font> section (tell a little about where the data came from, tell about the dataset itself, add a little bit about your data analysis if you do use some statistical analysis ex. i am going to use a t test with a significant value of 0.05)
- <font color="red">Results</font> (no discussion in this part, only listing what you found, a summary of what is to come, tables are nice to have)
    - this is optional but it is more like a paper
    - if it is not added put results in discussion
- <font color="red">Discussion</font> section
- <font color="red">Reference</font> section at bottom (don't go crazy)

# Start of Week 11 Stats

### Levels of measurement
- Know your data type
    - Categorical: nominal (no natural order), ordinal (natural order)
    - Numerical: interval (no true zero, like zero), ratio-type (true zero like weight)

- Random variable: a function that assings a number to the outcome of some action
    - typical variables are X, Y, K...
    - two types of random variables
        - discrete: can take on a countable number of values (only positive integer)
        - continuous: can take on an uncountable number of values (any real number)

- Probabilities
    - can only talk about probabilities of intervals like P(a<X<b) or that 5 people of the 30 are fitting a condition like P(K=k), this is because you cannot have like 7.5 people for example

- Depending on your variable your test will change

### <font color="violet">Fun Note</font>
- vibe coding: creating little games and selling them for a lot of money currently happening a lot right now, not even writing the code on your own using ai to make it


In [1]:
import numpy
from plotly import graph_objects, figure_factory
import pandas
from scipy import stats

In [2]:
# Seed the random number generator with a known value
numpy.random.seed(28)

# Create a empty list named "results" to store the results
results = []

# Create a for loop to roll the dice 1000 times
for _ in range(1000):
    results.append(numpy.random.randint(1, 7) + numpy.random.randint(1, 7))

In [3]:
# Convert the numpy array to a pandas DataFrame
results = pandas.DataFrame(results, columns=['Value'])

In [4]:
# Create a DataFrame object from the value counts and sort the index
value_counts = pandas.DataFrame(results['Value'].value_counts()).sort_index()

# Print the series object to the screen
print(value_counts)

       count
Value       
2         24
3         39
4         93
5        108
6        121
7        183
8        152
9        105
10       103
11        53
12        19


In [5]:
# Create a bar chart of the value counts
fig = graph_objects.Figure(
    graph_objects.Bar(
        x=value_counts.index.to_list(),
        y=value_counts.values[:, 0],
        text=value_counts.values[:, 0],
        textposition='auto'
    )
)

# Update the layout to improve readability
fig.update_layout(
    title='Value Counts for 1000 Dice Rolls',
    xaxis_title='Dice Value',
    yaxis_title='Count',
    width=700,
)

# Display the plot
fig.show()

In [6]:
# Empirical probability mass function of the two-dice sum distribution
value_counts / 1000

Unnamed: 0_level_0,count
Value,Unnamed: 1_level_1
2,0.024
3,0.039
4,0.093
5,0.108
6,0.121
7,0.183
8,0.152
9,0.105
10,0.103
11,0.053


In [7]:
# Calculate the probability of rolling a 7 in this experiment
print('Probability of rolling a 7:', value_counts.iloc[5].values.tolist()[0] / 1000)

Probability of rolling a 7: 0.183


In [8]:
# Create a bar chart of the value counts
fig = graph_objects.Figure(
    graph_objects.Bar(
        x=value_counts.index.to_list(),
        y=value_counts.values[:, 0] / 1000
    )
)

# Update the layout to improve readability
fig.update_layout(
    title='Visualizing the Probabilities',
    xaxis_title='Dice Value',
    yaxis_title='Probability (proportion)',
    width=700
)

# Display the plot
fig.show()

### Probability Mass Function (PMF)
- a function that gives the probability that a discrete random variable is exactly equal to some value
- P(X=x)
- theoretical is with ideal circumstances of a population not a sample

In [9]:
# Create a bar chart of the value counts
fig = graph_objects.Figure(
    graph_objects.Bar(
        x=numpy.arange(2,13),
        y=numpy.array([1,2,3,4,5,6,5,4,3,2,1]) / 36
    )
)

# Update the layout to improve readability
fig.update_layout(
    title='Visualizing the PMF',
    xaxis_title='Dice Value',
    yaxis_title='Probability (proportion)',
    width=700
)

# Display the plot
fig.show()

In [10]:
# Generate bar colors
colors = ['lightslategray',] * 11
colors[8] = 'crimson'
colors[9] = 'crimson'
colors[10] = 'crimson'


# Create a bar chart of the value counts
fig = graph_objects.Figure(
    graph_objects.Bar(
        x=numpy.arange(2,13),
        y=numpy.array([1,2,3,4,5,6,5,4,3,2,1]) / 36,
        marker_color=colors
    )
)

# Update the layout to improve readability
fig.update_layout(
    title='Probability of rolling a 10 or more (red)',
    xaxis_title='Dice Value',
    yaxis_title='Probability (proportion)',
    width=700
)

# Display the plot
fig.show()

### Continuous random variables


- Probability Density Function (PDF)
    - PDF is denoted by f(x) and is defined as P(a≤X≤b), where a and b are real numbers. The PDF is the area under the curve of the PDF between a and b.

In [11]:
# Set the numpy pseudo-random number generator seed
numpy.random.seed(28)

# Create an array of 10000 values from a normal distribution with a mean of 100 and a standard deviation of 10 assigned to the variables values
values = numpy.random.normal(100, 10, 10000)

# Create a plotly create_distplot figure of the data in values
fig = figure_factory.create_distplot(
    [values], ['Values'],
    show_hist=False,
    histnorm='probability desnity',
    curve_type='normal',
    show_rug=False
)

fig.update_layout(
    title='Probability density function for normal distribution',
    xaxis_title='Variable value',
    yaxis_title='Probability density',
    width=700
)

fig.show()

In [12]:
# Calculate the probability of a value less than -1 for the standard normal distribution
print('Probability of a value less than -1:', stats.norm.cdf(-1))

Probability of a value less than -1: 0.15865525393145707


In [13]:
# Seed the numpy pseudo-random number generator
numpy.random.seed(28)

# Create an array of 1000 values using a continuous uniform distribution between 50 and 100 and assign it to the variable population
population = numpy.random.uniform(50, 100, 1000)

# Create a plotly create_distplot figure of the data in population
fig = figure_factory.create_distplot(
    [numpy.random.choice(population, 1000)], ['Population'],
    show_hist=True,
    show_rug=False,
    show_curve=False
)

fig.update_layout(
    title='Histogram of population values',
    xaxis_title='Variable value',
    yaxis_title='Frequency',
    width=700
)

fig.show()

In [14]:
# Calculate the population mean of the population and assign it to the variable mu
mu = population.mean()
mu

np.float64(74.8702856608023)

In [15]:
# Calculate the population standard deviation of the population and assign it to the variable sigma
sigma = population.std()
sigma

np.float64(14.807689712963851)

In [16]:
# Add a horizontal line to the plot at the mean and dashed lines for the standard deviation
fig.add_vline(x=mu, line_width=2, line_color='green')
fig.add_vline(x=mu - sigma, line_width=2, line_dash='dash', line_color='red')
fig.add_vline(x=mu + sigma, line_width=2, line_dash='dash', line_color='red')

fig.show()

In [17]:
# Seed the pseudo-random number generator
numpy.random.seed(28)

# Select a random sample of 30 values from the population and assign it to the variable sample
sample = numpy.random.choice(population, 30)

# Calculate the sample mean and assign it to the variable x_bar
x_bar = sample.mean()
x_bar

np.float64(75.39338504047777)

In [18]:
# Use list comprehension to calculate the sample mean of 1000 samples of size 30 and assign it to the variable x_bar_vals
x_bar_vals = [numpy.random.choice(population, 30).mean() for _ in range(1000)]

In [19]:
# Sort the values in x_bar_vals and assign it to the variable x_bar_vals_sorted
x_bar_vals_sorted = numpy.sort(x_bar_vals)

In [20]:
x_bar_vals_sorted[24]

np.float64(69.54310782732712)

In [21]:
x_bar_vals_sorted[974]

np.float64(80.28681596822955)

In [22]:
# Create a histogram of the sample means
fig = graph_objects.Figure(
    graph_objects.Histogram(
        x=x_bar_vals,
        histnorm='probability'
    )
)

fig.update_layout(
    title='Histogram of sample means',
    xaxis_title='Sample means',
    yaxis_title='Probability',
    width=700
)

fig.show()

## Rest of notes just followed in class on jupyter notebook week 11