# Engineering Data Analysis

> **Mohamad M. Hallal, PhD** <br> Teaching Professor, UC Berkeley

[![License](https://img.shields.io/badge/license-CC%20BY--NC--ND%204.0-blue)](https://creativecommons.org/licenses/by-nc-nd/4.0/)
***

# Discrete Random Variables

In this notebook, we will explore the samples you selected during the in-class activity. Additionally, we'll combine the results from all groups to gain a deeper understanding of how discrete random variables work.

Let's get started!

# Your Sample

First, run the code below and enter your group number to read the data that you collected for the number of ducks of color A across all 30 repetitions.

In [None]:
import pandas as pd

# Define path to data and their gid
url = "https://docs.google.com/spreadsheets/d/e/2PACX-1vQ2fUKXMhPV62PJbqbEC8Dx2cQ0r7DmCBCpLv-j18s0pt91Xr_VmbiIsRSr_uiiPmtHeTBE9LEgUU6A/pub?output=csv&gid="
gids = [0, 256169159, 1309475386, 1766258035, 1993533243, 629036200, 1471433991, 1243754353, 1081653691, 1551041451, 1962198564, 502609536, 1908262079,
        296536638, 11798463, 2073400276, 1086870903, 1160920316, 788098645, 1587255657, 682312113, 1718069535, 1389110583, 1856898227, 1288189134]

# Ask user for input
group = int(input("Please enter your group number: "))

# Validate the input
if 1 <= group <= 25:
    print(f"You selected: {group}")
else:
    print("The number is out of range (should be between 1 and 25). Please run the code again and enter a valid number.")

# Read group data
df = pd.read_csv(f'{url}{gids[group-1]}', nrows=30)

# Probability Mass Function

Next, run the code below to plot the probability mass function of your sample.

In [None]:
import matplotlib.pyplot as plt

# Extract the observed values of X
sample = df['X']

# Plot the PMF
plt.hist(sample, bins=[-0.5, 0.5, 1.5, 2.5, 3.5], density=True, rwidth=0.8)

# Control plot appearance
plt.xlabel('$X$, Number of color A ducks in sample of 3')
plt.ylabel('$p(x)$')
plt.title(f'Probability Mass Function (PMF), Group {group}')
plt.xticks([0, 1, 2, 3]) 
plt.show()

The 30 values you obtained are based on a random sample, and as we discussed in class, they are affected by sampling variation. A different group with another set of 30 values will likely have a different distribution. Let's now combine the results from all groups into a single probability mass function to get a clearer picture of the overall distribution.

Run the code below to read the samples of all groups and plot their probability mass function.

In [None]:
# Create an empty DataFrame with column names
all_samples = pd.DataFrame(columns=['X'])

# Loop over all sheets
for gid in gids:
    df = pd.read_csv(f'{url}{gid}', nrows=30)
    
    # Concatenate the DataFrames vertically (along rows)
    all_samples = pd.concat([all_samples, df[['X']].dropna()], ignore_index=True)

all_samples = all_samples['X']

# Plot the PMF
plt.hist(all_samples, bins=[-0.5, 0.5, 1.5, 2.5, 3.5], density=True, rwidth=0.8)

# Control plot appearance
plt.xlabel('$X$, Number of color A ducks in sample of 3')
plt.ylabel('$p(x)$')
plt.title('Probability Mass Function (PMF), All Groups')
plt.xticks([0, 1, 2, 3]) 
plt.show()

# Numerical Summaries

Finally, run the code below to calculate the mean and standard deviation of your sample and the combined results from all groups.

In [None]:
# Mean of selected group
print(f'Mean of Group {group}: {round(sample.mean(),3)}')

# Mean of all groups
print(f'Mean of All Groups: {round(all_samples.mean(),3)}')

# Standard deviation of selected group
print(f'Standard Deviation of Group {group}: {round(sample.std(), 3)}')

# Standard deviation of all groups
print(f'Standard Deviation of All Groups: {round(all_samples.std(), 3)}')