# Project 2: Exploratory Data Analysis and Simulation-Based Inference on

Your Name  
2025-03-02

Soybean Cultivars

## Dataset Context

Soybean cultivation is crucial in the food industry. This
dataset—**Forty Soybean Cultivars from Subsequent Harvests**—contains
320 samples (the average of 10 plants per plot) from 40 soybean
cultivars planted over two seasons. The experiment was designed with
randomized blocks arranged in a split-plot scheme (4 replications).
Variables include:

-   **Season:** (1 or 2) indicating the planting season.
-   **Cultivar:** Soybean cultivar name.
-   **Repetition:** Replication number (1–4).
-   **PH:** Plant Height (cm)
-   **IFP:** Insertion of the First Pod (cm)
-   **NLP:** Number of Stems
-   **NGP:** Number of Legumes per Plant
-   **NGL:** Number of Grains per Plant
-   **NS:** Number of Grains per Pod
-   **MHG:** Thousand Seed Weight (g)
-   **GY:** Grain Yield

> **Note for Casey:** Download the dataset (e.g., as `soybean.csv`) and
> host it on Canvas so that students have access.

------------------------------------------------------------------------

## Overview

The goal of this project is to perform exploratory data analysis (EDA)
and use a simulation-based approach to explore the following research
question:

> **Research Question:**  
> *Do soybean plants exhibit a substantial difference in plant height
> (PH) between Season 1 and Season 2?*
>
> **Exploratory Question:**  
> *How much do plant heights differ between the seasons, and is this
> difference typical of what might occur by chance?*
>
> **Hint:** If you are stuck on the simulation part, refer to Lab 4 (see
> “Assessing Models with Simulation”) for guidance.

------------------------------------------------------------------------

## Part 1: Data Cleaning and Exploratory Data Analysis (EDA)

### 1.1 Import Packages and Load the Dataset

Below is the code students will use to load and preview the dataset.  
*\[Student Code\]*

In [1]:
# Import necessary packages
import pandas as pd
import numpy as np
import altair as alt

# Load the Soybean Cultivars dataset
# Ensure the file is saved as "soybean.csv" on Canvas.
soybean = pd.read_csv("soybean.csv")
print("First 6 rows of the dataset:")
print(soybean.head())
# (Tip: See Lab 3, Section 1.1 for similar file loading methods.)

------------------------------------------------------------------------

### 1.2 Examine the Dataset

*Student Code:* Identify dataset dimensions, column names, and summary
statistics.

In [2]:
# Print dataset dimensions and column names
print("Dataset dimensions:", soybean.shape)
print("Column names:", list(soybean.columns))

# Display summary statistics for numeric variables
print(soybean.describe())
# (Tip: Use Lab 3, Section 1.2 as a reference for exploring datasets.)

> **Assignment Question:**  
> **(Outside Code)**  
> *Summarize the key features of the dataset. What are the dimensions
> and basic statistics of the plant height (PH) variable?*

------------------------------------------------------------------------

### 1.3 Check for Missing Values

*Student Code:* Verify that there are no missing values.

In [3]:
# Check for missing values in each column
print("Missing values per column:")
print(soybean.isnull().sum())
# (Tip: See Lab 3, Section 1.3 for strategies to check for missing data.)

> **Assignment Question:**  
> **(Outside Code)**  
> *Describe any issues related to missing data and how they might affect
> the analysis.*

------------------------------------------------------------------------

### 1.4 Create Visualizations

#### Histogram of Plant Height (PH) by Season

We will create separate histograms for each season to compare the
distributions.

*Student Code:* Generate histograms for each season.

In [4]:
hist_season = alt.Chart(soybean).mark_bar().encode(
    x=alt.X("PH", bin=alt.Bin(maxbins=20), title="Plant Height (cm)"),
    y=alt.Y("count()", title="Frequency"),
    color="Season:N"
).properties(
    title="Distribution of Plant Height by Season"
)
hist_season.display()
# (Tip: For visualization techniques, refer to Lab 4, Section 2 on using Altair.)

> **Assignment Question:**  
> **(Outside Code)**  
> *Interpret the histogram. Based on the distribution, what differences
> (if any) do you observe between Season 1 and Season 2 in terms of
> plant height?*

------------------------------------------------------------------------

## Part 2: Simulation-Based Inference: Assessing Typicality

Instead of formal hypothesis testing, we use a simulation-based approach
to assess whether the observed difference in mean plant height is
typical relative to what might occur by chance.  
*Hint: If you need a refresher on simulation, review Lab 4, Section 2.2
(“Simulating the Model”).*

### 2.1 Compute the Observed Difference

*Student Code:* Calculate the difference in means between Season 1 and
Season 2.

In [5]:
# Split the data by Season and compute the mean plant height for each group
group1 = soybean[soybean["Season"] == 1]["PH"]
group2 = soybean[soybean["Season"] == 2]["PH"]
mean_group1 = group1.mean()
mean_group2 = group2.mean()
observed_diff = mean_group1 - mean_group2

print("Mean PH for Season 1:", mean_group1)
print("Mean PH for Season 2:", mean_group2)
print("Observed difference (Season 1 - Season 2):", observed_diff)
# (Tip: This is similar to Lab 4, Section 2.1 where group differences are calculated.)

> **Assignment Question:**  
> **(Outside Code)**  
> *Calculate and report the difference in mean plant height between the
> two seasons.*

------------------------------------------------------------------------

### 2.2 Generate a Simulated Distribution

*TA Demonstration Code:* The following code generates a distribution of
differences by randomly shuffling the season labels.  
*Hint: For additional context, see Lab 4, Section 2.2 (“Simulating the
Model”) and Lab 3 for similar techniques.*

In [6]:
n_sim = 1000
simulated_diffs = []

for _ in range(n_sim):
    # Shuffle the Season labels to assess what differences might occur by chance
    shuffled = soybean["Season"].sample(frac=1, replace=False).reset_index(drop=True)
    sim_group1 = soybean.loc[shuffled == 1, "PH"]
    sim_group2 = soybean.loc[shuffled == 2, "PH"]
    sim_diff = sim_group1.mean() - sim_group2.mean()
    simulated_diffs.append(sim_diff)

simulated_diffs = np.array(simulated_diffs)
# (Tip: Refer to Lab 4's simulation section for guidance on shuffling techniques.)

> **Assignment Question:**  
> **(Outside Code)**  
> *Explain why randomly shuffling the season labels can help assess
> whether the observed difference in plant height is typical or
> unusual.*

------------------------------------------------------------------------

### 2.3 Visualize the Simulated Distribution

*Student Code:* Create a histogram of the simulated differences and
overlay a red line indicating the observed difference.  
*Hint: For tips on using Altair for layered plots, see Lab 4, Section
2.3 (“Visualizing the Simulation”).*

In [7]:
sim_df = pd.DataFrame({'sim_diff': simulated_diffs})

sim_chart = alt.Chart(sim_df).mark_bar().encode(
    x=alt.X("sim_diff", bin=alt.Bin(maxbins=20), title="Simulated Difference in Mean PH"),
    y=alt.Y("count()", title="Frequency")
).properties(
    title="Simulated Distribution of Difference in Plant Height"
)

obs_line = alt.Chart(pd.DataFrame({'obs': [observed_diff]})).mark_rule(color='red').encode(
    x='obs:Q'
)

(sim_chart + obs_line).configure_title(fontSize=16).display()
# (Tip: See Lab 4, Section 2.3 for combining charts with a rule marker.)

> **Assignment Question:**  
> **(Outside Code)**  
> *Interpret the simulation plot. Based on the position of the observed
> difference (red line) relative to the simulated distribution, what can
> you conclude about the typicality of the observed difference?*

------------------------------------------------------------------------

### 2.4 Compute the Typicality Score

*Student Code:* Calculate the proportion of simulated differences that
are as extreme (in absolute value) as the observed difference.  
*Hint: For similar computation, review Lab 4, Section 2.3.*

In [8]:
extreme_count = np.sum(np.abs(simulated_diffs) >= np.abs(observed_diff))
typicality_score = extreme_count / n_sim
print("Typicality score:", typicality_score)

> **Assignment Question:**  
> **(Outside Code)**  
> *Based on the typicality score, explain in plain language whether the
> observed difference in plant height between seasons appears typical or
> unusual. What might this indicate about seasonal effects on plant
> growth?*

------------------------------------------------------------------------

## Reporting and Interpretation

In markdown cells (outside of code), provide your written answers to the
following:

1.  **Introduction and Research Question:**
    -   State your research question and describe the exploratory
        approach used.
2.  **Data Exploration Summary:**
    -   Summarize key findings from your EDA, including the distribution
        of plant height and any seasonal differences.
3.  **Inference Results:**
    -   Report the observed difference in mean plant height.
    -   Explain the simulation results and what the typicality score
        indicates about the difference between seasons.
4.  **Discussion:**
    -   Discuss any limitations of your analysis.
    -   Suggest potential further analyses (e.g., exploring other
        variables or applying additional non-parametric methods).

------------------------------------------------------------------------

## Final Deliverables

-   **Project Submission:**  
    Submit a well-documented Jupyter Notebook that includes your
    completed code, visualizations, and written interpretations.

-   **Written Report:**  
    Include a concise written report (approximately 2–3 pages)
    summarizing your methodology, findings, and conclusions.

------------------------------------------------------------------------