# Day 15 In-Class Assignment
---


### <p style="text-align: right;"> &#9989; Put your name here.</p>

#### <p style="text-align: right;"> &#9989; Put your group member names here.</p>

## Correlations and contaminated pollen

<img src="https://imgs.mongabay.com/wp-content/uploads/sites/20/2018/01/04160950/bee-up-close.jpg" style="display:block; margin-left: auto; margin-right: auto; width: 50%" alt="Long-tongued bumble bee queens visit flowers of the alpine skypilot. Acoustic sensors can distinguish the distinctive flight buzz of these large bees, a bee version of a cargo-plane flying from flower to flower.">
<p style="font-size:0.85em; text-align: center;">Credits: <a href="https://news.mongabay.com/2018/01/10-top-conservation-tech-innovations-from-2017/" target="_blank">Zo&euml; Maffett</a></p>

### Learning goals for today's assignment

* Describe the utility of fitting trendlines to data, in the context of making predictions about the future
* Use best-fit lines to make predictions
* Quantitatively and qualitatively describe how to determine the goodness of fit for a given line

### Assignment instructions

Work with your group to complete this assignment. Instructions for submitting this assignment are at the end of the Notebook. The assignment is due at the end of class.

---
## Background

As willows move higher with climate change, they will drive postpollination interference, reducing the fitness benefits of pollinator visitation for *Polemonium viscosum* (alpine skypilot) and selecting for traits that reduce pollinator sharing. The authors probed potential impacts of interference from encroaching *Salix* (willows) on pollination quality of *Polemonium*. Overlap in flowering time of *Salix* and *Polemonium* is a precondition for interference.

Pollinator sharing was ascertained from observations of willow pollen on bumble bees visiting *Polemonium* flowers and on *Polemonium* pistils. The authors measured the **correlation between *Salix* pollen contamination and seed set in naturally pollinated *Polemonium***. 
After accounting for variance in flowering date due to latitude, *Salix* and *Polemonium* showed similar advances in flowering under warmer summers. This trend supports the idea that **sensitivity to temperature promotes reproductive synchrony in both species**. 

<img src="https://onlinelibrary.wiley.com/cms/asset/5cb5dd0d-4e7f-4b00-9c60-588059d51a4c/ece33272-fig-0005-m.jpg" style="display:block; margin-left: auto; margin-right: auto; width: 50%" alt="Relationship between contamination of Polemonium stigma pollen loads with Salix pollen and seeds per flower averaged for the plant.">
<p style="font-size:0.85em; text-align: center;">Credits: <a href="https://doi.org/10.1002/ece3.3272" target="_blank">Kettenbach et al 2017</a></p>

The data comes from:

>  Kettenbach JA, Miller-Struttmann N, Moffett Z, Galen C. (2017) [How shrub encroachment under climate change could threaten pollination services for alpine wildflowers: A case study using the alpine skypilot, *Polemonium viscosum*](https://doi.org/10.1002/ece3.3272). *Ecol Evol*. **7**: 6963â€“6971. 


&#9989;&nbsp; **Question 1**

- What is reflected on the x-axis?
- What is reflected on the y-axis?
- In your own words, what information do you get out of this figure?

<font size=+3>&#9998;</font> *Put your answer here.*


---

## 1. Setting everything up

Before jumping straight into correlations, we need to go through the usual setup steps.

&#9989;&nbsp; **Task 2** 

- Import the usual suspects: NumPy, pandas, matplotlib
- To compute correlations and p-values, we will need one more import: the `stats` submodule from the `scipy` module.

[SciPy is the scientific library of Python](https://docs.scipy.org/doc/scipy/index.html) and it contains several useful submodules for all sorts of statistical and computational tasks. SciPy is the last piece of this course's "quadfecta".

In [None]:
# Import the rest of needed libraries

from scipy import stats

Remember that it is always good coding practice to import your libraries at the top before doing anything else.

&#9989;&nbsp; **Task 3**

- With pandas, load the dataset `salix+and+pv+phenology+data.csv`
- Display its first few rows to make sure everything worked as expected

In [None]:
# Load the data


&#9989;&nbsp; **Question 4**

- What columns will you need to reproduce [Fig. 2 from Kettenbach et al. (2017)](https://onlinelibrary.wiley.com/cms/asset/1a360e97-826a-4c93-95a2-96a10e966698/ece33272-fig-0002-m.jpg)? This is a different figure from the one above.

<font size=+3>&#9998;</font> *Put your answer here.*


&#9989;&nbsp; **Task 5**

- With matplotlib, make a scatterplot of *Polemonium* vs *Salix* Julian day of flowering, like Fig. 2 from the reference paper.
- Make sure to label your axes

*Note*: You can do `facecolor='none', edgecolor='red'` to get empty markers with a red edge.

In [None]:
# Your code

&#9989;&nbsp; **Question 6**

- Do you think flowering day times between *Salix* and *Polemonium* are correlated? If so, does the correlation look linear to you?
- Will the correlation coefficient be positive, negative, or close to zero? Explain your answer.

<font size=+3>&#9998;</font> *Put your answer here.*


---

## 2. Computing Pearson correlation with `stats.pearsonr`

With the SciPy's `stats` module, we can compute the Pearson correlation coefficient and its associated p-value with the `stats.pearsonr`. That function will return a *named tuple*.
- A *tuple* is almost like a list: you can have all sorts of variables in a tuple. You can access them via indices. *But* you cannot modify the items in a tuple.
- A *named tuple* is a tuple where you can access its items with a *name* rather than an index.

See the snippet below:

```python
# store the results in a named tuple
pearson = stats.pearsonr(x_axis_values, y_axis_values)

# This tuple has two names:
# - statistic (the coefficient)
# - pvalue
# You access them with a dot . instead of using an index and square brackets [i]
print('The Pearson coefficient is', pearson.statistic)
print('Its associated p-value is', pearson.pvalue)
```
[Check its documentation for more details](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.pearsonr.html).

&#9989;&nbsp; **Task 7** 

- Compute and print the Pearson correlation and its associated p-value between *Salix* and *Polemonium* flowering days.

In [None]:
# Your code

&#9989;&nbsp; **Question 8**

- Does the coefficient match your guess from Question 6?
- Do your reported values match the ones reported in Figure 2's caption?
- How do you interpret your reported p-value?
- Overall, what can you say about *Polemonium* and *Salix* flowering times?

<font size=+3>&#9998;</font> *Put your answer here.*


---

## 3. Computing Spearman correlation with `stats.spearmanr`

SciPy also comes with an easy way to compute Spearman correlation with `stats.spearmanr`. It works pretty much the same as `stats.pearsonr`. [Check its documentation for more details](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.spearmanr.html).

&#9989;&nbsp; **Task 9** 

- Compute and print the Spearman correlation and its associated p-value between *Salix* and *Polemonium* flowering days.

In [None]:
# Your code

&#9989;&nbsp; **Question 10**

- How different are the two correlation coefficients?
- Does the difference, or lack thereof, make sense to you? Explain your answer.
- Do the Spearman correlation results change or reinforce your interpretation of *Polemonium* and *Salix* flowering times as in Question 8?

<font size=+3>&#9998;</font> *Put your answer here.*


---

## 4. Data wrangling and more correlation practice

&#9989;&nbsp; **Task 11**

Now look at the Figure 3 from [Kettenbach et al (2017)](https://doi.org/10.1002/ece3.3272). Using the same data you loaded for Part 1:

- Make a scatterplot of *Salix* Julian flowering day versus minimum June temperature
- On the same plot, make a scatterplot of *Polemonium* versus temperature?
- Add labels and legends

In [None]:
# Your code

&#9989;&nbsp; **Question 12**

Looking at the plot alone:

- Do you think temperature is correlated with any of the plants' flowering day? If so, is the correlation linear?
- Does one plant seem more correlated than the other?
- How can this correlation, or lack thereof, can be interpreted in terms of climate change?

<font size=+3>&#9998;</font> *Put your answer here.*


Ideally when handling data, you would like to just focus on computing this or that statistic. But you always have to keep in mind that some ad-hoc data wrangling comes first. And you don't know it until you see it.

&#9989;&nbsp; **Task 13**

- Compute the Pearson correlation betwen minimum June temperature and *Salix* flowering day?
- Did you get the expected result?

In [None]:
# Unless you get a red error, you're not doing anything wrong

There are many reasons you can get an unexpected NaN ("not a number", remember them from Day 11?) The most common reasons are:

- We are trying to do an undefined math computation (like dividing by 0, log-transforming a negative number, or trying to correlate a single data point).
- There is a NaN elsewhere that is messing things up downstream.

The math behind correlation coefficients should always produce a number, so most likely there's a NaN in the original data.

&#9989;&nbsp; **Task 14**

- Use `.dropna` to drop the rows that have NaNs in either the Julian day flowering columns or the temperature column
- Save the modified dataframe in a different variable: we want to keep the original data around
- Compare the length of the original dataframe vs the modified one: how many rows did you actually drop?

Check in-class 11 if you need to remember how to use `.dropna`

In [None]:
# Your code

&#9989;&nbsp; **Task 15** 

- Now try to compute again the Pearson correlation coefficients between.
- Do the correlation coefficients agree with your intuition from Question 12?

In [None]:
# Your code

<font size=+3>&#9998;</font> *Put your answer here.*

---

## 5. One more practice (time-permitting)

Now let's turn to Figure 5 from Kettenbach et al.

&#9989;&nbsp; **Task 16** 

In one or more cells, you'll essentially repeat Parts 1 and 3:

- Load the dataset `pollen+purity+vs+seed+set.csv` and check its first few rows.
- Identify which columns you need to reproduce Fig.5
- Make a scatterplot of seeds per [*Polemonium*] flower versus contamination (percentage of *Salix* pollen found around)
- Compute the Pearson and Spearman correlations between these two variables.

In [None]:
# Load the data

In [None]:
# Scatterplot

In [None]:
# Correlations

&#9989;&nbsp; **Question 17** 

- Do the coefficients match your observation from the scatterplot?
- Does the Spearman coefficient agree with the one published by Kettenbach et al? (Look at Figure 5's caption).
- Are the Pearson and Spearman coefficients very different of each other?
- What does their difference, or lacktherof, tell you of the relationship between contamination and seed number?

<font size=+3>&#9998;</font> *Put your answer here.*

---

### Assignment wrap-up

Please fill out form from the link below. You must log-in using your MU credentials. **You must completely fill this out in order to receive credit for the assignment!** 

#### https://forms.office.com/r/cADesBUd7V

In [None]:
# Click on the link above if this cell fails to produce a survey form.

from IPython.display import HTML
HTML(
"""
<iframe 
	src="https://forms.office.com/r/cADesBUd7V" 
	width="800px" 
	height="600px" 
	frameborder="0" 
	marginheight="0" 
	marginwidth="0">
	Click the link above if this cell fails to produce a survey
</iframe>
"""
)

---

## Congratulations, you're done!

Submit this assignment by uploading it to the course Desire2Learn web page.  Go to the "In-class assignments" folder, find the appropriate submission link, and upload it there.

See you next class!

&#169; Copyright 2026,  Division of Plant Science & Technology&mdash;University of Missouri