![Callysto.ca Banner](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-top.jpg?raw=true)

## Analysis of a Plant's Environment

In this short notebook, we look at some data collected from a house plant growing in one of our homes.

<img src="images/plant2.jpg" alt="A plant with sensor" width="400"/>
<div align="center">

A basement window with our plant.
</div>

We have a number of data sensors around the plant, made by the [Phidgets company]( https://www.phidgets.com/). These sensors keep track of the temperature, humidity, soil moisture, and light levels. Values from the sensors have been recorded several times a day and stored in an online spreadsheet. 

You can see the spreadsheet here: https://ethercalc.net/callysto_plant_01

The sensors measure data in different units:
- Temperature, in degrees Celsius. e.g. $20^o C$ is room temperature.
- Humidity, as a percentage. e.g. 30% to 60% humidity is a typical room. 
- Moisture, as a ratio. e.g. 0.0 is bone dry, 1.0 is soaking wet.
- Luminance, in lux. e.g. 1,000 lux is the light outdoors on a dark, cloudy day.

In this notebook, we download the data and save it in a Pandas Dataframe. From there we can plot the data, and perform some numerical calculations that give us an idea about the state of the plant's environment.


## Step 1 - Libraries

Let's import some Python libraries for Pandas, Plotly, and tools for dealing with dates and times. 

In [None]:
import pandas as pd
import plotly.express as px
from datetime import datetime

## Step 2 - Getting some data

Let's get some plant data that we have stored as a file with this notebook. 

The following code reads in the data file and shows it as a data frame. We will call it **df** to represent the initial data set. Later, we will store a subset of the data in a new dataframe called **df_m** (for the modified data frame). 

In [None]:
df = pd.read_csv('./data/Plant_00.csv')
df

You should above see about 5000 rows with 6 columns of data in it.

## Step 3 - Adding datetime

Let's fix the time format, and add a "datetime" stamp by combining date and time into a single column. This will help us when plotting the data.



In [None]:
df['Time'] = df['Time'].apply(lambda x: datetime.strptime(x,"%H:%M:%S").strftime("%H:%M:%S"))
df["DateTime"] = df["Date"] + ' ' + df["Time"]
df

## Step 4 - First plots

Using Plotly, let's do some initial plots of the four different measurements: Temperature, Humidity, Moisture and Luminance. 

In [None]:
fig1 = px.scatter(df, x="DateTime", y="Temperature", 
                 title="Temperature versus Time")
fig2 = px.scatter(df, x="DateTime", y="Humidity", 
                 title="Humidity versus Time")
fig3 = px.scatter(df, x="DateTime", y="Moisture", 
                 title="Moisture versus Time")
fig4 = px.scatter(df, x="DateTime", y="Luminance", 
                 title="Luminance versus Time")
fig1.show(), fig2.show(), fig3.show(), fig4.show()

## Step 5 - First observations

Let's check to see what the range of dates are present in the database.

In [None]:
print("Start and end dates are",df["Date"].min(), "and", df["Date"].max())

We see there is a lot of missing data in the above plots, especially early on. 

Let's ignore this data, and pick a two week period starting when the data is stable. In the following, we pick the data of May 16 to May 30. You will need to change this when examining your own plant data. 

We call this modified data frame **df_m**. This way, if you want to change the date range, the initial data in **df** is still saved for you. 

In [None]:
start_date = '2023-05-16'
end_date = '2023-05-30'
# Select DataFrame rows between two dates
mask = (df['Date'] >= start_date) & (df['Date'] <= end_date)
df_m = df[mask]
df_m

## Now the plots are better

We plot with the modified data in **df_m**.


In [None]:
fig1 = px.scatter(df_m, x="DateTime", y="Temperature", 
                 title="Temperature versus Time")
fig2 = px.scatter(df_m, x="DateTime", y="Humidity", 
                 title="Humidity versus Time")
fig3 = px.scatter(df_m, x="DateTime", y="Moisture", 
                 title="Moisture versus Time")
fig4 = px.scatter(df_m, x="DateTime", y="Luminance", 
                 title="Luminance versus Time")
fig1.show(), fig2.show(), fig3.show(), fig4.show()

## Step 6: Analyzing the data

Before doing any statistical analysis, what can we discover just by looking at the raw data?

1. What is the typical range of: 
    - Temperature? 
    - Humidity?
    - Moisture? 
    - Luminance?
2. Are the any jumps in the data? 
    - We might expect a jump in moisture when the plant gets watered. 
    - Can you find the day and time the plant was watered?
    - Are there other places where you see a jump in data? Why do you think it jumps?
3. One day, when the moisture jumped, the humidity also briefly jumped. Why?
    - Was it a rainy day?  Rain moistened the plant and also raised the humidity in the air?
    - Was someone taking a shower and the air in the room got humid?
    - Did someone spill water on the humidity sensor while trying to water the plant?
    - What do you think happened?
3. Is there any correlation with time of day and the various data? 
    - Temperature? 
    - Humidity?
    - Moisture? 
    - Luminance?
4. It looks like temperature goes up when humidity goes down, and vice versa. Can you state this more precisely?
5. Does temperature go up with luminance? Or not?
6. What conclusions might you draw from this data? Some good questions include:
    - Is this an indoor plant? 
    - Can you say anything about the house it is in? 
    - Does the house have air conditioning for summer? 
    - Does the house have heating for winter? 
    - Are there indoor lights, or is the plant only seeing sunlight?
    - What other conclusions can you draw from the data?


## Step 6a: Analyzing the data - temperature and humidity

We can graph humidity versus temperature to see if there are any obvious correlations. 

The Plotly graphing package allows us to indicate the date via a color of the marker on the graph. By doing, we can see how the data clusters by day. 

In [None]:
fig4 = px.scatter(df_m, x="Temperature", y="Humidity", color="Date",
                 title="Humidity versus Temperature")

fig4.show()

## Observations

The colors of the dots in the above graph show how the data clusters by day. Clicking on the legend on the right, you can select specific dates to examine more closely. Are there some dates where the data cluster tightly? Others more loosely? What might that mean?

## Step 6b: Analyzing the data - light and time of day

We can expect sunlight in the daytime, and darkness at night. Let's plot the luminance data as a function of time of day.

Again, we can color the dots by "Date" to get an indication of how the data clusters based on the actual day. We also add the "category order" in order to sort the times in proper order along the x-axis.

In [None]:
fig5 = px.scatter(df_m, x="Time", y="Luminance", color="Date",
                 title="Luminance versus Time of Day")
fig5.update_layout(xaxis_categoryorder = 'category ascending')
fig5.show()

## Observations

We see the luminance peaks around 14:00 hours, which is 2:00 in the afternoon. That makes sense, as mid-afternoon is the brightest time of the day. 

We also see a small peak at 11:00 am, which dips downwards to about 1:00 pm, then goes up again. Why could this be?
- Is the sun briefly darker each day?
- Is there some dude walking in front of our plant each day at the same time?
- Is the sun moving behind a tree or building which casts a shadow on our plant each morning?
    
What do you think?

There are little blips of light around 9:00pm or 10:00pm on several days. What is going on here?
- Is the sun briefly appearing in the middle of the night?
- Is the moon shining on our plant?
- Is some dude walking by with a flashlight?
- Is someone working late in the evening, and turning on room lights where the plant is?

What do you think?

May 16 and May 17 are a bit unusual. 
- Describe the features of the data that make these two days stand out as different from the others.
- Can you suggest reasons **why** these two days are unusual? What might have happened on those days?

Notice there is a difference between **noticing** something unusual in the data, and **knowing** what actually happened to make that data be unusual.

## Step 6c: Statistical analysis

From the observations above, can we come up with numerical statements that make these observations quantifiable. 

For instance, we saw that temperature and humidity seemed to cluster together on certain days. Is there a trend line we can see?

We use Plotly, with the "trendline" parameter to attach a linear trend to each cluster of daily data. We see the result below. 

In [None]:
fig6 = px.scatter(df_m, x="Temperature", y="Humidity", color="Date",
            trendline="ols", template="simple_white", title="Humidity vs Temperature")
fig6.show()

## Observation

Most of the trendlines in the previous graph appear to slope downwards. What does this mean?
- as temperature goes up, the humidity tends to goes up? Or,
- As temperature goes up, the humidity tends to go down?

Which do you think it is?

While most dates have downward sloping trend line, May 24 stands out as a noticably increasing trend line. Using the legend on the right of the chart, pick out May 24. 
- Can you see why the trend line is up?
- Did anything unusual happen on that day?
- Try looking at your earlier plots of moisture as a function of time, at the beginning of this notebook. What happened that day?

## Advanced work -- for a deeper dive in the analysis

The trend lines above attempt to show a linear relationship between humidity and temperature. That is, it tries to fit the data with an equation of the form

$$y = mx + b.$$

Roll your mouse over one of the trend lines in the plot above. An information box should pop up. Take a look at all the information there.
- Can you see the equation for the line? 
- Why values do you see for $m$? for $b$? 
- What does it mean if $m$ is negative? 
- What does it mean if $m$ is positive?

There is also a value for $R^2$, called the R-squared statistical measure. It tells us how much the second variable is dependent on the first. A value near 1.0 says the second variable is strongly dependent on the first. A value near 0.0 says it is hardly dependent at all. Values mid-way between 0.0 and 1.0 say the second variable is only somewhat dependent on the first. 

Mouse over the various trend lines and look at the $R^2$ values.
- Can you find an $R^2$ that is close to zero?
- Can you find an $R^2$ that is close to one?
- What is the biggest one you can find?
- Do you notice anything about the plots with small $R^2$ values?
- Do you notice anything about the plots with large $R^2$ values?


## Step 6d: Analyzing the data - Moisture and temperature

We might expect moisture to depend on temperature, as higher temperatures could cause the soil to dry out. Let's see if the data shows this. 

Again, we use Plotly to draw the scatter plots and add some trend lines. 

In [None]:
fig6 = px.scatter(df_m, x="Temperature", y="Moisture", color="Date",
            trendline="ols", template="simple_white", title="Moisture vs Temperature")
fig6.show()

## Observations

Most of the trend lines seem to be slightly downwards. What does this mean?
- As the temperature rises, the moisture of the soil also rises?
- As the temperature rises, the moisture of the soil drops?
- There is no strong connection between moisture and temperature, it's just random?

What do you think?

One line stands out, though. May 24 has a sharply increasing trend line. Do you recall what happened on May 24?
- There was an earthquake in Vancouver that day that messed up our sensors?
- There was a flood in Vancouver that day that drenched our plant?
- Some dude came along and watered the plant, just as the day was getting hotter?

What do you think?

Should we draw conclusions about the relationship between humidity and temperature based on the one unusual day, or on the dozen or so other days?

## Step 7: Going further

Can you explore other relationships in the data?

For instance, is there a connection between temperature and time of day? How would you explore this with code? Can you use the code above as a model?

Can you see if there is a trend in average temperatures as we go from early May to late May? How do we compute average values using Pandas?

Can you explore other ranges of dates in the dataset, beyond just May 16 to 30?

If you had your own plant to monitor, what questions would you want answered about your plant?

## Step 8 - Using your own data from the web

Many students and teachers have been saving their own plant data on the internet. 

We can access data in this notebook by downloading it from one of two places: Google Sheets, or EtherCalc. 

You may adjust the following code to download your data. If the data is on Google Sheets, set
```
google_sheet = True
```
otherwise set to **False** and it will use EtherCalc instead.

You shoudl also set the name of your spreadsheet where the data lives. In the code below, change 'Plant_03' to the name of your spreadsheet.  
```
sheet_name = 'Plant_03'
```

For a first demo, you might like to just leave the code as it is. This way you examine the data at our spreadsheet called 'Plant_03' on Google Sheets.

In [None]:
google_sheet = True
sheet_name = 'Plant_03'

if google_sheet:
    sheet_id = "12s1bTFF0o4-i3iSsbm4-_9J358a3fPoS9lx5szZZjjE"
    url = f"https://docs.google.com/spreadsheets/d/{sheet_id}/gviz/tq?tqx=out:csv&sheet={sheet_name}"
else:
    url = f"https://ethercalc.nomagic.uk/_/{sheet_name}/csv"
    
print("Data read from ", url)

df = pd.read_csv(url)
df
    

## Clean up date/time

As in earlier demo, we add a new column to the data from to cover date and time, for convenience.

In [None]:
df['Time'] = df['Time'].apply(lambda x: datetime.strptime(x,"%H:%M:%S").strftime("%H:%M:%S"))
df["DateTime"] = df["Date"] + ' ' + df["Time"]
df

## First plots

Herea are the initial plots: Temperature, Humidity, Moisture and Luminance. 

In [None]:
fig1 = px.scatter(df, x="DateTime", y="Temperature", 
                 title="Temperature versus Time")
fig2 = px.scatter(df, x="DateTime", y="Humidity", 
                 title="Humidity versus Time")
fig3 = px.scatter(df, x="DateTime", y="Moisture", 
                 title="Moisture versus Time")
fig4 = px.scatter(df, x="DateTime", y="Luminance", 
                 title="Luminance versus Time")
fig1.show(), fig2.show(), fig3.show(), fig4.show()

## Set the date range

We notice the interesting data goes from September 5th to 10th. (Your data will have a different range.)

The follow code restricts the range to those dates, so we get more interesting data. 

**Note: You can change the dates to whatever you like, you will also need to rerun the above codes**

In [None]:
start_date = '2023-09-05'
end_date = '2023-09-10'
# Select DataFrame rows between two dates
mask = (df['Date'] >= start_date) & (df['Date'] <= end_date)
df_m = df[mask]
df_m

## More plots

The plots now just show this date range, with the interesting data.

In [None]:
fig1 = px.scatter(df_m, x="DateTime", y="Temperature", 
                 title="Temperature versus Time")
fig2 = px.scatter(df_m, x="DateTime", y="Humidity", 
                 title="Humidity versus Time")
fig3 = px.scatter(df_m, x="DateTime", y="Moisture", 
                 title="Moisture versus Time")
fig4 = px.scatter(df_m, x="DateTime", y="Luminance", 
                 title="Luminance versus Time")
fig1.show(), fig2.show(), fig3.show(), fig4.show()

## Next step

Now continue your analysis, by repeating the code from the earlier part of this notebook.

Or, if you like, just return to step 6. The dataframe called "df" now holds the data from the downloaded spreadsheet, so the code above will work on this new data. 

## Conclusions

We conducted data collection from the real plant, capturing environmental parameters like temperature, humidity, soil moisture, and light levels.

With the help of Pandas and Plotly, we delved into exploring relationships between different datasets. Our analysis aimed to identify correlations, such as those between temperature and humidity or the time of day and light levels.

Unusual data points often indicated interesting or unique plant events, like watering occurrences or late-night activities.

To provide a more quantitative perspective, we calculated trendlines to suggest mathematical relationships between these variables.

Additionally, we have the capability to access external plant data from the web, enabling us to compare and analyze our own plant-related information with broader datasets.


[![Callysto.ca License](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-bottom.jpg?raw=true)](https://github.com/callysto/curriculum-notebooks/blob/master/LICENSE.md)