# Exercise 3. Basic Introduction to Probability

## Part 2. Real-World Probability (Science Garden)

### Task 1. Loading real data.

First, let's load a dataset of daily observations from Science Garden covering the years 2009 to 2024. Each row represents weather data for a single day.

In [None]:
import pandas as pd

# Load the dataset
file_path = '../../data/2009-2024_SCIENCE_GARDEN.csv'
df = pd.read_csv(file_path)

# Quick preview
df.head()


#### Questions for Task 1
1. How many rows are in the dataset? What does each row represent?
2. How many columns? What kinds of variables are present?
3. Are there any missing values?
4. What are the earliest and latest years in the data?

#### Some helpful commands

```python
# Show first few rows
df.head()

# Get the number of rows and columns
df.shape

# List all column names
df.columns

# Summary of each column
df.info()

# Quick stats (mean, min, max, etc.) for numeric columns
df.describe()
```

---

### Task 2a. Defining Rainy Days

Let’s define a rainy day as any day where `PRCP_mm > 0`. We'll create a new column called `is_rainy` that stores a `1` for rainy days, and `0` for dry days.

This binary classification allows us to treat rain as a simple yes/no event, perfect for computing basic probabilities.

In [None]:
# Define rainy days (1 if PRCP_mm > 0, else 0)
df['is_rainy'] = (df['PRCP_mm'] > ).astype(int)

# Check
df[['date', 'PRCP_mm', 'is_rainy']].head(10)


### Task 2b. Comparing Rainy Day Probabilities for single years.

Now that we’ve defined what counts as a rainy day, we can start computing probabilities.

Let's calculate:
- Probability of a rainy day in 2009
- Probability of a rainy day in 2010

This is simply the proportion of days with `is_rainy == 1` in each year.

In [None]:
# Subset the data for each year
df_2009 = df[df['YEAR'] == ]
df_2010 = df[df['YEAR'] == ]

# Calculate the probability of a rainy day
p_rain_2009 = df_2009['is_rainy'].mean()
p_rain_2010 = df_2010['is_rainy'].mean()

print(f"Probability of rain in 2009: {p_rain_2009:.4f}")
print(f"Probability of rain in 2010: {p_rain_2010:.4f}")


### Task 2c. Comparing Rainy Day Probabilities for all available years.

In the crash course, we learned about [for loops](../../crash-course/02b_for_loops.ipynb). We can create a loop that calculates the probability of a rainy day for each year in the dataset and store it in a new DataFrame called rain_prob_by_year.

We loop through the available years (`years = sorted(df['year'].unique())`) and then subsetting the df (`subset = df[df['year'] == y]`), getting the relevant columns and appending in the `results` array. Finally, we convert `results` to a dataframe (`rain_prob_by_year = pd.DataFrame(results)`).

In [None]:
# Get unique years in the dataset (sorted)
years = sorted(df['year'].unique())

# Create an empty list to store results
results = []

# Loop through each year and calculate rainy day probability
for y in years:
    subset = df[df['year'] == y]
    prob_rain = subset['is_rainy'].mean()
    total_rain = subset['PRCP_mm'].sum()
    results.append({'year': y, 'P_rain': prob_rain, 'total_rain': total_rain})

# Convert the list of results to a DataFrame
rain_prob_by_year = pd.DataFrame(results)

# Show the results
rain_prob_by_year


### Task 2d. Visualize rainy day probabilities. 

Now that we have the yearly probabilities of rainy days stored in `rain_prob_by_year`, let’s create a  plot to visualize how the likelihood of rain changes from year to year.

You can start with something like:

``` python
import matplotlib.pyplot as plt
plt.plot(...)
plt.show()
```

Hint: 
[`plt.plot()`](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.plot.html) uses the following syntax

```python
plt.plot([x], y, [fmt], *, data=None, **kwargs)
```
We can subset `rain_prob_by_year` as follows
- `rain_prob_by_year['year']`
- `rain_prob_by_year['P_rain']`
- `rain_prob_by_year['total_rain']`


Create two cells below that plots `P_rain` and `total_rain`

In [None]:
import matplotlib.pyplot as plt
plt.plot(rain_prob_by_year['year'], ... , '-o')
plt.title('Probability of rain per year, Science Garden (2009 - 2024).')
plt.show()

In [None]:
import matplotlib.pyplot as plt
plt.plot(rain_prob_by_year['year'], ... , '-ro')
plt.title('Total rain per year, Science Garden (2009-2024).')
plt.show()

#### Questions for Task 2
1. Between 2009 and 2010, which year had a higher chance of rain? Can you guess why? What might have influenced this?
2. Compare the plots of probabilities and total rainfall. What other years have high probabilities of rain? What years have high amounts of rain?
3. What kind of rain events might increase the total rainfall but not the frequency?
4. One measures **how much rain falls**, the other measures **how often it rains**. When is it useful to calculate the total amount of rain? When is it more useful to focus on the probability or frequency of rain?

---

### Task 3a. Probability of a Hot and Rainy Day (Joint Event)
Now that we've looked at rainy days, let’s introduce another condition: hot days.

We’ll define a hot day as any day with a maximum temperature (`MAX_C`) above a certain threshold, for example, 35°C.

P(hot ∩ rainy) = Probability that a day is both hot and rainy in 2010

In [None]:
# Focus on a single year
df_2010 = df[df['YEAR'] == 2010].copy()

# Define hot days: MAX_C > 35
df_2010['is_hot'] = (df_2010['MAX_C'] > 35).astype(int)

# Joint event: hot AND rainy
df_2010['hot_and_rainy'] = ((df_2010['is_hot'] == 1) & (df_2010['is_rainy'] == 1)).astype(int)

# Calculate joint probability
p_hot_and_rainy = df_2010['hot_and_rainy'].mean()

print(f"Probability of a hot and rainy day in 2010: {p_hot_and_rainy:.4f}")


### Task 3b. Probability of a Cool and Rainy Day (Joint Event)
In a new cell below, copy the script above and calculate the joint probability for cool (`MIN_C<25`) and rainy days. 

#### Questions for Task 3
1. Is the probability of a cool and rainy day higher or lower than the probability of a hot and rainy day in 2010?
2. Does this match your expectations based on your own experience of rainy weather?
3. What might be the typical season or weather pattern that causes cool and rainy days in the Philippines?
4. Why might minimum temperature be more closely linked to rain than maximum temperature?

---

### Final Questions

1. What are the advantages of using probability instead of just counting rainy or hot days?
2. How does working with real data (versus synthetic data) change your understanding of uncertainty?
3. Which combinations of weather conditions were common? Which were rare?
4. When is it more helpful to calculate the total amount of rain, and when is it more useful to calculate the probability of rain? (What kinds of questions can each one answer?)
5. What are some challenges or limitations you noticed when working with real data?
6. If you had more time or more weather variables, what patterns would you want to explore next?