# Exercise 4. Exploratory Data Analysis (EDA)

## Task 1. Loading the dataset.
Load the dataset `2009_SCIENCE_GARDEN.csv` using the relative path provided. Display the first few rows of the data to understand its structure, and check for any missing values.

In [None]:
import pandas as pd

# Load the dataset using the relative path
data_path = '../../lessons/03_statistics/data/2009_SCIENCE_GARDEN.csv'
df = pd.read_csv(data_path)

# Display first few rows
df.head()

# Display basic info
df.info()

# Check for missing values
df.isnull().sum()


#### Questions
1. Are there missing values in the dataset? Which columns contain these missing values?
2. Based on the first few rows, what variables are available in this dataset, and what types of data do they contain?

---
## Task 2: Summary Statistics
Calculate and interpret the summary statistics (mean, median, min, max, standard deviation) for each numeric column in the dataset

In [None]:
# Summary statistics
summary_stats = df.describe()
summary_stats

# Calculate specific statistics for the "Daily Rainfall" column
rainfall_mean = df['Daily Rainfall'].mean()
rainfall_median = df['Daily Rainfall'].median()

print(f"Mean Rainfall: {rainfall_mean:.2f}")
print(f"Median Rainfall: {rainfall_median:.2f}")

mean_temp_mean = 
mean_temp_median = 

print(f"Mean of Mean Temperature: {mean_temp_mean:.2f}")
print(f"Median Mean Temperature: {mean_temp_median:.2f}")

### Questions:

1. What are the mean and median values for Daily Rainfall? Mean Temperature? Max Temperature? Minimum Temperature?
2. How does the mean compare to the median for each variable? What might this indicate about the data's distribution?
3. What is the maximum daily rainfall? How might this represent extreme weather events?

---
## Task 3: Histogram for Rainfall Distribution
Create a histogram for `Daily Rainfall` to visualize its distribution. Look for any skewness or extreme values in the rainfall data.

In [None]:
import matplotlib.pyplot as plt

# Histogram for Daily Rainfall
plt.figure(figsize=(8, 5))
plt.hist(df['Daily Rainfall'].dropna(), bins=20, color='skyblue', edgecolor='black')
plt.title('Daily Rainfall Distribution')
plt.xlabel('Daily Rainfall (mm)')
plt.ylabel('Frequency')
plt.show()


### Questions

1. Describe the shape of the Daily Rainfall distribution. Is it symmetric, positively skewed, or negatively skewed?
2. How might the shape of the histogram reflect the frequency of rainy versus dry days?

---
## Task 4. Analyze Rainy Days Only
Drop the days with zero rainfall and create a new histogram for `Daily Rainfall`. Compare this histogram to the one in Task 3 to see how excluding zero-rain days changes the distribution.


In [None]:
# Create a new DataFrame that only includes days with rainfall
rainy_days_df = df[df['DAILY RAINFALL'] > 0]

# Plot histogram for Daily Rainfall (rainy days only)
plt.figure(figsize=(8, 5))
plt.hist(rainy_days_df['DAILY RAINFALL'], bins=20, color='skyblue', edgecolor='black')
plt.title('Daily Rainfall Distribution (Rainy Days Only)')
plt.xlabel('Daily Rainfall (mm)')
plt.ylabel('Frequency')
plt.show()


### Questions
1. How does this histogram differ from the previous histogram that included all days?
2. How does excluding zero-rain days affect the mean and median of daily rainfall? HINT: compare `rainy_days_df['DAILY RAINFALL'].mean()` and `df['DAILY RAINFALL'].mean()`. Use additional cells if needed.
3. In what scenarios might it be useful to exclude zero-rain days versus include them?

---
## Task 5: Cumulative Rainfall Over Time
Calculate the cumulative rainfall over time to observe how rainfall accumulates. Plot the cumulative rainfall to identify any trends or periods with significant rainfall.

In [None]:
import matplotlib.pyplot as plt
import matplotlib.dates as mdates

# Calculate cumulative rainfall
df['Cumulative Rainfall'] = df['DAILY RAINFALL'].cumsum()

# Plot cumulative rainfall over time with monthly x-axis ticks
plt.figure(figsize=(10, 5))
plt.plot(df['DATE'], df['Cumulative Rainfall'], color='tab:blue')
plt.title('Cumulative Rainfall Over Time')
plt.xlabel('Date')
plt.ylabel('Cumulative Rainfall (mm)')

# Set x-axis to display months
plt.gca().xaxis.set_major_locator(mdates.MonthLocator())
plt.gca().xaxis.set_major_formatter(mdates.DateFormatter('%b'))  # Format as 'Month Year'

plt.xticks(rotation=45)  # Rotate labels for readability
plt.tight_layout()
plt.show()

### Questions

1. Describe the trend of cumulative rainfall over time. Are there any periods with rapid accumulation?
2. What could be some causes for sudden increases in cumulative rainfall?

---
## Task 6. Handling Outliers in Daily Rainfall
The `Daily Rainfall` data contains several extreme outliers and many zero-rain days. For this part of the exercise:

- First, filter the data to exclude zero-rain days.
- Then, filter out rainfall values above the 95th percentile to remove extreme outliers.
- Plot a new boxplot for the filtered data.
- Compare the summary statistics (mean and median) for:
    - All days (including zero-rain and outliers)
    - Rainy days only (excluding zero-rain)
    - Filtered rainy days (excluding zero-rain and the top 5% of rainfall values)

In [None]:
# Calculate the 95th percentile for DAILY RAINFALL (non-zero only)
threshold = df[df['DAILY RAINFALL'] > 0]['DAILY RAINFALL'].quantile(0.95)

# Filter data to exclude zero-rain days
rainy_days_df = df[df['DAILY RAINFALL'] > 0]

# Further filter to exclude outliers above the 95th percentile
filtered_rainy_days_df = rainy_days_df[rainy_days_df['DAILY RAINFALL'] <= threshold]

# Plot boxplot for filtered DAILY RAINFALL data (rainy days only, no extreme outliers)
plt.figure(figsize=(6, 4))
plt.boxplot(filtered_rainy_days_df['DAILY RAINFALL'], vert=False)
plt.title('DAILY RAINFALL Boxplot (Filtered Rainy Days Only)')
plt.xlabel('DAILY RAINFALL (mm)')
plt.show()

# Summary statistics
# All days (with zeros and outliers)
original_mean = df['DAILY RAINFALL'].mean()
original_median = df['DAILY RAINFALL'].median()

# Rainy days only (no zeros, but includes outliers)
rainy_mean = rainy_days_df['DAILY RAINFALL'].mean()
rainy_median = rainy_days_df['DAILY RAINFALL'].median()

# Filtered rainy days only (no zeros, no extreme outliers)
filtered_mean = filtered_rainy_days_df['DAILY RAINFALL'].mean()
filtered_median = filtered_rainy_days_df['DAILY RAINFALL'].median()

print("Summary Statistics:")
print(f"Original Mean (All Days): {original_mean:.2f} mm")
print(f"Original Median (All Days): {original_median:.2f} mm\n")
print(f"Mean (Rainy Days Only): {rainy_mean:.2f} mm")
print(f"Median (Rainy Days Only): {rainy_median:.2f} mm\n")
print(f"Filtered Mean (Filtered Rainy Days Only): {filtered_mean:.2f} mm")
print(f"Filtered Median (Filtered Rainy Days Only): {filtered_median:.2f} mm")


### Questions
1. Are there any outliers in the `Daily Rainfall` data? If so, what might they represent?
2. How might these outliers affect the mean and median values of rainfall?
3. How do zero-rain days affect the mean and median of rainfall?
4. After filtering out both zero-rain days and the top 5% of rainfall values, how did the mean and median change? What does this suggest about the influence of zero values and extreme values on rainfall statistics?

---
## Task 7: Scatter Plot - Temperature vs Rainfall
From the `filtered_rainy_days_df` dataframe, create a scatter plot of `TEMP MEAN` vs. `Daily Rainfall` to explore if there’s any relationship between temperature and rainfall.

In [None]:
import seaborn as sns
from scipy.stats import pearsonr

# Scatter plot with trend line
plt.figure(figsize=(8, 5))
sns.regplot(x='TEMP MEAN', y='DAILY RAINFALL', data=filtered_rainy_days_df, ci=None, line_kws={"color": "red"})
plt.title("Temperature Mean vs Daily Rainfall (Filtered Data)")
plt.xlabel("Temperature Mean (°C)")
plt.ylabel("Daily Rainfall (mm)")
plt.show()

# Calculate Pearson's correlation coefficient
pearson_corr, p_value = pearsonr(filtered_rainy_days_df['TEMP MEAN'], filtered_rainy_days_df['DAILY RAINFALL'])
print(f"Pearson's Correlation Coefficient: {pearson_corr:.2f}")
print(f"P-value: {p_value:.3f}")


### Questions
1. What is Pearson’s correlation coefficient between TEMP MEAN and Daily Rainfall? Is it positive, negative, or close to zero?
2. Based on the correlation coefficient and the trend line, do you observe a meaningful relationship between temperature and rainfall for moderate rainy days?
3. How does the trend line help in visualizing the relationship, or lack thereof, between temperature and rainfall?

---
## Task 8: Exploring Seasonal Patterns
Group the `Daily Rainfall` data by month to observe seasonal patterns. Calculate the mean rainfall for each month and plot it to see how rainfall changes over the year.

In [None]:
# Convert 'DATE' column to datetime format
df['DATE'] = pd.to_datetime(df['DATE'], errors='coerce')  # 'errors=coerce' handles any invalid dates by setting them to NaT

# Add a 'Month' column based on the DATE
df['Month'] = df['DATE'].dt.month

# Calculate mean monthly rainfall
monthly_rainfall = df.groupby('Month')['DAILY RAINFALL'].mean()

# Plot mean monthly rainfall
plt.figure(figsize=(8, 5))
monthly_rainfall.plot(kind='bar', color='skyblue', edgecolor='black')
plt.title("Average Monthly Rainfall")
plt.xlabel("Month")
plt.ylabel("Average Daily Rainfall (mm)")
plt.xticks(rotation=0)
plt.show()


### Questions
1. Which month(s) have the highest average rainfall? What might be causing this seasonal pattern?
2. How does the seasonal variation in rainfall align with your expectations for the region?