# Weather Data Exploratory Analysis for NASA Space Apps Challenge

This notebook explores weather data for our "Will it rain on my parade?" project, which aims to predict rainfall and weather conditions for specific locations. This analysis will help us understand the relationships between different weather variables and precipitation patterns.

## Project Objective
Develop a predictive model that can accurately forecast weather conditions, specifically rainfall, for particular locations to help users plan outdoor activities effectively.

## 1. Project Setup and Configuration

In [None]:
# Import required libraries
import os
import sys
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime, timedelta
import warnings

# For geospatial data handling and visualization
import geopandas as gpd
try:
    import cartopy.crs as ccrs
    import cartopy.feature as cfeature
    HAS_CARTOPY = True
except ImportError:
    HAS_CARTOPY = False
    print("Cartopy not installed. Some map visualizations will be disabled.")

# For statistical analysis
from scipy import stats
from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf

# For interactive visualizations
import plotly.express as px
import plotly.graph_objects as go

# Configure visualization settings
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_context("notebook", font_scale=1.2)
warnings.filterwarnings('ignore')

# Display settings for pandas
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
pd.set_option('display.width', 1000)

print("Project environment configured successfully!")

In [None]:
# Add project root to Python path to import our custom modules
project_root = os.path.abspath(os.path.join(os.getcwd(), '..'))
if project_root not in sys.path:
    sys.path.append(project_root)

# Create helper function to ensure data directories exist
def ensure_data_dirs():
    """Create data directories if they don't exist"""
    data_dirs = [
        os.path.join(project_root, 'data', 'raw'),
        os.path.join(project_root, 'data', 'processed'),
        os.path.join(project_root, 'data', 'raw', 'nasa'),
        os.path.join(project_root, 'data', 'raw', 'noaa'),
        os.path.join(project_root, 'data', 'raw', 'openweathermap'),
        os.path.join(project_root, 'data', 'raw', 'scraped')
    ]
    
    for directory in data_dirs:
        os.makedirs(directory, exist_ok=True)
        
    print(f"Data directories created at {os.path.join(project_root, 'data')}")

ensure_data_dirs()

## 2. Data Collection and Analysis

In this section, we'll load sample weather data and analyze its structure. In a real scenario, this data would come from:
1. NASA Earth Data (IMERG, MODIS, MERRA-2)
2. NOAA Weather Data
3. OpenWeatherMap API
4. Scraped data from weather websites using AI-Agent-Scraper

For this notebook, we'll create some sample data to demonstrate the analysis process.

In [None]:
# Generate sample weather data for demonstration
def generate_sample_weather_data(days=365, locations=3):
    """
    Generate sample weather data for demonstration purposes.
    
    Args:
        days: Number of days of data to generate
        locations: Number of different locations
    
    Returns:
        Pandas DataFrame with sample weather data
    """
    np.random.seed(42)  # For reproducibility
    
    # Sample locations (lat, lon, name)
    location_data = [
        (40.7128, -74.0060, "New York"),
        (34.0522, -118.2437, "Los Angeles"),
        (41.8781, -87.6298, "Chicago"),
        (29.7604, -95.3698, "Houston"),
        (39.9526, -75.1652, "Philadelphia")
    ]
    
    # Select the first n locations
    selected_locations = location_data[:locations]
    
    # Create date range
    end_date = datetime.now()
    start_date = end_date - timedelta(days=days)
    date_range = pd.date_range(start=start_date, end=end_date, freq='D')
    
    # Generate data
    data = []
    
    for location in selected_locations:
        lat, lon, name = location
        
        for date in date_range:
            # Base temperature with seasonal pattern (higher in summer, lower in winter)
            # Northern hemisphere seasonal pattern
            day_of_year = date.dayofyear
            seasonal_factor = np.sin(2 * np.pi * (day_of_year - 15) / 365)  # Peak in mid-July
            
            # Temperature depends on location and season
            base_temp = 15 + 10 * seasonal_factor  # Base temperature with seasonal variation
            
            # For southern locations, make it warmer
            lat_factor = (90 - abs(lat)) / 30  # Higher for locations closer to equator
            temp_with_lat = base_temp + lat_factor * 5
            
            # Add some random variation
            temperature = temp_with_lat + np.random.normal(0, 2)
            
            # Humidity: inversely related to temperature but with variation
            humidity = 70 - (temperature - 15) + np.random.normal(0, 10)
            humidity = max(min(humidity, 100), 20)  # Keep between 20% and 100%
            
            # Pressure: normal atmospheric with random variation
            pressure = 1013 + np.random.normal(0, 5)
            
            # Wind speed: random with some correlation to pressure changes
            wind_speed = max(0, np.random.gamma(2, 2) + (pressure - 1013) / 10)
            
            # Wind direction: random
            wind_direction = np.random.uniform(0, 360)
            
            # Cloud cover: correlated with humidity
            cloud_cover = humidity * 0.8 + np.random.normal(0, 10)
            cloud_cover = max(min(cloud_cover, 100), 0)  # Keep between 0% and 100%
            
            # Dew point: based on temperature and humidity
            dew_point = temperature - ((100 - humidity) / 5) + np.random.normal(0, 1)
            
            # Precipitation: more likely with higher humidity and cloud cover
            precip_probability = (humidity + cloud_cover) / 200  # Scale to 0-1
            precip_intensity = np.random.exponential(1) if np.random.random() < precip_probability else 0
            
            # More precipitation in spring/fall for mid latitudes
            season_precip_factor = np.abs(seasonal_factor) * (1 - abs(lat) / 90)
            precipitation = precip_intensity * season_precip_factor if precip_intensity > 0 else 0
            
            # UV Index: higher in summer, lower in winter, affected by cloud cover
            max_uv = 10 * (0.5 + 0.5 * seasonal_factor) * (1 - cloud_cover / 200)
            uv_index = max(0, max_uv + np.random.normal(0, 1))
            
            # Visibility: inversely related to precipitation and humidity
            visibility = 10 - (precipitation * 2) - (humidity - 50) / 10 + np.random.normal(0, 1)
            visibility = max(min(visibility, 10), 0.1)  # Keep between 0.1 and 10 km
            
            data.append({
                'date': date,
                'location': name,
                'latitude': lat,
                'longitude': lon,
                'temperature': round(temperature, 1),
                'humidity': round(humidity, 1),
                'pressure': round(pressure, 1),
                'wind_speed': round(wind_speed, 1),
                'wind_direction': round(wind_direction, 1),
                'cloud_cover': round(cloud_cover, 1),
                'dew_point': round(dew_point, 1),
                'precipitation': round(precipitation, 2),
                'uv_index': round(uv_index, 1),
                'visibility': round(visibility, 1)
            })
    
    df = pd.DataFrame(data)
    return df

# Generate sample data for one year across multiple locations
weather_data = generate_sample_weather_data(days=365, locations=5)

# Display the first few rows
print(f"Sample weather data shape: {weather_data.shape}")
weather_data.head()

In [None]:
# Simulate loading data from different sources and combining them
def simulate_data_sources():
    """Simulate loading data from multiple sources with slightly different structures"""
    
    # Base data is our weather_data from above
    base_data = weather_data.copy()
    
    # Subset 1: "NASA" data - focus on cloud cover, temperature, and precipitation
    nasa_data = base_data[['date', 'location', 'latitude', 'longitude', 
                           'temperature', 'cloud_cover', 'precipitation']].copy()
    nasa_data.rename(columns={'precipitation': 'precipitation_mm',
                             'temperature': 'surface_temp_celsius'}, inplace=True)
    nasa_data['source'] = 'NASA'
    
    # Subset 2: "NOAA" data - focus on full weather metrics
    noaa_data = base_data[['date', 'location', 'latitude', 'longitude', 
                          'temperature', 'humidity', 'pressure', 'wind_speed',
                          'wind_direction', 'precipitation']].copy()
    noaa_data.rename(columns={'precipitation': 'rainfall_mm'}, inplace=True)
    noaa_data['source'] = 'NOAA'
    
    # Subset 3: "OpenWeatherMap" data - comprehensive metrics
    owm_data = base_data.copy()
    owm_data.rename(columns={'precipitation': 'rain_1h'}, inplace=True)
    owm_data['source'] = 'OpenWeatherMap'
    
    # Save sample datasets to the data/raw directory for future use
    data_dir = os.path.join(project_root, 'data', 'raw')
    
    nasa_data.to_csv(os.path.join(data_dir, 'nasa', 'nasa_sample_data.csv'), index=False)
    noaa_data.to_csv(os.path.join(data_dir, 'noaa', 'noaa_sample_data.csv'), index=False)
    owm_data.to_csv(os.path.join(data_dir, 'openweathermap', 'owm_sample_data.csv'), index=False)
    
    print("Sample data files saved to data/raw directory")
    return nasa_data, noaa_data, owm_data

# Generate and save data from simulated sources
nasa_data, noaa_data, owm_data = simulate_data_sources()

# Display sample from each source
print("\nSample NASA data:")
print(nasa_data.head(3))

print("\nSample NOAA data:")
print(noaa_data.head(3))

print("\nSample OpenWeatherMap data:")
print(owm_data.head(3))

## 3. Exploratory Data Analysis

Now let's perform exploratory data analysis on our weather data to understand patterns, distributions, and relationships between variables.

In [None]:
# Basic statistics of our weather data
weather_data.describe()

In [None]:
# Check data types and missing values
print("Data types:")
print(weather_data.dtypes)
print("\nMissing values:")
print(weather_data.isnull().sum())

In [None]:
# Distribution of key weather variables
plt.figure(figsize=(18, 12))

# Temperature distribution
plt.subplot(2, 3, 1)
sns.histplot(weather_data['temperature'], kde=True)
plt.title('Temperature Distribution')
plt.xlabel('Temperature (°C)')

# Humidity distribution
plt.subplot(2, 3, 2)
sns.histplot(weather_data['humidity'], kde=True)
plt.title('Humidity Distribution')
plt.xlabel('Humidity (%)')

# Precipitation distribution
plt.subplot(2, 3, 3)
# Use log scale due to many zero values
sns.histplot(weather_data['precipitation'].clip(lower=0.01), kde=True, log_scale=True)
plt.title('Precipitation Distribution (log scale)')
plt.xlabel('Precipitation (mm)')

# Wind speed distribution
plt.subplot(2, 3, 4)
sns.histplot(weather_data['wind_speed'], kde=True)
plt.title('Wind Speed Distribution')
plt.xlabel('Wind Speed (m/s)')

# Cloud cover distribution
plt.subplot(2, 3, 5)
sns.histplot(weather_data['cloud_cover'], kde=True)
plt.title('Cloud Cover Distribution')
plt.xlabel('Cloud Cover (%)')

# Pressure distribution
plt.subplot(2, 3, 6)
sns.histplot(weather_data['pressure'], kde=True)
plt.title('Pressure Distribution')
plt.xlabel('Pressure (hPa)')

plt.tight_layout()
plt.show()

In [None]:
# Correlation analysis between weather variables
correlation_matrix = weather_data.select_dtypes(include=[np.number]).corr()

plt.figure(figsize=(12, 10))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', vmin=-1, vmax=1, fmt='.2f', linewidths=0.5)
plt.title('Correlation Matrix of Weather Variables', fontsize=16)
plt.tight_layout()
plt.show()

# Focus on correlation with precipitation
precip_corr = correlation_matrix['precipitation'].sort_values(ascending=False)
print("\nCorrelation with Precipitation:")
print(precip_corr)

In [None]:
# Time series analysis of temperature and precipitation
# Group by date and location, then calculate mean values
daily_avg = weather_data.groupby(['date', 'location']).agg({
    'temperature': 'mean',
    'precipitation': 'sum',
    'humidity': 'mean',
    'cloud_cover': 'mean'
}).reset_index()

# Create a pivot table for easier visualization
pivot_temp = daily_avg.pivot(index='date', columns='location', values='temperature')
pivot_precip = daily_avg.pivot(index='date', columns='location', values='precipitation')

# Plot temperature time series
plt.figure(figsize=(14, 6))
pivot_temp.plot(linewidth=1, alpha=0.8)
plt.title('Average Daily Temperature by Location', fontsize=16)
plt.ylabel('Temperature (°C)')
plt.xlabel('Date')
plt.legend(title='Location')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

# Plot precipitation time series
plt.figure(figsize=(14, 6))
pivot_precip.plot(linewidth=1, alpha=0.8)
plt.title('Daily Precipitation by Location', fontsize=16)
plt.ylabel('Precipitation (mm)')
plt.xlabel('Date')
plt.legend(title='Location')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

## Exploring Rain Patterns and Conditions

Now that we've examined the basic relationships between weather variables, let's focus specifically on analyzing rain patterns and conditions to better understand when precipitation occurs.

In [None]:
# Create a binary rain occurrence variable
weather_data['rain_occurred'] = (weather_data['precipitation'] > 0).astype(int)

# Analyze conditions when rain occurs
rain_conditions = weather_data.groupby('rain_occurred').agg({
    'temperature': ['mean', 'std'],
    'humidity': ['mean', 'std'],
    'cloud_cover': ['mean', 'std'],
    'wind_speed': ['mean', 'std'],
    'pressure': ['mean', 'std']
})

print("Weather conditions when rain occurs vs doesn't occur:")
print(rain_conditions)

# Visualize distributions for rainy vs non-rainy days
fig, axes = plt.subplots(2, 2, figsize=(16, 12))

# Temperature distribution
sns.kdeplot(
    data=weather_data, x="temperature", hue="rain_occurred",
    palette={0: "skyblue", 1: "darkblue"}, 
    fill=True, alpha=0.5, ax=axes[0, 0]
)
axes[0, 0].set_title("Temperature Distribution: Rain vs No Rain", fontsize=14)
axes[0, 0].set_xlabel("Temperature (°C)")
axes[0, 0].legend(labels=['No Rain', 'Rain'])

# Humidity distribution
sns.kdeplot(
    data=weather_data, x="humidity", hue="rain_occurred",
    palette={0: "palegreen", 1: "darkgreen"}, 
    fill=True, alpha=0.5, ax=axes[0, 1]
)
axes[0, 1].set_title("Humidity Distribution: Rain vs No Rain", fontsize=14)
axes[0, 1].set_xlabel("Humidity (%)")
axes[0, 1].legend(labels=['No Rain', 'Rain'])

# Cloud cover distribution
sns.kdeplot(
    data=weather_data, x="cloud_cover", hue="rain_occurred",
    palette={0: "lightsalmon", 1: "darkred"}, 
    fill=True, alpha=0.5, ax=axes[1, 0]
)
axes[1, 0].set_title("Cloud Cover Distribution: Rain vs No Rain", fontsize=14)
axes[1, 0].set_xlabel("Cloud Cover (%)")
axes[1, 0].legend(labels=['No Rain', 'Rain'])

# Pressure distribution
sns.kdeplot(
    data=weather_data, x="pressure", hue="rain_occurred",
    palette={0: "plum", 1: "purple"}, 
    fill=True, alpha=0.5, ax=axes[1, 1]
)
axes[1, 1].set_title("Pressure Distribution: Rain vs No Rain", fontsize=14)
axes[1, 1].set_xlabel("Pressure (hPa)")
axes[1, 1].legend(labels=['No Rain', 'Rain'])

plt.tight_layout()
plt.show()

## Feature Engineering for Rainfall Prediction

To improve our predictive models, let's engineer additional features that might help identify rainfall patterns. We'll create features like:

1. Rolling statistics for temperature and humidity
2. Weather change indicators (pressure deltas, temperature swings)
3. Seasonal indicators
4. Time-based features

In [None]:
# Make a copy of the DataFrame to avoid modifying the original
df_features = weather_data.copy()

# Convert date to datetime if not already
df_features['date'] = pd.to_datetime(df_features['date'])

# Extract time-based features
df_features['year'] = df_features['date'].dt.year
df_features['month'] = df_features['date'].dt.month
df_features['day'] = df_features['date'].dt.day
df_features['dayofweek'] = df_features['date'].dt.dayofweek
df_features['season'] = pd.cut(
    df_features['date'].dt.month, 
    bins=[0, 3, 6, 9, 12], 
    labels=['Winter', 'Spring', 'Summer', 'Fall'],
    include_lowest=True
)

# Group by location to calculate rolling features
locations = df_features['location'].unique()
feature_dfs = []

for location in locations:
    # Filter data for this location
    loc_data = df_features[df_features['location'] == location].sort_values('date')
    
    # Calculate rolling statistics (3-day windows)
    loc_data['temp_rolling_mean_3d'] = loc_data['temperature'].rolling(window=3, min_periods=1).mean()
    loc_data['temp_rolling_std_3d'] = loc_data['temperature'].rolling(window=3, min_periods=1).std()
    loc_data['humidity_rolling_mean_3d'] = loc_data['humidity'].rolling(window=3, min_periods=1).mean()
    
    # Calculate pressure and temperature deltas (change from previous day)
    loc_data['pressure_delta'] = loc_data['pressure'].diff().fillna(0)
    loc_data['temp_delta'] = loc_data['temperature'].diff().fillna(0)
    
    # Add to the list of dataframes
    feature_dfs.append(loc_data)

# Combine all locations back together
df_features = pd.concat(feature_dfs)

# Create interaction features
df_features['temp_humidity_interaction'] = df_features['temperature'] * df_features['humidity']
df_features['wind_temp_interaction'] = df_features['wind_speed'] * df_features['temperature']

# Show the first few rows with new features
display(df_features.head())

# Check correlation of new features with precipitation
correlation_with_precip = df_features.drop(columns=['date', 'location', 'rain_occurred', 'season']).corrwith(df_features['precipitation'])
print("\nCorrelation of features with precipitation:")
print(correlation_with_precip.sort_values(ascending=False))

## Geographical Analysis of Rainfall Patterns

Weather patterns can vary significantly based on location. Let's analyze how precipitation patterns differ across our sampled locations.

In [None]:
# Calculate precipitation statistics by location
location_rain_stats = weather_data.groupby('location').agg({
    'precipitation': ['mean', 'median', 'max', 'std', lambda x: (x > 0).mean() * 100],
    'temperature': ['mean', 'std'],
    'humidity': ['mean', 'std']
})

# Rename the custom lambda function column
location_rain_stats.columns = location_rain_stats.columns.map('_'.join)
location_rain_stats = location_rain_stats.rename(columns={'precipitation_<lambda>': 'precipitation_days_percent'})

# Sort by average precipitation
location_rain_stats = location_rain_stats.sort_values('precipitation_mean', ascending=False)

# Display statistics by location
print("Precipitation Statistics by Location:")
display(location_rain_stats)

# Visualize average precipitation by location
plt.figure(figsize=(12, 6))
sns.barplot(
    x=location_rain_stats.index, 
    y=location_rain_stats['precipitation_mean'],
    palette="Blues_d"
)
plt.title("Average Daily Precipitation by Location", fontsize=16)
plt.ylabel("Average Precipitation (mm)")
plt.xlabel("Location")
plt.xticks(rotation=45)
plt.grid(axis='y', alpha=0.3)
plt.tight_layout()
plt.show()

# Visualize percentage of rainy days by location
plt.figure(figsize=(12, 6))
sns.barplot(
    x=location_rain_stats.index,
    y=location_rain_stats['precipitation_days_percent'],
    palette="Greens_d"
)
plt.title("Percentage of Rainy Days by Location", fontsize=16)
plt.ylabel("Rainy Days (%)")
plt.xlabel("Location")
plt.xticks(rotation=45)
plt.grid(axis='y', alpha=0.3)
plt.tight_layout()
plt.show()

# Calculate monthly precipitation by location
monthly_precip = weather_data.copy()
monthly_precip['month'] = pd.to_datetime(monthly_precip['date']).dt.month
monthly_location_rain = monthly_precip.groupby(['location', 'month'])['precipitation'].mean().reset_index()

# Create a pivot table for visualization
monthly_location_pivot = monthly_location_rain.pivot(index='month', columns='location', values='precipitation')

# Plot monthly precipitation patterns by location
plt.figure(figsize=(14, 8))
sns.heatmap(
    monthly_location_pivot, 
    cmap='YlGnBu', 
    annot=True, 
    fmt=".2f", 
    linewidths=.5
)
plt.title("Monthly Average Precipitation by Location", fontsize=16)
plt.ylabel("Month")
plt.xlabel("Location")
plt.tight_layout()
plt.show()

## Preparing Data for Machine Learning Models

Now that we've done extensive exploratory analysis and feature engineering, let's prepare our data for machine learning models. We'll focus on two prediction tasks:

1. **Rain Occurrence Classification**: Predicting whether it will rain or not (binary classification)
2. **Rainfall Amount Regression**: Predicting how much rain will fall (regression)

Let's prepare the data for both tasks.

In [None]:
# Import necessary ML libraries
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.metrics import accuracy_score, classification_report, mean_squared_error, r2_score
import numpy as np

# Use the engineered features DataFrame
ml_data = df_features.copy()

# Drop rows with missing values or handle them appropriately
ml_data = ml_data.dropna()

# Define features and target variables
# For classification: predict whether it will rain
X = ml_data.drop(columns=['precipitation', 'rain_occurred', 'date'])
y_class = ml_data['rain_occurred']

# For regression: predict amount of rain (only on days when it rained)
rain_data = ml_data[ml_data['precipitation'] > 0].copy()
X_reg = rain_data.drop(columns=['precipitation', 'rain_occurred', 'date'])
y_reg = rain_data['precipitation']

# Print dataset shapes
print(f"Classification dataset: {X.shape[0]} samples, {X.shape[1]} features")
print(f"Regression dataset: {X_reg.shape[0]} samples, {X_reg.shape[1]} features")

# Handle categorical variables
categorical_features = ['location', 'season']
numerical_features = [col for col in X.columns if col not in categorical_features]

# Create preprocessor for both categorical and numerical features
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numerical_features),
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features)
    ])

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y_class, test_size=0.2, random_state=42)
X_train_reg, X_test_reg, y_train_reg, y_test_reg = train_test_split(X_reg, y_reg, test_size=0.2, random_state=42)

print("\nTrain/Test Split:")
print(f"Classification - Training: {X_train.shape[0]} samples, Testing: {X_test.shape[0]} samples")
print(f"Regression - Training: {X_train_reg.shape[0]} samples, Testing: {X_test_reg.shape[0]} samples")

## Building and Evaluating Machine Learning Models

Now let's build our two machine learning models:
1. A random forest classifier to predict whether it will rain
2. A random forest regressor to predict the amount of rainfall

We'll evaluate both models and analyze their performance.

In [None]:
# 1. Rain Occurrence Classification Model
# Create a pipeline with preprocessing and classifier
classification_pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(n_estimators=100, random_state=42))
])

# Train the model
print("Training rain occurrence classifier...")
classification_pipeline.fit(X_train, y_train)

# Make predictions
y_pred_class = classification_pipeline.predict(X_test)
y_pred_proba = classification_pipeline.predict_proba(X_test)[:, 1]

# Evaluate the classifier
accuracy = accuracy_score(y_test, y_pred_class)
print(f"\nClassification Model Performance:")
print(f"Accuracy: {accuracy:.4f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred_class))

# 2. Rainfall Amount Regression Model
# Create a pipeline with preprocessing and regressor
regression_pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('regressor', RandomForestRegressor(n_estimators=100, random_state=42))
])

# Train the model
print("\nTraining rainfall amount regressor...")
regression_pipeline.fit(X_train_reg, y_train_reg)

# Make predictions
y_pred_reg = regression_pipeline.predict(X_test_reg)

# Evaluate the regressor
mse = mean_squared_error(y_test_reg, y_pred_reg)
rmse = np.sqrt(mse)
r2 = r2_score(y_test_reg, y_pred_reg)

print(f"\nRegression Model Performance:")
print(f"Mean Squared Error: {mse:.4f}")
print(f"Root Mean Squared Error: {rmse:.4f}")
print(f"R² Score: {r2:.4f}")

In [None]:
# Analyze feature importances for both models
def plot_feature_importance(model, preprocessor, feature_names, title):
    # Extract feature names after preprocessing (including one-hot encoded categories)
    if hasattr(preprocessor, 'transformers_'):
        # For the ColumnTransformer in a Pipeline
        categorical_features_idx = preprocessor.transformers_[1][2]  # Index of categorical features
        categorical_features = [feature_names[i] for i in categorical_features_idx]
        onehot = preprocessor.transformers_[1][1]  # OneHotEncoder
        if hasattr(onehot, 'categories_'):
            onehot_feature_names = []
            for i, cats in enumerate(onehot.categories_):
                for cat in cats:
                    onehot_feature_names.append(f"{categorical_features[i]}_{cat}")
            
            # Get numerical feature names
            numerical_features_idx = preprocessor.transformers_[0][2]  # Index of numerical features
            numerical_features = [feature_names[i] for i in numerical_features_idx]
            
            # Combine all feature names
            all_feature_names = numerical_features + onehot_feature_names
        else:
            # Fallback if OneHotEncoder has no categories_ attribute
            all_feature_names = feature_names
    else:
        # Fallback to original feature names
        all_feature_names = feature_names
    
    # Get feature importances
    if hasattr(model, 'feature_importances_'):
        importances = model.feature_importances_
        
        # Create DataFrame for easier sorting and plotting
        feature_importance_df = pd.DataFrame({
            'Feature': all_feature_names[:len(importances)],
            'Importance': importances
        })
        
        # Sort by importance
        feature_importance_df = feature_importance_df.sort_values('Importance', ascending=False)
        
        # Plot
        plt.figure(figsize=(12, 8))
        sns.barplot(x='Importance', y='Feature', data=feature_importance_df.head(20))
        plt.title(title, fontsize=16)
        plt.tight_layout()
        plt.show()
        
        return feature_importance_df
    else:
        print("Model does not have feature_importances_ attribute")
        return None

# Extract the classifier and regressor from the pipelines
classifier = classification_pipeline.named_steps['classifier']
regressor = regression_pipeline.named_steps['regressor']
preprocessor_clf = classification_pipeline.named_steps['preprocessor']
preprocessor_reg = regression_pipeline.named_steps['preprocessor']

# Plot feature importances
print("Top Features for Rain Occurrence Prediction:")
clf_importances = plot_feature_importance(
    classifier, preprocessor_clf, 
    list(X.columns), 
    "Feature Importance for Rain Occurrence Prediction"
)
display(clf_importances.head(10))

print("\nTop Features for Rainfall Amount Prediction:")
reg_importances = plot_feature_importance(
    regressor, preprocessor_reg, 
    list(X_reg.columns), 
    "Feature Importance for Rainfall Amount Prediction"
)
display(reg_importances.head(10))

## Testing Prediction on New Data

Let's demonstrate how to use our trained models to make predictions for specific locations and dates. This is the functionality that will be core to our "Will it rain on my parade?" application.

In [None]:
# Create sample new data for prediction
# This would typically come from a weather forecast API
# but we'll create some synthetic examples

def prepare_prediction_data(location, date_str, forecast_data):
    """
    Prepare a single data point for prediction based on forecast data
    
    Parameters:
    -----------
    location : str
        The name of the location
    date_str : str
        Date string in 'YYYY-MM-DD' format
    forecast_data : dict
        Dictionary containing forecast weather variables
        
    Returns:
    --------
    pd.DataFrame
        A dataframe with a single row prepared for prediction
    """
    # Create a dataframe with the basic forecast data
    df = pd.DataFrame([forecast_data])
    
    # Add location and date
    df['location'] = location
    df['date'] = pd.to_datetime(date_str)
    
    # Add the engineered features we used during training
    df['year'] = df['date'].dt.year
    df['month'] = df['date'].dt.month
    df['day'] = df['date'].dt.day
    df['dayofweek'] = df['date'].dt.dayofweek
    
    # Create season
    df['season'] = pd.cut(
        df['date'].dt.month, 
        bins=[0, 3, 6, 9, 12], 
        labels=['Winter', 'Spring', 'Summer', 'Fall'],
        include_lowest=True
    )
    
    # Since we don't have historical data for this new point,
    # use the provided forecast data for the rolling features
    df['temp_rolling_mean_3d'] = df['temperature']
    df['temp_rolling_std_3d'] = 0
    df['humidity_rolling_mean_3d'] = df['humidity']
    df['pressure_delta'] = 0
    df['temp_delta'] = 0
    
    # Create interaction features
    df['temp_humidity_interaction'] = df['temperature'] * df['humidity']
    df['wind_temp_interaction'] = df['wind_speed'] * df['temperature']
    
    # Drop columns not needed for prediction
    if 'precipitation' in df.columns:
        df = df.drop(columns=['precipitation'])
    if 'rain_occurred' in df.columns:
        df = df.drop(columns=['rain_occurred'])
        
    return df

# Create a few sample forecast data points
sample_forecasts = [
    {
        'location': 'New York',
        'date': '2023-10-10',
        'forecast': {
            'temperature': 18.5,
            'humidity': 72,
            'cloud_cover': 65,
            'wind_speed': 12,
            'pressure': 1013.2
        }
    },
    {
        'location': 'San Francisco',
        'date': '2023-10-10',
        'forecast': {
            'temperature': 16.8,
            'humidity': 68,
            'cloud_cover': 30,
            'wind_speed': 8,
            'pressure': 1016.5
        }
    },
    {
        'location': 'Miami',
        'date': '2023-10-10',
        'forecast': {
            'temperature': 28.2,
            'humidity': 85,
            'cloud_cover': 75,
            'wind_speed': 15,
            'pressure': 1009.8
        }
    }
]

# Make predictions for each sample
print("Predictions for Sample Forecast Data:")
print("=" * 50)

for sample in sample_forecasts:
    location = sample['location']
    date = sample['date']
    forecast = sample['forecast']
    
    # Prepare the data for prediction
    pred_data = prepare_prediction_data(location, date, forecast)
    
    # Make predictions
    rain_prob = classification_pipeline.predict_proba(pred_data)[0, 1]
    will_rain = classification_pipeline.predict(pred_data)[0]
    
    # If it's predicted to rain, estimate the amount
    rain_amount = 0
    if will_rain == 1:
        rain_amount = regression_pipeline.predict(pred_data)[0]
    
    # Print results
    print(f"Location: {location}, Date: {date}")
    print(f"Forecast: Temp={forecast['temperature']}°C, Humidity={forecast['humidity']}%, Pressure={forecast['pressure']}hPa")
    print(f"Prediction: {rain_prob*100:.1f}% chance of rain")
    
    if will_rain == 1:
        print(f"Expected rainfall: {rain_amount:.2f} mm")
        print(f"Recommendation: Consider bringing an umbrella!")
    else:
        print("No rain expected")
        print("Recommendation: Enjoy your outdoor activities!")
    
    print("-" * 50)

## Conclusion and Next Steps

In this exploratory analysis notebook, we've:

1. Generated and analyzed sample weather data to understand the relationships between weather variables
2. Performed feature engineering to create useful predictors for rainfall
3. Analyzed geographical and seasonal patterns in precipitation
4. Built and evaluated machine learning models for predicting both rain occurrence and amount
5. Demonstrated how to use the models for real-world predictions

### Next Steps

1. **Integrate with Real Data Sources**: 
   - Replace our simulated data with real weather data from NASA, NOAA, and OpenWeatherMap
   - Configure AI-Agent-Scraper for efficient web scraping of weather data

2. **Enhance the Models**:
   - Experiment with different machine learning algorithms
   - Perform hyperparameter tuning to improve model performance
   - Consider using deep learning for time series forecasting

3. **Feature Development**:
   - Add more sophisticated features like atmospheric indices and radar data
   - Incorporate satellite imagery for cloud pattern recognition

4. **Application Development**:
   - Build a user-friendly API for "Will it rain on my parade?" predictions
   - Develop a front-end interface for users to input their event details
   - Create visualization components for the prediction results

5. **Validation and Testing**:
   - Validate predictions against actual weather outcomes
   - Implement a feedback mechanism to improve model accuracy over time
   - Test the system with various use cases and locations