# Weather Temperature Prediction using Linear Regression with Station Identification
This notebook demonstrates temperature prediction using Linear Regression, incorporating weather station identifiers (STA) for station-specific analysis.

**Data Source:** Summary of Weather.csv

## 1. Import Required Libraries

In [14]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from typing import Tuple, List, Optional, Dict
import warnings
warnings.filterwarnings('ignore')

# Machine Learning imports
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import joblib

# Set random seed for reproducibility
np.random.seed(42)

# Set plot style
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')

print("‚úÖ All libraries imported successfully!")

‚úÖ All libraries imported successfully!


## 2. Load and Explore Data with Station Identification

In [15]:
def load_weather_data(file_path: str) -> pd.DataFrame:
    """
    Load weather data from CSV file with station identification.
    
    Args:
        file_path: Path to the CSV file
        
    Returns:
        DataFrame containing weather data with STA column
    """
    df: pd.DataFrame = pd.read_csv(file_path)
    print(f"‚úÖ Data loaded successfully from {file_path}")
    print(f"üìä Shape: {df.shape}")
    print(f"\nüìã Columns: {df.columns.tolist()}")
    
    # Display station information
    if 'STA' in df.columns:
        unique_stations: np.ndarray = df['STA'].unique()
        print(f"\nüè¢ Number of unique stations: {len(unique_stations)}")
        print(f"üè¢ Station IDs: {unique_stations[:10]}{'...' if len(unique_stations) > 10 else ''}")
        
        # Station data distribution
        station_counts: pd.Series = df['STA'].value_counts()
        print(f"\nüìä Records per station (top 5):")
        print(station_counts.head())
    
    return df

# Load the Summary of Weather.csv file
file_path: str = 'Summary of Weather.csv'
df: pd.DataFrame = load_weather_data(file_path)

# Display first few rows
print("\nüëÄ First 5 rows:")
df.head()

‚úÖ Data loaded successfully from Summary of Weather.csv
üìä Shape: (119040, 31)

üìã Columns: ['STA', 'Date', 'Precip', 'WindGustSpd', 'MaxTemp', 'MinTemp', 'MeanTemp', 'Snowfall', 'PoorWeather', 'YR', 'MO', 'DA', 'PRCP', 'DR', 'SPD', 'MAX', 'MIN', 'MEA', 'SNF', 'SND', 'FT', 'FB', 'FTI', 'ITH', 'PGT', 'TSHDSBRSGF', 'SD3', 'RHX', 'RHN', 'RVG', 'WTE']

üè¢ Number of unique stations: 159
üè¢ Station IDs: [10001 10002 10101 10102 10502 10505 10701 10703 10704 10705]...

üìä Records per station (top 5):
STA
22508    2192
10701    2185
22502    2154
22504    2118
10803    1750
Name: count, dtype: int64

üëÄ First 5 rows:


Unnamed: 0,STA,Date,Precip,WindGustSpd,MaxTemp,MinTemp,MeanTemp,Snowfall,PoorWeather,YR,...,FB,FTI,ITH,PGT,TSHDSBRSGF,SD3,RHX,RHN,RVG,WTE
0,10001,1942-7-1,1.016,,25.555556,22.222222,23.888889,0.0,,42,...,,,,,,,,,,
1,10001,1942-7-2,0.0,,28.888889,21.666667,25.555556,0.0,,42,...,,,,,,,,,,
2,10001,1942-7-3,2.54,,26.111111,22.222222,24.444444,0.0,,42,...,,,,,,,,,,
3,10001,1942-7-4,2.54,,26.666667,22.222222,24.444444,0.0,,42,...,,,,,,,,,,
4,10001,1942-7-5,0.0,,26.666667,21.666667,24.444444,0.0,,42,...,,,,,,,,,,


In [16]:
# Data information
print("üìä Data Info:")
print("=" * 70)
df.info()

print("\nüìà Statistical Summary:")
print("=" * 70)
df.describe()

üìä Data Info:
<class 'pandas.DataFrame'>
RangeIndex: 119040 entries, 0 to 119039
Data columns (total 31 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   STA          119040 non-null  int64  
 1   Date         119040 non-null  str    
 2   Precip       119040 non-null  str    
 3   WindGustSpd  532 non-null     float64
 4   MaxTemp      119040 non-null  float64
 5   MinTemp      119040 non-null  float64
 6   MeanTemp     119040 non-null  float64
 7   Snowfall     117877 non-null  object 
 8   PoorWeather  34237 non-null   object 
 9   YR           119040 non-null  int64  
 10  MO           119040 non-null  int64  
 11  DA           119040 non-null  int64  
 12  PRCP         117108 non-null  str    
 13  DR           533 non-null     float64
 14  SPD          532 non-null     float64
 15  MAX          118566 non-null  float64
 16  MIN          118572 non-null  float64
 17  MEA          118542 non-null  float64
 18  SNF          117877

Unnamed: 0,STA,WindGustSpd,MaxTemp,MinTemp,MeanTemp,YR,MO,DA,DR,SPD,...,FT,FB,FTI,ITH,PGT,SD3,RHX,RHN,RVG,WTE
count,119040.0,532.0,119040.0,119040.0,119040.0,119040.0,119040.0,119040.0,533.0,532.0,...,0.0,0.0,0.0,0.0,525.0,0.0,0.0,0.0,0.0,0.0
mean,29659.435795,37.774534,27.045111,17.789511,22.411631,43.805284,6.726016,15.79753,26.998124,20.396617,...,,,,,12.085333,,,,,
std,20953.209402,10.297808,8.717817,8.334572,8.297982,1.136718,3.425561,8.794541,15.221732,5.560371,...,,,,,5.731328,,,,,
min,10001.0,18.52,-33.333333,-38.333333,-35.555556,40.0,1.0,1.0,2.0,10.0,...,,,,,0.0,,,,,
25%,11801.0,29.632,25.555556,15.0,20.555556,43.0,4.0,8.0,11.0,16.0,...,,,,,8.5,,,,,
50%,22508.0,37.04,29.444444,21.111111,25.555556,44.0,7.0,16.0,32.0,20.0,...,,,,,11.6,,,,,
75%,33501.0,43.059,31.666667,23.333333,27.222222,45.0,10.0,23.0,34.0,23.25,...,,,,,15.0,,,,,
max,82506.0,75.932,50.0,34.444444,40.0,45.0,12.0,31.0,78.0,41.0,...,,,,,23.9,,,,,


## 3. Station Selection and Data Preprocessing

In [17]:
def select_station_data(df: pd.DataFrame, station_id: Optional[int] = None) -> pd.DataFrame:
    """
    Select data for a specific station or the station with most records.
    
    Args:
        df: Input DataFrame with STA column
        station_id: Specific station ID to select (None = auto-select largest)
        
    Returns:
        DataFrame filtered for the selected station
    """
    if 'STA' not in df.columns:
        print("‚ö†Ô∏è  No STA column found, using all data")
        return df
    
    if station_id is None:
        # Select station with most records
        station_counts: pd.Series = df['STA'].value_counts()
        station_id = station_counts.index[0]
        print(f"üéØ Auto-selected station {station_id} with {station_counts.iloc[0]} records")
    else:
        if station_id not in df['STA'].values:
            print(f"‚ö†Ô∏è  Station {station_id} not found in data")
            return pd.DataFrame()
        print(f"üéØ Selected station {station_id}")
    
    df_station: pd.DataFrame = df[df['STA'] == station_id].copy()
    print(f"üìä Station {station_id} data shape: {df_station.shape}")
    
    return df_station

# Select station data (change station_id to select a specific station, or leave None for auto-select)
selected_station_id: Optional[int] = None  # Set to specific station ID or None
df_station: pd.DataFrame = select_station_data(df, selected_station_id)

# Store the selected station ID for later use
if 'STA' in df_station.columns and len(df_station) > 0:
    selected_station_id = df_station['STA'].iloc[0]
    print(f"\n‚úÖ Working with Station ID: {selected_station_id}")

üéØ Auto-selected station 22508 with 2192 records
üìä Station 22508 data shape: (2192, 31)

‚úÖ Working with Station ID: 22508


In [None]:
def preprocess_data(df: pd.DataFrame, target_column: str = 'MeanTemp') -> pd.DataFrame:
    """
    Preprocess weather data for a specific station.
    
    Args:
        df: Input DataFrame (should be station-specific)
        target_column: Name of the target column to predict
        
    Returns:
        Preprocessed DataFrame
    """
    df_processed: pd.DataFrame = df.copy()
    
    # Convert Date column to datetime and extract temporal features
    if 'Date' in df_processed.columns:
        df_processed['Date'] = pd.to_datetime(df_processed['Date'], errors='coerce')
        df_processed = df_processed.sort_values(by='Date')
        
        # Extract temporal features
        df_processed['Year'] = df_processed['Date'].dt.year
        df_processed['Month'] = df_processed['Date'].dt.month
        df_processed['Day'] = df_processed['Date'].dt.day
        df_processed['DayOfYear'] = df_processed['Date'].dt.dayofyear
        df_processed['DayOfWeek'] = df_processed['Date'].dt.dayofweek
        
        # Create cyclical features for seasonality
        df_processed['Month_sin'] = np.sin(2 * np.pi * df_processed['Month'] / 12)
        df_processed['Month_cos'] = np.cos(2 * np.pi * df_processed['Month'] / 12)
        df_processed['DayOfYear_sin'] = np.sin(2 * np.pi * df_processed['DayOfYear'] / 365)
        df_processed['DayOfYear_cos'] = np.cos(2 * np.pi * df_processed['DayOfYear'] / 365)
    
    df_processed = df_processed.reset_index(drop=True)
    
    # Handle missing values in numeric columns
    numeric_cols: List[str] = df_processed.select_dtypes(include=[np.number]).columns.tolist()
    for col in numeric_cols:
        if col != target_column:
            df_processed[col] = df_processed[col].ffill().bfill()
    
    # Handle target column separately
    if target_column in df_processed.columns:
        df_processed[target_column] = df_processed[target_column].ffill().bfill()
        df_processed = df_processed.dropna(subset=[target_column])
    
    print(f"‚úÖ Data preprocessing completed!")
    print(f"üìä Processed data shape: {df_processed.shape}")
    print(f"üéØ Target variable: '{target_column}'")
    
    if 'Date' in df_processed.columns:
        print(f"üìÖ Date range: {df_processed['Date'].min()} to {df_processed['Date'].max()}")
    
    return df_processed

# Preprocess data - using MeanTemp as the target variable
df_processed: pd.DataFrame = preprocess_data(df_station, target_column='MeanTemp')
df_processed.head()

‚úÖ Data preprocessing completed!
üìä Processed data shape: (2192, 40)
üéØ Target variable: 'MeanTemp'
üìÖ Date range: 1940-01-01 00:00:00 to 1945-12-31 00:00:00


Unnamed: 0,STA,Date,Precip,WindGustSpd,MaxTemp,MinTemp,MeanTemp,Snowfall,PoorWeather,YR,...,WTE,Year,Month,Day,DayOfYear,DayOfWeek,Month_sin,Month_cos,DayOfYear_sin,DayOfYear_cos
0,22508,1940-01-01,0.254,,23.333333,17.222222,20.0,0,,40,...,,1940,1,1,1,0,0.5,0.866025,0.017213,0.999852
1,22508,1940-01-02,10.16,,23.333333,16.111111,19.444444,0,,40,...,,1940,1,2,2,1,0.5,0.866025,0.034422,0.999407
2,22508,1940-01-03,T,,23.888889,15.555556,20.0,0,,40,...,,1940,1,3,3,2,0.5,0.866025,0.05162,0.998667
3,22508,1940-01-04,2.286,,23.888889,18.333333,21.111111,0,,40,...,,1940,1,4,4,3,0.5,0.866025,0.068802,0.99763
4,22508,1940-01-05,0.254,,22.222222,15.0,18.333333,0,,40,...,,1940,1,5,5,4,0.5,0.866025,0.085965,0.996298


## 4. Feature Engineering and Selection

In [19]:
def create_lag_features(df: pd.DataFrame, target_col: str, lags: List[int] = [1, 2, 3, 7, 14, 30]) -> pd.DataFrame:
    """
    Create lag features for time series prediction.
    
    Args:
        df: Input DataFrame
        target_col: Target column name
        lags: List of lag periods to create
        
    Returns:
        DataFrame with lag features
    """
    df_lagged: pd.DataFrame = df.copy()
    
    for lag in lags:
        df_lagged[f'{target_col}_lag_{lag}'] = df_lagged[target_col].shift(lag)
    
    # Create rolling statistics
    df_lagged[f'{target_col}_rolling_mean_7'] = df_lagged[target_col].rolling(window=7, min_periods=1).mean()
    df_lagged[f'{target_col}_rolling_std_7'] = df_lagged[target_col].rolling(window=7, min_periods=1).std()
    df_lagged[f'{target_col}_rolling_mean_30'] = df_lagged[target_col].rolling(window=30, min_periods=1).mean()
    df_lagged[f'{target_col}_rolling_std_30'] = df_lagged[target_col].rolling(window=30, min_periods=1).std()
    
    # Drop rows with NaN values created by lagging
    df_lagged = df_lagged.dropna()
    
    print(f"‚úÖ Created {len(lags)} lag features and 4 rolling statistics")
    print(f"üìä Data shape after feature engineering: {df_lagged.shape}")
    
    return df_lagged

# Create lag features
target_col: str = 'MeanTemp'
df_features: pd.DataFrame = create_lag_features(df_processed, target_col, lags=[1, 2, 3, 7, 14, 30])

print("\nüìã New features created:")
lag_features: List[str] = [col for col in df_features.columns if 'lag' in col or 'rolling' in col]
print(lag_features)

‚úÖ Created 6 lag features and 4 rolling statistics
üìä Data shape after feature engineering: (0, 50)

üìã New features created:
['MeanTemp_lag_1', 'MeanTemp_lag_2', 'MeanTemp_lag_3', 'MeanTemp_lag_7', 'MeanTemp_lag_14', 'MeanTemp_lag_30', 'MeanTemp_rolling_mean_7', 'MeanTemp_rolling_std_7', 'MeanTemp_rolling_mean_30', 'MeanTemp_rolling_std_30']


In [None]:
def select_features(df: pd.DataFrame, target_col: str) -> Tuple[List[str], pd.DataFrame]:
    """
    Select relevant features for linear regression.
    
    Args:
        df: Input DataFrame
        target_col: Target column name
        
    Returns:
        Tuple of (feature names list, DataFrame with selected features)
    """
    # Select numeric features (excluding target and identifiers)
    exclude_cols: List[str] = [target_col, 'STA', 'Date', 'YR', 'MO', 'DA']
    
    feature_cols: List[str] = [col for col in df.select_dtypes(include=[np.number]).columns 
                                if col not in exclude_cols]
    
    # Prioritize lag features and temporal features
    priority_features: List[str] = [col for col in feature_cols if 'lag' in col or 'rolling' in col 
                                     or 'Month' in col or 'DayOfYear' in col or 'Year' in col]
    
    # Add other weather features if available
    weather_features: List[str] = ['MaxTemp', 'MinTemp', 'Precip', 'WindGustSpd', 'Snowfall']
    available_weather: List[str] = [col for col in weather_features if col in feature_cols]
    
    # Combine all features
    selected_features: List[str] = list(set(priority_features + available_weather))
    
    # Remove features with too many missing values
    valid_features: List[str] = []
    for col in selected_features:
        if df[col].isnull().sum() / len(df) < 0.5:  # Less than 50% missing
            valid_features.append(col)
    
    print(f"‚úÖ Selected {len(valid_features)} features for modeling")
    print("\nüìã Feature categories:")
    lag_feats = [f for f in valid_features if 'lag' in f]
    rolling_feats = [f for f in valid_features if 'rolling' in f]
    temporal_feats = [f for f in valid_features if any(x in f for x in ['Month', 'Day', 'Year'])]
    weather_feats = [f for f in valid_features if f in weather_features]
    
    print(f"  ‚Ä¢ Lag features: {len(lag_feats)}")
    print(f"  ‚Ä¢ Rolling statistics: {len(rolling_feats)}")
    print(f"  ‚Ä¢ Temporal features: {len(temporal_feats)}")
    print(f"  ‚Ä¢ Weather features: {len(weather_feats)}")
    
    return valid_features, df[valid_features + [target_col]].copy()

# Select features
feature_names: List[str]
df_modeling: pd.DataFrame
feature_names, df_modeling = select_features(df_features, target_col)

print(f"\nüìä Final modeling dataset shape: {df_modeling.shape}")

‚úÖ Selected 0 features for modeling

üìã Feature categories:
  ‚Ä¢ Lag features: 0
  ‚Ä¢ Rolling statistics: 0
  ‚Ä¢ Temporal features: 0
  ‚Ä¢ Weather features: 0

üìä Final modeling dataset shape: (0, 1)


## 5. Prepare Train and Test Sets

In [22]:
# Remove any remaining NaN values
df_modeling = df_modeling.dropna()

# Split features and target
X: pd.DataFrame = df_modeling[feature_names]
y: pd.Series = df_modeling[target_col]

print(f"üìä Feature matrix shape: {X.shape}")
print(f"üìä Target vector shape: {y.shape}")

# Split into train and test sets (80-20 split, preserving temporal order)
train_size: int = int(len(X) * 0.8)
X_train: pd.DataFrame = X[:train_size]
X_test: pd.DataFrame = X[train_size:]
y_train: pd.Series = y[:train_size]
y_test: pd.Series = y[train_size:]

print(f"\nüìä Training set size: {len(X_train)} ({(len(X_train)/len(X)*100):.1f}%)")
print(f"üìä Test set size: {len(X_test)} ({(len(X_test)/len(X)*100):.1f}%)")

# Standardize features
scaler: StandardScaler = StandardScaler()
X_train_scaled: np.ndarray = scaler.fit_transform(X_train)
X_test_scaled: np.ndarray = scaler.transform(X_test)

print(f"\n‚úÖ Features standardized (mean=0, std=1)")

üìä Feature matrix shape: (0, 0)
üìä Target vector shape: (0,)


ZeroDivisionError: division by zero

## 6. Train Linear Regression Model

In [None]:
# Initialize and train Linear Regression model
print(f"üöÄ Training Linear Regression model for Station {selected_station_id}...")
print("=" * 70)

model: LinearRegression = LinearRegression()
model.fit(X_train_scaled, y_train)

print("‚úÖ Model training completed!")
print(f"\nüìä Model coefficients: {len(model.coef_)} features")
print(f"üìä Model intercept: {model.intercept_:.4f}")

# Display feature importance (absolute coefficient values)
feature_importance: pd.DataFrame = pd.DataFrame({
    'Feature': feature_names,
    'Coefficient': model.coef_,
    'Abs_Coefficient': np.abs(model.coef_)
}).sort_values('Abs_Coefficient', ascending=False)

print(f"\nüìä Top 10 Most Important Features:")
print("=" * 70)
print(feature_importance.head(10).to_string(index=False))

## 7. Make Predictions

In [None]:
# Make predictions
print(f"üîÆ Making predictions for Station {selected_station_id}...")

y_train_pred: np.ndarray = model.predict(X_train_scaled)
y_test_pred: np.ndarray = model.predict(X_test_scaled)

print("‚úÖ Predictions completed!")

## 8. Calculate Performance Metrics

In [None]:
def calculate_metrics(y_true: np.ndarray, y_pred: np.ndarray) -> Dict[str, float]:
    """
    Calculate performance metrics.
    
    Args:
        y_true: True values
        y_pred: Predicted values
        
    Returns:
        Dictionary containing performance metrics
    """
    mse: float = mean_squared_error(y_true, y_pred)
    rmse: float = np.sqrt(mse)
    mae: float = mean_absolute_error(y_true, y_pred)
    r2: float = r2_score(y_true, y_pred)
    
    # Calculate MAPE (avoiding division by zero)
    mape: float = np.mean(np.abs((y_true - y_pred) / np.where(y_true != 0, y_true, 1))) * 100
    
    metrics: Dict[str, float] = {
        'MSE': mse,
        'RMSE': rmse,
        'MAE': mae,
        'R2 Score': r2,
        'MAPE': mape
    }
    
    return metrics

# Calculate metrics for training set
train_metrics: Dict[str, float] = calculate_metrics(y_train.values, y_train_pred)
print(f"\nüìä Training Set Metrics - Station {selected_station_id}:")
print("=" * 70)
for metric, value in train_metrics.items():
    print(f"  {metric:15s}: {value:.6f}")

print("\n" + "=" * 70 + "\n")

# Calculate metrics for test set
test_metrics: Dict[str, float] = calculate_metrics(y_test.values, y_test_pred)
print(f"üéØ Test Set Metrics - Station {selected_station_id}:")
print("=" * 70)
for metric, value in test_metrics.items():
    print(f"  {metric:15s}: {value:.6f}")

print("\n" + "=" * 70)

## 9. Visualize Feature Importance

In [None]:
# Plot feature importance
plt.figure(figsize=(12, 8))
top_features: pd.DataFrame = feature_importance.head(15)
colors = ['#2E86AB' if coef > 0 else '#D62246' for coef in top_features['Coefficient']]

plt.barh(range(len(top_features)), top_features['Coefficient'], color=colors)
plt.yticks(range(len(top_features)), top_features['Feature'])
plt.xlabel('Coefficient Value', fontsize=12)
plt.title(f'Top 15 Feature Importance - Station {selected_station_id}', fontsize=14, fontweight='bold')
plt.axvline(x=0, color='black', linestyle='--', linewidth=1)
plt.grid(True, alpha=0.3, axis='x')
plt.tight_layout()
plt.show()

print("\nüí° Interpretation:")
print("  ‚Ä¢ Blue bars: Positive correlation (increase in feature ‚Üí increase in temperature)")
print("  ‚Ä¢ Red bars: Negative correlation (increase in feature ‚Üí decrease in temperature)")

## 10. Visualize Predictions

In [None]:
# Plot predictions vs actual values
plt.figure(figsize=(18, 7))

# Training data
train_indices: np.ndarray = np.arange(len(y_train))
plt.plot(train_indices, y_train.values, label='Actual (Train)', 
         color='#2E86AB', alpha=0.7, linewidth=1.5)
plt.plot(train_indices, y_train_pred, label='Predicted (Train)', 
         color='#F18F01', alpha=0.7, linewidth=1.5)

# Test data
test_indices: np.ndarray = np.arange(len(y_train), len(y_train) + len(y_test))
plt.plot(test_indices, y_test.values, label='Actual (Test)', 
         color='#06A77D', alpha=0.7, linewidth=1.5)
plt.plot(test_indices, y_test_pred, label='Predicted (Test)', 
         color='#D62246', alpha=0.7, linewidth=1.5)

# Add vertical line to separate train and test
plt.axvline(x=len(y_train), color='black', linestyle='--', 
            linewidth=2, label='Train/Test Split', alpha=0.5)

plt.title(f'Temperature Prediction: Actual vs Predicted - Station {selected_station_id}', 
          fontsize=16, fontweight='bold')
plt.xlabel('Time Steps', fontsize=12)
plt.ylabel('Temperature (¬∞C)', fontsize=12)
plt.legend(fontsize=10, loc='best')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

In [None]:
# Zoom in on test predictions
plt.figure(figsize=(18, 6))
plt.plot(y_test.values, label='Actual Temperature', 
         color='#06A77D', linewidth=2, marker='o', markersize=3, alpha=0.8)
plt.plot(y_test_pred, label='Predicted Temperature', 
         color='#D62246', linewidth=2, marker='s', markersize=3, alpha=0.8)
plt.title(f'Test Set: Detailed Comparison - Station {selected_station_id}', 
          fontsize=16, fontweight='bold')
plt.xlabel('Time Steps', fontsize=12)
plt.ylabel('Temperature (¬∞C)', fontsize=12)
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

In [None]:
# Scatter plot: Predicted vs Actual
fig, axes = plt.subplots(1, 2, figsize=(15, 6))

# Training set
axes[0].scatter(y_train.values, y_train_pred, alpha=0.5, s=20, color='#2E86AB')
axes[0].plot([y_train.min(), y_train.max()], 
             [y_train.min(), y_train.max()], 
             'r--', linewidth=2, label='Perfect Prediction')
axes[0].set_title(f'Training Set: Predicted vs Actual - Station {selected_station_id}', 
                  fontsize=14, fontweight='bold')
axes[0].set_xlabel('Actual Temperature (¬∞C)', fontsize=12)
axes[0].set_ylabel('Predicted Temperature (¬∞C)', fontsize=12)
axes[0].legend(fontsize=10)
axes[0].grid(True, alpha=0.3)

# Test set
axes[1].scatter(y_test.values, y_test_pred, alpha=0.5, s=20, color='#06A77D')
axes[1].plot([y_test.min(), y_test.max()], 
             [y_test.min(), y_test.max()], 
             'r--', linewidth=2, label='Perfect Prediction')
axes[1].set_title(f'Test Set: Predicted vs Actual - Station {selected_station_id}', 
                  fontsize=14, fontweight='bold')
axes[1].set_xlabel('Actual Temperature (¬∞C)', fontsize=12)
axes[1].set_ylabel('Predicted Temperature (¬∞C)', fontsize=12)
axes[1].legend(fontsize=10)
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## 11. Residual Analysis

In [None]:
# Calculate residuals
train_residuals: np.ndarray = y_train.values - y_train_pred
test_residuals: np.ndarray = y_test.values - y_test_pred

fig, axes = plt.subplots(2, 2, figsize=(16, 10))

# Training residuals over time
axes[0, 0].plot(train_residuals, color='#2E86AB', alpha=0.7, linewidth=1)
axes[0, 0].axhline(y=0, color='r', linestyle='--', linewidth=2)
axes[0, 0].set_title(f'Training Residuals Over Time - Station {selected_station_id}', 
                     fontsize=12, fontweight='bold')
axes[0, 0].set_xlabel('Time Steps', fontsize=10)
axes[0, 0].set_ylabel('Residuals (¬∞C)', fontsize=10)
axes[0, 0].grid(True, alpha=0.3)

# Test residuals over time
axes[0, 1].plot(test_residuals, color='#06A77D', alpha=0.7, linewidth=1)
axes[0, 1].axhline(y=0, color='r', linestyle='--', linewidth=2)
axes[0, 1].set_title(f'Test Residuals Over Time - Station {selected_station_id}', 
                     fontsize=12, fontweight='bold')
axes[0, 1].set_xlabel('Time Steps', fontsize=10)
axes[0, 1].set_ylabel('Residuals (¬∞C)', fontsize=10)
axes[0, 1].grid(True, alpha=0.3)

# Training residuals distribution
axes[1, 0].hist(train_residuals, bins=50, edgecolor='black', alpha=0.7, color='#2E86AB')
axes[1, 0].axvline(x=0, color='r', linestyle='--', linewidth=2)
axes[1, 0].set_title('Training Residuals Distribution', fontsize=12, fontweight='bold')
axes[1, 0].set_xlabel('Residuals (¬∞C)', fontsize=10)
axes[1, 0].set_ylabel('Frequency', fontsize=10)
axes[1, 0].grid(True, alpha=0.3)

# Test residuals distribution
axes[1, 1].hist(test_residuals, bins=50, edgecolor='black', alpha=0.7, color='#06A77D')
axes[1, 1].axvline(x=0, color='r', linestyle='--', linewidth=2)
axes[1, 1].set_title('Test Residuals Distribution', fontsize=12, fontweight='bold')
axes[1, 1].set_xlabel('Residuals (¬∞C)', fontsize=10)
axes[1, 1].set_ylabel('Frequency', fontsize=10)
axes[1, 1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print(f"\nüìä Training Residuals - Mean: {train_residuals.mean():.4f}¬∞C, Std: {train_residuals.std():.4f}¬∞C")
print(f"üìä Test Residuals - Mean: {test_residuals.mean():.4f}¬∞C, Std: {test_residuals.std():.4f}¬∞C")

## 12. Summary Report

In [None]:
def print_summary_report(train_metrics: Dict[str, float], test_metrics: Dict[str, float], 
                        station_id: Optional[int] = None, n_features: int = 0) -> None:
    """
    Print comprehensive summary report.
    
    Args:
        train_metrics: Training set metrics
        test_metrics: Test set metrics
        station_id: Station identifier
        n_features: Number of features used
    """
    print("\n" + "="*80)
    print(" " * 15 + "üå°Ô∏è  LINEAR REGRESSION TEMPERATURE PREDICTION - SUMMARY REPORT")
    print("="*80)
    
    print("\nüìä MODEL CONFIGURATION:")
    print("-" * 80)
    print(f"  ‚Ä¢ Data Source: Summary of Weather.csv")
    print(f"  ‚Ä¢ Station ID: {station_id}")
    print(f"  ‚Ä¢ Target Variable: MeanTemp (Mean Temperature)")
    print(f"  ‚Ä¢ Number of Features: {n_features}")
    print(f"  ‚Ä¢ Training Samples: {len(X_train)}")
    print(f"  ‚Ä¢ Test Samples: {len(X_test)}")
    print(f"  ‚Ä¢ Model Type: Linear Regression (Ordinary Least Squares)")
    print(f"  ‚Ä¢ Feature Scaling: StandardScaler (mean=0, std=1)")
    
    print("\nüìà TRAINING SET PERFORMANCE:")
    print("-" * 80)
    for metric, value in train_metrics.items():
        print(f"  ‚Ä¢ {metric:15s}: {value:.6f}")
    
    print("\nüéØ TEST SET PERFORMANCE:")
    print("-" * 80)
    for metric, value in test_metrics.items():
        print(f"  ‚Ä¢ {metric:15s}: {value:.6f}")
    
    print("\nüí° MODEL INSIGHTS:")
    print("-" * 80)
    print(f"  ‚Ä¢ Model explains {test_metrics['R2 Score']*100:.2f}% of variance in test data")
    print(f"  ‚Ä¢ Average prediction error on test set: {test_metrics['MAE']:.4f}¬∞C")
    print(f"  ‚Ä¢ Mean Absolute Percentage Error: {test_metrics['MAPE']:.2f}%")
    
    print("\nüèÜ PERFORMANCE EVALUATION:")
    print("-" * 80)
    if test_metrics['R2 Score'] > 0.9:
        print("  ‚úÖ Excellent model performance! The model captures temperature patterns very well.")
    elif test_metrics['R2 Score'] > 0.7:
        print("  ‚úÖ Good model performance! The model shows strong predictive capability.")
    elif test_metrics['R2 Score'] > 0.5:
        print("  ‚ö†Ô∏è  Moderate model performance - consider adding more features or using non-linear models.")
    else:
        print("  ‚ö†Ô∏è  Poor model performance - linear regression may not be suitable for this data.")
    
    print("\n" + "="*80 + "\n")

print_summary_report(train_metrics, test_metrics, selected_station_id, len(feature_names))

## 13. Save Model and Results

In [None]:
import json

# Save the model
model_filename: str = f'linear_regression_model_station_{selected_station_id}.pkl'
joblib.dump(model, model_filename)
print(f"‚úÖ Model saved as '{model_filename}'")

# Save the scaler
scaler_filename: str = f'scaler_station_{selected_station_id}.pkl'
joblib.dump(scaler, scaler_filename)
print(f"‚úÖ Scaler saved as '{scaler_filename}'")

# Save predictions to CSV
predictions_filename: str = f'lr_predictions_station_{selected_station_id}.csv'
predictions_df: pd.DataFrame = pd.DataFrame({
    'Station_ID': selected_station_id,
    'Actual_Test': y_test.values,
    'Predicted_Test': y_test_pred,
    'Residuals': test_residuals,
    'Absolute_Error': np.abs(test_residuals)
})

predictions_df.to_csv(predictions_filename, index=False)
print(f"‚úÖ Predictions saved as '{predictions_filename}'")

# Save feature importance
feature_importance_filename: str = f'feature_importance_station_{selected_station_id}.csv'
feature_importance.to_csv(feature_importance_filename, index=False)
print(f"‚úÖ Feature importance saved as '{feature_importance_filename}'")

# Save metrics to JSON
metrics_filename: str = f'lr_metrics_station_{selected_station_id}.json'
metrics_summary: Dict[str, any] = {
    'data_source': 'Summary of Weather.csv',
    'station_id': int(selected_station_id) if selected_station_id else None,
    'target_variable': 'MeanTemp',
    'training_metrics': {k: float(v) for k, v in train_metrics.items()},
    'test_metrics': {k: float(v) for k, v in test_metrics.items()},
    'model_config': {
        'model_type': 'Linear Regression',
        'n_features': len(feature_names),
        'feature_names': feature_names,
        'scaler': 'StandardScaler'
    },
    'data_split': {
        'train_size': len(X_train),
        'test_size': len(X_test),
        'train_percentage': 80,
        'test_percentage': 20
    }
}

with open(metrics_filename, 'w') as f:
    json.dump(metrics_summary, f, indent=4)

print(f"‚úÖ Metrics saved as '{metrics_filename}'")
print("\nüéâ All results saved successfully!")
print(f"\nüìÅ Generated Files for Station {selected_station_id}:")
print(f"  ‚Ä¢ {model_filename} - Trained Linear Regression model")
print(f"  ‚Ä¢ {scaler_filename} - Feature scaler")
print(f"  ‚Ä¢ {predictions_filename} - Test predictions and errors")
print(f"  ‚Ä¢ {feature_importance_filename} - Feature importance ranking")
print(f"  ‚Ä¢ {metrics_filename} - Comprehensive metrics and configuration")

## 14. Model Comparison with LSTM (Optional)

In [None]:
# This cell compares Linear Regression with LSTM results (if available)
print("üìä Model Comparison: Linear Regression vs LSTM")
print("=" * 70)

comparison_data: Dict[str, Dict[str, float]] = {
    'Linear Regression': test_metrics
}

# Try to load LSTM metrics if available
lstm_metrics_file: str = f'model_metrics_station_{selected_station_id}.json'
try:
    with open(lstm_metrics_file, 'r') as f:
        lstm_data: Dict = json.load(f)
        comparison_data['LSTM'] = lstm_data['test_metrics']
    
    # Create comparison DataFrame
    comparison_df: pd.DataFrame = pd.DataFrame(comparison_data).T
    print("\nüìä Performance Comparison:")
    print(comparison_df.to_string())
    
    # Visualize comparison
    fig, axes = plt.subplots(2, 2, figsize=(14, 10))
    metrics_to_plot: List[str] = ['RMSE', 'MAE', 'R2 Score', 'MAPE']
    
    for idx, metric in enumerate(metrics_to_plot):
        ax = axes[idx // 2, idx % 2]
        values = [comparison_data[model][metric] for model in comparison_data.keys()]
        colors = ['#2E86AB', '#F18F01']
        ax.bar(comparison_data.keys(), values, color=colors[:len(values)])
        ax.set_title(f'{metric} Comparison', fontsize=12, fontweight='bold')
        ax.set_ylabel(metric, fontsize=10)
        ax.grid(True, alpha=0.3, axis='y')
        
        # Add value labels on bars
        for i, v in enumerate(values):
            ax.text(i, v, f'{v:.4f}', ha='center', va='bottom', fontsize=9)
    
    plt.suptitle(f'Model Performance Comparison - Station {selected_station_id}', 
                 fontsize=14, fontweight='bold', y=1.00)
    plt.tight_layout()
    plt.show()
    
    print("\nüí° Key Insights:")
    print("-" * 70)
    if comparison_data['Linear Regression']['R2 Score'] > comparison_data['LSTM']['R2 Score']:
        print("  ‚Ä¢ Linear Regression outperforms LSTM on this dataset")
        print("  ‚Ä¢ This suggests the relationship is primarily linear")
    else:
        print("  ‚Ä¢ LSTM outperforms Linear Regression on this dataset")
        print("  ‚Ä¢ This suggests non-linear temporal patterns are important")
    
except FileNotFoundError:
    print("\n‚ö†Ô∏è  LSTM metrics file not found. Run the LSTM notebook first for comparison.")
    print(f"   Expected file: {lstm_metrics_file}")

## Conclusion

### ‚úÖ What We Accomplished:

1. **Station-Specific Analysis**
   - Loaded weather data with station identification (STA column)
   - Selected and processed data for a specific weather station
   - Created station-specific models and results

2. **Feature Engineering**
   - Created lag features (1, 2, 3, 7, 14, 30 days)
   - Generated rolling statistics (7-day and 30-day windows)
   - Extracted temporal features (year, month, day, cyclical encodings)
   - Selected relevant weather features (MaxTemp, MinTemp, Precip, etc.)

3. **Model Development**
   - Built Linear Regression model with StandardScaler normalization
   - Trained on 80% of data, tested on 20%
   - Analyzed feature importance through coefficients

4. **Model Evaluation**
   - Calculated comprehensive metrics: MSE, RMSE, MAE, R¬≤ Score, MAPE
   - Analyzed residuals and error distributions
   - Visualized predictions vs actual values
   - Compared with LSTM performance (if available)

### üéØ Key Features:
- **Interpretability**: Linear coefficients show direct feature impact
- **Efficiency**: Fast training and prediction (no iterative optimization)
- **Baseline**: Provides strong baseline for comparison with complex models
- **Feature Insights**: Reveals which features are most predictive

### üìä Linear Regression vs LSTM:

**Linear Regression Advantages:**
- Faster training (no backpropagation)
- More interpretable (clear feature weights)
- No hyperparameter tuning needed
- Works well when relationships are linear

**LSTM Advantages:**
- Captures non-linear patterns
- Learns complex temporal dependencies
- Better for long-term forecasting
- Handles sequential patterns automatically

### üöÄ Next Steps:

1. **Regularization**: Try Ridge or Lasso regression to reduce overfitting
2. **Polynomial Features**: Add interaction terms and polynomial features
3. **Multi-Station Models**: Train models for multiple stations and compare
4. **Ensemble Methods**: Combine Linear Regression with LSTM predictions
5. **Time Series Cross-Validation**: Use rolling window validation for robust evaluation

### üìö Resources:
- [Scikit-learn Linear Regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html)
- [Time Series Feature Engineering](https://www.kaggle.com/learn/time-series)
- [Regression Metrics Guide](https://scikit-learn.org/stable/modules/model_evaluation.html#regression-metrics)