# NPRI Time Series Analysis - Data Exploration

This notebook explores the National Pollutant Release Inventory (NPRI) dataset, which includes data merged from the CMPT2400 project with 2023 data.

## Overview

The NPRI dataset contains information about pollutant releases across different industries in Canada. This notebook will:

1. Load and inspect the merged dataset
2. Perform exploratory data analysis
3. Identify key patterns and trends
4. Prepare for data preprocessing

In [None]:
# Import required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)

# Set plotting style
plt.style.use('seaborn-whitegrid')
sns.set_palette('viridis')

## 1. Load and Inspect the Dataset

First, let's load the merged dataset from CMPT2400 that includes the 2023 data.

In [None]:
# Load the dataset
df_releases = pd.read_csv('../data/raw/df_merged_releases.csv')

# Display basic information
print(f"Dataset shape: {df_releases.shape}")
print("\nFirst 5 rows:")
df_releases.head()

In [None]:
# Check data types and missing values
print("\nData Types:")
print(df_releases.dtypes)

print("\nMissing Values:")
print(df_releases.isnull().sum())

## 2. Explore Key Variables

Let's examine the distribution of key variables such as reporting years, industry sectors, and pollutant releases.

In [None]:
# Distribution of reporting years
yearly_counts = df_releases['Reporting_Year'].value_counts().sort_index()
plt.figure(figsize=(12, 6))
yearly_counts.plot(kind='bar')
plt.title('Number of Reports by Year')
plt.xlabel('Reporting Year')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.show()

In [None]:
# Distribution of industry sectors
sector_counts = df_releases['Industry_Sector'].value_counts()
plt.figure(figsize=(14, 8))
sector_counts.plot(kind='barh')
plt.title('Reports by Industry Sector')
plt.xlabel('Count')
plt.tight_layout()
plt.show()

In [None]:
# Distribution of pollutant release amounts
release_columns = ['Total_Air_Releases', 'Total_Land_Releases', 'Total_Water_Releases']

fig, axes = plt.subplots(3, 1, figsize=(12, 15))
for i, col in enumerate(release_columns):
    sns.histplot(df_releases[col].dropna(), ax=axes[i], kde=True, log_scale=True)
    axes[i].set_title(f'Distribution of {col}')
    axes[i].set_xlabel('Release Amount (tonnes)')
plt.tight_layout()
plt.show()

## 3. Time Series Exploration

Let's examine how pollutant releases have changed over time, focusing on specific industry sectors.

In [None]:
# Time series of total releases by year
yearly_releases = df_releases.groupby('Reporting_Year')[release_columns].sum().reset_index()

fig = px.line(yearly_releases, x='Reporting_Year', y=release_columns,
             title='Total Releases by Year',
             labels={'value': 'Total Releases (tonnes)', 'variable': 'Release Type'})
fig.show()

In [None]:
# Time series by industry sector (top 5 sectors by air releases)
top_sectors = df_releases.groupby('Industry_Sector')['Total_Air_Releases'].sum().nlargest(5).index

# Filter for top sectors
top_sector_data = df_releases[df_releases['Industry_Sector'].isin(top_sectors)]

# Group by year and sector
sector_time_series = top_sector_data.groupby(['Reporting_Year', 'Industry_Sector'])['Total_Air_Releases'].sum().reset_index()

# Create interactive plot
fig = px.line(sector_time_series, x='Reporting_Year', y='Total_Air_Releases', color='Industry_Sector',
             title='Air Releases by Top Industry Sectors Over Time',
             labels={'Total_Air_Releases': 'Air Releases (tonnes)'})
fig.show()

## 4. Correlation Analysis

Let's explore correlations between different types of releases.

In [None]:
# Correlation heatmap
plt.figure(figsize=(10, 8))
correlation = df_releases[release_columns].corr()
sns.heatmap(correlation, annot=True, cmap='coolwarm', vmin=-1, vmax=1)
plt.title('Correlation Between Release Types')
plt.show()

## 5. Data Quality Assessment

Let's identify any data quality issues that need to be addressed in the preprocessing step.

In [None]:
# Check for outliers in release amounts
plt.figure(figsize=(14, 6))
for i, col in enumerate(release_columns):
    plt.subplot(1, 3, i+1)
    sns.boxplot(y=df_releases[col].dropna())
    plt.title(f'Boxplot of {col}')
    plt.tight_layout()
plt.show()

In [None]:
# Identify extreme outliers
for col in release_columns:
    q75, q25 = np.percentile(df_releases[col].dropna(), [75, 25])
    iqr = q75 - q25
    upper_bound = q75 + 3 * iqr
    
    extreme_outliers = df_releases[df_releases[col] > upper_bound]
    print(f"\nExtreme outliers in {col}: {len(extreme_outliers)}")
    if len(extreme_outliers) > 0:
        print(extreme_outliers[["Reporting_Year", "Industry_Sector", "Facility_Name", col]].head())

## 6. Summary of Findings

Key findings from the exploratory data analysis:

1. The dataset includes data from [YEAR_RANGE] years, with [OBSERVATIONS] on the merged 2023 data.
2. Industry sectors with highest releases include [TOP_SECTORS].
3. Time trends show [TRENDS_OBSERVED].
4. Data quality issues identified: [ISSUES].

Next steps:
- Proceed to data preprocessing (notebook 02)
- Handle identified quality issues
- Prepare for time series modeling