# Emissions Decision Support System (Maryland)
The goal is to build a clean, analysis-ready dataset for comparing greenhouse-gas emissions across Maryland counties and sectors, then train a simple model to estimate emissions.

This notebook follows the same structure as a typical portfolio project:
- Data dictionary + quick checks
- Preprocessing (standardization, missing values, date range)
- Exploratory data analysis (categorical + continuous)
- Encoding + correlation heatmap
- Outlier removal
- Train/test split + model building (Decision Tree + GridSearchCV)
- Evaluation + feature importance + conclusion


In [None]:
# Loading the libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns


In [None]:
# Loading the dataset
df = pd.read_csv('../data/raw_emissions_maryland.csv')
df.head()


In [None]:
# Shape + types
df.shape, df.dtypes


## Data Preprocessing Part 1
The raw dataset intentionally includes messy county labels and inconsistent sector naming.  
We standardize:
- `county` casing + whitespace
- `sector` labels (transport/residential → standard)
- ensure year is within a consistent analysis window (2015–2024)


In [None]:
# Standardize county strings
df['county'] = (
    df['county']
      .astype(str)
      .str.strip()
      .str.replace(r"\s+", " ", regex=True)
      .str.title()
)

# Standardize sector labels
df['sector'] = (
    df['sector']
      .astype(str)
      .str.strip()
      .str.lower()
      .replace({'transport': 'transportation', 'residential': 'residential'})
      .str.title()
)

# Keep a consistent year range
df = df[df['year'].between(2015, 2024)]

# Basic missing-value handling
df = df.dropna(subset=['emissions_mtco2e','population','gdp_billions_usd','sector','county'])
df.head()


In [None]:
# Quick sanity checks
df.isna().sum()


In [None]:
# Descriptive statistics
df[['emissions_mtco2e','population','gdp_billions_usd']].describe()


## Exploratory Data Analysis
We look at:
- County totals (who contributes most)
- Sector distribution
- Emissions trends over time


In [None]:
# Sector distribution
plt.figure(figsize=(10,4))
sns.countplot(x='sector', data=df)
plt.xticks(rotation=25)
plt.title('Record count by sector')
plt.show()


In [None]:
# Total emissions by county (top 10)
county_totals = df.groupby('county')['emissions_mtco2e'].sum().sort_values(ascending=False).head(10).reset_index()

plt.figure(figsize=(8,5))
sns.barplot(data=county_totals, y='county', x='emissions_mtco2e')
plt.title('Top counties by total emissions (2015–2024)')
plt.xlabel('Total emissions (MtCO2e)')
plt.ylabel('County')
plt.show()


In [None]:
# Trend: emissions over time by sector (aggregated across counties)
trend = df.groupby(['year','sector'])['emissions_mtco2e'].sum().reset_index()

plt.figure(figsize=(10,5))
sns.lineplot(data=trend, x='year', y='emissions_mtco2e', hue='sector')
plt.title('Emissions by sector over time (aggregated)')
plt.show()


## Data Preprocessing Part 2 (Encoding + Correlation)
We encode categorical variables for correlation and modeling.


In [None]:
from sklearn.preprocessing import LabelEncoder

model_df = df.copy()

# Encode categorical cols
cat_cols = ['county','sector']
encoders = {}

for col in cat_cols:
    le = LabelEncoder()
    model_df[col] = le.fit_transform(model_df[col])
    encoders[col] = le

model_df.head()


In [None]:
# Correlation heatmap
plt.figure(figsize=(8,6))
sns.heatmap(model_df[['emissions_mtco2e','population','gdp_billions_usd','county','sector']].corr(), annot=True, cmap='coolwarm')
plt.title('Correlation matrix (encoded)')
plt.show()


## Outlier Removal
We use a simple Z-score filter on numeric predictors to reduce extreme points that can distort tree splits.


In [None]:
from scipy import stats

z = np.abs(stats.zscore(model_df[['population','gdp_billions_usd','emissions_mtco2e']]))
model_df = model_df[(z < 3).all(axis=1)]
model_df.shape


## Train/Test Split + Model Building
Target: `emissions_mtco2e`  
Model: DecisionTreeRegressor + GridSearchCV


In [None]:
from sklearn.model_selection import train_test_split

X = model_df.drop(columns=['emissions_mtco2e'])
y = model_df['emissions_mtco2e']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train.shape, X_test.shape


In [None]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import GridSearchCV

dtr = DecisionTreeRegressor()

params = {
    'max_depth': [3,5,7,9],
    'min_samples_split': [2,4,8],
    'min_samples_leaf': [1,2,4],
    'random_state': [42]
}

grid = GridSearchCV(dtr, param_grid=params, cv=5, n_jobs=-1, verbose=0)
grid.fit(X_train, y_train)

grid.best_params_


In [None]:
best = grid.best_estimator_
best.fit(X_train, y_train)

train_r2 = best.score(X_train, y_train)
test_r2 = best.score(X_test, y_test)

train_r2, test_r2


In [None]:
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error

pred = best.predict(X_test)

print("R2 Score:", r2_score(y_test, pred))
print("MSE:", mean_squared_error(y_test, pred))
print("MAE:", mean_absolute_error(y_test, pred))
print("RMSE:", np.sqrt(mean_squared_error(y_test, pred)))


In [None]:
# Feature importance
feat_df = pd.DataFrame({'Feature': X.columns, 'Importance': best.feature_importances_}).sort_values('Importance', ascending=False)
feat_df


In [None]:
plt.figure(figsize=(8,4))
sns.barplot(data=feat_df, x='Importance', y='Feature')
plt.title('Feature Importance')
plt.show()


## Export Cleaned Dataset
This is the dataset you'd hand off to Excel/Power BI.


In [None]:
# Export cleaned dataset (human-readable labels)
cleaned = df.copy()
cleaned.to_csv('../reports/cleaned_emissions_maryland.csv', index=False)
cleaned.head()


## Conclusion
This workflow demonstrates a decision-support pipeline:
- messy inputs → standardized categories
- multi-year comparisons via consistent filtering
- EDA to understand sector/county patterns
- baseline model + feature importance to quantify drivers
