## NMF-PY Workflow

The steps in this notebook are intended to replicate the preprocessing, base model building, and base model post-processing steps of PMF5.

The error estimation functionality has not yet been implemented in the new code base.

In [None]:
# Notebook imports
import os
import sys
import json

sys.path.insert(0, "/content/nmf_py")
sys.path.insert(0, "/content/nmf_py/src")

#### Sample Dataset
The three sample datasets from PMF5 are available for use, but a new dataset can be used in their place.

In [None]:
# Baton Rouge Dataset
br_input_file = os.path.join("/content/nmf_py/data/Dataset-BatonRouge-con.csv")
br_uncertainty_file = os.path.join("/content/nmf_py/data/Dataset-BatonRouge-unc.csv")
br_output_path = os.path.join("/content/nmf_py/data/output/BatonRouge")
# Baltimore Dataset
# b_input_file = os.path.join("data", "Dataset-Baltimore_con.txt")
# b_uncertainty_file = os.path.join("data", "Dataset-Baltimore_unc.txt")
# b_output_path = os.path.join("data", "output", "Baltimore")
# Saint Louis Dataset
# sl_input_file = os.path.join("data", "Dataset-StLouis-con.csv")
# sl_uncertainty_file = os.path.join("data", "Dataset-StLouis-unc.csv")
# sl_output_path = os.path.join("data", "output", "StLouis")

In [None]:
# !unzip /content/nmf_py-main-20231012.zip

In [None]:
!pip install fuzzy-c-means

Collecting fuzzy-c-means
  Downloading fuzzy_c_means-1.7.0-py3-none-any.whl (9.0 kB)
Collecting tabulate<0.9.0,>=0.8.9 (from fuzzy-c-means)
  Downloading tabulate-0.8.10-py3-none-any.whl (29 kB)
Collecting typer<0.5.0,>=0.4.0 (from fuzzy-c-means)
  Downloading typer-0.4.2-py3-none-any.whl (27 kB)
Installing collected packages: typer, tabulate, fuzzy-c-means
  Attempting uninstall: typer
    Found existing installation: typer 0.9.0
    Uninstalling typer-0.9.0:
      Successfully uninstalled typer-0.9.0
  Attempting uninstall: tabulate
    Found existing installation: tabulate 0.9.0
    Uninstalling tabulate-0.9.0:
      Successfully uninstalled tabulate-0.9.0
Successfully installed fuzzy-c-means-1.7.0 tabulate-0.8.10 typer-0.4.2


#### Code Imports

In [None]:
from src.data.datahandler import DataHandler
from src.model.nmf import NMF
from src.model.model import BatchNMF
from src.data.analysis import ModelAnalysis

#### Input Parameters

In [None]:
index_col = "Date"                  # the index of the input/uncertainty datasets
factors = 6                         # the number of factors
method = "ls-nmf"                   # "ls-nmf", "ws-nmf"
models = 20                         # the number of models to train
init_method = "col_means"           # default is column means "col_means", "kmeans", "cmeans"
init_norm = True                    # if init_method=kmeans or cmeans, normalize the data prior to clustering.
seed = 42                           # random seed for initialization
max_iterations = 20000              # the maximum number of iterations for fitting a model
converge_delta = 0.1                # convergence criteria for the change in loss, Q
converge_n = 10                     # convergence criteria for the number of steps where the loss changes by less than converge_delta
verbose = True                      # adds more verbosity to the algorithm workflow on execution.
optimized = True                    # use the Rust code if possible
parallel = True                     # execute the model training in parallel, multiple models at the same time

#### Dataset Selection
One of the three sample datasets can be selected or a new cleaned dataset can be used. Datasets should be cleaned, containing no missing data (either dropping missing/NaNs, or interpolating the missing values).

In [None]:
# Loading the Baton Rouge dataset
input_file = br_input_file
uncertainty_file = br_uncertainty_file
output_path = br_output_path

#### Load Data
Assign the processed data and uncertainty datasets to the variables V and U. These steps will be simplified/streamlined in a future version of the code.

In [None]:
data_handler = DataHandler(
    input_path=input_file,
    uncertainty_path=uncertainty_file,
    index_col=index_col
)
V = data_handler.input_data_processed               # Cleaned input dataset (numpy array)
U = data_handler.uncertainty_data_processed         # Cleaned uncertainty dataset (numpy array)

#### Input/Uncertainty Data Metrics and Visualizations

In [None]:
# Show the input data metrics, including signal to noise ratio of the data and uncertainty
data_handler.metrics

Unnamed: 0,Category,S/N,Min,25th,50th,75th,Max
124-Trimethylbenzene,Strong,5.445168,0.005,0.820001,1.290001,1.865001,5.470003
224-Trimethylpentane,Strong,5.666667,0.41,1.580001,2.490002,3.865002,13.560008
234-Trimethylpentane,Strong,5.537459,0.005,0.53,0.820001,1.300001,4.410003
23-Dimethylbutane,Strong,5.500543,0.005,0.64,1.110001,2.285001,10.500007
23-Dimethylpentane,Strong,5.463626,0.005,0.34,0.49,0.78,3.310002
2-Methylheptane,Strong,5.039088,0.005,0.215,0.33,0.535,2.480002
3-Methylhexane,Strong,5.648208,0.005,0.655,1.050001,1.510001,7.780005
3-Methylpentane,Strong,5.611292,0.54,1.720001,2.990002,5.945004,29.100018
Acetylene,Strong,5.666667,0.38,1.410001,1.990001,2.835002,8.070005
Benzene,Strong,5.666667,0.59,1.960001,2.770002,4.440003,9.330006


In [None]:
# Concentration / Uncertainty Scatter plot for specific feature, feature/column specified by index
data_handler.data_uncertainty_plot(feature_idx=2)

In [None]:
# Species Concentration plot comparing features, features/columns specified by index
data_handler.feature_data_plot(x_idx=0, y_idx=1)

In [None]:
# Species Timeseries, a single or list of features/columns specified by index
data_handler.feature_timeseries_plot(feature_selection=[0, 1, 2, 3])

#### Train Model

In [None]:
%%time
# Training multiple models, optional parameters are commented out.
nmf_models = BatchNMF(V=V, U=U, factors=factors, models=models, method=method, seed=seed, max_iter=max_iterations,
                    # init_method=init_method, init_norm=init_norm,
                    converge_delta=converge_delta, converge_n=converge_n,
                    parallel=parallel, optimized=False,
                    # verbose=verbose
                   )
nmf_models.train()

Model: 0, Seed: 8925, Q(true): 66337.26:   6%|▌         | 1114/20000 [00:11<02:59, 104.94it/s]
Model: 0, Seed: 8925, Q(true): 66337.25:   6%|▌         | 1114/20000 [00:11<02:59, 104.94it/s]

Model: 3, Seed: 43887, Q(true): 66051.8651, Steps: 1135/20000, Converged: True, Runtime: 11.5 sec

Model: 0, Seed: 8925, Q(true): 66337.23:   6%|▌         | 1114/20000 [00:11<02:59, 104.94it/s]




Model: 4, Seed: 43301, Q(true): 63939.55:   6%|▌         | 1173/20000 [00:12<03:23, 92.66it/s]
Model: 4, Seed: 43301, Q(true): 63939.55:   6%|▌         | 1173/20000 [00:12<03:23, 92.66it/s]

Model: 0, Seed: 8925, Q(true): 65069.5744, Steps: 2301/20000, Converged: True, Runtime: 23.55 sec

Model: 4, Seed: 43301, Q(true): 63939.5:   6%|▌         | 1173/20000 [00:12<03:23, 92.66it/s]




Model: 4, Seed: 43301, Q(true): 63928.03:   7%|▋         | 1495/20000 [00:14<03:04, 100.17it/s]
Model: 1, Seed: 77395, Q(true): 65630.28:   2%|▏         | 301/20000 [00:02<03:06, 105.57it/s]

Model: 4, Seed: 43301, Q(true): 63928.0347, Steps: 1496/20000, Converged: True, Runtime: 14.93 sec


Model: 5, Seed: 85859, Q(true): 64423.28:   5%|▍         | 946/20000 [00:09<03:26, 92.20it/s]
Model: 5, Seed: 85859, Q(true): 64423.28:   5%|▍         | 946/20000 [00:09<03:26, 92.20it/s]

Model: 1, Seed: 77395, Q(true): 65541.8666, Steps: 1292/20000, Converged: True, Runtime: 12.35 sec


Model: 5, Seed: 85859, Q(true): 63840.51:  12%|█▏        | 2337/20000 [00:23<02:11, 134.79it/s]
Model: 5, Seed: 85859, Q(true): 63840.49:  12%|█▏        | 2337/20000 [00:23<02:11, 134.79it/s]

Model: 2, Seed: 65457, Q(true): 65049.2052, Steps: 1349/20000, Converged: True, Runtime: 13.71 sec

Model: 5, Seed: 85859, Q(true): 63840.48:  12%|█▏        | 2337/20000 [00:23<02:11, 134.79it/s]




Model: 6, Seed: 8594, Q(true): 161052.97:   0%|          | 0/20000 [00:00<?, ?it/s]
Model: 6, Seed: 8594, Q(true): 161052.97:   0%|          | 0/20000 [00:00<?, ?it/s]

Model: 5, Seed: 85859, Q(true): 63840.4524, Steps: 2344/20000, Converged: True, Runtime: 23.23 sec

Model: 6, Seed: 8594, Q(true): 152001.61:   0%|          | 0/20000 [00:00<?, ?it/s]




Model: 9, Seed: 9417, Q(true): 66223.14:   5%|▌         | 1042/20000 [00:10<03:06, 101.68it/s]


Model: 6, Seed: 8594, Q(true): 66533.3494, Steps: 1012/20000, Converged: True, Runtime: 10.55 sec

Model: 9, Seed: 9417, Q(true): 66220.76:   5%|▌         | 1042/20000 [00:10<03:06, 101.68it/s]




Model: 7, Seed: 69736, Q(true): 66474.87:   3%|▎         | 632/20000 [00:06<03:15, 99.13it/s]


Model: 9, Seed: 9417, Q(true): 66038.0929, Steps: 1678/20000, Converged: True, Runtime: 16.77 sec

Model: 7, Seed: 69736, Q(true): 66474.52:   3%|▎         | 632/20000 [00:06<03:15, 99.13it/s]




Model: 10, Seed: 52647, Q(true): 66365.83:   5%|▌         | 1088/20000 [00:11<03:24, 92.66it/s]


Model: 10, Seed: 52647, Q(true): 66365.8347, Steps: 1089/20000, Converged: True, Runtime: 11.75 sec

Model: 7, Seed: 69736, Q(true): 65071.6:   9%|▊         | 1722/20000 [00:18<03:30, 86.67it/s]




Model: 7, Seed: 69736, Q(true): 65055.98:  11%|█         | 2167/20000 [00:22<03:07, 95.32it/s]


Model: 7, Seed: 69736, Q(true): 65055.9772, Steps: 2168/20000, Converged: True, Runtime: 22.75 sec


Model: 8, Seed: 20146, Q(true): 65411.32:   3%|▎         | 647/20000 [00:06<02:47, 115.27it/s]
Model: 8, Seed: 20146, Q(true): 65399.3:   3%|▎         | 662/20000 [00:06<02:37, 122.51it/s] 

Model: 11, Seed: 97562, Q(true): 66462.6334, Steps: 1070/20000, Converged: True, Runtime: 11.28 sec

Model: 8, Seed: 20146, Q(true): 65395.02:   3%|▎         | 662/20000 [00:06<02:37, 122.51it/s]




Model: 8, Seed: 20146, Q(true): 63894.24:  11%|█▏        | 2286/20000 [00:23<03:02, 97.20it/s]
Model: 12, Seed: 73575, Q(true): 66146.68:   8%|▊         | 1583/20000 [00:16<03:37, 84.71it/s]

Model: 8, Seed: 20146, Q(true): 63894.2365, Steps: 2287/20000, Converged: True, Runtime: 23.52 sec

Model: 12, Seed: 73575, Q(true): 66146.61:   8%|▊         | 1583/20000 [00:16<03:37, 84.71it/s]




Model: 12, Seed: 73575, Q(true): 66125.14:  10%|▉         | 1912/20000 [00:20<03:23, 89.02it/s]

In [None]:
# Selet the best performing model to review
best_model = nmf_models.best_model
best_model

5

In [None]:
# Initialize the Model Analysis module
model_analysis = ModelAnalysis(datahandler=data_handler, model=nmf_models, selected_model=best_model)

In [None]:
# Residual Analysis shows the scaled residual histogram, along with metrics and distribution curves. The abs_threshold parameter specifies the condition for the returned values of the function call as those residuals which exceed the absolute value of that threshold.
abs_threshold = 3.0
threshold_residuals = model_analysis.plot_residual_histogram(feature_idx=0, abs_threshold=abs_threshold)

In [None]:
print(f"List of Absolute Scaled Residual Greather than: {abs_threshold}. Count: {threshold_residuals.shape[0]}")
threshold_residuals

List of Absolute Scaled Residual Greather than: 3.0. Count: 71


Unnamed: 0,124-Trimethylbenzene,datetime
0,-3.999048,6/1/2005 6:00
8,5.534593,6/6/2005 3:00
9,6.028516,6/6/2005 6:00
14,4.639467,6/9/2005 3:00
15,6.373150,6/10/2005 3:00
...,...,...
294,-3.422925,9/18/2006 6:00
301,6.528177,9/22/2006 6:00
303,5.100881,9/25/2006 6:00
304,6.383691,9/26/2006 3:00


In [None]:
# The model output statistics for the estimated V, including SE: Standard Error metrics, and 3 normal distribution tests of the residuals (KS Normal is used in PMF5)
model_analysis.calculate_statistics()
model_analysis.statistics

Unnamed: 0,Features,Category,r2,Intercept,Intercept SE,Slope,Slope SE,SE,SE Regression,Anderson Normal Residual,Anderson Statistic,Shapiro Normal Residuals,Shapiro PValue,KS Normal Residuals,KS PValue,KS Statistic
0,124-Trimethylbenzene,Strong,0.588973,0.197586,0.054143,0.657848,0.031468,0.034396,0.511612,No,2.847252,No,5.039691e-06,Yes,0.1540227,0.064066
1,224-Trimethylpentane,Strong,0.827492,0.695924,0.064181,0.65117,0.017024,0.057165,0.649723,15.0,0.214079,Yes,0.7244727,Yes,0.9710335,0.027341
2,234-Trimethylpentane,Strong,0.752289,0.189695,0.027837,0.66684,0.02191,0.021412,0.282952,No,1.46942,No,0.000582703,Yes,0.2716517,0.056446
3,23-Dimethylbutane,Strong,0.552614,0.207565,0.08312,0.639939,0.03297,0.067606,1.004347,No,1.306493,No,0.0009724029,Yes,0.3156685,0.054238
4,23-Dimethylpentane,Strong,0.784656,0.099089,0.016666,0.739934,0.022196,0.011492,0.167215,No,3.303195,No,1.36764e-08,No,0.03784841,0.079814
5,2-Methylheptane,Strong,0.573763,0.056348,0.016459,0.674406,0.033284,0.011334,0.173263,No,1.438563,No,0.0006605691,Yes,0.3804811,0.051341
6,3-Methylhexane,Strong,0.707751,0.183255,0.036302,0.676358,0.024886,0.025434,0.357434,No,4.307489,No,2.793057e-10,No,0.002475226,0.103789
7,3-Methylpentane,Strong,0.541216,1.198529,0.161995,0.454739,0.023973,0.183461,1.957702,No,1.571499,No,0.001082421,Yes,0.2083368,0.060135
8,Acetylene,Strong,0.794266,0.326204,0.059588,0.761776,0.0222,0.034662,0.517446,15.0,0.199257,Yes,0.8460033,Yes,0.9064345,0.031752
9,Benzene,Strong,0.77526,0.663911,0.083857,0.707843,0.021822,0.051291,0.713236,5.0,0.700786,Yes,0.08472921,Yes,0.6284243,0.042233


In [None]:
# Model feature observed vs predicted plot with regression and one-to-one lines. Feature/Column specified by index.
model_analysis.plot_estimated_observed(feature_idx=2)

In [None]:
# Model feature timeseries analysis plot showing the observed vs predicted values of the feature, along with the residuals shown below. Feature/column specified by index.
model_analysis.plot_estimated_timeseries(feature_idx=1)

In [None]:
# Factor profile plot showing the factor sum of concentrations by feature (blue bars), the percentage of the feature as the red dot, and in the bottom plot the normalized contributions by date (values are resampled at a daily timestep for timeseries consistency).
# Factor specified by index.
model_analysis.plot_factor_profile(factor_idx=0)

In [None]:
# Model factor fingerprint specifies the feature percentage of each factor.
model_analysis.plot_factor_fingerprints()

In [None]:
# Factor G-Space plot shows the normalized contributions of one factor vs another factor. Factor specified by index.
model_analysis.plot_g_space(factor_1=2, factor_2=1)

In [None]:
# Factor contribution pie chart shows the percentage of factor contributions for the specified feature, and the corresponding normalized contribution of each factor for that feature (bottom plot). Feature specified by index.
model_analysis.plot_factor_contributions(feature_idx=1)