# Feed-forward neural network using ozone data at Joshua Tree
[![Latest release](https://badgen.net/github/release/Naereen/Strapdown.js)](https://github.com/eabarnes1010/course_ml_ats/tree/main/code)
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/eabarnes1010/course_ml_ats/blob/main/code/ann_ozone_joshuatree_blank.ipynb)

* Created originally by TA Jamin Rader [CSU] for ATS 780A7 at Colorado State University led by Prof. Elizabeth Barnes

# 0. Set Up Environments

In [None]:
try:
    import google.colab

    IN_COLAB = True
except:
    IN_COLAB = False
print("IN_COLAB = " + str(IN_COLAB))

In [None]:
import sys
import numpy as np
import seaborn as sb

import pandas as pd
import datetime
import tensorflow as tf
import sklearn

# import pydot
import matplotlib.pyplot as plt
import matplotlib.patches as patches

# %matplotlib inline

# tf.compat.v1.disable_v2_behavior()

In [None]:
print(f"python version = {sys.version}")
print(f"numpy version = {np.__version__}")
print(f"tensorflow version = {tf.__version__}")

In [None]:
## UNCOMMENT IF YOU WANT TO SAVE FIGURES IN COLABORATORY

# if(IN_COLAB==True):
#     try:
#         from google.colab import drive
#         drive.mount('/content/drive', force_remount=True)
#         local_path = '/content/drive/My Drive/Colab Notebooks/'
#     except:
#         local_path = './'
# else:
#     local_path = '../figures/'


# 1. Data Preparation

### 1.1 Data Overview

This is ozone and meteorlogical data from [CASTNET](https://www.epa.gov/castnet) (Clean Air Status and Trends Network) for Joshua Tree National Park, located just outside of Palm Springs and about 100 miles east of Los Angeles. The National Park Service monitors ozone in their parks. Joshua Tree has recorded at least 30 exceedance days per year [since 2016](https://www.nps.gov/subjects/air/ozone-exceed.htm). An exceedance day occurs when the daily maximum 8-hour ozone average is 71 ppb or higher. For comparison, Rocky Mountain NP has only experienced 35 ozone exceedance days since 2016.





In [None]:
# Read in data from url
url = "https://raw.githubusercontent.com/eabarnes1010/course_ml_ats/main/data/ozone_data_joshuatreenp.csv"
data = pd.read_csv(url, parse_dates=["DATE_TIME"], infer_datetime_format=True)

# Fix data issue with Daylight Savings Time
duplicate_dates = data["DATE_TIME"][data.duplicated("DATE_TIME")]
for dup_date in duplicate_dates:
    idx = data["DATE_TIME"].eq(dup_date).idxmax()
    data.at[idx, "DATE_TIME"] = dup_date - pd.Timedelta(value=1, unit="hours")

# Add hour and day of year
data["HOUR"] = data["DATE_TIME"].dt.hour
data["MONTH"] = data["DATE_TIME"].dt.month
data["YEAR"] = data["DATE_TIME"].dt.year
data["DAYOFYEAR"] = data["DATE_TIME"].dt.dayofyear
data.sort_values("DATE_TIME", inplace=True, ignore_index=True)

Let's take a look at the data. We have data for ozone, temperature, relative humidity, and wind direction, among others.

In [None]:
display(data.head())


### 1.2 Define Input and Output

The 2015 benchmark for [human health ozone condition](https://www.nps.gov/articles/analysis-methods2020.htm) is shown here. Let us predict whether the ozone quality will be classified as good, fair, or poor over 8-hour periods.

**Good**   $\leq$ 54.9 ppb

**Fair**   55.0 - 70.9 ppb

**Poor**   $\geq$ 71.0 ppb

Let's start out by training our model using Temperature, Relative Humidity, Windspeed, and Day of Year.

In [None]:
# Here are all the different variables that we could use for training our neural
# nework (except ozone, of course)
data.columns


**EDIT the Input Variables Here:** Reminder, if you choose to use wind direction, you must first convert it to a vector for averaging.

In [None]:
# List of strings from the available column names in the data set
INPUT_VARIABLES = [
    "TEMPERATURE",
    "RELATIVE_HUMIDITY",
    "WINDSPEED",
    "DAYOFYEAR",
]


In [None]:
# Let's isolate our variables of interest and take the 8-hour running mean

# First using input and output variables together to take running mean
df_data_to_be_used = data[["OZONE"] + INPUT_VARIABLES]

# Here we take the 8-hour rolling mean (note: DATE_TIME does not work)
df_data_to_be_used = df_data_to_be_used.rolling(8).mean()

# Now adding Date and Time components
df_data_to_be_used[["DATE_TIME", "HOUR", "MONTH", "YEAR"]] = data[["DATE_TIME", "HOUR", "MONTH", "YEAR"]]

# Dropping NaNs
df_data_to_be_used.dropna(inplace=True)

display(df_data_to_be_used.head())

In [None]:
# Creating a numpy array for our inputs and outputs
input = df_data_to_be_used[INPUT_VARIABLES].values
output_raw = df_data_to_be_used["OZONE"].values

# Creating numpy arrays for time/date info for visualizations
hour = df_data_to_be_used["HOUR"].values
month = df_data_to_be_used["MONTH"].values
year = df_data_to_be_used["YEAR"].values

# Turning ozone into classification problem:
# Class 0: Good, Class 1: Fair, Class 2: Poor
output_class = (output_raw >= 55).astype(int) + (output_raw >= 71).astype(int)
output = (output_class.reshape(-1, 1) == np.unique(output_class)).astype(int)

# Here are some examples of how our data is encoded into classes.
print("Ozone Value:", output_raw[0])
print("Ozone Class:", output_class[0])
print("Encoded As:", output[0])
print()
print("Ozone Value:", output_raw[2000])
print("Ozone Class:", output_class[2000])
print("Encoded As:", output[2000])
print()
print("Ozone Value:", output_raw[90116])
print("Ozone Class:", output_class[90116])
print("Encoded As:", output[90116])

In [None]:
# Printing the shapes of our input and output arrays (#samples , #dimension of input/output)
print("Input Array Shape:", input.shape)
print("Output Array Shape:", output.shape)

### 1.3 Visualizing our Data

Let's look at what our output data actually looks like.

In [None]:
# How often does our data fall into each category?
calcpercent = lambda cat: str((np.sum(output_class == cat) / len(output_class) * 100).astype(int))

# Print out the sizes of each class
print("Frequency for each Ozone Category")
print("Good: " + calcpercent(0) + "%")
print("Fair: " + calcpercent(1) + "%")
print("Poor: " + calcpercent(2) + "%")

In [None]:
# Distribution of ozone concentrations
sb.displot(output_raw, kind="hist")
plt.xlabel("Ozone (ppb)")
plt.axvline(x=71, color="red")
plt.axvline(x=55, color="goldenrod")
plt.text(56, 50, "Fair", rotation=90, color="goldenrod")
plt.text(72, 50, "Poor", rotation=90, color="red")

plt.title("Histogram of O3 Concentrations in Joshua Tree NP", fontweight="demi")
plt.plot()
plt.show()

In [None]:
# Distribution of ozone concentrations in each month

fig, axs = plt.subplots(2, 6, figsize=(15, 5))

for m in np.arange(12):
    ax = axs[m // 6, m % 6]
    sb.histplot(output_raw[month == m + 1], ax=ax)
    ax.set_title(datetime.datetime.strptime(str(m + 1), "%m").strftime("%b"))
    ax.axvline(x=71, color="red")
    ax.axvline(x=55, color="goldenrod")
    ax.set_xlim(0, 100)
    ax.set_ylim(0, 500)

fig.tight_layout(pad=1.0)
print("O3 Concentrations in Each Month")

In [None]:
# Distribution of ozone concentrations in each year

fig, axs = plt.subplots(2, 7, figsize=(15, 5))

axidx = 0
for y in np.unique(year):
    ax = axs[axidx // 7, axidx % 7]
    sb.histplot(output_raw[year == y], ax=ax)
    ax.set_title(y)
    ax.axvline(x=71, color="red")
    ax.axvline(x=55, color="goldenrod")
    ax.set_xlim(0, 100)
    axidx += 1

fig.tight_layout(pad=1.0)
print("O3 Concentrations in Each Year")

In [None]:
# How many samples are available in each year? Data cannot be used if
# there are NaNs (see 2013)
sb.histplot(year)
plt.title("Number of Usable Samples in Each Year")


### 1.4 Partitioning Data in Training, Validation, and Testing Sets

Our data is highly temporally correlated, so we are going to separate training, validation, and testing by grabbing different years of data. *Not* by random sampling.

**Some Variable Definitions**

* ```Xtrain/Xval/Xtest:*** 2-D Arrays of input data (shape: #samples, #input_variables)```

* ```Ttrain/Tval/Ttest:*** 2-D Arrays of target output data (true ozone class likelihood; shape: #samples, #classes)```

* ```Ptrain/Pval/Ptest:*** 2-D Arrays of predicted output data (predicted ozone class likelihoods; shape: #samples, #classes)```

* ```Xtrain_raw/Xval_raw/Xtest_raw:*** 2-D Arrays of raw (pre-standardized) input data (shape: #samples, #input_variables)```

* ```O3train/O3val/O3test:*** 1-D Arrays of raw ozone measurements (ppb; shape: #samples)```

* ```Cttrain/Ctval/Cttest:*** 1-D Arrays of the true ozone class (shape: #samples)```

* ```Cptrain/Cpval/Cptest:*** 1-D Arrays of the predicted ozone class with the highest likelihood (shape #samples)```

**EDIT the years used for training, validation and testing here:**

In [None]:
# Using the years 2010 - 2017 for training, 2018-2019 for validation, and 2020-2021 for testing
TRAIN_RANGE = (2010, 2017)
VAL_RANGE = (2018, 2019)
TEST_RANGE = (2020, 2021)


In [None]:
# Splitting into training, testing, validation

# This function returns a boolean array of years that fall within the given year range
year_bool = lambda yrrange: np.logical_and(year >= yrrange[0], year <= yrrange[1])

# Create the input and output arrays from training, testing, validation sets
# Inputs haven't been standardized yet (thus "_raw")

Xtrain_raw = input[year_bool(TRAIN_RANGE)]  # these are the inputs (X)
Ttrain = output[year_bool(TRAIN_RANGE)]  # these are the outputs (T is for target)

Xval_raw = input[year_bool(VAL_RANGE)]
Tval = output[year_bool(VAL_RANGE)]

Xtest_raw = input[year_bool(TEST_RANGE)]
Ttest = output[year_bool(TEST_RANGE)]

# These are the raw outputs in each set for use later
O3train = output_raw[year_bool(TRAIN_RANGE)]
O3val = output_raw[year_bool(VAL_RANGE)]
O3test = output_raw[year_bool(TEST_RANGE)]

print("Shapes:")
print("  Xtrain: ", Xtrain_raw.shape)
print("  Xval: ", Xval_raw.shape)
print("  Xtest: ", Xtest_raw.shape)
print("  Ttrain: ", Ttrain.shape)
print("  Tval: ", Tval.shape)
print("  Ttest: ", Ttest.shape)

In [None]:
# Standardizing the training, testing, and validation data

# This function takes a raw set of input fields (for example, the training,
# validation, or testing arrays), and standardizes it based on the training data.

standardize_input = lambda dat, x, s: (dat - x) / s

# Calculate mean and standard deviation of the training data
trainmean = Xtrain_raw.mean(axis=0)
trainstd = Xtrain_raw.std(axis=0)

Xtrain = standardize_input(Xtrain_raw, trainmean, trainstd)
Xval = standardize_input(Xval_raw, trainmean, trainstd)
Xtest = standardize_input(Xtest_raw, trainmean, trainstd)

In [None]:
raise Exception("stop here.")


# 2. Neural Network

### 2.1 Building the Model

In [None]:
def build_model(x_train, y_train, settings):
    # create input layer
    input_layer = tf.keras.layers.Input(shape=x_train.shape[1:])

    # create a normalization layer if you would like
    normalizer = tf.keras.layers.Normalization(axis=(1,))
    normalizer.adapt(x_train)
    layers = normalizer(input_layer)

    # create hidden layers each with specific number of nodes
    assert len(settings["hiddens"]) == len(
        settings["activations"]
    ), "hiddens and activations settings must be the same length."

    # add dropout layer
    layers = tf.keras.layers.Dropout(rate=settings["dropout_rate"])(layers)

    for hidden, activation in zip(settings["hiddens"], settings["activations"]):
        layers = tf.keras.layers.Dense(
            units=hidden,
            activation=activation,
            use_bias=True,
            kernel_regularizer=tf.keras.regularizers.l1_l2(l1=0, l2=0),
            bias_initializer=tf.keras.initializers.RandomNormal(seed=settings["random_seed"]),
            kernel_initializer=tf.keras.initializers.RandomNormal(seed=settings["random_seed"]),
        )(layers)

    # create output layer
    output_layer = tf.keras.layers.Dense(
        units=y_train.shape[-1],
        activation="softmax",
        use_bias=True,
        bias_initializer=tf.keras.initializers.RandomNormal(seed=settings["random_seed"] + 1),
        kernel_initializer=tf.keras.initializers.RandomNormal(seed=settings["random_seed"] + 2),
    )(layers)

    # construct the model
    model = tf.keras.Model(inputs=input_layer, outputs=output_layer)
    model.summary()

    return model


def compile_model(model, settings):
    model.compile(
        optimizer=tf.keras.optimizers.Adam(learning_rate=settings["learning_rate"]),
        loss=tf.keras.losses.CategoricalCrossentropy(),
        metrics=[
            tf.keras.metrics.CategoricalAccuracy(),
        ],
    )
    return model


In [None]:
settings = {
    "hiddens": [3, 3],
    "activations": ["relu", "relu"],
    "learning_rate": 0.001,
    "random_seed": 33,
    "max_epochs": 1_000,
    "batch_size": 256,
    "patience": 10,
}

tf.keras.backend.clear_session()
tf.keras.utils.set_random_seed(settings["random_seed"])

model = build_model(Xtrain, Ttrain, settings)
model = compile_model(model, settings)


### 2.2 Training the Model

In [None]:
# define the class weights
class_weights = {
    0: 1 / np.mean(Ttrain[:, 0] == 1),
    1: 1 / np.mean(Ttrain[:, 1] == 1),
    2: 1 / np.mean(Ttrain[:, 2] == 1),
}
# class_weights = {0: 1, 1: 1, 2: 1}

# define the early stopping callback
early_stopping_callback = tf.keras.callbacks.EarlyStopping(
    monitor="val_loss", patience=settings["patience"], restore_best_weights=True, mode="auto"
)

# train the model via model.fit
history = model.fit(
    Xtrain,
    Ttrain,
    epochs=settings["max_epochs"],
    batch_size=settings["batch_size"],
    shuffle=True,
    validation_data=[Xval, Tval],
    class_weight=class_weights,
    callbacks=[early_stopping_callback],
    verbose=1,
)


# 3. Model Performance
How should we evaluate our ANN's performance? Categorical accuracy is one way, which tells us how often any class was correctly predicted. However, by this metric, since 65% of our data points are from days with "Good" ozone, our model could learn to predict "Good" every time and we would still be 65% accurate. This doesn't allow us to learn anything about what it takes to predict "Fair" or "Poor" ozone days. Below, we compare categorical accuracy, to weighted categorical accuracy which takes into account class imbalances.

In [None]:
# Let's plot the change in loss and categorical_accuracy

fig, axs = plt.subplots(1, 2, figsize=(10, 4))

axs[0].plot(history.history["loss"], label="training")
axs[0].plot(history.history["val_loss"], label="validation")
axs[0].set_xlabel("Epoch")
axs[0].set_ylabel("Loss")
axs[0].legend()

axs[1].plot(history.history["categorical_accuracy"], label="training")
axs[1].plot(history.history["val_categorical_accuracy"], label="validation")
axs[1].set_xlabel("Epoch")
axs[1].set_ylabel("Categorical Accuracy")
axs[1].legend()

In [None]:
# What predictions did the model make for our training, validation, and test sets?
Ptrain = model.predict(Xtrain)  # Array of class likelihoods for each class
Pval = model.predict(Xval)
Ptest = model.predict(Xtest)

Cptrain = Ptrain.argmax(axis=1)  # 1-D array of predicted class (highest likelihood)
Cpval = Pval.argmax(axis=1)
Cptest = Ptest.argmax(axis=1)

Cttrain = Ttrain.argmax(axis=1)  # 1-D array of truth class
Ctval = Tval.argmax(axis=1)
Cttest = Ttest.argmax(axis=1)

In [None]:
from sklearn.metrics import f1_score, accuracy_score

print("Validation Categorical Accuracy:", accuracy_score(Ctval, Cpval))

# Weight equal to the inverse of the frequency of the class
cat_weights = np.sum((1 / np.mean(Ttrain, axis=0)) * Tval, axis=1)
print("Validation Weighted Categorical Accuracy:", accuracy_score(Ctval, Cpval, sample_weight=cat_weights))

In [None]:
def confusion_matrix(predclasses, targclasses):
    class_names = np.unique(targclasses)

    table = []
    for pred_class in class_names:
        row = []
        for true_class in class_names:
            row.append(100 * np.mean(predclasses[targclasses == true_class] == pred_class))
        table.append(row)
    class_titles_t = ["T(Good)", "T(Fair)", "T(Poor)"]
    class_titles_p = ["P(Good)", "P(Fair)", "P(Poor)"]
    conf_matrix = pd.DataFrame(table, index=class_titles_p, columns=class_titles_t)
    display(conf_matrix.style.background_gradient(cmap="Blues").format("{:.1f}"))

In [None]:
print("Predicted versus Target Classes")
confusion_matrix(Cptrain, Cttrain)
confusion_matrix(Cpval, Ctval)
# confusion_matrix(Cptest, Cttest)

In [None]:
df_class0 = pd.DataFrame(O3val[Cpval == 0])
df_class1 = pd.DataFrame(O3val[Cpval == 1])
df_class2 = pd.DataFrame(O3val[Cpval == 2])

sb.violinplot(data=[O3val[Cpval == 0], O3val[Cpval == 1], O3val[Cpval == 2]])

plt.axhline(55, color="goldenrod", zorder=0)
plt.axhline(71, color="red", zorder=0)
plt.ylabel("Ozone (ppb)")
plt.xlabel("Predicted Class")
plt.xticks(
    [0, 1, 2],
    ["Good", "Fair", "Poor"],
)
plt.title("O3 Concentrations when each Class is Predicted", fontweight="demi")
plt.show()

In [None]:
raise Exception("stop code prior to explainability techniques")

# 4. Explainability Techniques

### 4.1 SHAP (SHapley Additive exPlanations)

We know the accuracy of our neural network. When it makes correct predictions, is it making the right prediction for the right reasons? This is where explainable AI comes in.

Below we will use DeepSHAP to understand how our neural network made its predictions. 

In [None]:
if IN_COLAB:
    # Installing SHAP
    %pip install shap

import shap
shap.initjs()

In [None]:
# DeepSHAP takes a long time, so we make it easier by letting it learn the
# relationships between the features in a subset of the training set.
background = Xtrain[np.random.choice(Xtrain.shape[0], 100, replace=False)]

# We will use this to explain the neural network's decision-making process
explainer = shap.DeepExplainer(model, background)

# Selecting data for when a prediction was correct
Xgoodgood = Xtest[np.logical_and(Cttest == 0, Cptest == 0)]
Xfairfair = Xtest[np.logical_and(Cttest == 1, Cptest == 1)]
Xpoorpoor = Xtest[np.logical_and(Cttest == 2, Cptest == 2)]

# These are the SHAP values for when a prediction was correct
shap_values_goodgood = explainer.shap_values(Xgoodgood)[0]
shap_values_fairfair = explainer.shap_values(Xfairfair)[1]
shap_values_poorpoor = explainer.shap_values(Xpoorpoor)[2]

In [None]:
# Reduce noise by taking means of sorted 1% segments of the data
def get_percentile_stats(arr, sortarr, num=100, func=np.mean):
    sorted_arr = arr[np.argsort(sortarr)]  # Sorts arr based on values in sortarr
    split_arrs = np.array_split(sorted_arr, num)  # Splits the array into [num] lists
    meanslist = list(map(func, split_arrs))  # Applies np.mean to all arrays in split_arrs
    return np.array(meanslist)  # Returns array of the means for [num] sorted segments of arr


# Plot out the Shapley Values in a more visually appealing format
def plot_shap(shapvals, featurevals, title=False):
    # Number of samples
    samp_num = shapvals.shape[0]

    # Init colormap
    n = len(INPUT_VARIABLES)
    color = iter(plt.cm.get_cmap("viridis")(np.linspace(0, 1, n)))

    for varindex, varname in enumerate(INPUT_VARIABLES):
        # Step color
        c = next(color)

        # Get the avg feature val for every 5 percentiles of shap values
        featuremean_for_shappercentile = get_percentile_stats(featurevals[:, varindex], shapvals[:, varindex])

        # Get the median shap val for every 5 percentiles of shap values
        shapmedian_for_shappercentile = get_percentile_stats(
            shapvals[:, varindex], shapvals[:, varindex], func=np.median
        )

        # Plot
        plt.plot(featuremean_for_shappercentile, shapmedian_for_shappercentile, "o", label=varname, color=c)
        plt.axhline(0, zorder=0, color="k", alpha=0.1)
        plt.axvline(0, zorder=0, color="k", alpha=0.1)

    plt.xlim(-3, 3)
    plt.legend(bbox_to_anchor=(1.6, 1), loc="upper right")

    plt.ylabel("Median Shap Value")
    plt.xlabel("Mean Feature Value")
    if title:
        plt.title(
            "SHAP Values and Feature Values when \nNeural Network correctly "
            + "predicts '"
            + title
            + "' Ozone\nN="
            + str(samp_num)
        )
    plt.show()

The plots below are de-noised versions of all our SHAP values. They tell us two things:

1) When a certain class of ozone was correctly predicted, what did the feature values look like? For example, when a 'Good' day was correctly predicted, were the temperature values generally warmer or cooler than average (data is standardized, so the average is zero).

2) When a certain class of ozone was correctly predicted, how did changes in the feature values allow the network to be more certain (positive SHAP values) or less certain (negative SHAP values) in its prediction? For example, when windspeed was higher did this make the neural network more confident in predicting a 'Poor' ozone day?

See what you can learn about the relationship between ozone and the input features by looking at these plots.

In [None]:
plot_shap(shap_values_goodgood, Xgoodgood, title="Good")
plot_shap(shap_values_fairfair, Xfairfair, title="Fair")
plot_shap(shap_values_poorpoor, Xpoorpoor, title="Poor")

# 5. Model Competition - Do Not Run

We have set aside a bunch of ozone data. Tune your model to the best of your abilities, and we will see how it performed at the end of class. Specifically, we will be using Weighted Categorical Accuracy to measure model performance. **EDIT the code below to test.**

In [None]:
CODE = ""  # We will give you this at the end of class

DO NOT EDIT THE FOLLOWING:

In [None]:
def compete():
    # Read in data from url
    url = (
        "https://raw.githubusercontent.com/eabarnes1010/course_ml_ats/main/data/ozone_data_joshuatreenp_"
        + CODE
        + ".csv"
    )
    data = pd.read_csv(url, parse_dates=["DATE_TIME"], infer_datetime_format=True)

    # Fix data issue with Daylight Savings Time
    duplicate_dates = data["DATE_TIME"][data.duplicated("DATE_TIME")]
    for dup_date in duplicate_dates:
        idx = data["DATE_TIME"].eq(dup_date).idxmax()
        data.at[idx, "DATE_TIME"] = dup_date - pd.Timedelta(value=1, unit="hours")

    # Add hour and day of year
    data["HOUR"] = data["DATE_TIME"].dt.hour
    data["MONTH"] = data["DATE_TIME"].dt.month
    data["YEAR"] = data["DATE_TIME"].dt.year
    data["DAYOFYEAR"] = data["DATE_TIME"].dt.dayofyear
    data.sort_values("DATE_TIME", inplace=True, ignore_index=True)

    df_data_to_be_used = data[["OZONE"] + INPUT_VARIABLES]
    df_data_to_be_used = df_data_to_be_used.rolling(8).mean()
    df_data_to_be_used[["DATE_TIME", "HOUR", "MONTH", "YEAR"]] = data[["DATE_TIME", "HOUR", "MONTH", "YEAR"]]
    df_data_to_be_used.dropna(inplace=True)

    Xcompete_raw = df_data_to_be_used[INPUT_VARIABLES].values
    output_raw = df_data_to_be_used["OZONE"].values
    hour = df_data_to_be_used["HOUR"].values
    month = df_data_to_be_used["MONTH"].values
    year = df_data_to_be_used["YEAR"].values

    output_class = (output_raw >= 55).astype(int) + (output_raw >= 71).astype(int)
    Tcompete = (output_class.reshape(-1, 1) == np.unique(output_class)).astype(int)
    year_bool = lambda yrrange: np.logical_and(year >= yrrange[0], year <= yrrange[1])

    standardize_input = lambda dat, x, s: (dat - x) / s
    Xcompete = standardize_input(Xcompete_raw, trainmean, trainstd)

    Pcompete = model.predict(Xcompete)
    Cpcompete = Pcompete.argmax(axis=1)
    Ctcompete = Tcompete.argmax(axis=1)

    cat_weights = np.sum((1 / np.mean(Ttrain, axis=0)) * Tcompete, axis=1)
    print(
        "Congrats! Your overall weighted categorical accuracy is:",
        accuracy_score(Ctcompete, Cpcompete, sample_weight=cat_weights),
    )
    print("Predicted versus Target Classes for Competition Data")
    confusion_matrix(Cpcompete, Ctcompete)

In [None]:
compete()
