**Predicting the age of a dataset of pipes: a regression (and classification) task**


---


First, let us import some packages and functions.

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns

from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error

import keras
from keras.models import Sequential
from keras.layers import Dense
from keras.optimizers import Adam
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.initializers import RandomNormal

We load the dataset of pipes and we briefly visualize and outline its main properties.

In [None]:
url = 'https://raw.githubusercontent.com/cesc14/AgePipes/main/dataset.csv'
df = pd.read_csv(url)
df.head()   # At a glance...

In [None]:
df.info()   # Some further information

As we are interest in estimating the age of the pipes, it is worth to visualize the overall age's distribution in this dataset.

In [None]:
ages_array = df['anno']
sns.displot(ages_array, binwidth=1)

print(np.unique(df['anno']))

Before proceeding with the exercise, it is a good idea to transform the dataset in such a way that it contains numerical entries only. To do so, we need to encode the material of each pipe "numerically". After that, we can set our regression problem up by considering pipes' age as the label to be predicted.

In [None]:
df = pd.get_dummies(df, columns=['materiale'])
db = df.to_numpy().astype(float)

X = db[:,1:]
y = db[:,0].astype(float)


To experiment with this dataset, we then randomly split it into a training and a test set. We will learn from the training set and we will then assess the performance of the trained model on the test set.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, shuffle=True, random_state=42)

# Scale data
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

We are ready to train a simple feed-forward neural network on our dataset!

In [None]:
keras.utils.set_random_seed(42)

model = Sequential()
model.add(Dense(1000, input_dim=X_train.shape[1], activation='relu'))
model.add(Dense(200, activation='relu'))
model.add(Dense(1, activation='linear'))

loss = 'mean_squared_error'

opt = Adam(learning_rate=1e-3)
model.compile(loss=loss, optimizer=opt)

model.fit(X_train, y_train, epochs=60, batch_size=32, verbose=1)

y_pred = model.predict(X_test)

In [None]:
print("Mean absolute error (regression):", np.round(mean_absolute_error(y_test, y_pred),2),"years")

 You can play with the network's parameters to observe how the result change!

**Moving to classification**

In this situation, we may interpret pipes' age prediction as a classification problem, which might lead to good results as well.
First, in order to obtain a restricted number of classes to be predicted, we group together pipes of similar ages.


In [None]:
def myround(x):
    return 10 * np.round(x/10)

y_train_rounded = myround(y_train)
num_classes = len(np.unique(y_train_rounded))

print(np.unique(y_train_rounded))

y_test_rounded = myround(y_test)

Furthermore, to perform neural network classification, we need to one-hot encode our labels.

In [None]:
enchot = OneHotEncoder()
y_train_hot = enchot.fit_transform(y_train_rounded.reshape(-1,1)).toarray()

We are ready to train a similar feed-forward neural network for classification.

In [None]:
keras.utils.set_random_seed(42)

model = Sequential()
model.add(Dense(1000, input_dim=X_train.shape[1], activation='relu'))
model.add(Dense(200, activation='relu'))
model.add(Dense(num_classes, activation='softmax'))

loss = 'categorical_crossentropy'

opt = Adam(learning_rate=1e-3)
model.compile(loss=loss, optimizer=opt)

model.fit(X_train, y_train_hot, epochs=60, batch_size=32, verbose=1)

y_pred_hot = to_categorical(np.argmax(model.predict(X_test),axis=1),num_classes)
y_pred_rounded = enchot.inverse_transform(y_pred_hot)

In [None]:
print("Mean absolute error (classification):", np.round(mean_absolute_error(y_test_rounded, y_pred_rounded),2),"years")