In [None]:
# Full name test
NAME = ""
# Institutional email (hm.edu or hmtm.de)
EMAIL = ""

In [None]:
#@title install required packages
%pip install otter-grader
%pip install matplotlib
%pip install seaborn
%pip install scikit-learn
%pip install tensorflow

In [None]:
#@title clone git repository
%%capture
%rm -rf aica-assignments
!git clone https://github.com/aica-wavelab/aica-assignments.git
%cd aica-assignments/A1_introduction

In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("4_tutorial_machine_learning.ipynb")

# Day 2 - Rudiments of Machine Learning: Predicting Visitors in Bristol Museums

+ **AI in Culture and Arts - Tech Crash Course**
+ **Date:** 10.04.2025
+ **Author:** Dr. Benedikt Zönnchen & Dr. Téo Sanchez

<a href="https://colab.research.google.com/github/aica-wavelab/aica-assignments/blob/main/A1_introduction/4_tutorial_machine_learning.ipynb" target="_parent">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

This notebook is a brief introduction to machine learning. It is intended for beginners who are interested in predictive modelling and machine learning. The notebook covers the different steps of the development of a machine learning model, and provides a concrete application: **predicting the number of visitors in the museums of Bristol (UK), based on the date and the weather**.


<image src="https://live.staticflickr.com/1571/25139082514_36dcf31ae0_b.jpg" style="width: 700px; display: block; margin-left: auto; margin-right: auto;"/>

We possess two datasets that comprise daily information from April 2014 to February 2019:

- `bristol_museum_visit.csv` contains the daily number of visitors for the museums of Bristol.
- `bristol_weather.csv` contains the daily weather information in Bristol on the same time period: the temperature mean, the temperature range, the sum of precipitation, the sum of snowfall, and the max wind speed.

Our goal is simple: we want to train machine learning model that predict the number of visitor from on all available information (museum name, day of the week, month, weather etc.)


The development of a machine learning model can be divided into several steps. The most common steps include:

1. Data collection
2. Data preprocessing and feature selection
3. Model selection
4. Model training
5. Model evaluation
6. Model deployment

Rather than steps, these can be seen as a cycle, as a problem in one step can lead to revisiting a previous step. In this notebook, we will go through these steps in order to build a predictive model for the number of visitors in Bristol museums.

## 12 Data collection

### 12.1 Daily Museum Visitors

In [None]:
import pandas as pd
museum_visits = pd.read_csv('data/bristol_museum_visit.csv')
museum_visits["date"] = pd.to_datetime(museum_visits["date"])
museum_visits

<!-- BEGIN QUESTION -->

<div class="alert alert-info">
    
<b>Instruction 12.1:</b> How many unique museums are in the dataset?

</div>

In [None]:
...

In [None]:
grader.check("q121")

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

<div class="alert alert-info">
    
<b>Instruction 12.2:</b> What is the time range of the dataset?

</div>

In [None]:
...

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

<div class="alert alert-info">
<b>Instruction 12.3:</b> Plot the number of visitors for each museum as a function of time and limited to the year 2018. Visualize each museum in a different color.
</div>

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
visitors2018 = museum_visits[museum_visits["date"].dt.year == 2018]
plt.figure(figsize=(15, 5))
...

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

<div class="alert alert-info">
<b>Instruction 12.4:</b> Describe the visualization. What can you say about the number of visitors in 2018?
</div>

_Type your answer here, replacing this text._

<!-- END QUESTION -->

### 12.2 Daily weather information

In [None]:
import pandas as pd
weather = pd.read_csv('data/bristol_weather.csv')
weather["date"] = pd.to_datetime(weather["date"])
weather

<!-- BEGIN QUESTION -->

<div class="alert alert-info">
<b>Instruction 12.5:</b> Plot each weather feature as a function of time and limited to the year 2018. Visualize each weather feature in a different color.
</div>

<div class="alert alert-warning">
<b>Warning:</b> Seaborn requires the data to be in a specific format called "long-form" or "tidy data". 

In this format, each row is an observation and each column is a variable. You can use the `melt` method to convert the data to this format.
</div>

In [None]:
weather2018 = weather[weather["date"].dt.year == 2018]
plt.figure(figsize=(15, 5))
# Melt all features except the date column
weather2018_melted = weather2018.melt(
    id_vars=["date"], 
    value_vars=weather2018.columns[1:], 
    var_name="weather_feature", 
    value_name="value")
weather2018_melted

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(15, 5))
...

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

<div class="alert alert-info">
    
<b>Instruction 12.6:</b> Describe the visualization. What kind of variations and periodicity do you observe in the weather features?

</div>

_Type your answer here, replacing this text._

<!-- END QUESTION -->

## 13 Data Preprocessing and Feature Selection

Now we collected our two dataset, we need to preprocess them before we can use them to train a machine learning model. The preprocessing steps include:

- Merging of all relevant features
- Handling missing values
- Encoding categorical features into numerical values or one-hot encoding
- Feature normalization 

### 13.1 Merging all relevant features

<!-- BEGIN QUESTION -->

<div  class="alert alert-info">
    
<b>Instruction 13.1:</b> Use the `pd.merge` function to combine the two datasets based on the `date` column. After your operation ``data`` should be the data frame containing everything.

</div>

In [None]:
data = ...

<!-- END QUESTION -->

### 13.2 Handling missing values

Let's check if there are any missing values in the dataset using the `isna` method.

In [None]:
data.isna().sum()

We are lucky as there are no missing values in the dataset. However, in a real-world scenario, missing values are common and need to be handled. There are several strategies to handle missing values, including:

- Removing the rows with missing values
- Imputing the missing values with the mean, median, or mode of the column
- Using machine learning algorithms that can handle missing values

### 13.3 Encoding categorical features into numerical values

Our data comprise at least one categorical feature: the museum name. Machine learning algorithms require numerical input data, so we need to encode the categorical features into numerical values. 

The machine learning package `scikit-learn` we are going to use today provides a class called `LabelEncoder` to encode categorical features into numerical values.

A good practice is to use one-hot encoding to encode categorical features. One-hot encoding creates a binary column for each category, with the column containing 1 if the category is present and 0 otherwise.

In [None]:
from sklearn import preprocessing
data_encoded = pd.get_dummies(data, columns=["museum"], prefix="museum")
data_encoded

The date format might not be useful for our predictive model as it is. A more useful feature would be to extract the day of the week and the month from the date.

These new features are also **categorical** and need to be encoded as one-hot encoding.

In [None]:
# Map day of the week and month from numbers to names
data_encoded['day_of_week'] = data_encoded['date'].dt.dayofweek.map({
    0: 'Monday', 1: 'Tuesday', 2: 'Wednesday', 3: 'Thursday', 4: 'Friday', 5: 'Saturday', 6: 'Sunday'
})
data_encoded['month'] = data_encoded['date'].dt.month.map({
    1: 'January', 2: 'February', 3: 'March', 4: 'April', 5: 'May', 6: 'June',
    7: 'July', 8: 'August', 9: 'September', 10: 'October', 11: 'November', 12: 'December'
})

# Create one-hot encoded features
data_one_hot = pd.get_dummies(data_encoded, columns=['day_of_week', 'month'], prefix=['day', 'month'])

# Display the new DataFrame with named one-hot encoded columns
data_one_hot.head()

As public holidays can have an impact on the number of visitors, we can add a new feature that includes the number of the day in the month (1 to 31).

In [None]:
# Add number of the day in the month as a feature
data_one_hot['day'] = data_one_hot['date'].dt.day
data_one_hot

We can now remove the `date` column from the dataset and move to our next step: **feature normalization**.

In [None]:
data_one_hot = data_one_hot.drop(columns=["date"])
data_one_hot

### 13.4 Feature normalization

Feature normalization is an important step in data preprocessing and **only applies to ordinal features**, in our case, the weather features and the number of the day. Normalizing the ordinal data ensures that each feature contributes approximately proportionately to the final prediction.

Several normalization techniques exist, including:

- Min-max scaling: scales the data to a fixed range, usually [0, 1]
- Z-score normalization: scales the data to have a mean of 0 and a standard deviation of 1
- ...

We decide to use the Min-max scaling technique to normalize the weather features and import the `MinMaxScaler` class from the `scikit-learn` package to do so.



In [None]:
# min-max scaling of certain features
scaler = preprocessing.MinMaxScaler()
data_one_hot_scaled = data_one_hot.copy()
# Normalize the weather features (5 first columns)
data_one_hot_scaled[data_one_hot.columns[:5]] = scaler.fit_transform(data_one_hot[data_one_hot.columns[:5]])
# Normalize the column "day"
data_one_hot_scaled["day"] = scaler.fit_transform(data_one_hot[["day"]])
data_one_hot_scaled

As a last step, let's move the output variable `daily_visitors` to the end of the dataset.

In [None]:
# Move the daily_visitors column to the end
data_processed = data_one_hot_scaled[[c for c in data_one_hot_scaled if c not in ['daily_visitors']] + ['daily_visitors']]
data_processed

We have now preprocessed our data to be used in a machine learning model. Before moving to the next step, we can develop our intuition about the most important features to predict the daily number of visitors.

To do so, we can visualize the correlation matrix of the dataset, i.e. the correlation between each pair of features. The correlation matrix can be visualized using a heatmap, available in the `seaborn` package.

In [None]:
# Correlation matrix of the processed data
correlation_matrix = data_processed.corr()
plt.figure(figsize=(20, 20))
sns.heatmap(correlation_matrix, annot=True, fmt=".2f")

This visualization is almost impossible to read when the number of features is large. Instead, we can focus on the correlation between the output variable `daily_visitors` and the other features.

In [None]:
# Correlation of the daily_visitors with the other features, by descending order
correlation_daily_visitors = correlation_matrix['daily_visitors'].sort_values(ascending=False)
correlation_daily_visitors

<!-- BEGIN QUESTION -->

<div class="alert alert-info">
    
<b>Instruction 13.2:</b> What are the most correlated features with the number of daily visitors? Can you interpret this result?

</div>

_Type your answer here, replacing this text._

<!-- END QUESTION -->

Finally, we can split our dataset into an input matrix `X` and an output vector `y`, and then split `X`and `y` into training and testing sets: `(X_train, y_train)` and `(X_test, y_test)`.

The training set is used to train the machine learning model, while the testing set is used to evaluate the model's performance on unseen data.

It is common to split the dataset into 80% training and 20% testing data. We can use the `train_test_split` function from the `scikit-learn` package to do so.

In [None]:
# Split the data into X and y
X = data_processed.iloc[:, :-1]
y = data_processed.iloc[:, -1]

# Split the data into training and testing sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

## 14 Model Selection and Definition

To simplify the model selection step, we will only consider a mutli-layer perceptron (MLP) model. The MLP model is a type of artificial neural network that is commonly used for regression and classification tasks.

The package `tensorflow` provides an easy-to-use API to build and train neural networks. We will use the `Sequential` class from `keras` to build our MLP regressor.

The architecture of the MLP model is defined by the number of layers, the number of neurons in each layer, and the activation function used in each layer.

In [None]:
# MLP regressor using keras
from tensorflow.keras.layers import Input
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import Adam

# Create the model
model = Sequential()
model.add(Input(shape=(X_train.shape[1],)))
model.add(Dense(64, activation='relu'))
model.add(Dense(64, activation='relu'))
model.add(Dense(32, activation='relu'))
model.add(Dense(32, activation='relu'))
model.add(Dense(1, activation='linear'))

<!-- BEGIN QUESTION -->

<div class="alert alert-info">
    
<b>Instruction 13.2:</b> Execute ``model.summary()`` and explain how your neural network looks like.

</div>

_Type your answer here, replacing this text._

<!-- END QUESTION -->

## 15 Model Training

Now we chose and defined our machine learning model, we can train it on the training data `(X_train, y_train)`.

In [None]:
# Compile the model
model.compile(loss='mean_squared_error', optimizer=Adam())

# Train the model
history = model.fit(X_train, y_train, epochs=30, batch_size=30, validation_split=0.2, verbose=1)


<!-- BEGIN QUESTION -->

<div class="alert alert-info">
    
<b>Instruction 14.1:</b> Plot the training and validation loss as a function of the number of epoch, using the `history` variable.

</div>

In [None]:
...

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

<div class="alert alert-info">
    
<b>Instruction 14.2:</b> Describe and interpret the plot. What can you say about the model's performance?

</div>

_Type your answer here, replacing this text._

<!-- END QUESTION -->

## 15 Model Evaluation

The model's performance has now to be evaluated on the unseen testing data `(X_test, y_test)` we preserved. 

Let's first oberse the real `y` values (ground truth) alongside the predicted `y_pred` values produced by our machine learning model.

In [None]:
# Predict the daily visitor from the test set
y_pred = model.predict(X_test)
y_pred = y_pred.reshape(y_pred.shape[0])
results = pd.DataFrame({'True': y_test, 'Predicted': y_pred})
results

The most common metrics to evaluate a regression model are:

- Mean absolute error (MAE)
- Mean squared error (MSE)
- R-squared
- ...

We will use the mean absolute error (MAE) to obtain a performance score of our model.

In [None]:
from sklearn.metrics import mean_absolute_error
msa = mean_absolute_error(y_test, y_pred)
print(f'Mean absolute error on the test set: {msa}')

In the context of our problem of visitor prediction, the MAE represents the average absolute difference between the predicted number of visitors and the actual number of visitors.

In other words, our model makes an average error of `N = MAE` visitors when predicting the number of visitors in the Bristol's museums.

## 16 Model Prediction and Deployment (for the brave!)

We may have trained and evaluated a good machine learning model, but it remains useless if it is not deployed in a real-world scenario.

For instance, how can we help workers in the Bristol museums to predict the number of visitors for the next week?

To do so, we must develop meaningful interactions between users and the machine learning pipeline: the data, the model, and its predictions.

[Marcelle](https://marcelle.dev/) is a modular open source toolkit for programming interactive machine learning applications. Marcelle is built around components embedding computation and interaction that can be composed to form reactive machine learning pipelines and custom user interfaces. This architecture enables rapid prototyping and extension. Marcelle can be used to build interfaces to Python scripts, and it provides flexible data stores to facilitate collaboration between machine learning experts, designers and end users.

<div class="alert alert-info">
    
<b>Instruction 15.1:</b> Use Marcelle to build an interactive application that predicts the number of visitors in the Bristol museums.

</div>

---

To double-check your work, the cell below will rerun all of the autograder tests.

In [None]:
grader.check_all()