#  Generative AI Model

## Different approaches of how such game can be re-created.

### Machine Learning Classification:

#### First approach :

The first approach would be Machine Learning Classification. Indeed, in the context of our issue, this approach would be rather simple and fast, thanks to the characteristics of each of our gaits that seem relatively independent of each other. Here, the objective is to classify each gait to recreate our match.
One could therefore consider the Random Forest algorithm. On the one hand, this algorithm would allow us to avoid overfitting, and on the other hand, it performs well when it comes to classifying data that has an imbalance in the distribution of classes, as in our case where the "run" label has 552 samples while the "shot" label has 18. However, in the context of our problem, it's an approach that doesn't handle dependencies between gaits, which can be inconvenient in our case.

### Sequence Modeling with Recurrent Neural Networks (RNNs):

#### Second approach :

The second envisaged approach would be a deep learning approach, namely Sequence Modeling with Recurrent Neural Networks (RNNs). Indeed, RNNs are rather suitable for detecting sequential dependencies, which can be very useful for understanding the successive actions of players. We can thus use the LSTM (Long Short-Term Memory) algorithm, which proves to be very effective in detecting dependencies, but not only. Indeed, the RNN approach may face an issue of vanishing gradients, a problem that can be alleviated by this LSTM algorithm. However, like any deep learning approach, difficulties can be encountered in terms of interpretability and explainability of the obtained results.

In [65]:
import numpy as np
import seaborn as sns
import json
import pandas as pd
import matplotlib.pyplot as plt
import tensorflow as tf
from tensorflow.keras.layers import LSTM, Dense, Reshape,Input,Embedding,Masking
from tensorflow.keras.models import Sequential,Model
from tensorflow.keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from tensorflow.keras.utils import to_categorical

with open("../Data/match_1.json", 'r') as json_file:
    data = json.load(json_file)

with open("../Data/match_2.json", 'r') as json_file:
    data2 = json.load(json_file)

df = data + data2

## LSTM approach

To solve this problem, I decided to use the RNN's LSTM algortihm, as the RNN will allow us to work with our gait sequences and the LSTM will allow us to solve the gradient vanishing problem that is linked to the memory of RNN models.
To do this, I'll explain the mathematical theory behind the algorithm.
The LSTM cell is the basic unit of the algortithm, which contains two main states: short-term and long-term memory. To control the flow of information within the cell, it uses so-called gates.

There are 3 of them:

An forget gate, which filters the information contained in the previous memory cell using a ft function.
<img src="../Images/Function_ft.png" alt="function ft" width="200"/>

With xt corresponding to the input data to the neural network, ht-1 corresponds to the output value predicted by the previous LSTM layer. From a mathematical point of view, [ht-1,xt] is the concatenation of the arrays xt and ht-1. Wf corresponds to the weight of the neurons and bf to the bias with sigmoid as activation function. This function will enable us to filter the relevant values to be retained from our previous memory cell. Our sigmoide function will return values between 0 and 1.

<img src="../Images/ct.png" alt="maj ct" width="100"/>

This will allow us to multiply our ft function and Ct-1, which corresponds to the value of the previous memory cell. Here too, we'll have values between 0 and 1. If the result of the multiplication is close to 1, the value will be kept, whereas if it's close to 0, it will be ignored. In short, the purpose of the forgetting gate is to sort out the old values contained in the previous memory cell and thus update our memory cell at time t.

Next, we have the input gate, whose purpose is to decide which of the new incoming values xt will be authorized to enter the memory cell. Those that are most relevant will be placed in the memory cell and will help in the decision at time t.
<img src="../Images/Input_gate.png" alt="input gate" width="250"/>
Here, we'll start with the it function, whose activation function is the sigmoid, which will output values between 0 and 1. Then we'll do the same thing, but our ~Ct function will have the tangent as its activation function, which will output values between -1 and 1. Term-by-term multiplication between our two functions will determine which values will be authorized to enter the memory cell. As with the forget gate, we'll keep the values closest to 1, while those closest to 0 will be ignored.
In short, the input gate will select the most relevant input values and then enter them into the memory cell, so that these values will have an impact on decisions at time t and perhaps t-1, depending on our port of forgetfulness.

Once the work on these two gates has been completed, our memory will be updated as follows: 
<img src="../Images/memory.png" alt="maj ctfinal" width="200"/>
Finally, we have an output gate that will allow us to determine which part of our long-term memory will be exposed as an output.

<img src="../Images/Output_gate.png" alt="Output gate" width="200"/>
The decision is made using the ht output, which depends on the information in the ct memory cell and the xt input values.

Just below is a flow chart showing how an LSTM memory cell works:
<img src="../Images/flow-chart.png" alt="Flow chart" width="500"/>

To use our LSTM algorithm, a challenge arises: it only considers fixed-length temporal sequences that do not vary. We will need to perform preprocessing to address this issue.

In [72]:
# Definition of the model creation function
def create_lstm_model(max_norm_length, lstm_units=120, dense_units=723):
    # Branch for norm data
    input_norm = Input(shape=(max_norm_length, 1), name='input_norm')
    masked_norm = Masking(mask_value=0.0)(input_norm)
    lstm_norm = LSTM(lstm_units, name='lstm_norm')(masked_norm)

    # Dense layer for norm prediction
    output_norm = Dense(dense_units, activation='relu', name='output_norm')(lstm_norm)

    # Model definition
    model = Model(inputs=input_norm, outputs=output_norm)

    return model

# Data preprocessing
norms = [entry["norm"] for entry in df]
labels = [entry["label"] for entry in df]

# Padding norm sequences
max_norm_length = max(len(norm) for norm in norms)
padded_norms = pad_sequences(norms, padding='post', maxlen=max_norm_length, dtype='float32')
# Norms are good: well-padded with the max_norm_length

# Remove the first element from each sequence in padded_norms
padded_norms_input = padded_norms[1:, :]
padded_norms_output = padded_norms[:-1, :]

# Convert data to numpy arrays
labels_array = np.array(labels)

# Transform labels into numbers
label_encoder = LabelEncoder()
encoded_labels = label_encoder.fit_transform(labels_array)

# One-hot encoding of labels
one_hot_labels = to_categorical(encoded_labels)

one_hot_labels_input = one_hot_labels[1:, :]
one_hot_labels_output = one_hot_labels[:-1, :]

# Definition of input shapes
input_shape_label = (len(one_hot_labels[0]), 1)  # Shape of label data

# Model creation
model = create_lstm_model(max_norm_length)

# Model compilation
# optimizer = Adam(lr=0.001)  # Adjust the learning rate according to your needs
model.compile(optimizer='adam', loss='mean_squared_error', metrics=['mae','accuracy'])

print(padded_norms.shape)
# Model training
predictions = model.fit(padded_norms_input, padded_norms_output, epochs=50, batch_size=32)



(1187, 723)
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


Here, the goal is to generate sequences based on the input sequences. Therefore, I performed padding to ensure that all sequences are of equal length, allowing the LSTM model to process them. To recreate the match, the first step involved generating sequences that make sense in the succession of movements. I shifted the inputs and outputs so that the output corresponds to the norm following the input. Then, in a second step, I aimed to predict the labels associated with these generated sequences.

In [73]:
predictions = model.predict(padded_norms)
print(predictions)

[[34.029617  0.       38.363422 ...  0.        0.        0.      ]
 [33.936417  0.       38.25449  ...  0.        0.        0.      ]
 [33.96271   0.       38.284317 ...  0.        0.        0.      ]
 ...
 [34.594425  0.       39.013397 ...  0.        0.        0.      ]
 [35.650845  0.       40.195686 ...  0.        0.        0.      ]
 [33.971966  0.       38.295265 ...  0.        0.        0.      ]]


In [74]:
# Extract sequences and labels
sequences = [example["norm"] for example in df]
labels = [example["label"] for example in df]

# Convert labels to numerical categories
label_mapping = {label: idx for idx, label in enumerate(np.unique(labels))}
labels_numeric = np.array([label_mapping[label] for label in labels])
labels_one_hot = to_categorical(labels_numeric)

# Padding sequences to have the same length
padded_sequences = pad_sequences(sequences, dtype='float32', padding='post', truncating='post')
print(padded_sequences.shape)

# Define accelerometer time sequence as input
input_sequence = Input(shape=(padded_sequences.shape[1], 1), name='input_sequence')

# Add a masking layer to account for padding
masked_input = Masking(mask_value=0.0)(input_sequence)

# LSTM layer to capture temporal sequences
lstm_output = LSTM(100)(masked_input)

# Dense layer for final prediction
output = Dense(len(label_mapping), activation='softmax')(lstm_output)

# Create the model
model = Model(inputs=input_sequence, outputs=output)

# Compile the model
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

# Display the model structure
model.summary()

# Train the model
model.fit(padded_sequences, labels_one_hot, epochs=10, batch_size=32, validation_split=0.2)


(1187, 723)
Model: "model_16"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_sequence (InputLayer  [(None, 723, 1)]          0         
 )                                                               
                                                                 
 masking_22 (Masking)        (None, 723, 1)            0         
                                                                 
 lstm_16 (LSTM)              (None, 100)               40800     
                                                                 
 dense_12 (Dense)            (None, 9)                 909       
                                                                 
Total params: 41709 (162.93 KB)
Trainable params: 41709 (162.93 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/

<keras.src.callbacks.History at 0x1b409f06700>

Here, I created a model to predict the new labels for the generated sequences, facilitating the recreation of our match.

In [75]:
labels_predicted = model.predict(predictions)



Unfortunately, the output does not have the desired meaning; the movements for recreating the match are only runs.

In [76]:


# Similarly here
predicted_labels_numeric = np.argmax(labels_predicted, axis=1)

# Convert true labels to numerical labels
true_labels_numeric = np.argmax(labels_one_hot, axis=1)

# Reverse the mapping to get a mapping from numbers to labels
inverse_label_mapping = {idx: label for label, idx in label_mapping.items()}

# Convert predicted numerical labels to original labels
predicted_labels_original = [inverse_label_mapping[idx] for idx in predicted_labels_numeric]
print(predicted_labels_original)

['run', 'run', 'run', 'run', 'run', 'run', 'run', 'run', 'run', 'run', 'run', 'run', 'run', 'run', 'run', 'run', 'run', 'run', 'run', 'run', 'run', 'run', 'run', 'run', 'run', 'run', 'run', 'run', 'run', 'run', 'run', 'run', 'run', 'run', 'run', 'run', 'run', 'run', 'run', 'run', 'run', 'run', 'run', 'run', 'run', 'run', 'run', 'run', 'run', 'run', 'run', 'run', 'run', 'run', 'run', 'run', 'run', 'run', 'run', 'run', 'run', 'run', 'run', 'run', 'run', 'run', 'run', 'run', 'run', 'run', 'run', 'run', 'run', 'run', 'run', 'run', 'run', 'run', 'run', 'run', 'run', 'run', 'run', 'run', 'run', 'run', 'run', 'run', 'run', 'run', 'run', 'run', 'run', 'run', 'run', 'run', 'run', 'run', 'run', 'run', 'run', 'run', 'run', 'run', 'run', 'run', 'run', 'run', 'run', 'run', 'run', 'run', 'run', 'run', 'run', 'run', 'run', 'run', 'run', 'run', 'run', 'run', 'run', 'run', 'run', 'run', 'run', 'run', 'run', 'run', 'run', 'run', 'run', 'run', 'run', 'run', 'run', 'run', 'run', 'run', 'run', 'run', 'run'

Below, I attempted to combine the models to generate both at the same time, but unfortunately, the output still has the same issue where my output is only walks.

In [57]:
# Example data (replace this with your own data)
# Data preprocessing
labels = [example["label"] for example in df]
sequences = [example["norm"] for example in df]

# Padding sequences
max_sequence_length = max(len(seq) for seq in sequences)
padded_sequences = pad_sequences(sequences, dtype='float32', padding='post', truncating='post')

# Convert labels to numbers
label_encoder = LabelEncoder()
encoded_labels = label_encoder.fit_transform(labels)

# One-hot encoding of labels
one_hot_labels = to_categorical(encoded_labels)

# Definition of the combined LSTM model
input_sequence = Input(shape=(max_sequence_length, 1), name='input_sequence')
masked_sequence = Masking(mask_value=0.0)(input_sequence)
lstm_output = LSTM(100)(masked_sequence)
output = Dense(len(np.unique(labels)), activation='softmax')(lstm_output)
model_lstm_combined = Model(inputs=input_sequence, outputs=output)

# Compilation of the combined LSTM model
model_lstm_combined.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

# Training the model
model_lstm_combined.fit(padded_sequences, one_hot_labels, epochs=10, batch_size=32, validation_split=0.2)

# Inference with the combined model
predictions = model_lstm_combined.predict(padded_sequences)

# Convert predictions to labels
predicted_labels_numeric = np.argmax(predictions, axis=1)
predicted_labels_original = label_encoder.inverse_transform(predicted_labels_numeric)

# Display predicted labels
print(predicted_labels_original)



Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
['walk' 'walk' 'walk' ... 'walk' 'walk' 'walk']


# Conclusion

To conclude this use case, initially, I created a notebook on data exploration. This allowed me to better understand our data, the context surrounding it, and formulate my initial hypotheses on the potential model to use.

Next, I researched the best model to implement for our problem, and I ultimately chose the LSTM model from RNNs, primarily for its ability to handle data with temporal sequences, a model I explained earlier in this notebook.

Finally, I attempted to implement this model to recreate a football match. Unfortunately, the output of my model is not as expected.