# COMPOSITIONAL SEMANTICS RESULTS

Here, we compare the the original audio features with the audio features predicted by the model. 

Once video features were mapped onto audio features, an experiment was performed in order to verify if the model developed the ability of combine elements autonomously, i.e. if it developed compositional semantics capabilities. In order to do that, five subsequent tests were performed, one for each object: pen, phone, spoon, knife and fork. For each of them, a reduced dataset was prepared, removing the videos that showed the object moving to the left, to the right, up, down and rotating. The model was trained on the reduced dataset, resulting in 13,050 samples, and then tested against the 1,450 videos that had been removed. 

The results are shown in table 12 at page 27 on https://github.com/fabiodeponte/symbol_grounding/blob/main/grounding_words_visual_perceptions.pdf.

The model was able to generalize when similar videos were present during training. However, it was not able to combine information from the videos that showed the objects staying still and information about the movement applied to other objects, to form a sentence composed by “move” and the name of the object.

For example, when videos showing pens moving to the left were part of the training set, the model was able to recognize new videos showing a similar scene. On the contrary, when videos showing pens moving to the left were NOT part of the training set at all, the model was not able to recognize new videos showing that scene, even if it was exposed to videos showing still pens and spoons moving to the left. The (failed) attempt was to let the model compose separate information from different videos and let it apply to a single one.

NOTE: for the sake of clarity, we show here only the first 100 utterances. In the folder https://github.com/fabiodeponte/symbol_grounding/tree/main/seq2seq%20chollet/compositional%20semantics%20-%20complete%20tests the predicted utterance for each of the 1,450 tested videos can be found.

In [7]:
import numpy as np
import pandas as pd
import os
from numpy import save
from numpy import load
import matplotlib.pyplot as plt
from numpy import argmax

from tensorflow import keras
from keras import models
from keras.models import Sequential
from keras.layers import LSTM, TimeDistributed, RepeatVector, Dense

import librosa
import torch
from transformers import Wav2Vec2ForCTC, Wav2Vec2Tokenizer

import IPython.display as display

# Importing Wav2Vec pretrained model

tokenizer = Wav2Vec2Tokenizer.from_pretrained("facebook/wav2vec2-base-960h")
model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h")

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'Wav2Vec2CTCTokenizer'. 
The class this function is called from is 'Wav2Vec2Tokenizer'.
Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at facebook/wav2vec2-base-960h and are newly initialized: ['wav2vec2.masked_spec_embed']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


# MODEL TRAINED WITHOUT MOVE PEN: TEST ON MOVE PEN

In [8]:
# lstm model - CHOLLET - TEST NEW ENCODER
predicted_features = pd.DataFrame(load('predictions/PREDICTED_MODEL_CHOLLET_06_movepen_0-100.npy', allow_pickle=True))
original_features = pd.DataFrame(load('features/audio_features_move_pen.npy', allow_pickle=True))

out = []
for i in range(predicted_features.shape[0]):
#for i in range(100):
    #print(i, "-",tokenizer.batch_decode(torch.tensor([original_features.iloc[i]]))[0], "\t-\t", tokenizer.batch_decode(torch.tensor([predicted_features.iloc[i]]))[0])
    new_row=[tokenizer.batch_decode(torch.tensor([original_features.iloc[i]]))[0], tokenizer.batch_decode(torch.tensor([predicted_features.iloc[i]]))[0]]
    out.append(new_row)
    
pd.set_option('display.max_rows', None)




In [9]:
pd.DataFrame(out)

Unnamed: 0,0,1
0,MOVE THE PEN TO THE LEFT,THIS IS A TAN
1,MOVE THE PEN TO THE RIGHT,THIS IS A TAN
2,MOVE THE PEN UP,THIS IS A PEN
3,MOVE THE PEN DOWN,THIS IS A PEN
4,WROTATE THE PEN,THIS IS A PEN
5,MOTHE THE PEN TO THE LEFT,THIS IS A TAN
6,MOVE THE PEN TO THE RIGHT,THIS IS A PEN
7,MOE THE PENNOB,THIS IS A PEN
8,MOVE THE PEN DOWN,THIS IS A TAN
9,ROTATE THE PEN,THIS IS A PEN


# MODEL TRAINED WITHOUT MOVE PEN: TEST ON MOVE PHONE

In [10]:
# lstm model - CHOLLET - TEST NEW ENCODER
predicted_features = pd.DataFrame(load('predictions/PREDICTED_MODEL_CHOLLET_06_move_phone_0-100.npy', allow_pickle=True))
original_features = pd.DataFrame(load('features/audio_features_move_phone.npy', allow_pickle=True))

out = []
for i in range(predicted_features.shape[0]):
#for i in range(100):
    #print(i, "-",tokenizer.batch_decode(torch.tensor([original_features.iloc[i]]))[0], "\t-\t", tokenizer.batch_decode(torch.tensor([predicted_features.iloc[i]]))[0])
    new_row=[tokenizer.batch_decode(torch.tensor([original_features.iloc[i]]))[0], tokenizer.batch_decode(torch.tensor([predicted_features.iloc[i]]))[0]]
    out.append(new_row)
    
pd.set_option('display.max_rows', None)




In [11]:
pd.DataFrame(out)

Unnamed: 0,0,1
0,MOVE THE PHONE TO THE LEFT,THIS IS A PHONE
1,MOVE THE PHONE TO THE RIGHT,THIS IS A FOWN
2,MOVE THE PHONE UP,THIS IS A SPOON
3,MOVE THE PHONE DOWN,THIS IS A PHONE
4,ROTATE THE PHONE,THIS IS A PHONE
5,MOOVE THE PHONE TO THE LEFT,THIS IS A PHAWN
6,MOVE THE PHONE TO THE RIGHT,THIS IS A PHAWN
7,MOO THE PHOE KNOP,THIS IS A PHOAM
8,MOVE THE PHONE DOWN,THIS IS A FARM
9,ROTATE THE PHONE,THIS IS A PHONE


# MODEL TRAINED WITHOUT MOVE PEN: TEST ON MOVE SPOON

In [19]:
# lstm model - CHOLLET - TEST NEW ENCODER
predicted_features = pd.DataFrame(load('predictions/PREDICTED_MODEL_CHOLLET_06_move_spoon_0-100_02.npy', allow_pickle=True))
original_features = pd.DataFrame(load('features/audio_features_move_spoon.npy', allow_pickle=True))

out = []
for i in range(predicted_features.shape[0]):
#for i in range(100):
    #print(i, "-",tokenizer.batch_decode(torch.tensor([original_features.iloc[i]]))[0], "\t-\t", tokenizer.batch_decode(torch.tensor([predicted_features.iloc[i]]))[0])
    new_row=[tokenizer.batch_decode(torch.tensor([original_features.iloc[i]]))[0], tokenizer.batch_decode(torch.tensor([predicted_features.iloc[i]]))[0]]
    out.append(new_row)
    
pd.set_option('display.max_rows', None)




In [20]:
pd.DataFrame(out)

Unnamed: 0,0,1
0,MOVE THE SPOON TO THE LEFT,THIS IS A SPOON
1,MOVE THE SPOON TO THE RIGHT,THIS IS A SPOON
2,MOVE THE SPOON UP,THIS IS A SPERN
3,MOVE THE SPOON DOWN,THIS IS A SPOON
4,RO TAKE THE SPOON,THIS IS A SPOON
5,MOVED THE SPOON TO THE LEFT,THIS IS A PHONE
6,MOVE THE SPOON TO THE RIGHT,THIS IS A PHONE
7,MOVE THE SPOON UP,THIS IS A FONE
8,MOOTH THE SPOON DOWN,THIS IS A SPERN
9,ROTATE THE SPOON,THIS IS A SPOON


# MODEL TRAINED WITHOUT MOVE PEN: TEST ON MOVE KNIFE

In [14]:
# lstm model - CHOLLET - TEST NEW ENCODER
predicted_features = pd.DataFrame(load('predictions/PREDICTED_MODEL_CHOLLET_06_move_knife_0-100.npy', allow_pickle=True))
original_features = pd.DataFrame(load('features/audio_features_move_knife.npy', allow_pickle=True))

out = []
for i in range(predicted_features.shape[0]):
#for i in range(100):
    #print(i, "-",tokenizer.batch_decode(torch.tensor([original_features.iloc[i]]))[0], "\t-\t", tokenizer.batch_decode(torch.tensor([predicted_features.iloc[i]]))[0])
    new_row=[tokenizer.batch_decode(torch.tensor([original_features.iloc[i]]))[0], tokenizer.batch_decode(torch.tensor([predicted_features.iloc[i]]))[0]]
    out.append(new_row)
    
pd.set_option('display.max_rows', None)




In [15]:
pd.DataFrame(out)

Unnamed: 0,0,1
0,MOVE THE KNIFE TO THE LEFT,THIS IS A KNIFE
1,MOVE THE KNIFE TO THE RIGHT,THIS IS A KNIFE
2,MOVE THE KNIFE UP,THIS IS A SPOON
3,MOVE THE KNIFE DOWN,THIS IS A KNIFE
4,RO ATE THE KNIFE,THIS IS A PEN
5,MOOTHE THE KNIFE TO THE LEFT,THIS IS A KNIFE
6,MOVE THE KNIFE TO THE RIGHT,THIIS IS A PEN
7,MOVE THE KNIFE UP,THIS IS A KNIFE
8,MOOVE THE KNIFE DOWN,THIS IS A PHONE
9,ROTATE THE KNIFE,THIS IS A KNIFE


# MODEL TRAINED WITHOUT MOVE PEN: TEST ON MOVE FORK

In [17]:
# lstm model - CHOLLET - TEST NEW ENCODER
predicted_features = pd.DataFrame(load('predictions/PREDICTED_MODEL_CHOLLET_06_movefork_0-100.npy', allow_pickle=True))
original_features = pd.DataFrame(load('features/audio_features_move_fork.npy', allow_pickle=True))

out = []
for i in range(predicted_features.shape[0]):
#for i in range(100):
    #print(i, "-",tokenizer.batch_decode(torch.tensor([original_features.iloc[i]]))[0], "\t-\t", tokenizer.batch_decode(torch.tensor([predicted_features.iloc[i]]))[0])
    new_row=[tokenizer.batch_decode(torch.tensor([original_features.iloc[i]]))[0], tokenizer.batch_decode(torch.tensor([predicted_features.iloc[i]]))[0]]
    out.append(new_row)
    
pd.set_option('display.max_rows', None)




In [18]:
pd.DataFrame(out)

Unnamed: 0,0,1
0,MOVE THE FORK TO THE LEFT,THIS IS A FORK
1,MOVE THE FORK TO THE RIGHT,THIS IS A FORK
2,MOVE THE FORECUP,THIS IS A FOR
3,MOVE THE FORK DOWN,THIS IS A FORK
4,ROTATE THE FORK,THIS IS A PHONE
5,MOVE THE FORK TO THE LEFT,THIS IS A FORK
6,MOVE THE FORK TO THE RIGHT,THIS IS A FORK
7,MOE THE FORECOP,THIS IS A PEN
8,MOOTHE THE FORK DOWN,THIS IS A FORK
9,ROTE THE FORK,THIS IS A FFORK
