# Assignment - Probabilistic Graphical Models
### Year 2020-2021- Semester I
### CCE5225
####  Developed by - Adrian Muscat, 2020
---
Zachary Cauchi, 197999M, BSc CS, Yr I

Submit a pdf version (with the attached plagiarism form) of the final jupyter notebook (as a turn-it-in job on VLE) and the jupyter notebook itself separately (as an assignment job on VLE)

This assignment is to be attempted individually. It is essential that the work you submit and present consists only of your own work; use of copied material will be treated as plagiarism. Discussion is only permitted on general issues, and it is absolutely forbidden to discuss specific details with anyone and/or share results.



In [1]:
import numpy as np
import pickle

from skmultilearn.problem_transform import BinaryRelevance
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, MultiLabelBinarizer
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.utils.multiclass import unique_labels
from sklearn.linear_model import LogisticRegressionCV
from sklearn.metrics import classification_report, multilabel_confusion_matrix

import pandas
from collections import Counter

import re

def saveAnswer(obj, name):
    answer_file = open(f'saved_answers/{name}.pkl', 'wb')
    pickle.dump(obj, answer_file)
    answer_file.close()

def trimSubClasses(labels):
    pattern = re.compile(r'.+?(?=_\d+(?!.))')
    labels = [[label if not pattern.match(label) else pattern.match(label).group(0) for label in row] for row in labels]
    return labels

In [2]:
infile = open('MLC_data_2020_21.pkl','rb')
data = pickle.load(infile, encoding='latin1')
infile.close()

In [3]:
# Explore dataset
print("First split is into :",data.keys(),'\n')
#
# Lets explore the development set
# This is organised into three lists
print("The three lists are",data['development'].keys(),'\n')
#
# The first element of each list corresponds to the object_labels, 
# geomteric features and output labels for the first example
# ...and so on
# When getting the object labels, trim them accordingly to obtain only the 20 classes
train_obj_labels = trimSubClasses(data['development']['object_labels'])
train_out_labels = data['development']['output_labels']
train_geo_feat = data['development']['geometric_features']
test_obj_labels = trimSubClasses(data['test']['object_labels'])
test_out_labels = data['test']['output_labels']
test_geo_feat = data['test']['geometric_features']

print("There are",len(train_obj_labels), "examples in dev set\n")
print("First example:")
print(train_obj_labels[0])
print(train_out_labels[0])
print(train_geo_feat[0])
print("\nSecond example:")
print(train_obj_labels[1])
print(train_out_labels[1])
print(train_geo_feat[1])
print("\n...")

First split is into : dict_keys(['development', 'test']) 

The three lists are dict_keys(['object_labels', 'output_labels', 'geometric_features']) 

There are 4253 examples in dev set

First example:
['2008_001130.jpg', 'tvmonitor', 'bottle']
['next_to', 'at_the_level_of', 'near']
[ 0.68888274  0.07051991  0.          0.88679245  0.39215686  0.63316053
  0.109375    1.36170213  1.14893617  1.06603774  0.58490566  9.76862745
  0.5546875  -0.30530973]

Second example:
['2008_002210.jpg', 'person', 'diningtable']
['behind', 'opposite', 'near']
[ 0.43984962  0.28696742  0.16        0.40206186  2.36082474  0.48306117
  0.          2.27350427  0.31623932  1.          0.66666667  1.53275109
  0.34962406 -0.33333333]

...


In [4]:
# Example 
# Learning the one-hot encoder

# read all prepositions in multilabel examples and flatten
all_preps=[]
for Y in data['development']['output_labels']:
    for y in Y:
        all_preps.append(y)

values = np.array(all_preps).reshape(len(all_preps),)
print("Shape of values", values.shape,'\n')

# integer encode
label_encoder = LabelEncoder()
integer_encoded = label_encoder.fit_transform(values)
print("Unique labels:\n",label_encoder.classes_,'\n')

# onehot encode
onehot_encoder = OneHotEncoder(sparse=False)
integer_encoded = integer_encoded.reshape(len(integer_encoded), 1)
onehot_encoded = onehot_encoder.fit_transform(integer_encoded)
print("Example of One-Hot Encoded:\n", onehot_encoded[0],'\n')

# single label encoding for first example
print("Consider first example\n")
b = np.array(data['development']['output_labels'][0])
b = b.reshape(len(b),1)
print("Output Labels:\n",b)
print("\nOne-Hot encoded labels:")
for i in b:
    a = label_encoder.transform(i)
    print(onehot_encoder.transform(a.reshape(-1, 1))[0])



Shape of values (9180,) 

Unique labels:
 ['above' 'against' 'along' 'around' 'at_the_level_of' 'behind' 'beyond'
 'far from' 'in' 'in_front_of' 'near' 'next_to' 'none' 'on' 'opposite'
 'outside_of' 'under'] 

Example of One-Hot Encoded:
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0.] 

Consider first example

Output Labels:
 [['next_to']
 ['at_the_level_of']
 ['near']]

One-Hot encoded labels:
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0.]


# Section 1: Preparing the data

## Part 1

In [5]:
# 1.a. Computing the mean output label count per example, per dataset (development and test)
average_out_count_train = 0
average_out_count_test = 0

for row in train_out_labels:
    average_out_count_train += len(row)
for row in test_out_labels:
    average_out_count_test += len(row)

average_out_count_train /= len(train_out_labels)
average_out_count_test /= len(test_out_labels)

print('Answer to 1.a:')
print('Mean output labels per row (train set): ', average_out_count_train)
print ('Mean output labels per row (test set): ', average_out_count_test)

saveAnswer({
    'train_average_out': average_out_count_train,
    'test_average_out': average_out_count_test
}, '1a')


Answer to 1.a:
Mean output labels per row (train set):  2.1584763696214435
Mean output labels per row (test set):  2.148496240601504


In [6]:
# 1.b. Flatten the output labels to a 1-d array, computing the distribution for both datasets

# Flatten the labels into a 1D array
flat_out_train = np.concatenate(train_out_labels)
flat_out_test = np.concatenate(test_out_labels)

# Count the numbers of each label
train_out_counts = Counter(flat_out_train)
test_out_counts = Counter(flat_out_test)

# Create dataframes from each counter object above.
train_out_counts_df = pandas.DataFrame.from_dict(train_out_counts, orient='index')
train_out_counts_df.index.name = 'Label distribution in development (train) set'
test_out_counts_df = pandas.DataFrame.from_dict(test_out_counts, orient='index')
test_out_counts_df.index.name = 'Label distribution in test set'

print("Results for 1.b:")
display(train_out_counts_df)
display(test_out_counts_df)

saveAnswer({
    'train_out_counts': train_out_counts_df,
    'test_out_counts': test_out_counts_df
}, '1b')

Results for 1.b:


Unnamed: 0_level_0,0
Label distribution in development (train) set,Unnamed: 1_level_1
next_to,1411
at_the_level_of,926
near,2276
behind,1055
opposite,267
on,359
in_front_of,1102
above,117
under,432
far from,376


Unnamed: 0_level_0,0
Label distribution in test set,Unnamed: 1_level_1
in_front_of,270
against,136
next_to,359
at_the_level_of,227
near,578
under,101
behind,270
far from,100
on,88
opposite,66


In [7]:
# 1.c. Computing the composite output labels (without flattening like in 1.b) for both datasets.

# Same as above, compute the occurances of each composite output label.
# Unlike above, we first need to transform each row from an unhashable list to a hashable tuple object.
train_cmp_out_counts = Counter(map(tuple, train_out_labels))
test_cmp_out_counts = Counter(map(tuple, test_out_labels))

train_cmp_out_counts_df = pandas.DataFrame.from_dict(train_cmp_out_counts, orient='index')
train_cmp_out_counts_df.index.name = 'Composite output label distribution in development (train) set'
test_cmp_out_counts_df = pandas.DataFrame.from_dict(test_cmp_out_counts, orient='index')
test_cmp_out_counts_df.index.name = 'Composite output label distribution in test set'

print('Results for 1.c:')
display(train_cmp_out_counts_df)
display(test_cmp_out_counts_df)

saveAnswer({
    'train_out_counts': train_cmp_out_counts_df,
    'test_out_counts': test_cmp_out_counts_df
}, '1c')


Results for 1.c:


Unnamed: 0_level_0,0
Composite output label distribution in development (train) set,Unnamed: 1_level_1
"(next_to, at_the_level_of, near)",509
"(behind, opposite, near)",3
"(on,)",135
"(in_front_of, near)",269
"(near, behind)",31
...,...
"(opposite, beyond)",1
"(in_front_of, opposite, under)",1
"(outside_of, next_to, at_the_level_of, near)",1
"(in_front_of, next_to, opposite, near)",1


Unnamed: 0_level_0,0
Composite output label distribution in test set,Unnamed: 1_level_1
"(in_front_of, against)",7
"(next_to, at_the_level_of, near)",132
"(under,)",53
"(at_the_level_of,)",9
"(in_front_of, next_to, at_the_level_of, near)",2
...,...
"(above, next_to, against, behind, near)",1
"(in, on)",1
"(in_front_of, next_to, against)",1
"(around, against, near)",1


In [8]:
# 1.d. Compute a word-word co occurrence probability distribution
train_1d_out_labels = [' '.join(label).replace('far from', 'far_from') for label in train_out_labels]

corpus = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?',
]

vectorizer = TfidfVectorizer(use_idf=False)
X = vectorizer.fit_transform(train_1d_out_labels)
print(vectorizer.get_feature_names())
print(X.shape)
Xc = (X.T * X)
df = pandas.DataFrame(Xc.todense(), index = vectorizer.vocabulary_, columns=vectorizer.vocabulary_)
display(df)


['above', 'against', 'along', 'around', 'at_the_level_of', 'behind', 'beyond', 'far_from', 'in', 'in_front_of', 'near', 'next_to', 'none', 'on', 'opposite', 'outside_of', 'under']
(4253, 17)


Unnamed: 0,next_to,at_the_level_of,near,behind,opposite,on,in_front_of,above,under,far_from,against,outside_of,beyond,around,in,along,none
next_to,50.116667,1.25,0.0,0.0,2.25,12.95,0.583333,3.916667,0.25,5.2,27.116667,8.5,0.0,2.083333,1.95,0.833333,0.0
at_the_level_of,1.25,263.466667,1.283333,1.583333,32.483333,27.85,0.333333,0.0,4.25,27.033333,20.966667,47.05,0.0,95.583333,3.95,1.0,64.916667
near,0.0,1.283333,20.516667,0.0,4.483333,5.85,0.0,0.333333,0.0,6.333333,14.433333,15.516667,0.0,0.0,0.0,0.0,0.25
behind,0.0,1.583333,0.0,29.083333,0.0,0.583333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.75
opposite,2.25,32.483333,4.483333,0.0,343.883333,8.8,0.333333,8.083333,0.0,11.833333,244.133333,251.55,0.0,0.0,15.333333,1.25,1.583333
on,12.95,27.85,5.85,0.583333,8.8,544.7,9.333333,74.333333,0.0,24.45,246.95,66.25,0.0,0.333333,14.616667,4.833333,13.166667
in_front_of,0.583333,0.333333,0.0,0.0,0.333333,9.333333,16.916667,6.75,0.0,3.916667,2.0,0.333333,0.0,0.0,1.166667,0.0,0.333333
above,3.916667,0.0,0.333333,0.0,8.083333,74.333333,6.75,188.333333,0.0,83.166667,0.5,1.166667,0.0,0.0,4.333333,3.083333,2.0
under,0.25,4.25,0.0,0.0,0.0,0.0,0.0,0.0,41.25,0.0,0.0,0.0,0.0,10.25,0.0,0.0,0.0
far_from,5.2,27.033333,6.333333,0.0,11.833333,24.45,3.916667,83.166667,0.0,574.233333,248.15,68.616667,0.0,6.0,26.4,5.5,11.166667


In [9]:
import numpy as np
import nltk
from nltk import bigrams
import itertools
import pandas as pd
 
 
def generate_co_occurrence_matrix(corpus):
    vocab = set(corpus)
    vocab = list(vocab)
    vocab_index = {word: i for i, word in enumerate(vocab)}

    # Create bigrams from all words in corpus
    bi_grams = list(bigrams(corpus))

    # Frequency distribution of bigrams ((word1, word2), num_occurrences)
    bigram_freq = nltk.FreqDist(bi_grams).most_common(len(bi_grams))

    # Initialise co-occurrence matrix
    # co_occurrence_matrix[current][previous]
    co_occurrence_matrix = np.zeros((len(vocab), len(vocab)))
 
    # Loop through the bigrams taking the current and previous word,
    # and the number of occurrences of the bigram.
    for bigram in bigram_freq:
        current = bigram[0][1]
        previous = bigram[0][0]
        count = bigram[1]
        pos_current = vocab_index[current]
        pos_previous = vocab_index[previous]
        co_occurrence_matrix[pos_current][pos_previous] = count
    co_occurrence_matrix = np.matrix(co_occurrence_matrix)
 
    # return the matrix and the index
    return co_occurrence_matrix, vocab_index

matrix, vocab_index = generate_co_occurrence_matrix(np.concatenate(train_out_labels))
 
 
data_matrix = pd.DataFrame(matrix, index=vocab_index,
                             columns=vocab_index)
display(data_matrix)

# WIP 1.d

Unnamed: 0,at_the_level_of,on,against,under,in,behind,near,opposite,none,far from,in_front_of,beyond,above,outside_of,along,around,next_to
at_the_level_of,5.0,5.0,11.0,4.0,1.0,20.0,82.0,23.0,1.0,13.0,21.0,0.0,2.0,2.0,1.0,0.0,735.0
on,11.0,1.0,38.0,22.0,6.0,21.0,184.0,0.0,3.0,24.0,31.0,4.0,3.0,1.0,1.0,2.0,7.0
against,52.0,185.0,11.0,129.0,1.0,49.0,72.0,4.0,1.0,9.0,39.0,2.0,3.0,1.0,1.0,5.0,29.0
under,20.0,11.0,30.0,6.0,5.0,34.0,228.0,7.0,1.0,26.0,49.0,2.0,1.0,0.0,1.0,2.0,9.0
in,3.0,11.0,16.0,3.0,0.0,1.0,17.0,0.0,0.0,2.0,2.0,1.0,0.0,0.0,0.0,0.0,0.0
behind,45.0,44.0,99.0,60.0,7.0,35.0,382.0,31.0,3.0,94.0,100.0,5.0,18.0,7.0,1.0,5.0,119.0
near,616.0,13.0,66.0,34.0,4.0,436.0,150.0,81.0,1.0,16.0,434.0,7.0,48.0,5.0,49.0,2.0,314.0
opposite,15.0,6.0,21.0,11.0,6.0,24.0,103.0,3.0,1.0,12.0,41.0,2.0,4.0,2.0,0.0,1.0,15.0
none,1.0,1.0,0.0,0.0,1.0,2.0,12.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0
far from,12.0,1.0,13.0,10.0,2.0,110.0,79.0,7.0,1.0,2.0,121.0,1.0,5.0,5.0,0.0,1.0,6.0


## Part 2

In [10]:
#2.a Transform the object and geometrical features into an input matrix.

# Trim the file names from the inputs.
train_trimmed = np.array(train_obj_labels)[:, 1:]
test_trimmed = np.array(test_obj_labels)[:, 1:]

# Transform the features into one-hot encoded.
obj_encoder = OneHotEncoder(sparse=False)
obj_encoder = obj_encoder.fit(train_trimmed)
train_input_matrix = obj_encoder.transform(train_trimmed)
test_input_matrix = obj_encoder.transform(test_trimmed)

# Append the geometrical features onto the obtained one-hot features.
train_input_matrix = np.append(train_input_matrix, train_geo_feat, axis=1)
test_input_matrix = np.append(test_input_matrix, test_geo_feat, axis=1)

saveAnswer({
    'train_input_matrix': train_input_matrix,
    'test_input_matrix': test_input_matrix
}, '2.a')

XTrain = train_input_matrix
XTest = test_input_matrix


In [11]:
# 2.b Transform the output features into a multi-label output matrix.

# Use a multi-label binarizer to one-hot encode and reduce multiple features into a single vector.
out_one_hot = MultiLabelBinarizer()
out_one_hot = out_one_hot.fit(train_out_labels)

train_output_matrix = out_one_hot.transform(train_out_labels)
test_output_matrix = out_one_hot.transform(test_out_labels)

saveAnswer({
    'train_output_matrix': train_output_matrix,
    'test_output_matrix': test_output_matrix
}, '2.b')

yTrain = train_output_matrix
yTest = test_output_matrix

## Part 3


In [69]:
# 3 Functions for calculating accuracy metrics

def getMatrix(predictions, truths):
    # Generate a single confusion matrix for all labels
    tp = 0
    fp = 1
    fn = 2
    tn = 3

    # Initialise an empty array.
    matrix = [0, 0, 0, 0]

    # Over each prediction-truth pair, update the confusion matrix for that label.
    for (plabel, tlabel) in np.nditer([predictions, truths], flags=['refs_ok']):
        if plabel == 1 and tlabel == 1: matrix[tp] += 1
        elif plabel == 1 and tlabel == 0: matrix[fp] += 1
        elif plabel == 0 and tlabel == 1: matrix[fn] += 1
        elif plabel == 0 and tlabel == 0: matrix[tn] += 1

    return matrix

def getMatrices(predictions, truths, num_labels):
    # Generate a multi-label confusion matrix
    tp = 0
    fp = 1
    fn = 2
    tn = 3

    # Initialise an empty set of arrays.
    matrices = [[0, 0, 0, 0] for i in range(0, num_labels)]

    it = np.nditer([predictions, truths], flags=['multi_index', 'refs_ok'])

    # Over each prediction-truth pair, update the confusion matrix for that label.
    for plabel, tlabel in it:
        i = it.multi_index[1] # This is the label index
        if plabel == 1 and tlabel == 1: matrices[i][tp] += 1
        elif plabel == 1 and tlabel == 0: matrices[i][fp] += 1
        elif plabel == 0 and tlabel == 1: matrices[i][fn] += 1
        elif plabel == 0 and tlabel == 0: matrices[i][tn] += 1

    return matrices

# 3.a Accuracy (intersection over union)
def getAccuracy(predictions, truths):
    # Get the overall accuracy
    matrix = getMatrix(predictions, truths)
    
    correct = matrix[0] + matrix[3] # tp + tn
    total = sum(matrix) # tp + fp + fn + tn
    
    return correct / total

# 3.b Precision
def getPrecision(predictions, truths):
    # Get the overall precision
    matrix = getMatrix(predictions, truths)

    positives = matrix[0] # tp
    positiveGuesses = matrix[0] + matrix[1] # tp + fp

    return positives / positiveGuesses

# 3.c Recall
def getRecall(predictions, truths):
    # Get the overall recall
    matrix = getMatrix(predictions, truths)

    positives = matrix[0] # tp
    allPositives = matrix[0] + matrix[2] # tp + fn
    
    return positives / allPositives

# 3.d Per-label precision
def getMultiLabelPrecision(predictions, truths):
    # Get the per-label precision
    matrices = getMatrices(predictions, truths, len(truths[0]))

    precisions = [0 for i in range(len(matrices))]

    for i, (tp, fp, fn, tn) in enumerate(matrices):
        p = tp + fp
        precisions[i] = (tp / p) if p != 0 else 1.0
    
    return precisions

# 3.e Per-label recall
def getMultiLabelRecall(predictions, truths):
    # Get the per-label recall
    matrices = getMatrices(predictions, truths, len(truths[0]))

    recalls = [0 for i in range(len(matrices))]
    
    for i, (tp, fp, fn, tn) in enumerate(matrices):
        allPositives = tp + fn
        recalls[i] = (tp / allPositives) if allPositives != 0 else 1.0
    
    return recalls

# Section 2

## Part 1


In [83]:
# 4.a Develop a binary-relevance model set using logistic regression, first trained through cross-validation and then training the best br model on the whole training set.

parameters = [
    {
        'classifier': [LogisticRegression()],
        'classifier__solver': ['sag', 'saga'],
        'classifier__C': [1.0, 0.5, 1.5],
        'classifier__max_iter': [250, 500, 100],
        'classifier__class_weight': [None, 'balanced'],
        'classifier__warm_start': [True],
        'classifier__random_state': [12]
    }
]

clf = GridSearchCV(BinaryRelevance(), parameters, scoring='accuracy', verbose=2, n_jobs=4)

clf.fit(XTrain, yTrain)

display(clf.best_params_, clf.best_score_)
print(f'Best estimator achieved an accuracy score of {clf.best_score_} and trained in {clf.refit_time_:.2f} sec')

predictions = clf.predict(XTest).todense()

saveAnswer({
    'trained_model': clf,
    'predictions': predictions
}, '4.a')

Fitting 5 folds for each of 36 candidates, totalling 180 fits
[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  33 tasks      | elapsed:  1.0min
[Parallel(n_jobs=4)]: Done 154 tasks      | elapsed:  4.8min
[Parallel(n_jobs=4)]: Done 180 out of 180 | elapsed:  5.6min finished


{'classifier': LogisticRegression(max_iter=500, random_state=12, solver='sag', warm_start=True),
 'classifier__C': 1.0,
 'classifier__class_weight': None,
 'classifier__max_iter': 500,
 'classifier__random_state': 12,
 'classifier__solver': 'sag',
 'classifier__warm_start': True}

0.0338582981958941

Best estimator achieved an accuracy score of 0.0338582981958941 and trained in 10.91 sec


In [84]:
# 4.b Metrics on the trained models from 4.a


# Next, compute the metrics accordingly.
acc = getAccuracy(predictions, yTest)
pre = getPrecision(predictions, yTest)
rec = getRecall(predictions, yTest)
prePerLabel = getMultiLabelPrecision(predictions, yTest)
recPerLabel = getMultiLabelRecall(predictions, yTest)

# In the case of the multi-label metrics, convert them into dataframes for readability.
prePerLabel = pd.DataFrame(prePerLabel, index=out_one_hot.classes_, columns=['Precision'])
recPerLabel = pd.DataFrame(recPerLabel, index=out_one_hot.classes_, columns=['Recall'])

print(f'Accuracy: {acc}')
print(f'Precision: {pre}')
print(f'Recall: {rec}')
display(prePerLabel)
display(recPerLabel)

saveAnswer({
    'accuracy': acc,
    'precision': pre,
    'recall': rec,
    'multiPrecision': prePerLabel,
    'recPerLabel': recPerLabel
}, '4.b')

Accuracy: 0.8857253427686864
Precision: 0.6102719033232629
Recall: 0.2650918635170604


Unnamed: 0,Precision
above,1.0
against,1.0
along,1.0
around,1.0
at_the_level_of,1.0
behind,0.681818
beyond,1.0
far from,0.625
in,1.0
in_front_of,0.672131


Unnamed: 0,Recall
above,0.0
against,0.0
along,0.0
around,0.0
at_the_level_of,0.0
behind,0.222222
beyond,0.0
far from,0.05
in,0.0
in_front_of,0.151852
