### Bayesian Sets

This is an experiment to check if BS work well for our companion recommendation use case. The implementation that we are going to use is https://github.com/MaLL-UFSCar/bayessets. Be sure to install it via `!pip install git+https://github.com/MaLL-UFSCar/bayessets.git`.

In [1]:
import numpy
import pandas
from scipy import sparse
import itertools

In [2]:
def cal_sparsity(array):
    num_total = total_elems(array)
    num_non_zero = zero_elems(array, num_total)
    sparsity = num_non_zero/num_total
    print("Sparsity of matrix is = {}".format(sparsity))
    return sparsity


def zero_elems(array, num_total):
    non_zero = numpy.count_nonzero(array)
    return num_total-non_zero


def total_elems(array):
    shape = array.shape
    return shape[0]*shape[1]

In [3]:
import json
import pickle

In [4]:
with open('package-to-id-dict-without-trans.json', 'r') as f:
    pack_to_id = json.load(f)

In [5]:
with open('manifest-to-id-without-trans.pickle', 'rb') as f:
    man_to_id = pickle.load(f)

In [6]:
man_to_id.get(frozenset(['django']))

16101

In [7]:
users = len(man_to_id)

In [8]:
items = len(pack_to_id)

In [9]:
users

66018

In [10]:
items

18796

#### Step 1 - Training the model

Here we define a sparse matrix of manifests (users) x packages (items). Here, we mark those entries as 1 for the manifests that contain those packages.

In [11]:
rating_matrix = numpy.zeros((users, items))

In [12]:
for item_list, user in man_to_id.items():
    for item in item_list:
        rating_matrix[user][pack_to_id.get(item)] = 1

In [13]:
cal_sparsity(rating_matrix)

Sparsity of matrix is = 0.9996093746247605


0.9996093746247605

In [14]:
rating_matrix[:10]

array([[1., 1., 1., ..., 0., 0., 0.],
       [0., 0., 1., ..., 0., 0., 0.],
       [0., 0., 1., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 1., ..., 0., 0., 0.]])

In [15]:
sparse_rating_matrix = sparse.csr_matrix(rating_matrix)

We now transpose the matrix here i.e make it items x users. The reason being the following:

If you consider that each package is an item that could be recommended, then it works like a recommender system. If you consider that each feature is a user, then it works like a collaborative filter RS. So, if you had kept the original input, then we will be querying the model to find similar users rather than items. 

Well, interesting thought here: Can we combine both the approaches to say that these are the items and these are the users that are similar and combine that information to generate the final set of recommendations? That can help us filter the recommendations better.

In [16]:
sparse_rating_matrix = sparse_rating_matrix.transpose()

In [17]:
sparse_rating_matrix[:10]

<10x66018 sparse matrix of type '<class 'numpy.float64'>'
	with 48037 stored elements in Compressed Sparse Column format>

In [18]:
import bayessets

In [19]:
# We just pass the sparse rating matrix to the model

model = bayessets.BernoulliBayesianSet(sparse_rating_matrix)

In [20]:
# Let's save the trained model

with open('Bayesian_Sets.pkl', 'wb') as f:
    pickle.dump(model, f)

#### Step 2 - Scoring the model

Here we will see how the model behaves based on different stack inputs and personas

In [21]:
# Let's load the model

with open('Bayesian_Sets.pkl', 'rb') as f:
    model = pickle.load(f)

In [22]:
import random
from math import ceil

In [23]:
with open('id-to-package-dict-without-trans.json', 'r') as f:
    id_to_pack = json.load(f)

In [24]:
def get_packages_from_id(package_ids):
    package_list = list()
    for i in package_ids:
        package = id_to_pack.get(str(i))
        package_list.append(package)
    return package_list

In [25]:
def map_input_to_package_ids(input_stack):
    package_id_list = list()
    for package in input_stack:
        package_id = pack_to_id.get(package)
        if package_id is not None:
            package_id_list.append(package_id)
    return package_id_list

#### Let's see how the model behaves for stacks that have tensorflow, keras in it

In [30]:
# Let's get started
count = 0
l = []
for item in man_to_id.items():
    if 'tensorflow' in item[0] and 'keras' in item[0]:
        count+=1
        l.append(item[0])
print(count)

248


As we can see, we have 248 stacks 

In [31]:
# Let's see which users have that
x = l[:10]
print(x)

[frozenset({'lxml', 'keras', 'tensorflow'}), frozenset({'keras', 'tensorflow', 'midi'}), frozenset({'tqdm', 'keras', 'docopt', 'tensorflow', 'opencv-python', 'python-resize-image', 'logger', 'scikit-image', 'hdfs3'}), frozenset({'dill', 'keras', 'tensorflow', 'matplotlib', 'scikit-learn'}), frozenset({'pykitti', 'packaging', 'pip', 'keras', 'unrealcv', 'transforms3d', 'pymongo', 'xxhash', 'tensorflow'}), frozenset({'tqdm', 'theano', 'keras', 'pygame', 'sgf', 'tensorflow', 'scikit-learn'}), frozenset({'sphinx-gallery', 'nbsphinx', 'keras', 'pillow', 'tensorflow', 'cython', 'ipykernel', 'scikit-learn'}), frozenset({'keras', 'pillow', 'dill', 'tensorflow'}), frozenset({'simplegeneric', 'certifi', 'werkzeug', 'cycler', 'jupyter', 'pytz', 'matplotlib', 'jupyter-console', 'wcwidth', 'imageio', 'olefile', 'pyparsing', 'pillow', 'pathlib2', 'html5lib', 'tqdm', 'singledispatch', 'theano', 'keras', 'moviepy', 'tensorflow', 'backports-shutil-get-terminal-size', 'scandir', 'subprocess32', 'pbr', '

In [33]:
# Let's get top 10 recommendations for our 10 users

for stack in x:
    # First get the id for the stack
    print("Stack is: ", stack)
    input_stack = map_input_to_package_ids(stack)
    scores = model.query(list(input_stack))
    ranking = numpy.argsort(scores)[::-1]
    top10 = ranking[:10]
    recommendations = numpy.array(list(itertools.compress(top10,
    [i not in input_stack for i in top10])))
    print("========================================")
    print("Recommendations are: ", set(get_packages_from_id(recommendations)))
    print("========================================")

Stack is:  frozenset({'lxml', 'keras', 'tensorflow'})
Recommendations are:  {'tqdm', 'theano', 'sklearn', 'tensorflow-tensorboard', 'opencv-python', 'jupyter', 'scikit-learn'}
Stack is:  frozenset({'keras', 'tensorflow', 'midi'})
Recommendations are:  {'theano', 'tensorflow-gpu', 'sklearn', 'tensorflow-tensorboard', 'opencv-python', 'jupyter', 'scikit-image', 'scikit-learn'}
Stack is:  frozenset({'tqdm', 'keras', 'docopt', 'tensorflow', 'opencv-python', 'python-resize-image', 'logger', 'scikit-image', 'hdfs3'})
Recommendations are:  {'dockerpty', 'tensorflow-tensorboard', 'tensorflow-gpu', 'theano'}
Stack is:  frozenset({'dill', 'keras', 'tensorflow', 'matplotlib', 'scikit-learn'})
Recommendations are:  {'pandas', 'pyparsing', 'scipy', 'sklearn', 'cycler', 'jupyter'}
Stack is:  frozenset({'pykitti', 'packaging', 'pip', 'keras', 'unrealcv', 'transforms3d', 'pymongo', 'xxhash', 'tensorflow'})
Recommendations are:  {'olefile', 'theano', 'sklearn', 'appdirs', 'wcwidth'}
Stack is:  frozense

**As we can see, the recommendations look to be much better than the previous models. Notice the serendipity of the model, it recommends mostly different packages to the user based on the set of packages which he has used. It doesn't generalize as per the previous models (because we have no CF here and that is because we have no users here) and gives useful rather than popular packages.**

#### TODO:

1. While scoring, we can either use the whole vector (everything you already know the user likes/has consumed), or random sample from it a few times and see whatever items appears in the top of the rankings the most (by doing that you could increase the serendipity of the model).