## Initial data exploration - Amazon Book

This notebook simply calculates some statistics to understand the dataset better.

We look at how many items and users are in the train and test sets. We also calculate the expected number of appearance of items as positive in the sampling set.

### Imports

In [1]:
# Imports
import os
try:
    os.chdir('code')
except FileNotFoundError:
    pass

from world import Config, FakeArgs
from dataloader import DataLoader
from scipy.sparse import csr_matrix
import numpy as np

### Define analysis functions

In [2]:
def analyse_dataset(config, dataset):
    config.dataset = dataset
    # Load the dataset
    dataset = DataLoader(config)
    # Create the user item sparse matrices for the train and test set
    train_user_item_matrix = csr_matrix(
        (np.ones(len(dataset.df_train['user_id'])), (dataset.df_train['user_id'], dataset.df_train['item_id'])),
        shape=(dataset.n_user, dataset.m_item)
    )

    test_user_item_matrix = csr_matrix(
        (np.ones(len(dataset.df_test['user_id'])), (dataset.df_test['user_id'], dataset.df_test['item_id'])),
        shape=(dataset.n_user, dataset.m_item)
    )
    # Get the number of items by users
    train_user = np.array(train_user_item_matrix.sum(axis=1)).squeeze()
    # Get the number of users by items
    train_item = np.array(train_user_item_matrix.sum(axis=0)).squeeze()

    # Get the number of items by users
    test_user = np.array(test_user_item_matrix.sum(axis=1)).squeeze()
    # Get the number of users by items
    test_item = np.array(test_user_item_matrix.sum(axis=0)).squeeze()
    
    # Print the dataset statistics
    print('In the training dataset:')
    print(f'- There are {len(dataset.df_train["user_id"])} edges.')
    print(f'- The users have a minimum, mean, median, maximum, std of {get_array_statistics(train_user)} edges.')
    print(f'- The items have a minimum, mean, median, maximum, std of {get_array_statistics(train_item)} users.\n')

    print('In the testing dataset:')
    print(f'- There are {len(dataset.df_test["user_id"])} edges.')
    print(f'- The users have a minimum, mean, median, maximum, std of {get_array_statistics(test_user)} edges.')
    print(f'- The items have a minimum, mean, median, maximum, std of {get_array_statistics(test_item)} users.\n')
    
    # Expected number of occurrences of items as positives under random sampling
    excepted_n_item = {i: 0 for i in dataset.df_train['item_id'].unique()}
    sample_list = []
    for user_id in dataset.df_train['user_id'].unique():
        user_items = dataset.all_pos[user_id]
        user_items_len = len(user_items)
        for i in user_items:
            excepted_n_item[i] += dataset.mean_item_per_user / user_items_len
    
    # Print number of occurrences statistics
    min_expected_value = min([v for v in excepted_n_item.values()])
    mean_expected_value = np.mean([v for v in excepted_n_item.values()])
    max_expected_value = max([v for v in excepted_n_item.values()])
    std_expected_value = np.std([v for v in excepted_n_item.values()])

    print(f'Minimum item expected value: {min_expected_value}')
    print(f'Mean item expected value: {mean_expected_value}')
    print(f'Maximum item expected value: {max_expected_value}')
    print(f'Std item expected value: {std_expected_value}')
    
    data_dic = {
        'dataset': dataset,
        'min_expected_value': min_expected_value,
        'mean_expected_value': mean_expected_value,
        'max_expected_value': max_expected_value,
        'std_expected_value': std_expected_value,
    }
    return data_dic
    

def get_array_statistics(array):
    return array.min(), array.mean(), np.median(array), array.max(), array.std()

### Set up

In [3]:
# Extract the default values to instantiate the Config class
args = FakeArgs()

# Instantiate the config classss
config = Config(
    args.dataset, args.model, args.bpr_batch, args.recdim, args.layer, args.dropout, args.keepprob, args.a_fold,
    args.testbatch, args.multicore, args.lr, args.decay, args.pretrain, args.seed, args.epochs, args.load,
    args.checkpoint_path, args.results_path, args.topks, args.tensorboard, args.comment, args.sampling
)

### Gowalla dataset analysis

In [4]:
gowalla_analysis = analyse_dataset(config, 'gowalla')

[0;30;43mloading [../data/gowalla][0m
0 training samples and 0 test samples were dropped during the data cleaning.
The user ids were not updated.
The item ids were not updated.
810128 interactions for training
217242 interactions for testing
gowalla Sparsity : 0.0008396216228570436
gowalla is ready to go
In the training dataset:
- There are 810128 edges.
- The users have a minimum, mean, median, maximum, std of (8.0, 27.132694755174494, 16.0, 811.0, 36.85818812689325) edges.
- The items have a minimum, mean, median, maximum, std of (1.0, 19.76838046899783, 12.0, 1415.0, 33.11268050158492) users.

In the testing dataset:
- There are 217242 edges.
- The users have a minimum, mean, median, maximum, std of (1.0, 7.275838971130016, 4.0, 203.0, 9.217030630034714) edges.
- The items have a minimum, mean, median, maximum, std of (0.0, 5.301041946267783, 3.0, 895.0, 13.345033564190903) users.

Minimum item expected value: 0.14594594594594595
Mean item expected value: 19.671701520216686
Maximu

### Yelp 2018 dataset analysis

In [5]:
yelp_analysis = analyse_dataset(config, 'yelp2018')

[0;30;43mloading [../data/yelp2018][0m
0 training samples and 0 test samples were dropped during the data cleaning.
The user ids were not updated.
The item ids were not updated.
1237259 interactions for training
324147 interactions for testing
yelp2018 Sparsity : 0.0012958757851778647
yelp2018 is ready to go
In the training dataset:
- There are 1237259 edges.
- The users have a minimum, mean, median, maximum, std of (16.0, 39.06969180245042, 25.0, 1848.0, 45.10830201741167) edges.
- The items have a minimum, mean, median, maximum, std of (1.0, 32.51837153069807, 17.0, 1258.0, 49.266030880676894) users.

In the testing dataset:
- There are 324147 edges.
- The users have a minimum, mean, median, maximum, std of (2.0, 10.235790071996968, 7.0, 463.0, 11.245355174258071) edges.
- The items have a minimum, mean, median, maximum, std of (0.0, 8.51942283431455, 5.0, 275.0, 12.541067574469807) users.

Minimum item expected value: 0.16956521739130434
Mean item expected value: 32.46036585365854

### Amazon Book analysis

In [6]:
amazon_book_analysis = analyse_dataset(config, 'amazon-book')

[0;30;43mloading [../data/amazon-book][0m
0 training samples and 0 test samples were dropped during the data cleaning.
The user ids were not updated.
The item ids were not updated.
2380730 interactions for training
603378 interactions for testing
amazon-book Sparsity : 0.0006188468344849981
amazon-book is ready to go
In the training dataset:
- There are 2380730 edges.
- The users have a minimum, mean, median, maximum, std of (16.0, 45.22405637976559, 26.0, 10682.0, 77.95751270232581) edges.
- The items have a minimum, mean, median, maximum, std of (1.0, 25.990785925610542, 15.0, 1741.0, 38.39710866361827) users.

In the testing dataset:
- There are 603378 edges.
- The users have a minimum, mean, median, maximum, std of (0.0, 11.461694812225748, 7.0, 2631.0, 18.933403968453813) edges.
- The items have a minimum, mean, median, maximum, std of (0.0, 6.587167982183211, 3.0, 416.0, 11.468086187789028) users.

Minimum item expected value: 0.004212694252012731
Mean item expected value: 25.8

### LastFM dataset analysis

In [7]:
lastfm_analysis = analyse_dataset(config, 'lastfm')

[0;30;43mloading [../data/lastfm][0m
0 training samples and 0 test samples were dropped during the data cleaning.
The user ids were not updated.
The item ids were not updated.
2418427 interactions for training
616336 interactions for testing
lastfm Sparsity : 0.0026760006439057356
lastfm is ready to go
In the training dataset:
- There are 2418427 edges.
- The users have a minimum, mean, median, maximum, std of (8.0, 102.62356785199016, 42.0, 8384.0, 208.9881824698514) edges.
- The items have a minimum, mean, median, maximum, std of (1.0, 50.25511709577541, 24.0, 2836.0, 86.9692285597461) users.

In the testing dataset:
- There are 616336 edges.
- The users have a minimum, mean, median, maximum, std of (2.0, 26.153611134685566, 11.0, 2096.0, 52.24164370985187) edges.
- The items have a minimum, mean, median, maximum, std of (0.0, 12.80751407850716, 6.0, 1362.0, 23.542949552649013) users.

Minimum item expected value: 0.09705042816365367
Mean item expected value: 49.94975375600024
Maxi