# Demo for BERT4Rec from scratch

In this notebook we demonstrate how to train a `BERT4Rec` model from scratch, including preparing a dataset.

## 1. Create runtime environment

In [2]:
%%bash
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt

Successfully installed PyYAML-6.0 colorama-0.4.4 colorlog-4.7.2 joblib-1.2.0 markdown-3.4.1 pandas-1.4.4 pillow-9.3.0 protobuf-3.19.6 pytz-2022.6 recbole-1.0.1 scikit-learn-1.1.3 scipy-1.6.0 torch-1.12.1+cu116 torchaudio-0.12.1+cu116 torchvision-0.13.1+cu116 tqdm-4.64.1 typing-extensions-4.4.0


ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
spacy 3.4.1 requires jinja2, which is not installed.
-bash: line 1: python: command not found
-bash: line 2: venv/bin/activate: No such file or directory


Defaulting to user installation because normal site-packages is not writeable
Looking in links: https://download.pytorch.org/whl/cu116/torch_stable.html


## 2. Dataset Preparation

### 1. Download and unpack the chosen dataset inside the datasets directory

In [3]:
%%bash
mkdir -p datasets && cd datasets

if [ ! -f ml-100k.zip ]; then
wget http://files.grouplens.org/datasets/movielens/ml-100k.zip;
unzip ml-100k.zip;
rm ml-100k.zip
fi

--2022-11-21 12:34:14--  http://files.grouplens.org/datasets/movielens/ml-100k.zip
Resolving files.grouplens.org (files.grouplens.org)... 128.101.65.152
Connecting to files.grouplens.org (files.grouplens.org)|128.101.65.152|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4924029 (4.7M) [application/zip]
Saving to: ‘ml-100k.zip’

     0K .......... .......... .......... .......... ..........  1%  203K 23s
    50K .......... .......... .......... .......... ..........  2%  399K 17s
   100K .......... .......... .......... .......... ..........  3% 34.2M 12s
   150K .......... .......... .......... .......... ..........  4%  399K 11s
   200K .......... .......... .......... .......... ..........  5% 38.0M 9s
   250K .......... .......... .......... .......... ..........  6% 9.86M 8s
   300K .......... .......... .......... .......... ..........  7%  415K 8s
   350K .......... .......... .......... .......... ..........  8% 26.9M 7s
   400K .......... .......... ..

Archive:  ml-100k.zip
   creating: ml-100k/
  inflating: ml-100k/allbut.pl       
  inflating: ml-100k/mku.sh          
  inflating: ml-100k/README          
  inflating: ml-100k/u.data          
  inflating: ml-100k/u.genre         
  inflating: ml-100k/u.info          
  inflating: ml-100k/u.item          
  inflating: ml-100k/u.occupation    
  inflating: ml-100k/u.user          
  inflating: ml-100k/u1.base         
  inflating: ml-100k/u1.test         
  inflating: ml-100k/u2.base         
  inflating: ml-100k/u2.test         
  inflating: ml-100k/u3.base         
  inflating: ml-100k/u3.test         
  inflating: ml-100k/u4.base         
  inflating: ml-100k/u4.test         
  inflating: ml-100k/u5.base         
  inflating: ml-100k/u5.test         
  inflating: ml-100k/ua.base         
  inflating: ml-100k/ua.test         
  inflating: ml-100k/ub.base         
  inflating: ml-100k/ub.test         


### 2. Preprocess the downloaded dataset to the atomic format that the RecSys models understand.

The following command converts the raw dataset into the necessary atomic-formatted files:
- interactions
- items
- users

In [11]:
%%bash
python3 preprocess_dataset.py --dataset ml-100k  --convert_inter --convert_item --convert_user

100%|██████████| 100000/100000 [00:04<00:00, 22814.22it/s]
100%|██████████| 1682/1682 [00:00<00:00, 22982.11it/s]
100%|██████████| 943/943 [00:00<00:00, 19741.10it/s]


## 3. Prepare config file

Prepare a config.yaml file with the chosen dataset, model, hyperparameters and so on. You can copy one of the example config files in the `config` folder and adjust it to fit your needs.

## 4. Train the model

In [13]:
%%bash
python3 run.py --model_name BERT4Rec --config config_ml-100k.yaml --dataset=ml-100k

command line args [--model_name BERT4Rec --config config_ml-100k.yaml] will not be used in RecBole
21 Nov 12:38    INFO  
General Hyper Parameters:
gpu_id = 0
use_gpu = True
seed = 2020
state = INFO
reproducibility = True
data_path = /home/lazerdance/.local/lib/python3.8/site-packages/recbole/config/../dataset_example/ml-100k
checkpoint_dir = saved
show_progress = True
save_dataset = False
dataset_save_path = None
save_dataloaders = False
dataloaders_save_path = None
log_wandb = False

Training Hyper Parameters:
epochs = 10
train_batch_size = 3500
learner = adam
learning_rate = 0.01
neg_sampling = None
eval_step = 1
stopping_step = 10
clip_grad_norm = None
weight_decay = 0.0
loss_decimal_place = 4

Evaluation Hyper Parameters:
eval_args = {'split': {'RS': [4, 1, 1]}, 'group_by': 'None', 'order': 'TO', 'mode': 'uni50'}
repeatable = True
metrics = ['Recall', 'MRR', 'NDCG', 'Hit', 'Precision', 'MAP']
topk = [12]
valid_metric = MAP@12
valid_metric_bigger = True
eval_batch_size = 3500
metri

Congratulations! You just trained a BERT4Rec model on a dataset of your choice. The model and the dataset are stored in the `saved` folder as `.path` file

In [17]:
%%bash
pip freeze > full_requirements.txt

/mnt/c/Users/amroh/Desktop/Master Thesis/Crawlers/recommender_systems
