<a href="https://colab.research.google.com/github/Yanina-Kutovaya/RecSys-yandex/blob/main/notebooks/01_Data_preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Yandex Cup 2022 - ML RecSys


## A. Like prediction
An important task of music recommendations is to find tracks unknown to the user that he will like. Models that analyze explicit user feedback play an important role in solving this problem. 
Likes or dislikes placed on a track are consederd as user's explicit feedback.

Actions such as play and skip are also used in recommendations, but they provide less information about the user's preferences. Moreover, it is far more important to find a track that the user will like than a track that the user will simply listen to. 

The requirement of this task is to predict the next track that the user will like based on previous likes.

## Input format
The data is provided at the link https://disk.yandex.ru/d/SI1aAooPn9i8TA

There are three files in the likes_data.zip archive:
* train - training dataset. Each line is a sequence of track id's that one user has liked. It is guaranteed that likes are given in the order in which they were put by the user.
* test - test dataset. It has exactly the same format, but each line lacks the last like to be predicted.
* track_artists.csv - information about track artists. It is guaranteed that each track has exactly one artist. For tracks that actually have multiple artists, the one that is considered to be the main artist of the track has been left.

## Output format
As a solution, a file with no more than 100 tracks on a line for each user 
should be provided

## Notes
MRR@100 is used as the metric

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
!unzip /content/drive/MyDrive/ML_projects/recsys_yandex/data/01_raw_data/likes_data.zip

Archive:  /content/drive/MyDrive/ML_projects/recsys_yandex/data/01_raw_data/likes_data.zip
  inflating: test                    
  inflating: __MACOSX/._test         
  inflating: track_artists.csv       
  inflating: train                   
  inflating: __MACOSX/._train        


## Data preprocessing

In [3]:
import pandas as pd

def preprocess_data(
    data_path = 'train',
    track_artists_path = 'track_artists.csv',
    output_path = 'train_preprocessed.parquet.gzip'
  ):
  df = pd.read_csv(track_artists_path).values
  with open(data_path) as f:
      lines = f.readlines()      
      seq_length = []
      product_ids = []
      category_ids = []
      for line in lines:
        track = line.strip().split(' ')
        product_id = [int(id) for id in track]
        category_id = df[product_id, 1].tolist()
      
        seq_length.append(len(track))
        product_ids.append(product_id)
        category_ids.append(category_id)

      pd.DataFrame(
          zip(seq_length, product_ids, category_ids), 
          columns=['seq_len', 'product_id', 'category_id']
      ).to_parquet(output_path, compression='gzip')

In [4]:
preprocess_data(data_path='train', output_path='train_preprocessed.parquet.gzip')

In [5]:
preprocess_data(data_path='test', output_path='test_preprocessed.parquet.gzip')

In [6]:
train = pd.read_parquet('train_preprocessed.parquet.gzip')

print(f'train.shape = {train.shape}\n')
train.head(2)

train.shape = (1160084, 3)



Unnamed: 0,seq_len,product_id,category_id
0,54,"[333396, 267089, 155959, 353335, 414000, 33998...","[37399, 52345, 25987, 55650, 23545, 13408, 446..."
1,10,"[174197, 335779, 141676, 119856, 376664, 31175...","[21355, 41149, 39683, 30298, 18426, 51447, 114..."


In [7]:
test = pd.read_parquet('test_preprocessed.parquet.gzip')

print(f'test.shape = {test.shape}\n')
train.head(2)

test.shape = (289914, 3)



Unnamed: 0,seq_len,product_id,category_id
0,54,"[333396, 267089, 155959, 353335, 414000, 33998...","[37399, 52345, 25987, 55650, 23545, 13408, 446..."
1,10,"[174197, 335779, 141676, 119856, 376664, 31175...","[21355, 41149, 39683, 30298, 18426, 51447, 114..."


In [8]:
!cp -r '/content/train_preprocessed.parquet.gzip' '/content/drive/MyDrive/ML_projects/recsys_yandex/data/02_intermediate'
!cp -r '/content/test_preprocessed.parquet.gzip' '/content/drive/MyDrive/ML_projects/recsys_yandex/data/02_intermediate'