# HACKATHON_Yandex.Music

### **Цель:** Разработка ML-модели для сопоставления текстов музыкальных произведений и для поиска каверов (вариации обработки оригинала с элементами новой аранжировки) по их текстам

### **Задача**: Разработать ML-продукт, который:

- Находит все кавер-треки и/или исходники к заданному треку в датасете
- Перечисляет все кавер-треки и/или оригиналы к заданному, указывает положение данного трека в цепочке каверов

#### **Стек: pandas, pyplot, seaborn, sklearn, gensim, langdetect, sentence_transformers,  CatBoost, XGBoost**

### **Описание данных**

#### Разметка каверов:

Файл `covers.json` содержит разметку каверов, сделанную редакторами сервиса:

- `track_id` - уникальный идентификатор трека;
- `track_remake_type` - метка, присвоенная редакторами. Может принимать значения `ORIGINAL` и `COVER`;
- `original_track_id` - уникальный идентификатор исходного трека.

<aside>
💡 Обратите внимание, что не для всех каверов известны идентификаторы исходных треков!!!

</aside>

#### Метаинформация:

- `track_id` - уникальный идентификатор трека;
- `dttm` - первая дата появления информации о треке;
- `title` - название трека;
- `language` - язык исполнения;
- `isrc` - международный уникальный идентификатор трека;
- `genres` - жанры;
- `duration` - длительность трека;

#### Текст песен:

- `track_id` - уникальный идентификатор трека;
- `lyricId` - уникальный идентификатор текста;
- `text` - текст трека.


### **Описание решения:**
  1. Загрузка данных.
   - Подключение необходимых библиотек.
   - Предварительное изучение данных.
   - Предварительная подготовка данных. Отчистка от явно лишней информации.
   - Формирование общего датасета для дальнейшего исследование и предобработки.
  2. Исследовательский анализ:
  - Работа с аномалиями, пропусками и дубликатами.
  - Анализ признаков.
  3. Формирование признаков:
  - Анализ данных,
  - Удаление неинформативных признаков, генерация ряда признаков, по необходимости.
  4. Построение и обучение модели:
  - Подготовка данных для обучения модели:
    - Кодирование и масштабирование признаков - стандартизация данных, по необходимости.
    - Разделение общего датасета на выборки для обучения и проверки модели.
  - Обучение модели
    - Кодирование данных
    - RandomForestClassifier.
    - CatBoost,
    - XGBClassifier.
  5. Выбор лучшей модели. Тестирование.
  6. Выводы.

## **1.Загрузка данных.**
   - Подключение необходимых библиотек.

In [1]:
#проект запускаю в colab
import sys
ENV_COLAB = 'google.colab' in sys.modules

if ENV_COLAB:
    !pip install catboost >> None
    !"{sys.executable}" -m pip install phik >> None
    !pip install pytorch-transformers >> None
    !pip install transformers >> None
    #!pip install pytorch-pretrained-bert >> None
    !pip install pyinflect >> None
    !pip install sentence_splitter >> None
    !pip install contractions >> None
    !pip install sentence-transformers >>None
    !pip install langdetect >>None


    print('Environment: Google Colab')


Environment: Google Colab


In [2]:
# импорт библиотек
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import json
import math
import time
import re
import contractions
# работа с текстом
import transformers as ppb
from langdetect import detect #определение языка
import pyinflect
from sentence_splitter import SentenceSplitter, split_text_into_sentences
# импорт моделей
from lightgbm import LGBMClassifier
from xgboost import XGBClassifier
from sklearn.ensemble import RandomForestClassifier
from catboost import CatBoostClassifier

# импортируем функции из statsmodels
from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.tsa.stattools import adfuller
from scipy.stats import norm
from scipy import stats

# предварительная обработка
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import KNNImputer, SimpleImputer
# работа с текстом
import gensim
from gensim.utils import simple_preprocess
import spacy
import en_core_web_sm # малая модель spacy
import gensim.downloader as api
from gensim.models import Word2Vec
from sentence_transformers import SentenceTransformer

# кроссвалидация
from sklearn.model_selection import (
    GridSearchCV,
    RandomizedSearchCV,
    train_test_split
)

# метрики
from sklearn.metrics import (
    accuracy_score,
    f1_score,
    roc_auc_score
)


# настройки
from tqdm import tqdm
import warnings
warnings.filterwarnings("ignore")
#from skimpy import clean_columns
import logging
logging.getLogger('matplotlib.font_manager').disabled = True

# константа верхний регистр
RANDOM_STATE = 123456

In [3]:
from google.colab import drive
drive.mount('/content/drive/')

Mounted at /content/drive/


In [4]:
%cd /content/drive/My Drive/Colab Notebooks/

/content/drive/My Drive/Colab Notebooks


### **Загрузка данных**
   - Предварительное изучение данных.
   - Предварительная подготовка данных. Отчистка от явно лишней информации.
   - Формирование общего датасета для дальнейшего исследование и предобработки.Импорты необходимых билиотек

In [5]:
def convert_js(data):
    data_js = []
    with open(f'{data}', 'r') as file:
        for line in file:
            try:
                item = json.loads(line)
                # Преобразовать значения списков в строки
                for key, value in item.items():
                    if isinstance(value, list):
                        item[key] = json.dumps(value)
                data_js.append(item)
            except json.JSONDecodeError as e:
                print(f"Ошибка при разборе JSON: {e}")
    return data_js

In [6]:
covers_js = convert_js('covers.json')
lyrics_js = convert_js( 'lyrics.json')
meta_js = convert_js( 'meta.json')

In [7]:
# Создаем DataFrame из данных JSON#
covers = pd.DataFrame(covers_js)
covers.head()

Unnamed: 0,original_track_id,track_id,track_remake_type
0,eeb69a3cb92300456b6a5f4162093851,eeb69a3cb92300456b6a5f4162093851,ORIGINAL
1,fe7ee8fc1959cc7214fa21c4840dff0a,fe7ee8fc1959cc7214fa21c4840dff0a,ORIGINAL
2,cd89fef7ffdd490db800357f47722b20,cd89fef7ffdd490db800357f47722b20,ORIGINAL
3,995665640dc319973d3173a74a03860c,995665640dc319973d3173a74a03860c,ORIGINAL
4,,d6288499d0083cc34e60a077b7c4b3e1,COVER


In [8]:
lyrics = pd.DataFrame(lyrics_js)
lyrics.head()

Unnamed: 0,lyricId,text,track_id
0,a951f9504e89759e9d23039b7b17ec14,"Живу сейчас обломами, обломками не той любви\n...",1c4b1230f937e4c548ff732523214dcd
1,0c749bc3f01eb8e6cf986fa14ccfc585,Tell me your fable\nA fable\nTell me your fabl...,0faea89b0d7d6235b5b74def72511bd8
2,e2c8830fbc86e5964478243099eec23a,You're ashamed about all your fears and doubts...,9c6dc41d5ccd9968d07f055da5d8f741
3,e2c8830fbc86e5964478243099eec23a,You're ashamed about all your fears and doubts...,bfd04a73e9cffdf0e282c92219a86ea1
4,7624653ca8522ba93470843c74961b7d,"You showed him all the best of you,\nBut I'm a...",8d70930d09cd239c948408d1317d8659


In [9]:
meta = pd.DataFrame(meta_js)
meta.head()

Unnamed: 0,track_id,dttm,title,language,isrc,genres,duration
0,c3b9d6a354ca008aa4518329aaa21380,1639688000000.0,Happy New Year,EN,RUB422103970,"[""DANCE""]",161120.0
1,c57e3d13bbbf5322584a7e92e6f1f7ff,1637762000000.0,Bad Habits,EN,QZN882178276,"[""ELECTRONICS""]",362260.0
2,955f2aafe8717908c140bf122ba4172d,1637768000000.0,Por Esa Loca Vanidad,,QZNJZ2122549,"[""FOLK"", ""LATINFOLK""]",260000.0
3,fae5a077c9956045955dde02143bd8ff,1637768000000.0,Mil Lagrimas,,QZNJZ2166033,"[""FOLK"", ""LATINFOLK""]",190000.0
4,6bede082154d34fc18d9a6744bc95bf5,1637768000000.0,Sexo Humo y Alcohol,,QZNJZ2122551,"[""FOLK"", ""LATINFOLK""]",203000.0


In [10]:
# Объединяем по столбцу track_id
data = covers.merge(lyrics, on='track_id', how='left').merge(meta, on='track_id', how='left')

data['dttm'] = pd.to_datetime(data['dttm'], unit='ms')
data['dttm'] = pd.to_datetime(data['dttm']).dt.strftime('%d-%m-%Y')

In [11]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 72571 entries, 0 to 72570
Data columns (total 11 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   original_track_id  5378 non-null   object 
 1   track_id           72571 non-null  object 
 2   track_remake_type  72571 non-null  object 
 3   lyricId            11097 non-null  object 
 4   text               11097 non-null  object 
 5   dttm               72571 non-null  object 
 6   title              72571 non-null  object 
 7   language           22598 non-null  object 
 8   isrc               72242 non-null  object 
 9   genres             72571 non-null  object 
 10  duration           72571 non-null  float64
dtypes: float64(1), object(10)
memory usage: 6.6+ MB


In [12]:
#напишем функцию для исследования данных
def research(data, name, figsize, silent):
    print(f'Размер данных:      {data.shape}')
    print(f'Количество явных дубликатов: {data.duplicated().sum()}')
    print(f'Наличие пропусков:           {data.isna().sum().sum()}')
    print('Пропущенные данные (в процентном соотношении):')
    logging.getLogger('matplotlib.font_manager').disabled = True
    print(round(data.isna().mean()*100).sort_values(ascending=False).head(15))
    display(data.head(3))
    if not silent:
        print(f'\nПроверка структуры {name}:')
        data.hist(linewidth=2, histtype='step', figsize=figsize)
        plt.suptitle(f'Гистограмма распеределения {name}', y=0.95, fontsize=12)
        plt.show()

        print()
        #тепловая карта корреляций
        plt.figure(figsize=(6, 6))
        matrix = np.triu(data.corr())
        heatmap = sns.heatmap(data.corr(), annot=True, fmt='.2g',
                          mask=matrix, square=True,
                          cmap='GnBu',  cbar=False,
                          xticklabels=True, yticklabels=True , vmin=0, vmax=1, center= 0)#'coolwarm', cmap=cmap,'Blues','BuGn'
        plt.suptitle(f'Тепловая карта матрицы корреляции {name}', y=0.90, fontsize=12)
        plt.show()
        print()
        display(data.describe())

In [13]:
research(data, 'общего датасета', figsize=(13, 7), silent=True)

Размер данных:      (72571, 11)
Количество явных дубликатов: 0
Наличие пропусков:           240443
Пропущенные данные (в процентном соотношении):
original_track_id    93.0
lyricId              85.0
text                 85.0
language             69.0
track_id              0.0
track_remake_type     0.0
dttm                  0.0
title                 0.0
isrc                  0.0
genres                0.0
duration              0.0
dtype: float64


Unnamed: 0,original_track_id,track_id,track_remake_type,lyricId,text,dttm,title,language,isrc,genres,duration
0,eeb69a3cb92300456b6a5f4162093851,eeb69a3cb92300456b6a5f4162093851,ORIGINAL,260f21d9f48e8de874a6e844159ddf28,Left a good job in the city\nWorkin' for the m...,11-11-2009,Proud Mary,EN,USFI86900049,"[""ROCK"", ""ALLROCK""]",187220.0
1,eeb69a3cb92300456b6a5f4162093851,eeb69a3cb92300456b6a5f4162093851,ORIGINAL,f3331cf99637ee24559242d13d8cf259,Left a good job in the city\nWorkin' for the m...,11-11-2009,Proud Mary,EN,USFI86900049,"[""ROCK"", ""ALLROCK""]",187220.0
2,fe7ee8fc1959cc7214fa21c4840dff0a,fe7ee8fc1959cc7214fa21c4840dff0a,ORIGINAL,2498827bd11eca5846270487e4960080,Some folks are born made to wave the flag\nOoh...,11-11-2009,Fortunate Son,EN,USFI86900065,"[""ROCK"", ""ALLROCK""]",137780.0


In [14]:
#data.to_csv("data.csv", index=False)

**Выводы:**
Провели первичный анализ:
- Сразу перевела данные даты в формат даты.
- Количество явных дубликатов: 0
- Наличие пропусков - 240443. Пропущенные данные (в процентном соотношении):
  - original_track_id    93.0
  - lyricId              85.0
  - text                 85.0
  - language             69.0
- Поработаем с пропусками:
  - в тексте, сделаем предобработку,
  - language - можем достать из названия и текста песен.

-  В genres - уберем кавычки.

## **2. Исследовательский анализ:**
  - Предобработка текстов
  - Работа с аномалиями, пропусками и дубликатами.

#### **Заполним пропуски в 'language'**

In [15]:
%%time
tqdm.pandas()

def detect_language(text):
    try:
        if not pd.isna(text):
            return detect(text)
    except:
        pass
    return pd.NA

def detect_lang_fill(row):
    if pd.isna(row['language']):
        if pd.notna(row['text']):
            row['language'] = detect_language(row['text'])
        elif pd.notna(row['title']):
            row['language'] = detect_language(row['title'])

    return row['language']

data['language'] = data.progress_apply(detect_lang_fill, axis=1)

100%|██████████| 72571/72571 [14:15<00:00, 84.82it/s]

CPU times: user 13min 23s, sys: 5.33 s, total: 13min 28s
Wall time: 14min 15s





In [16]:
data.tail(5)

Unnamed: 0,original_track_id,track_id,track_remake_type,lyricId,text,dttm,title,language,isrc,genres,duration
72566,4788e0bf61d80ef5ec9380aa8a8119d9,4788e0bf61d80ef5ec9380aa8a8119d9,ORIGINAL,,,28-09-2023,"Милый, прощай",ru,RUAGT2312928,"[""POP"", ""RUSPOP""]",178980.0
72567,,78b2db35476f134dc3cdfbf4d77ba034,COVER,,,01-10-2023,Habits (Stay Hight),EN,TCAHK2396284,"[""ELECTRONICS""]",149570.0
72568,,e720ff378efe032df56e0e656a6a92d3,COVER,,,05-10-2023,Arcade,EN,TCAHM2318975,"[""FOREIGNBARD"", ""BARD""]",201580.0
72569,554e33d79e258da91149c3a4985cf6a1,554e33d79e258da91149c3a4985cf6a1,ORIGINAL,,,05-10-2023,Май,bg,SMRUS0076417,"[""RUSRAP"", ""RAP""]",156870.0
72570,7b0f6ff24137be50cf5ea5f82d789448,7b0f6ff24137be50cf5ea5f82d789448,ORIGINAL,,,05-10-2023,Не улетай,ru,DGA0M2316512,"[""POP"", ""RUSPOP""]",148500.0


In [17]:
data['language'].value_counts() #.unique() #

EN    15866
en    13979
pt    10824
es     6688
de     2273
      ...  
YO        1
SA        1
MN        1
te        1
TG        1
Name: language, Length: 134, dtype: int64

#### **Сделаем предобработку текстов**

In [19]:
data['text'][0]

"Left a good job in the city\nWorkin' for the man ev'ry night and day\nAnd I never lost one minute of sleepin'\nWorryin' 'bout the way things might have been\n\nBig wheel keep on turnin'\nProud Mary keep on burnin'\nRollin', rollin', rollin' on the river\n\nCleaned a lot of plates in Memphis\nPumped a lot of 'pane down in New Orleans\nBut I never saw the good side of a city\n'Til I hitched a ride on a riverboat queen\n\nBig wheel keep on turnin'\nProud Mary keep on burnin'\nRollin', rollin', rollin' on the river\n\nRollin', rollin', rollin' on the river\n\nIf you come down to the river\nBet you gonna find some people who live\nYou don't have to worry 'cause you have no money\nPeople on the river are happy to give\n\nBig wheel keep on turnin'\nProud Mary keep on burnin'\nRollin', rollin', rollin' on the river\nRollin', rollin', rollin' on the river\nRollin', rollin', rollin' on the river\nRollin', rollin', rollin' on"

In [20]:
data['text'][1]

"Left a good job in the city\nWorkin' for the man every night and day\nAnd I never lost one minute of sleepin'\nWorryin' 'bout the way things might have been\n\nBig wheel keep on turnin'\nProud Mary keep on burnin'\nRollin', rollin'\nRollin' on the river\n\nCleaned a lot of plates in Memphis\nPumped a lot of 'pane down in New Orleans\nBut I never saw the good side of the city\n'Til I hitched a ride on a river boat queen\n\nBig wheel keep on turnin'\nProud Mary keep on burnin'\nRollin', rollin' (rollin')\nRollin' on the river\n\nRollin', rollin'\nRollin' on the river\n\nIf you come down to the river\nBet you gonna find some people who live\nYou don't have to worry 'cause you have no money\nPeople on the river are happy to give\n\nBig wheel keep on turnin'\nProud Mary keep on burnin'\nRollin', rollin'\nRollin' on the river\n\nRollin', rollin' (roll, Lord)\nRollin' on the river\nRollin', rollin'\nRollin' on the river\nRollin', rollin'\nRollin' on the river"

In [21]:
data['text'][2]

'Some folks are born made to wave the flag\nOoh, they\'re red, white and blue\nAnd when the band plays "Hail to the Chief"\nOoh, they point the cannon at you, Lord\n\nIt ain\'t me, it ain\'t me\nI ain\'t no senator\'s son, son\nIt ain\'t me, it ain\'t me\nI ain\'t no fortunate one, no\n\nSome folks are born silver spoon in hand\nLord, don\'t they help themselves, Lord?\nBut when the taxman come to the door\nLord, the house lookin\' like a rummage sale, yeah\n\nIt ain\'t me, it ain\'t me\nI ain\'t no millionaire\'s son, no, no\nIt ain\'t me, it ain\'t me\nI ain\'t no fortunate one, no\n\nYeah-yeah, some folks inherit star-spangled eyes\nOoh, they send you down to war, Lord\nAnd when you ask \'em, "How much should we give?"\nHoo, they only answer, "More, more, more, more"\n\nIt ain\'t me, it ain\'t me\nI ain\'t no military son, son, Lord\nIt ain\'t me, it ain\'t me\nI ain\'t no fortunate one, one\n\nIt ain\'t me, it ain\'t me\nI ain\'t no fortunate one, no, no, no\nIt ain\'t me, it ain\'

In [22]:
#приведем тексты к нижнему регистру
data['text'] = data['text'].str.lower()
data['title'] = data['title'].str.lower()
data['genres'] = data['genres'].str.lower()
data['language'] = data['language'].str.lower()

# убираем сокращенные формы глаголов в 'text'
data['text'] = data['text'].astype(str)
data['text'] = data['text'].apply(lambda x: contractions.fix(x))

# уберем ненужные символы
data['text'] = data['text'].apply(lambda x: re.sub(r"ev'ry", 'every', x))
data['genres'] = [re.sub(r'"', '', genre) for genre in data['genres']]
data.head(3)

Unnamed: 0,original_track_id,track_id,track_remake_type,lyricId,text,dttm,title,language,isrc,genres,duration
0,eeb69a3cb92300456b6a5f4162093851,eeb69a3cb92300456b6a5f4162093851,ORIGINAL,260f21d9f48e8de874a6e844159ddf28,left a good job in the city\nworkin' for the m...,11-11-2009,proud mary,en,USFI86900049,"[rock, allrock]",187220.0
1,eeb69a3cb92300456b6a5f4162093851,eeb69a3cb92300456b6a5f4162093851,ORIGINAL,f3331cf99637ee24559242d13d8cf259,left a good job in the city\nworkin' for the m...,11-11-2009,proud mary,en,USFI86900049,"[rock, allrock]",187220.0
2,fe7ee8fc1959cc7214fa21c4840dff0a,fe7ee8fc1959cc7214fa21c4840dff0a,ORIGINAL,2498827bd11eca5846270487e4960080,some folks are born made to wave the flag\nooh...,11-11-2009,fortunate son,en,USFI86900065,"[rock, allrock]",137780.0


In [23]:
%%time
# Преобразуем текст в список токенов и удалим символы \n, оставим знаки препинания
data['text'] = data['text'].apply(lambda x: ' '.join([token for token in simple_preprocess(x) if token != '\n' or token.is_punct]))

CPU times: user 4.75 s, sys: 23.8 ms, total: 4.78 s
Wall time: 4.84 s


In [24]:
data['text'][0]

'left good job in the city workin for the man every night and day and never lost one minute of sleepin worryin bout the way things might have been big wheel keep on turnin proud mary keep on burnin rollin rollin rollin on the river cleaned lot of plates in memphis pumped lot of pane down in new orleans but never saw the good side of city til hitched ride on riverboat queen big wheel keep on turnin proud mary keep on burnin rollin rollin rollin on the river rollin rollin rollin on the river if you come down to the river bet you going to find some people who live you do not have to worry because you have no money people on the river are happy to give big wheel keep on turnin proud mary keep on burnin rollin rollin rollin on the river rollin rollin rollin on the river rollin rollin rollin on the river rollin rollin rollin on'

In [25]:
data['text'][1]

'left good job in the city workin for the man every night and day and never lost one minute of sleepin worryin bout the way things might have been big wheel keep on turnin proud mary keep on burnin rollin rollin rollin on the river cleaned lot of plates in memphis pumped lot of pane down in new orleans but never saw the good side of the city til hitched ride on river boat queen big wheel keep on turnin proud mary keep on burnin rollin rollin rollin rollin on the river rollin rollin rollin on the river if you come down to the river bet you going to find some people who live you do not have to worry because you have no money people on the river are happy to give big wheel keep on turnin proud mary keep on burnin rollin rollin rollin on the river rollin rollin roll lord rollin on the river rollin rollin rollin on the river rollin rollin rollin on the river'

In [26]:
data['text'][2]

'some folks are born made to wave the flag ooh they are red white and blue and when the band plays hail to the chief ooh they point the cannon at you lord it are not me it are not me are not no senator son son it are not me it are not me are not no fortunate one no some folks are born silver spoon in hand lord do not they help themselves lord but when the taxman come to the door lord the house lookin like rummage sale yeah it are not me it are not me are not no millionaire son no no it are not me it are not me are not no fortunate one no yeah yeah some folks inherit star spangled eyes ooh they send you down to war lord and when you ask them how much should we give hoo they only answer more more more more it are not me it are not me are not no military son son lord it are not me it are not me are not no fortunate one one it are not me it are not me are not no fortunate one no no no it are not me it are not me are not no fortunate son no no no it are not me it are not me'

In [27]:
data['text'][6]

'state of emergency yeah yeah yeah oh yeah oh oh oh yeah yeah yeah oh yeah oh oh oh remember the time baby yeah yeah yeah oh yeah oh oh oh are not got no money are not got no car to take you on date cannot even buy you flowers but together we will be the perfect soulmates talk to me girl oh baby it is alright now you are not got to flaunt for me if we go dutch you can still touch my love it is free we can work without the perks just you and me thug it out til we get it right baby if you strip you can get tip because like you just the way you are am about to strip and am well equipped can you handle me the way are do not need the or the car keys boy like you just the way you are let me see ya strip you can get tip because like like like are not got no visa are not got no red american express we cannot go nowhere exotic it do not matter because am the one that loves you best talk to me girl oh baby it is alright now you are not got to flaunt for me if we go dutch you can still touch my l

#### **Проверим на наличие неявных дубликатов**

In [28]:
#Посчитаем неявные дубликаты (отличаются только по lyricId)
duplicate_count = data.drop('lyricId', axis=1).duplicated().sum()
print('Количество неявных дубликатов', duplicate_count)
duplicates = data[data.drop('lyricId', axis=1).duplicated()]
display(duplicates['text'].head(2))

Количество неявных дубликатов 281


7     in state of emergency are not got no money are...
34    see trees of green red roses too see them bloo...
Name: text, dtype: object

In [29]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 72571 entries, 0 to 72570
Data columns (total 11 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   original_track_id  5378 non-null   object 
 1   track_id           72571 non-null  object 
 2   track_remake_type  72571 non-null  object 
 3   lyricId            11097 non-null  object 
 4   text               72571 non-null  object 
 5   dttm               72571 non-null  object 
 6   title              72571 non-null  object 
 7   language           72510 non-null  object 
 8   isrc               72242 non-null  object 
 9   genres             72571 non-null  object 
 10  duration           72571 non-null  float64
dtypes: float64(1), object(10)
memory usage: 8.7+ MB


In [30]:
# Удаляем неявные дубикаты
data = data.drop_duplicates(subset=data.columns.difference(['lyricId'])).reset_index(drop=True)
#Проверяем
duplicate_count = data.drop('lyricId', axis=1).duplicated().sum()
print('Количество неявных дубликатов после обработки:', duplicate_count)

Количество неявных дубликатов после обработки: 0


In [31]:
data['track_id'].duplicated().value_counts()

False    71597
True       693
Name: track_id, dtype: int64

In [32]:
data[data['track_id'] == '2bfb9427a1d97d16ab61ff31d6408870']

Unnamed: 0,original_track_id,track_id,track_remake_type,lyricId,text,dttm,title,language,isrc,genres,duration
72019,,2bfb9427a1d97d16ab61ff31d6408870,COVER,3525530bd73b2802420fd85c265ff6ab,хочу запомнить как смята постель как ты одевае...,15-06-2023,спектакль окончен,ru,AEA2Z2314296,"[punk, postpunk]",189480.0


In [33]:
print('Количество неявных дубликатов по "track_id":', len(data[data.duplicated(subset=['track_id', 'track_remake_type'], keep='first')]))


Количество неявных дубликатов по "track_id": 693


In [36]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 72290 entries, 0 to 72289
Data columns (total 11 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   original_track_id  5253 non-null   object 
 1   track_id           72290 non-null  object 
 2   track_remake_type  72290 non-null  object 
 3   lyricId            10816 non-null  object 
 4   text               72290 non-null  object 
 5   dttm               72290 non-null  object 
 6   title              72290 non-null  object 
 7   language           72229 non-null  object 
 8   isrc               71965 non-null  object 
 9   genres             72290 non-null  object 
 10  duration           72290 non-null  float64
dtypes: float64(1), object(10)
memory usage: 6.1+ MB


### **Заполним пропуски**

In [37]:
data['language'] = data['language'].fillna(value='No')
data['title'] = data['title'].fillna(value='No')
data['duration'] = data['duration'].fillna(value=0)

In [38]:
research(data, 'общего датасета', figsize=(13, 7), silent=True)

Размер данных:      (72290, 11)
Количество явных дубликатов: 0
Наличие пропусков:           128836
Пропущенные данные (в процентном соотношении):
original_track_id    93.0
lyricId              85.0
track_id              0.0
track_remake_type     0.0
text                  0.0
dttm                  0.0
title                 0.0
language              0.0
isrc                  0.0
genres                0.0
duration              0.0
dtype: float64


Unnamed: 0,original_track_id,track_id,track_remake_type,lyricId,text,dttm,title,language,isrc,genres,duration
0,eeb69a3cb92300456b6a5f4162093851,eeb69a3cb92300456b6a5f4162093851,ORIGINAL,260f21d9f48e8de874a6e844159ddf28,left good job in the city workin for the man e...,11-11-2009,proud mary,en,USFI86900049,"[rock, allrock]",187220.0
1,eeb69a3cb92300456b6a5f4162093851,eeb69a3cb92300456b6a5f4162093851,ORIGINAL,f3331cf99637ee24559242d13d8cf259,left good job in the city workin for the man e...,11-11-2009,proud mary,en,USFI86900049,"[rock, allrock]",187220.0
2,fe7ee8fc1959cc7214fa21c4840dff0a,fe7ee8fc1959cc7214fa21c4840dff0a,ORIGINAL,2498827bd11eca5846270487e4960080,some folks are born made to wave the flag ooh ...,11-11-2009,fortunate son,en,USFI86900065,"[rock, allrock]",137780.0


#### Вывод:
- проведена обработка текстов:
  - Зменили сокращенную форму глагола в `text` на полную;
  - убрали ненужные символы;
  - привели к нижнему регистру;
- Проведена проверка на наличие неявных дубликатов по `track_id`, а также без учета `text` , удалили их:
  > Обнаружено 974 неявных дубликатов, удалили их.

- Провели токенизацию текста с песнями `text`.
- заполнили пропуски:
  -  language из названия и текста песни, оставшиеся на 'No'
  - 'title' на 'No'
  - 'duration' на 0


### **3. Формирование признаков:**
- Анализ данных,
- Удаление неинформативных признаков, генерация ряда признаков, по необходимости.

#### **Создаем эмбендинговую  модель**
Не хватило времени ее проверить - не успели создать эмбендинги

In [41]:
%%time
# Загрузка модели SentenceTransformer
model = SentenceTransformer("ai-forever/FRED-T5-1.7B")
def embeddings(text):
    return model.encode(text)

.gitattributes:   0%|          | 0.00/1.48k [00:00<?, ?B/s]

README.md:   0%|          | 0.00/3.67k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/2.50k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/632 [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/1.27M [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/6.96G [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/360 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/640 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.71M [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


CPU times: user 22.6 s, sys: 25.5 s, total: 48.1 s
Wall time: 3min 44s


In [42]:
#создадим небольшой датасет для проверки работоспособности эмбендинговой модели
data_cut = data.head(10)

In [43]:
%%time

tqdm.pandas()
# Создаем эмбеддинги для столбца 'title'
data_cut['title_embeddings'] = data_cut['title'].progress_apply(embeddings)
data_emb = data.copy()
data_emb.to_csv("data_emb.csv", index=False)

  0%|          | 0/10 [00:00<?, ?it/s]Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
100%|██████████| 10/10 [00:06<00:00,  1.59it/s]


CPU times: user 7.38 s, sys: 76.2 ms, total: 7.46 s
Wall time: 8.1 s


In [44]:
%%time
tqdm.pandas()
# Создаем эмбеддинги для столбца 'genre'
data_cut['genre_embedding'] = data_cut['genres'].progress_apply(embeddings)
data_emb = data_cut.copy()
data_emb.to_csv("data_emb.csv", index=False)

100%|██████████| 10/10 [00:08<00:00,  1.18it/s]

CPU times: user 8.27 s, sys: 33.2 ms, total: 8.3 s
Wall time: 8.5 s





In [45]:
%%time
tqdm.pandas()
# Создаем объединенный эмбеддинг для столбцов 'genre' и 'track_id'
data_cut['genre_track_id_embedding'] = data_cut.progress_apply(lambda row: model.encode(row['genres'] + " " + row['track_id']), axis=1)

# Вывести результат
data_emb1 = data_cut.copy()
data_emb1.to_csv("data_emb1.csv", index=False)

100%|██████████| 10/10 [00:15<00:00,  1.59s/it]

CPU times: user 15.6 s, sys: 35.8 ms, total: 15.7 s
Wall time: 16.2 s





In [46]:
display(data_cut)

Unnamed: 0,original_track_id,track_id,track_remake_type,lyricId,text,dttm,title,language,isrc,genres,duration,title_embeddings,genre_embedding,genre_track_id_embedding
0,eeb69a3cb92300456b6a5f4162093851,eeb69a3cb92300456b6a5f4162093851,ORIGINAL,260f21d9f48e8de874a6e844159ddf28,left good job in the city workin for the man e...,11-11-2009,proud mary,en,USFI86900049,"[rock, allrock]",187220.0,"[0.0011266496, -0.0048668487, -0.0024569903, -...","[0.0013713714, 0.0013830342, -0.003086395, -0....","[0.006809471, -0.0025031648, -0.0055086534, 0...."
1,eeb69a3cb92300456b6a5f4162093851,eeb69a3cb92300456b6a5f4162093851,ORIGINAL,f3331cf99637ee24559242d13d8cf259,left good job in the city workin for the man e...,11-11-2009,proud mary,en,USFI86900049,"[rock, allrock]",187220.0,"[0.0011266496, -0.0048668487, -0.0024569903, -...","[0.0013713714, 0.0013830342, -0.003086395, -0....","[0.006809471, -0.0025031648, -0.0055086534, 0...."
2,fe7ee8fc1959cc7214fa21c4840dff0a,fe7ee8fc1959cc7214fa21c4840dff0a,ORIGINAL,2498827bd11eca5846270487e4960080,some folks are born made to wave the flag ooh ...,11-11-2009,fortunate son,en,USFI86900065,"[rock, allrock]",137780.0,"[0.009950027, -0.001544389, -0.006491204, -0.0...","[0.0013713714, 0.0013830342, -0.003086395, -0....","[0.008026071, -0.0034410378, -0.0005314903, -0..."
3,cd89fef7ffdd490db800357f47722b20,cd89fef7ffdd490db800357f47722b20,ORIGINAL,5237001311d4062bf2b80de30652bf58,uno por pobre feo hombre pero antoja ay ome te...,21-09-2009,la camisa negra,es,USUL10400965,"[pop, folk, latinfolk]",216840.0,"[0.0036356219, -0.001111055, -0.0058290297, -0...","[-0.002829399, -0.0003187269, -0.0030060736, -...","[0.0026712085, -0.002920405, -0.0038161918, 8...."
4,995665640dc319973d3173a74a03860c,995665640dc319973d3173a74a03860c,ORIGINAL,e5b1b57090b728e8d98d2b4d9b781bf4,yeah yeah remember the time baby yeah are not ...,16-11-2009,the way i are,en,USUM70722806,"[foreignrap, rap]",179660.0,"[0.0028019436, 0.00031179507, -0.0003042277, -...","[0.005633045, -0.0022621476, -0.0027199036, -0...","[0.005826189, 0.0011910216, -0.0025019816, -0...."
5,995665640dc319973d3173a74a03860c,995665640dc319973d3173a74a03860c,ORIGINAL,b6625d84706fefe8782e63bd36067bc2,in state of emergency are not got no money are...,16-11-2009,the way i are,en,USUM70722806,"[foreignrap, rap]",179660.0,"[0.0028019436, 0.00031179507, -0.0003042277, -...","[0.005633045, -0.0022621476, -0.0027199036, -0...","[0.005826189, 0.0011910216, -0.0025019816, -0...."
6,995665640dc319973d3173a74a03860c,995665640dc319973d3173a74a03860c,ORIGINAL,4b30eb13f54a1d83f34202ab8e8a3357,state of emergency yeah yeah yeah oh yeah oh o...,16-11-2009,the way i are,en,USUM70722806,"[foreignrap, rap]",179660.0,"[0.0028019436, 0.00031179507, -0.0003042277, -...","[0.005633045, -0.0022621476, -0.0027199036, -0...","[0.005826189, 0.0011910216, -0.0025019816, -0...."
7,,d6288499d0083cc34e60a077b7c4b3e1,COVER,,,17-09-2009,extraball,en,FR8Q10900116,[electronics],212620.0,"[-0.001989016, -0.004578747, -0.0076768906, -0...","[0.004979183, -0.0006452594, -0.00030280798, -...","[0.0027971673, -0.0046878983, -0.0041917684, -..."
8,,4da9d7b6d119db4d2d564a2197798380,COVER,58b6145f2fb180f8cdc2067b4f1baebd,cannot buy me love cannot buy me love cannot b...,17-09-2009,can't buy me love,en,USGR10110569,"[jazz, vocaljazz]",158950.0,"[0.002162083, -0.0088259755, -0.0088212555, 0....","[-0.00277486, 0.0009907121, -0.004430046, -0.0...","[-0.002666236, 0.001698053, -0.004355389, 0.00..."
9,,2bf283c05b601f21364d052ca0ec798d,COVER,eb38211a25c1320991c5a23ad2417f33,wednesday morning at five of the clock as the ...,17-09-2009,she's leaving home,en,USGR19900418,[jazz],356070.0,"[-0.0009148408, -0.0015381218, -0.006161625, -...","[-0.005207289, 0.00032963644, -0.0013126861, 0...","[0.0004017738, -0.004560808, -0.0024972449, -0..."


**Вывод:**
- Модель рабочая - все проверено.

#### **Удаление неинформативных признаков**

In [47]:
def get_data_info(data):
    display(data.sample(5))
    display(data.info())
    display(data.describe(include='all'))

In [48]:
corpus = data[['title', 'language', 'duration', 'track_remake_type']]\
                      .drop_duplicates()\
                      .reset_index(drop=True)

get_data_info(corpus)

Unnamed: 0,title,language,duration,track_remake_type
45194,el trotamundo instrumental nicola di bari,ro,156130.0,COVER
45604,the unforgiven,en,384170.0,COVER
4204,sweet dreams (are made of this),en,207700.0,COVER
1122,stranglehold,en,506290.0,COVER
1280,sin,de,299890.0,COVER


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 69969 entries, 0 to 69968
Data columns (total 4 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   title              69969 non-null  object 
 1   language           69969 non-null  object 
 2   duration           69969 non-null  float64
 3   track_remake_type  69969 non-null  object 
dtypes: float64(1), object(3)
memory usage: 2.1+ MB


None

Unnamed: 0,title,language,duration,track_remake_type
count,69969,69969,69969.0,69969
unique,43984,93,,2
top,smooth criminal,en,,COVER
freq,84,28120,,65701
mean,,,205059.5,
std,,,85885.84,
min,,,0.0,
25%,,,161220.0,
50%,,,199240.0,
75%,,,239150.0,


In [49]:
corpus['track_remake_type'].unique()

array(['ORIGINAL', 'COVER'], dtype=object)

In [50]:
def codirovanie(text):
    return 1 if text == 'ORIGINAL' else 0

In [51]:
corpus['type'] = corpus['track_remake_type'].apply(codirovanie)

corpus = corpus.drop(['track_remake_type'], axis=1)

## **4. Построение и обучение модели:**
- Подготовка данных для обучения модели:
- Кодирование и масштабирование признаков - стандартизация данных, по необходимости.
- Разделение общего датасета на выборки для обучения и проверки модели.

### **Подготовка данных для обучения модели**

In [52]:
target = corpus['type']
features = corpus.drop(['type'], axis=1)

In [53]:
X_train, X_test, y_train, y_test = train_test_split(
    features, target, test_size=0.25, random_state=RANDOM_STATE,stratify=corpus['type']
)

In [54]:
print('X_train.shape = ', X_train.shape, 'y_train.shape = ', y_train.shape)
print('X_test.shape = ', X_test.shape, 'y_test.shape = ', y_test.shape)

X_train.shape =  (52476, 3) y_train.shape =  (52476,)
X_test.shape =  (17493, 3) y_test.shape =  (17493,)


In [55]:
get_data_info(X_train)

Unnamed: 0,title,language,duration
42577,danzón laura,es,199360.0
49959,the poet and the muse,en,254690.0
49262,"kraid's lair (from ""metroid"")",en,204370.0
59562,born to be alive,en,202980.0
20157,numb,id,196620.0


<class 'pandas.core.frame.DataFrame'>
Int64Index: 52476 entries, 61729 to 11979
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   title     52476 non-null  object 
 1   language  52476 non-null  object 
 2   duration  52476 non-null  float64
dtypes: float64(1), object(2)
memory usage: 1.6+ MB


None

Unnamed: 0,title,language,duration
count,52476,52476,52476.0
unique,34868,87,
top,smooth criminal,en,
freq,61,21138,
mean,,,205064.3
std,,,85367.87
min,,,0.0
25%,,,161190.0
50%,,,199130.0
75%,,,239180.0


In [56]:
get_data_info(X_test)

Unnamed: 0,title,language,duration
66381,ghost,sk,144380.0
36616,antecedentes de culpa,es,170760.0
52009,njinek,en,88690.0
27344,home,en,318010.0
18420,medley navidad,es,154610.0


<class 'pandas.core.frame.DataFrame'>
Int64Index: 17493 entries, 46052 to 56350
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   title     17493 non-null  object 
 1   language  17493 non-null  object 
 2   duration  17493 non-null  float64
dtypes: float64(1), object(2)
memory usage: 546.7+ KB


None

Unnamed: 0,title,language,duration
count,17493,17493,17493.0
unique,13810,69,
top,smooth criminal,en,
freq,23,6982,
mean,,,205044.8
std,,,87423.71
min,,,0.0
25%,,,161300.0
50%,,,199700.0
75%,,,239020.0


In [57]:
#категориальные признаки
cat_features = X_train.select_dtypes(include='object').columns.to_list()
cat_features

['title', 'language']

In [58]:
num_features = X_train.select_dtypes(exclude='object').columns.to_list()
num_features

['duration']

### **Обучение моделей:**
  - Кодирование данных
  - RandomForestClassifier.
  - CatBoost,
  - XGBClassifier.

In [59]:
X_train_oe = X_train.copy()
X_test_oe = X_test.copy()

In [60]:
encoder = OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1)
encoder.fit(X_train_oe[cat_features])
X_train_oe[cat_features] = encoder.transform(X_train_oe[cat_features])
X_test_oe[cat_features] = encoder.transform(X_test_oe[cat_features])

In [61]:
scaler=StandardScaler()
scaler.fit(X_train_oe[num_features])

X_train_oe[num_features]=scaler.transform(X_train_oe[num_features])
X_test_oe[num_features]=scaler.transform(X_test_oe[num_features])

#### RandomForestClassifier

In [62]:
model_rf = RandomForestClassifier()

param_grid = {
    'max_depth': [None] + [i for i in range(2, 7)],
    'min_samples_split': [2, 10],
    'min_samples_leaf': [2, 10],
    'n_estimators': [10, 50, 100],
}


grid_search_rf = GridSearchCV(model_rf, param_grid, cv=5,scoring='roc_auc')


grid_search_rf.fit(X_train_oe, y_train)
grid_search_rf.best_params_
grid_search_rf.best_score_


0.9171824359313794

### CatBoostClassifier

In [63]:
model_cat =  CatBoostClassifier(random_state=RANDOM_STATE, verbose=0)
param_grid = {
    'learning_rate': [0.1, 0.3],
    'iterations': [50, 100],
    'l2_leaf_reg': [3, 9]
}

grid_search_cat = GridSearchCV(model_cat, param_grid, cv=5,scoring='roc_auc')


grid_search_cat.fit(X_train_oe, y_train)
grid_search_cat.best_params_
grid_search_cat.best_score_

0.9250154082875645

### XGBClassifier

In [64]:
model_xgb =  XGBClassifier(random_state=RANDOM_STATE, verbose=0)

param_grid = {
    'n_estimators': [10, 50, 100],
    'learning_rate': [0.1, 0.3],
    'max_depth': [None] + [i for i in range(2, 7)],

}

grid_search_xgb = GridSearchCV(model_xgb, param_grid, cv=5,scoring='roc_auc')


grid_search_xgb.fit(X_train_oe, y_train)
grid_search_xgb.best_params_
grid_search_xgb.best_score_

0.9261956631195233

In [65]:
result = pd.DataFrame(
    [grid_search_rf.best_score_, grid_search_cat.best_score_, grid_search_xgb.best_score_],
    index=['RandomForestClassifier', 'CatBoostClassifier', 'XGBClassifier'],
    columns=['roc_auc']
)
result

Unnamed: 0,roc_auc
RandomForestClassifier,0.917182
CatBoostClassifier,0.925015
XGBClassifier,0.926196


Вывод:
- В ходе запуска модели получили следующие результаты:
  - Лучшие результаты показала модель **XGBClassifier**:
    ```
    ROC_AUC: 0.928
    Лучшие параметры:
    'n_estimators'= 100, 'learning_rate' = 0.1,
    'max_depth' = 7
    ```

## 5. **Выбор лучшей модели. Тестирование.**

In [66]:
clf = XGBClassifier(**grid_search_xgb.best_params_ )

model_xgb = clf.fit(X_train_oe, y_train)

In [67]:
# Проверим на тестовой выборке качество модели 'XGBClassifier'


xgb_predict = model_xgb.predict(X_test_oe)
prediction_xgb = model_xgb.predict_proba(X_test_oe)[:,1]

roc_test = roc_auc_score(y_test, prediction_xgb)
accuracy_test = accuracy_score(y_test, xgb_predict)
f1_test = f1_score(y_test, xgb_predict)

print("roc_auc_score_test:", roc_test)
print("accuracy_score_test:", accuracy_test)
print("f1_score_test:", f1_test)

roc_auc_score_test: 0.9195607724558559
accuracy_score_test: 0.9534099354027326
f1_score_test: 0.5122681029323758


## **6. Вывод:**
- В ходе запуска лучшей модели **XGBClassifier** получили следующие результаты:
    ```
    ROC_AUC: 0.916
    accuracy: 0.955
    f1: 0.56
    ```
- Результаты не слишком отличаются от модели на обучени, поэтому считаем, что данная моель работает стабильно. Предсказания достоверны.