👕В этом ноутбуке проверяются два алгоритма классического ML (**наивный Байес и логистическая регрессия**) на задаче классификации товаров по описанию.

🏆Лучший результат (**0.74**): логистическая регрессия, обученная на TF-IDF и категориальном признаке shop_id.

#Подготовка данных

In [None]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

Mounted at /content/drive


In [None]:
!unzip -qq '/content/drive/MyDrive/internship_2023.zip'

In [None]:
from sklearn.metrics import f1_score, classification_report
from utils import get_description, RANDOM_STATE

Загружаем датасет и обрабатываем описание товаров

In [None]:
df = pd.read_parquet('train.parquet').reset_index().dropna(subset=['category_id'])
df['description'] = pd.Series(map(get_description, df.text_fields))

Делим данные на обучающие и валидационные

In [None]:
from sklearn.model_selection import train_test_split
df_train, df_val = train_test_split(df, random_state=RANDOM_STATE)

In [None]:
import numpy as np
def get_data(df):
  texts = np.array(df.description.fillna(''))
  shop_ids = np.array(df.shop_id.fillna(''))
  y = np.array(df.category_id)
  return texts, shop_ids, y

Создаем класс, который преобразует текст в набор признаков TF-IDF, и кодирует shop_id товаров

In [None]:
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import OneHotEncoder
from scipy.sparse import hstack
class Vectorizer(BaseEstimator,TransformerMixin):
  def __init__(self, df):
    self.shop_unique = df.shop_id.unique()
    self.shop_id2categorical = dict(zip(map(lambda i: i.item(), self.shop_unique), range(len(self.shop_unique))))
    self.one_hot_enc = OneHotEncoder()
    self.tf_idf = TfidfTransformer()
    self.one_hot_enc.fit(np.arange(len(self.shop_unique)).reshape(-1,1))
  def fit(self, X, y=None):
    self.tf_idf.fit(X[0])
    return self
  def transform(self, X, y=None):
    bow, shop_ids = X
    shop_ids_vectorized = list(map(self.shop_id2categorical.get, shop_ids))
    one_hot_shop_id = self.one_hot_enc.transform(np.asarray(shop_ids_vectorized).reshape(-1,1))
    tf_idf = self.tf_idf.transform(bow)
    return hstack((tf_idf, one_hot_shop_id))

Инициализируем классы для отбора признаков.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.pipeline import FeatureUnion
from sklearn.feature_selection import SelectPercentile, chi2
import nltk 
nltk.download('punkt')

#представляет текст в виде мешка слов
count_vectorizer = CountVectorizer(lowercase=True, tokenizer=nltk.word_tokenize)
#преобразует текст в набор признаков TF-IDF
tf_idf = TfidfTransformer()
#преобразует текст в набор признаков TF-IDF, и кодирует shop_id товаров
vectorizer = Vectorizer(df)
#отбирает 20% признаков, наиболее важных для классификации при критерию хи-квадрат
selector = SelectPercentile(chi2, percentile=20)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [None]:
texts_train, shop_ids_train, y_train = get_data(df_train)
texts_val, shop_ids_val, y_val = get_data(df_val)

bow_train = count_vectorizer.fit_transform(texts_train)
bow_val = count_vectorizer.transform(texts_val)

tf_idf_train = tf_idf.fit_transform(bow_train)
tf_idf_val = tf_idf.transform(bow_val)

X_train = vectorizer.fit_transform((bow_train, shop_ids_train))
X_val = vectorizer.transform((bow_val, shop_ids_val))

#Наивный Байес

In [None]:
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()

nb_pipeline = Pipeline([
        ('selector', selector),
        ('final_classifier', nb)
    ])

Наивный Байес на мешке слов

In [None]:
experiment_name = f'naive_bayes_bow'
nb_pipeline.fit(bow_train, y_train)
predictions = nb_pipeline.predict(bow_val)
print(f'F1 for the "{experiment_name}" experiment: {f1_score(y_val, predictions, average="weighted")}')
save_metrics(experiment_name, y_val, predictions)

F1 for the "naive_bayes_bow" experiment: 0.5280035968327977
Metrics saved to /content/drive/MyDrive/KazanExpress/metrics/naive_bayes_bow.csv


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Наивный Байес на TF-IDF

In [None]:
experiment_name = f'naive_bayes_tfidf'
nb_pipeline.fit(tf_idf_train, y_train)
predictions = nb_pipeline.predict(tf_idf_val)
print(f'F1 for the "{experiment_name}" experiment: {f1_score(y_val, predictions, average="weighted")}')
save_metrics(experiment_name, y_val, predictions)

F1 for the "naive_bayes_tfidf" experiment: 0.2981906279608247
Metrics saved to /content/drive/MyDrive/KazanExpress/metrics/naive_bayes_tfidf.csv


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Наивный Байес на TF-IDF с учетом shop_id



In [None]:
experiment_name = f'naive_bayes_tfidf_shop_id'
nb_pipeline.fit(X_train, y_train)
predictions = nb_pipeline.predict(X_val)
print(f'F1 for the "{experiment_name}" experiment: {f1_score(y_val, predictions, average="weighted")}')
save_metrics(experiment_name, y_val, predictions)

F1 for the "naive_bayes_tfidf_shop_id" experiment: 0.3615729104297685
Metrics saved to /content/drive/MyDrive/KazanExpress/metrics/naive_bayes_tfidf_shop_id.csv


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [None]:
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression(solver='saga', verbose=True, n_jobs=-1)

logreg_pipeline = Pipeline([
        ('selector', selector),
        ('final_classifier', logreg)
    ])

#Логистическая регрессия

Логистическая регрессия на мешке слов

In [None]:
experiment_name = f'logistic_regression_bow'
logreg_pipeline.fit(bow_train, y_train)
predictions = logreg_pipeline.predict(bow_val)
print(f'F1 for the "{experiment_name}" experiment: {f1_score(y_val, predictions, average="weighted")}')
save_metrics(experiment_name, y_val, predictions)

F1 for the "logistic_regression_bow" experiment: 0.15841273839537479
Metrics saved to /content/drive/MyDrive/KazanExpress/metrics/logistic_regression_bow.csv


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Логистическая регрессия на TF-IDF

In [None]:
experiment_name = f'logistic_regression_tfidf'
logreg_pipeline.fit(tf_idf_train, y_train)
predictions = logreg_pipeline.predict(tf_idf_val)
print(f'F1 for the "{experiment_name}" experiment: {f1_score(y_val, predictions, average="weighted")}')
save_metrics(experiment_name, y_val, predictions)

[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 2 concurrent workers.


convergence after 19 epochs took 376 seconds


[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:  6.3min finished


F1 for the "logistic_regression_tfidf" experiment: 0.7255188172878155
Metrics saved to /content/drive/MyDrive/KazanExpress/metrics/logistic_regression_tfidf.csv


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Логистическая регрессия на TF-IDF с учетом shop_id

In [None]:
experiment_name = f'logistic_regression__tfidf_shop_id'
logreg_pipeline.fit(X_train, y_train)
predictions = logreg_pipeline.predict(X_val)
print(f'F1 for the "{experiment_name}" experiment: {f1_score(y_val, predictions, average="weighted")}')
save_metrics(experiment_name, y_val, predictions)

[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 2 concurrent workers.


convergence after 22 epochs took 413 seconds


[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:  6.9min finished


F1 for the "logistic_regression__tfidf_shop_id" experiment: 0.741626367750366
Metrics saved to /content/drive/MyDrive/KazanExpress/metrics/logistic_regression__tfidf_shop_id.csv


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


#Вывод

**Наивный байесовский классификатор** обучается быстро (несколько секунд), но по качеству уступает логистической регрессии. Любопытно, что для этого алгоритма метрика при обучении на мешке слов (0.53) намного выше, чем метрика, полученная при его обучении на TF-IDF (0.3). Возможно, это связано с тем, что TF-IDF в большей степени нарушает предположение об отсутствии корреляции между данными.

**Логистическая регрессия** обучается дольше. Для мешка слов потребовалось больше 40 минут, для TF-IDF - 7-8 минут. При этом на мешке слов получено очень низкое качество (0.15), видимо этих признаков недостаточно для обучения модели. 

**Лучший результат (0.74) показала логистическая регрессия, обученная на признаках TF-IDF с учетом shop_id.**

In [None]:
print(classification_report(y_val, predictions))

              precision    recall  f1-score   support

        2599       0.75      0.94      0.83        32
        2600       0.00      0.00      0.00         3
        2601       0.40      0.38      0.39        21
        2602       0.62      0.42      0.50        12
        2605       0.00      0.00      0.00        10
        2725       0.94      0.85      0.89        20
        2726       0.79      0.83      0.81        18
        2727       1.00      0.04      0.08        25
        2728       0.77      0.77      0.77        43
        2729       0.86      0.75      0.80        16
        2730       0.90      0.93      0.92        29
        2731       1.00      0.83      0.91        12
        2732       1.00      1.00      1.00         2
        2733       1.00      0.86      0.92        35
        2735       0.91      1.00      0.95        20
        2736       0.00      0.00      0.00         3
        2737       1.00      0.22      0.36        18
        2738       1.00    

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
