## Классификация новостей [AG's News Topic Classification Dataset](https://github.com/mhjabreel/CharCnn_Keras/tree/master/data/ag_news_csv)

Используется три архитектуры нейронных сетей:
- Одномерная сверточная нейросеть
- Рекуррентная нейросеть LSTM
- Рекуррентная нейросеть GRU

Для ускорения обучения сети рекомендуется подключить GPU (Runtime -> Change Runtime Type -> Hardware Accelerator -> GPU).

In [1]:
import os
os.environ["TF_CPP_MIN_LOG_LEVEL"] = "2"
os.environ["TF_DETERMINISTIC_OPS"] = "1"
os.environ["PYTHONHASHSEED"] = "42"

def set_global_seed(seed: int = 42):
    import random
    import numpy as np
    import tensorflow as tf

    random.seed(seed)
    np.random.seed(seed)
    tf.random.set_seed(seed)

    print(f"[INFO] Global seed fixed to {seed}")

set_global_seed(42)

[INFO] Global seed fixed to 42


In [2]:
import re
import os
from time import perf_counter as timer
os.environ['TF_CPP_MIN_LOG_LEVEL'] = "2"

import nltk
from tqdm.autonotebook import tqdm, trange
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.models import Sequential
from tensorflow.keras import utils
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.callbacks import ModelCheckpoint
from tensorflow.keras import utils
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline


  from tqdm.autonotebook import tqdm, trange


In [3]:
# Enable to see which operations are assigned to the GPU
tf.debugging.set_log_device_placement(False)  # Set to True for detailed logs
tf.get_logger().setLevel("ERROR")

# Confirm GPU is available
print(tf.config.list_physical_devices('GPU'))
print(tf.config.list_logical_devices())

[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
[LogicalDevice(name='/device:CPU:0', device_type='CPU'), LogicalDevice(name='/device:GPU:0', device_type='GPU')]


I0000 00:00:1764625938.563256    8160 gpu_device.cc:2020] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 3584 MB memory:  -> device: 0, name: NVIDIA GeForce RTX 3060 Laptop GPU, pci bus id: 0000:01:00.0, compute capability: 8.6


In [4]:
# Максимальное количество уникальных слов слов
num_words = 10000
# Максимальная длина новости
max_news_len = 30
# Количество классов новостей
nb_classes = 4

## Загрузка набора данных

Загружаем данные для обучения

In [5]:
# !wget https://github.com/mhjabreel/CharCnn_Keras/raw/master/data/ag_news_csv/train.csv -O train.csv

Загружаем данные для тестирования

In [6]:
# !wget https://github.com/mhjabreel/CharCnn_Keras/raw/master/data/ag_news_csv/test.csv -O test.csv

Загружаем имена классов

In [7]:
!wget https://raw.githubusercontent.com/mhjabreel/CharCnn_Keras/master/data/ag_news_csv/classes.txt -O classes.txt

--2025-12-02 00:52:18--  https://raw.githubusercontent.com/mhjabreel/CharCnn_Keras/master/data/ag_news_csv/classes.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.110.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 31 [text/plain]
Saving to: 'classes.txt'


2025-12-02 00:52:19 (1.44 MB/s) - 'classes.txt' saved [31/31]



## Просматриваем данные

In [8]:
!ls

classes.txt		lab_4_ktuner.ipynb   lab_5.ipynb	   train.csv
experiment_results.csv	lab_4_ktuner2.ipynb  nlp_class_news.ipynb  wget-log
lab_3.ipynb		lab_4_manual.ipynb   test.csv


In [9]:
!cat classes.txt

World
Sports
Business
Sci/Tech


In [10]:
!head train.csv

"3","Wall St. Bears Claw Back Into the Black (Reuters)","Reuters - Short-sellers, Wall Street's dwindling\band of ultra-cynics, are seeing green again."
"3","Carlyle Looks Toward Commercial Aerospace (Reuters)","Reuters - Private investment firm Carlyle Group,\which has a reputation for making well-timed and occasionally\controversial plays in the defense industry, has quietly placed\its bets on another part of the market."
"3","Oil and Economy Cloud Stocks' Outlook (Reuters)","Reuters - Soaring crude prices plus worries\about the economy and the outlook for earnings are expected to\hang over the stock market next week during the depth of the\summer doldrums."
"3","Iraq Halts Oil Exports from Main Southern Pipeline (Reuters)","Reuters - Authorities have halted oil export\flows from the main pipeline in southern Iraq after\intelligence showed a rebel militia could strike\infrastructure, an oil official said on Saturday."
"3","Oil prices soar to all-time record, posing new menace to 

In [11]:
!head test.csv

"3","Fears for T N pension after talks","Unions representing workers at Turner   Newall say they are 'disappointed' after talks with stricken parent firm Federal Mogul."
"4","The Race is On: Second Private Team Sets Launch Date for Human Spaceflight (SPACE.com)","SPACE.com - TORONTO, Canada -- A second\team of rocketeers competing for the  #36;10 million Ansari X Prize, a contest for\privately funded suborbital space flight, has officially announced the first\launch date for its manned rocket."
"4","Ky. Company Wins Grant to Study Peptides (AP)","AP - A company founded by a chemistry researcher at the University of Louisville won a grant to develop a method of producing better peptides, which are short chains of amino acids, the building blocks of proteins."
"4","Prediction Unit Helps Forecast Wildfires (AP)","AP - It's barely dawn when Mike Fitzpatrick starts his shift with a blur of colorful maps, figures and endless charts, but already he knows what the day will bring. Lightning 

In [12]:
!wc -l train.csv
!wc -l test.csv

120000 train.csv
7600 test.csv


## Загружаем данные в память

Читаем данные из файла

In [13]:
train = pd.read_csv('train.csv',
                    header=None,
                    names=['class', 'title', 'text'])

In [14]:
train

Unnamed: 0,class,title,text
0,3,Wall St. Bears Claw Back Into the Black (Reuters),"Reuters - Short-sellers, Wall Street's dwindli..."
1,3,Carlyle Looks Toward Commercial Aerospace (Reu...,Reuters - Private investment firm Carlyle Grou...
2,3,Oil and Economy Cloud Stocks' Outlook (Reuters),Reuters - Soaring crude prices plus worries\ab...
3,3,Iraq Halts Oil Exports from Main Southern Pipe...,Reuters - Authorities have halted oil export\f...
4,3,"Oil prices soar to all-time record, posing new...","AFP - Tearaway world oil prices, toppling reco..."
...,...,...,...
119995,1,Pakistan's Musharraf Says Won't Quit as Army C...,KARACHI (Reuters) - Pakistani President Perve...
119996,2,Renteria signing a top-shelf deal,Red Sox general manager Theo Epstein acknowled...
119997,2,Saban not going to Dolphins yet,The Miami Dolphins will put their courtship of...
119998,2,Today's NFL games,PITTSBURGH at NY GIANTS Time: 1:30 p.m. Line: ...


Выделяем данные для обучения

In [15]:
news = train['text']
# news = train['title'] + ' ' + train['text']

In [16]:
news[:5]

0    Reuters - Short-sellers, Wall Street's dwindli...
1    Reuters - Private investment firm Carlyle Grou...
2    Reuters - Soaring crude prices plus worries\ab...
3    Reuters - Authorities have halted oil export\f...
4    AFP - Tearaway world oil prices, toppling reco...
Name: text, dtype: object

Выделяем правильные ответы

In [17]:
# Преобразование целочисленного вектора класса в двоичную матрицу
# Аргументы: вектор классов в данных и общее количество классов
y_train = utils.to_categorical(train['class'] - 1, nb_classes)

In [18]:
y_train

array([[0., 0., 1., 0.],
       [0., 0., 1., 0.],
       [0., 0., 1., 0.],
       ...,
       [0., 1., 0., 0.],
       [0., 1., 0., 0.],
       [0., 1., 0., 0.]], shape=(120000, 4))

## Токенизация текста

In [19]:
news[:5]

0    Reuters - Short-sellers, Wall Street's dwindli...
1    Reuters - Private investment firm Carlyle Grou...
2    Reuters - Soaring crude prices plus worries\ab...
3    Reuters - Authorities have halted oil export\f...
4    AFP - Tearaway world oil prices, toppling reco...
Name: text, dtype: object

Создаем токенизатор Keras

<img src="https://habrastorage.org/r/w1560/getpro/habr/upload_files/bc9/feb/314/bc9feb3143af4759aceff629305cf8ae.png">

Принцип работы модуля TextVectorization

In [20]:
vectorize_layer = tf.keras.layers.TextVectorization(max_tokens=num_words,
    output_mode='int',
    output_sequence_length=max_news_len)

In [21]:
vectorize_layer.adapt(news)
vectorize_layer.get_vocabulary()

['',
 '[UNK]',
 np.str_('the'),
 np.str_('a'),
 np.str_('to'),
 np.str_('of'),
 np.str_('in'),
 np.str_('and'),
 np.str_('on'),
 np.str_('for'),
 np.str_('that'),
 np.str_('39s'),
 np.str_('with'),
 np.str_('as'),
 np.str_('its'),
 np.str_('at'),
 np.str_('is'),
 np.str_('said'),
 np.str_('by'),
 np.str_('it'),
 np.str_('has'),
 np.str_('new'),
 np.str_('an'),
 np.str_('from'),
 np.str_('his'),
 np.str_('us'),
 np.str_('will'),
 np.str_('was'),
 np.str_('reuters'),
 np.str_('after'),
 np.str_('have'),
 np.str_('be'),
 np.str_('their'),
 np.str_('are'),
 np.str_('over'),
 np.str_('ap'),
 np.str_('he'),
 np.str_('but'),
 np.str_('two'),
 np.str_('first'),
 np.str_('this'),
 np.str_('more'),
 np.str_('monday'),
 np.str_('wednesday'),
 np.str_('tuesday'),
 np.str_('thursday'),
 np.str_('company'),
 np.str_('up'),
 np.str_('friday'),
 np.str_('inc'),
 np.str_('one'),
 np.str_('world'),
 np.str_('yesterday'),
 np.str_('they'),
 np.str_('last'),
 np.str_('york'),
 np.str_('against'),
 np.str_

In [22]:
tokenizer = Tokenizer(num_words=num_words)

Обучаем токенизатор на новостях

In [23]:
tokenizer.fit_on_texts(news)

Просматриваем словарь токенизатора

In [24]:
tokenizer.word_index

{'the': 1,
 'a': 2,
 'to': 3,
 'of': 4,
 'in': 5,
 'and': 6,
 'on': 7,
 'for': 8,
 '39': 9,
 's': 10,
 'that': 11,
 'with': 12,
 'as': 13,
 'its': 14,
 'at': 15,
 'said': 16,
 'is': 17,
 'by': 18,
 'it': 19,
 'has': 20,
 'new': 21,
 'an': 22,
 'from': 23,
 'reuters': 24,
 'his': 25,
 'will': 26,
 'was': 27,
 'after': 28,
 'have': 29,
 'be': 30,
 'their': 31,
 'two': 32,
 'are': 33,
 'us': 34,
 'over': 35,
 'quot': 36,
 'year': 37,
 'first': 38,
 'ap': 39,
 'he': 40,
 'but': 41,
 'gt': 42,
 'lt': 43,
 'this': 44,
 'more': 45,
 'monday': 46,
 'wednesday': 47,
 'one': 48,
 'tuesday': 49,
 'up': 50,
 'thursday': 51,
 'company': 52,
 'inc': 53,
 'friday': 54,
 'world': 55,
 'than': 56,
 'u': 57,
 '1': 58,
 'last': 59,
 'they': 60,
 'york': 61,
 'yesterday': 62,
 'against': 63,
 'about': 64,
 'who': 65,
 'not': 66,
 'were': 67,
 'into': 68,
 'out': 69,
 'three': 70,
 'been': 71,
 'president': 72,
 '2': 73,
 'had': 74,
 'million': 75,
 'corp': 76,
 'oil': 77,
 'when': 78,
 'week': 79,
 'time'

Преобразуем новости в числовое представление

In [25]:
sequences = tokenizer.texts_to_sequences(news)

Просматриваем новости в числовом представлении

In [26]:
index = 1
print(news[index])
print(sequences[index])

Reuters - Private investment firm Carlyle Group,\which has a reputation for making well-timed and occasionally\controversial plays in the defense industry, has quietly placed\its bets on another part of the market.
[24, 863, 751, 371, 93, 84, 20, 2, 3916, 8, 453, 431, 6, 1308, 2799, 5, 1, 549, 237, 20, 3528, 2002, 14, 8267, 7, 216, 314, 4, 1, 131]


In [27]:
tokenizer.word_index['reuters']

24

Ограничиваем длину текста

In [28]:
x_train = pad_sequences(sequences, maxlen=max_news_len)

In [29]:
x_train[:5]

array([[   0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,   24,  758, 7851,  433,
        5786, 2861,    4, 5916,   33, 3642,  831,  432],
       [  24,  863,  751,  371,   93,   84,   20,    2, 3916,    8,  453,
         431,    6, 1308, 2799,    5,    1,  549,  237,   20, 3528, 2002,
          14, 8267,    7,  216,  314,    4,    1,  131],
       [  24, 2199,  463,  105, 1568, 1484,   64,    1,  397,    6,    1,
        1026,    8,  317,   33,  178,    3, 6377,   35,    1,  311,  131,
          99,   79,  189,    1, 6120,    4,    1, 1068],
       [   0,   24,  713,   29, 5142,   77, 3549, 7993,   23,    1,  737,
        3199,    5,  493,  106,   28, 1402,  573,    2,  825, 2601,   90,
         760, 2559,   22,   77,  292,   16,    7,   97],
       [   0,    0,    0,    0,    0,    0,    0,    0,    0,    0,  165,
          55,   77,  105, 8851, 1776,    6, 8268, 3454,    2,   21,  343,
        3036,   70,  266,  151, 

## Загружаем набор данных для тестирования

In [30]:
test = pd.read_csv('test.csv',
                    header=None,
                    names=['class', 'title', 'text'])

In [31]:
test

Unnamed: 0,class,title,text
0,3,Fears for T N pension after talks,Unions representing workers at Turner Newall...
1,4,The Race is On: Second Private Team Sets Launc...,"SPACE.com - TORONTO, Canada -- A second\team o..."
2,4,Ky. Company Wins Grant to Study Peptides (AP),AP - A company founded by a chemistry research...
3,4,Prediction Unit Helps Forecast Wildfires (AP),AP - It's barely dawn when Mike Fitzpatrick st...
4,4,Calif. Aims to Limit Farm-Related Smog (AP),AP - Southern California's smog-fighting agenc...
...,...,...,...
7595,1,Around the world,Ukrainian presidential candidate Viktor Yushch...
7596,2,Void is filled with Clement,With the supply of attractive pitching options...
7597,2,Martinez leaves bitter,Like Roger Clemens did almost exactly eight ye...
7598,3,5 of arthritis patients in Singapore take Bext...,SINGAPORE : Doctors in the United States have ...


Преобразуем новости в числовое представление

Обратите внимание, что нужно использовать токенизатор, обученный на наборе данных train.

In [32]:
test_sequences = tokenizer.texts_to_sequences(test['text'])

In [33]:
x_test = pad_sequences(test_sequences, maxlen=max_news_len)

In [34]:
x_test[:5]

array([[   0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0, 2020, 3371,  449,   15, 6956,  252,   60,   33,
          28,  289,   12, 9088, 2184,  371,  169, 9284],
       [  92,  119,    4, 3472,    8,    1,  402,  134,   75, 5471, 1516,
        1227,    2, 3436,    8, 2744, 5220,  230,  936,   20, 2148,  117,
           1,   38,  561, 1974,    8,   14, 3957, 1218],
       [   2,   52, 5787,   18,    2, 4914,   15,    1,  525,    4, 4586,
         227,    2, 3922,    3, 1316,    2, 6888,    4, 2893,  576,   84,
          33,  758, 6515,    4,    1, 1013, 5154,    4],
       [9642, 1291,    6, 8714,   41,  671,   40, 2977,  177,    1,  108,
          26, 1027, 7300,   26,  760,    5, 3356,   40, 1271, 2450,   26,
        2170,   50, 3356,   26, 7415,    6, 9153,   26],
       [   0,    0,    0,    0,    0,    0,   39,  493, 7343,  832,  400,
         820,   28, 3827,    4,    1, 4474,   54, 9411,    1, 1273,   38,
        1230,    3, 1365,  348, 

Правильные ответы

In [35]:
y_test = utils.to_categorical(test['class'] - 1, nb_classes)

In [36]:
y_test

array([[0., 0., 1., 0.],
       [0., 0., 0., 1.],
       [0., 0., 0., 1.],
       ...,
       [0., 1., 0., 0.],
       [0., 0., 1., 0.],
       [0., 0., 1., 0.]], shape=(7600, 4))

## Задание

1. Попробовать улучшить предсказание моделей за счет использования не только текста, но и заголовка новости.
При этом рекомендуется увеличить максимальню длину текста. <br/>
2. Проверить, как на качеаство предсказания влияет предобработка текста (удаление стоп-слов, пунктуации, лемматизация и др.)
3. Применить методы регуляризации для устранения переобучения моделей.


## Стоп-слова и пунктуация

**Стоп-слова** - это слова, которые часто встречаются практически в любом тексте и не несут полезной информации о конретном документе. Для модели это просто шум. А шум нужно убирать. По аналогичной причине убирают и пунктуацию.

In [37]:
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/vad1mchk/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /home/vad1mchk/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /home/vad1mchk/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [38]:
# импортируем стоп-слова из библиотеки nltk
from nltk.corpus import stopwords

# посмотрим на стоп-слова для английского языка
print(len(stopwords.words('english')))

198


*Знаки* пунктуации лучше импортировать из модуля **String**. В нем хранятся различные наборы констант для работы со строками (пунктуация, алфавит и др.).

In [39]:
from string import punctuation
punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'



Объединим стоп-слова и знаки пунктуации вместе и запишем в переменную noise:


In [40]:
noise = stopwords.words('english') + list(punctuation)


## Лемматизация

**Лемматизация** – это сведение разных форм одного слова к начальной форме – **лемме**. Почему это хорошо?
* Во-первых, естественно рассматривать как отдельный признак каждое *слово*, а не каждую его отдельную форму.
* Во-вторых, некоторые стоп-слова стоят только в начальной форме, и без лематизации выкидываем мы только её.

In [41]:
from nltk.stem import WordNetLemmatizer

In [42]:
lemmatizer = WordNetLemmatizer()

In [43]:
news = list(lemmatizer.lemmatize(word) for word in news)

## Полезные ссылки

1. [Как понять LSTM сети](https://alexsosn.github.io/ml/2015/11/17/LSTM.html)
2. [Рекуррентные нейронные сети в Keras](https://keras.io/layers/recurrent/)
3. [Регуляризуем правильно!](https://telegra.ph/Regulyarizuem-pravilno-09-20)
4. [12 основных методов Dropout](https://towardsdatascience.com/12-main-dropout-methods-mathematical-and-visual-explanation-58cdc2112293?source=topic_page---------6------------------1)

In [44]:
from keras.src.callbacks import EarlyStopping
from typing import Dict, List, Literal, Any

df_train = pd.read_csv('train.csv', header=None, names=['class', 'title', 'text'])
df_test = pd.read_csv('test.csv', header=None, names=['class', 'title', 'text'])
classes = { i+1: line for i, line in enumerate(open('classes.txt').read().split('\n')) if line }
classes

{1: 'World', 2: 'Sports', 3: 'Business', 4: 'Sci/Tech'}

In [45]:
# nltk resources
_stopwords = set(stopwords.words('english'))
_punctuation = set(punctuation)
_noise = _stopwords | _punctuation
_lemmatizer = WordNetLemmatizer()

_NON_ALPHANUMERIC_REGEX = re.compile(r'[^a-zA-Z0-9\s]+')

def preprocess_text(text: str, method: Literal['nothing', 'remove_noise', 'lemmatize', 'all'] = 'nothing') -> str:
    if method == 'nothing':
        return text
    text = str(text).lower()
    
    text = _NON_ALPHANUMERIC_REGEX.sub(' ', text)
    tokens = text.split()
    
    if method == 'all':
        tokens = [_lemmatizer.lemmatize(tok) for tok in tokens if tok not in _noise]
    elif method == 'remove_noise':
        tokens = [tok for tok in tokens if tok not in _noise]
    elif method == 'lemmatize':
        tokens = [_lemmatizer.lemmatize(tok) for tok in tokens]
        
    return ' '.join(tokens)

for method in ['nothing', 'remove_noise', 'lemmatize', 'all']:
    print(f'{method}: {preprocess_text("So eat more of these soft french buns, and drink some tea", method=method)}')

nothing: So eat more of these soft french buns, and drink some tea
remove_noise: eat soft french buns drink tea
lemmatize: so eat more of these soft french bun and drink some tea
all: eat soft french bun drink tea


In [46]:
from sklearn.metrics import f1_score

def run_experiment(
        experiment_name: str = "experiment",
        nn_type: Literal["cnn", "lstm", "gru"] = "cnn",
        do_dropout: bool = False,
        do_early_stopping: bool = False,
        max_news_len: int = 30,
        do_add_header: bool = False,
        preprocess_method: Literal['nothing', 'remove_noise', 'lemmatize', 'all'] = 'nothing',
        epochs: int = 10,
        batch_size: int = 256,
        validation_split: float = 0.1,
) -> Dict[str, Any]:
    """
    Generalized experiment runner matching the original notebook's setup.

    Parameters:
        experiment_name    : used in checkpoint filename.
        nn_type            : "cnn" / "lstm" / "gru".
        do_dropout         : add Dropout layer(s) after pooling/recurrent layer.
        do_early_stopping  : use EarlyStopping callback.
        max_news_len       : pad_sequences length.
        do_add_header      : if True, use title + ' ' + text as input.
        preprocess_method  : which kinds of text preprocessing to apply.
        epochs             : max epochs to train.
        batch_size         : batch size.
        validation_split   : fraction of train data for validation.

    Returns:
        Dict with experiment config and metrics (test accuracy, macro-F1).
    """

    t0 = timer()
    t0_prep = timer()
    set_global_seed(42)
    print(f"\n=== {experiment_name} | nn_type={nn_type} ===")
    print(f"  do_dropout={do_dropout}, do_early_stopping={do_early_stopping}, "
          f"max_news_len={max_news_len}, do_add_header={do_add_header}, "
          f"preprocess_method={preprocess_method}")

    # ----- 1. Build raw texts exactly as in the notebook -----
    if do_add_header:
        # i.e. news = train['title'] + ' ' + train['text']
        train_texts_raw = (df_train['title'].astype(str) + ' ' + df_train['text'].astype(str)).tolist()
        test_texts_raw = (df_test['title'].astype(str) + ' ' + df_test['text'].astype(str)).tolist()
    else:
        # i.e. news = train['text']
        train_texts_raw = df_train['text'].astype(str).tolist()
        test_texts_raw = df_test['text'].astype(str).tolist()

    if preprocess_method != 'nothing':
        train_texts = [preprocess_text(t, method=preprocess_method) for t in train_texts_raw]
        test_texts = [preprocess_text(t, method=preprocess_method) for t in test_texts_raw]
    else:
        train_texts = train_texts_raw
        test_texts = test_texts_raw

    # ----- 2. Labels as in notebook -----
    # y_train = utils.to_categorical(train['class'] - 1, nb_classes)
    y_train = utils.to_categorical(df_train['class'].values.astype(int) - 1, nb_classes)
    y_test = utils.to_categorical(df_test['class'].values.astype(int) - 1, nb_classes)

    # ----- 3. Tokenize (same Tokenizer usage as in notebook) -----
    tokenizer = Tokenizer(num_words=num_words)
    tokenizer.fit_on_texts(train_texts)

    sequences_train = tokenizer.texts_to_sequences(train_texts)
    sequences_test = tokenizer.texts_to_sequences(test_texts)

    x_train = pad_sequences(sequences_train, maxlen=max_news_len)
    x_test = pad_sequences(sequences_test, maxlen=max_news_len)
    t_prep = timer() - t0_prep
    print(f"  x_train shape = {x_train.shape}, x_test shape = {x_test.shape}")

    # ----- 4. Build model as in notebook -----
    model = Sequential()
    model.add(layers.Embedding(num_words, 32))

    if nn_type == "cnn":
        # CNN architecture from the notebook
        model.add(layers.Conv1D(250, 5, padding='valid', activation='relu'))
        model.add(layers.GlobalMaxPooling1D())
        if do_dropout:
            model.add(layers.Dropout(0.3))
        model.add(layers.Dense(128, activation='relu'))
        model.add(layers.Dense(nb_classes, activation='softmax'))

    elif nn_type == "lstm":
        model.add(layers.LSTM(16))
        if do_dropout:
            model.add(layers.Dropout(0.3))
        model.add(layers.Dense(nb_classes, activation='softmax'))

    elif nn_type == "gru":
        model.add(layers.GRU(16))
        if do_dropout:
            model.add(layers.Dropout(0.3))
        model.add(layers.Dense(nb_classes, activation='softmax'))

    else:
        raise ValueError(f"Unknown nn_type: {nn_type}")

    model.compile(optimizer='adam',
                  loss='categorical_crossentropy',
                  metrics=['accuracy'])

    # ----- 5. Callbacks: ModelCheckpoint + optional EarlyStopping -----
    callbacks = []

    os.makedirs("/tmp/ckpt", exist_ok=True)
    checkpoint_filepath = f"/tmp/ckpt/checkpoint.{experiment_name}.{nn_type}.weights.h5"
    checkpoint_callback = ModelCheckpoint(
        filepath=checkpoint_filepath,
        monitor='val_accuracy',
        save_best_only=True,
        verbose=0,
        save_weights_only=True
    )
    callbacks.append(checkpoint_callback)

    if do_early_stopping:
        early_stopping = EarlyStopping(
            monitor='val_accuracy',
            patience=5,
            restore_best_weights=True,
            verbose=0
        )
        callbacks.append(early_stopping)

    # ----- 6. Training (as in CNN/LSTM/GRU fit with validation_split=0.1) -----
    t0_fit = timer()
    history = model.fit(
        x_train,
        y_train,
        epochs=epochs,
        batch_size=batch_size,
        validation_split=validation_split,
        callbacks=callbacks,
        verbose=0
    )
    t_fit = timer() - t0_fit

    # Reload best weights if checkpoint exists
    if os.path.exists(checkpoint_filepath):
        model.load_weights(checkpoint_filepath)

    # ----- 7. Evaluation on test -----
    t0_eval = timer()
    eval_res = model.evaluate(x_test, y_test, verbose=0, return_dict=True)
    test_loss = float(eval_res["loss"])
    test_accuracy = float(eval_res["accuracy"])
    
    y_test_pred_probs = model.predict(x_test, verbose=0)
    y_test_true = df_test['class'].values.astype(int) - 1
    y_test_pred = np.argmax(y_test_pred_probs, axis=1)
    test_f1 = f1_score(y_test_true, y_test_pred, average='macro')
    t_eval = timer() - t0_eval

    print(f"[{experiment_name} / {nn_type}] "
          f"Test accuracy = {test_accuracy:.4f}, macro-F1 = {test_f1:.4f}")
    time = timer() - t0

    return {
        "experiment_name": experiment_name,
        "nn_type": nn_type,
        "do_dropout": do_dropout,
        "do_early_stopping": do_early_stopping,
        "max_news_len": max_news_len,
        "do_add_header": do_add_header,
        "preprocess_method": preprocess_method,
        "test_accuracy": float(test_accuracy),
        "test_f1": float(test_f1),
        "history": history.history,
        "time_total": time,
        "time_prep": t_prep,
        "time_fit": t_fit,
        "time_eval": t_eval,
    }


In [47]:
experiments = [
    { "experiment_name": "cnn", "nn_type": "cnn", "do_dropout": False, "do_early_stopping": False,
      "max_news_len": 50, "do_add_header": False, "preprocess_method": "nothing" },
    { "experiment_name": "lstm", "nn_type": "lstm", "do_dropout": False, "do_early_stopping": False,
      "max_news_len": 50, "do_add_header": False, "preprocess_method": "nothing" },
    { "experiment_name": "gru", "nn_type": "gru", "do_dropout": False, "do_early_stopping": False,
      "max_news_len": 50, "do_add_header": False, "preprocess_method": "nothing" },
    { "experiment_name": "cnn_header", "nn_type": "cnn", "do_dropout": False, "do_early_stopping": False,
      "max_news_len": 50, "do_add_header": True, "preprocess_method": "nothing" },
    { "experiment_name": "lstm_header", "nn_type": "lstm", "do_dropout": False, "do_early_stopping": False,
      "max_news_len": 50, "do_add_header": True, "preprocess_method": "nothing" },
    { "experiment_name": "gru_header", "nn_type": "gru", "do_dropout": False, "do_early_stopping": False,
      "max_news_len": 50, "do_add_header": True, "preprocess_method": "nothing" },
    { "experiment_name": "cnn_header_short", "nn_type": "cnn", "do_dropout": False, "do_early_stopping": False,
      "max_news_len": 40, "do_add_header": True, "preprocess_method": "nothing" },
    { "experiment_name": "cnn_header_long", "nn_type": "cnn", "do_dropout": False, "do_early_stopping": False,
      "max_news_len": 75, "do_add_header": True, "preprocess_method": "nothing" },
    { "experiment_name": "cnn_header_long_preproc1", "nn_type": "cnn", "do_dropout": False, "do_early_stopping": False,
      "max_news_len": 75, "do_add_header": True, "preprocess_method": "remove_noise" },
    { "experiment_name": "cnn_header_long_preproc2", "nn_type": "cnn", "do_dropout": False, "do_early_stopping": False,
      "max_news_len": 75, "do_add_header": True, "preprocess_method": "lemmatize" },
    { "experiment_name": "cnn_header_long_preproc3", "nn_type": "cnn", "do_dropout": False, "do_early_stopping": False,
      "max_news_len": 75, "do_add_header": True, "preprocess_method": "all" },
    { "experiment_name": "cnn_header_long_preproc3_dropout", "nn_type": "cnn", "do_dropout": True, "do_early_stopping": False,
      "max_news_len": 75, "do_add_header": True, "preprocess_method": "all" },
    { "experiment_name": "cnn_header_long_preproc3_es", "nn_type": "cnn", "do_dropout": False, "do_early_stopping": True,
      "max_news_len": 75, "do_add_header": True, "preprocess_method": "all" },
    { "experiment_name": "cnn_header_long_preproc3_dropout_es", "nn_type": "cnn", "do_dropout": True, "do_early_stopping": True,
      "max_news_len": 75, "do_add_header": True, "preprocess_method": "all" },
    { "experiment_name": "lstm_header_long_preproc3_dropout_es", "nn_type": "lstm", "do_dropout": False, "do_early_stopping": True,
      "max_news_len": 75, "do_add_header": True, "preprocess_method": "all" },
    { "experiment_name": "gru_header_long_preproc3_dropout_es", "nn_type": "gru", "do_dropout": False, "do_early_stopping": True,
      "max_news_len": 75, "do_add_header": True, "preprocess_method": "all" },
]
experiment_results = []

for exp in tqdm(experiments):
    result = run_experiment(
        experiment_name=exp["experiment_name"],
        nn_type=exp["nn_type"],
        do_dropout=exp["do_dropout"],
        do_early_stopping=exp["do_early_stopping"],
        max_news_len=exp["max_news_len"],
        do_add_header=exp["do_add_header"],
        preprocess_method=exp["preprocess_method"],
        batch_size=exp.get("batch_size", 256),
    )
    experiment_results.append(result)

  0%|          | 0/16 [00:00<?, ?it/s]

[INFO] Global seed fixed to 42

=== cnn | nn_type=cnn ===
  do_dropout=False, do_early_stopping=False, max_news_len=50, do_add_header=False, preprocess_method=nothing
  x_train shape = (120000, 50), x_test shape = (7600, 50)


2025-12-02 00:52:34.449166: E tensorflow/core/framework/node_def_util.cc:680] NodeDef mentions attribute use_unbounded_threadpool which is not in the op definition: Op<name=MapDataset; signature=input_dataset:variant, other_arguments: -> handle:variant; attr=f:func; attr=Targuments:list(type),min=0; attr=output_types:list(type),min=1; attr=output_shapes:list(shape),min=1; attr=use_inter_op_parallelism:bool,default=true; attr=preserve_cardinality:bool,default=false; attr=force_synchronous:bool,default=false; attr=metadata:string,default=""> This may be expected if your graph generating binary is newer  than this binary. Unknown attributes will be ignored. NodeDef: {{node ParallelMapDatasetV2/_15}}
2025-12-02 00:52:43.569421: E tensorflow/core/framework/node_def_util.cc:680] NodeDef mentions attribute use_unbounded_threadpool which is not in the op definition: Op<name=MapDataset; signature=input_dataset:variant, other_arguments: -> handle:variant; attr=f:func; attr=Targuments:list(type),

[cnn / cnn] Test accuracy = 0.9037, macro-F1 = 0.9037
[INFO] Global seed fixed to 42

=== lstm | nn_type=lstm ===
  do_dropout=False, do_early_stopping=False, max_news_len=50, do_add_header=False, preprocess_method=nothing
  x_train shape = (120000, 50), x_test shape = (7600, 50)


2025-12-02 00:53:44.168132: E tensorflow/core/framework/node_def_util.cc:680] NodeDef mentions attribute use_unbounded_threadpool which is not in the op definition: Op<name=MapDataset; signature=input_dataset:variant, other_arguments: -> handle:variant; attr=f:func; attr=Targuments:list(type),min=0; attr=output_types:list(type),min=1; attr=output_shapes:list(shape),min=1; attr=use_inter_op_parallelism:bool,default=true; attr=preserve_cardinality:bool,default=false; attr=force_synchronous:bool,default=false; attr=metadata:string,default=""> This may be expected if your graph generating binary is newer  than this binary. Unknown attributes will be ignored. NodeDef: {{node ParallelMapDatasetV2/_15}}
2025-12-02 00:53:51.628660: E tensorflow/core/framework/node_def_util.cc:680] NodeDef mentions attribute use_unbounded_threadpool which is not in the op definition: Op<name=MapDataset; signature=input_dataset:variant, other_arguments: -> handle:variant; attr=f:func; attr=Targuments:list(type),

[lstm / lstm] Test accuracy = 0.8984, macro-F1 = 0.8983
[INFO] Global seed fixed to 42

=== gru | nn_type=gru ===
  do_dropout=False, do_early_stopping=False, max_news_len=50, do_add_header=False, preprocess_method=nothing
  x_train shape = (120000, 50), x_test shape = (7600, 50)


2025-12-02 00:54:59.287249: E tensorflow/core/framework/node_def_util.cc:680] NodeDef mentions attribute use_unbounded_threadpool which is not in the op definition: Op<name=MapDataset; signature=input_dataset:variant, other_arguments: -> handle:variant; attr=f:func; attr=Targuments:list(type),min=0; attr=output_types:list(type),min=1; attr=output_shapes:list(shape),min=1; attr=use_inter_op_parallelism:bool,default=true; attr=preserve_cardinality:bool,default=false; attr=force_synchronous:bool,default=false; attr=metadata:string,default=""> This may be expected if your graph generating binary is newer  than this binary. Unknown attributes will be ignored. NodeDef: {{node ParallelMapDatasetV2/_15}}
2025-12-02 00:55:06.444439: E tensorflow/core/framework/node_def_util.cc:680] NodeDef mentions attribute use_unbounded_threadpool which is not in the op definition: Op<name=MapDataset; signature=input_dataset:variant, other_arguments: -> handle:variant; attr=f:func; attr=Targuments:list(type),

[gru / gru] Test accuracy = 0.8911, macro-F1 = 0.8909
[INFO] Global seed fixed to 42

=== cnn_header | nn_type=cnn ===
  do_dropout=False, do_early_stopping=False, max_news_len=50, do_add_header=True, preprocess_method=nothing
  x_train shape = (120000, 50), x_test shape = (7600, 50)


2025-12-02 00:56:10.132188: E tensorflow/core/framework/node_def_util.cc:680] NodeDef mentions attribute use_unbounded_threadpool which is not in the op definition: Op<name=MapDataset; signature=input_dataset:variant, other_arguments: -> handle:variant; attr=f:func; attr=Targuments:list(type),min=0; attr=output_types:list(type),min=1; attr=output_shapes:list(shape),min=1; attr=use_inter_op_parallelism:bool,default=true; attr=preserve_cardinality:bool,default=false; attr=force_synchronous:bool,default=false; attr=metadata:string,default=""> This may be expected if your graph generating binary is newer  than this binary. Unknown attributes will be ignored. NodeDef: {{node ParallelMapDatasetV2/_15}}
2025-12-02 00:56:17.426834: E tensorflow/core/framework/node_def_util.cc:680] NodeDef mentions attribute use_unbounded_threadpool which is not in the op definition: Op<name=MapDataset; signature=input_dataset:variant, other_arguments: -> handle:variant; attr=f:func; attr=Targuments:list(type),

[cnn_header / cnn] Test accuracy = 0.9124, macro-F1 = 0.9122
[INFO] Global seed fixed to 42

=== lstm_header | nn_type=lstm ===
  do_dropout=False, do_early_stopping=False, max_news_len=50, do_add_header=True, preprocess_method=nothing
  x_train shape = (120000, 50), x_test shape = (7600, 50)


2025-12-02 00:57:13.081264: E tensorflow/core/framework/node_def_util.cc:680] NodeDef mentions attribute use_unbounded_threadpool which is not in the op definition: Op<name=MapDataset; signature=input_dataset:variant, other_arguments: -> handle:variant; attr=f:func; attr=Targuments:list(type),min=0; attr=output_types:list(type),min=1; attr=output_shapes:list(shape),min=1; attr=use_inter_op_parallelism:bool,default=true; attr=preserve_cardinality:bool,default=false; attr=force_synchronous:bool,default=false; attr=metadata:string,default=""> This may be expected if your graph generating binary is newer  than this binary. Unknown attributes will be ignored. NodeDef: {{node ParallelMapDatasetV2/_15}}
2025-12-02 00:57:20.369199: E tensorflow/core/framework/node_def_util.cc:680] NodeDef mentions attribute use_unbounded_threadpool which is not in the op definition: Op<name=MapDataset; signature=input_dataset:variant, other_arguments: -> handle:variant; attr=f:func; attr=Targuments:list(type),

[lstm_header / lstm] Test accuracy = 0.9101, macro-F1 = 0.9101
[INFO] Global seed fixed to 42

=== gru_header | nn_type=gru ===
  do_dropout=False, do_early_stopping=False, max_news_len=50, do_add_header=True, preprocess_method=nothing
  x_train shape = (120000, 50), x_test shape = (7600, 50)


2025-12-02 00:58:24.965355: E tensorflow/core/framework/node_def_util.cc:680] NodeDef mentions attribute use_unbounded_threadpool which is not in the op definition: Op<name=MapDataset; signature=input_dataset:variant, other_arguments: -> handle:variant; attr=f:func; attr=Targuments:list(type),min=0; attr=output_types:list(type),min=1; attr=output_shapes:list(shape),min=1; attr=use_inter_op_parallelism:bool,default=true; attr=preserve_cardinality:bool,default=false; attr=force_synchronous:bool,default=false; attr=metadata:string,default=""> This may be expected if your graph generating binary is newer  than this binary. Unknown attributes will be ignored. NodeDef: {{node ParallelMapDatasetV2/_15}}
2025-12-02 00:58:31.545225: E tensorflow/core/framework/node_def_util.cc:680] NodeDef mentions attribute use_unbounded_threadpool which is not in the op definition: Op<name=MapDataset; signature=input_dataset:variant, other_arguments: -> handle:variant; attr=f:func; attr=Targuments:list(type),

[gru_header / gru] Test accuracy = 0.9038, macro-F1 = 0.9039
[INFO] Global seed fixed to 42

=== cnn_header_short | nn_type=cnn ===
  do_dropout=False, do_early_stopping=False, max_news_len=40, do_add_header=True, preprocess_method=nothing
  x_train shape = (120000, 40), x_test shape = (7600, 40)


2025-12-02 00:59:28.316532: E tensorflow/core/framework/node_def_util.cc:680] NodeDef mentions attribute use_unbounded_threadpool which is not in the op definition: Op<name=MapDataset; signature=input_dataset:variant, other_arguments: -> handle:variant; attr=f:func; attr=Targuments:list(type),min=0; attr=output_types:list(type),min=1; attr=output_shapes:list(shape),min=1; attr=use_inter_op_parallelism:bool,default=true; attr=preserve_cardinality:bool,default=false; attr=force_synchronous:bool,default=false; attr=metadata:string,default=""> This may be expected if your graph generating binary is newer  than this binary. Unknown attributes will be ignored. NodeDef: {{node ParallelMapDatasetV2/_15}}
2025-12-02 00:59:34.687204: E tensorflow/core/framework/node_def_util.cc:680] NodeDef mentions attribute use_unbounded_threadpool which is not in the op definition: Op<name=MapDataset; signature=input_dataset:variant, other_arguments: -> handle:variant; attr=f:func; attr=Targuments:list(type),

[cnn_header_short / cnn] Test accuracy = 0.9084, macro-F1 = 0.9083
[INFO] Global seed fixed to 42

=== cnn_header_long | nn_type=cnn ===
  do_dropout=False, do_early_stopping=False, max_news_len=75, do_add_header=True, preprocess_method=nothing
  x_train shape = (120000, 75), x_test shape = (7600, 75)


2025-12-02 01:00:30.827622: E tensorflow/core/framework/node_def_util.cc:680] NodeDef mentions attribute use_unbounded_threadpool which is not in the op definition: Op<name=MapDataset; signature=input_dataset:variant, other_arguments: -> handle:variant; attr=f:func; attr=Targuments:list(type),min=0; attr=output_types:list(type),min=1; attr=output_shapes:list(shape),min=1; attr=use_inter_op_parallelism:bool,default=true; attr=preserve_cardinality:bool,default=false; attr=force_synchronous:bool,default=false; attr=metadata:string,default=""> This may be expected if your graph generating binary is newer  than this binary. Unknown attributes will be ignored. NodeDef: {{node ParallelMapDatasetV2/_15}}
2025-12-02 01:00:38.978377: E tensorflow/core/framework/node_def_util.cc:680] NodeDef mentions attribute use_unbounded_threadpool which is not in the op definition: Op<name=MapDataset; signature=input_dataset:variant, other_arguments: -> handle:variant; attr=f:func; attr=Targuments:list(type),

[cnn_header_long / cnn] Test accuracy = 0.9129, macro-F1 = 0.9128
[INFO] Global seed fixed to 42

=== cnn_header_long_preproc1 | nn_type=cnn ===
  do_dropout=False, do_early_stopping=False, max_news_len=75, do_add_header=True, preprocess_method=remove_noise
  x_train shape = (120000, 75), x_test shape = (7600, 75)


2025-12-02 01:01:47.755374: E tensorflow/core/framework/node_def_util.cc:680] NodeDef mentions attribute use_unbounded_threadpool which is not in the op definition: Op<name=MapDataset; signature=input_dataset:variant, other_arguments: -> handle:variant; attr=f:func; attr=Targuments:list(type),min=0; attr=output_types:list(type),min=1; attr=output_shapes:list(shape),min=1; attr=use_inter_op_parallelism:bool,default=true; attr=preserve_cardinality:bool,default=false; attr=force_synchronous:bool,default=false; attr=metadata:string,default=""> This may be expected if your graph generating binary is newer  than this binary. Unknown attributes will be ignored. NodeDef: {{node ParallelMapDatasetV2/_15}}
2025-12-02 01:01:56.371577: E tensorflow/core/framework/node_def_util.cc:680] NodeDef mentions attribute use_unbounded_threadpool which is not in the op definition: Op<name=MapDataset; signature=input_dataset:variant, other_arguments: -> handle:variant; attr=f:func; attr=Targuments:list(type),

[cnn_header_long_preproc1 / cnn] Test accuracy = 0.9141, macro-F1 = 0.9140
[INFO] Global seed fixed to 42

=== cnn_header_long_preproc2 | nn_type=cnn ===
  do_dropout=False, do_early_stopping=False, max_news_len=75, do_add_header=True, preprocess_method=lemmatize
  x_train shape = (120000, 75), x_test shape = (7600, 75)


2025-12-02 01:03:20.231366: E tensorflow/core/framework/node_def_util.cc:680] NodeDef mentions attribute use_unbounded_threadpool which is not in the op definition: Op<name=MapDataset; signature=input_dataset:variant, other_arguments: -> handle:variant; attr=f:func; attr=Targuments:list(type),min=0; attr=output_types:list(type),min=1; attr=output_shapes:list(shape),min=1; attr=use_inter_op_parallelism:bool,default=true; attr=preserve_cardinality:bool,default=false; attr=force_synchronous:bool,default=false; attr=metadata:string,default=""> This may be expected if your graph generating binary is newer  than this binary. Unknown attributes will be ignored. NodeDef: {{node ParallelMapDatasetV2/_15}}
2025-12-02 01:03:28.003241: E tensorflow/core/framework/node_def_util.cc:680] NodeDef mentions attribute use_unbounded_threadpool which is not in the op definition: Op<name=MapDataset; signature=input_dataset:variant, other_arguments: -> handle:variant; attr=f:func; attr=Targuments:list(type),

[cnn_header_long_preproc2 / cnn] Test accuracy = 0.9141, macro-F1 = 0.9140
[INFO] Global seed fixed to 42

=== cnn_header_long_preproc3 | nn_type=cnn ===
  do_dropout=False, do_early_stopping=False, max_news_len=75, do_add_header=True, preprocess_method=all
  x_train shape = (120000, 75), x_test shape = (7600, 75)


2025-12-02 01:04:43.036794: E tensorflow/core/framework/node_def_util.cc:680] NodeDef mentions attribute use_unbounded_threadpool which is not in the op definition: Op<name=MapDataset; signature=input_dataset:variant, other_arguments: -> handle:variant; attr=f:func; attr=Targuments:list(type),min=0; attr=output_types:list(type),min=1; attr=output_shapes:list(shape),min=1; attr=use_inter_op_parallelism:bool,default=true; attr=preserve_cardinality:bool,default=false; attr=force_synchronous:bool,default=false; attr=metadata:string,default=""> This may be expected if your graph generating binary is newer  than this binary. Unknown attributes will be ignored. NodeDef: {{node ParallelMapDatasetV2/_15}}
2025-12-02 01:04:51.179433: E tensorflow/core/framework/node_def_util.cc:680] NodeDef mentions attribute use_unbounded_threadpool which is not in the op definition: Op<name=MapDataset; signature=input_dataset:variant, other_arguments: -> handle:variant; attr=f:func; attr=Targuments:list(type),

[cnn_header_long_preproc3 / cnn] Test accuracy = 0.9146, macro-F1 = 0.9145
[INFO] Global seed fixed to 42

=== cnn_header_long_preproc3_dropout | nn_type=cnn ===
  do_dropout=True, do_early_stopping=False, max_news_len=75, do_add_header=True, preprocess_method=all
  x_train shape = (120000, 75), x_test shape = (7600, 75)


2025-12-02 01:06:12.460167: E tensorflow/core/framework/node_def_util.cc:680] NodeDef mentions attribute use_unbounded_threadpool which is not in the op definition: Op<name=MapDataset; signature=input_dataset:variant, other_arguments: -> handle:variant; attr=f:func; attr=Targuments:list(type),min=0; attr=output_types:list(type),min=1; attr=output_shapes:list(shape),min=1; attr=use_inter_op_parallelism:bool,default=true; attr=preserve_cardinality:bool,default=false; attr=force_synchronous:bool,default=false; attr=metadata:string,default=""> This may be expected if your graph generating binary is newer  than this binary. Unknown attributes will be ignored. NodeDef: {{node ParallelMapDatasetV2/_15}}
2025-12-02 01:06:21.354720: E tensorflow/core/framework/node_def_util.cc:680] NodeDef mentions attribute use_unbounded_threadpool which is not in the op definition: Op<name=MapDataset; signature=input_dataset:variant, other_arguments: -> handle:variant; attr=f:func; attr=Targuments:list(type),

[cnn_header_long_preproc3_dropout / cnn] Test accuracy = 0.9130, macro-F1 = 0.9129
[INFO] Global seed fixed to 42

=== cnn_header_long_preproc3_es | nn_type=cnn ===
  do_dropout=False, do_early_stopping=True, max_news_len=75, do_add_header=True, preprocess_method=all
  x_train shape = (120000, 75), x_test shape = (7600, 75)


2025-12-02 01:07:43.351678: E tensorflow/core/framework/node_def_util.cc:680] NodeDef mentions attribute use_unbounded_threadpool which is not in the op definition: Op<name=MapDataset; signature=input_dataset:variant, other_arguments: -> handle:variant; attr=f:func; attr=Targuments:list(type),min=0; attr=output_types:list(type),min=1; attr=output_shapes:list(shape),min=1; attr=use_inter_op_parallelism:bool,default=true; attr=preserve_cardinality:bool,default=false; attr=force_synchronous:bool,default=false; attr=metadata:string,default=""> This may be expected if your graph generating binary is newer  than this binary. Unknown attributes will be ignored. NodeDef: {{node ParallelMapDatasetV2/_15}}
2025-12-02 01:07:52.093658: E tensorflow/core/framework/node_def_util.cc:680] NodeDef mentions attribute use_unbounded_threadpool which is not in the op definition: Op<name=MapDataset; signature=input_dataset:variant, other_arguments: -> handle:variant; attr=f:func; attr=Targuments:list(type),

[cnn_header_long_preproc3_es / cnn] Test accuracy = 0.9146, macro-F1 = 0.9145
[INFO] Global seed fixed to 42

=== cnn_header_long_preproc3_dropout_es | nn_type=cnn ===
  do_dropout=True, do_early_stopping=True, max_news_len=75, do_add_header=True, preprocess_method=all
  x_train shape = (120000, 75), x_test shape = (7600, 75)


2025-12-02 01:08:43.515769: E tensorflow/core/framework/node_def_util.cc:680] NodeDef mentions attribute use_unbounded_threadpool which is not in the op definition: Op<name=MapDataset; signature=input_dataset:variant, other_arguments: -> handle:variant; attr=f:func; attr=Targuments:list(type),min=0; attr=output_types:list(type),min=1; attr=output_shapes:list(shape),min=1; attr=use_inter_op_parallelism:bool,default=true; attr=preserve_cardinality:bool,default=false; attr=force_synchronous:bool,default=false; attr=metadata:string,default=""> This may be expected if your graph generating binary is newer  than this binary. Unknown attributes will be ignored. NodeDef: {{node ParallelMapDatasetV2/_15}}
2025-12-02 01:08:51.772239: E tensorflow/core/framework/node_def_util.cc:680] NodeDef mentions attribute use_unbounded_threadpool which is not in the op definition: Op<name=MapDataset; signature=input_dataset:variant, other_arguments: -> handle:variant; attr=f:func; attr=Targuments:list(type),

[cnn_header_long_preproc3_dropout_es / cnn] Test accuracy = 0.9130, macro-F1 = 0.9129
[INFO] Global seed fixed to 42

=== lstm_header_long_preproc3_dropout_es | nn_type=lstm ===
  do_dropout=False, do_early_stopping=True, max_news_len=75, do_add_header=True, preprocess_method=all
  x_train shape = (120000, 75), x_test shape = (7600, 75)


2025-12-02 01:09:43.992861: E tensorflow/core/framework/node_def_util.cc:680] NodeDef mentions attribute use_unbounded_threadpool which is not in the op definition: Op<name=MapDataset; signature=input_dataset:variant, other_arguments: -> handle:variant; attr=f:func; attr=Targuments:list(type),min=0; attr=output_types:list(type),min=1; attr=output_shapes:list(shape),min=1; attr=use_inter_op_parallelism:bool,default=true; attr=preserve_cardinality:bool,default=false; attr=force_synchronous:bool,default=false; attr=metadata:string,default=""> This may be expected if your graph generating binary is newer  than this binary. Unknown attributes will be ignored. NodeDef: {{node ParallelMapDatasetV2/_15}}
2025-12-02 01:09:51.814169: E tensorflow/core/framework/node_def_util.cc:680] NodeDef mentions attribute use_unbounded_threadpool which is not in the op definition: Op<name=MapDataset; signature=input_dataset:variant, other_arguments: -> handle:variant; attr=f:func; attr=Targuments:list(type),

[lstm_header_long_preproc3_dropout_es / lstm] Test accuracy = 0.9112, macro-F1 = 0.9110
[INFO] Global seed fixed to 42

=== gru_header_long_preproc3_dropout_es | nn_type=gru ===
  do_dropout=False, do_early_stopping=True, max_news_len=75, do_add_header=True, preprocess_method=all
  x_train shape = (120000, 75), x_test shape = (7600, 75)


2025-12-02 01:10:49.946056: E tensorflow/core/framework/node_def_util.cc:680] NodeDef mentions attribute use_unbounded_threadpool which is not in the op definition: Op<name=MapDataset; signature=input_dataset:variant, other_arguments: -> handle:variant; attr=f:func; attr=Targuments:list(type),min=0; attr=output_types:list(type),min=1; attr=output_shapes:list(shape),min=1; attr=use_inter_op_parallelism:bool,default=true; attr=preserve_cardinality:bool,default=false; attr=force_synchronous:bool,default=false; attr=metadata:string,default=""> This may be expected if your graph generating binary is newer  than this binary. Unknown attributes will be ignored. NodeDef: {{node ParallelMapDatasetV2/_15}}
2025-12-02 01:10:57.659396: E tensorflow/core/framework/node_def_util.cc:680] NodeDef mentions attribute use_unbounded_threadpool which is not in the op definition: Op<name=MapDataset; signature=input_dataset:variant, other_arguments: -> handle:variant; attr=f:func; attr=Targuments:list(type),

[gru_header_long_preproc3_dropout_es / gru] Test accuracy = 0.9089, macro-F1 = 0.9090


In [48]:
for result in experiment_results:
    print(f"experiment name: {result['experiment_name']}, accuracy: {result['test_accuracy']:.4f}, macro-F1: {result['test_f1']:.4f}"
          f" in {result['time_prep']:.1f} + {result['time_fit']:.1f} + {result['time_eval']:.1f} = {result['time_total']:.1f}s")

experiment name: cnn, accuracy: 0.9037, macro-F1: 0.9037 in 3.4 + 64.2 + 1.9 = 69.6s
experiment name: lstm, accuracy: 0.8984, macro-F1: 0.8983 in 3.5 + 68.6 + 2.8 = 75.0s
experiment name: gru, accuracy: 0.8911, macro-F1: 0.8909 in 3.6 + 64.3 + 2.5 = 70.5s
experiment name: cnn_header, accuracy: 0.9124, macro-F1: 0.9122 in 4.0 + 57.2 + 1.7 = 62.9s
experiment name: lstm_header, accuracy: 0.9101, macro-F1: 0.9101 in 3.9 + 65.3 + 2.4 = 71.7s
experiment name: gru_header, accuracy: 0.9038, macro-F1: 0.9039 in 4.1 + 57.2 + 2.3 = 63.6s
experiment name: cnn_header_short, accuracy: 0.9084, macro-F1: 0.9083 in 3.8 + 56.7 + 1.8 = 62.3s
experiment name: cnn_header_long, accuracy: 0.9129, macro-F1: 0.9128 in 3.9 + 71.0 + 1.8 = 76.7s
experiment name: cnn_header_long_preproc1, accuracy: 0.9141, macro-F1: 0.9140 in 4.1 + 76.4 + 1.7 = 82.2s
experiment name: cnn_header_long_preproc2, accuracy: 0.9141, macro-F1: 0.9140 in 14.3 + 69.6 + 1.9 = 85.9s
experiment name: cnn_header_long_preproc3, accuracy: 0.9146

In [49]:
df = pd.DataFrame(experiment_results)
df = df.drop(columns=['history'])
df.describe(include='all')
df.to_csv("experiment_results.csv")