In [6]:
# config (run before anything else)
LOG_RESULTS = True
INPUT_FILE_PATH = "data/train.csv"
N_SPLITS = 5
RANDOM_SEED = 42

## Weak baseline

The weak baseline establishes a foundational classification model using TF-IDF (Term Frequency–Inverse Document Frequency) vectorization on the raw description field and a Support Vector Classifier (SVC) with a linear kernel.

- Only the description field is used as input.
- The raw text is directly transformed with TfidfVectorizer, which converts words into numerical features based on their relative frequency across documents.
- No additional preprocessing (such as stopword removal, lowercasing, or cleaning) is applied, keeping this baseline deliberately simple.

This setup provides the simplest possible benchmark for the classification task. It helps establish how much predictive power comes purely from recipe descriptions, serving as a lower reference point against which richer models can be compared.

### Evaluation: K-Fold Cross-Validation

Performance is estimated using Stratified K-Fold Cross-Validation:

- The dataset is partitioned into K folds of (almost) equal size.
- Each fold is used once as the validation set, while the remaining K-1 folds form the training set.
- Class distributions (chef_id) are preserved across folds to ensure fairness.
- Results are averaged over all folds, providing a more robust and reliable accuracy estimate than a single 80/20 split.


In [7]:
from utils import create_folds

create_folds(
    input_data="data/train.csv",
    output_file="data/weak_baseline/folds_not_preprocessed.csv",
    text_columns=["description"],
    n_splits=N_SPLITS,
    random_seed=RANDOM_SEED,
)

Loading data from 'data/train.csv'...

## Chef ID Distribution Check

=== Original Dataset === (Size: 2999)
         Count  Percentage
chef_id                   
4470       806       26.88
5060       534       17.81
3288       451       15.04
8688       432       14.40
1533       404       13.47
6357       372       12.40

=== Fold 1 === (Size: 600)
         Count  Percentage
chef_id                   
4470       161       26.83
5060       107       17.83
3288        90       15.00
8688        86       14.33
1533        81       13.50
6357        75       12.50

=== Fold 2 === (Size: 600)
         Count  Percentage
chef_id                   
4470       162       27.00
5060       107       17.83
3288        90       15.00
8688        86       14.33
1533        81       13.50
6357        74       12.33

=== Fold 3 === (Size: 600)
         Count  Percentage
chef_id                   
4470       161       26.83
5060       107       17.83
3288        90       15.00
8688        87       14.5

### Evaluation


In [8]:
from utils import run_kfold_experiment
from sklearn.svm import SVC

_ = run_kfold_experiment(
    folds_file="data/weak_baseline/folds_not_preprocessed.csv",
    model_cls=SVC,
    model_kwargs={"kernel": "linear"},
    model_desc="Weak baseline (without pre-processing): Linear SVC + TF-IDF on raw description",
    random_seed=RANDOM_SEED,
    log_results=LOG_RESULTS
)

_ = run_kfold_experiment(
    folds_file="data/weak_baseline/folds_preprocessed.csv",
    model_cls=SVC,
    model_kwargs={"kernel": "linear"},
    model_desc="Weak baseline (with pre-processing): Linear SVC + TF-IDF on raw description",
    random_seed=RANDOM_SEED,
    log_results=LOG_RESULTS
)

Model: SVC
Cross-Validation (using data/weak_baseline/folds_not_preprocessed.csv)
  Fold 1: 0.7117
  Fold 2: 0.7000
  Fold 3: 0.7550
  Fold 4: 0.7183
  Fold 5: 0.7195
Mean accuracy: 0.7209  |  Std: 0.0206
Total runtime: 4.84 seconds
➕ Added new results for 'Weak baseline (without pre-processing): Linear SVC + TF-IDF on raw description'.
✅ Results saved to results/log.xlsx


  df = pd.concat([df, pd.DataFrame([new_row])[cols]], ignore_index=True)


Model: SVC
Cross-Validation (using data/weak_baseline/folds_preprocessed.csv)
  Fold 1: 0.6767
  Fold 2: 0.6617
  Fold 3: 0.6767
  Fold 4: 0.6883
  Fold 5: 0.6745
Mean accuracy: 0.6756  |  Std: 0.0095
Total runtime: 3.35 seconds
➕ Added new results for 'Weak baseline (with pre-processing): Linear SVC + TF-IDF on raw description'.
✅ Results saved to results/log.xlsx


## Strong baseline

The strong baseline continues to use TF-IDF vectorization in combination with a Linear Support Vector Classifier (SVC).
Instead of relying only on the description field (as in the weak baseline), this approach leverages a richer textual representation by concatenating multiple fields into a single document:

- recipe_name
- description
- tags
- steps
- ingredients
- data (date field, cast to text)
- n_ingredients (numeric field, cast to text)

This combined text ensures that the classifier has access to broader contextual information about each recipe.

Evaluation: K-Fold Cross-Validation


In [9]:
from utils import create_folds

create_folds(
    input_data="data/train.csv",
    output_file="data/strong_baseline/folds_not_preprocessed.csv",
    text_columns=[
        "recipe_name",
        "data",
        "tags",
        "steps",
        "description",
        "ingredients",
        "n_ingredients",
    ],
    n_splits=N_SPLITS,
    random_seed=RANDOM_SEED,
)

Loading data from 'data/train.csv'...

## Chef ID Distribution Check

=== Original Dataset === (Size: 2999)
         Count  Percentage
chef_id                   
4470       806       26.88
5060       534       17.81
3288       451       15.04
8688       432       14.40
1533       404       13.47
6357       372       12.40

=== Fold 1 === (Size: 600)
         Count  Percentage
chef_id                   
4470       161       26.83
5060       107       17.83
3288        90       15.00
8688        86       14.33
1533        81       13.50
6357        75       12.50

=== Fold 2 === (Size: 600)
         Count  Percentage
chef_id                   
4470       162       27.00
5060       107       17.83
3288        90       15.00
8688        86       14.33
1533        81       13.50
6357        74       12.33

=== Fold 3 === (Size: 600)
         Count  Percentage
chef_id                   
4470       161       26.83
5060       107       17.83
3288        90       15.00
8688        87       14.5

### Evaluation


In [10]:
from utils import run_kfold_experiment
from sklearn.svm import SVC

_ = run_kfold_experiment(
    folds_file="data/strong_baseline/folds_not_preprocessed.csv",
    model_cls=SVC,
    model_kwargs={"kernel": "linear"},
    model_desc="Strong baseline (without pre-processing): Linear SVC + TF-IDF on all fields",
    random_seed=RANDOM_SEED,
    log_results=LOG_RESULTS
)

_ = run_kfold_experiment(
    folds_file="data/strong_baseline/folds_preprocessed.csv",
    model_cls=SVC,
    model_kwargs={"kernel": "linear"},
    model_desc="Strong baseline (with pre-processing): Linear SVC + TF-IDF on all fields",
    random_seed=RANDOM_SEED,
    log_results=LOG_RESULTS
)

Model: SVC
Cross-Validation (using data/strong_baseline/folds_not_preprocessed.csv)
  Fold 1: 0.8667
  Fold 2: 0.8500
  Fold 3: 0.8667
  Fold 4: 0.8600
  Fold 5: 0.8464
Mean accuracy: 0.8579  |  Std: 0.0094
Total runtime: 19.63 seconds
➕ Added new results for 'Strong baseline (without pre-processing): Linear SVC + TF-IDF on all fields'.
✅ Results saved to results/log.xlsx
Model: SVC
Cross-Validation (using data/strong_baseline/folds_preprocessed.csv)
  Fold 1: 0.8533
  Fold 2: 0.8533
  Fold 3: 0.8483
  Fold 4: 0.8433
  Fold 5: 0.8447
Mean accuracy: 0.8486  |  Std: 0.0047
Total runtime: 16.34 seconds
➕ Added new results for 'Strong baseline (with pre-processing): Linear SVC + TF-IDF on all fields'.
✅ Results saved to results/log.xlsx
