# News Classification Exploration Notebook


Goal: build and evaluate a text classification pipeline for the news dataset, then train a tuned model and export predictions for submission.


This notebook is organized into the following steps:

1. **Load and inspect data** ‚Äì understand the structure, size, and label balance of the dataset.
2. **Build text feature (title + article)** ‚Äì combine text fields into a single input column.
3. **Baseline TF‚ÄëIDF + Logistic Regression** ‚Äì create a first working model and check that the pipeline runs.
4. **Train/validation split** ‚Äì create a robust evaluation setup with stratification.
5. **Baseline evaluation (Macro F1)** ‚Äì quantify how good the baseline is across all classes.
6. **Handle class imbalance & hyperparameter tuning** ‚Äì improve the model with class weights and TF‚ÄëIDF / LogisticRegression tuning.
7. **Alternative models / advanced experiments** ‚Äì try SVMs, character n‚Äëgrams, and other ideas for potential gains.
8. **Final chosen pipeline** ‚Äì fix the best settings and retrain on all labeled data.
9. **Generate submission file** ‚Äì run the final model on evaluation.csv and save predictions to CSV.


In the sections below, each code block is paired with explanations of:

- what the cell is trying to achieve (aim), and

- how to read its output (what the result shows).

### 1) Load and inspect data

**Aim:** Load the labeled news articles from `development.csv` and quickly understand the structure of the dataset (columns, dtypes, number of rows) and the distribution of the target labels.


- The next code cells import libraries, read the CSV into a DataFrame `df`, and display `.head()`, `.columns`, and `.info()`.
- The label frequency cell (`df['label'].value_counts()`) shows how many examples there are per class and highlights any class imbalance.


**What the results show:**

- Whether the file loads correctly without errors.
- How many articles you have and what fields are available (e.g. `title`, `article`, `label`).
- That some classes are more frequent than others, which motivates later steps on class imbalance.

In [123]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score

In [124]:


df = pd.read_csv("development.csv")
df.head()
df.columns
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 79997 entries, 0 to 79996
Data columns (total 7 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   Id         79997 non-null  int64 
 1   source     79997 non-null  object
 2   title      79996 non-null  object
 3   article    79996 non-null  object
 4   page_rank  79997 non-null  int64 
 5   timestamp  79997 non-null  object
 6   label      79997 non-null  int64 
dtypes: int64(3), object(4)
memory usage: 4.3+ MB


In [91]:
df['label'].value_counts()

label
0    23542
5    13053
2    11161
1    10588
3     9977
4     8574
6     3102
Name: count, dtype: int64

### 2) Build text feature (title + article)

**Aim:** Create a single text field that concatenates `title` and `article` so the model can learn from both in one go.


- The following code cell constructs `df['text'] = title + ' ' + article` with missing values filled as empty strings.
- The `.head()` on `df['text']` is just a quick sanity check that the text looks reasonable.


**What the results show:**

- Each row now has a clean combined text representation.
- You can visually confirm that the concatenation worked and there are no obvious issues like `NaN` or duplicated separators.

In [92]:
df['text'] = df['title'].fillna('') + ' ' + df['article'].fillna('')
df['text'].head()

0    OPEC Boosts Nigeria&#39;s Oil Revenue By .82m ...
1    Yearender: Mideast peace roadmap reaches dead-...
2    Battleground Dispatches for Oct. 5 \\n    (CQP...
3    Air best to resuscitate newborns Air rather th...
4    High tech German train crash kills at least on...
Name: text, dtype: object

### 3) Baseline approach using TF‚ÄëIDF + Logistic Regression

**Aim:** Turn the raw text into numerical features with TF‚ÄëIDF and train a simple Logistic Regression classifier as a strong, interpretable baseline.


- The TF‚ÄëIDF vectorizer converts each article into a sparse vector of word/phrase weights (up to 50,000 features).

- The shape printed after `fit_transform` confirms how many samples (rows) and features (columns) you obtained.


**What the results show:**

- That the text‚Äëto‚Äënumbers transformation succeeds (no errors, non‚Äëzero dimensions).

- The very high dimensionality of the feature space, which motivates using linear models with regularization.

In [93]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(
    stop_words="english",
    max_features=50000
)

X_tfidf = vectorizer.fit_transform(df["text"])
X_tfidf.shape

(79997, 50000)

Description: 
Rows = articles
Columns = words (features)

üîπ 79997 ‚Üí number of articles
You have 79,997 news articles in development.csv.
Each row corresponds to one article.

üîπ 50000 ‚Üí number of features (words)
You told TF-IDF:
	‚Ä¢	The system selected the 50,000 most important words/word-patterns
	‚Ä¢	Each column corresponds to one word (or word combination)

One Row looks like this: [0.0, 0.12, 0.0, 0.87, 0.03, 0.0, ...]

This line confirms that:
	‚Ä¢	‚úÖ Your text ‚Üí numbers conversion worked
	‚Ä¢	‚úÖ You now have a valid ML input
	‚Ä¢	‚úÖ Each article is represented consistently

This is a big milestone, even if it looks simple.

### 4) Train / validation split

**Aim:** Split the labeled data into training and validation sets so that model performance is measured on unseen examples in a fair, reproducible way.


- `X = X_tfidf` and `y = df['label']` define the feature matrix and target vector.
- `train_test_split(..., test_size=0.2, random_state=42, stratify=y)` creates an 80/20 split while keeping class proportions similar in both sets.
- The resulting shapes printed at the end confirm the sizes of train and validation sets.


**What the results show:**

- The split completed successfully and produced non‚Äëempty train/validation partitions.
- The number of features (columns) is identical in both sets, as expected.
- Thanks to `stratify=y`, rare classes are still present in the validation set, making Macro F1 evaluation meaningful.

In [94]:
from sklearn.model_selection import train_test_split

X = X_tfidf          # features (numbers)
y = df['label']     # target labels

X_train, X_val, y_train, y_val = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=42,
    stratify=y
)

X_train.shape, X_val.shape

((63997, 50000), (16000, 50000))

X = X_tfidf
y = df['label']
What this means:
	‚Ä¢	X ‚Üí the input features
	‚Ä¢	Here: the TF-IDF matrix (numbers representing article text)
	‚Ä¢	y ‚Üí the target variable
	‚Ä¢	The label (0‚Äì6) indicating the news category

In ML notation:
	‚Ä¢	X = inputs
	‚Ä¢	y = correct answers

X_train, X_val, y_train, y_val = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=42,
    stratify=y
)
1Ô∏è‚É£ Splits the data
	‚Ä¢	X_train, y_train ‚Üí used to train the model
	‚Ä¢	X_val, y_val ‚Üí used to evaluate the model on unseen data
2Ô∏è‚É£ test_size=0.2
	‚Ä¢	20% of the data goes to validation
	‚Ä¢	80% remains for training
This is a standard choice in the lectures.
3Ô∏è‚É£ random_state=42
	‚Ä¢	Fixes the randomness of the split
	‚Ä¢	Ensures reproducibility
	‚Ä¢	Running the code again gives the same split

Very important for:
	‚Ä¢	debugging
	‚Ä¢	fair comparison of models
4Ô∏è‚É£ stratify=y
	‚Ä¢	Keeps the class proportions the same in train and validation
	‚Ä¢	Important because the dataset is imbalanced (e.g. Health is rare)

Without this:
	‚Ä¢	validation set could miss rare classes
	‚Ä¢	evaluation would be misleading

X_train.shape, X_val.shape
What this checks:
	‚Ä¢	Confirms the split worked
	‚Ä¢	Shows how many samples are in each set
	‚Ä¢	Number of columns (features) stays the same



In [95]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(
    max_iter=1000,
    n_jobs=-1
)

model.fit(X_train, y_train)



0,1,2
,"penalty  penalty: {'l1', 'l2', 'elasticnet', None}, default='l2' Specify the norm of the penalty: - `None`: no penalty is added; - `'l2'`: add a L2 penalty term and it is the default choice; - `'l1'`: add a L1 penalty term; - `'elasticnet'`: both L1 and L2 penalty terms are added. .. warning::  Some penalties may not work with some solvers. See the parameter  `solver` below, to know the compatibility between the penalty and  solver. .. versionadded:: 0.19  l1 penalty with SAGA solver (allowing 'multinomial' + L1) .. deprecated:: 1.8  `penalty` was deprecated in version 1.8 and will be removed in 1.10.  Use `l1_ratio` instead. `l1_ratio=0` for `penalty='l2'`, `l1_ratio=1` for  `penalty='l1'` and `l1_ratio` set to any float between 0 and 1 for  `'penalty='elasticnet'`.",'deprecated'
,"C  C: float, default=1.0 Inverse of regularization strength; must be a positive float. Like in support vector machines, smaller values specify stronger regularization. `C=np.inf` results in unpenalized logistic regression. For a visual example on the effect of tuning the `C` parameter with an L1 penalty, see: :ref:`sphx_glr_auto_examples_linear_model_plot_logistic_path.py`.",1.0
,"l1_ratio  l1_ratio: float, default=0.0 The Elastic-Net mixing parameter, with `0 <= l1_ratio <= 1`. Setting `l1_ratio=1` gives a pure L1-penalty, setting `l1_ratio=0` a pure L2-penalty. Any value between 0 and 1 gives an Elastic-Net penalty of the form `l1_ratio * L1 + (1 - l1_ratio) * L2`. .. warning::  Certain values of `l1_ratio`, i.e. some penalties, may not work with some  solvers. See the parameter `solver` below, to know the compatibility between  the penalty and solver. .. versionchanged:: 1.8  Default value changed from None to 0.0. .. deprecated:: 1.8  `None` is deprecated and will be removed in version 1.10. Always use  `l1_ratio` to specify the penalty type.",0.0
,"dual  dual: bool, default=False Dual (constrained) or primal (regularized, see also :ref:`this equation `) formulation. Dual formulation is only implemented for l2 penalty with liblinear solver. Prefer `dual=False` when n_samples > n_features.",False
,"tol  tol: float, default=1e-4 Tolerance for stopping criteria.",0.0001
,"fit_intercept  fit_intercept: bool, default=True Specifies if a constant (a.k.a. bias or intercept) should be added to the decision function.",True
,"intercept_scaling  intercept_scaling: float, default=1 Useful only when the solver `liblinear` is used and `self.fit_intercept` is set to `True`. In this case, `x` becomes `[x, self.intercept_scaling]`, i.e. a ""synthetic"" feature with constant value equal to `intercept_scaling` is appended to the instance vector. The intercept becomes ``intercept_scaling * synthetic_feature_weight``. .. note::  The synthetic feature weight is subject to L1 or L2  regularization as all other features.  To lessen the effect of regularization on synthetic feature weight  (and therefore on the intercept) `intercept_scaling` has to be increased.",1
,"class_weight  class_weight: dict or 'balanced', default=None Weights associated with classes in the form ``{class_label: weight}``. If not given, all classes are supposed to have weight one. The ""balanced"" mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as ``n_samples / (n_classes * np.bincount(y))``. Note that these weights will be multiplied with sample_weight (passed through the fit method) if sample_weight is specified. .. versionadded:: 0.17  *class_weight='balanced'*",
,"random_state  random_state: int, RandomState instance, default=None Used when ``solver`` == 'sag', 'saga' or 'liblinear' to shuffle the data. See :term:`Glossary ` for details.",
,"solver  solver: {'lbfgs', 'liblinear', 'newton-cg', 'newton-cholesky', 'sag', 'saga'}, default='lbfgs' Algorithm to use in the optimization problem. Default is 'lbfgs'. To choose a solver, you might want to consider the following aspects: - 'lbfgs' is a good default solver because it works reasonably well for a wide  class of problems. - For :term:`multiclass` problems (`n_classes >= 3`), all solvers except  'liblinear' minimize the full multinomial loss, 'liblinear' will raise an  error. - 'newton-cholesky' is a good choice for  `n_samples` >> `n_features * n_classes`, especially with one-hot encoded  categorical features with rare categories. Be aware that the memory usage  of this solver has a quadratic dependency on `n_features * n_classes`  because it explicitly computes the full Hessian matrix. - For small datasets, 'liblinear' is a good choice, whereas 'sag'  and 'saga' are faster for large ones; - 'liblinear' can only handle binary classification by default. To apply a  one-versus-rest scheme for the multiclass setting one can wrap it with the  :class:`~sklearn.multiclass.OneVsRestClassifier`. .. warning::  The choice of the algorithm depends on the penalty chosen (`l1_ratio=0`  for L2-penalty, `l1_ratio=1` for L1-penalty and `0 < l1_ratio < 1` for  Elastic-Net) and on (multinomial) multiclass support:  ================= ======================== ======================  solver l1_ratio multinomial multiclass  ================= ======================== ======================  'lbfgs' l1_ratio=0 yes  'liblinear' l1_ratio=1 or l1_ratio=0 no  'newton-cg' l1_ratio=0 yes  'newton-cholesky' l1_ratio=0 yes  'sag' l1_ratio=0 yes  'saga' 0<=l1_ratio<=1 yes  ================= ======================== ====================== .. note::  'sag' and 'saga' fast convergence is only guaranteed on features  with approximately the same scale. You can preprocess the data with  a scaler from :mod:`sklearn.preprocessing`. .. seealso::  Refer to the :ref:`User Guide ` for more  information regarding :class:`LogisticRegression` and more specifically the  :ref:`Table `  summarizing solver/penalty supports. .. versionadded:: 0.17  Stochastic Average Gradient (SAG) descent solver. Multinomial support in  version 0.18. .. versionadded:: 0.19  SAGA solver. .. versionchanged:: 0.22  The default solver changed from 'liblinear' to 'lbfgs' in 0.22. .. versionadded:: 1.2  newton-cholesky solver. Multinomial support in version 1.6.",'lbfgs'



Each row is a hyperparameter of Logistic Regression.

Hyperparameters are:
	‚Ä¢	chosen by you
	‚Ä¢	fixed before training
	‚Ä¢	they control how the model learns
    


üîπ solver = 'lbfgs'

This is the optimization algorithm used to find the weights.

Plain English:

This is the math engine that adjusts the model until it fits the data.

Why this is good:
	‚Ä¢	lbfgs is standard
	‚Ä¢	works well for multiclass classification
	‚Ä¢	handles many features (like TF-IDF)

‚úîÔ∏è Fully aligned with course defaults.

üîπ max_iter = 1000

This is very important.

Plain English:

Maximum number of steps the optimizer is allowed to take.

Why we increased it:
	‚Ä¢	TF-IDF has 50,000 features
	‚Ä¢	Default (100) is often not enough
	‚Ä¢	1000 prevents premature stopping

‚úîÔ∏è Correct and recommended.

üîπ C = 1.0

This controls regularization strength.

Plain English:
	‚Ä¢	Large C ‚Üí model fits data more closely
	‚Ä¢	Small C ‚Üí model is more conservative

C = 1.0 means:

‚ÄúUse a balanced, default amount of regularization.‚Äù

‚úîÔ∏è Perfect baseline choice
We‚Äôll maybe tune this later, not now.

üîπ penalty = 'deprecated'

This looks scary but it is not a problem.

What it really means:
	‚Ä¢	You did not explicitly set a penalty
	‚Ä¢	The solver default (l2) is used

So effectively:

You are using L2 regularization, which is standard.

You can safely ignore this for now.

üîπ class_weight = None

This means:

All classes are treated equally during training.

Is this okay?
	‚Ä¢	Yes, for a baseline
	‚Ä¢	Later we may try class_weight='balanced' as an improvement

Right now:
‚úîÔ∏è Totally fine.


üîπ n_jobs = -1

Plain English:

Use all available CPU cores.

This only affects speed, not results.

‚úîÔ∏è Good practice.


	‚Ä¢	LogisticRegression(...)
‚Üí creates the model object
	‚Ä¢	max_iter=1000
‚Üí allows more training iterations so the model converges
(important with many features like TF-IDF)
	‚Ä¢	n_jobs=-1
‚Üí uses all available CPU cores (faster)
	‚Ä¢	model.fit(X_train, y_train)
‚Üí this is where learning happens
The model finds patterns linking word features to labels

### 4) Baseline evaluation (Macro F1)

**Aim:** Evaluate the first Logistic Regression model on the validation set using Macro F1, which gives equal importance to all classes.


- The next code cells compute predictions on `X_val` and then call `f1_score(y_val, y_val_pred, average="macro")`.

- Macro F1 is the main competition metric, so this value is your baseline to beat.


**What the results show:**

- A single number (around 0.65 in your runs) that summarizes how well the model performs across all 7 categories.
- If this score is much higher than random guessing and simple baselines, it confirms that the pipeline is meaningful and ready for further tuning.

In [96]:
y_val_pred = model.predict(X_val)

In [97]:
from sklearn.metrics import f1_score

f1_macro = f1_score(y_val, y_val_pred, average="macro")
f1_macro

0.6475014177285618


0.6475 means:
On unseen validation data, your model is doing a reasonably good job at correctly classifying articles across all 7 categories, giving equal importance to each category.


‚ÄúThe baseline Logistic Regression model achieves a Macro F1 of approximately 0.65 on the validation set, indicating that it generalizes reasonably well across all news categories, including underrepresented ones. This confirms that the feature extraction and learning pipeline is correct.‚Äù

### 5) Handle class imbalance (class_weight)

**Aim:** Make the model pay more attention to minority classes by using `class_weight="balanced"` in Logistic Regression, and see if this improves Macro F1.


- The following code defines `model_balanced` with `class_weight="balanced"`, fits it on the same train split, and evaluates Macro F1 on the validation set.
- Comparing `f1_macro_balanced` to the previous baseline tells you whether balancing helps.


**What the results show:**

- If `f1_macro_balanced` is higher than the original baseline, then weighting classes is beneficial and should be kept in later models.
- If the score is similar or worse, it suggests that imbalance is not the main bottleneck or that other tuning is needed.

In [103]:
from sklearn.linear_model import LogisticRegression

model_balanced = LogisticRegression(
    max_iter=1000,
    class_weight="balanced"
)

model_balanced.fit(X_train, y_train)

0,1,2
,"penalty  penalty: {'l1', 'l2', 'elasticnet', None}, default='l2' Specify the norm of the penalty: - `None`: no penalty is added; - `'l2'`: add a L2 penalty term and it is the default choice; - `'l1'`: add a L1 penalty term; - `'elasticnet'`: both L1 and L2 penalty terms are added. .. warning::  Some penalties may not work with some solvers. See the parameter  `solver` below, to know the compatibility between the penalty and  solver. .. versionadded:: 0.19  l1 penalty with SAGA solver (allowing 'multinomial' + L1) .. deprecated:: 1.8  `penalty` was deprecated in version 1.8 and will be removed in 1.10.  Use `l1_ratio` instead. `l1_ratio=0` for `penalty='l2'`, `l1_ratio=1` for  `penalty='l1'` and `l1_ratio` set to any float between 0 and 1 for  `'penalty='elasticnet'`.",'deprecated'
,"C  C: float, default=1.0 Inverse of regularization strength; must be a positive float. Like in support vector machines, smaller values specify stronger regularization. `C=np.inf` results in unpenalized logistic regression. For a visual example on the effect of tuning the `C` parameter with an L1 penalty, see: :ref:`sphx_glr_auto_examples_linear_model_plot_logistic_path.py`.",1.0
,"l1_ratio  l1_ratio: float, default=0.0 The Elastic-Net mixing parameter, with `0 <= l1_ratio <= 1`. Setting `l1_ratio=1` gives a pure L1-penalty, setting `l1_ratio=0` a pure L2-penalty. Any value between 0 and 1 gives an Elastic-Net penalty of the form `l1_ratio * L1 + (1 - l1_ratio) * L2`. .. warning::  Certain values of `l1_ratio`, i.e. some penalties, may not work with some  solvers. See the parameter `solver` below, to know the compatibility between  the penalty and solver. .. versionchanged:: 1.8  Default value changed from None to 0.0. .. deprecated:: 1.8  `None` is deprecated and will be removed in version 1.10. Always use  `l1_ratio` to specify the penalty type.",0.0
,"dual  dual: bool, default=False Dual (constrained) or primal (regularized, see also :ref:`this equation `) formulation. Dual formulation is only implemented for l2 penalty with liblinear solver. Prefer `dual=False` when n_samples > n_features.",False
,"tol  tol: float, default=1e-4 Tolerance for stopping criteria.",0.0001
,"fit_intercept  fit_intercept: bool, default=True Specifies if a constant (a.k.a. bias or intercept) should be added to the decision function.",True
,"intercept_scaling  intercept_scaling: float, default=1 Useful only when the solver `liblinear` is used and `self.fit_intercept` is set to `True`. In this case, `x` becomes `[x, self.intercept_scaling]`, i.e. a ""synthetic"" feature with constant value equal to `intercept_scaling` is appended to the instance vector. The intercept becomes ``intercept_scaling * synthetic_feature_weight``. .. note::  The synthetic feature weight is subject to L1 or L2  regularization as all other features.  To lessen the effect of regularization on synthetic feature weight  (and therefore on the intercept) `intercept_scaling` has to be increased.",1
,"class_weight  class_weight: dict or 'balanced', default=None Weights associated with classes in the form ``{class_label: weight}``. If not given, all classes are supposed to have weight one. The ""balanced"" mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as ``n_samples / (n_classes * np.bincount(y))``. Note that these weights will be multiplied with sample_weight (passed through the fit method) if sample_weight is specified. .. versionadded:: 0.17  *class_weight='balanced'*",'balanced'
,"random_state  random_state: int, RandomState instance, default=None Used when ``solver`` == 'sag', 'saga' or 'liblinear' to shuffle the data. See :term:`Glossary ` for details.",
,"solver  solver: {'lbfgs', 'liblinear', 'newton-cg', 'newton-cholesky', 'sag', 'saga'}, default='lbfgs' Algorithm to use in the optimization problem. Default is 'lbfgs'. To choose a solver, you might want to consider the following aspects: - 'lbfgs' is a good default solver because it works reasonably well for a wide  class of problems. - For :term:`multiclass` problems (`n_classes >= 3`), all solvers except  'liblinear' minimize the full multinomial loss, 'liblinear' will raise an  error. - 'newton-cholesky' is a good choice for  `n_samples` >> `n_features * n_classes`, especially with one-hot encoded  categorical features with rare categories. Be aware that the memory usage  of this solver has a quadratic dependency on `n_features * n_classes`  because it explicitly computes the full Hessian matrix. - For small datasets, 'liblinear' is a good choice, whereas 'sag'  and 'saga' are faster for large ones; - 'liblinear' can only handle binary classification by default. To apply a  one-versus-rest scheme for the multiclass setting one can wrap it with the  :class:`~sklearn.multiclass.OneVsRestClassifier`. .. warning::  The choice of the algorithm depends on the penalty chosen (`l1_ratio=0`  for L2-penalty, `l1_ratio=1` for L1-penalty and `0 < l1_ratio < 1` for  Elastic-Net) and on (multinomial) multiclass support:  ================= ======================== ======================  solver l1_ratio multinomial multiclass  ================= ======================== ======================  'lbfgs' l1_ratio=0 yes  'liblinear' l1_ratio=1 or l1_ratio=0 no  'newton-cg' l1_ratio=0 yes  'newton-cholesky' l1_ratio=0 yes  'sag' l1_ratio=0 yes  'saga' 0<=l1_ratio<=1 yes  ================= ======================== ====================== .. note::  'sag' and 'saga' fast convergence is only guaranteed on features  with approximately the same scale. You can preprocess the data with  a scaler from :mod:`sklearn.preprocessing`. .. seealso::  Refer to the :ref:`User Guide ` for more  information regarding :class:`LogisticRegression` and more specifically the  :ref:`Table `  summarizing solver/penalty supports. .. versionadded:: 0.17  Stochastic Average Gradient (SAG) descent solver. Multinomial support in  version 0.18. .. versionadded:: 0.19  SAGA solver. .. versionchanged:: 0.22  The default solver changed from 'liblinear' to 'lbfgs' in 0.22. .. versionadded:: 1.2  newton-cholesky solver. Multinomial support in version 1.6.",'lbfgs'


In [102]:
from sklearn.metrics import f1_score

y_val_pred_balanced = model_balanced.predict(X_val)
f1_macro_balanced = f1_score(y_val, y_val_pred_balanced, average="macro")
f1_macro_balanced

0.6600408599457183

### 6) Hyperparameter tuning overview

**Aim:** Systematically explore different hyperparameters for Logistic Regression and TF‚ÄëIDF to find configurations that yield better Macro F1 than the simple baseline.


In the next cells you will tune:

- **C** for Logistic Regression (strength of regularization).

- **TF‚ÄëIDF options** such as `ngram_range`, `min_df`, `max_df`, `sublinear_tf`, and `max_features`.

- Some settings are tested via simple train/validation splits, others via cross‚Äëvalidation with `StratifiedKFold`.



**What the results show:**

- Printed tables of `(hyperparameter value ‚Üí mean Macro F1 ¬± std)` or lines like `C=1  MacroF1=...`.

- From these, you select reasonable defaults (e.g. `C=1`, `ngram_range=(1,2)`, `min_df=2`) to use in the final pipeline.

In [104]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score

Cs = [0.25, 0.5, 1, 2, 4, 8]

results = []
for C in Cs:
    m = LogisticRegression(max_iter=1000, class_weight="balanced", C=C)
    m.fit(X_train, y_train)
    pred = m.predict(X_val)
    f1 = f1_score(y_val, pred, average="macro")
    results.append((C, f1))
    print(f"C={C:<4}  MacroF1={f1:.5f}")

best = max(results, key=lambda x: x[1])
print("\nBEST:", best)

C=0.25  MacroF1=0.65316
C=0.5   MacroF1=0.65759
C=1     MacroF1=0.66004
C=2     MacroF1=0.65472
C=4     MacroF1=0.64859
C=8     MacroF1=0.64073

BEST: (1, 0.6600408599457183)


### Decision: C = 1

**Aim:** Summarize the Logistic Regression `C` tuning results and record the chosen value.

The previous loop evaluated several `C` values (regularization strengths) using Macro F1 on the validation set.
Here we note that `C = 1` provided a good balance between underfitting and overfitting, and will be used as the default in later models.

**What the results show:**
- `C = 1` is a sensible, well‚Äëperforming choice on this dataset.
- Future experiments use this value unless stated otherwise.

### Future idea: two‚Äëbranch TF‚ÄëIDF (title vs article)

**Aim:** Consider a more advanced architecture where title and article text are vectorized separately (two branches) and then combined with different weights.

This is a design idea only and is **not implemented** in the current notebook.

### Future idea: LogisticRegression L1 / Elastic Net (saga)

**Aim:** Try sparsity‚Äëinducing penalties (L1 or Elastic Net) with the `saga` solver to see if they improve performance or interpretability.

Also a **future experiment only** ‚Äì there is no corresponding code in this notebook yet.

### Future idea: sample weights (inverse class frequency)

**Aim:** Weight each training example by the inverse frequency of its class, as an alternative to `class_weight='balanced'`.

This is listed as a potential improvement and is **not implemented** below.

### Future idea: TF‚ÄëIDF custom preprocessor

**Aim:** Build a custom text preprocessor to normalize numbers, URLs, emails, etc., before TF‚ÄëIDF, which might make features more stable.

Again, this is only a **note for future work** and not implemented here.

### TF‚ÄëIDF strip_accents and token_pattern

**Aim:** Explore alternative tokenization and accent‚Äëstripping options for TF‚ÄëIDF.

The next code cell tests a few configurations of `strip_accents` and `token_pattern` and reports whether they succeed and how they affect Macro F1.

### TF‚ÄëIDF max_features tuning

(See the following code cell for the actual experiment and Macro F1 results.)

We choose `min_df = 2` because it is safer and simpler: it removes extremely rare words while keeping enough signal for the model.
