In [5]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score

1) Load and inspect data

In [6]:


df = pd.read_csv("development.csv")
df.head()
df.columns
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 79997 entries, 0 to 79996
Data columns (total 7 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   Id         79997 non-null  int64 
 1   source     79997 non-null  object
 2   title      79996 non-null  object
 3   article    79996 non-null  object
 4   page_rank  79997 non-null  int64 
 5   timestamp  79997 non-null  object
 6   label      79997 non-null  int64 
dtypes: int64(3), object(4)
memory usage: 4.3+ MB


In [7]:
df['label'].value_counts()

label
0    23542
5    13053
2    11161
1    10588
3     9977
4     8574
6     3102
Name: count, dtype: int64

2) Build text feature (title + article)

replaces missing titles with empty strings
prevents error during string concatention
adds space

In [8]:
df['text'] = df['title'].fillna('') + ' ' + df['article'].fillna('')
df['text'].head()

0    OPEC Boosts Nigeria&#39;s Oil Revenue By .82m ...
1    Yearender: Mideast peace roadmap reaches dead-...
2    Battleground Dispatches for Oct. 5 \\n    (CQP...
3    Air best to resuscitate newborns Air rather th...
4    High tech German train crash kills at least on...
Name: text, dtype: object

TRAIN/VAL SPLIT

X_texts selects the input, in supervised learning
y -> selects the target value, in supervised learning, what the model must predict

In [9]:
X_text = df["text"]
y = df["label"]

random split between val and test

X_train_text
text used to train the model

y_train
labels used to learn

X_val_text
text the model never sees during training

y_val
true labels used only for evaluation


In [10]:
from sklearn.model_selection import train_test_split

X_train_text, X_val_text, y_train, y_val = train_test_split(
    X_text,
    y,
    test_size=0.2,
    random_state=42,
    stratify=y
)

EXPLORATORY BASELINE DISCARDED 3) Baseline approach using TF-IDF + Logistic Regression

text classification
removes very common words with stop-words
we dont want to overfit so we should choose a proper max_features

fit_transformer: fit-> learns the vocab, how frewuent they are and their IDF weight,

transform -> converts each document into a TF-IDF vector
each row: article
each column: one word
each value: importance of that word in the article

In [11]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(
    stop_words="english",
    max_features=50000
)

X_tfidf = vectorizer.fit_transform(df["text"])
X_tfidf.shape

(79997, 50000)

Raw text (title + article)
‚Üì
TF-IDF vectors (numerical, sparse, high-dimensional)
‚Üì
Ready for LinearSVC / LogisticRegression

NOT: tf-ƒ±df is done on all the data (train + val)

we will fix it later

Description: 
Rows = articles
Columns = words (features)

üîπ 79997 ‚Üí number of articles
You have 79,997 news articles in development.csv.
Each row corresponds to one article.

üîπ 50000 ‚Üí number of features (words)
You told TF-IDF:
	‚Ä¢	The system selected the 50,000 most important words/word-patterns
	‚Ä¢	Each column corresponds to one word (or word combination)

One Row looks like this: [0.0, 0.12, 0.0, 0.87, 0.03, 0.0, ...]

This line confirms that:
	‚Ä¢	‚úÖ Your text ‚Üí numbers conversion worked
	‚Ä¢	‚úÖ You now have a valid ML input
	‚Ä¢	‚úÖ Each article is represented consistently

This is a big milestone, even if it looks simple.

TRAIN VALIDATION SPLIT

In [12]:
from sklearn.model_selection import train_test_split

X = X_tfidf          # features (numbers)
y = df['label']     # target labels

X_train, X_val, y_train, y_val = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=42,
    stratify=y
)

X_train.shape, X_val.shape

((63997, 50000), (16000, 50000))

X = X_tfidf
y = df['label']
What this means:
	‚Ä¢	X ‚Üí the input features
	‚Ä¢	Here: the TF-IDF matrix (numbers representing article text)
	‚Ä¢	y ‚Üí the target variable
	‚Ä¢	The label (0‚Äì6) indicating the news category

In ML notation:
	‚Ä¢	X = inputs
	‚Ä¢	y = correct answers

X_train, X_val, y_train, y_val = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=42,
    stratify=y
)
1Ô∏è‚É£ Splits the data
	‚Ä¢	X_train, y_train ‚Üí used to train the model
	‚Ä¢	X_val, y_val ‚Üí used to evaluate the model on unseen data
2Ô∏è‚É£ test_size=0.2
	‚Ä¢	20% of the data goes to validation
	‚Ä¢	80% remains for training
This is a standard choice in the lectures.
3Ô∏è‚É£ random_state=42
	‚Ä¢	Fixes the randomness of the split
	‚Ä¢	Ensures reproducibility
	‚Ä¢	Running the code again gives the same split

Very important for:
	‚Ä¢	debugging
	‚Ä¢	fair comparison of models
4Ô∏è‚É£ stratify=y
	‚Ä¢	Keeps the class proportions the same in train and validation
	‚Ä¢	Important because the dataset is imbalanced (e.g. Health is rare)

Without this:
	‚Ä¢	validation set could miss rare classes
	‚Ä¢	evaluation would be misleading

X_train.shape, X_val.shape
What this checks:
	‚Ä¢	Confirms the split worked
	‚Ä¢	Shows how many samples are in each set
	‚Ä¢	Number of columns (features) stays the same



In [13]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(
    max_iter=1000,
    n_jobs=-1
)

model.fit(X_train, y_train)



0,1,2
,"penalty  penalty: {'l1', 'l2', 'elasticnet', None}, default='l2' Specify the norm of the penalty: - `None`: no penalty is added; - `'l2'`: add a L2 penalty term and it is the default choice; - `'l1'`: add a L1 penalty term; - `'elasticnet'`: both L1 and L2 penalty terms are added. .. warning::  Some penalties may not work with some solvers. See the parameter  `solver` below, to know the compatibility between the penalty and  solver. .. versionadded:: 0.19  l1 penalty with SAGA solver (allowing 'multinomial' + L1) .. deprecated:: 1.8  `penalty` was deprecated in version 1.8 and will be removed in 1.10.  Use `l1_ratio` instead. `l1_ratio=0` for `penalty='l2'`, `l1_ratio=1` for  `penalty='l1'` and `l1_ratio` set to any float between 0 and 1 for  `'penalty='elasticnet'`.",'deprecated'
,"C  C: float, default=1.0 Inverse of regularization strength; must be a positive float. Like in support vector machines, smaller values specify stronger regularization. `C=np.inf` results in unpenalized logistic regression. For a visual example on the effect of tuning the `C` parameter with an L1 penalty, see: :ref:`sphx_glr_auto_examples_linear_model_plot_logistic_path.py`.",1.0
,"l1_ratio  l1_ratio: float, default=0.0 The Elastic-Net mixing parameter, with `0 <= l1_ratio <= 1`. Setting `l1_ratio=1` gives a pure L1-penalty, setting `l1_ratio=0` a pure L2-penalty. Any value between 0 and 1 gives an Elastic-Net penalty of the form `l1_ratio * L1 + (1 - l1_ratio) * L2`. .. warning::  Certain values of `l1_ratio`, i.e. some penalties, may not work with some  solvers. See the parameter `solver` below, to know the compatibility between  the penalty and solver. .. versionchanged:: 1.8  Default value changed from None to 0.0. .. deprecated:: 1.8  `None` is deprecated and will be removed in version 1.10. Always use  `l1_ratio` to specify the penalty type.",0.0
,"dual  dual: bool, default=False Dual (constrained) or primal (regularized, see also :ref:`this equation `) formulation. Dual formulation is only implemented for l2 penalty with liblinear solver. Prefer `dual=False` when n_samples > n_features.",False
,"tol  tol: float, default=1e-4 Tolerance for stopping criteria.",0.0001
,"fit_intercept  fit_intercept: bool, default=True Specifies if a constant (a.k.a. bias or intercept) should be added to the decision function.",True
,"intercept_scaling  intercept_scaling: float, default=1 Useful only when the solver `liblinear` is used and `self.fit_intercept` is set to `True`. In this case, `x` becomes `[x, self.intercept_scaling]`, i.e. a ""synthetic"" feature with constant value equal to `intercept_scaling` is appended to the instance vector. The intercept becomes ``intercept_scaling * synthetic_feature_weight``. .. note::  The synthetic feature weight is subject to L1 or L2  regularization as all other features.  To lessen the effect of regularization on synthetic feature weight  (and therefore on the intercept) `intercept_scaling` has to be increased.",1
,"class_weight  class_weight: dict or 'balanced', default=None Weights associated with classes in the form ``{class_label: weight}``. If not given, all classes are supposed to have weight one. The ""balanced"" mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as ``n_samples / (n_classes * np.bincount(y))``. Note that these weights will be multiplied with sample_weight (passed through the fit method) if sample_weight is specified. .. versionadded:: 0.17  *class_weight='balanced'*",
,"random_state  random_state: int, RandomState instance, default=None Used when ``solver`` == 'sag', 'saga' or 'liblinear' to shuffle the data. See :term:`Glossary ` for details.",
,"solver  solver: {'lbfgs', 'liblinear', 'newton-cg', 'newton-cholesky', 'sag', 'saga'}, default='lbfgs' Algorithm to use in the optimization problem. Default is 'lbfgs'. To choose a solver, you might want to consider the following aspects: - 'lbfgs' is a good default solver because it works reasonably well for a wide  class of problems. - For :term:`multiclass` problems (`n_classes >= 3`), all solvers except  'liblinear' minimize the full multinomial loss, 'liblinear' will raise an  error. - 'newton-cholesky' is a good choice for  `n_samples` >> `n_features * n_classes`, especially with one-hot encoded  categorical features with rare categories. Be aware that the memory usage  of this solver has a quadratic dependency on `n_features * n_classes`  because it explicitly computes the full Hessian matrix. - For small datasets, 'liblinear' is a good choice, whereas 'sag'  and 'saga' are faster for large ones; - 'liblinear' can only handle binary classification by default. To apply a  one-versus-rest scheme for the multiclass setting one can wrap it with the  :class:`~sklearn.multiclass.OneVsRestClassifier`. .. warning::  The choice of the algorithm depends on the penalty chosen (`l1_ratio=0`  for L2-penalty, `l1_ratio=1` for L1-penalty and `0 < l1_ratio < 1` for  Elastic-Net) and on (multinomial) multiclass support:  ================= ======================== ======================  solver l1_ratio multinomial multiclass  ================= ======================== ======================  'lbfgs' l1_ratio=0 yes  'liblinear' l1_ratio=1 or l1_ratio=0 no  'newton-cg' l1_ratio=0 yes  'newton-cholesky' l1_ratio=0 yes  'sag' l1_ratio=0 yes  'saga' 0<=l1_ratio<=1 yes  ================= ======================== ====================== .. note::  'sag' and 'saga' fast convergence is only guaranteed on features  with approximately the same scale. You can preprocess the data with  a scaler from :mod:`sklearn.preprocessing`. .. seealso::  Refer to the :ref:`User Guide ` for more  information regarding :class:`LogisticRegression` and more specifically the  :ref:`Table `  summarizing solver/penalty supports. .. versionadded:: 0.17  Stochastic Average Gradient (SAG) descent solver. Multinomial support in  version 0.18. .. versionadded:: 0.19  SAGA solver. .. versionchanged:: 0.22  The default solver changed from 'liblinear' to 'lbfgs' in 0.22. .. versionadded:: 1.2  newton-cholesky solver. Multinomial support in version 1.6.",'lbfgs'



Each row is a hyperparameter of Logistic Regression.

Hyperparameters are:
	‚Ä¢	chosen by you
	‚Ä¢	fixed before training
	‚Ä¢	they control how the model learns
    


üîπ solver = 'lbfgs'

This is the optimization algorithm used to find the weights.

Plain English:

This is the math engine that adjusts the model until it fits the data.

Why this is good:
	‚Ä¢	lbfgs is standard
	‚Ä¢	works well for multiclass classification
	‚Ä¢	handles many features (like TF-IDF)

‚úîÔ∏è Fully aligned with course defaults.

üîπ max_iter = 1000

This is very important.

Plain English:

Maximum number of steps the optimizer is allowed to take.

Why we increased it:
	‚Ä¢	TF-IDF has 50,000 features
	‚Ä¢	Default (100) is often not enough
	‚Ä¢	1000 prevents premature stopping

‚úîÔ∏è Correct and recommended.

üîπ C = 1.0

This controls regularization strength.

Plain English:
	‚Ä¢	Large C ‚Üí model fits data more closely
	‚Ä¢	Small C ‚Üí model is more conservative

C = 1.0 means:

‚ÄúUse a balanced, default amount of regularization.‚Äù

‚úîÔ∏è Perfect baseline choice
We‚Äôll maybe tune this later, not now.

üîπ penalty = 'deprecated'

This looks scary but it is not a problem.

What it really means:
	‚Ä¢	You did not explicitly set a penalty
	‚Ä¢	The solver default (l2) is used

So effectively:

You are using L2 regularization, which is standard.

You can safely ignore this for now.

üîπ class_weight = None

This means:

All classes are treated equally during training.

Is this okay?
	‚Ä¢	Yes, for a baseline
	‚Ä¢	Later we may try class_weight='balanced' as an improvement

Right now:
‚úîÔ∏è Totally fine.


üîπ n_jobs = -1

Plain English:

Use all available CPU cores.

This only affects speed, not results.

‚úîÔ∏è Good practice.


	‚Ä¢	LogisticRegression(...)
‚Üí creates the model object
	‚Ä¢	max_iter=1000
‚Üí allows more training iterations so the model converges
(important with many features like TF-IDF)
	‚Ä¢	n_jobs=-1
‚Üí uses all available CPU cores (faster)
	‚Ä¢	model.fit(X_train, y_train)
‚Üí this is where learning happens
The model finds patterns linking word features to labels

4) Baseline evaluation (Macro F1)

In [13]:
y_val_pred = model.predict(X_val)

In [14]:
from sklearn.metrics import f1_score

f1_macro = f1_score(y_val, y_val_pred, average="macro")
f1_macro

0.6475014177285618


0.6475 means:
On unseen validation data, your model is doing a reasonably good job at correctly classifying articles across all 7 categories, giving equal importance to each category.


‚ÄúThe baseline Logistic Regression model achieves a Macro F1 of approximately 0.65 on the validation set, indicating that it generalizes reasonably well across all news categories, including underrepresented ones. This confirms that the feature extraction and learning pipeline is correct.‚Äù

5) Handle class imbalance (class_weight)

In [15]:
from sklearn.linear_model import LogisticRegression

model_balanced = LogisticRegression(
    max_iter=1000,
    class_weight="balanced"
)

model_balanced.fit(X_train, y_train)

0,1,2
,"penalty  penalty: {'l1', 'l2', 'elasticnet', None}, default='l2' Specify the norm of the penalty: - `None`: no penalty is added; - `'l2'`: add a L2 penalty term and it is the default choice; - `'l1'`: add a L1 penalty term; - `'elasticnet'`: both L1 and L2 penalty terms are added. .. warning::  Some penalties may not work with some solvers. See the parameter  `solver` below, to know the compatibility between the penalty and  solver. .. versionadded:: 0.19  l1 penalty with SAGA solver (allowing 'multinomial' + L1) .. deprecated:: 1.8  `penalty` was deprecated in version 1.8 and will be removed in 1.10.  Use `l1_ratio` instead. `l1_ratio=0` for `penalty='l2'`, `l1_ratio=1` for  `penalty='l1'` and `l1_ratio` set to any float between 0 and 1 for  `'penalty='elasticnet'`.",'deprecated'
,"C  C: float, default=1.0 Inverse of regularization strength; must be a positive float. Like in support vector machines, smaller values specify stronger regularization. `C=np.inf` results in unpenalized logistic regression. For a visual example on the effect of tuning the `C` parameter with an L1 penalty, see: :ref:`sphx_glr_auto_examples_linear_model_plot_logistic_path.py`.",1.0
,"l1_ratio  l1_ratio: float, default=0.0 The Elastic-Net mixing parameter, with `0 <= l1_ratio <= 1`. Setting `l1_ratio=1` gives a pure L1-penalty, setting `l1_ratio=0` a pure L2-penalty. Any value between 0 and 1 gives an Elastic-Net penalty of the form `l1_ratio * L1 + (1 - l1_ratio) * L2`. .. warning::  Certain values of `l1_ratio`, i.e. some penalties, may not work with some  solvers. See the parameter `solver` below, to know the compatibility between  the penalty and solver. .. versionchanged:: 1.8  Default value changed from None to 0.0. .. deprecated:: 1.8  `None` is deprecated and will be removed in version 1.10. Always use  `l1_ratio` to specify the penalty type.",0.0
,"dual  dual: bool, default=False Dual (constrained) or primal (regularized, see also :ref:`this equation `) formulation. Dual formulation is only implemented for l2 penalty with liblinear solver. Prefer `dual=False` when n_samples > n_features.",False
,"tol  tol: float, default=1e-4 Tolerance for stopping criteria.",0.0001
,"fit_intercept  fit_intercept: bool, default=True Specifies if a constant (a.k.a. bias or intercept) should be added to the decision function.",True
,"intercept_scaling  intercept_scaling: float, default=1 Useful only when the solver `liblinear` is used and `self.fit_intercept` is set to `True`. In this case, `x` becomes `[x, self.intercept_scaling]`, i.e. a ""synthetic"" feature with constant value equal to `intercept_scaling` is appended to the instance vector. The intercept becomes ``intercept_scaling * synthetic_feature_weight``. .. note::  The synthetic feature weight is subject to L1 or L2  regularization as all other features.  To lessen the effect of regularization on synthetic feature weight  (and therefore on the intercept) `intercept_scaling` has to be increased.",1
,"class_weight  class_weight: dict or 'balanced', default=None Weights associated with classes in the form ``{class_label: weight}``. If not given, all classes are supposed to have weight one. The ""balanced"" mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as ``n_samples / (n_classes * np.bincount(y))``. Note that these weights will be multiplied with sample_weight (passed through the fit method) if sample_weight is specified. .. versionadded:: 0.17  *class_weight='balanced'*",'balanced'
,"random_state  random_state: int, RandomState instance, default=None Used when ``solver`` == 'sag', 'saga' or 'liblinear' to shuffle the data. See :term:`Glossary ` for details.",
,"solver  solver: {'lbfgs', 'liblinear', 'newton-cg', 'newton-cholesky', 'sag', 'saga'}, default='lbfgs' Algorithm to use in the optimization problem. Default is 'lbfgs'. To choose a solver, you might want to consider the following aspects: - 'lbfgs' is a good default solver because it works reasonably well for a wide  class of problems. - For :term:`multiclass` problems (`n_classes >= 3`), all solvers except  'liblinear' minimize the full multinomial loss, 'liblinear' will raise an  error. - 'newton-cholesky' is a good choice for  `n_samples` >> `n_features * n_classes`, especially with one-hot encoded  categorical features with rare categories. Be aware that the memory usage  of this solver has a quadratic dependency on `n_features * n_classes`  because it explicitly computes the full Hessian matrix. - For small datasets, 'liblinear' is a good choice, whereas 'sag'  and 'saga' are faster for large ones; - 'liblinear' can only handle binary classification by default. To apply a  one-versus-rest scheme for the multiclass setting one can wrap it with the  :class:`~sklearn.multiclass.OneVsRestClassifier`. .. warning::  The choice of the algorithm depends on the penalty chosen (`l1_ratio=0`  for L2-penalty, `l1_ratio=1` for L1-penalty and `0 < l1_ratio < 1` for  Elastic-Net) and on (multinomial) multiclass support:  ================= ======================== ======================  solver l1_ratio multinomial multiclass  ================= ======================== ======================  'lbfgs' l1_ratio=0 yes  'liblinear' l1_ratio=1 or l1_ratio=0 no  'newton-cg' l1_ratio=0 yes  'newton-cholesky' l1_ratio=0 yes  'sag' l1_ratio=0 yes  'saga' 0<=l1_ratio<=1 yes  ================= ======================== ====================== .. note::  'sag' and 'saga' fast convergence is only guaranteed on features  with approximately the same scale. You can preprocess the data with  a scaler from :mod:`sklearn.preprocessing`. .. seealso::  Refer to the :ref:`User Guide ` for more  information regarding :class:`LogisticRegression` and more specifically the  :ref:`Table `  summarizing solver/penalty supports. .. versionadded:: 0.17  Stochastic Average Gradient (SAG) descent solver. Multinomial support in  version 0.18. .. versionadded:: 0.19  SAGA solver. .. versionchanged:: 0.22  The default solver changed from 'liblinear' to 'lbfgs' in 0.22. .. versionadded:: 1.2  newton-cholesky solver. Multinomial support in version 1.6.",'lbfgs'


In [16]:
from sklearn.metrics import f1_score

y_val_pred_balanced = model_balanced.predict(X_val)
f1_macro_balanced = f1_score(y_val, y_val_pred_balanced, average="macro")
f1_macro_balanced

0.6600408599457183

6) Hyperparameter tuning
- C tuning
- TF-IDF ngram_range tuning
- TF-IDF min_df tuning
- (placeholder for max_df, max_features later)

In [14]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import StratifiedKFold, cross_val_score
cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=42)

Cs = [0.25, 0.5, 1, 2, 4, 8]

for C in Cs:
    pipe = Pipeline([
        ("tfidf", TfidfVectorizer(
            stop_words="english",
            ngram_range=(1,2),
            min_df=2,
            max_df=1.0,          # don‚Äôt lock 0.6 unless proven
            sublinear_tf=True,   # add this
            max_features=100000
        )),
        ("clf", LogisticRegression(
            C=C, max_iter=2000, class_weight="balanced", n_jobs=-1
        ))
    ])
    scores = cross_val_score(pipe, X_train_text, y_train, scoring="f1_macro", cv=cv)
    print("C=", C, "mean=", scores.mean(), "std=", scores.std())



C= 0.25 mean= 0.6521822851941533 std= 0.0018349588448387715




C= 0.5 mean= 0.6618358807948415 std= 0.0013555367375706737




C= 1 mean= 0.6664789544884246 std= 0.0035769640537670743




C= 2 mean= 0.6678991571840923 std= 0.0034437685098182578




C= 4 mean= 0.6656707341117357 std= 0.00210675895660936




C= 8 mean= 0.6611166785622088 std= 0.0028619064054469148


C decided 1

In [18]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score

pipe = Pipeline([
    ("tfidf", TfidfVectorizer(stop_words="english")),
    ("clf", LogisticRegression(
        C=2,
        max_iter=1000,
        class_weight="balanced"
    ))
])

In [35]:
configs = [
    {"tfidf__ngram_range": (1,1)},
    {"tfidf__ngram_range": (1,2)},
    {"tfidf__ngram_range": (1,3)},
]

for cfg in configs:
    pipe.set_params(**cfg)
    pipe.fit(X_train_text, y_train)
    preds = pipe.predict(X_val_text)
    f1 = f1_score(y_val, preds, average="macro")
    print(cfg, "‚Üí Macro F1:", round(f1, 5))

{'tfidf__ngram_range': (1, 1)} ‚Üí Macro F1: 0.65907
{'tfidf__ngram_range': (1, 2)} ‚Üí Macro F1: 0.67554
{'tfidf__ngram_range': (1, 3)} ‚Üí Macro F1: 0.6778


ngram_range : (1,2)

In [20]:
configs = [
    {"tfidf__min_df": 1},
    {"tfidf__min_df": 2},
    {"tfidf__min_df": 3},
]

for cfg in configs:
    pipe.set_params(tfidf__ngram_range=(1,2), **cfg)
    pipe.fit(X_train_text, y_train)
    preds = pipe.predict(X_val_text)
    f1 = f1_score(y_val, preds, average="macro")
    print(cfg, "‚Üí Macro F1:", round(f1, 5))

{'tfidf__min_df': 1} ‚Üí Macro F1: 0.66983
{'tfidf__min_df': 2} ‚Üí Macro F1: 0.67469
{'tfidf__min_df': 3} ‚Üí Macro F1: 0.67526


min_df decided 2


Tuning max_df

In [22]:
from sklearn.metrics import f1_score

max_dfs = [0.6, 0.7, 0.75, 0.8, 0.9, 0.95, 1.0]
results = []

for md in max_dfs:
    pipe.set_params(tfidf__ngram_range=(1,2), tfidf__min_df=2, tfidf__max_df=md, tfidf__sublinear_tf=True)
    pipe.fit(X_train_text, y_train)
    preds = pipe.predict(X_val_text)
    f1 = f1_score(y_val, preds, average="macro")
    results.append((md, f1))
    print(f"max_df={md:<4}  MacroF1={f1:.5f}")

best = max(results, key=lambda x: x[1])
print("\nBEST max_df:", best)

max_df=0.6   MacroF1=0.67491
max_df=0.7   MacroF1=0.67491
max_df=0.75  MacroF1=0.67491
max_df=0.8   MacroF1=0.67491
max_df=0.9   MacroF1=0.67491
max_df=0.95  MacroF1=0.67491
max_df=1.0   MacroF1=0.67491

BEST max_df: (0.6, 0.6749082920824913)


max_df increased the f1 score but all the versions are the same

You are tuning max_df while keeping fixed:
	‚Ä¢	ngram_range=(1,2)
	‚Ä¢	min_df=2
	‚Ä¢	sublinear_tf=True

In [24]:
from sklearn.metrics import f1_score

max_feats = [20000, 50000, 100000, None]
results = []

for mf in max_feats:
    pipe.set_params(
        tfidf__ngram_range=(1,2),
        tfidf__min_df=2,
        tfidf__max_df=0.6,
        tfidf__sublinear_tf=True,
        tfidf__max_features=mf,
    )
    pipe.fit(X_train_text, y_train)
    preds = pipe.predict(X_val_text)
    f1 = f1_score(y_val, preds, average="macro")
    results.append((mf, f1))
    print(f"max_features={str(mf):<6}  MacroF1={f1:.5f}")

best = max(results, key=lambda x: x[1])
print("\nBEST max_features:", best)

max_features=20000   MacroF1=0.67174
max_features=50000   MacroF1=0.67567
max_features=100000  MacroF1=0.67602
max_features=None    MacroF1=0.67491

BEST max_features: (100000, 0.6760192182529935)


best max_fatures 10.000

In [28]:
from sklearn.metrics import f1_score
from sklearn.feature_extraction.text import CountVectorizer
import traceback

configs = [
    {"tfidf__strip_accents": None,      "tfidf__token_pattern": r"(?u)\b\w\w+\b"},
    {"tfidf__strip_accents": "unicode", "tfidf__token_pattern": r"(?u)\b\w\w+\b"},
    {"tfidf__strip_accents": "unicode"},  # rely on default token_pattern
]

# Sanity check (tokenization)
cv = CountVectorizer(stop_words="english", token_pattern=r"(?u)\b\w\w+\b")
try:
    cv.fit(X_train_text)
    print("Pre-check OK: vocab size =", len(cv.vocabulary_))
except Exception as e:
    print("Pre-check failed ‚Üí", e)

for cfg in configs:
    pipe.set_params(
        tfidf__ngram_range=(1,2),
        tfidf__min_df=2,
        tfidf__max_df=1.0,
        tfidf__sublinear_tf=True,
        tfidf__max_features=100000,   
        **cfg,
    )
    try:
        pipe.fit(X_train_text, y_train)
        preds = pipe.predict(X_val_text)
        f1 = f1_score(y_val, preds, average="macro")
        print(cfg, "‚Üí Macro F1:", round(f1, 5))
    except ValueError as e:
        print(cfg, "‚Üí Error:", str(e))
    except Exception:
        print(cfg, "‚Üí Unexpected error:")
        traceback.print_exc()

Pre-check OK: vocab size = 95612
{'tfidf__strip_accents': None, 'tfidf__token_pattern': '(?u)\\b\\w\\w+\\b'} ‚Üí Macro F1: 0.67602
{'tfidf__strip_accents': 'unicode', 'tfidf__token_pattern': '(?u)\\b\\w\\w+\\b'} ‚Üí Macro F1: 0.67554
{'tfidf__strip_accents': 'unicode'} ‚Üí Macro F1: 0.67554


best none

In [29]:
import re
from sklearn.metrics import f1_score

def simple_preproc(text):
    if text is None:
        return ""
    s = text
    s = re.sub(r"https?://\S+|www\.\S+", " URL ", s)
    s = re.sub(r"[\w.+-]+@[\w-]+\.[\w.-]+", " EMAIL ", s)
    s = re.sub(r"\d+", " 0 ", s)
    return s

pipe.set_params(
    tfidf__ngram_range=(1,2),
    tfidf__min_df=2,
    tfidf__max_df=1.0,
    tfidf__sublinear_tf=True,
    tfidf__strip_accents='unicode',
    tfidf__preprocessor=simple_preproc,
    tfidf__max_features=100000,  
 )
pipe.fit(X_train_text, y_train)
preds = pipe.predict(X_val_text)
f1 = f1_score(y_val, preds, average="macro")
print("preprocessor on ‚Üí Macro F1:", round(f1, 5))

# Reset preprocessor (baseline) and compare
pipe.set_params(tfidf__preprocessor=None)
pipe.fit(X_train_text, y_train)
preds = pipe.predict(X_val_text)
f1 = f1_score(y_val, preds, average="macro")
print("preprocessor off ‚Üí Macro F1:", round(f1, 5))

preprocessor on ‚Üí Macro F1: 0.66115
preprocessor off ‚Üí Macro F1: 0.67554


no preproessor

In [38]:
import numpy as np
from sklearn.metrics import f1_score

# Build sample weights for y_train
unique, counts = np.unique(y_train, return_counts=True)
freq = dict(zip(unique, counts))
inv_freq = {k: 1.0/v for k, v in freq.items()}
sample_weight = np.array([inv_freq[c] for c in y_train])

# Fit with sample weights (to the classifier step)
pipe.set_params(
    tfidf__ngram_range=(1,2),
    tfidf__min_df=2,
    tfidf__max_df=1.0,
    tfidf__sublinear_tf=True,
    tfidf__max_features=100000,
    clf__C=1,
    clf__max_iter=1000,
    clf__class_weight=None,
)
pipe.fit(X_train_text, y_train, clf__sample_weight=sample_weight)
preds = pipe.predict(X_val_text)
print("MacroF1:", f1_score(y_val, preds, average="macro"))


MacroF1: 0.5532974830089769


In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score

# L1 penalty
Cs = [0.5, 1.0, 2.0]
for C in Cs:
    pipe.set_params(
        tfidf__ngram_range=(1,2), tfidf__min_df=2, tfidf__max_df=1.0, tfidf__sublinear_tf=True,
        clf=LogisticRegression(solver='saga', penalty='l1', C=C, max_iter=1000, class_weight='balanced', n_jobs=-1),
    )
    pipe.fit(X_train_text, y_train)
    preds = pipe.predict(X_val_text)
    f1 = f1_score(y_val, preds, average='macro')
    print(f"L1 C={C:<4}  MacroF1={f1:.5f}")

# Elastic Net
Cs = [0.5, 1.0, 2.0]
l1_ratios = [0.1, 0.5, 0.9]
for C in Cs:
    for l1 in l1_ratios:
        pipe.set_params(
            tfidf__ngram_range=(1,2), tfidf__min_df=2, tfidf__max_df=1.0, tfidf__sublinear_tf=True,
            clf=LogisticRegression(solver='saga', penalty='elasticnet', l1_ratio=l1, C=C, max_iter=1000, class_weight='balanced', n_jobs=-1),
        )
        pipe.fit(X_train_text, y_train)
        preds = pipe.predict(X_val_text)
        f1 = f1_score(y_val, preds, average='macro')
        print(f"ElasticNet C={C:<4} l1_ratio={l1:<3} MacroF1={f1:.5f}")



In [15]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import FunctionTransformer
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score
import numpy as np

X_cols = df[["title", "article"]]
y_full = df["label"]

Xtr, Xva, ytr, yva = train_test_split(
    X_cols, y_full, test_size=0.2, random_state=42, stratify=y_full
)

def scaler(alpha):
    return FunctionTransformer(lambda X: X * alpha)

alphas = [1.0, 1.5, 2.0, 3.0]
for a in alphas:
    ct = ColumnTransformer([
        ("title", make_pipeline(
            TfidfVectorizer(stop_words='english', ngram_range=(1,2), min_df=2, sublinear_tf=True),
            scaler(a)
        ), "title"),
        ("article", TfidfVectorizer(stop_words='english', ngram_range=(1,2), min_df=2, sublinear_tf=True), "article"),
    ], remainder="drop")

    pipe2 = Pipeline([
        ("features", ct),
        ("clf", LogisticRegression(C=1, max_iter=1000, class_weight='balanced', n_jobs=-1))
    ])
    pipe2.fit(Xtr, ytr)
    preds = pipe2.predict(Xva)
    f1 = f1_score(yva, preds, average="macro")
    print(f"title_weight={a:<3} MacroF1={f1:.5f}")

ValueError: np.nan is an invalid document, expected byte or unicode string.

In [None]:
from sklearn.pipeline import FeatureUnion
from sklearn.metrics import f1_score
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

word_vect = TfidfVectorizer(stop_words='english', ngram_range=(1,2), min_df=2, sublinear_tf=True)
char_vect = TfidfVectorizer(analyzer='char', ngram_range=(3,5), min_df=2, max_df=0.95, sublinear_tf=True)

features = FeatureUnion([
    ("word", word_vect),
    ("char", char_vect),
])

pipe_wc = Pipeline([
    ("feats", features),
    ("clf", LogisticRegression(C=1, max_iter=1000, class_weight='balanced', n_jobs=-1))
])

pipe_wc.fit(X_train_text, y_train)
preds = pipe_wc.predict(X_val_text)
f1 = f1_score(y_val, preds, average='macro')
print("Word+Char Macro F1:", round(f1, 5))

In [None]:
import numpy as np
from sklearn.metrics import f1_score

# Fit baseline pipe for proba
pipe.set_params(tfidf__ngram_range=(1,2), tfidf__min_df=2, tfidf__max_df=1.0, tfidf__sublinear_tf=True)
pipe.fit(X_train_text, y_train)
proba = pipe.predict_proba(X_val_text)
classes_ = pipe.named_steps['clf'].classes_
n_classes = len(classes_)

def preds_with_weights(P, w):
    Pw = P * w  # broadcast class-wise weights over columns
    return classes_[np.argmax(Pw, axis=1)]

def eval_f1(w):
    yhat = preds_with_weights(proba, w)
    return f1_score(y_val, yhat, average='macro')

w = np.ones(n_classes)
grid = [0.9, 1.0, 1.1, 1.2]
improved = True
iters = 0
while improved and iters < 3:
    improved = False
    base = eval_f1(w)
    for i in range(n_classes):
        best_local = (base, w[i])
        for g in grid:
            w_try = w.copy()
            w_try[i] = g
            score = eval_f1(w_try)
            if score > best_local[0] + 1e-6:
                best_local = (score, g)
        if best_local[1] != w[i]:
            w[i] = best_local[1]
            improved = True
    iters += 1

print("best weights:", w)
print("Macro F1 after scaling:", round(eval_f1(w), 5))

In [None]:
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression

baseline_pipe = Pipeline([
    ("tfidf", TfidfVectorizer(stop_words='english', ngram_range=(1,2), min_df=2, sublinear_tf=True)),
    ("clf", LogisticRegression(C=1, max_iter=1000, class_weight='balanced', n_jobs=-1))
])

cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=42)
scores = cross_val_score(baseline_pipe, X_text, y, cv=cv, scoring='f1_macro', n_jobs=1)
print("CV Macro F1 per fold:", [round(s,5) for s in scores])
print("CV Macro F1 mean:", round(scores.mean(),5), "+/-", round(scores.std(),5))

In [None]:
from tempfile import mkdtemp
from os.path import join

cachedir = mkdtemp()
cached_pipe = Pipeline([
    ("tfidf", TfidfVectorizer(stop_words='english', ngram_range=(1,2), min_df=2, sublinear_tf=True)),
    ("clf", LogisticRegression(C=1, max_iter=1000, class_weight='balanced', n_jobs=-1))
], memory=cachedir)
print("Cache directory:", cachedir)

Pipeline caching (optional speed-up; no accuracy change)

StratifiedKFold CV for Macro F1 (baseline pipe)

Per-class probability scaling (argmax after class-wise multipliers)

Word + Character n-grams (FeatureUnion)

Two-branch TF-IDF: title vs article with weighting

LogisticRegression L1 and Elastic Net (saga)

Sample weights (inverse class frequency)

TF-IDF custom preprocessor (normalize numbers, URLs, emails)

TF-IDF strip_accents and token_pattern

TF-IDF max_features tuning

(Placeholder) Future tuning: max_df and max_features

We choose mind-df: 2 beacuse it is safer to go on with the easier model


7) Final pipeline for submission

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression

final_pipe = Pipeline([
    ("tfidf", TfidfVectorizer(
        stop_words="english",
        ngram_range=(1,2),
        min_df=2,
        max_features=100000
    )),
    ("clf", LogisticRegression(
        C=1,
        max_iter=1000,
        class_weight="balanced"
    ))
])

8) Train final model on full data

In [None]:
X_text_all = df["text"]          # already title+article
y_all = df["label"]

final_pipe.fit(X_text_all, y_all)

9) Generate submission.csv (Id, Predicted)

In [None]:
import pandas as pd

eval_df = pd.read_csv("evaluation.csv")
eval_df["text"] = eval_df["title"].fillna("") + " " + eval_df["article"].fillna("")

In [None]:
eval_pred = final_pipe.predict(eval_df["text"])

submission = pd.DataFrame({
    "Id": eval_df["Id"],
    "Predicted": eval_pred
})

submission.to_csv("submission_v1.csv", index=False)
submission.head()

In [None]:
print("FINAL PIPELINE USED:")
print(final_pipe)

submission = pd.DataFrame({
    "Id": eval_df["Id"],
    "Predicted": eval_pred
})

submission.to_csv("submission.csv", index=False)
print("Saved submission.csv with columns:", submission.columns.tolist())
print("Rows:", len(submission))

In [None]:
print("evaluation rows:", len(eval_df))
print("submission rows:", len(submission))
print("columns:", submission.columns.tolist())
print(submission["Predicted"].unique())